午夜剧场十八禁肉肉午夜剧场,免费高清不卡视频日本,强奷乱码中文字幕熟女塚本

在部署與管理KServe模型服務過程中，需應對模型推理服務面臨的高度動態負載波動。KServe通過集成Kubernetes原生的HPA（Horizontal Pod Autoscaler）技術及擴縮容控制器，實現了根據CPU利用率、內存占用情況、GPU利用率以及自定義性能指標，自動靈活地調整模型服務Pod的規模，以確保服務效能與穩定性。本文以Qwen-7B-Chat-Int8模型、GPU類型為V100卡為例，介紹如何基于KServe為服務配置彈性擴縮容。

前提條件

已安裝Arena客戶端，且版本不低于0.9.15。具體操作，請參見配置Arena客戶端。
已安裝ack-kserve。具體操作，請參見安裝ack-kserve?。

基于CPU或Memory配置自動擴縮容策略

Raw Deployment模式下的自動擴縮容依賴Kubernetes的HPA（Horizontal Pod Autoscaler）機制，這是最基礎的自動擴縮容方式，HPA根據Pod的CPU或Memory的利用率動態調整ReplicaSet中的Pod副本數量。

以下介紹如何基于CPU利用率配置自動擴縮容。HPA機制可參考社區文檔Pod 水平自動擴縮。

執行以下命令，提交服務。

arena serve kserve \
    --name=sklearn-iris \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-sklearn-server:v0.12.0 \
    --cpu=1 \
    --memory=200Mi \
    --scale-metric=cpu \
    --scale-target=10 \
    --min-replicas=1 \
    --max-replicas=10 \
    "python -m sklearnserver --model_name=sklearn-iris --model_dir=/models --http_port=8080"

參數說明如下：

參數	說明
--scale-metric	擴縮容指標支持`cpu`和`memory`兩種。此處以`cpu`為例。
--scale-target	擴縮容閾值，百分比。
--min-replicas	擴縮容的最小副本數，該值需要為大于0的整數。HPA策略暫不支持縮容到0。
--max-replicas	擴縮容的最大副本數，該值需要為大于`minReplicas`的整數。

預期輸出：

inferenceservice.serving.kserve.io/sklearn-iris created
INFO[0002] The Job sklearn-iris has been submitted successfully 
INFO[0002] You can run `arena serve get sklearn-iris --type kserve -n default` to check the job status

輸出結果表明sklearn-iris服務已經成功創建。

執行以下命令，準備推理輸入請求。
創建一個名為iris-input.json的文件，并將以下特定的JSON內容寫入iris-input.json文件中，用于模型預測的輸入數據。
```
cat <<EOF > "./iris-input.json"
{
  "instances": [
    [6.8,  2.8,  4.8,  1.4],
    [6.0,  3.4,  4.5,  1.6]
  ]
}
EOF
```

執行以下命令，訪問服務進行推理。

# 從kube-system命名空間中獲取名為nginx-ingress-lb的服務的負載均衡器IP地址，這是外部訪問服務的入口點。
NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'`
# 獲取名為sklearn-iris的Inference Service的URL，并從中提取出主機名部分，以便后續使用。
SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
# 使用curl命令發送請求到模型服務。請求頭中設置了目標主機名（即之前獲取的SERVICE_HOSTNAME）和內容類型為JSON。-d @./iris-input.json指定了請求體內容來自于本地文件iris-input.json，該文件應包含模型預測所需的輸入數據。
curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" \
     http://$NGINX_INGRESS_IP:80/v1/models/sklearn-iris:predict -d @./iris-input.json

預期輸出：

{"predictions":[1,1]}%

輸出結果表明請求導致了兩次推理的發生，且兩次推理的響應一致。

執行以下命令，發起壓測。

說明

Hey壓測工具的詳細介紹，請參見Hey。

hey -z 2m -c 20 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -D ./iris-input.json http://${NGINX_INGRESS_IP}:80/v1/models/sklearn-iris:predict

在壓測的同時，另外打開一個終端，執行以下命令查看服務的擴縮容情況。

kubectl describe hpa sklearn-iris-predictor

預期輸出：

展開查看服務的擴縮容情況

Name:                                                  sklearn-iris-predictor
Namespace:                                             default
Labels:                                                app=isvc.sklearn-iris-predictor
                                                       arena.kubeflow.org/uid=3399d840e8b371ed7ca45dda29debeb1
                                                       chart=kserve-0.1.0
                                                       component=predictor
                                                       heritage=Helm
                                                       release=sklearn-iris
                                                       serving.kserve.io/inferenceservice=sklearn-iris
                                                       servingName=sklearn-iris
                                                       servingType=kserve
Annotations:                                           arena.kubeflow.org/username: kubecfg:certauth:admin
                                                       serving.kserve.io/deploymentMode: RawDeployment
CreationTimestamp:                                     Sat, 11 May 2024 17:15:47 +0800
Reference:                                             Deployment/sklearn-iris-predictor
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  0% (2m) / 10%
Min replicas:                                          1
Max replicas:                                          10
Behavior:
  Scale Up:
    Stabilization Window: 0 seconds
    Select Policy: Max
    Policies:
      - Type: Pods     Value: 4    Period: 15 seconds
      - Type: Percent  Value: 100  Period: 15 seconds
  Scale Down:
    Select Policy: Max
    Policies:
      - Type: Percent  Value: 100  Period: 15 seconds
Deployment pods:       10 current / 10 desired
Conditions:
  Type            Status  Reason               Message
  ----            ------  ------               -------
  AbleToScale     True    ScaleDownStabilized  recent recommendations were higher than current one, applying the highest recent recommendation
  ScalingActive   True    ValidMetricFound     the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  False   DesiredWithinRange   the desired count is within the acceptable range
Events:
  Type    Reason             Age                  From                       Message
  ----    ------             ----                 ----                       -------
  Normal  SuccessfulRescale  38m                  horizontal-pod-autoscaler  New size: 8; reason: cpu resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  28m                  horizontal-pod-autoscaler  New size: 7; reason: All metrics below target
  Normal  SuccessfulRescale  27m                  horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

預期輸出的Events參數顯示HPA根據CPU使用情況自動調整了副本數。例如，在不同時間點將副本數調整為8、7、1。即HPA能根據CPU的使用情況進行自動擴縮容。

基于GPU利用率配置自定義指標的彈性擴縮容策略

自定義指標的擴縮容依賴ACK提供的ack-alibaba-cloud-metrics-adapter組件與Kubernetes HPA機制實現。詳細信息，請參見基于阿里云Prometheus指標的容器水平伸縮。

以下示例演示如何基于Pod的GPU利用率配置自定義指標的擴縮容。

準備Qwen-7B-Chat-Int8模型數據。具體操作，請參見部署vLLM推理服務。
配置自定義GPU Metrics指標。具體操作，請參見基于GPU指標實現彈性伸縮。

執行以下命令，部署vLLM服務。

arena serve kserve \
    --name=qwen \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
    --gpus=1 \
    --cpu=4 \
    --memory=12Gi \
    --scale-metric=DCGM_CUSTOM_PROCESS_SM_UTIL \
    --scale-target=50 \
    --min-replicas=1 \
    --max-replicas=2 \
    --data="llm-model:/mnt/models/Qwen-7B-Chat-Int8" \
    "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"

預期輸出：

inferenceservice.serving.kserve.io/qwen created
INFO[0002] The Job qwen has been submitted successfully 
INFO[0002] You can run `arena serve get qwen --type kserve -n default` to check the job status

輸出結果表明推理服務已經部署成功。

執行以下命令，使用獲取到的Nginx Ingress網關地址訪問推理服務，測試vLLM服務是否正常。

# 獲取Nginx ingress的IP地址。
NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}')
# 獲取Inference Service的Hostname。
SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen -o jsonpath='{.status.url}' | cut -d "/" -f 3)
# 發送請求訪問推理服務。
curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" http://$NGINX_INGRESS_IP:80/v1/chat/completions -d '{"model": "qwen", "messages": [{"role": "user", "content": "測試一下"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

預期輸出：

{"id":"cmpl-77088b96abe744c89284efde2e779174","object":"chat.completion","created":1715590010,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"好的，請問您有什么需要測試的？<|im_end|>"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}%

輸出結果表明請求被正確地發送到了服務端，并且服務端返回了一個預期的JSON響應。

執行以下命令，對服務進行壓測。

說明

Hey壓測工具的詳細介紹，請參見Hey。

hey -z 2m -c 5 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -d '{"model": "qwen", "messages": [{"role": "user", "content": "測試一下"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}' http://$NGINX_INGRESS_IP:80/v1/chat/completions

在壓測期間，重新打開一個終端，執行以下命令查看服務的擴縮容情況。

kubectl describe hpa qwen-hpa

預期輸出：

展開查看qwen-hpa的擴縮容情況

Name:                                     qwen-hpa
Namespace:                                default
Labels:                                   <none>
Annotations:                              <none>
CreationTimestamp:                        Tue, 14 May 2024 14:57:03 +0800
Reference:                                Deployment/qwen-predictor
Metrics:                                  ( current / target )
  "DCGM_CUSTOM_PROCESS_SM_UTIL" on pods:  0 / 50
Min replicas:                             1
Max replicas:                             2
Deployment pods:                          1 current / 1 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from pods metric DCGM_CUSTOM_PROCESS_SM_UTIL
  ScalingLimited  True    TooFewReplicas    the desired replica count is less than the minimum replica count
Events:
  Type    Reason             Age   From                       Message
  ----    ------             ----  ----                       -------
  Normal  SuccessfulRescale  43m   horizontal-pod-autoscaler  New size: 2; reason: pods metric DCGM_CUSTOM_PROCESS_SM_UTIL above target
  Normal  SuccessfulRescale  34m   horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

預期輸出表明在壓測期間Pod數會擴容到2，而當壓測結束后，經過一段時間（約為5分鐘），Pod縮容到1。即KServe可以基于Pod的GPU利用率實現自定義指標的擴縮容。

配置定時擴縮容策略

定時擴縮容需要結合ACK提供的ack-kubernetes-cronhpa-controller組件實現，通過該組件您可以設定在特定的時間點或周期性地改變應用的副本數量，以應對可預見性的負載變化。

安裝CronHPA組件。具體操作，請參見使用容器定時水平伸縮（CronHPA）。
準備Qwen-7B-Chat-Int8模型數據。具體操作，請參見部署vLLM推理服務。

執行以下命令，部署vLLM服務。

arena serve kserve \
    --name=qwen-cronhpa \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
    --gpus=1 \
    --cpu=4 \
    --memory=12Gi \
    --annotation="serving.kserve.io/autoscalerClass=external" \
    --data="llm-model:/mnt/models/Qwen-7B-Chat-Int8" \
   "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"

預期輸出：

inferenceservice.serving.kserve.io/qwen-cronhpa created
INFO[0004] The Job qwen-cronhpa has been submitted successfully 
INFO[0004] You can run `arena serve get qwen-cronhpa --type kserve -n default` to check the job status

執行以下命令，測試vLLM服務是否正常。

# 獲取Nginx ingress的IP地址。
NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'`
# 獲取Inference Service的Hostname。
SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen -o jsonpath='{.status.url}' | cut -d "/" -f 3)
# 發送請求訪問推理服務。
curl -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \
     http://$NGINX_INGRESS_IP:80/v1/chat/completions -X POST \
     -d '{"model": "qwen", "messages": [{"role": "user", "content": "你好"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10, "stop":["<|endoftext|>", "<|im_end|>", "<|im_start|>"]}'

預期輸出：

{"id":"cmpl-b7579597aa284f118718b22b83b726f8","object":"chat.completion","created":1715589652,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"好的，請問您有什么需要測試的？<|im_end|>"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}%

輸出結果表明請求被正確地發送到了服務，并且服務返回了一個預期的JSON響應。

執行以下命令，配置定時擴縮容。

展開查看配置定時擴縮容的命令

kubectl apply -f- <<EOF
apiVersion: autoscaling.alibabacloud.com/v1beta1
kind: CronHorizontalPodAutoscaler
metadata:
  name: qwen-cronhpa
  namespace: default 
spec:
   scaleTargetRef:
      apiVersion: apps/v1
      kind: Deployment
      name: qwen-cronhpa-predictor
   jobs:
   # 每天10點半擴容
   - name: "scale-up"
     schedule: "0 30 10 * * *"
     targetSize: 2
     runOnce: false
  # 每天12點縮容
   - name: "scale-down"
     schedule: "0 0 12 * * *"
     targetSize: 1
     runOnce: false
EOF

預期輸出：

展開查看預設的擴縮容

Name:         qwen-cronhpa
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  autoscaling.alibabacloud.com/v1beta1
Kind:         CronHorizontalPodAutoscaler
Metadata:
  Creation Timestamp:  2024-05-12T14:06:49Z
  Generation:          2
  Resource Version:    9205625
  UID:                 b9e72da7-262e-4***-b***-26586b7****c
Spec:
  Jobs:
    Name:         scale-up
    Schedule:     0 30 10 * * *
    Target Size:  2
    Name:         scale-down
    Schedule:     0 0 12 * * *
    Target Size:  1
  Scale Target Ref:
    API Version:  apps/v1
    Kind:         Deployment
    Name:         qwen-cronhpa-predictor
Status:
  Conditions:
    Job Id:           3972f7cc-bab0-482e-8cbe-7c4*******5
    Last Probe Time:  2024-05-12T14:06:49Z
    Message:          
    Name:             scale-up
    Run Once:         false
    Schedule:         0 30 10 * * *
    State:            Submitted
    Target Size:      2
    Job Id:           36a04605-0233-4420-967c-ac2********6
    Last Probe Time:  2024-05-12T14:06:49Z
    Message:          
    Name:             scale-down
    Run Once:         false
    Schedule:         0 0 12 * * *
    State:            Submitted
    Target Size:      1
  Scale Target Ref:
    API Version:  apps/v1
    Kind:         Deployment
    Name:         qwen-cronhpa-predictor
Events:           <none>

輸出結果表明qwen-cronhpaCRD已經配置了一個自動擴縮容計劃，根據設定的時間表，在每天的特定時間自動調整名為qwen-cronhpa-predictor的Deployment中的Pod數量，以滿足預設的擴縮容需求。

日本熟妇hd丰满老熟妇,中文字幕一区二区三区在线不卡 ,亚洲成片在线观看,免费女同在线一区二区

基于KServe為服務配置彈性擴縮容策略

前提條件

基于CPU或Memory配置自動擴縮容策略

基于GPU利用率配置自定義指標的彈性擴縮容策略

配置定時擴縮容策略

相關文檔