Kubernetes_Deployment_CPU-Mem

Okay, this is an excellent example to illustrate how resource allocation works with a sidecar pattern and how it drives VM sizing and scaling decisions.

Let's break down your qsm-julia-combined.yml:

# ... (ServiceAccount is fine, not directly part of resource calculation for VMs) ...
---
# --- Deployment for the QSM + Julia Sidecar Pod ---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: qsm-julia-sidecar
  # ...
spec:
  replicas: 1 # You are starting with 1 instance of this Pod
  # ...
  template:
    # ...
    spec:
      serviceAccountName: sa-qsm-julia
      containers:
      # --- Container 1: QSM (cdsqsmimg) ---
      - name: qsm
        # ...
        resources:
          requests:
            memory: "2.5Gi" # qsm requests 2.5 GiB RAM
            cpu: "1"        # qsm requests 1 vCPU
          limits:
            memory: "4Gi"
            cpu: "1.5"

      # --- Container 2: Julia Service ---
      - name: julia-service
        # ...
        resources:
          requests:
            memory: "16Gi"  # julia-service requests 16 GiB RAM
            cpu: "4"        # julia-service requests 4 vCPUs
          limits:
            memory: "20Gi"
            cpu: "5"
# ... (Service is for network exposure, not direct resource calculation for VMs) ...

1. Calculating Total Pod Resource Requests

In a sidecar pattern, multiple containers run within the same Pod. For scheduling purposes, Kubernetes considers the sum of the resource requests of all containers within that Pod.

Total CPU Requested by the Pod:
- qsm container CPU request: 1
- julia-service container CPU request: 4
- Total Pod CPU Request: 1 + 4 = 5 CPUs
Total Memory Requested by the Pod:
- qsm container memory request: 2.5Gi
- julia-service container memory request: 16Gi
- Total Pod Memory Request: 2.5Gi + 16Gi = 18.5Gi

So, for each replica of your qsm-julia-sidecar Pod, Kubernetes needs to find a node that can guarantee 5 vCPUs and 18.5 GiB of RAM.

2. Calculating Total Pod Resource Limits (Conceptual)

While limits are enforced per container, understanding the sum helps in capacity planning for potential peak usage:

Total CPU Limit for the Pod:
- qsm container CPU limit: 1.5
- julia-service container CPU limit: 5
- Total Pod CPU Limit: 1.5 + 5 = 6.5 CPUs
Total Memory Limit for the Pod:
- qsm container memory limit: 4Gi
- julia-service container memory limit: 20Gi
- Total Pod Memory Limit: 4Gi + 20Gi = 24Gi

This means, theoretically, a single Pod could burst up to use 6.5 CPUs (if available and not throttled) and attempt to use up to 24GiB of RAM (before individual containers get OOMKilled if they exceed their respective memory limits).

3. Measuring for Best VM Sizes

Your Pod requests 5 CPUs and 18.5Gi RAM. This is the minimum allocatable resource a node must have to schedule one such Pod.

Step 1: Identify Potential VM SKUs

You need Azure VM SKUs where the allocatable resources exceed 5 vCPUs and 18.5GiB RAM. Remember, total VM resources != allocatable resources. Some are reserved for the OS, Kubelet, etc.

Let's look at some Azure VM families (General Purpose, Memory Optimized):

Standard_D8s_v3 / Standard_D8ds_v4 / Standard_D8as_v4:
- vCPUs: 8
- RAM: 32 GiB
- Estimated Allocatable (approximate, check kubectl describe node for specifics):
  - CPU: ~7.0 - 7.5 vCPUs (after ~0.5-1 vCPU for system)
  - Memory: ~26 - 28 GiB (after ~4-6 GiB for system & kube-reserved)
- Can this fit one Pod? Yes. (5 requested CPU < ~7 allocatable CPU; 18.5Gi requested RAM < ~26Gi allocatable RAM).
- Can it fit two Pods? No. (2 * 5 CPU = 10 CPU > ~7 allocatable CPU; 2 * 18.5Gi RAM = 37Gi RAM > ~26Gi allocatable RAM). So, one Pod per D8s_v3 type node.
Standard_E8s_v3 / Standard_E8ds_v4 / Standard_E8as_v4:
- vCPUs: 8
- RAM: 64 GiB
- Estimated Allocatable:
  - CPU: ~7.0 - 7.5 vCPUs
  - Memory: ~55 - 58 GiB
- Can this fit one Pod? Yes.
- Can it fit two Pods? No for CPU. Yes for memory. Still limited by CPU. So, one Pod per E8s_v3 type node if only these Pods are running.
- This SKU offers much more memory than needed for a single Pod of this type, leading to potentially wasted memory if you only run these Pods. However, if you have other memory-intensive Pods or this Pod has very high memory limits that it often reaches, it might be considered.
Standard_D16s_v3 / Standard_D16ds_v4 / Standard_D16as_v4:
- vCPUs: 16
- RAM: 64 GiB
- Estimated Allocatable:
  - CPU: ~14.5 - 15 vCPUs
  - Memory: ~55 - 58 GiB
- Can this fit one Pod? Yes.
- Can this fit two Pods? Yes. (2 * 5 CPU = 10 CPU < ~14.5 allocatable; 2 * 18.5Gi RAM = 37Gi RAM < ~55Gi allocatable).
- Can this fit three Pods? No for CPU. (3 * 5 CPU = 15 CPU which is very close or slightly over typical allocatable). No for memory. (3 * 18.5Gi RAM = 55.5Gi RAM, which is very close or slightly over).
- So, likely two Pods per D16s_v3 type node.

Step 2: Load Testing and Monitoring (Crucial for "Best" Fit)

The "best" VM size isn't just about fitting requests; it's about actual usage and cost-effectiveness.

Deploy your application with initial resource requests/limits.
Use Azure Monitor for containers / Prometheus & Grafana:
- Monitor actual CPU and Memory usage of each container (qsm and julia-service) within the Pods under realistic load.
- kubectl top pod <pod-name> -n <namespace> --containers is good for quick checks.
Observe:
- CPU Usage: Are containers consistently far below their requests? Or are they frequently hitting their limits (getting throttled)?
- Memory Usage: Is actual memory usage far below requests? Or are containers getting OOMKilled (hitting their limits)?
Adjust Requests/Limits:
- If qsm consistently uses only 0.2 CPU and 1Gi RAM, you could lower its requests/limits.
- If julia-service consistently uses 3.5 CPU and 14Gi RAM under peak load but sometimes spikes to 4.5 CPU and 18Gi RAM, its current requests/limits might be reasonable.
- The goal: Set requests to what your application typically needs to run stably. Set limits to a cap that prevents runaway consumption but allows for reasonable bursts.
Re-evaluate VM Sizing: After tuning requests, re-calculate the total Pod request and see how that affects bin-packing onto VMs.
- If you manage to lower total Pod requests significantly, you might fit more Pods onto a smaller/cheaper VM, or more Pods onto the same VM SKU, improving density and reducing cost.
- For instance, if after optimization, your Pod requests become 3 CPUs and 12Gi RAM, a Standard_D8s_v3 (allocatable ~7 CPU, ~26Gi RAM) could now fit two such Pods.

4. How to Scale Up & Down

Once you have a good understanding of your Pod's resource needs and have selected a suitable VM SKU for your node pool(s):

Horizontal Pod Autoscaler (HPA):
- The HPA will automatically adjust the replicas count in your Deployment based on observed metrics.
- For your sidecar setup, you need to decide which container's metrics (or a combination via custom metrics) should drive scaling.
  - Option 1 (CPU of dominant container): If julia-service's CPU usage is the primary indicator of load:
    apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: qsm-julia-hpa namespace: default spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: qsm-julia-sidecar # Target your deployment minReplicas: 1 maxReplicas: 10 # Example max metrics: - type: ContainerResource # Use ContainerResource for sidecars containerResource: name: julia-service # Specify the container name container: julia-service target: type: Utilization averageUtilization: 70 # Target 70% of julia-service's CPU request
  - Option 2 (Custom Metrics): If a business metric (e.g., queue length, requests per second processed by julia-service) is a better indicator, use custom metrics with HPA.
- When HPA decides to scale up (e.g., from 1 to 2 replicas), it tells Kubernetes to create another qsm-julia-sidecar Pod. This new Pod will again require 5 CPUs and 18.5Gi RAM.
Cluster Autoscaler (CA):
- The CA monitors for Pods that are in a "Pending" state because no node has enough allocatable resources to schedule them.
- If HPA scales up your qsm-julia-sidecar deployment, and all existing nodes are full (i.e., cannot satisfy the 5 CPU / 18.5Gi RAM request of the new Pod), the CA will kick in.
- It will look at your node pool configurations (VM size, min/max node count) and provision a new VM of the appropriate size (e.g., another Standard_D8s_v3 or Standard_D16s_v3 depending on your node pool setup).
- Once the new node is ready and joined to the cluster, the pending Pod will be scheduled on it.
- Conversely, if HPA scales down and nodes become underutilized for a configurable period, the CA can terminate nodes to save costs.

Workflow Summary:

Initial Sizing: Sum container requests to get total Pod requests (5 CPU, 18.5Gi RAM).
VM Selection: Choose VM SKUs whose allocatable resources can satisfy the Pod's total requests (e.g., Standard_D8s_v3 for one Pod, Standard_D16s_v3 for two Pods). Create an AKS node pool with this VM SKU.
Monitor & Tune: Deploy, load test, monitor actual usage of each container. Adjust container requests and limits based on observed behavior to optimize. Re-evaluate VM choice if necessary.
Configure HPA: Based on the critical container's resource usage (e.g., julia-service CPU) or custom metrics.
Enable CA: For the node pool hosting these Pods, allowing it to scale the number of VMs up/down.

Example Node Pool Strategy:

Given the significant resource requirements of this Pod, you might consider:

A dedicated node pool: For qsm-julia-sidecar Pods, using a VM SKU like Standard_D16ds_v4 (or similar, based on your final tuned requests and how many you want per node). Enable CA on this node pool.
Another node pool: For other, less resource-intensive workloads, using smaller, more cost-effective VM SKUs.

This separation prevents your large qsm-julia Pods from forcing all your nodes to be oversized if you have other smaller applications.

Key Takeaway for Sidecars: The Pod is the unit of scheduling. Its total requested resources (sum of all its containers' requests) must be satisfiable by an available node. Fine-tuning individual container requests/limits and then selecting appropriate VM SKUs in conjunction with HPA and CA is key to an efficient and scalable setup.

PreviousKQL Explained NextANH - CDS LLM Solution Roadmap

Last updated 5 months ago