Kubernetes Sidecar Monitoring and Debugging Guide
1. Enhanced Kubectl Debugging Commands
Pod-level Debugging
# Get detailed pod information
kubectl describe pod -l app=qsm-julia-sidecar
# Check pod logs from both containers
kubectl logs -l app=qsm-julia-sidecar -c qsm --tail=100 -f
kubectl logs -l app=qsm-julia-sidecar -c julia-service --tail=100 -f
# View logs from both containers simultaneously
kubectl logs -l app=qsm-julia-sidecar --all-containers=true -f
# Get previous container logs (if containers restarted)
kubectl logs -l app=qsm-julia-sidecar -c qsm --previous
kubectl logs -l app=qsm-julia-sidecar -c julia-service --previous
# Execute into specific container for debugging
kubectl exec -it <pod-name> -c qsm -- /bin/bash
kubectl exec -it <pod-name> -c julia-service -- /bin/bash
# Test connectivity between containers
kubectl exec -it <pod-name> -c qsm -- curl http://localhost:8000/health
kubectl exec -it <pod-name> -c qsm -- netstat -tulnNetwork Debugging
# Check network policies
kubectl get networkpolicies
# Check service endpoints
kubectl get endpoints qsm-julia-svc
# Describe the service
kubectl describe service qsm-julia-svc
# Port forward for external debugging
kubectl port-forward deployment/qsm-julia-sidecar 8000:8000
kubectl port-forward deployment/qsm-julia-sidecar 8080:802. Enhanced Manifest with Better Observability
apiVersion: v1
kind: ServiceAccount
metadata:
name: sa-qsm-julia
namespace: default
annotations:
azure.workload.identity/client-id: "f6f41384-8515-4dfe-ba7e-28c6bb5a190f"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: qsm-julia-sidecar
namespace: default
spec:
replicas: 4
selector:
matchLabels:
app: qsm-julia-sidecar
template:
metadata:
labels:
app: qsm-julia-sidecar
azure.workload.identity/use: "true"
annotations:
# Enable Prometheus scraping
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: sa-qsm-julia
containers:
- name: qsm
image: cdsoptmzprod.azurecr.io/cdsqsmimg:__QSMIMG__
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
name: http-qsm
- containerPort: 9090
name: metrics-qsm
env:
- name: QSM_ENVIRONMENT
value: "prod"
- name: JULIA_SERVICE_URL
value: "http://localhost:8000"
- name: LOG_LEVEL
value: "DEBUG"
resources:
requests:
memory: "2.5Gi"
cpu: "1.5"
limits:
memory: "4Gi"
cpu: "2.5"
# Health checks
livenessProbe:
httpGet:
path: /health
port: 80
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 80
initialDelaySeconds: 5
periodSeconds: 5
- name: julia-service
image: cdsoptmzprod.azurecr.io/julia-service:__JULIA__
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8000
name: http-julia
- containerPort: 9091
name: metrics-julia
env:
- name: CLUSTER
value: "prod"
- name: LOG_LEVEL
value: "DEBUG"
resources:
requests:
memory: "16Gi"
cpu: "4"
limits:
memory: "20Gi"
cpu: "5"
# Health checks
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
# Optional: Add logging sidecar for centralized logs
- name: fluent-bit
image: fluent/fluent-bit:latest
volumeMounts:
- name: shared-logs
mountPath: /var/log
volumes:
- name: shared-logs
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: qsm-julia-svc
namespace: default
labels:
app: qsm-julia-sidecar
spec:
type: LoadBalancer
ports:
- name: http-qsm
port: 80
targetPort: http-qsm
- name: http-julia
port: 8000
targetPort: http-julia
- name: metrics-qsm
port: 9090
targetPort: metrics-qsm
- name: metrics-julia
port: 9091
targetPort: metrics-julia
selector:
app: qsm-julia-sidecar3. Application-Level Monitoring Code
QSM Service Monitoring (Example in Python/Go)
import time
import logging
import requests
from prometheus_client import Counter, Histogram, start_http_server
# Metrics
REQUEST_COUNT = Counter('qsm_requests_total', 'Total QSM requests', ['status'])
REQUEST_DURATION = Histogram('qsm_request_duration_seconds', 'QSM request duration')
JULIA_REQUEST_COUNT = Counter('julia_requests_total', 'Total requests to Julia', ['status'])
JULIA_REQUEST_DURATION = Histogram('julia_request_duration_seconds', 'Julia request duration')
class QSMService:
def __init__(self):
self.julia_url = "http://localhost:8000"
self.logger = logging.getLogger(__name__)
@REQUEST_DURATION.time()
def process_request(self, request_data):
try:
self.logger.info(f"Processing request: {request_data.get('id', 'unknown')}")
# Send to Julia service with monitoring
with JULIA_REQUEST_DURATION.time():
response = self.send_to_julia(request_data)
REQUEST_COUNT.labels(status='success').inc()
return response
except Exception as e:
self.logger.error(f"Error processing request: {str(e)}")
REQUEST_COUNT.labels(status='error').inc()
raise
def send_to_julia(self, data):
try:
self.logger.debug(f"Sending to Julia: {data}")
response = requests.post(f"{self.julia_url}/optimize", json=data, timeout=300)
if response.status_code == 200:
JULIA_REQUEST_COUNT.labels(status='success').inc()
self.logger.info(f"Julia response received: {response.status_code}")
return response.json()
else:
JULIA_REQUEST_COUNT.labels(status='error').inc()
self.logger.error(f"Julia service error: {response.status_code} - {response.text}")
response.raise_for_status()
except requests.exceptions.Timeout:
JULIA_REQUEST_COUNT.labels(status='timeout').inc()
self.logger.error("Julia service timeout")
raise
except requests.exceptions.ConnectionError:
JULIA_REQUEST_COUNT.labels(status='connection_error').inc()
self.logger.error("Julia service connection error")
raise
# Start metrics server
start_http_server(9090)4. Distributed Tracing Setup
Using Jaeger for Request Tracing
# Add to your deployment
- name: JAEGER_AGENT_HOST
value: "jaeger-agent.default.svc.cluster.local"
- name: JAEGER_SERVICE_NAME
value: "qsm-julia-sidecar"5. Advanced Debugging Techniques
Network Analysis
# Install network tools in debugging pod
kubectl run debug --image=nicolaka/netshoot -it --rm
# From debug pod, test connectivity
nslookup qsm-julia-svc
curl -v http://qsm-julia-svc:80/health
curl -v http://qsm-julia-svc:8000/health
# Check DNS resolution within the pod
kubectl exec -it <pod-name> -c qsm -- nslookup localhost
kubectl exec -it <pod-name> -c qsm -- netstat -tlnpPerformance Analysis
# Resource usage per container
kubectl top pod -l app=qsm-julia-sidecar --containers
# Get detailed resource metrics
kubectl describe pod <pod-name> | grep -A 10 "Containers:"
# Check for OOM kills or restarts
kubectl get events --sort-by='.lastTimestamp' | grep <pod-name>6. Monitoring Dashboards
Prometheus Queries
# Request rate
rate(qsm_requests_total[5m])
# Error rate
rate(qsm_requests_total{status="error"}[5m]) / rate(qsm_requests_total[5m])
# Request duration
histogram_quantile(0.95, rate(qsm_request_duration_seconds_bucket[5m]))
# Julia service availability
up{job="julia-service"}
# Container restarts
increase(kube_pod_container_status_restarts_total[1h])7. Specific Debugging for Complex Optimization Requests
Your JSON Payload Analysis
Based on your nutritional optimization request, here are targeted debugging strategies:
# Monitor payload size and processing time
kubectl logs -l app=qsm-julia-sidecar -c qsm | grep -E "(payload_size|processing_time|ingredients_count|nutrients_count)"
# Check for JSON parsing errors
kubectl logs -l app=qsm-julia-sidecar -c julia-service | grep -i "json\|parse\|invalid"
# Monitor memory usage during large requests
kubectl top pod -l app=qsm-julia-sidecar --containers
# Check for optimization timeouts (your Julia optimization might be complex)
kubectl logs -l app=qsm-julia-sidecar -c qsm | grep -i "timeout\|502\|504"Enhanced Request Monitoring Code
import json
import sys
from datetime import datetime
class OptimizationRequestMonitor:
def log_request_summary(self, request_data):
"""Log key metrics about the optimization request"""
summary = {
"timestamp": datetime.utcnow().isoformat(),
"request_type": "optimization",
"locations_count": len(request_data.get("Locations", [])),
"total_ingredients": 0,
"total_nutrients": 0,
"payload_size_bytes": len(json.dumps(request_data))
}
for location in request_data.get("Locations", []):
ingredients = location.get("Ingredients", [])
summary["total_ingredients"] += len(ingredients)
for ingredient in ingredients:
summary["total_nutrients"] += len(ingredient.get("NutrientLevels", []))
self.logger.info(f"REQUEST_SUMMARY: {json.dumps(summary)}")
# Alert on unusually large requests
if summary["payload_size_bytes"] > 10 * 1024 * 1024: # > 10MB
self.logger.warning(f"LARGE_PAYLOAD: {summary['payload_size_bytes']} bytes")
if summary["total_ingredients"] > 1000:
self.logger.warning(f"HIGH_COMPLEXITY: {summary['total_ingredients']} ingredients")
def validate_request_structure(self, request_data):
"""Validate the request structure before sending to Julia"""
errors = []
if not isinstance(request_data.get("Locations"), list):
errors.append("Locations must be an array")
for i, location in enumerate(request_data.get("Locations", [])):
if not isinstance(location.get("Ingredients"), list):
errors.append(f"Location[{i}].Ingredients must be an array")
continue
for j, ingredient in enumerate(location.get("Ingredients", [])):
if not ingredient.get("IngredientId"):
errors.append(f"Location[{i}].Ingredients[{j}].IngredientId is required")
if ingredient.get("Cost") is None:
errors.append(f"Location[{i}].Ingredients[{j}].Cost is required")
nutrients = ingredient.get("NutrientLevels", [])
for k, nutrient in enumerate(nutrients):
if not nutrient.get("NutrientId"):
errors.append(f"Location[{i}].Ingredients[{j}].NutrientLevels[{k}].NutrientId missing")
if errors:
self.logger.error(f"REQUEST_VALIDATION_ERRORS: {json.dumps(errors)}")
return False
return TrueMemory and Timeout Configuration Updates
# Updated container configs for large optimization requests
containers:
- name: qsm
resources:
requests:
memory: "4Gi" # Increased for large JSON processing
cpu: "2"
limits:
memory: "8Gi" # Increased limits
cpu: "4"
env:
- name: HTTP_TIMEOUT
value: "600" # 10 minute timeout for complex optimizations
- name: MAX_REQUEST_SIZE
value: "50MB" # Allow larger payloads
- name: julia-service
resources:
requests:
memory: "20Gi" # Increased for optimization algorithms
cpu: "6"
limits:
memory: "32Gi" # More memory for complex problems
cpu: "8"
env:
- name: JULIA_OPTIMIZATION_TIMEOUT
value: "540" # 9 minute timeout (less than QSM timeout)
- name: JULIA_GC_ENABLE
value: "true" # Enable garbage collection for memory managementSpecific Request Testing
# Create a test payload file
cat > test_optimization.json << 'EOF'
{
"Locations": [
{
"Ingredients": [
{
"IngredientId": "test-id",
"Available": true,
"Global": true,
"Cost": 6610.0,
"NutrientLevels": [
{"NutrientId": "nutrient-1", "Level": 30.0},
{"NutrientId": "nutrient-2", "Level": 0.22}
]
}
]
}
]
}
EOF
# Test with simplified payload first
kubectl exec -it <pod-name> -c qsm -- curl -X POST \
-H "Content-Type: application/json" \
-H "X-Request-ID: test-001" \
--data-binary @test_optimization.json \
http://localhost:8000/optimize
# Monitor the test request
kubectl logs -f -l app=qsm-julia-sidecar -c julia-service | grep "test-001"Performance Analysis Queries
# Find slow optimization requests
kubectl logs -l app=qsm-julia-sidecar -c julia-service --since=1h | \
grep "optimization_complete" | \
awk '{print $NF}' | \
sort -n | \
tail -10
# Check memory usage patterns
kubectl logs -l app=qsm-julia-sidecar -c julia-service | \
grep -E "memory_usage|gc_run|out_of_memory"
# Find failed optimizations
kubectl logs -l app=qsm-julia-sidecar -c julia-service | \
grep -E "optimization_failed|error|exception" | \
tail -208. Common Issues with Your Request Type
1. JSON Size and Memory Issues
Symptom: OOMKilled containers or 413 Request Entity Too Large
Solution: Increase memory limits, implement request streaming, or batch processing
2. Optimization Complexity Timeouts
Symptom: 504 Gateway Timeout, incomplete responses
Solution: Implement progress callbacks, increase timeouts, or problem decomposition
3. Numerical Precision Issues
Symptom: Inconsistent results, optimization failures
Solution: Validate nutrient levels, check for NaN/Inf values, implement bounds checking
4. Ingredient/Nutrient ID Validation
Symptom: Optimization errors, constraint violations
Solution: Pre-validate all IDs exist in your database, implement ID caching
Debugging Commands for Your Specific Case
# Check for ingredient/nutrient processing errors
kubectl logs -l app=qsm-julia-sidecar -c julia-service | grep -E "IngredientId|NutrientId|constraint"
# Monitor optimization progress
kubectl logs -f -l app=qsm-julia-sidecar -c julia-service | grep -E "optimization|iteration|solution"
# Check for memory pressure during optimization
kubectl describe pod -l app=qsm-julia-sidecar | grep -A 5 -B 5 "memory"Last updated