Break It Before Production Does
În producție vor apărea failures. Chaos engineering le simulează controlat pentru a verifica că sistemul se recuperează gracefully.
Chaos Experiments
LLM Timeout
✓ RESILIENTFault: LLM nu răspunde în 5 secunde
Response: Fallback to simpler responses
TTS Service Down
✓ RESILIENTFault: Serviciul TTS devine unavailable
Response: Queue audio, retry with backup
Database Latency
⚠ PARTIALFault: DB răspunde cu 2s delay
Response: Graceful degradation
Network Partition
✓ RESILIENTFault: 50% packet loss
Response: Connection recovery
Memory Pressure
✓ RESILIENTFault: Node la 95% memory
Response: Auto-scaling trigger
Calendar API Down
✓ RESILIENTFault: Integration endpoint fails
Response: Offer callback instead
Experiment: LLM Provider Failure
Hypothesis
Dacă OpenAI devine unavailable, sistemul switch-uiește la Anthropic backup în sub 2 secunde, fără ca utilizatorul să observe întrerupere.
Injection
Fault injector blochează toate request-urile către api.openai.com
Result
Failover completed in 1.2s. User experienced ~1s pause. Backup provider handled 100% of traffic during 10 minute outage simulation.
Chaos Monkey Configuration
# chaos-config.yaml
chaos_monkey:
enabled: true
schedule: "0 3 * * *" # Daily at 3 AM
duration: 30m
environment: staging
experiments:
- name: llm_timeout
type: latency_injection
target: openai-api
latency: 10s
probability: 0.1
- name: tts_failure
type: service_kill
target: tts-service
duration: 5m
- name: network_chaos
type: packet_loss
target: voice-gateway
loss_rate: 0.3
- name: memory_pressure
type: resource_stress
target: ai-worker
memory_percent: 90
blast_radius:
max_affected_calls: 100
auto_rollback: true
notifications:
slack: "#chaos-alerts"
pagerduty: false # staging onlyResilience Patterns Tested
Circuit Breaker
După 5 failures consecutive, circuit se deschide și request-urile merg direct la fallback.
✓ Working correctlyRetry with Backoff
Request-urile eșuate sunt retry-uite cu exponential backoff (100ms, 200ms, 400ms).
✓ Working correctlyTimeout Handling
Toate external calls au timeout de 5s cu graceful fallback.
✓ Working correctlyBulkhead Isolation
Failure într-un service nu cascadează în altele.
✓ Working correctly