Appearance
Fase 3: Production Hardening
← Fase 2 SSE | Índice General →
Objetivo de la Fase
Preparar el sistema para producción con retry logic, monitoring, health checks, systemd service y documentación completa. Esta fase agrega resiliencia, observabilidad y operabilidad.
Entregables principales:
- Retry con exponential backoff
- Cleanup automático de jobs antiguos
- Admin endpoints para gestión de failed jobs
- Métricas Prometheus y health checks
- Logging estructurado y alerting
- Systemd service con auto-restart
- Documentación técnica completa
Subfases
3.1 Retry + Error Handling
Tareas: 4 tarjetas Estimación: 14 horas Objetivo: Implementar retry logic, cleanup y admin endpoints
Tareas principales:
- Migración columnas retry (2h)
- Retry logic con exponential backoff (5h)
- Cronjob cleanup stale jobs (3h)
- Admin endpoint failed jobs (4h)
Dependencias: Requiere completar Fase 2 completa
3.2 Monitoring + Health Checks
Tareas: 4 tarjetas Estimación: 14 horas Objetivo: Métricas, health checks, logging estructurado y alerting
Tareas principales:
- Metrics endpoint Prometheus (4h)
- Health check endpoint (3h)
- Logging estructurado (3h)
- Alert triggers (4h)
Dependencias: Requiere completar Fase 3.1
3.3 Systemd Service
Tareas: 3 tarjetas Estimación: 5 horas Objetivo: Systemd service con auto-restart y logs journalctl
Tareas principales:
- Systemd service file (2h)
- Auto-restart config (1h)
- Logs journalctl (2h)
Dependencias: Requiere completar Fase 3.2
3.4 Documentación
Tareas: 5 tarjetas Estimación: 15 horas Objetivo: Documentación completa (CLAUDE.md, técnica, arquitectura, OpenAPI, CHANGELOG)
Tareas principales:
- Actualizar server/CLAUDE.md (3h)
- Crear docs/backend/background-jobs-system.md (4h)
- Crear docs/architecture/background-jobs-architecture.md (3h)
- Actualizar OpenAPI (3h)
- Actualizar CHANGELOG.md (2h)
Dependencias: Requiere completar Fases 3.1-3.3
Dependencias entre Subfases
Fase 2 SSE completa
↓
3.1 Retry + Error Handling
↓
3.2 Monitoring + Health Checks
↓
3.3 Systemd Service
↓
3.4 DocumentaciónSecuencia recomendada:
- Validar que Fase 2 está completa y tests pasan
- Implementar 3.1 (retry y error handling)
- Implementar 3.2 (monitoring y observability)
- Implementar 3.3 (systemd para deployment)
- Implementar 3.4 (documentación completa)
Estimación Total de la Fase
Total de tarjetas: 16 tarjetas Total de horas: ~48 horas Duración estimada (1 dev full-time): 6-7 días laborables Duración estimada (1 dev part-time 50%): 12-14 días laborables
Criterios de Completitud
La Fase 3 se considera completa cuando:
3.1 Retry
- [ ] Columnas retry_count y max_retries agregadas
- [ ] Exponential backoff implementado (delay = base * 2^attempt)
- [ ] Cronjob cleanup ejecuta cada hora
- [ ] Admin endpoint permite ver/retry/delete failed jobs
- [ ] Tests verifican retry flow completo
3.2 Monitoring
- [ ] Endpoint /jobs/metrics retorna formato Prometheus
- [ ] Health check verifica worker, DB, queue size
- [ ] Logging estructurado JSON a STDOUT y archivo
- [ ] Alert service envía notificaciones a webhook externo
- [ ] Tests verifican formato y thresholds
3.3 Systemd
- [ ] Service file instalado y habilitado
- [ ] Auto-restart configurado (max 5 en 600s)
- [ ] Logs visibles en journalctl
- [ ] Graceful shutdown con SIGTERM
- [ ] Tests simulan crash/restart
3.4 Documentación
- [ ] server/CLAUDE.md actualizado con sección Background Jobs
- [ ] docs/backend/background-jobs-system.md completo
- [ ] docs/architecture/background-jobs-architecture.md completo
- [ ] OpenAPI spec actualizado
- [ ] CHANGELOG.md actualizado con versión nueva
- [ ] Todos los docs revisados y sin dead links
Notas Técnicas
Exponential Backoff
php
// RetryStrategy.php
public function calculateNextAttempt(int retryCount): DateTime
{
$baseDelay = 60; // 1 minuto
$delay = min($baseDelay * (2 ** $retryCount), 3600); // max 1 hora
return new DateTime("+{$delay} seconds");
}Retry 0: 60s (1 min)
Retry 1: 120s (2 min)
Retry 2: 240s (4 min)
Retry 3: 480s (8 min)
Retry 4: 960s (16 min)
Retry 5: 1920s (32 min)
Retry 6+: 3600s (1 hora) ← max capCleanup Stale Jobs
bash
# Crontab
0 * * * * /usr/bin/php /var/www/Bautista/server/bin/cleanup-stale-jobs.php >> /var/log/background-jobs-cleanup.log 2>&1php
// cleanup-stale-jobs.php
$deleted = $jobModel->deleteWhere([
'status' => ['completed', 'failed'],
'completed_at < NOW() - INTERVAL \'30 days\''
]);
$logger->info("Cleanup: deleted {$deleted} jobs");Prometheus Metrics
# HELP background_jobs_pending Number of pending jobs
# TYPE background_jobs_pending gauge
background_jobs_pending{schema="suc0001"} 15
background_jobs_pending{schema="suc0002"} 3
# HELP background_jobs_processing Number of processing jobs
# TYPE background_jobs_processing gauge
background_jobs_processing{schema="suc0001"} 2
# HELP background_jobs_completed_total Total completed jobs
# TYPE background_jobs_completed_total counter
background_jobs_completed_total{schema="suc0001"} 1543
# HELP background_jobs_failed_total Total failed jobs
# TYPE background_jobs_failed_total counter
background_jobs_failed_total{schema="suc0001"} 12
# HELP background_jobs_execution_seconds Average execution time
# TYPE background_jobs_execution_seconds histogram
background_jobs_execution_seconds{schema="suc0001",le="1"} 450
background_jobs_execution_seconds{schema="suc0001",le="5"} 890
background_jobs_execution_seconds{schema="suc0001",le="10"} 1200Health Check
json
GET /jobs/health
{
"status": "healthy",
"timestamp": "2025-01-15T10:30:00Z",
"checks": {
"worker": {
"status": "up",
"last_heartbeat": "2025-01-15T10:29:55Z",
"latency_ms": 850
},
"database": {
"status": "up",
"connection_pool": "5/10"
},
"queue": {
"status": "healthy",
"pending_count": 15,
"threshold": 1000
}
}
}Systemd Service
ini
# /etc/systemd/system/background-jobs-worker.service
[Unit]
Description=Background Jobs Worker - Sistema Bautista
After=postgresql.service
Requires=postgresql.service
[Service]
Type=simple
User=www-data
WorkingDirectory=/var/www/Bautista/server
ExecStart=/usr/bin/php /var/www/Bautista/server/bin/worker.php
Restart=on-failure
RestartSec=5
StartLimitIntervalSec=600
StartLimitBurst=5
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.targetbash
# Instalación
sudo cp server/systemd/background-jobs-worker.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable background-jobs-worker
sudo systemctl start background-jobs-worker
# Monitoreo
sudo systemctl status background-jobs-worker
journalctl -u background-jobs-worker.service -fAlert Triggers
php
// AlertService.php
$alerts = [
'high_failure_rate' => [
'condition' => 'failed jobs > 5 in 5 minutes',
'threshold' => 5,
'window' => 300, // 5 min
'cooldown' => 1800, // 30 min
],
'queue_overflow' => [
'condition' => 'pending jobs > 500',
'threshold' => 500,
],
'worker_inactive' => [
'condition' => 'no heartbeat in 10 minutes',
'threshold' => 600,
],
];
// Webhook a Slack/Email
$this->sendAlert([
'level' => 'critical',
'title' => 'High failure rate detected',
'message' => '8 jobs failed in the last 5 minutes',
'schema' => 'suc0001',
'timestamp' => now(),
]);Deployment Checklist
Pre-Deployment
- [ ] Todas las migraciones ejecutadas
- [ ] Tests passing (unit + integration + e2e)
- [ ] Coverage >= 85%
- [ ] Feature flag ENABLE_BACKGROUND_JOBS configurado
- [ ] Environment variables en .env
- [ ] Systemd service instalado
Deployment
- [ ] Ejecutar migraciones en producción
- [ ] Iniciar systemd service
- [ ] Verificar health check retorna 200 OK
- [ ] Verificar metrics endpoint funciona
- [ ] Configurar Prometheus scraping (si aplica)
- [ ] Configurar alerting webhooks
Post-Deployment
- [ ] Monitorear logs journalctl primeras horas
- [ ] Verificar jobs completan exitosamente
- [ ] Verificar SSE funciona en producción
- [ ] Verificar retry logic funciona
- [ ] Configurar backup automático tabla background_jobs
- [ ] Documentar runbook en wiki