How ReTraced Helped Me Build ReTraced
Known Issue: Exponential Backoff Runtime Discrepancy
Problem Discovery
While stress-testing ReTraced by intentionally breaking the system to force failures, I discovered that actual retry intervals don't match expected exponential backoff behavior. This is exactly why explicit retry tracking matters—the data exposed the bug.
Test Setup
I deliberately configured a job to always fail (network error) to observe the full retry → DLQ lifecycle:
{
"backoffStrategy": "exponential",
"backoffConfig": {
"baseDelaySeconds": 5,
"maxDelaySeconds": 60,
"factor": 2,
"limitOfTries": 5
}
}
The Outpt
{
"jobId": "job-805",
"createdAt": 1769032645464,
"updatedAt": 1769032673534,
"queueName": "email",
"status": "dead",
"tries": 5,
"maxTries": 5,
"jobData": {
"emailFrom": "noreply@test.com",
"emailTo": "user@test.com",
"subject": "Test",
"body": "Hello"
},
"backoffStrategy": "exponential",
"backoffConfig": {
"baseDelaySeconds": 5,
"maxDelaySeconds": 60,
"factor": 2,
"limitOfTries": 5
},
"retryAttempts": [
{
"attemptedAt": 1769032648561,
"trigger": "AUTO",
"changesMade": false,
"result": "PENDING",
"error": {
"code": "NETWORK_ERROR",
"message": "NETWORK_ERROR",
"failedAt": 1769032648561
}
},
{
"attemptedAt": 1769032654078,
"trigger": "AUTO",
"changesMade": false,
"result": "PENDING",
"error": {
"code": "NETWORK_ERROR",
"message": "NETWORK_ERROR",
"failedAt": 1769032654077
}
},
{
"attemptedAt": 1769032661127,
"trigger": "AUTO",
"changesMade": false,
"result": "PENDING",
"error": {
"code": "NETWORK_ERROR",
"message": "NETWORK_ERROR",
"failedAt": 1769032661126
}
},
{
"attemptedAt": 1769032667470,
"trigger": "AUTO",
"changesMade": false,
"result": "PENDING",
"error": {
"code": "NETWORK_ERROR",
"message": "NETWORK_ERROR",
"failedAt": 1769032667469
}
},
{
"attemptedAt": 1769032673538,
"trigger": "AUTO",
"changesMade": false,
"result": "PENDING",
"error": {
"code": "NETWORK_ERROR",
"message": "NETWORK_ERROR",
"failedAt": 1769032673537
}
}
]
}
Investigation Commands
Access Redis CLI
docker compose exec redis redis-cli
Inspect job data
127.0.0.1:6379> GET job:job-845
Expected vs Actual Results
| Attempt | Delay Formula | Delay | Cumulative Time |
|---|---|---|---|
| 1 | Immediate | 0s | 0s |
| 2 | 5 × 2⁰ | 5s | 5s |
| 3 | 5 × 2¹ | 10s | 15s |
| 4 | 5 × 2² | 20s | 35s |
| 5 | 5 × 2³ | 40s | 75s |
Expected total retry duration: ~75 seconds
Actual Behavior (From Redis Data)
{
"jobId": "job-845",
"status": "dead",
"tries": 5,
"retryAttempts": [
{ "attemptedAt": 1769032648561 }, // +3.1s
{ "attemptedAt": 1769032654078 }, // +5.5s
{ "attemptedAt": 1769032661127 }, // +7.0s
{ "attemptedAt": 1769032667470 }, // +6.3s
{ "attemptedAt": 1769032673538 } // +6.1s
]
}
Actual intervals: 3.1s, 5.5s, 7.0s, 6.3s, 6.1s Actual total: ~28 seconds (vs expected 75 seconds)
The delays plateau at ~6 seconds instead of growing exponentially.
Why This Matters
This bug was only discoverable because ReTraced makes retry data explicit: ✅ Timestamps exposed the timing issue ✅ Retry history showed the pattern ✅ Structured data enabled analysis Most schedulers hide this information, making such bugs invisible.
Takeaway
Intentionally breaking the system revealed bugs that would be hidden in traditional job schedulers. This validates ReTraced's core philosophy: explicit retry data makes systems debuggable and observable.