Skip to main content

How ReTraced Helped Me Build ReTraced

Known Issue: Exponential Backoff Runtime Discrepancy

Problem Discovery

While stress-testing ReTraced by intentionally breaking the system to force failures, I discovered that actual retry intervals don't match expected exponential backoff behavior. This is exactly why explicit retry tracking matters—the data exposed the bug.

Test Setup

I deliberately configured a job to always fail (network error) to observe the full retry → DLQ lifecycle:

{
"backoffStrategy": "exponential",
"backoffConfig": {
"baseDelaySeconds": 5,
"maxDelaySeconds": 60,
"factor": 2,
"limitOfTries": 5
}
}

The Outpt

{
"jobId": "job-805",
"createdAt": 1769032645464,
"updatedAt": 1769032673534,
"queueName": "email",
"status": "dead",
"tries": 5,
"maxTries": 5,

"jobData": {
"emailFrom": "noreply@test.com",
"emailTo": "user@test.com",
"subject": "Test",
"body": "Hello"
},

"backoffStrategy": "exponential",
"backoffConfig": {
"baseDelaySeconds": 5,
"maxDelaySeconds": 60,
"factor": 2,
"limitOfTries": 5
},

"retryAttempts": [
{
"attemptedAt": 1769032648561,
"trigger": "AUTO",
"changesMade": false,
"result": "PENDING",
"error": {
"code": "NETWORK_ERROR",
"message": "NETWORK_ERROR",
"failedAt": 1769032648561
}
},
{
"attemptedAt": 1769032654078,
"trigger": "AUTO",
"changesMade": false,
"result": "PENDING",
"error": {
"code": "NETWORK_ERROR",
"message": "NETWORK_ERROR",
"failedAt": 1769032654077
}
},
{
"attemptedAt": 1769032661127,
"trigger": "AUTO",
"changesMade": false,
"result": "PENDING",
"error": {
"code": "NETWORK_ERROR",
"message": "NETWORK_ERROR",
"failedAt": 1769032661126
}
},
{
"attemptedAt": 1769032667470,
"trigger": "AUTO",
"changesMade": false,
"result": "PENDING",
"error": {
"code": "NETWORK_ERROR",
"message": "NETWORK_ERROR",
"failedAt": 1769032667469
}
},
{
"attemptedAt": 1769032673538,
"trigger": "AUTO",
"changesMade": false,
"result": "PENDING",
"error": {
"code": "NETWORK_ERROR",
"message": "NETWORK_ERROR",
"failedAt": 1769032673537
}
}
]
}

Investigation Commands

Access Redis CLI

docker compose exec redis redis-cli

Inspect job data

127.0.0.1:6379> GET job:job-845

Expected vs Actual Results

AttemptDelay FormulaDelayCumulative Time
1Immediate0s0s
25 × 2⁰5s5s
35 × 2¹10s15s
45 × 2²20s35s
55 × 2³40s75s

Expected total retry duration: ~75 seconds

Actual Behavior (From Redis Data)

{
"jobId": "job-845",
"status": "dead",
"tries": 5,
"retryAttempts": [
{ "attemptedAt": 1769032648561 }, // +3.1s
{ "attemptedAt": 1769032654078 }, // +5.5s
{ "attemptedAt": 1769032661127 }, // +7.0s
{ "attemptedAt": 1769032667470 }, // +6.3s
{ "attemptedAt": 1769032673538 } // +6.1s
]
}

Actual intervals: 3.1s, 5.5s, 7.0s, 6.3s, 6.1s Actual total: ~28 seconds (vs expected 75 seconds)

The delays plateau at ~6 seconds instead of growing exponentially.

Why This Matters

This bug was only discoverable because ReTraced makes retry data explicit: ✅ Timestamps exposed the timing issue ✅ Retry history showed the pattern ✅ Structured data enabled analysis Most schedulers hide this information, making such bugs invisible.

Takeaway

Intentionally breaking the system revealed bugs that would be hidden in traditional job schedulers. This validates ReTraced's core philosophy: explicit retry data makes systems debuggable and observable.