Retry Strategies Implemented In ReTraced
ReTraced is designed in such a way that every job can define its own retry strategy. Although not every job requires a unique strategy, different classes of failures (network, infra, logic, business rules) demand different retry behaviors.
Below are the four retry strategies supported in ReTraced, along with where and why each strategy is used in real-world systems.
1. Multi-Phase Retry Strategy (Three-Tier Architecture)
Immediate → Exponential Backoff (± Jitter) → DLQ
This is the most robust and production-proven retry architecture, used by Netflix, Uber, LinkedIn, Stripe, and other large-scale distributed systems.
What This Strategy Is
A progressive retry model where retries become slower and more controlled as failures persist, and eventually terminate into a DLQ for inspection.
Instead of retrying blindly, the system adapts retry behavior based on failure persistence.
Phase 1 – Immediate Retry
Characteristics
- Zero / near-zero delay
- Worker-level retry
- No persistence overhead
- Optimized for transient failures
Why This Phase Exists Most failures are short-lived:
- Temporary network hiccups
- Brief DB disconnects
- Cold starts
- Momentary resource contention
Immediate retry avoids unnecessary scheduling overhead and often succeeds instantly.
Phase 2 – Exponential Backoff (± Jitter)
delay = base * 2^n ± random()
Characteristics
- Persisted delay
- Retry interval grows exponentially
- Optional jitter to randomize delay
- Controlled retry rate
Why This Phase Exists If failures persist:
- Downstream services may be unhealthy
- Immediate retries cause retry storms
- System load amplifies the outage
Exponential backoff protects the system and dependencies while still retrying.
Jitter Jitter randomizes retry timing to prevent synchronized retries, avoiding thundering herd problems when multiple workers retry simultaneously.
Phase 3 – Dead Letter Queue (DLQ)
Number of retries >= Maximum retry limit
Characteristics
- Retry termination
- Payload preserved
- Error metadata captured
- Manual or automated replay possible
Why This Phase Exists Some failures are not recoverable by retrying:
- Invalid input
- Logic bugs
- Contract mismatches
- Corrupted data
DLQ enables forensics, debugging, and safe recovery.
Why Big Tech Uses This Strategy at Massive Scale
Netflix
Where
- Netflix Conductor
- Kafka-based job & workflow systems
How
- Immediate retry in workers
- Exponential backoff via retry queues/topics
- Dead Letter Queue for inspection & replay
Why
- Prevent retry storms during outages
- Preserve failed payloads
- Maintain platform stability
Stripe
Where
- Payment processing
- Webhook delivery systems
How
- Fast retries for flaky networks
- Backoff over minutes to hours
- Final failure surfaced to ops (DLQ equivalent)
Why
- Financial correctness > speed
- Zero data loss tolerance
- Safe recovery from failures
2. Exponential Backoff (With / Without Jitter)
How It Works
delay = base * 2^n ± random()
This is the most commonly used retry strategy and the default in many job schedulers (e.g. BullMQ).
Characteristics
- Simple and predictable
- Prevents retry floods
- Scales well in distributed systems
When to Use
- External API calls
- Network-bound jobs
- Rate-limited services
- Idempotent operations
When NOT to Use
- Time-critical jobs requiring immediate success
- Non-idempotent operations
- Jobs requiring deterministic retry timing
3. Linear Retry Strategy
Retries happen at a fixed incremental delay.
How It Works
delay = baseDelay * attempt
Example:
- Retry 1 → 5s
- Retry 2 → 10s
- Retry 3 → 15s
Characteristics
- Predictable retry pattern
- Slower escalation than exponential
- Easier to reason about
When to Use
- Stable systems with moderate load
- Internal services with known recovery time
- Jobs where exponential growth is too aggressive
Downsides
- Can still overload dependencies during outages
- Not ideal for large-scale distributed systems
4. Fixed Retry Strategy
Retries occur after a constant delay, regardless of attempt count.
How It Works
delay = fixedDelay
Example:
- Every retry happens after 10 seconds
Characteristics
- Extremely simple
- Deterministic behavior
- No backoff logic
When to Use
- Development and testing environments
- Short-lived background jobs
- Systems with guaranteed downstream stability
When NOT to Use
- High-scale systems
- Unstable or rate-limited services
- Production workloads under variable load
Summary
| Strategy | Complexity | Scale Suitability | Recommended Use |
|---|---|---|---|
| Multi-Phase | High | Very High | Mission-critical systems |
| Exponential Backoff | Medium | High | Most production jobs |
| Linear | Low-Medium | Medium | Controlled internal systems |
| Fixed | Low | Low | Testing / simple jobs |
ReTraced allows per-job retry customization, enabling teams to choose the right strategy based on failure characteristics, system scale, and correctness guarantees.