Skip to main content

Retry Strategies Implemented In ReTraced

ReTraced is designed in such a way that every job can define its own retry strategy. Although not every job requires a unique strategy, different classes of failures (network, infra, logic, business rules) demand different retry behaviors.

Below are the four retry strategies supported in ReTraced, along with where and why each strategy is used in real-world systems.


1. Multi-Phase Retry Strategy (Three-Tier Architecture)

Immediate → Exponential Backoff (± Jitter) → DLQ

This is the most robust and production-proven retry architecture, used by Netflix, Uber, LinkedIn, Stripe, and other large-scale distributed systems.

What This Strategy Is

A progressive retry model where retries become slower and more controlled as failures persist, and eventually terminate into a DLQ for inspection.

Instead of retrying blindly, the system adapts retry behavior based on failure persistence.


Phase 1 – Immediate Retry

Characteristics

  • Zero / near-zero delay
  • Worker-level retry
  • No persistence overhead
  • Optimized for transient failures

Why This Phase Exists Most failures are short-lived:

  • Temporary network hiccups
  • Brief DB disconnects
  • Cold starts
  • Momentary resource contention

Immediate retry avoids unnecessary scheduling overhead and often succeeds instantly.


Phase 2 – Exponential Backoff (± Jitter)

delay = base * 2^n ± random()

Characteristics

  • Persisted delay
  • Retry interval grows exponentially
  • Optional jitter to randomize delay
  • Controlled retry rate

Why This Phase Exists If failures persist:

  • Downstream services may be unhealthy
  • Immediate retries cause retry storms
  • System load amplifies the outage

Exponential backoff protects the system and dependencies while still retrying.

Jitter Jitter randomizes retry timing to prevent synchronized retries, avoiding thundering herd problems when multiple workers retry simultaneously.


Phase 3 – Dead Letter Queue (DLQ)

Number of retries >= Maximum retry limit

Characteristics

  • Retry termination
  • Payload preserved
  • Error metadata captured
  • Manual or automated replay possible

Why This Phase Exists Some failures are not recoverable by retrying:

  • Invalid input
  • Logic bugs
  • Contract mismatches
  • Corrupted data

DLQ enables forensics, debugging, and safe recovery.


Why Big Tech Uses This Strategy at Massive Scale

Netflix

Where

  • Netflix Conductor
  • Kafka-based job & workflow systems

How

  • Immediate retry in workers
  • Exponential backoff via retry queues/topics
  • Dead Letter Queue for inspection & replay

Why

  • Prevent retry storms during outages
  • Preserve failed payloads
  • Maintain platform stability

Stripe

Where

  • Payment processing
  • Webhook delivery systems

How

  • Fast retries for flaky networks
  • Backoff over minutes to hours
  • Final failure surfaced to ops (DLQ equivalent)

Why

  • Financial correctness > speed
  • Zero data loss tolerance
  • Safe recovery from failures

2. Exponential Backoff (With / Without Jitter)

How It Works

delay = base * 2^n ± random()

This is the most commonly used retry strategy and the default in many job schedulers (e.g. BullMQ).

Characteristics

  • Simple and predictable
  • Prevents retry floods
  • Scales well in distributed systems

When to Use

  • External API calls
  • Network-bound jobs
  • Rate-limited services
  • Idempotent operations

When NOT to Use

  • Time-critical jobs requiring immediate success
  • Non-idempotent operations
  • Jobs requiring deterministic retry timing

3. Linear Retry Strategy

Retries happen at a fixed incremental delay.

How It Works

delay = baseDelay * attempt

Example:

  • Retry 1 → 5s
  • Retry 2 → 10s
  • Retry 3 → 15s

Characteristics

  • Predictable retry pattern
  • Slower escalation than exponential
  • Easier to reason about

When to Use

  • Stable systems with moderate load
  • Internal services with known recovery time
  • Jobs where exponential growth is too aggressive

Downsides

  • Can still overload dependencies during outages
  • Not ideal for large-scale distributed systems

4. Fixed Retry Strategy

Retries occur after a constant delay, regardless of attempt count.

How It Works

delay = fixedDelay

Example:

  • Every retry happens after 10 seconds

Characteristics

  • Extremely simple
  • Deterministic behavior
  • No backoff logic

When to Use

  • Development and testing environments
  • Short-lived background jobs
  • Systems with guaranteed downstream stability

When NOT to Use

  • High-scale systems
  • Unstable or rate-limited services
  • Production workloads under variable load

Summary

StrategyComplexityScale SuitabilityRecommended Use
Multi-PhaseHighVery HighMission-critical systems
Exponential BackoffMediumHighMost production jobs
LinearLow-MediumMediumControlled internal systems
FixedLowLowTesting / simple jobs

ReTraced allows per-job retry customization, enabling teams to choose the right strategy based on failure characteristics, system scale, and correctness guarantees.