Introduction
ReTraced is a transparent and extensible distributed job scheduler built to make
retry behavior, failure handling, and job lifecycle transitions explicit and observable.
Unlike many schedulers that abstract retries behind configuration flags,
ReTraced treats retries as first-class data — visible, auditable, and configurable per job.
ReTraced is not designed to hide complexity.
It is designed to expose it clearly.
Why ReTraced Exists
Modern job schedulers are powerful, but they often:
- Hide retry decisions inside internal engines
- Expose retry counts without retry intent
- Make failure analysis opaque and indirect
ReTraced was built to answer questions such as:
- Why did this job retry at this moment?
- Was the retry automatic or manually triggered?
- Is this failure temporary or permanent?
- When and why did retries stop?
These questions matter when building reliable, debuggable distributed systems.
Core Philosophy
Explicit Over Implicit
- Retry attempts are stored as structured, queryable data
- Failures are classified (temporary vs permanent)
- Dead Letter Queue (DLQ) is a first-class system component, not a side effect
This makes execution behavior predictable, inspectable, and explainable.
Practical Before Perfect
ReTraced intentionally favors clarity and control over hidden guarantees:
- At-least-once delivery semantics
- Redis-backed state for simplicity and speed
- Minimal internal coordination logic
These choices keep the system easy to reason about while remaining useful in real scenarios.
Performance Snapshot
ReTraced prioritizes correctness and visibility while maintaining solid performance.
Benchmark (local, Redis-backed):
- 10,000 jobs in ~2.4 seconds with 1 worker
- 10,000 jobs in ~2.1 seconds with 5 workers
This demonstrates:
- Low scheduler overhead
- Horizontal scalability at the worker layer
- Retry orchestration that does not dominate execution time
Benchmarks are indicative and not a production SLA.
What Makes ReTraced Different
Retry as Data
Each job maintains a structured retry history:
- Every retry attempt
- Timestamp and error message
- Retry trigger (
AUTOorMANUAL) - Retry outcome
This enables:
- Full retry audit trails
- DLQ forensics
- Safe and informed replays
Per-Job Retry Strategies
Each job can define its own retry behavior:
- Fixed delay
- Linear backoff
- Exponential backoff (with or without jitter)
- Multi-phase retry (immediate → backoff → DLQ)
This mirrors real-world production patterns without hiding the mechanics.
First-Class DLQ
Dead jobs are not treated as an afterthought:
- Complete retry history preserved
- Failure reasons recorded
- Manual retries explicitly tracked
- Poison jobs clearly identified
Failure recovery becomes intentional rather than accidental.
Availability and Direction
ReTraced is currently usable for experimentation and internal tools, with plans to evolve into a production-ready self-hostable service.
It does not attempt to replace existing schedulers.
Instead, it focuses on:
- Making retry and failure behavior explicit
- Providing fine-grained control
- Improving observability and intent
ReTraced complements mature systems by exposing what they abstract away.
Summary
ReTraced is a clear and intentional job scheduler designed to:
- Surface retry decisions
- Preserve failure context
- Make execution behavior auditable
It sits between experimentation and production —
practical enough to run, transparent enough to understand.