Skip to main content

Introduction

ReTraced is a transparent and extensible distributed job scheduler built to make
retry behavior, failure handling, and job lifecycle transitions explicit and observable.

Unlike many schedulers that abstract retries behind configuration flags,
ReTraced treats retries as first-class data — visible, auditable, and configurable per job.

ReTraced is not designed to hide complexity.
It is designed to expose it clearly.


Why ReTraced Exists

Modern job schedulers are powerful, but they often:

  • Hide retry decisions inside internal engines
  • Expose retry counts without retry intent
  • Make failure analysis opaque and indirect

ReTraced was built to answer questions such as:

  • Why did this job retry at this moment?
  • Was the retry automatic or manually triggered?
  • Is this failure temporary or permanent?
  • When and why did retries stop?

These questions matter when building reliable, debuggable distributed systems.


Core Philosophy

Explicit Over Implicit

  • Retry attempts are stored as structured, queryable data
  • Failures are classified (temporary vs permanent)
  • Dead Letter Queue (DLQ) is a first-class system component, not a side effect

This makes execution behavior predictable, inspectable, and explainable.


Practical Before Perfect

ReTraced intentionally favors clarity and control over hidden guarantees:

  • At-least-once delivery semantics
  • Redis-backed state for simplicity and speed
  • Minimal internal coordination logic

These choices keep the system easy to reason about while remaining useful in real scenarios.


Performance Snapshot

ReTraced prioritizes correctness and visibility while maintaining solid performance.

Benchmark (local, Redis-backed):

  • 10,000 jobs in ~2.4 seconds with 1 worker
  • 10,000 jobs in ~2.1 seconds with 5 workers

This demonstrates:

  • Low scheduler overhead
  • Horizontal scalability at the worker layer
  • Retry orchestration that does not dominate execution time

Benchmarks are indicative and not a production SLA.


What Makes ReTraced Different

Retry as Data

Each job maintains a structured retry history:

  • Every retry attempt
  • Timestamp and error message
  • Retry trigger (AUTO or MANUAL)
  • Retry outcome

This enables:

  • Full retry audit trails
  • DLQ forensics
  • Safe and informed replays

Per-Job Retry Strategies

Each job can define its own retry behavior:

  • Fixed delay
  • Linear backoff
  • Exponential backoff (with or without jitter)
  • Multi-phase retry (immediate → backoff → DLQ)

This mirrors real-world production patterns without hiding the mechanics.


First-Class DLQ

Dead jobs are not treated as an afterthought:

  • Complete retry history preserved
  • Failure reasons recorded
  • Manual retries explicitly tracked
  • Poison jobs clearly identified

Failure recovery becomes intentional rather than accidental.


Availability and Direction

ReTraced is currently usable for experimentation and internal tools, with plans to evolve into a production-ready self-hostable service.

It does not attempt to replace existing schedulers.
Instead, it focuses on:

  • Making retry and failure behavior explicit
  • Providing fine-grained control
  • Improving observability and intent

ReTraced complements mature systems by exposing what they abstract away.


Summary

ReTraced is a clear and intentional job scheduler designed to:

  • Surface retry decisions
  • Preserve failure context
  • Make execution behavior auditable

It sits between experimentation and production —
practical enough to run, transparent enough to understand.