Introduction

ReTraced is a transparent and extensible distributed job queue built to make
retry behavior, failure handling, and job lifecycle transitions explicit and observable.

Unlike many queues that abstract retries behind configuration flags,
ReTraced treats retries as first-class data — visible, auditable, and configurable per job.

ReTraced is not designed to hide complexity.
It is designed to expose it clearly.

Why ReTraced Exists

Modern job queues are powerful, but they often:

Hide retry decisions inside internal engines
Expose retry counts without retry intent
Make failure analysis opaque and indirect

ReTraced was built to answer questions such as:

Why did this job retry at this moment?
Was the retry automatic or manually triggered?
Is this failure temporary or permanent?
When and why did retries stop?

These questions matter when building reliable, debuggable distributed systems.

Core Philosophy

Explicit Over Implicit

Retry attempts are stored as structured, queryable data
Failures are classified (temporary vs permanent)
Dead Letter Queue (DLQ) is a first-class system component, not a side effect

This makes execution behavior predictable, inspectable, and explainable.

Practical Before Perfect

ReTraced intentionally favors clarity and control over hidden guarantees:

At-least-once delivery semantics
Redis-backed state for simplicity and speed
Minimal internal coordination logic

These choices keep the system easy to reason about while remaining useful in real scenarios.

Performance Snapshot

ReTraced prioritizes correctness and visibility while maintaining solid performance.

Benchmark (local, Redis-backed):

10,000 jobs in ~2.4 seconds with 1 worker
10,000 jobs in ~2.1 seconds with 5 workers

This demonstrates:

Low queue overhead
Horizontal scalability at the worker layer
Retry orchestration that does not dominate execution time

Benchmarks are indicative and not a production SLA.

What Makes ReTraced Different

Retry as Data

Each job maintains a structured retry history:

Every retry attempt
Timestamp and error message
Retry trigger (AUTO or MANUAL)
Retry outcome

This enables:

Full retry audit trails
DLQ forensics
Safe and informed replays

Per-Job Retry Strategies

Each job can define its own retry behavior:

Fixed delay
Linear backoff
Exponential backoff (with or without jitter)
Multi-phase retry (immediate → backoff → DLQ)

This mirrors real-world production patterns without hiding the mechanics.

First-Class DLQ

Dead jobs are not treated as an afterthought:

Complete retry history preserved
Failure reasons recorded
Manual retries explicitly tracked
Poison jobs clearly identified

Failure recovery becomes intentional rather than accidental.

Availability and Direction

ReTraced is currently usable for experimentation and internal tools, with plans to evolve into a production-ready self-hostable service.

It does not attempt to replace existing queues.
Instead, it focuses on:

Making retry and failure behavior explicit
Providing fine-grained control
Improving observability and intent

ReTraced complements mature systems by exposing what they abstract away.

Summary

ReTraced is a clear and intentional job queue designed to:

Surface retry decisions
Preserve failure context
Make execution behavior auditable

It sits between experimentation and production —
practical enough to run, transparent enough to understand.

Why ReTraced Exists​

Core Philosophy​

Explicit Over Implicit​

Practical Before Perfect​

Performance Snapshot​

What Makes ReTraced Different​

Retry as Data​

Per-Job Retry Strategies​

First-Class DLQ​

Availability and Direction​

Summary​