Transitive RL: A Divide-and-Conquer Approach to Scalable Off-Policy Reinforcement Learning

This article introduces Transitive RL (TRL), a reinforcement learning (RL) algorithm that replaces traditional temporal difference (TD) learning with a divide-and-conquer paradigm. TRL addresses scalability challenges in off-policy RL, particularly for long-horizon tasks, by reducing error accumulation and eliminating the need for hyperparameter tuning (e.g., selecting $n$ in $n$-step TD learning).

Problem Setting: Off-Policy RL

Off-policy RL allows the use of any data (e.g., old experience, human demonstrations) for training, making it more flexible than on-policy RL (which requires fresh data from the current policy).
Challenges:
- Traditional off-policy algorithms like Q-learning rely on TD learning, which suffers from error accumulation over long horizons due to bootstrapping.
- Methods like $n$-step TD learning reduce error propagation but still require tuning $n$ and face high variance for large $n$.

Two Paradigms in Value Learning: TD vs. Monte Carlo (MC)

TD Learning (e.g., Q-learning):
- Uses bootstrapping (estimating $Q(s, a)$ based on $Q(s’, a’)$).
- Propagates errors across the entire horizon, worsening performance in long-horizon tasks.
Monte Carlo (MC) Methods:
- Use actual returns from trajectories, avoiding bootstrapping.
- No error accumulation but high variance and suboptimality for large horizons.
Hybrid $n$-step TD:
- Combines $n$ steps of MC returns with bootstrapping.
- Reduces error propagation by a constant factor ($n$) but still requires manual tuning of $n$.

The Divide-and-Conquer Paradigm

Key Idea: Split trajectories into smaller segments and update values recursively, reducing Bellman recursions logarithmically (instead of linearly).
Advantages:
- Eliminates the need for hyperparameter $n$.
- Avoids high variance and suboptimality inherent in $n$-step TD.
Mathematical Foundation:
- Uses the triangle inequality for shortest path distances in goal-conditioned RL: $$ d^(s, g) \leq d^(s, w) + d^*(w, g) $$
- Translates to a transitive Bellman update using subgoals $w$ as midpoints.

Practical Implementation: Transitive RL (TRL)

Core Innovation:
- Subgoal Selection: Restricts subgoals $w$ to states present in the dataset (specifically, those between $s$ and $g$ in trajectories).
- Expectile Regression: Replaces $\max$ with a “soft” $\text{argmax}$ to avoid value overestimation.
- Loss Function: $$ \min_{\bar{V}} \mathbb{E}{(s_i, s_k, s_j)} \left[ \ell^2\kappa \left( \bar{V}(s_i, g) - \left( \bar{V}(s_i, w) + \bar{V}(w, g) \right) \right) \right] $$ where $\ell^2_\kappa$ is the expectile loss.
Applicability: Designed for goal-conditioned RL, where reaching a target state is the objective.

Evaluation and Results

Benchmark: Tested on OGBench, focusing on humanoidmaze and puzzle tasks with 1B-sized datasets.
Performance:
- TRL outperformed strong baselines (TD, MC, quasimetric learning) on most tasks.
- Matched the best $n$-step TD results without tuning $n$.
- Successfully handled tasks requiring 3,000 environment steps (long-horizon).
Key Insight: Divide-and-conquer naturally scales to long horizons by recursively splitting trajectories.

Future Directions

Generalization to Regular RL: Extend TRL to non-goal-conditioned tasks by converting rewards into goals.
Stochastic Environments: Adapt to partial observability using stochastic triangle inequalities.
Algorithm Improvements:
- Better subgoal selection (beyond trajectory-based candidates).
- Reduce hyperparameters and stabilize training.

Conclusion

TRL represents a breakthrough in scalable off-policy RL by leveraging divide-and-conquer principles. It addresses the limitations of TD and MC methods, offering a promising path for long-horizon tasks. The approach aligns with broader trends in recursive decision-making and could inspire further advancements in RL and machine learning.

For more details, refer to the original research: RL without TD learning

On This Page

Transitive RL: A Divide-and-Conquer Approach to Scalable Off-Policy Reinforcement Learning