RT-J: Large-Scale Pretraining of Relational Transformers for Label-Efficient Predictions

Learning a new task from a handful of examples is central to the value of foundation models, yet foundation models for relational databases still need tens of thousands of labels to do well on a new task. RT-J closes that gap. It is a Relational Transformer (RT) pretrained for the few-shot regime, and it reaches state-of-the-art predictions on unseen relational databases from only hundreds of in-context labels, with no task-specific training.

23–32×

fewer labeled examples than strong in-context-learning pipelines, at matched accuracy

650

real-world databases in THE JOIN, the largest open relational pretraining corpus to date

6,255

forecasting pretraining tasks spanning eight reasoning regimes

85M

parameters, pretrained once and reused across tasks with no fine-tuning

The label-efficiency gap

Strong relational predictors today typically require task-specific training, substantial labeled data, or careful feature engineering. That makes them hard to use exactly where they are most valuable: when labels are scarce, tasks change frequently, and users need rapid ad-hoc predictions for decision-making. The Relational Transformer is a natural starting point: it operates directly over database cells, schema metadata, and primary–foreign-key links, but its original recipe was not designed to turn the architecture into a strong in-context learner.

We argue this label inefficiency is not primarily an architectural limitation, but a mismatch between existing relational foundation-model recipes and the demands of few-shot prediction. Three gaps are central, and RT-J is the recipe that closes each one:

Three gaps between relational pretraining and few-shot inference

Narrow data. Public relational corpora are small, leaving rare-event, cold-start, and heavy-tailed regimes sparsely represented.
Sparse supervision. Masked-cell pretraining supervises a single short context, while few-shot inference relies on long contexts full of labeled examples.
Naive context. Standard context construction ignores relational structure or expands neighborhoods blindly, instead of retrieving rows that are both relevant and label-bearing.

Our goal is not a new backbone, but to identify the ingredients that turn an RT-style model into an effective in-context learner for relational databases: a diverse pretraining corpus, a training recipe aligned with few-shot inference, a structure-aware retriever, and test-time compute. We call the resulting pretrained model RT-J: RT pretrained on THE JOIN.

RT-J operates directly on a relational database and frames any forecasting task as a task table. This interactive shows the idea on a small made-up example; the recipe section below shows how the model builds a context around one target row.

THE JOIN: a large-scale pretraining corpus

Existing collections of relational databases comprise a limited set of datasets and hand-curated tasks. THE JOIN is, to our knowledge, the largest open pretraining corpus of relational data to date: 650 databases across e-commerce, sports, media, finance, healthcare and more, with 6,255 forecasting tasks deliberately covering rare-event, cold-start, and heavy-tailed targets. Tasks span eight reasoning regimes drawn from 25 template families.

Reasoning regime	What the task asks
aggregation	volume and intensity of future activity
threshold / existence	whether a continuous signal crosses a decision boundary
trend / change	direction and magnitude of change
behaviour / churn	entity lifecycle and retention
ranking	relative position within a peer group
adoption / novelty	first-time events
streak	resilience as a single metric
tail-event	rare and extreme regimes: cold-start, rare-positive, heavy-tail

The corpus is built with a four-stage pipeline that turns messy web data into clean, learnable forecasting tasks:

Collection

Raw databases are gathered from five disjoint source classes (curated relational repositories, public APIs, domain portals, community dumps, and benchmarks) under permissive licenses, keeping only multi-table databases with a parseable timestamp.

Standardization

Six idempotent detector-and-transformer fixes recover canonical relational form: boolean and numeric coercion, foreign-key inference, temporal normalization, entity-axis subsampling, and source-overlap exclusion to keep evaluation benchmarks held out.

Task generation

An LLM proposer (Claude Opus) reads each schema and writes forecasting tasks from 25 registered templates; a lightweight deterministic checker drops any proposal whose columns, entities, or patterns do not validate.

Task filtering

An XGBoost baseline over window-aggregation features scores every task, retaining those that are learnable (it beats historical baselines) or random-tier (weak signal that may still help stronger models).

The RT-J pretraining recipe

Three ingredients align pretraining with few-shot inference. Together they expose the model to the long, label-rich contexts it will see at test time, while keeping supervision dense and context retrieval structure-aware.

Mixed context sizes

Context length is the axis that trades inference compute for quality. RT-J trains at a mix of lengths, L ∼ Unif{1k, 2k, 4k, 8k} cells, with a constant batch size and adaptive gradient accumulation at large L. One pretrained model then handles both short and long contexts at test time.

Multi-cell masking

As contexts grow, supervising only the single target cell wastes compute. A hierarchical Bernoulli scheme draws a per-context rate p ∼ Unif[0, 0.5] and masks many non-target cells, giving dense supervision within every window and keeping long-context pretraining data-efficient.

Random-walk retrieval

A two-stage, on-the-fly retriever fills the context window: thousands of bounded random walks from the target row rank task rows by recency and visit count, then a width-bounded local search expands around them. It surfaces rows that are both topologically close to the target and likely to carry a useful label, with no precomputation.

How the context window is built around a target row (RT's Algorithm 1): a bounded-width breadth-first search packs each visited row's feature cells into a fixed cell budget (parents always, children subsampled, future rows skipped) until the budget is full. Change the budget or width and watch it fill.

Model & training. RT-J is an 85M-parameter Relational Transformer (12 layers, 512 hidden dimensions, 8 attention heads) pretrained for 100k optimizer steps on 32×H100 GPUs in about two days. The same recipe is used for classification and regression, with checkpoints selected independently per metric. The local context size, retrieval width, and recency bias are left as test-time knobs (see below) rather than baked into the weights.

Test-time compute

Two complementary forms of test-time compute improve few-shot predictions without changing the pretrained model: extra inference budget buys accuracy.

average over context samples

Context ensembling. The retriever is stochastic, so different random seeds yield different but equally valid contexts. Averaging predictions over up to 16 sampled contexts improves both metrics, and is complementary to context size, since many small-context samples can beat a single large one.

tune retrieval per task

Context tuning. The local context size, retrieval width, and recency bias are tuned per task on its own validation set. This adapts how far and how recently the retriever looks, lifting accuracy at large contexts with no change to the pretrained weights.

Results

We evaluate on RelBench, 21 forecasting tasks across 7 databases (9 regression, 12 binary classification), all held out from pretraining. We report Z-score normalized mean absolute error (nMAE, lower is better) for regression and AUROC (higher is better) for classification.

The recipe transforms a pretrained RT

With the same architecture but the few-shot recipe and THE JOIN, RT-J cuts regression error by a third and lifts classification AUROC by nearly ten points over the prior RT pretraining recipe, and test-time compute pushes both further, all without any task-specific training.

Regression. Mean nMAE over 9 RelBench regression tasks at an 8k-cell context (Table 1). Lower is better. Hover for values.

Classification. Mean AUROC over 12 RelBench classification tasks at an 8k-cell context (Table 1). Higher is better. Hover for values.

Few-shot: dominating in-context-learning pipelines

We compare against the strongest publicly available in-context pipelines, which flatten relational neighborhoods into engineered features before applying a tabular predictor: RDBLearn and an LLM Agent that writes SQL feature queries, each paired with the TabICLv2 tabular foundation model or with XGBoost. As the number of in-context labels grows, RT-J leads on regression at every label count, and leads on classification across most of the range; the strongest baselines only close the gap once they consume the full 8k-cell context. RT-J reaches their best results using 23–32× fewer labeled examples.

Regression. Few-shot nMAE vs. mean in-context labels, reproduced from the paper’s Figure 2. RT-J (cardinal) is lowest at every label count.

Classification. Few-shot AUROC vs. mean in-context labels, reproduced from the paper’s Figure 2. RT-J (cardinal) leads across most of the range.

On the full RelBench regression test splits, RT-J comes within ~3.5 nMAE points of the prior fully-supervised state of the art (RT pretrained and fine-tuned) at an 8k-cell context, despite doing no task-specific training of its own. Full per-task numbers are in the paper and tracked on the RelBench leaderboard.

What drives the gains

Removing any one ingredient from the recipe hurts both metrics, and each contributes meaningfully. Schema semantics (the language-model meaning of table and column names) matter most: they supply world knowledge that synthetic-only tabular models cannot recover, and help at every label count.

Configuration	nMAE (%) ↓	AUROC (%) ↑
Prior RT recipe	41.1	63.4
RT-J, full recipe	27.8	73.1
− schema semantics	29.8 (+2.0)	70.7 (−2.4)
− multi-cell masking	29.5 (+1.7)	71.6 (−1.5)
− random-walk retrieval	29.1 (+1.3)	72.3 (−0.8)
− mixed pretraining	28.2 (+0.4)	72.3 (−0.8)
+ per-task context tuning	26.8 (−1.0)	74.9 (+1.8)
+ 16-seed context ensembling	27.4 (−0.4)	74.8 (+1.7)

Recipe summary on RelBench at an 8k-cell context (Table 1). nMAE averages 9 regression tasks; AUROC averages 12 classification tasks. Ablations remove one ingredient; test-time additions are applied on top of the full recipe.

Summary. Few-shot relational prediction is unlocked not by changing the backbone, but by matching the entire foundation-model recipe to the few-shot regime: diverse real-world data, dense context-level supervision, retrieval that surfaces useful labeled examples, and scalable test-time compute. This is a step toward relational foundation models that work for databases the way language models work for text: pretrained once, then reused across new tasks from only a few examples.

Key takeaways

Hundreds of labels, not tens of thousands. RT-J matches strong in-context pipelines with 23–32× fewer labels, and comes within ~3.5 nMAE of the fully-supervised state of the art with no task-specific training.
The recipe is the contribution. A diverse corpus (THE JOIN), dense multi-cell supervision, and structure-aware random-walk retrieval each contribute meaningfully, and the same pretrained model scales with test-time compute.
Real data carries world knowledge. Schema semantics from 650 real databases supply signal that synthetic-only tabular models (TabICLv2, TabPFN, PluRel) cannot recover, helping at every label count.