Learning a new task from a handful of examples is central to the value of foundation models, yet foundation models for relational databases still need tens of thousands of labels to do well on a new task. RT-J closes that gap. It is a Relational Transformer (RT) pretrained for the few-shot regime, and it reaches state-of-the-art predictions on unseen relational databases from only hundreds of in-context labels, with no task-specific training.
The label-efficiency gap
Strong relational predictors today typically require task-specific training, substantial labeled data, or careful feature engineering. That makes them hard to use exactly where they are most valuable: when labels are scarce, tasks change frequently, and users need rapid ad-hoc predictions for decision-making. The Relational Transformer is a natural starting point: it operates directly over database cells, schema metadata, and primary–foreign-key links, but its original recipe was not designed to turn the architecture into a strong in-context learner.
We argue this label inefficiency is not primarily an architectural limitation, but a mismatch between existing relational foundation-model recipes and the demands of few-shot prediction. Three gaps are central, and RT-J is the recipe that closes each one:
- Narrow data. Public relational corpora are small, leaving rare-event, cold-start, and heavy-tailed regimes sparsely represented.
- Sparse supervision. Masked-cell pretraining supervises a single short context, while few-shot inference relies on long contexts full of labeled examples.
- Naive context. Standard context construction ignores relational structure or expands neighborhoods blindly, instead of retrieving rows that are both relevant and label-bearing.
Our goal is not a new backbone, but to identify the ingredients that turn an RT-style model into an effective in-context learner for relational databases: a diverse pretraining corpus, a training recipe aligned with few-shot inference, a structure-aware retriever, and test-time compute. We call the resulting pretrained model RT-J: RT pretrained on THE JOIN.
THE JOIN: a large-scale pretraining corpus
Existing collections of relational databases comprise a limited set of datasets and hand-curated tasks. THE JOIN is, to our knowledge, the largest open pretraining corpus of relational data to date: 650 databases across e-commerce, sports, media, finance, healthcare and more, with 6,255 forecasting tasks deliberately covering rare-event, cold-start, and heavy-tailed targets. Tasks span eight reasoning regimes drawn from 25 template families.
| Reasoning regime |
|---|
| aggregation |
| threshold / existence |
| trend / change |
| behaviour / churn |
| ranking |
| adoption / novelty |
| streak |
| tail-event |
The corpus is built with a four-stage pipeline that turns messy web data into clean, learnable forecasting tasks:
Collection
Raw databases are gathered from five disjoint source classes (curated relational repositories, public APIs, domain portals, community dumps, and benchmarks) under permissive licenses, keeping only multi-table databases with a parseable timestamp.
Standardization
Six idempotent detector-and-transformer fixes recover canonical relational form: boolean and numeric coercion, foreign-key inference, temporal normalization, entity-axis subsampling, and source-overlap exclusion to keep evaluation benchmarks held out.
Task generation
An LLM proposer (Claude Opus) reads each schema and writes forecasting tasks from 25 registered templates; a lightweight deterministic checker drops any proposal whose columns, entities, or patterns do not validate.
Task filtering
An XGBoost baseline over window-aggregation features scores every task, retaining those that are learnable (it beats historical baselines) or random-tier (weak signal that may still help stronger models).
The RT-J pretraining recipe
Three ingredients align pretraining with few-shot inference. Together they expose the model to the long, label-rich contexts it will see at test time, while keeping supervision dense and context retrieval structure-aware.
Mixed context sizes
Context length is the axis that trades inference compute for quality. RT-J trains at a mix of lengths, L ∼ Unif{1k, 2k, 4k, 8k} cells, with a constant batch size and adaptive gradient accumulation at large L. One pretrained model then handles both short and long contexts at test time.
Multi-cell masking
As contexts grow, supervising only the single target cell wastes compute. A hierarchical Bernoulli scheme draws a per-context rate p ∼ Unif[0, 0.5] and masks many non-target cells, giving dense supervision within every window and keeping long-context pretraining data-efficient.
Random-walk retrieval
A two-stage, on-the-fly retriever fills the context window: thousands of bounded random walks from the target row rank task rows by recency and visit count, then a width-bounded local search expands around them. It surfaces rows that are both topologically close to the target and likely to carry a useful label, with no precomputation.
Model & training. RT-J is an 85M-parameter Relational Transformer (12 layers, 512 hidden dimensions, 8 attention heads) pretrained for 100k optimizer steps on 32×H100 GPUs in about two days. The same recipe is used for classification and regression, with checkpoints selected independently per metric. The local context size, retrieval width, and recency bias are left as test-time knobs (see below) rather than baked into the weights.
Test-time compute
Two complementary forms of test-time compute improve few-shot predictions without changing the pretrained model: extra inference budget buys accuracy.
Results
We evaluate on RelBench, 21 forecasting tasks across 7 databases (9 regression, 12 binary classification), all held out from pretraining. We report Z-score normalized mean absolute error (nMAE, lower is better) for regression and AUROC (higher is better) for classification.
The recipe transforms a pretrained RT
With the same architecture but the few-shot recipe and THE JOIN, RT-J cuts regression error by a third and lifts classification AUROC by nearly ten points over the prior RT pretraining recipe, and test-time compute pushes both further, all without any task-specific training.
Few-shot: dominating in-context-learning pipelines
We compare against the strongest publicly available in-context pipelines, which flatten relational neighborhoods into engineered features before applying a tabular predictor: RDBLearn and an LLM Agent that writes SQL feature queries, each paired with the TabICLv2 tabular foundation model or with XGBoost. As the number of in-context labels grows, RT-J leads on regression at every label count, and leads on classification across most of the range; the strongest baselines only close the gap once they consume the full 8k-cell context. RT-J reaches their best results using 23–32× fewer labeled examples.
On the full RelBench regression test splits, RT-J comes within ~3.5 nMAE points of the prior fully-supervised state of the art (RT pretrained and fine-tuned) at an 8k-cell context, despite doing no task-specific training of its own. Full per-task numbers are in the paper and tracked on the RelBench leaderboard.
What drives the gains
Removing any one ingredient from the recipe hurts both metrics, and each contributes meaningfully. Schema semantics (the language-model meaning of table and column names) matter most: they supply world knowledge that synthetic-only tabular models cannot recover, and help at every label count.
| Configuration | nMAE (%) ↓ | AUROC (%) ↑ |
|---|---|---|
| Prior RT recipe | 41.1 | 63.4 |
| RT-J, full recipe | 27.8 | 73.1 |
| − schema semantics | 29.8 (+2.0) | 70.7 (−2.4) |
| − multi-cell masking | 29.5 (+1.7) | 71.6 (−1.5) |
| − random-walk retrieval | 29.1 (+1.3) | 72.3 (−0.8) |
| − mixed pretraining | 28.2 (+0.4) | 72.3 (−0.8) |
| + per-task context tuning | 26.8 (−1.0) | 74.9 (+1.8) |
| + 16-seed context ensembling | 27.4 (−0.4) | 74.8 (+1.7) |
Recipe summary on RelBench at an 8k-cell context (Table 1). nMAE averages 9 regression tasks; AUROC averages 12 classification tasks. Ablations remove one ingredient; test-time additions are applied on top of the full recipe.
Summary. Few-shot relational prediction is unlocked not by changing the backbone, but by matching the entire foundation-model recipe to the few-shot regime: diverse real-world data, dense context-level supervision, retrieval that surfaces useful labeled examples, and scalable test-time compute. This is a step toward relational foundation models that work for databases the way language models work for text: pretrained once, then reused across new tasks from only a few examples.
Key takeaways
- Hundreds of labels, not tens of thousands. RT-J matches strong in-context pipelines with 23–32× fewer labels, and comes within ~3.5 nMAE of the fully-supervised state of the art with no task-specific training.
- The recipe is the contribution. A diverse corpus (THE JOIN), dense multi-cell supervision, and structure-aware random-walk retrieval each contribute meaningfully, and the same pretrained model scales with test-time compute.
- Real data carries world knowledge. Schema semantics from 650 real databases supply signal that synthetic-only tabular models (TabICLv2, TabPFN, PluRel) cannot recover, helping at every label count.