WorldModel² – Alexander Epple

WorldModel2 was a course research project in which we investigated whether model-based reinforcement learning (MBRL) agents could learn exploration behavior that transfers across previously unseen tasks, a capability central to the pursuit of Artificial General Intelligence (AGI). The full paper is available in the Papers section.

Overview

Alexander Epple

Jonas Lang

Thilo Dünßer

Yannik Bretschneider

The project asked whether an agent could learn how to explore, not merely what to do. We built around DreamerV3, a state-of-the-art MBRL algorithm that learns an internal world model via a Recurrent State Space Model (RSSM) and trains an actor-critic entirely on imagined trajectories in latent space. Rather than applying it to a fixed task, we adapted it to a procedurally generated environment where the underlying rules change across training episodes, forcing the agent to generalize rather than memorize.

We investigated three interconnected questions: How DreamerV3 variants compare to model-free baselines across rulesets of increasing complexity, whether Dreamer could be extended to a continual meta-reinforcement learning setting in which it repeatedly infers new rules from scratch and how curriculum learning affects adaptation as rule complexity grows. Supporting all three directions required building a substantial custom infrastructure, which became a major contribution in its own right.

Switchboard Environment

The switchboard environment presents the agent with a set of binary buttons and observation slots. To activate a target slot, the agent must discover and execute the correct sequence of button presses as defined by a hidden ruleset. Rules were encoded as hierarchical RuleTree structures built from a TreeRule class, with leaf nodes representing raw actions or observations, standard logic nodes (AND, OR, NOT), and temporal nodes supporting delays, holds, toggles, and sequences. This representation made it possible to generate an enormous and diverse distribution of solvable tasks while keeping rule structure fully introspectable.

To guarantee that every generated rule was actually solvable, we used the Z3 solver as a symbolic ground-truth solver. Given a rule tree and a fixed planning horizon, Z3 constructed a symbolic constraint system by recursively encoding each node’s semantics, then checked whether any valid action sequence existed. Rules that failed satisfiability were discarded. Accepted rules were additionally validated by simulating the extracted action sequence in the live environment against the full current ruleset, catching cases where a locally satisfiable rule broke under the combined scenario.

Rule Infrastructure

We developed a procedural rule generator that recursively sampled rule trees from a weighted node-type distribution, with controllable maximum depth, early-stopping probabilities, and predefined templates for recurring patterns such as inhibitions and exclusive-or relations. Structural constraints (e.g. prohibiting repeated negations and recursive sequence nesting) kept the generated rules meaningfully complex rather than superficially so. A GPU-accelerated switchboard simulator substantially reduced the per-step evaluation cost during training, which was necessary because Z3-based validation made repeated on-the-fly rule generation computationally expensive.

A generated, complex AND/NOT/DELAY rule solvable in 6 steps

To quantify how difficult rules actually were, we introduced two complementary complexity scorers. The structural scorer derived a score directly from the rule tree, weighting each node type according to its logical demands (e.g. NOT nodes multiplied their subtree score, OR nodes used a reciprocal sum reflecting parallel satisfiability). The action-based scorer instead analyzed the ground-truth action sequence from Z3, combining total step count, number of distinct buttons, and the entropy of the button usage distribution. Because neither metric was sufficient alone (simple rules could have long solver sequences, and complex trees could be satisfied trivially) combining both produced the stable estimates required for curriculum design and outlier filtering.

Dreamer Adaptation

Adapting DreamerV3 to the switchboard required replacing the standard convolutional encoder and decoder with feedforward networks suited to the binary vector-based observation space, projecting up to the RSSM’s hidden dimension and back down in steps. The policy was implemented as a latent goal-conditioned Bernoulli policy trained exclusively on imagined RSSM rollouts. We evaluated two world-model variants: Vanilla Dreamer (VD), trained via decoder reconstruction, and Energy Dreamer (ED), which supplemented reconstruction with a contrastive observation–state matching objective and detached the reconstruction loss from directly shaping the latent dynamics.

Adapted Dreamer architecture for the switchboard: feedforward encoder/decoder, recurrent latent state, and a goal-conditioned actor-critic. Yellow elements indicate additions for the self-learning extension.

For the self-learning extension, training was restructured into alternating Learn Phases (1,000 steps of free exploration) and Answer Phases (30 steps with a goal provided and success evaluated). Both the world model and the policy received an additional binary input encoding the current phase, and the policy gained a reset action to handle rules with stateful preconditions. We trained this setup across three model sizes (tiny (~400K parameters), medium (~12M), and large (~100M)) in both vanilla and energy variants to study how scale affected self-learning performance.

Baseline Comparison

We evaluated VD and ED against three model-free baselines: an Actor-Critic (AC), Goal-Conditioned Supervised Learning (GCSL), and Contrastive Reinforcement Learning (CRL). Experiments spanned three rulesets of increasing difficulty with each model trained twice under identical conditions to account for stochasticity. On hard rules, VD reached a mean success rate of 0.71 and ED of 0.69, compared to 0.57 for CRL, the strongest baseline. On direct rules all models converged near the theoretical cap of 0.67, leaving little room for differentiation.

Trajectory heatmaps for Vanilla Dreamer, Energy Dreamer, and CRL across two training runs on the hard ruleset

The hard rules proved the most informative benchmark: Sufficiently complex to differentiate models yet structured enough to remain tractable. Trajectory analysis showed that the Dreamer variants could handle action sequences of up to five steps and partially omit irrelevant actions, behaviors the model-free baselines reproduced less reliably. On random rules, all methods declined sharply, with no model exceeding 0.50, indicating that deep nesting and diverse logical-temporal combinations produced too sparse a distribution of feasible solutions for any architecture to handle consistently within the given training budget.

Self-Learning and Curriculum

In the self-learning setting, the tiny models reached single-slot success rates of approximately 60%, while medium and large models performed substantially worse. This was unexpected: Reconstruction loss analysis revealed that larger models did not consistently reduce prediction error between the Learn and Answer phases, suggesting their world models failed to reliably encode the current ruleset in latent state. The policy of larger models appeared to optimize reward under a mislearned world model rather than genuinely inferring rules, while smaller models forwarded a cleaner signal to the actor by compressing the world state less aggressively.

Mean success rate across four curriculum checkpoints. Each checkpoint was trained on a progressively larger subset of levels. All trained models consistently outperformed the untrained baseline.

The curriculum learning run covered four levels of boolean logic, from direct mappings through AND rules, OR rules, and deeper compositional trees. We observed clear negative transfer and catastrophic forgetting at the level 1–2 transition: Level 1 performance dropped from 70% to 34% as the agent struggled to reconcile single-button policies with the simultaneous-press requirement of AND rules. The level 3 checkpoint, however, recovered strongly to 51% on both earlier levels. We attribute this result to OR rules reintroducing single-action satisfiability and preventing overspecialization in either direction. Across all checkpoints, the trained agent consistently outperformed the untrained random baseline on the more complex levels, suggesting that even imperfect curriculum training produced a more structured and transferable exploration strategy.

Conclusion

WorldModel2 offered partial evidence that transferable exploration behavior can be learned in procedurally generated environments. Dreamer-based agents outperformed model-free baselines on structured rulesets by a meaningful margin, the self-learning extension showed genuine adaptation in smaller models, and the curriculum experiments revealed a recovery dynamic that pointed toward real policy flexibility rather than simple memorization. At the same time, the results on random rules and the inverse size-performance relationship in self-learning highlighted that stable meta-structure encoding inside a world model remains a fundamentally open challenge, particularly when task rules change at every episode boundary.

The project was a rewarding opportunity to work at the intersection of MBRL, meta-reinforcement learning, and curriculum design, contributing novel infrastructure alongside the experimental findings. The source code is available on GitHub.