v1.0 — Research Document
Nirav Madhani | huggingface.co/Nirav-Madhani github.com/Nirav-Madhani March 2026

A Hybrid Field–Graph Representation for Probabilistic World Models in Robotics

A human can pick up an unfamiliar mug after seeing it once — not because they have memorized every mug, but because they can imagine the physical consequences: the surface normal constrains the grasp angle, the estimated mass sets the grip force, and a mental model of the support chain predicts what else will move. Current image-to-action networks lack this structured physical imagination; they must re-learn physics from raw pixels for every new task and cannot generalize from sparse interactions.

We propose an explicit structured intermediate state — a continuous belief field Φ for geometry and material queries, and a discrete object graph G for persistent identity and relational reasoning — that enables physics-aware dreaming: the ability to mentally simulate plausible outcomes of untried actions using explicit physical structure, and to generalize from sparse observations because physics constrains the space of possible outcomes. Claims are separated into three evidence classes throughout.

Literature-backed Repo result Proof-gated benchmark
Central Idea

From Implicit to Explicit World State

The core design choice is where structured physical knowledge lives in the pipeline: inside a black box, or as an addressable intermediate state that enables physics-aware dreaming and sparse-data generalization.

Conventional Pipeline
RGB Image Neural Network geometry? material? relations? (all implicit) Action

Physical queries are answered implicitly inside the network weights. Every new task must re-discover geometry, friction, and object relationships from raw pixels.

Proposed Pipeline
RGB Image Encoder CNN+Slots M1 S = (Φ, G) Φ: geometry material, SDF G: identity relations, graph (structured object) Vectorizer GNN + pool s ∈ ℝ⁵¹² dense vector M2 Cross Encoder M3 Action

Physical structure lives in an explicit intermediate state St = (Φt, Gt). A StateVectorizer (M2) then converts this structured object-graph into a dense scene vector s ∈ ℝ512 via GNN message passing and attention pooling — the step that makes (Φ, G) compatible with downstream neural networks. The CrossEncoder (M3) takes s and produces the robot action.

The key design question. Which feature map enables a downstream skill to imagine physical outcomes and generalize from sparse data: pooled image pixels, or a structured physical state extracted from those same pixels? When the state carries explicit mass, friction, and contact topology, the downstream model can dream forward plausible consequences of untried actions — constrained by physics rather than learned from millions of pixel examples. Section 6 reports a controlled benchmark isolating this effect; Section 5 demonstrates why it matters through three concrete MuJoCo experiments.
Section 1

Claim and Scope

Literature-backed Repo result Proof-gated benchmark

This document argues for a hybrid intermediate state representation St = (Φt, Gt) as the interface between a robot’s perceptual front-end and its planning and control back-end. The contribution is an interface design: a robot-facing world state that exposes local geometry and material queries through a continuous neural field \(\Phi\), and persistent object identity with relational structure through a discrete graph \(G\). The long-term goal is to train all three pipeline stages end-to-end and use the result as the physics-grounded backbone of a full Vision–Language–Action (VLA) model — detailed in Section 9.

What is Physics-Aware Dreaming?

Physics-aware dreaming is the ability of a world model to mentally simulate the consequences of actions it has never taken, using explicit physical structure rather than learned pixel correlations. The term extends the Dreamer paradigm (Hafner et al., DreamerV3), where agents train policies entirely by imagining forward in a latent space. In our case, the latent space is not opaque — it is a physically grounded state \(S_t = (\Phi_t, G_t)\) where geometry, mass, friction, and contact topology are all explicitly addressable. This makes the imagination constrained by physics rather than by what the model happened to see in training.

Concrete example: picking up an unfamiliar mug

A robot observes a mug once from a single camera and extracts \(\Phi + G\):

QueryWhat \(\Phi + G\) providesWhat the robot computes (dreams)
Grasp angle\(\nabla \text{SDF}\) at candidate contact → surface normal \(\hat{n}\)Gripper must align to \(\hat{n}\) at the handle — no second view needed
Grip forcemass \(m = 0.3\) kg, friction \(\mu = 0.6\)\(F_{\min} = mg/(2\mu) = 2.45\) N to lift without slipping — no trial grasps needed
Support chainGraph edge: mug ⟶ plate ⟶ tableLifting the mug will not disturb the plate — no physical probe needed
What if heavier?Vary \(m\) to 1.0 kg in the state\(F_{\min}\) rises to 8.17 N — the robot dreams a different scenario by changing one number

Each of these is a closed-form consequence of the explicit state — no additional training data, no reinforcement learning, no thousands of pixel demonstrations. One observation populates the physical parameters; physics does the rest. This is the sparse-data generalization property: the structured state compresses the experience requirement from thousands of demonstrations to a handful of observations, because physics constrains the space of possible outcomes.

This principle is not hypothetical. DayDreamer (Wu, Hafner et al., CoRL 2022) showed a physical robot could learn locomotion in ~1 hour by training mostly in imagination. DreMa (ICLR 2025) achieved one-shot policy learning for novel manipulation tasks by combining object-centric representations with physics simulation — a single demonstration suffices because the world model can equivariantly imagine variations. LeCun’s JEPA framework argues that prediction should happen in abstract representation space, not pixel space; our contribution is making that representation space physically grounded. A comprehensive Science Robotics survey (2025) confirms the field-wide trend: structured state representations with physics priors consistently improve sample efficiency and generalization for manipulation.

The structured state \((\Phi, G)\) makes dreaming possible because physical parameters enter the forward model as explicit variables, not patterns buried in weights. Changing one parameter (mass, friction, contact topology) changes the predicted outcome through physics equations, not through gradient descent on millions of examples. This is the fundamental mechanism behind both sparse-data generalization and zero-shot planning from the structured state.

What a manipulation skill must answer

The breadth of queries a robot must handle in even a simple grasping task motivates the representation design. Consider picking up an unknown mug from a cluttered surface:

Physical queries at task time
Query typeExample questionHow it shapes behavior
GeometryWhat is the surface normal at this candidate grasp point?Determines gripper orientation and approach angle
MaterialWhat friction coefficient and mass should I expect?Sets minimum grip force to prevent slip; limits maximum to avoid crush
RelationsIs the mug sitting on a plate? Will the plate move if I lift the mug?Determines whether a single-object plan is safe
UncertaintyI have only seen this object from one angle. How confident am I in the far-side geometry?Triggers an active information-gathering step before committing to a grasp

An end-to-end image ➝ action network must answer all of these implicitly, encoding the answers inside its weights with no explicit query interface. The representation proposed here makes each query a first-class operation on the state St.

Scope of claims on this page

The evidence on this page falls into three categories, tagged throughout:

Evidence is tagged throughout. Scope limitations are consolidated in Section 10 rather than scattered across individual sections.

Section 3

What This Repo Already Demonstrates

The repo contains concrete, runnable artifacts at each stage of the pipeline. Together they demonstrate feasibility of the representation, provide direct benchmark evidence for the structured interface, and form the launchpad for the end-to-end training described in Section 9.

StatusArtifactSourceWhat it justifies
Repo resultIndistinguishable-pair push demo and force sweeprender_exp1_push.py plus the embedded chart data on this pageIn a controlled toy scene, identical appearance can conceal very different contact outcomes
Repo resultSequential material estimationrender_exp2_estimation.pyInteraction can tighten posterior beliefs over mass and friction rather than relying on a single visual guess
Repo resultGrasp-force selection demorender_exp3_grasp.pyLocal state queries can be routed into a downstream force-selection decision
Repo resultNeural-field prototype exportnerf/train.py, nerf/evaluate.py, and the embedded NF exportThe repo already supports toy geometry and material prediction with charted convergence
Repo resultProof-gated image-only versus structured-state benchmarkproofs/fairness_of_image_vs_state_comparison.md, proofs/superiority_certification_from_experiment.md, training/scripts/generate_planned_comparison_data.py, training/scripts/run_planned_comparison.pyOn a controlled calibration-push cross-view benchmark, the structured interface beats the matched image interface under the certification rule

Every quantitative claim later in this document traces back to one of these artifacts. Claims that cannot be so traced are explicitly labeled as future work or illustrative. Section 4 gives the formal representation definition that these artifacts instantiate.

Section 4

The Representation: \(S_t = (\Phi_t,\, G_t)\)

The world state at time \(t\) is represented as a pair: a continuous belief field \(\Phi_t\) that can be queried at any 3D location to return local geometry and material properties, and a discrete object graph \(G_t\) that carries persistent object identity and relational structure. The objective is not to replace every prior world-model interface, but to define a representation that is directly legible to downstream manipulation skills and task planners — one where the answers to geometry, material, and relational queries are explicit rather than entangled inside a black-box network.

Sensors
Image, Depth,
Touch, Audio
Encoder (M1)
qψ(St | y≤t)
CNN + Slot Attention
World State
St = (Φt, Gt)
structured object state
Vectorizer (M2)
GNN + pool
s ∈ ℝ512
dense scene vector
Cross-Encoder (M3)
scene + text +
image → action
Output
Action

The contribution is localized at the interface between perception and action. Rather than asking a single network to absorb appearance, geometry, material response, identity, and control all at once, the proposal inserts a structured intermediate state and a dedicated vectorization step. The Encoder (M1) decomposes an image into a structured state \(S_t = (\Phi_t, G_t)\) — per-object latent codes, 6-DoF poses, existence flags, and contact-weighted edges. Because this state is a heterogeneous graph, not a flat tensor, the StateVectorizer (M2) converts it into a fixed-size dense vector \(\mathbf{s} \in \mathbb{R}^{512}\) via GNN message passing and attention pooling. Only \(\mathbf{s}\) enters the downstream Cross-Encoder (M3), which fuses it with language and image embeddings to produce an action.

Why three models, not one? The conventional pipeline \(\text{image} \to \text{NN} \to \text{action}\) forces a single model to simultaneously absorb appearance, geometry, material physics, object identity, and control. The proposed pipeline decomposes this into three separately trainable stages: perception (M1, supervised), vectorization (M2, contrastive), and action (M3, contrastive + supervised). This decomposition enables physics-aware dreaming: the structured state makes geometry, mass, friction, and contact topology explicit and queryable, so the downstream policy can imagine the physical consequences of untried actions — “what if I push harder? What if this object is heavier than it looks?” — without requiring direct experience of every scenario. Because the scene vector \(\mathbf{s}\) encodes physical structure rather than pixel statistics, the system generalizes from sparse interactions: knowing the mass and friction of an object lets you predict its response to a novel force, the same way humans can pick up an unfamiliar object after a single grasp. Section 6 shows that this separation yields a 46.3% reduction in OOD prediction error under a controlled matched-probe benchmark; Section 9 describes how it extends to a full VLA.

Section 5 presents three MuJoCo experiments that validate three independent properties of this interface. Section 6 reports the controlled head-to-head benchmark.

Section 5

Toy Experiments: Motivating the Representation

The core claim of this work is that explicit physical structure enables physics-aware dreaming: the ability to imagine the consequences of untried actions and to generalize from sparse observations because physics constrains the possible outcomes. The three experiments below each test a distinct facet of this claim in a controlled MuJoCo environment. All are Repo results generated by runnable scripts in this repository.

Dreaming capability  →  Experiment
What dreaming enablesExperimentWhat is demonstrated
Predict outcomes from sparse observationsA — Dream from single observationOne extraction of \(\Phi+G\) yields correct predictions across 40× force range
Refine beliefs with targeted interactionsB — Sequential material estimation4–6 pushes pin mass and friction to <20% error
Plan without trial-and-errorC — Zero-shot grasp-force selectionSafe grip range is a closed-form function of the state — zero demonstrations needed
Experiment A — Dreaming from sparse observations

Motivation. The core payoff of physics-aware dreaming is that a single observation can populate the physical parameters in \(\Phi + G\), and physics does the rest. Once the structured state encodes mass \(m\) and friction \(\mu\) for an object, the robot can dream (predict) displacement under any applied force via \(\Delta x \approx f \cdot \Delta t^2 / (2m)\) without additional demonstrations. This experiment tests whether the structured state produces correct dreams across a 40× force range after a single scene extraction.

Setup. Two red cubes are rendered with identical appearance — same size, color, texture. Their physical parameters differ: steel (\(m=2.0\) kg, \(\mu=0.5\)) versus foam (\(m=0.05\) kg, \(\mu=0.3\)). Given \(\Phi + G\) from a single observation, the model dreams displacement trajectories under forces from 0.2 N to 5 N and these are compared against the MuJoCo ground truth.

Measurement. Displacement trajectories and final displacement under each force, as rendered by render_exp1_push.py.

Result. The dreamed trajectories match the simulator: the structured state correctly predicts a 7,000× displacement ratio between the two objects. From a single observation, the model generates correct predictions across the entire force range — the same way a human who has picked up a heavy block and a foam block once can predict how far each will slide. This is the sparse-data generalization property at work: physics constrains the prediction space so that one observation is sufficient for accurate imagination across novel conditions. The interactive panel in Section 8 shows the full displacement-vs-force sweep.

Experiment B — Learning physics from sparse interactions

Motivation. A world model that dreams forward must have accurate physical parameters — but a single image rarely pins them down. The key advantage of structured state over flat embeddings is that a few targeted interactions can rapidly constrain the physical parameters (mass, friction) because the state space has physically meaningful dimensions. This is analogous to how humans probe an unfamiliar object with a single push or lift to estimate its weight, rather than requiring thousands of demonstrations. DreMa (Barcellona et al., ICLR 2025) demonstrates this principle at scale: compositional world models with explicit physical structure enable one-shot policy learning from single demonstrations.

Setup. A Bayesian grid estimator maintains a joint posterior over mass and friction across a 36 × 36 grid of MuJoCo-simulated outcomes. Sequential push observations are fed in one at a time, tightening the posterior with each interaction. Run by render_exp2_estimation.py.

Measurement. Mass and friction estimates (mean ± 1\(\sigma\)) as a function of interaction count for four material types.

Result. After just four to six interactions, mass estimation error typically falls below 20% relative error. This rapid convergence is possible because the state has explicit physical dimensions that observations directly constrain — each push observation eliminates a region of the mass–friction space. An opaque latent vector would require orders of magnitude more data to implicitly learn the same mapping. The interactive panel in Section 8 shows the convergence trajectory for each material.

Experiment C — Dreaming enables zero-shot control

Motivation. The ultimate payoff of physics-aware dreaming is zero-shot planning: once the world model knows the physical parameters, the robot can imagine the outcome of every candidate action and select the best one without physical trial-and-error. This experiment demonstrates that closed-form control decisions fall out of the structured state directly — no learned policy or reinforcement learning is needed. The model “dreams” the grasp outcome for every candidate force and selects the safe operating range analytically.

Setup. For a given material (rubber, wood, plastic, steel, egg, glass), the field state encodes mass, friction, and crush threshold. Grip force is swept from 0.5 N to 55 N and each force is classified as “drops,” “lifts safely,” or “crushes.” Run by render_exp3_grasp.py.

Measurement. Minimum lift force, crush threshold, and safe operating band per material.

Result. The safe grip range is a closed-form function of the field state: \(F_{\min} = m g / (2\mu)\), \(F_{\max} = F_{\text{crush}}\). No learned policy is needed — the physics in the state is the policy. This is the sparse-data generalization property at its most extreme: the robot needs zero demonstrations for the control decision because the structured state contains the physical quantities that analytically determine the answer. The interactive panel in Section 8 lets you explore this outcome map for each material.

Experiment D — Prototype field machinery is exportable

Motivation. Experiments A–C rely on an analytic MuJoCo state or a grid-based estimator. A full realization of the representation requires a learned neural field that can generalize to novel objects. This experiment validates that the neural-field prototype in the repo can be trained, evaluated, and exported in a form suitable for the interactive visualizations on this page.

Setup. A neural field is trained on MuJoCo ground-truth SDF and material labels for three objects (wood block, rubber ball, metal cylinder) using nerf/train.py. Evaluation and export follow from nerf/evaluate.py.

Measurement. SDF mean absolute error (MAE) on a held-out cross-section at convergence, plus friction and density estimates versus ground truth.

Implication for the representation. The neural-field stack is already instrumented, trainable on MuJoCo ground truth, and capable of exporting SDF slices and material estimates at evaluation time. The field prototype appendix (Appendix Validation) shows the exported convergence curves and SDF cross-sections for all three objects.

These four experiments confirm three measurable properties of the interface: physics can diverge from appearance, material beliefs sharpen through interaction, and field state directly encodes the safe control range without additional learning. Section 6 follows with a controlled direct comparison against an image-only baseline.

Section 6

Proof-Gated Image-Only vs Structured-State Benchmark

Repo result Fairness + certification

The benchmark compares two feature maps under identical conditions: image ➝ predictor versus image ➝ structured state \(S\) ➝ predictor. Both branches receive the same calibration-push observations; both use a matched norm-bounded linear probe; the only difference is what the probe sees. This isolates the contribution of the feature map itself.

Benchmark contract

Observation parity: both branches receive the same calibration history O = (I_pre, I_post, K, T_c, f_c), where I_pre and I_post are the RGB frames before and after the calibration push.

Feature-map difference: the image branch uses pooled RGB directly, while the structured branch uses the deterministic state S = r(O) = (p_pre, p_post, u) extracted from that same history. This benchmark does not remove task-aligned inductive bias from the structured map; it holds fixed the downstream learner and the other controlled factors so the comparison isolates which feature map better serves that learner class.

Matched learner: both branches use the same norm-bounded linear probe class, the same squared-error loss, the same train / val / test splits, and the same optimizer.

OOD protocol: training uses only the front camera, while evaluation uses held-out top and angle1 cameras.

Formal scope: the controlled-parity contract and the repo-defined certification rule are specified in proofs/fairness_of_image_vs_state_comparison.md and proofs/superiority_certification_from_experiment.md.

Formal proof sketch

This section is an abridged report rendering of the two source-of-truth proof notes in proofs/. The full markdown proofs remain authoritative; the purpose here is to make the logical chain explicit on the page itself.

Theorem 1: Matched-learner fairness contract

For episode \(e\), define the observation history, target, and structured state by

\[ O_e = \bigl(I_{e,\mathrm{pre}}, I_{e,\mathrm{post}}, K_{c_e}, T_{c_e}, f_{c,e}\bigr), \qquad Y_e = \Delta x_e(f_q^*), \] \[ S_e = r(O_e) = \bigl(\hat p_{e,\mathrm{pre}}, \hat p_{e,\mathrm{post}}, \hat u_e\bigr), \qquad \hat u_e = \frac{\bigl(\hat p_{e,\mathrm{post}} - \hat p_{e,\mathrm{pre}}\bigr)\cdot e_x}{f_{c,e}}. \]

The two branches feed the same norm-bounded linear hypothesis class

\[ x_{\mathrm{img},e} = n_{\mathrm{img}}(g_{\mathrm{img}}(O_e)), \qquad x_{\mathrm{state},e} = n_{\mathrm{state}}(g_{\mathrm{state}}(S_e)), \] \[ \mathcal H_B = \{x \mapsto w^\top x : \lVert w \rVert_2 \le B\}, \qquad \ell(h(x), Y) = (h(x) - Y)^2. \]

Assume the benchmark enforces: same observation parity, same label parity, same split parity, same learner class, same loss and optimizer family, no privileged test-time inputs, and the same held-out-camera protocol.

Claim. Under those assumptions, any empirical or population risk gap between the branches isolates the effect of the feature map presented to the matched learner class \(\mathcal H_B\), not extra data, extra labels, extra capacity, or asymmetric evaluation. This claim is about controlled parity for \(\mathcal H_B\); it does not say the two feature maps contain identical inductive bias.

Proof. Both \(x_{\mathrm{img}}\) and \(x_{\mathrm{state}}\) are deterministic functions of the same source variable \(O\). Both are trained and evaluated against the same \(Y\), on the same episode indices, under the same class \(\mathcal H_B\), with the same loss and optimizer family. The structured branch is forbidden from using simulator-only variables at test time, so it cannot benefit from privileged information. Both branches also face the same OOD camera shift. Therefore every controlled factor is matched except the map from \(O\) to the feature vector seen by the probe. That isolates the feature-map choice for \(\mathcal H_B\). QED.

Theorem 2: Conditional toy model for camera normalization

Now assume the benchmark's illustrative toy generative model used in the certification note:

\[ \Delta I_{c_e,e} = a_{c_e} u_e v_{c_e} + \eta_e, \qquad Y_e = \beta u_e + \xi_e, \] \[ \hat u_e = u_e + \delta_e, \qquad |\delta_e| \le \varepsilon. \]

Here \(u_e\) is latent mobility, \(v_{c_e}\) is the camera-specific nuisance template, \(a_{c_e}\) is a camera scale factor, and \(\eta_e,\xi_e\) are noise terms. The extractor assumption is deliberate: this theorem is conditional on already having a camera-normalized structured statistic before the linear probe sees the input.

Claim. Under that toy model, the structured branch admits an \(\varepsilon\)-accurate linear predictor

\[ h_S(S_e) = \beta \hat u_e \quad\text{with}\quad |h_S(S_e) - Y_e| \le \beta \varepsilon + |\xi_e|. \]

Proof. Substitute \(\hat u_e = u_e + \delta_e\) into the structured predictor:

\[ h_S(S_e) = \beta \hat u_e = \beta u_e + \beta \delta_e. \]

Subtract the target and apply the triangle inequality:

\[ h_S(S_e) - Y_e = \beta u_e + \beta \delta_e - (\beta u_e + \xi_e) = \beta \delta_e - \xi_e, \] \[ |h_S(S_e) - Y_e| \le \beta |\delta_e| + |\xi_e| \le \beta \varepsilon + |\xi_e|. \]

This proves a conditional statement: if the state extractor already recovers mobility up to bounded error, then the matched linear learner can access mobility directly rather than through a viewpoint-specific nuisance factor.

Why the image branch is OOD-fragile for this probe class. If training sees only the front camera and learns a direction aligned with \(v_{\mathrm{front}}\), then on an OOD camera \(c\) with \(v_c^\top v_{\mathrm{front}} = 0\), a matched linear image probe \(w = \gamma v_{\mathrm{front}}\) yields

\[ h_I(\Delta I_{c,e}) = \gamma v_{\mathrm{front}}^\top(a_c u_e v_c + \eta_e) = \gamma v_{\mathrm{front}}^\top \eta_e, \]

so the mobility signal vanishes through the nuisance template. This exhibits one sufficient failure mode for matched linear image probes. It is not a necessity theorem against all image representations or all nonlinear learners.

Benchmark certification rule

Define the OOD risks and the paired superiority gap on the same held-out episodes:

\[ R_{\mathrm{OOD}}(\mathrm{image}) = \mathbb E\bigl[|h_I(x_{\mathrm{img}}) - Y|\bigr], \qquad R_{\mathrm{OOD}}(\mathrm{state}) = \mathbb E\bigl[|h_S(x_{\mathrm{state}}) - Y|\bigr], \] \[ \Delta = R_{\mathrm{OOD}}(\mathrm{image}) - R_{\mathrm{OOD}}(\mathrm{state}). \]

On the test set, the experiment computes paired errors

\[ d_i = |h_I(x_{\mathrm{img},i}) - Y_i| - |h_S(x_{\mathrm{state},i}) - Y_i|, \qquad \hat\Delta = \frac{1}{n}\sum_{i=1}^{n} d_i, \] \[ \rho = \frac{\widehat R_{\mathrm{OOD}}(\mathrm{image}) - \widehat R_{\mathrm{OOD}}(\mathrm{state})} {\widehat R_{\mathrm{OOD}}(\mathrm{image})}. \]

The repo marks the benchmark as certified only if three conditions all hold:

  1. \(L_{0.95}(\Delta) > 0\), where \(L_{0.95}\) is the lower bound of a one-sided paired bootstrap confidence interval.
  2. \(\rho \ge 0.20\), so the improvement is not merely positive but practically material.
  3. The sign of \(\hat\Delta\) is positive for every training seed, ruling out a lucky-seed reversal.

Interpretation. Theorem 1 says the comparison controls the non-interface factors for the matched learner class. Theorem 2 gives a conditional toy-model reason why a camera-normalized structured statistic can preserve the physically relevant signal under held-out cameras. A strictly positive one-sided paired bootstrap lower bound supports \(\Delta > 0\) under the stated resampling procedure on the paired OOD episodes. The threshold \(\rho \ge 0.20\) and the per-seed sign constraint are repo-defined robustness criteria rather than theorem-implied constants. Therefore this run passes the benchmark's certification rule for a narrow claim: on this benchmark, under this matched linear-probe contract, the structured interface outperforms the image interface.

The formal scope and proof details are in proofs/fairness_of_image_vs_state_comparison.md and proofs/superiority_certification_from_experiment.md.

Avg image val MSE
3.96e-7
Avg state val MSE
2.31e-7
OOD image MAE
8.12e-4
OOD state MAE
4.36e-4
OOD MAE reduction
46.3%
Bootstrap lower bound
3.52e-4
Positive gap every seed
Yes
SeedImage val MSEState val MSEImage OOD MAEState OOD MAEOOD gap
03.93e-72.29e-76.91e-44.39e-42.52e-4
14.08e-72.34e-71.28e-34.32e-48.49e-4
23.87e-72.30e-74.63e-44.36e-42.71e-5
Avg3.96e-72.31e-78.12e-44.36e-43.76e-4

The structured branch has lower fit loss and lower held-out-camera error in every seed. The published evidence bundle includes the aggregate metrics, per-seed metrics, OOD prediction dump, and sampled GPU-utilization trace at training/results/planned_comparison_gpu_batched/. The profiled run reached 98% peak GPU utilization and 1317 MiB peak memory.

Certified result. The structured interface passes the repo-defined certification rule across all three seeds: positive paired bootstrap lower bound, >20% relative OOD MAE reduction, and a positive image-minus-state gap in every seed. This is the quantitative foundation for the full end-to-end training described in Section 9. Section 7 documents the complete three-model pipeline that this benchmark is designed to validate.
Section 7

Training Assets and Reproducibility

The benchmark in Section 6 evaluated a narrow slice of the representation: a linear probe operating on a deterministic structured state extracted from calibration-push RGB pairs. The full realization of the field-graph representation requires three trained modules: an image encoder that produces \(\Phi + G\) from raw observations, a state vectorizer that compresses the structured state into a dense scene vector for planning, and a cross-encoder that aligns scene, text, and image modalities for downstream control.

This section documents those three modules, their training objectives, and the staged training schedule. The code is runnable and instrumented on a single GPU. The architecture is the same pipeline that will be unified into end-to-end training and extended to a full Vision–Language–Action model in Section 9.

Input
Image
[B, 3, 256, 256]
Model 1
Encoder
CNN + Slot Attention
SUPERVISED
Structured State
Φ + G
zi [N×128] + graph
Model 2
Vectorizer
GNN + pooling
CONTRASTIVE
Scene Vector
s
[B, 512]
Model 3
Cross-Encoder
s + text + image
CONTRASTIVE + SUPERVISED
Output
Action
[B, 6]

Model 1: Encoding Model (Supervised)

A CNN backbone extracts features from the RGB image. Slot Attention discovers object slots — each slot becomes one node in G and one latent zi for Φ. Per-slot heads predict pose Ti, existence ei, and pairwise edges predict contact probabilities.

The latent zi serves double duty: it IS the graph node’s state, and it parameterizes the object’s local neural field. Given zi and a 3D query point x, the field network returns SDF + material properties. So zi is the compressed representation of object i’s geometry and physics.

Training: Fully supervised from MuJoCo ground truth. We have GT poses, segmentation masks, contacts, material parameters, and analytic SDFs. Hungarian matching assigns predicted slots to GT objects. Losses: MSE on poses, BCE on existence/masks/contacts, MSE on SDF at sampled points.

▼ View `EncodingModel` excerpt from `nerf/pipeline.py`
Model 1: Encoder
nerf/pipeline.py

This is the image-to-Φ+G stage: CNN features, iterative slot updates, then per-slot heads for object latents, poses, masks, and pairwise contact structure.

class EncodingModel(nn.Module):
    def __init__(self, latent_dim=128, max_objects=8):
        super().__init__()
        sd = 128
        self.max_objects = max_objects
        self.backbone = nn.Sequential(
            nn.Conv2d(3, 32, 7, 2, 3),
            nn.ReLU(),
            nn.Conv2d(32, 64, 5, 2, 2),
            nn.ReLU(),
            nn.Conv2d(64, 128, 3, 2, 1),
            nn.ReLU(),
            nn.Conv2d(128, 256, 3, 2, 1),
            nn.ReLU(),
        )
        self.slots_init = nn.Parameter(torch.randn(1, max_objects, sd) * 0.02)
        self.sq = nn.Linear(sd, 64)
        self.sk = nn.Linear(256, 64)
        self.sv = nn.Linear(256, sd)
        self.sgru = nn.GRUCell(sd, sd)
        self.smlp = nn.Sequential(nn.Linear(sd, sd), nn.ReLU(), nn.Linear(sd, sd))
        self.to_z = nn.Linear(sd, latent_dim)
        self.to_pose = nn.Linear(sd, 7)
        self.to_exist = nn.Linear(sd, 1)
        self.edge_net = nn.Sequential(nn.Linear(sd * 2, 64), nn.ReLU(), nn.Linear(64, 33))

    def forward(self, image):
        B, N = image.shape[0], self.max_objects
        f = self.backbone(image).flatten(2).permute(0, 2, 1)
        slots = self.slots_init.expand(B, -1, -1)

        for _ in range(3):
            q, k, v = self.sq(slots), self.sk(f), self.sv(f)
            a = F.softmax(torch.einsum("bnd,bmd->bnm", q, k) / 8, dim=1)
            u = torch.einsum("bnm,bmd->bnd", a, v)
            slots = self.sgru(u.reshape(B * N, -1), slots.reshape(B * N, -1)).reshape(B, N, -1)
            slots = slots + self.smlp(slots)

        z = self.to_z(slots)
        poses = self.to_pose(slots)
        exist = torch.sigmoid(self.to_exist(slots).squeeze(-1))
        si = slots.unsqueeze(2).expand(-1, -1, N, -1)
        sj = slots.unsqueeze(1).expand(-1, N, -1, -1)
        eo = self.edge_net(torch.cat([si, sj], -1))

        return {
            "phi": {"z": z, "poses": poses, "existence": exist},
            "graph": {
                "node_features": slots,
                "edge_features": eo[..., :32],
                "contact_probs": torch.sigmoid(eo[..., 32]),
            },
        }

Model 2: State Vectorizer (Self-Supervised Contrastive)

This stage maps the structured Φ+G state into a single dense vector s ∈ R512 intended to preserve physically relevant relationships while discarding viewpoint-specific detail.

Architecture: project each (zi, posei, existencei) into node embeddings, run 3 rounds of GNN message passing weighted by contact probabilities, then attention-pool into a fixed-dimensional vector.

Training: Contrastive (InfoNCE). In MuJoCo, render the same scene from two different camera angles. Both views produce the same Φ+G (same physics), so their state vectors sA and sB should be close. Different scenes in the batch are negatives. Hard negatives: same geometry but swapped materials (looks identical, physics differs) — this forces s to encode material properties.

▼ View `StateVectorizer` excerpt from `nerf/pipeline.py`
Model 2: Vectorizer
nerf/pipeline.py

This block is the structured-state compressor: graph-aware message passing over objects followed by attention pooling into the scene vector s.

class GNNLayer(nn.Module):
    def __init__(self, hd, ed):
        super().__init__()
        self.msg = nn.Sequential(nn.Linear(hd * 2 + ed, hd), nn.ReLU(), nn.Linear(hd, hd))
        self.upd = nn.Sequential(nn.Linear(hd * 2, hd), nn.ReLU(), nn.Linear(hd, hd))
        self.norm = nn.LayerNorm(hd)

    def forward(self, h, ef, cp, mask):
        hi = h.unsqueeze(2).expand(-1, -1, h.shape[1], -1)
        hj = h.unsqueeze(1).expand(-1, h.shape[1], -1, -1)
        m = self.msg(torch.cat([hi, hj, ef], -1)) * cp.unsqueeze(-1) * mask.unsqueeze(1)
        return self.norm(h + self.upd(torch.cat([h, m.sum(2)], -1))) * mask


class StateVectorizer(nn.Module):
    def __init__(self, latent_dim=128, output_dim=512):
        super().__init__()
        hd = 256
        self.node_emb = nn.Sequential(nn.Linear(latent_dim + 8, hd), nn.ReLU(), nn.Linear(hd, hd))
        self.gnns = nn.ModuleList([GNNLayer(hd, 32) for _ in range(3)])
        self.aq = nn.Linear(hd, 64)
        self.ak = nn.Linear(hd, 64)
        self.proj = nn.Sequential(nn.Linear(hd, hd), nn.ReLU(), nn.Linear(hd, output_dim))
        self.norm = nn.LayerNorm(output_dim)

    def forward(self, phi, graph):
        h = self.node_emb(
            torch.cat([phi["z"], phi["poses"], phi["existence"].unsqueeze(-1)], -1)
        )
        mask = phi["existence"].unsqueeze(-1)
        h = h * mask

        for g in self.gnns:
            h = g(h, graph["edge_features"], graph["contact_probs"], mask)

        q = self.aq(h.mean(1, keepdim=True))
        k = self.ak(h)
        a = F.softmax(torch.einsum("bid,bjd->bij", q, k) / 8 + (1 - mask.transpose(1, 2)) * -1e9, -1)
        pooled = torch.einsum("bij,bjd->bid", a, h).squeeze(1)
        return self.norm(self.proj(pooled))
Design goal for the scene vector. Scenes with the same underlying physics should be nearby in \(s\)-space even when raw observations differ in viewpoint or lighting. Section 6 confirms this on a calibration-push benchmark: the structured interface yields 46.3% lower held-out-camera error than the matched image interface. The full vectorizer extends this to the multi-object, multi-task setting, and is jointly trained with the encoder in the end-to-end stage described in Section 9.

Model 3: Cross-Encoder (Contrastive + Supervised)

Aligns three modalities in a shared embedding space, then predicts actions:

Three-way alignment (extending CLIP to physics)
ModalityEncoderWhat it captures
Scene sMLP projectionPhysics: mass, friction, contacts, spatial arrangement
TextTransformerSemantics: “pick up the heavy metal block gently”
ImageCNNAppearance: current visual observation

Training Phase A (Contrastive): Align (s, text) pairs. Text auto-generated from MuJoCo GT: “3 objects: wood block (0.5kg, μ=0.4), rubber ball (1.1kg, μ=0.8), metal cylinder (3.0kg, μ=0.2)”. Loss: InfoNCE, same as CLIP.

Training Phase B (Supervised): Behavioral cloning from scripted MuJoCo policies (reach, grasp, push). The fused embedding (from cross-attention over s + text + image) feeds an action MLP. Loss: MSE on predicted vs. GT action. Alignment layers frozen; only action head fine-tuned.

▼ View `CrossEncoder` excerpt from `nerf/pipeline.py`
Model 3: Cross-Encoder
nerf/pipeline.py

This is the alignment and action stage: separate encoders for scene, text, and image, followed by cross-attention and an action head.

class CrossEncoder(nn.Module):
    def __init__(self, state_dim=512, embed_dim=512, action_dim=6):
        super().__init__()
        self.scene_proj = nn.Sequential(nn.Linear(state_dim, embed_dim), nn.ReLU(), nn.Linear(embed_dim, embed_dim))
        self.text_emb = nn.Embedding(10000, 256)
        self.text_tf = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(256, 4, 512, batch_first=True), 2
        )
        self.img_enc = nn.Sequential(
            nn.Conv2d(3, 32, 7, 4, 3),
            nn.ReLU(),
            nn.Conv2d(32, 64, 3, 2, 1),
            nn.ReLU(),
            nn.Conv2d(64, 128, 3, 2, 1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d(1),
            nn.Flatten(),
        )
        self.cross_attn = nn.MultiheadAttention(embed_dim, 8, batch_first=True)
        self.action_head = nn.Sequential(
            nn.Linear(embed_dim, 256), nn.ReLU(),
            nn.Linear(256, 64), nn.ReLU(),
            nn.Linear(64, action_dim), nn.Tanh(),
        )

    def forward(self, s, text_tokens=None, text_mask=None, image=None):
        se = self.scene_proj(s)
        te = self.encode_text(text_tokens, text_mask) if text_tokens is not None else None
        ie = self.encode_image(image) if image is not None else None

        toks = [se.unsqueeze(1)]
        if te is not None:
            toks.append(te.unsqueeze(1))
        if ie is not None:
            toks.append(ie.unsqueeze(1))

        seq = torch.cat(toks, 1)
        fused, _ = self.cross_attn(seq, seq, seq)
        return {
            "action": self.action_head(fused[:, 0]),
            "scene_emb": se,
            "text_emb": te,
            "image_emb": ie,
        }

Why image is not redundant

s was derived from the image, so including the image in Model 3 seems circular. Three reasons it’s not:

Training signal: Image↔s alignment loss teaches Model 1 to produce good Φ+G. If s can’t be matched back to its source image, the encoder is losing information.

Multi-timestep fusion: s may have been built up over multiple observations. The current image provides fresh visual grounding that accumulated s might lack.

When s comes from language: “Imagine a heavy metal block on a glass plate.” This generates s from text alone. The image provides the actual visual context.

Training Schedule (all MuJoCo, single GPU)

PhaseModelMethodStepsTime
1EncoderSupervised (MuJoCo GT)~50K~4h
2VectorizerContrastive (scene pairs)~100K~8h
3aCross-Encoder alignmentContrastive (s, text)~50K~4h
3bCross-Encoder actionBehavioral cloning~100K~8h
Total~300K~24h

All data generated on-the-fly from MuJoCo. No pre-collected dataset needed.

▼ View `FullPipeline` wiring from `nerf/pipeline.py`
Pipeline Wiring
nerf/pipeline.py

The top-level module is intentionally short. It only composes the three trained stages in sequence: encoder, vectorizer, then cross-encoder.

class FullPipeline(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = EncodingModel(128, 8)
        self.vectorizer = StateVectorizer(128, 512)
        self.cross_encoder = CrossEncoder(512, 512, 6)

    def forward(self, image, text_tokens=None, text_mask=None):
        pg = self.encoder(image)
        s = self.vectorizer(pg["phi"], pg["graph"])
        out = self.cross_encoder(s, text_tokens, text_mask, image)
        out["phi"] = pg["phi"]
        out["graph"] = pg["graph"]
        out["state_vector"] = s
        return out
▼ View complete pipeline code (PyTorch)
End-to-end training pipeline
pipeline.py excerpt

Image to Φ+G to state vector to action. The three stages below cover a supervised encoder, a contrastive state vectorizer, and a cross-encoder trained with contrastive alignment plus behavioral cloning.

import torch
import torch.nn as nn
import torch.nn.functional as F

# ============ MODEL 1: Encoding Model (SUPERVISED) ============

class EncodingModel(nn.Module):
    def __init__(self, latent_dim=128, max_objects=8):
        super().__init__()
        self.latent_dim, self.max_objects = latent_dim, max_objects
        sd = 128
        self.backbone = nn.Sequential(
            nn.Conv2d(3,32,7,2,3),nn.ReLU(),nn.Conv2d(32,64,5,2,2),nn.ReLU(),
            nn.Conv2d(64,128,3,2,1),nn.ReLU(),nn.Conv2d(128,256,3,2,1),nn.ReLU())
        self.slots_init = nn.Parameter(torch.randn(1,max_objects,sd)*0.02)
        self.sq=nn.Linear(sd,64);self.sk=nn.Linear(256,64);self.sv=nn.Linear(256,sd)
        self.sgru=nn.GRUCell(sd,sd)
        self.smlp=nn.Sequential(nn.Linear(sd,sd),nn.ReLU(),nn.Linear(sd,sd))
        self.to_z=nn.Linear(sd,latent_dim);self.to_pose=nn.Linear(sd,7)
        self.to_exist=nn.Linear(sd,1)
        self.edge_net=nn.Sequential(nn.Linear(sd*2,64),nn.ReLU(),nn.Linear(64,33))
        self.mask_dec=nn.Sequential(nn.Linear(sd,64),nn.ReLU(),nn.Linear(64,16*16))

    def forward(self, image):
        B,N=image.shape[0],self.max_objects
        f=self.backbone(image).flatten(2).permute(0,2,1)
        slots=self.slots_init.expand(B,-1,-1)
        for _ in range(3):
            q,k,v=self.sq(slots),self.sk(f),self.sv(f)
            a=F.softmax(torch.einsum('bnd,bmd->bnm',q,k)/8,dim=1)
            a=a/(a.sum(-1,keepdim=True)+1e-8)
            u=torch.einsum('bnm,bmd->bnd',a,v)
            slots=self.sgru(u.reshape(B*N,-1),slots.reshape(B*N,-1)).reshape(B,N,-1)
            slots=slots+self.smlp(slots)
        z=self.to_z(slots);poses=self.to_pose(slots)
        exist=torch.sigmoid(self.to_exist(slots).squeeze(-1))
        si=slots.unsqueeze(2).expand(-1,-1,N,-1);sj=slots.unsqueeze(1).expand(-1,N,-1,-1)
        eo=self.edge_net(torch.cat([si,sj],-1))
        masks=torch.sigmoid(self.mask_dec(slots).reshape(B,N,16,16))
        return {'phi':{'z':z,'poses':poses,'existence':exist},
                'graph':{'node_features':slots,'edge_features':eo[...,:32],
                         'contact_probs':torch.sigmoid(eo[...,32])},'masks':masks}

# Training: SUPERVISED from MuJoCo GT
# Losses: MSE(pose), BCE(existence), BCE(masks), BCE(contacts)
# Uses Hungarian matching to assign predicted slots to GT objects

# ============ MODEL 2: State Vectorizer (CONTRASTIVE) ============

class GNNLayer(nn.Module):
    def __init__(self, hd, ed):
        super().__init__()
        self.msg=nn.Sequential(nn.Linear(hd*2+ed,hd),nn.ReLU(),nn.Linear(hd,hd))
        self.upd=nn.Sequential(nn.Linear(hd*2,hd),nn.ReLU(),nn.Linear(hd,hd))
        self.norm=nn.LayerNorm(hd)
    def forward(self,h,ef,cp,mask):
        B,N,D=h.shape
        hi=h.unsqueeze(2).expand(-1,-1,N,-1);hj=h.unsqueeze(1).expand(-1,N,-1,-1)
        m=self.msg(torch.cat([hi,hj,ef],-1))*cp.unsqueeze(-1)*mask.unsqueeze(1)
        return self.norm(h+self.upd(torch.cat([h,m.sum(2)],-1)))*mask

class StateVectorizer(nn.Module):
    def __init__(self, latent_dim=128, output_dim=512):
        super().__init__()
        hd=256
        self.node_emb=nn.Sequential(nn.Linear(latent_dim+8,hd),nn.ReLU(),nn.Linear(hd,hd))
        self.gnns=nn.ModuleList([GNNLayer(hd,32) for _ in range(3)])
        self.aq=nn.Linear(hd,64);self.ak=nn.Linear(hd,64)
        self.proj=nn.Sequential(nn.Linear(hd,hd),nn.ReLU(),nn.Linear(hd,output_dim))
        self.norm=nn.LayerNorm(output_dim)
    def forward(self,phi,graph):
        B,N=phi['z'].shape[:2]
        h=self.node_emb(torch.cat([phi['z'],phi['poses'],phi['existence'].unsqueeze(-1)],-1))
        mask=phi['existence'].unsqueeze(-1);h=h*mask
        for g in self.gnns: h=g(h,graph['edge_features'],graph['contact_probs'],mask)
        q=self.aq(h.mean(1,keepdim=True));k=self.ak(h)
        a=F.softmax(torch.einsum('bid,bjd->bij',q,k)/8+(1-mask.transpose(1,2))*-1e9,-1)
        return self.norm(self.proj(torch.einsum('bij,bjd->bid',a,h).squeeze(1)))

# Training: SELF-SUPERVISED CONTRASTIVE (InfoNCE)
# Positive pairs: same MuJoCo scene from 2 different camera angles → same s
# Negative pairs: different scenes in the batch → different s
# Hard negatives: same geometry, swapped materials → different s

# ============ MODEL 3: Cross-Encoder (CONTRASTIVE + SUPERVISED) ============

class CrossEncoder(nn.Module):
    def __init__(self, state_dim=512, embed_dim=512, action_dim=6):
        super().__init__()
        self.scene_proj=nn.Sequential(nn.Linear(state_dim,embed_dim),nn.ReLU(),nn.Linear(embed_dim,embed_dim))
        self.scene_norm=nn.LayerNorm(embed_dim)
        self.text_emb=nn.Embedding(10000,256)
        self.text_pos=nn.Parameter(torch.randn(1,64,256)*0.02)
        self.text_tf=nn.TransformerEncoder(nn.TransformerEncoderLayer(256,4,512,batch_first=True),2)
        self.text_proj=nn.Linear(256,embed_dim);self.text_norm=nn.LayerNorm(embed_dim)
        self.img_enc=nn.Sequential(nn.Conv2d(3,32,7,4,3),nn.ReLU(),nn.Conv2d(32,64,3,2,1),nn.ReLU(),
                                    nn.Conv2d(64,128,3,2,1),nn.ReLU(),nn.AdaptiveAvgPool2d(1),nn.Flatten())
        self.img_proj=nn.Sequential(nn.Linear(128,embed_dim),nn.ReLU(),nn.Linear(embed_dim,embed_dim))
        self.img_norm=nn.LayerNorm(embed_dim)
        self.cross_attn=nn.MultiheadAttention(embed_dim,8,batch_first=True)
        self.fuse_norm=nn.LayerNorm(embed_dim)
        self.fuse_mlp=nn.Sequential(nn.Linear(embed_dim,embed_dim),nn.ReLU(),nn.Linear(embed_dim,embed_dim))
        self.action_head=nn.Sequential(nn.Linear(embed_dim,256),nn.ReLU(),nn.Linear(256,64),nn.ReLU(),nn.Linear(64,action_dim),nn.Tanh())
        self.logit_scale=nn.Parameter(torch.tensor(1/0.07).log())

    def encode_scene(self,s): return self.scene_norm(self.scene_proj(s))
    def encode_text(self,tok,mask=None):
        x=self.text_emb(tok)+self.text_pos[:,:tok.shape[1]]
        return self.text_norm(self.text_proj(self.text_tf(x,src_key_padding_mask=mask)[:,0]))
    def encode_image(self,img): return self.img_norm(self.img_proj(self.img_enc(img)))

    def forward(self,s,text_tokens=None,text_mask=None,image=None):
        se=self.encode_scene(s)
        te=self.encode_text(text_tokens,text_mask) if text_tokens is not None else None
        ie=self.encode_image(image) if image is not None else None
        toks=[se.unsqueeze(1)]
        if te is not None: toks.append(te.unsqueeze(1))
        if ie is not None: toks.append(ie.unsqueeze(1))
        seq=torch.cat(toks,1)
        f,_=self.cross_attn(seq,seq,seq)
        fused=self.fuse_norm(f[:,0]+self.fuse_mlp(f[:,0]))
        return {'action':self.action_head(fused),'scene_emb':se,'text_emb':te,'image_emb':ie,'fused':fused}

    def contrastive_loss(self,se,te):
        s,t=F.normalize(se,-1),F.normalize(te,-1)
        l=self.logit_scale.exp()*s@t.T;lb=torch.arange(len(l),device=l.device)
        return(F.cross_entropy(l,lb)+F.cross_entropy(l.T,lb))/2

# Training Phase A: CONTRASTIVE alignment
#   (s, text) pairs from MuJoCo. Text auto-generated from GT properties.
#   Loss: InfoNCE on (scene_embedding, text_embedding)
# Training Phase B: SUPERVISED action prediction
#   Behavioral cloning from scripted MuJoCo policies (reach, grasp, push).
#   Loss: MSE on predicted vs GT action. Freeze alignment, fine-tune action head.

# ============ FULL PIPELINE ============

class FullPipeline(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder=EncodingModel(128,8)
        self.vectorizer=StateVectorizer(128,512)
        self.cross_encoder=CrossEncoder(512,512,6)
    def forward(self,image,text_tokens=None,text_mask=None):
        pg=self.encoder(image)
        s=self.vectorizer(pg['phi'],pg['graph'])
        out=self.cross_encoder(s,text_tokens,text_mask,image)
        out['phi']=pg['phi'];out['graph']=pg['graph'];out['state_vector']=s
        return out

TRAINING_SCHEDULE = """
Training Schedule (all MuJoCo, single GPU, ~24h total)

Phase 1 — Model 1 Encoder (SUPERVISED) ~50K steps, ~4h
  Data: Random MuJoCo scenes, render RGB + extract GT
  Loss: MSE(pose) + BCE(existence, masks, contacts) + MSE(SDF at sampled points)
  Metric: Pose error <1cm, contact F1 >0.9

Phase 2 — Model 2 Vectorizer (CONTRASTIVE) ~100K steps, ~8h  
  Data: Scene pairs (same scene, 2 cameras) from MuJoCo
  Loss: InfoNCE on (s_view_A, s_view_B)
  Metric: Scene retrieval accuracy, physics probe R²>0.8

Phase 3a — Model 3 Alignment (CONTRASTIVE) ~50K steps, ~4h
  Data: (s, text) pairs. Text auto-generated from GT.
  Loss: InfoNCE on (scene_emb, text_emb)
  Metric: Text→scene retrieval accuracy

Phase 3b — Model 3 Action (SUPERVISED) ~100K steps, ~8h
  Data: Scripted MuJoCo demonstrations (reach/grasp/push)
  Loss: MSE(predicted_action, gt_action)
  Metric: Task success rate on held-out scenes
"""

if __name__=="__main__":
    m=FullPipeline()
    total=sum(p.numel() for p in m.parameters())
    print(f"Total: {total:,} params")
    for n,s in[("Encoder",m.encoder),("Vectorizer",m.vectorizer),("CrossEncoder",m.cross_encoder)]:
        print(f"  {n}: {sum(p.numel() for p in s.parameters()):,}")
    B=2;img=torch.randn(B,3,256,256);txt=torch.randint(0,1000,(B,12))
    with torch.no_grad(): out=m(img,text_tokens=txt)
    print(f"\nz: {list(out['phi']['z'].shape)}")
    print(f"s: {list(out['state_vector'].shape)}")
    print(f"action: {list(out['action'].shape)} = {[round(x.item(),3) for x in out['action'][0]]}")
    print(TRAINING_SCHEDULE)
Technical Appendix A

The Belief Field \(\Phi\): Query Interface Specification

This appendix gives the complete formal specification of the field half of the representation. The field \(\Phi\) is a neural function parameterized by an object latent \(z_i\) that maps a 3D query point (optionally with direction and time) to distributional predictions over the physical quantities relevant to robotic manipulation. It is the primary mechanism through which geometry, material, and dynamics information becomes explicitly queryable at task time.

The notation below is an interface summary. Each output category is realized by a separate prediction head; the field should be read as a bundle of property-specific heads exposed through one query interface, not as a single fully specified joint probabilistic model over every quantity simultaneously. The implications of that distinction for downstream use are noted at each relevant step.

\[\Phi\!\left(x,\,\hat{d},\,t \;\middle|\; z_i\right) \;\longrightarrow\; \left\{\,\text{geometry},\ \text{material},\ \text{dynamics},\ \text{appearance},\ \text{semantics}\,\right\}\]

Query Interface

InputTypeWhen needed
xR³ — 3D positionAlways
S² — unit directionAnisotropic materials, appearance
tR — timeDynamic scenes
ziRD — object latentAlways (conditions the field per-object)

Why direction \(\hat{d}\) matters

Some physical properties are isotropic: their value at a point does not depend on direction. Signed distance, density, and temperature fall into this category. Others are inherently anisotropic, and querying them without a direction is not well-defined:

Direction-dependent (anisotropic) properties

Friction \(\mu(x, \hat{d})\): Brushed metal has different friction along vs. across the grain; fabric has weave-aligned resistance. A robot sliding in direction \(\hat{d}\) requires friction in that specific direction, not an average over directions.

Stiffness \(E(x, \hat{d})\): Wood is approximately 20× stiffer along the grain than across it. The effective stiffness under compression in direction \(\hat{d}\) is \(\hat{d}^\top \mathbf{C}(x)\hat{d}\), where \(\mathbf{C}\) is the fourth-order elasticity tensor.

Thermal conductivity \(\kappa(x, \hat{d})\): Carbon-fiber composites conduct heat roughly 10× more efficiently along the fiber axis than transversely.

Appearance \(c(x, \hat{d})\): This is precisely the NeRF formulation — color depends on viewing direction (specular highlights, iridescence). The field extends the same factorization principle from appearance to physical material properties.

Output Taxonomy

OutputDistributionDepends on d̂?
SDF d(x)Gaussian(μ, σ)No
Normal n̂(x)Derived: ∇d / |∇d|No (computed via autograd)
Density ρ(x)LogNormalNo
Temperature T(x)GaussianNo
Restitution e(x)Beta(α,β) ∈ [0,1]No
Friction μ(x, d̂)LogNormalYes
Stiffness E(x, d̂)LogNormalYes
Velocity v(x)3× GaussianNo (vector output)
Stress σij(x)6× Gaussian (Voigt)No (tensor output)
Color c(x, d̂)RGBYes (as in NeRF)
SemanticsCategoricalNo

These outputs live in heterogeneous spaces. In this appendix, the field should be read as a bundle of property-specific predictive heads exposed through one query interface, not yet as one fully specified joint probabilistic law over every quantity at once.

Architecture

The network follows a shared-trunk, branched-head design. A shared trunk processes the concatenation of the Fourier-encoded position \(x\), time encoding \(t\), and object latent \(z_i\) into a feature vector \(h \in \mathbb{R}^H\). Geometry and isotropic material branches take only \(h\) as input. Directional branches (friction, stiffness, thermal conductivity, appearance) concatenate \(h\) with a separate encoding of \(\hat{d}\), following the NeRF design principle that geometry should be view-direction invariant.

Surface normals are derived by automatic differentiation of the SDF, not predicted by a separate head. In regions where the learned SDF is smooth and \(\nabla d(x) \neq 0\), the normal \(\hat{n} = \nabla d / \|\nabla d\|\) is perpendicular to the level set by construction. Near poorly learned boundaries, nonsmooth regions, or degenerate SDF gradients, the derived normal may be numerically unstable — a known limitation noted in the main text.

▼ View complete PyTorch implementation (field_complete.py)
Belief field network
field_complete.py

This implementation models Φ(x, d̂, t | zi) with a shared trunk and separate geometry, material, dynamics, appearance, and semantic heads. Direction only enters the branches that actually need anisotropic or view-dependent context.

Complete Field Specification: Φ(x, d̂, t | z_i)
================================================

The belief field is a neural function that maps a query (position, direction, time)
to DISTRIBUTIONS over every physical quantity relevant for manipulation.

Design Principles:
  1. Position x ∈ R³ is always required
  2. Direction d̂ ∈ S² is required for anisotropic/directional properties
  3. Time t is required for dynamic properties
  4. Object latent z_i conditions the field per-object
  5. Every output is DISTRIBUTIONAL (mean + uncertainty)
  6. Outputs span scalars, vectors, and tensors — matching physics

This is the NeRF paradigm extended from appearance to full multiphysics:
  NeRF:  Φ(x, d̂) → (color, density)         [appearance only]
  Ours:  Φ(x, d̂, t | z) → (geometry, material, dynamics, semantics)

Architecture:
  The network has a SHARED TRUNK that processes (x, t, z_i) into a feature vector,
  then BRANCHES for different output groups. The direction d̂ enters ONLY for
  branches that need it (anisotropic material, appearance) — following the NeRF
  insight that geometry shouldn't depend on view direction.

  ┌─────────────┐
  │ x ∈ R³      │──→ Fourier Encoding ──→┐
  │ t ∈ R       │──→ Time Encoding   ──→ │
  │ z_i ∈ R^D   │──────────────────────→ ├──→ Shared Trunk (MLP) ──→ h ∈ R^H
  └─────────────┘                        │
                                         │
  ┌──────────── h ──────────────────────┐│
  │                                     ││
  │  ┌─── Geometry Branch ←── h        ││
  │  │    (position-only)               ││
  │  │    → SDF:     μ_d, σ_d          ││
  │  │    → Normal:  n̂ = ∇μ_d / |∇μ_d| ││  (computed via autograd, not predicted)
  │  │    → Curvature: κ = ∇²d         ││  (second derivative of SDF)
  │  │                                  ││
  │  ┌─── Isotropic Material Branch ← h││
  │  │    (position-only)               ││
  │  │    → Density:      μ_ρ, σ_ρ     ││  LogNormal
  │  │    → Temperature:  μ_T, σ_T     ││  Normal (or LogNormal for Kelvin)
  │  │    → Restitution:  α_e, β_e     ││  Beta[0,1]
  │  │    → Poisson ratio: α_ν, β_ν   ││  Beta[0,0.5]
  │  │                                  ││
  │  ┌─── Directional Material Branch ←─┤│← d̂ ∈ S² (direction enters here)
  │  │    (position + direction)        ││
  │  │    → Friction:    μ_f(d̂), σ_f   ││  Friction along sliding direction d̂
  │  │    → Stiffness:   E(d̂), σ_E     ││  Young's modulus along d̂
  │  │    → Thermal cond: κ(d̂), σ_κ    ││  Conductivity along d̂
  │  │                                  ││
  │  ┌─── Dynamics Branch ← h          ││
  │  │    (position-only, vector output)││
  │  │    → Velocity:  v ∈ R³           ││
  │  │    → Stress:    σ_ij (6 indep)   ││  Symmetric 3x3 → Voigt notation
  │  │    → Strain:    ε_ij (6 indep)   ││  
  │  │                                  ││
  │  ┌─── Appearance Branch ←───────────┤│← d̂ (view direction, as in NeRF)
  │  │    → Color:     c ∈ R³           ││
  │  │    → Opacity:   α ∈ [0,1]        ││
  │  │                                  ││
  │  ┌─── Semantic Branch ← h          ││
  │  │    → Class logits: R^C           ││
  │  │    → Affordance: R^A             ││
  │  └─────────────────────────────────┘│
  └─────────────────────────────────────┘
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from typing import Dict, Optional, Tuple


# ============================================================
# Encodings
# ============================================================

class FourierEncoding(nn.Module):
    """Sinusoidal positional encoding. Maps R^n → R^(n + 2nL)."""
    def __init__(self, input_dim: int = 3, n_freqs: int = 6, include_input: bool = True):
        super().__init__()
        self.input_dim = input_dim
        self.n_freqs = n_freqs
        self.include_input = include_input
        freqs = 2.0 ** torch.linspace(0, n_freqs - 1, n_freqs)
        self.register_buffer('freqs', freqs)

    @property
    def out_dim(self):
        return self.input_dim * (1 + 2 * self.n_freqs) if self.include_input else self.input_dim * 2 * self.n_freqs

    def forward(self, x):
        # x: [..., input_dim]
        x_proj = x.unsqueeze(-1) * self.freqs  # [..., input_dim, L]
        enc = torch.cat([x_proj.sin(), x_proj.cos()], dim=-1)  # [..., input_dim, 2L]
        enc = enc.reshape(*x.shape[:-1], -1)  # [..., input_dim * 2L]
        return torch.cat([x, enc], dim=-1) if self.include_input else enc


class DirectionEncoding(nn.Module):
    """Encode unit direction vector d̂ ∈ S².
    
    Can be input as:
      - Unit vector (dx, dy, dz) with |d|=1
      - Spherical angles (θ, φ)
    
    Uses spherical harmonics basis (lower frequency than position encoding
    since directional variation is typically smoother).
    """
    def __init__(self, n_freqs: int = 4):
        super().__init__()
        self.pos_enc = FourierEncoding(input_dim=3, n_freqs=n_freqs)

    @property
    def out_dim(self):
        return self.pos_enc.out_dim

    def forward(self, d: torch.Tensor):
        """d: [..., 3] unit vectors OR [..., 2] (theta, phi)"""
        if d.shape[-1] == 2:
            # Convert spherical to cartesian
            theta, phi = d[..., 0], d[..., 1]
            dx = torch.sin(theta) * torch.cos(phi)
            dy = torch.sin(theta) * torch.sin(phi)
            dz = torch.cos(theta)
            d = torch.stack([dx, dy, dz], dim=-1)
        # Normalize
        d = F.normalize(d, dim=-1)
        return self.pos_enc(d)


# ============================================================
# Distribution Heads (proper probabilistic outputs)
# ============================================================

class GaussianHead(nn.Module):
    """Predicts Gaussian(μ, σ) for unbounded scalars."""
    def __init__(self, in_dim: int, out_dim: int = 1):
        super().__init__()
        self.mu = nn.Linear(in_dim, out_dim)
        self.log_sigma = nn.Linear(in_dim, out_dim)

    def forward(self, h):
        return self.mu(h), F.softplus(self.log_sigma(h)) + 1e-4


class LogNormalHead(nn.Module):
    """Predicts LogNormal for positive scalars (density, stiffness, etc.)."""
    def __init__(self, in_dim: int, out_dim: int = 1, scale: float = 1.0):
        super().__init__()
        self.log_mu = nn.Linear(in_dim, out_dim)
        self.log_sigma = nn.Linear(in_dim, out_dim)
        self.scale = scale

    def forward(self, h):
        log_mu = self.log_mu(h)
        log_sigma = F.softplus(self.log_sigma(h)) + 1e-4
        # Mean of LogNormal = exp(log_mu + log_sigma²/2)
        mu = torch.exp(log_mu + 0.5 * log_sigma ** 2) * self.scale
        # Approximate sigma in real space
        sigma = mu * torch.sqrt(torch.exp(log_sigma ** 2) - 1)
        return mu, sigma


class BetaHead(nn.Module):
    """Predicts Beta(α, β) for [0, upper_bound] quantities."""
    def __init__(self, in_dim: int, out_dim: int = 1, upper_bound: float = 1.0):
        super().__init__()
        self.alpha = nn.Linear(in_dim, out_dim)
        self.beta = nn.Linear(in_dim, out_dim)
        self.upper_bound = upper_bound

    def forward(self, h):
        alpha = F.softplus(self.alpha(h)) + 1.01  # > 1 for unimodal
        beta = F.softplus(self.beta(h)) + 1.01
        mu = alpha / (alpha + beta) * self.upper_bound
        var = alpha * beta / ((alpha + beta) ** 2 * (alpha + beta + 1))
        sigma = torch.sqrt(var) * self.upper_bound
        return mu, sigma


class VectorHead(nn.Module):
    """Predicts a 3D vector with per-component uncertainty."""
    def __init__(self, in_dim: int):
        super().__init__()
        self.mu = nn.Linear(in_dim, 3)
        self.log_sigma = nn.Linear(in_dim, 3)

    def forward(self, h):
        return self.mu(h), F.softplus(self.log_sigma(h)) + 1e-4


class SymmetricTensorHead(nn.Module):
    """Predicts a 3×3 symmetric tensor via 6 Voigt components.
    
    Voigt ordering: [σ_xx, σ_yy, σ_zz, σ_yz, σ_xz, σ_xy]
    Ensures symmetry by construction.
    """
    def __init__(self, in_dim: int):
        super().__init__()
        self.voigt = nn.Linear(in_dim, 6)  # 6 independent components
        self.log_sigma = nn.Linear(in_dim, 6)

    def forward(self, h):
        v = self.voigt(h)
        s = F.softplus(self.log_sigma(h)) + 1e-4
        return v, s

    @staticmethod
    def voigt_to_matrix(voigt):
        """Convert [B, 6] Voigt to [B, 3, 3] symmetric matrix."""
        B = voigt.shape[0]
        mat = torch.zeros(B, 3, 3, device=voigt.device)
        mat[:, 0, 0] = voigt[:, 0]  # xx
        mat[:, 1, 1] = voigt[:, 1]  # yy
        mat[:, 2, 2] = voigt[:, 2]  # zz
        mat[:, 1, 2] = mat[:, 2, 1] = voigt[:, 3]  # yz
        mat[:, 0, 2] = mat[:, 2, 0] = voigt[:, 4]  # xz
        mat[:, 0, 1] = mat[:, 1, 0] = voigt[:, 5]  # xy
        return mat


# ============================================================
# The Complete Object Field
# ============================================================

class ObjectFieldComplete(nn.Module):
    """Complete per-object neural field.
    
    Φ(x, d̂, t | z_i) → full multiphysics state
    
    Architecture follows the NeRF insight:
      - Shared trunk processes (x, t, z_i) → features h
      - Geometry branch uses h only (no direction dependence for SDF)
      - Directional branches take h + encoded d̂
      - This ensures SDF is view/direction invariant
    
    Args:
        latent_dim:  dimension of object latent z_i
        hidden_dim:  width of trunk MLP
        n_layers:    depth of trunk
        pos_freqs:   Fourier frequencies for position encoding
        dir_freqs:   Fourier frequencies for direction encoding
        n_semantic_classes: number of semantic categories
        skip_at:     layer index for skip connection
    """

    def __init__(
        self,
        latent_dim: int = 128,
        hidden_dim: int = 256,
        n_layers: int = 6,
        pos_freqs: int = 6,
        dir_freqs: int = 4,
        n_semantic_classes: int = 32,
        skip_at: int = 3,
    ):
        super().__init__()
        self.latent_dim = latent_dim
        self.hidden_dim = hidden_dim
        self.skip_at = skip_at

        # --- Encodings ---
        self.pos_enc = FourierEncoding(input_dim=3, n_freqs=pos_freqs)
        self.dir_enc = DirectionEncoding(n_freqs=dir_freqs)
        self.time_enc = FourierEncoding(input_dim=1, n_freqs=4)

        trunk_input_dim = self.pos_enc.out_dim + self.time_enc.out_dim + latent_dim

        # --- Shared Trunk ---
        trunk_layers = []
        dims = [trunk_input_dim] + [hidden_dim] * n_layers
        for i in range(n_layers):
            in_d = dims[i] + (trunk_input_dim if i == skip_at else 0)
            trunk_layers.append(nn.Linear(in_d, hidden_dim))
        self.trunk = nn.ModuleList(trunk_layers)

        # --- Geometry Branch (position-only) ---
        self.sdf_head = GaussianHead(hidden_dim)
        # Note: surface normal is computed as ∇SDF via autograd, not predicted
        # Curvature can be computed as ∇²SDF (Laplacian)

        # --- Isotropic Material Branch (position-only) ---
        self.density_head = LogNormalHead(hidden_dim, scale=1000.0)    # kg/m³
        self.temperature_head = GaussianHead(hidden_dim)                # Kelvin
        self.restitution_head = BetaHead(hidden_dim, upper_bound=1.0)   # [0, 1]
        self.poisson_head = BetaHead(hidden_dim, upper_bound=0.5)       # [0, 0.5]
        self.acoustic_damping_head = LogNormalHead(hidden_dim, scale=1.0)

        # --- Directional Material Branch (position + direction) ---
        dir_input_dim = hidden_dim + self.dir_enc.out_dim
        self.dir_material_trunk = nn.Sequential(
            nn.Linear(dir_input_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Linear(hidden_dim // 2, hidden_dim // 2),
            nn.ReLU(),
        )
        dir_h_dim = hidden_dim // 2
        self.friction_head = LogNormalHead(dir_h_dim, scale=1.0)        # μ ∈ R+ (typ 0.1-1.0)
        self.stiffness_head = LogNormalHead(dir_h_dim, scale=1e9)       # Pa
        self.thermal_cond_head = LogNormalHead(dir_h_dim, scale=1.0)    # W/(m·K)

        # --- Dynamics Branch (position-only, vector/tensor outputs) ---
        self.velocity_head = VectorHead(hidden_dim)                      # m/s
        self.stress_head = SymmetricTensorHead(hidden_dim)               # Pa (6 Voigt components)
        self.strain_head = SymmetricTensorHead(hidden_dim)               # dimensionless (6 Voigt)

        # --- Appearance Branch (position + direction, as in NeRF) ---
        app_input_dim = hidden_dim + self.dir_enc.out_dim
        self.appearance_trunk = nn.Sequential(
            nn.Linear(app_input_dim, hidden_dim // 2),
            nn.ReLU(),
        )
        self.color_head = nn.Linear(hidden_dim // 2, 3)     # RGB ∈ [0,1]
        self.opacity_head = nn.Linear(hidden_dim // 2, 1)    # α ∈ [0,1]

        # --- Semantic Branch ---
        self.semantic_head = nn.Linear(hidden_dim, n_semantic_classes)
        self.affordance_head = nn.Linear(hidden_dim, 8)  # graspable, pushable, pourable, etc.

    def trunk_forward(self, x: torch.Tensor, t: torch.Tensor, z: torch.Tensor) -> torch.Tensor:
        """Shared trunk: (x, t, z) → feature vector h."""
        x_enc = self.pos_enc(x)
        t_enc = self.time_enc(t.unsqueeze(-1) if t.dim() == 1 else t)
        if z.dim() == 1:
            z = z.unsqueeze(0).expand(x.shape[0], -1)

        h = torch.cat([x_enc, t_enc, z], dim=-1)
        h_input = h

        for i, layer in enumerate(self.trunk):
            if i == self.skip_at:
                h = torch.cat([h, h_input], dim=-1)
            h = F.relu(layer(h))

        return h

    def forward(
        self,
        x: torch.Tensor,                      # [B, 3] position in object-local frame
        z: torch.Tensor,                       # [D] or [B, D] object latent
        t: Optional[torch.Tensor] = None,      # [B] or [B, 1] time
        d: Optional[torch.Tensor] = None,      # [B, 3] or [B, 2] query direction
        compute_normals: bool = False,          # whether to compute ∇SDF
        compute_curvature: bool = False,        # whether to compute ∇²SDF
    ) -> Dict[str, torch.Tensor]:
        """
        Full field query.
        
        Returns dict with all physical quantities as (mean, uncertainty) pairs.
        """
        B = x.shape[0]

        if t is None:
            t = torch.zeros(B, 1, device=x.device)
        elif t.dim() == 1:
            t = t.unsqueeze(-1)

        # Enable gradient tracking for normal/curvature computation
        if compute_normals or compute_curvature:
            x = x.detach().requires_grad_(True)

        # --- Shared trunk ---
        h = self.trunk_forward(x, t, z)

        result = {}

        # --- Geometry (position-only) ---
        sdf_mu, sdf_sigma = self.sdf_head(h)
        result['sdf_mu'] = sdf_mu
        result['sdf_sigma'] = sdf_sigma

        # Surface normal via autograd
        if compute_normals:
            grad_sdf = torch.autograd.grad(
                sdf_mu.sum(), x, create_graph=compute_curvature, retain_graph=True
            )[0]  # [B, 3]
            normal = F.normalize(grad_sdf, dim=-1)
            result['normal'] = normal
            result['sdf_gradient'] = grad_sdf

            if compute_curvature:
                # Laplacian of SDF ≈ mean curvature
                curvature = torch.zeros(B, device=x.device)
                for dim in range(3):
                    grad2 = torch.autograd.grad(
                        grad_sdf[:, dim].sum(), x, retain_graph=True
                    )[0][:, dim]
                    curvature += grad2
                result['mean_curvature'] = curvature

        # --- Isotropic materials (position-only) ---
        result['density_mu'], result['density_sigma'] = self.density_head(h)
        result['temperature_mu'], result['temperature_sigma'] = self.temperature_head(h)
        result['restitution_mu'], result['restitution_sigma'] = self.restitution_head(h)
        result['poisson_mu'], result['poisson_sigma'] = self.poisson_head(h)
        result['damping_mu'], result['damping_sigma'] = self.acoustic_damping_head(h)

        # --- Directional materials (position + direction) ---
        if d is not None:
            d_enc = self.dir_enc(d)
            h_dir = self.dir_material_trunk(torch.cat([h, d_enc], dim=-1))
            result['friction_mu'], result['friction_sigma'] = self.friction_head(h_dir)
            result['stiffness_mu'], result['stiffness_sigma'] = self.stiffness_head(h_dir)
            result['thermal_cond_mu'], result['thermal_cond_sigma'] = self.thermal_cond_head(h_dir)
        else:
            # If no direction given, query with canonical directions and average
            # (gives isotropic estimate)
            canonical = torch.tensor([[1,0,0],[0,1,0],[0,0,1]], dtype=torch.float, device=x.device)
            fric_sum = torch.zeros(B, 1, device=x.device)
            stiff_sum = torch.zeros(B, 1, device=x.device)
            for cd in canonical:
                cd_batch = cd.unsqueeze(0).expand(B, -1)
                d_enc = self.dir_enc(cd_batch)
                h_dir = self.dir_material_trunk(torch.cat([h, d_enc], dim=-1))
                f_mu, _ = self.friction_head(h_dir)
                s_mu, _ = self.stiffness_head(h_dir)
                fric_sum += f_mu
                stiff_sum += s_mu
            result['friction_mu'] = fric_sum / 3
            result['stiffness_mu'] = stiff_sum / 3
            # Higher uncertainty for averaged isotropic estimate
            result['friction_sigma'] = torch.ones_like(result['friction_mu']) * 0.1
            result['stiffness_sigma'] = result['stiffness_mu'] * 0.2

        # --- Dynamics ---
        result['velocity_mu'], result['velocity_sigma'] = self.velocity_head(h)
        stress_v, stress_s = self.stress_head(h)
        result['stress_voigt'] = stress_v        # [B, 6]
        result['stress_sigma'] = stress_s
        strain_v, strain_s = self.strain_head(h)
        result['strain_voigt'] = strain_v
        result['strain_sigma'] = strain_s

        # --- Appearance (needs direction) ---
        if d is not None:
            d_enc = self.dir_enc(d)
            h_app = self.appearance_trunk(torch.cat([h, d_enc], dim=-1))
            result['color'] = torch.sigmoid(self.color_head(h_app))
            result['opacity'] = torch.sigmoid(self.opacity_head(h_app))

        # --- Semantics ---
        result['semantic_logits'] = self.semantic_head(h)
        result['affordance_logits'] = self.affordance_head(h)

        # --- Store features for gating ---
        result['features'] = h

        return result


# ============================================================
# Convenience: Derived Quantities
# ============================================================

def contact_response(field_output: Dict, contact_force: torch.Tensor, contact_direction: torch.Tensor):
    """Given field output at a contact point, predict contact response.
    
    Args:
        field_output: from ObjectFieldComplete.forward(x, z, d=contact_normal)
        contact_force: [B, 1] applied normal force
        contact_direction: [B, 3] contact normal
    
    Returns:
        dict with:
            - max_static_friction: F_max = μ_s * F_normal
            - deformation: δ ≈ F / (E * A) for small deformations
            - will_slip: bool, whether tangential force exceeds friction
    """
    mu = field_output['friction_mu']
    E = field_output['stiffness_mu']

    max_friction = mu * contact_force.abs()
    # Hertzian contact approximation: deformation ~ (F / E)^(2/3)
    deformation = (contact_force.abs() / (E + 1e-6)) ** 0.667

    return {
        'max_static_friction': max_friction,
        'deformation': deformation,
        'friction': mu,
        'stiffness': E,
    }


# ============================================================
# Test
# ============================================================

if __name__ == "__main__":
    print("=" * 60)
    print("Complete Object Field — Architecture Test")
    print("=" * 60)

    field = ObjectFieldComplete(
        latent_dim=128,
        hidden_dim=256,
        n_layers=6,
        pos_freqs=6,
        dir_freqs=4,
    )

    params = sum(p.numel() for p in field.parameters())
    print(f"\nTotal parameters: {params:,}")

    # Itemize by component
    components = {
        'trunk': sum(p.numel() for l in field.trunk for p in l.parameters()),
        'geometry': sum(p.numel() for p in field.sdf_head.parameters()),
        'isotropic_material': sum(
            sum(p.numel() for p in head.parameters())
            for head in [field.density_head, field.temperature_head,
                         field.restitution_head, field.poisson_head, field.acoustic_damping_head]
        ),
        'directional_material': sum(p.numel() for p in field.dir_material_trunk.parameters()) +
                                sum(p.numel() for p in field.friction_head.parameters()) +
                                sum(p.numel() for p in field.stiffness_head.parameters()) +
                                sum(p.numel() for p in field.thermal_cond_head.parameters()),
        'dynamics': sum(p.numel() for p in field.velocity_head.parameters()) +
                    sum(p.numel() for p in field.stress_head.parameters()) +
                    sum(p.numel() for p in field.strain_head.parameters()),
        'appearance': sum(p.numel() for p in field.appearance_trunk.parameters()) +
                      sum(p.numel() for p in field.color_head.parameters()) +
                      sum(p.numel() for p in field.opacity_head.parameters()),
        'semantic': sum(p.numel() for p in field.semantic_head.parameters()) +
                    sum(p.numel() for p in field.affordance_head.parameters()),
    }
    for name, count in components.items():
        pct = count / params * 100
        print(f"  {name:25s} {count:>8,} ({pct:5.1f}%)")

    # Forward pass
    B = 64
    x = torch.randn(B, 3) * 0.05
    z = torch.randn(128)
    t = torch.zeros(B)
    d = F.normalize(torch.randn(B, 3), dim=-1)

    print(f"\nQuery: {B} points, with direction, with normals+curvature")
    out = field(x, z, t=t, d=d, compute_normals=True, compute_curvature=True)

    print(f"\nOutputs:")
    for k, v in sorted(out.items()):
        if isinstance(v, torch.Tensor):
            print(f"  {k:25s} shape={str(list(v.shape)):12s} range=[{v.min().item():.4f}, {v.max().item():.4f}]")

    # Query without direction (isotropic average)
    print(f"\nQuery: {B} points, NO direction (isotropic estimate)")
    out_iso = field(x, z, t=t, d=None, compute_normals=True)
    print(f"  friction_mu (isotropic avg): {out_iso['friction_mu'].mean().item():.4f}")
    print(f"  normal (from ∇SDF):          shape={list(out_iso['normal'].shape)}")

    # Contact response
    print(f"\nContact response at query points:")
    F_contact = torch.ones(B, 1) * 5.0  # 5N
    response = contact_response(out, F_contact, d)
    for k, v in response.items():
        if isinstance(v, torch.Tensor):
            print(f"  {k:25s} mean={v.mean().item():.4f}")

    print("\nDone!")
Technical Appendix B

The Object Graph \(G\): Identity and Relational Structure

This appendix specifies the discrete half of the state. Where the belief field \(\Phi\) provides per-point physical queries at geometric resolution, the graph \(G_t\) provides the mechanism for tracking which object is which across time and for representing relational structure between objects. These two components are complementary: a field without a graph has no persistent object identity; a graph without a field has no local physical query interface.

\[G_t = (V_t,\; E_t)\] \[\text{Node } i \in V_t: \quad T_i \in SE(3),\quad z_i \in \mathbb{R}^D,\quad e_i \in [0,1]\] \[\text{Edge } (i,j) \in E_t: \quad c_{ij} \in [0,1],\quad r_{ij} \in \{\text{on},\text{in},\text{contact},\ldots\},\quad \text{constraints}\]

Coupling: Mixture of Object-Centric Experts

The global field is a weighted mixture of object-local fields. For a query point x in world frame:

\[x_i = T_i^{-1} x \qquad \text{(transform query point to object } i\text{'s local frame)}\] \[p(\,\cdot\mid x,t) = w_0(x)\cdot p_{\mathrm{bg}}(\,\cdot) + \sum_{i} w_i(x)\cdot p_i(\,\cdot \mid x_i, z_i)\]

Here p(· | x, t) is shorthand for the collection of property-specific outputs returned at query point x, not a claim that the repo already specifies one joint density tying every scalar, vector, tensor, and categorical head together.

The gating weights wi(x) are computed from each object’s SDF: points well inside object i should receive high weight for i. This is a heuristic ownership mechanism rather than a proved partition of unity. Near overlapping, gapped, or inaccurate SDF boundaries, the ownership can become ambiguous, and this appendix does not yet provide identifiability or error-propagation bounds. A new object enters the scene by allocating a new zi and pose Ti.

What the graph gives you that the field can’t

Object permanence: The intended behavior is to keep node i alive with decaying existence probability while the object is occluded, then recover identity when it reappears. Achieving that requires explicit temporal update and data-association machinery; it does not follow from writing down \(G_t=(V_t,E_t)\) alone.

Relational reasoning: “The cup is on the plate, which is on the table.” This support chain is representable as a path in the graph. Turning that stored relation into a reliable counterfactual such as “remove the plate, then the cup falls” still requires explicit dynamics or symbolic reasoning rather than the graph state by itself.

Contact-conditional dynamics: Graph edges can parameterize sparse message passing over likely contacts. Turning that representation into reliable force propagation still requires a concrete update rule and a learned or analytic dynamics model.

Supporting Visualization

St = (Φt, Gt) — Live Demo

This is a supporting visualization of the representation on a MuJoCo tabletop scene with five objects. Click any object to query its field properties and inspect the graph edges, ownership map, and SDF wireframes.

Scene Overview

The same tabletop scene used for the live field-graph visualization and object-level queries below.

Drag to orbit · Scroll to zoom · Click objects to query Φ

Pose (x,y,z)
Existence ei
Shape
Field Φ at surface
SDF
Friction μ
Density ρ
Stiffness E
Restitution
Graph edges
Section 8 — Interactive Results

MuJoCo Experiment Panels

The panels below are interactive renderings of the three toy experiments described in Section 5. Each panel corresponds to a runnable script in the repository; the measurements were produced by MuJoCo 3.5 simulation. Use the controls to explore the parameter space for each experiment. All results carry the same scope qualifications as the Section 5 descriptions: these are repo evidence for individual interface properties, not general performance comparisons.

Experiment 1: The Indistinguishable Pair

Corresponding to Section 5, Experiment A.  Two cubes are rendered with identical visual appearance (same shape, color, and texture) but with dramatically different hidden physics: steel (2.0 kg, \(\mu = 0.5\)) versus foam (0.05 kg, \(\mu = 0.3\)). The slider selects the applied force; pressing Play animates the displacement trajectories frame-by-frame. The bar chart shows final displacement at each tested force on a log scale, where the gap between the two materials is most apparent.

Push at 2N

The lighter cube accelerates away while the heavier cube barely moves under the same visual setup.

Push at 5N

The higher-force version makes the separation in hidden physics even more obvious.

Force Sweep

This sweep is the rendered counterpart to the displacement-vs-force chart shown in the panel.

Displacement over time
Steel (2.0 kg)
0.00003
meters
Foam (0.05 kg)
0.223
meters
Ratio
7147×
foam / steel
Final displacement vs force (log scale)
This panel shows a repo toy result: the embedded force sweep contains a displacement ratio above 7,000× for the displayed setting. That supports the claim that appearance can hide physics in this scene. It does not by itself establish that every image-only model would fail on the same task unless a direct baseline is run.

Experiment 2: Bayesian Material Estimation

Corresponding to Section 5, Experiment B.  This panel demonstrates that the field state can be updated through interaction rather than relying on a single visual estimate. The estimator maintains a joint posterior over mass \(m\) and friction coefficient \(\mu\) using a grid of 1,296 precomputed MuJoCo simulation outcomes as the likelihood model. After each push, the observed displacement is compared to the precomputed likelihood grid and the posterior is updated via Bayes’ rule.

Select a material and drag the Interactions slider (or press Autoplay) to observe how the mass estimate and its uncertainty evolve. Note how the relative error typically drops below 20% within four to six interactions, regardless of starting prior. This is the “interaction refines belief” property from Section 5.

Wood Interactions

Sequential pushes in the lighter-material regime show the observation stream used to tighten the posterior.

Metal Interactions

The same estimator on a heavier, lower-friction object produces a different convergence path.

Mass estimate
Friction estimate
Mass error
relative
Mass estimate convergence (±1σ)

Experiment 3: Grasp Force Selection

Corresponding to Section 5, Experiment C.  Given the field state for a known material, the safe grip-force range is a direct closed-form consequence: the minimum force to lift without slipping is \(F_{\min} = mg/(2\mu)\), and the maximum force before crushing is \(F_{\max} = F_{\text{crush}}\). This experiment verifies that the material estimates in the field state are sufficient to compute this control constraint without any additional learned layer.

Select a material and sweep the Grip Force slider to see whether the robot lifts, drops (insufficient friction), or crushes the object. The color-coded force map on the right summarizes the full outcome spectrum for the selected material at once.

Egg Grasp Comparison

The fragile-object comparison shows why the safe operating band matters more than a single nominal force.

Status
Min force
to lift (N)
Max force
crush (N)
Safe range
Newtons
Force outcome map (green=lift, red=crush, gray=drop)
Language priors: The label “this is an egg” sets p(mass) ≈ 60g, p(crush) ≈ 3N. In this page, that should be read as an interface example showing how a prior can shift the initial belief state before contact, not as a claim that language alone solves material estimation.
Section 9 — Outlook

Internet Video as Training Signal: Alignment with Prior Work

Large-scale human and internet video is already an established component of the contemporary world-model landscape. Structured World Models from Human Videos, Genie, and UniSim collectively demonstrate that rich video corpora can support world modeling, interactive environment generation, and transferable control. The field-graph approach is compatible with this line of work: internet video becomes the data source for Stage S2 of the VLA curriculum (Section 10), supervising view-consistent geometry and relational priors rather than pixel reconstruction.

The key distinction is the supervision target: routing internet video through a structured 3D state \((\Phi, G)\) rather than through pixel prediction or latent dynamics produces an intermediate representation that is addressable by downstream skills. Section 6 establishes that this structured intermediate already outperforms an image baseline under a controlled protocol. Section 10 describes how that result scales to the full VLA setting.

Video as geometry and relational supervision. Rather than predicting future frames, the VLA uses video to learn view-consistent SDF priors, object-level identity, and support-chain relations. The supervision signal is self-supervised consistency (same scene, multiple views → same \(\Phi+G\)) rather than pixel accuracy. Language descriptions of objects seen in video initialize material priors before any robot interaction occurs.
QuestionVideo-centric systemsField-graph hypothesis
What is directly supervised?Pixels, latent video dynamics, or simulator rolloutsStructured 3D state variables and relations
What interface is exposed downstream?Typically images, latents, or simulator trajectoriesLocal queries plus persistent object graph state
What remains to be shown here?A controlled repo baseline proving that the field-graph interface transfers better than a simple image-only baseline on the same MuJoCo tasks

Training Curriculum

Level 0: Simulation pre-training. MuJoCo provides ground truth SDF, materials, contacts, and trajectories. This is the most direct supervision currently available in the repo.

Level 1: Internet or human video. RGB only, no direct force labels. The intended role here is to learn geometry and relational priors, consistent with the direction explored in the cited literature.

Level 2: Robot interaction. Use targeted interaction to calibrate quantities that are weakly observed from video alone, such as friction or crush thresholds.

Level 3: Language priors. Use labels or task descriptions to initialize a belief state, then refine that state through observation and contact.

Appendix C — Field Prototype

Neural Field Prototype: Embedded Evaluation Export

This section shows the current embedded export tied to the repo’s neural-field training and evaluation workflow in nerf/train.py and nerf/evaluate.py. The prototype is trained on MuJoCo ground truth for three objects and is included here as validation of the field machinery, not as a direct comparison against image-only baselines.

SDF MAE
meters
Friction est.
Density est.
SDF MAE convergence (log scale)
SDF cross-section (z=0) — Predicted vs Ground Truth
Current embedded export

The curves and counters below are the current site export from the prototype field workflow. They indicate that the shared architecture can fit the toy objects used in the repo and recover material-related quantities on those examples.

This should be read as evidence of prototype viability and exportability. The direct proof-gated image-versus-structured comparison now lives in Section 6; this appendix remains separate evidence that the field machinery can be trained and exported inside the repo.

Section 10 — Future Work

From Prototype to End-to-End VLA

The three-model pipeline in Section 7 is currently trained in stages: supervised encoder first, then contrastive vectorizer, then cross-encoder. The next step is to collapse this into a single end-to-end objective, then extend the result to a full Vision–Language–Action (VLA) model where the explicit \((\Phi, G)\) state acts as a physics-grounded intermediate between perception and action. This section describes both stages.

Phase 1 — End-to-End Joint Training of All Three Models

Staged training allows each module to converge independently but introduces a mismatch: the encoder is trained on MuJoCo ground truth supervision, while the vectorizer is trained on the encoder’s outputs which are imperfect. End-to-end training removes this distribution gap by computing gradients through all three stages simultaneously.

End-to-end training objective
Image I + language Model 1 Encoder CNN + Slots L_supervised Φ + G explicit state Model 2 Vectorizer GNN + pool L_contrastive s scene vector Model 3 Cross-Encoder s + text + img L_action + L_align Action end-effector cmd joint gradient flow (end-to-end)
Training phaseLoss termsWhat is learned
E2E Warm-up\(\mathcal{L}_{\text{sup}} + \mathcal{L}_{\text{ctr}}\) with small LRAlign Model 1 gradients with downstream contrastive signal from Model 2
E2E Joint\(\mathcal{L}_{\text{sup}} + \mathcal{L}_{\text{ctr}} + \mathcal{L}_{\text{action}}\)All three models optimized simultaneously; \(\Phi+G\) shaped by both geometric supervision and task reward
E2E Refinement\(\mathcal{L}_{\text{action}}\) with frozen encoderFine-tune policy head on new task distributions while keeping the structured state stable

The key advantage of end-to-end training is that the encoder learns to produce a \(\Phi + G\) that is useful for control, not just one that reconstructs MuJoCo ground truth. Gradients from the action loss propagate back through the vectorizer into the encoder, allowing the object latents \(z_i\) to encode the physically relevant structure that actually matters for the downstream task.

Phase 2 — A Physics-Grounded Vision–Language–Action Model

The Cross-Encoder (Model 3) in Section 7 already has the structure of a VLA: it accepts a structured scene representation, a language instruction, and a current image, and outputs an action. The next step is to scale this to a full VLA trained on internet video and robot demonstrations, where \((\Phi, G)\) acts as the physics-grounded intermediate state rather than a raw image latent.

Contemporary VLAs such as RT-2, OpenVLA, and π0 (Physical Intelligence) map directly from visual tokens and language to action tokens using transformer backbones pretrained on internet-scale data. They achieve impressive generalization but encode all physical understanding implicitly in the attention layers. The field-graph VLA proposes inserting the explicit \((\Phi, G)\) state as a mandatory intermediate, making physical structure addressable at inference time.

Standard VLA (e.g., RT-2, π0)
Images + Language Transformer physics implicit in attention layers (not queryable) Action
Field-Graph VLA (proposed)
Images + Language Encoder Models 1+2 Φ+G queryable Policy Model 3 Action

The critical architectural difference is that the field-graph VLA can answer explicit physical queries at inference time: a planner can query the SDF of an object before choosing a grasp, read material friction before selecting a grip force, or traverse the graph to check for support relations before executing a pick. Standard VLAs cannot expose these answers without retraining.

Training Curriculum for the VLA

StageData sourceObjectiveWhat \(\Phi+G\) learns
S1: Sim pre-trainingMuJoCo scenes with GT SDF, materials, contactsSupervised \(\mathcal{L}_{\text{geo}} + \mathcal{L}_{\text{mat}} + \mathcal{L}_{\text{contact}}\)Accurate SDF, material prediction, and graph structure in controlled settings
S2: Video pre-trainingInternet video and human manipulation demonstrationsView-consistency contrastive \(\mathcal{L}_{\text{ctr}}\); self-supervised depth/flowGeometry and relational priors from large-scale natural observation, consistent with Structured World Models from Human Videos
S3: Interaction calibrationRobot self-play and targeted interactionPosterior update on contact outcomes; friction and stiffness calibrationMaterial quantities that are weakly observable from video alone (friction, crush threshold, compliance)
S4: Language alignmentScene + instruction pairs; CLIP-style paired dataContrastive \(\mathcal{L}_{\text{align}}\) between scene embedding and instruction embeddingSemantic grounding: “the heavy metal block” initializes a prior over mass and friction before any contact
S5: Policy fine-tuningRobot demonstrations on target tasksBehavioral cloning \(\mathcal{L}_{\text{action}}\) through all three modelsTask-specific shaping of the structured state to support downstream control
The VLA design hypothesis. Language descriptions of a scene (“pick up the fragile glass egg gently”) can initialize a prior over the field-graph state before any observation. Interaction then refines that prior. At execution time, the policy reads grip force, approach angle, and support relations directly from the state rather than re-encoding them from pixels at each step. This loop — language prior → observation update → physical query → action — is the intended inference-time behavior of the complete system.
Section 11 — Conclusion

Conclusion

This document has introduced, justified, and benchmarked a hybrid intermediate state \(S_t = (\Phi_t, G_t)\) as the interface between robotic perception and control. The field component \(\Phi_t\) exposes geometry and material properties at any 3D query point; the graph component \(G_t\) tracks persistent object identity and relational structure. Together they form an explicitly addressable world state that downstream skills can query without re-encoding physical structure from raw pixels at each new task.

Three layers of evidence support the design: literature precedent for each component (Section 2), controlled MuJoCo experiments validating three independent interface properties (Sections 5 and 8), and a proof-gated direct benchmark showing a 46.3% reduction in out-of-distribution prediction error over a matched image baseline (Section 6). The complete three-model pipeline (Section 7) is runnable, instrumented, and designed for end-to-end training. The roadmap in Section 10 describes how this pipeline becomes a full Vision–Language–Action model: training all three models jointly, then scaling to internet video and robot demonstration data in a five-stage curriculum.

Scope and Limitations

The benchmark in Section 6 covers a one-object calibration-push task under a matched norm-bounded linear probe. It does not yet cover multi-object occlusion identity, contact-rich manipulation, or richer nonlinear baselines. The field exposes per-property heads rather than a single joint probabilistic model; surface normal stability and SDF-based ownership are contingent on SDF regularity. Graph persistence through occlusion and relational counterfactuals require explicit update and reasoning machinery beyond the state definition. These are the engineering boundaries for the end-to-end training phase.

The program. A structured intermediate state separates representation from control, makes physical queries explicit, and provides a principled place to inject language priors, field uncertainty, and relational context. The next step is to train the full pipeline end-to-end and extend the cross-encoder into a complete physics-grounded VLA.
References

Selected References and Repo Artifacts

CategoryReferenceWhy it appears here
World models / dreamingDreamerV3 (Hafner et al.); Genie; UniSim; V-JEPA (LeCun et al.)Latent-space imagination and embedding-space prediction frameworks that motivate the dreaming argument in Section 1.
Compositional / physics world modelsDreMa (ICLR 2025); PIN-WM; DayDreamer (CoRL 2022)One-shot policy learning from compositional world models (DreMa), few-shot dynamics identification (PIN-WM), and physical robot learning via imagination in ~1 hour (DayDreamer).
SurveyAi et al., Science Robotics 2025Comprehensive review confirming that structured state representations with physics priors consistently improve sample efficiency and generalization for manipulation.
Object-centric modelsSlot Attention; FOCUSExamples of structured object representations that motivate the graph half of the proposed interface.
Neural fields for roboticsNeRF; Dex-NeRF; Evo-NeRFExamples of continuous field-based geometry interfaces related to the Φ half of the representation.
Scene graphs / planningHydra; Hierarchical 3D Scene Graph PlanningExamples of persistent object identity and relational planning interfaces related to the G half of the representation.
Repo proof notesproofs/fairness_of_image_vs_state_comparison.md; proofs/superiority_certification_from_experiment.mdFormal fairness and certification rules that define the exact scope of the benchmark claim reported in Section 6.
Repo benchmark pipelinetraining/scripts/generate_planned_comparison_data.py; training/scripts/run_planned_comparison.pyExecutable generator and trainer used to produce the direct image-versus-structured benchmark.
Repo benchmark outputstraining/results/planned_comparison_gpu_batched/metrics.json; seed_metrics.csv; ood_predictions.csvSaved validation, held-out-camera, and per-seed evidence backing the metric cards and table on this page.

This is a selected reference section rather than a full bibliography. External literature links cover the representative prior work named in the report, while repo-local proof notes and output paths are listed because they are the primary evidence source for the direct benchmark claim.