A human can pick up an unfamiliar mug after seeing it once — not because they have memorized every mug, but because they can imagine the physical consequences: the surface normal constrains the grasp angle, the estimated mass sets the grip force, and a mental model of the support chain predicts what else will move. Current image-to-action networks lack this structured physical imagination; they must re-learn physics from raw pixels for every new task and cannot generalize from sparse interactions.
We propose an explicit structured intermediate state — a continuous belief field Φ for geometry and material queries, and a discrete object graph G for persistent identity and relational reasoning — that enables physics-aware dreaming: the ability to mentally simulate plausible outcomes of untried actions using explicit physical structure, and to generalize from sparse observations because physics constrains the space of possible outcomes. Claims are separated into three evidence classes throughout.
The core design choice is where structured physical knowledge lives in the pipeline: inside a black box, or as an addressable intermediate state that enables physics-aware dreaming and sparse-data generalization.
Physical queries are answered implicitly inside the network weights. Every new task must re-discover geometry, friction, and object relationships from raw pixels.
Physical structure lives in an explicit intermediate state St = (Φt, Gt). A StateVectorizer (M2) then converts this structured object-graph into a dense scene vector s ∈ ℝ512 via GNN message passing and attention pooling — the step that makes (Φ, G) compatible with downstream neural networks. The CrossEncoder (M3) takes s and produces the robot action.
This document argues for a hybrid intermediate state representation St = (Φt, Gt) as the interface between a robot’s perceptual front-end and its planning and control back-end. The contribution is an interface design: a robot-facing world state that exposes local geometry and material queries through a continuous neural field \(\Phi\), and persistent object identity with relational structure through a discrete graph \(G\). The long-term goal is to train all three pipeline stages end-to-end and use the result as the physics-grounded backbone of a full Vision–Language–Action (VLA) model — detailed in Section 9.
Physics-aware dreaming is the ability of a world model to mentally simulate the consequences of actions it has never taken, using explicit physical structure rather than learned pixel correlations. The term extends the Dreamer paradigm (Hafner et al., DreamerV3), where agents train policies entirely by imagining forward in a latent space. In our case, the latent space is not opaque — it is a physically grounded state \(S_t = (\Phi_t, G_t)\) where geometry, mass, friction, and contact topology are all explicitly addressable. This makes the imagination constrained by physics rather than by what the model happened to see in training.
A robot observes a mug once from a single camera and extracts \(\Phi + G\):
| Query | What \(\Phi + G\) provides | What the robot computes (dreams) |
|---|---|---|
| Grasp angle | \(\nabla \text{SDF}\) at candidate contact → surface normal \(\hat{n}\) | Gripper must align to \(\hat{n}\) at the handle — no second view needed |
| Grip force | mass \(m = 0.3\) kg, friction \(\mu = 0.6\) | \(F_{\min} = mg/(2\mu) = 2.45\) N to lift without slipping — no trial grasps needed |
| Support chain | Graph edge: mug ⟶ plate ⟶ table | Lifting the mug will not disturb the plate — no physical probe needed |
| What if heavier? | Vary \(m\) to 1.0 kg in the state | \(F_{\min}\) rises to 8.17 N — the robot dreams a different scenario by changing one number |
Each of these is a closed-form consequence of the explicit state — no additional training data, no reinforcement learning, no thousands of pixel demonstrations. One observation populates the physical parameters; physics does the rest. This is the sparse-data generalization property: the structured state compresses the experience requirement from thousands of demonstrations to a handful of observations, because physics constrains the space of possible outcomes.
This principle is not hypothetical. DayDreamer (Wu, Hafner et al., CoRL 2022) showed a physical robot could learn locomotion in ~1 hour by training mostly in imagination. DreMa (ICLR 2025) achieved one-shot policy learning for novel manipulation tasks by combining object-centric representations with physics simulation — a single demonstration suffices because the world model can equivariantly imagine variations. LeCun’s JEPA framework argues that prediction should happen in abstract representation space, not pixel space; our contribution is making that representation space physically grounded. A comprehensive Science Robotics survey (2025) confirms the field-wide trend: structured state representations with physics priors consistently improve sample efficiency and generalization for manipulation.
The structured state \((\Phi, G)\) makes dreaming possible because physical parameters enter the forward model as explicit variables, not patterns buried in weights. Changing one parameter (mass, friction, contact topology) changes the predicted outcome through physics equations, not through gradient descent on millions of examples. This is the fundamental mechanism behind both sparse-data generalization and zero-shot planning from the structured state.
The breadth of queries a robot must handle in even a simple grasping task motivates the representation design. Consider picking up an unknown mug from a cluttered surface:
| Query type | Example question | How it shapes behavior |
|---|---|---|
| Geometry | What is the surface normal at this candidate grasp point? | Determines gripper orientation and approach angle |
| Material | What friction coefficient and mass should I expect? | Sets minimum grip force to prevent slip; limits maximum to avoid crush |
| Relations | Is the mug sitting on a plate? Will the plate move if I lift the mug? | Determines whether a single-object plan is safe |
| Uncertainty | I have only seen this object from one angle. How confident am I in the far-side geometry? | Triggers an active information-gathering step before committing to a grasp |
An end-to-end image ➝ action network must answer all of these implicitly, encoding the answers inside its weights with no explicit query interface. The representation proposed here makes each query a first-class operation on the state St.
The evidence on this page falls into three categories, tagged throughout:
Evidence is tagged throughout. Scope limitations are consolidated in Section 10 rather than scattered across individual sections.
The repo contains concrete, runnable artifacts at each stage of the pipeline. Together they demonstrate feasibility of the representation, provide direct benchmark evidence for the structured interface, and form the launchpad for the end-to-end training described in Section 9.
| Status | Artifact | Source | What it justifies |
|---|---|---|---|
| Repo result | Indistinguishable-pair push demo and force sweep | render_exp1_push.py plus the embedded chart data on this page | In a controlled toy scene, identical appearance can conceal very different contact outcomes |
| Repo result | Sequential material estimation | render_exp2_estimation.py | Interaction can tighten posterior beliefs over mass and friction rather than relying on a single visual guess |
| Repo result | Grasp-force selection demo | render_exp3_grasp.py | Local state queries can be routed into a downstream force-selection decision |
| Repo result | Neural-field prototype export | nerf/train.py, nerf/evaluate.py, and the embedded NF export | The repo already supports toy geometry and material prediction with charted convergence |
| Repo result | Proof-gated image-only versus structured-state benchmark | proofs/fairness_of_image_vs_state_comparison.md, proofs/superiority_certification_from_experiment.md, training/scripts/generate_planned_comparison_data.py, training/scripts/run_planned_comparison.py | On a controlled calibration-push cross-view benchmark, the structured interface beats the matched image interface under the certification rule |
Every quantitative claim later in this document traces back to one of these artifacts. Claims that cannot be so traced are explicitly labeled as future work or illustrative. Section 4 gives the formal representation definition that these artifacts instantiate.
The world state at time \(t\) is represented as a pair: a continuous belief field \(\Phi_t\) that can be queried at any 3D location to return local geometry and material properties, and a discrete object graph \(G_t\) that carries persistent object identity and relational structure. The objective is not to replace every prior world-model interface, but to define a representation that is directly legible to downstream manipulation skills and task planners — one where the answers to geometry, material, and relational queries are explicit rather than entangled inside a black-box network.
The contribution is localized at the interface between perception and action. Rather than asking a single network to absorb appearance, geometry, material response, identity, and control all at once, the proposal inserts a structured intermediate state and a dedicated vectorization step. The Encoder (M1) decomposes an image into a structured state \(S_t = (\Phi_t, G_t)\) — per-object latent codes, 6-DoF poses, existence flags, and contact-weighted edges. Because this state is a heterogeneous graph, not a flat tensor, the StateVectorizer (M2) converts it into a fixed-size dense vector \(\mathbf{s} \in \mathbb{R}^{512}\) via GNN message passing and attention pooling. Only \(\mathbf{s}\) enters the downstream Cross-Encoder (M3), which fuses it with language and image embeddings to produce an action.
Section 5 presents three MuJoCo experiments that validate three independent properties of this interface. Section 6 reports the controlled head-to-head benchmark.
The core claim of this work is that explicit physical structure enables physics-aware dreaming: the ability to imagine the consequences of untried actions and to generalize from sparse observations because physics constrains the possible outcomes. The three experiments below each test a distinct facet of this claim in a controlled MuJoCo environment. All are Repo results generated by runnable scripts in this repository.
| What dreaming enables | Experiment | What is demonstrated |
|---|---|---|
| Predict outcomes from sparse observations | A — Dream from single observation | One extraction of \(\Phi+G\) yields correct predictions across 40× force range |
| Refine beliefs with targeted interactions | B — Sequential material estimation | 4–6 pushes pin mass and friction to <20% error |
| Plan without trial-and-error | C — Zero-shot grasp-force selection | Safe grip range is a closed-form function of the state — zero demonstrations needed |
Motivation. The core payoff of physics-aware dreaming is that a single observation can populate the physical parameters in \(\Phi + G\), and physics does the rest. Once the structured state encodes mass \(m\) and friction \(\mu\) for an object, the robot can dream (predict) displacement under any applied force via \(\Delta x \approx f \cdot \Delta t^2 / (2m)\) without additional demonstrations. This experiment tests whether the structured state produces correct dreams across a 40× force range after a single scene extraction.
Setup. Two red cubes are rendered with identical appearance — same size, color, texture. Their physical parameters differ: steel (\(m=2.0\) kg, \(\mu=0.5\)) versus foam (\(m=0.05\) kg, \(\mu=0.3\)). Given \(\Phi + G\) from a single observation, the model dreams displacement trajectories under forces from 0.2 N to 5 N and these are compared against the MuJoCo ground truth.
Measurement. Displacement trajectories and final displacement under each force, as rendered by render_exp1_push.py.
Result. The dreamed trajectories match the simulator: the structured state correctly predicts a 7,000× displacement ratio between the two objects. From a single observation, the model generates correct predictions across the entire force range — the same way a human who has picked up a heavy block and a foam block once can predict how far each will slide. This is the sparse-data generalization property at work: physics constrains the prediction space so that one observation is sufficient for accurate imagination across novel conditions. The interactive panel in Section 8 shows the full displacement-vs-force sweep.
Motivation. A world model that dreams forward must have accurate physical parameters — but a single image rarely pins them down. The key advantage of structured state over flat embeddings is that a few targeted interactions can rapidly constrain the physical parameters (mass, friction) because the state space has physically meaningful dimensions. This is analogous to how humans probe an unfamiliar object with a single push or lift to estimate its weight, rather than requiring thousands of demonstrations. DreMa (Barcellona et al., ICLR 2025) demonstrates this principle at scale: compositional world models with explicit physical structure enable one-shot policy learning from single demonstrations.
Setup. A Bayesian grid estimator maintains a joint posterior over mass and friction across a 36 × 36 grid of MuJoCo-simulated outcomes. Sequential push observations are fed in one at a time, tightening the posterior with each interaction. Run by render_exp2_estimation.py.
Measurement. Mass and friction estimates (mean ± 1\(\sigma\)) as a function of interaction count for four material types.
Result. After just four to six interactions, mass estimation error typically falls below 20% relative error. This rapid convergence is possible because the state has explicit physical dimensions that observations directly constrain — each push observation eliminates a region of the mass–friction space. An opaque latent vector would require orders of magnitude more data to implicitly learn the same mapping. The interactive panel in Section 8 shows the convergence trajectory for each material.
Motivation. The ultimate payoff of physics-aware dreaming is zero-shot planning: once the world model knows the physical parameters, the robot can imagine the outcome of every candidate action and select the best one without physical trial-and-error. This experiment demonstrates that closed-form control decisions fall out of the structured state directly — no learned policy or reinforcement learning is needed. The model “dreams” the grasp outcome for every candidate force and selects the safe operating range analytically.
Setup. For a given material (rubber, wood, plastic, steel, egg, glass), the field state encodes mass, friction, and crush threshold. Grip force is swept from 0.5 N to 55 N and each force is classified as “drops,” “lifts safely,” or “crushes.” Run by render_exp3_grasp.py.
Measurement. Minimum lift force, crush threshold, and safe operating band per material.
Result. The safe grip range is a closed-form function of the field state: \(F_{\min} = m g / (2\mu)\), \(F_{\max} = F_{\text{crush}}\). No learned policy is needed — the physics in the state is the policy. This is the sparse-data generalization property at its most extreme: the robot needs zero demonstrations for the control decision because the structured state contains the physical quantities that analytically determine the answer. The interactive panel in Section 8 lets you explore this outcome map for each material.
Motivation. Experiments A–C rely on an analytic MuJoCo state or a grid-based estimator. A full realization of the representation requires a learned neural field that can generalize to novel objects. This experiment validates that the neural-field prototype in the repo can be trained, evaluated, and exported in a form suitable for the interactive visualizations on this page.
Setup. A neural field is trained on MuJoCo ground-truth SDF and material labels for three objects (wood block, rubber ball, metal cylinder) using nerf/train.py. Evaluation and export follow from nerf/evaluate.py.
Measurement. SDF mean absolute error (MAE) on a held-out cross-section at convergence, plus friction and density estimates versus ground truth.
Implication for the representation. The neural-field stack is already instrumented, trainable on MuJoCo ground truth, and capable of exporting SDF slices and material estimates at evaluation time. The field prototype appendix (Appendix Validation) shows the exported convergence curves and SDF cross-sections for all three objects.
These four experiments confirm three measurable properties of the interface: physics can diverge from appearance, material beliefs sharpen through interaction, and field state directly encodes the safe control range without additional learning. Section 6 follows with a controlled direct comparison against an image-only baseline.
The benchmark compares two feature maps under identical conditions: image ➝ predictor versus image ➝ structured state \(S\) ➝ predictor. Both branches receive the same calibration-push observations; both use a matched norm-bounded linear probe; the only difference is what the probe sees. This isolates the contribution of the feature map itself.
Observation parity: both branches receive the same calibration history O = (I_pre, I_post, K, T_c, f_c), where I_pre and I_post are the RGB frames before and after the calibration push.
Feature-map difference: the image branch uses pooled RGB directly, while the structured branch uses the deterministic state S = r(O) = (p_pre, p_post, u) extracted from that same history. This benchmark does not remove task-aligned inductive bias from the structured map; it holds fixed the downstream learner and the other controlled factors so the comparison isolates which feature map better serves that learner class.
Matched learner: both branches use the same norm-bounded linear probe class, the same squared-error loss, the same train / val / test splits, and the same optimizer.
OOD protocol: training uses only the front camera, while evaluation uses held-out top and angle1 cameras.
Formal scope: the controlled-parity contract and the repo-defined certification rule are specified in proofs/fairness_of_image_vs_state_comparison.md and proofs/superiority_certification_from_experiment.md.
This section is an abridged report rendering of the two source-of-truth proof notes in proofs/. The full markdown proofs remain authoritative; the purpose here is to make the logical chain explicit on the page itself.
For episode \(e\), define the observation history, target, and structured state by
The two branches feed the same norm-bounded linear hypothesis class
Assume the benchmark enforces: same observation parity, same label parity, same split parity, same learner class, same loss and optimizer family, no privileged test-time inputs, and the same held-out-camera protocol.
Claim. Under those assumptions, any empirical or population risk gap between the branches isolates the effect of the feature map presented to the matched learner class \(\mathcal H_B\), not extra data, extra labels, extra capacity, or asymmetric evaluation. This claim is about controlled parity for \(\mathcal H_B\); it does not say the two feature maps contain identical inductive bias.
Proof. Both \(x_{\mathrm{img}}\) and \(x_{\mathrm{state}}\) are deterministic functions of the same source variable \(O\). Both are trained and evaluated against the same \(Y\), on the same episode indices, under the same class \(\mathcal H_B\), with the same loss and optimizer family. The structured branch is forbidden from using simulator-only variables at test time, so it cannot benefit from privileged information. Both branches also face the same OOD camera shift. Therefore every controlled factor is matched except the map from \(O\) to the feature vector seen by the probe. That isolates the feature-map choice for \(\mathcal H_B\). QED.
Now assume the benchmark's illustrative toy generative model used in the certification note:
Here \(u_e\) is latent mobility, \(v_{c_e}\) is the camera-specific nuisance template, \(a_{c_e}\) is a camera scale factor, and \(\eta_e,\xi_e\) are noise terms. The extractor assumption is deliberate: this theorem is conditional on already having a camera-normalized structured statistic before the linear probe sees the input.
Claim. Under that toy model, the structured branch admits an \(\varepsilon\)-accurate linear predictor
Proof. Substitute \(\hat u_e = u_e + \delta_e\) into the structured predictor:
Subtract the target and apply the triangle inequality:
This proves a conditional statement: if the state extractor already recovers mobility up to bounded error, then the matched linear learner can access mobility directly rather than through a viewpoint-specific nuisance factor.
Why the image branch is OOD-fragile for this probe class. If training sees only the front camera and learns a direction aligned with \(v_{\mathrm{front}}\), then on an OOD camera \(c\) with \(v_c^\top v_{\mathrm{front}} = 0\), a matched linear image probe \(w = \gamma v_{\mathrm{front}}\) yields
so the mobility signal vanishes through the nuisance template. This exhibits one sufficient failure mode for matched linear image probes. It is not a necessity theorem against all image representations or all nonlinear learners.
Define the OOD risks and the paired superiority gap on the same held-out episodes:
On the test set, the experiment computes paired errors
The repo marks the benchmark as certified only if three conditions all hold:
Interpretation. Theorem 1 says the comparison controls the non-interface factors for the matched learner class. Theorem 2 gives a conditional toy-model reason why a camera-normalized structured statistic can preserve the physically relevant signal under held-out cameras. A strictly positive one-sided paired bootstrap lower bound supports \(\Delta > 0\) under the stated resampling procedure on the paired OOD episodes. The threshold \(\rho \ge 0.20\) and the per-seed sign constraint are repo-defined robustness criteria rather than theorem-implied constants. Therefore this run passes the benchmark's certification rule for a narrow claim: on this benchmark, under this matched linear-probe contract, the structured interface outperforms the image interface.
The formal scope and proof details are in proofs/fairness_of_image_vs_state_comparison.md and proofs/superiority_certification_from_experiment.md.
| Seed | Image val MSE | State val MSE | Image OOD MAE | State OOD MAE | OOD gap |
|---|---|---|---|---|---|
| 0 | 3.93e-7 | 2.29e-7 | 6.91e-4 | 4.39e-4 | 2.52e-4 |
| 1 | 4.08e-7 | 2.34e-7 | 1.28e-3 | 4.32e-4 | 8.49e-4 |
| 2 | 3.87e-7 | 2.30e-7 | 4.63e-4 | 4.36e-4 | 2.71e-5 |
| Avg | 3.96e-7 | 2.31e-7 | 8.12e-4 | 4.36e-4 | 3.76e-4 |
The structured branch has lower fit loss and lower held-out-camera error in every seed. The published evidence bundle includes the aggregate metrics, per-seed metrics, OOD prediction dump, and sampled GPU-utilization trace at training/results/planned_comparison_gpu_batched/. The profiled run reached 98% peak GPU utilization and 1317 MiB peak memory.
The benchmark in Section 6 evaluated a narrow slice of the representation: a linear probe operating on a deterministic structured state extracted from calibration-push RGB pairs. The full realization of the field-graph representation requires three trained modules: an image encoder that produces \(\Phi + G\) from raw observations, a state vectorizer that compresses the structured state into a dense scene vector for planning, and a cross-encoder that aligns scene, text, and image modalities for downstream control.
This section documents those three modules, their training objectives, and the staged training schedule. The code is runnable and instrumented on a single GPU. The architecture is the same pipeline that will be unified into end-to-end training and extended to a full Vision–Language–Action model in Section 9.
A CNN backbone extracts features from the RGB image. Slot Attention discovers object slots — each slot becomes one node in G and one latent zi for Φ. Per-slot heads predict pose Ti, existence ei, and pairwise edges predict contact probabilities.
The latent zi serves double duty: it IS the graph node’s state, and it parameterizes the object’s local neural field. Given zi and a 3D query point x, the field network returns SDF + material properties. So zi is the compressed representation of object i’s geometry and physics.
Training: Fully supervised from MuJoCo ground truth. We have GT poses, segmentation masks, contacts, material parameters, and analytic SDFs. Hungarian matching assigns predicted slots to GT objects. Losses: MSE on poses, BCE on existence/masks/contacts, MSE on SDF at sampled points.
▼ View `EncodingModel` excerpt from `nerf/pipeline.py`This is the image-to-Φ+G stage: CNN features, iterative slot updates, then per-slot heads for object latents, poses, masks, and pairwise contact structure.
class EncodingModel(nn.Module):
def __init__(self, latent_dim=128, max_objects=8):
super().__init__()
sd = 128
self.max_objects = max_objects
self.backbone = nn.Sequential(
nn.Conv2d(3, 32, 7, 2, 3),
nn.ReLU(),
nn.Conv2d(32, 64, 5, 2, 2),
nn.ReLU(),
nn.Conv2d(64, 128, 3, 2, 1),
nn.ReLU(),
nn.Conv2d(128, 256, 3, 2, 1),
nn.ReLU(),
)
self.slots_init = nn.Parameter(torch.randn(1, max_objects, sd) * 0.02)
self.sq = nn.Linear(sd, 64)
self.sk = nn.Linear(256, 64)
self.sv = nn.Linear(256, sd)
self.sgru = nn.GRUCell(sd, sd)
self.smlp = nn.Sequential(nn.Linear(sd, sd), nn.ReLU(), nn.Linear(sd, sd))
self.to_z = nn.Linear(sd, latent_dim)
self.to_pose = nn.Linear(sd, 7)
self.to_exist = nn.Linear(sd, 1)
self.edge_net = nn.Sequential(nn.Linear(sd * 2, 64), nn.ReLU(), nn.Linear(64, 33))
def forward(self, image):
B, N = image.shape[0], self.max_objects
f = self.backbone(image).flatten(2).permute(0, 2, 1)
slots = self.slots_init.expand(B, -1, -1)
for _ in range(3):
q, k, v = self.sq(slots), self.sk(f), self.sv(f)
a = F.softmax(torch.einsum("bnd,bmd->bnm", q, k) / 8, dim=1)
u = torch.einsum("bnm,bmd->bnd", a, v)
slots = self.sgru(u.reshape(B * N, -1), slots.reshape(B * N, -1)).reshape(B, N, -1)
slots = slots + self.smlp(slots)
z = self.to_z(slots)
poses = self.to_pose(slots)
exist = torch.sigmoid(self.to_exist(slots).squeeze(-1))
si = slots.unsqueeze(2).expand(-1, -1, N, -1)
sj = slots.unsqueeze(1).expand(-1, N, -1, -1)
eo = self.edge_net(torch.cat([si, sj], -1))
return {
"phi": {"z": z, "poses": poses, "existence": exist},
"graph": {
"node_features": slots,
"edge_features": eo[..., :32],
"contact_probs": torch.sigmoid(eo[..., 32]),
},
}
This stage maps the structured Φ+G state into a single dense vector s ∈ R512 intended to preserve physically relevant relationships while discarding viewpoint-specific detail.
Architecture: project each (zi, posei, existencei) into node embeddings, run 3 rounds of GNN message passing weighted by contact probabilities, then attention-pool into a fixed-dimensional vector.
Training: Contrastive (InfoNCE). In MuJoCo, render the same scene from two different camera angles. Both views produce the same Φ+G (same physics), so their state vectors sA and sB should be close. Different scenes in the batch are negatives. Hard negatives: same geometry but swapped materials (looks identical, physics differs) — this forces s to encode material properties.
▼ View `StateVectorizer` excerpt from `nerf/pipeline.py`This block is the structured-state compressor: graph-aware message passing over objects followed by attention pooling into the scene vector s.
class GNNLayer(nn.Module):
def __init__(self, hd, ed):
super().__init__()
self.msg = nn.Sequential(nn.Linear(hd * 2 + ed, hd), nn.ReLU(), nn.Linear(hd, hd))
self.upd = nn.Sequential(nn.Linear(hd * 2, hd), nn.ReLU(), nn.Linear(hd, hd))
self.norm = nn.LayerNorm(hd)
def forward(self, h, ef, cp, mask):
hi = h.unsqueeze(2).expand(-1, -1, h.shape[1], -1)
hj = h.unsqueeze(1).expand(-1, h.shape[1], -1, -1)
m = self.msg(torch.cat([hi, hj, ef], -1)) * cp.unsqueeze(-1) * mask.unsqueeze(1)
return self.norm(h + self.upd(torch.cat([h, m.sum(2)], -1))) * mask
class StateVectorizer(nn.Module):
def __init__(self, latent_dim=128, output_dim=512):
super().__init__()
hd = 256
self.node_emb = nn.Sequential(nn.Linear(latent_dim + 8, hd), nn.ReLU(), nn.Linear(hd, hd))
self.gnns = nn.ModuleList([GNNLayer(hd, 32) for _ in range(3)])
self.aq = nn.Linear(hd, 64)
self.ak = nn.Linear(hd, 64)
self.proj = nn.Sequential(nn.Linear(hd, hd), nn.ReLU(), nn.Linear(hd, output_dim))
self.norm = nn.LayerNorm(output_dim)
def forward(self, phi, graph):
h = self.node_emb(
torch.cat([phi["z"], phi["poses"], phi["existence"].unsqueeze(-1)], -1)
)
mask = phi["existence"].unsqueeze(-1)
h = h * mask
for g in self.gnns:
h = g(h, graph["edge_features"], graph["contact_probs"], mask)
q = self.aq(h.mean(1, keepdim=True))
k = self.ak(h)
a = F.softmax(torch.einsum("bid,bjd->bij", q, k) / 8 + (1 - mask.transpose(1, 2)) * -1e9, -1)
pooled = torch.einsum("bij,bjd->bid", a, h).squeeze(1)
return self.norm(self.proj(pooled))
Aligns three modalities in a shared embedding space, then predicts actions:
| Modality | Encoder | What it captures |
|---|---|---|
| Scene s | MLP projection | Physics: mass, friction, contacts, spatial arrangement |
| Text | Transformer | Semantics: “pick up the heavy metal block gently” |
| Image | CNN | Appearance: current visual observation |
Training Phase A (Contrastive): Align (s, text) pairs. Text auto-generated from MuJoCo GT: “3 objects: wood block (0.5kg, μ=0.4), rubber ball (1.1kg, μ=0.8), metal cylinder (3.0kg, μ=0.2)”. Loss: InfoNCE, same as CLIP.
Training Phase B (Supervised): Behavioral cloning from scripted MuJoCo policies (reach, grasp, push). The fused embedding (from cross-attention over s + text + image) feeds an action MLP. Loss: MSE on predicted vs. GT action. Alignment layers frozen; only action head fine-tuned.
This is the alignment and action stage: separate encoders for scene, text, and image, followed by cross-attention and an action head.
class CrossEncoder(nn.Module):
def __init__(self, state_dim=512, embed_dim=512, action_dim=6):
super().__init__()
self.scene_proj = nn.Sequential(nn.Linear(state_dim, embed_dim), nn.ReLU(), nn.Linear(embed_dim, embed_dim))
self.text_emb = nn.Embedding(10000, 256)
self.text_tf = nn.TransformerEncoder(
nn.TransformerEncoderLayer(256, 4, 512, batch_first=True), 2
)
self.img_enc = nn.Sequential(
nn.Conv2d(3, 32, 7, 4, 3),
nn.ReLU(),
nn.Conv2d(32, 64, 3, 2, 1),
nn.ReLU(),
nn.Conv2d(64, 128, 3, 2, 1),
nn.ReLU(),
nn.AdaptiveAvgPool2d(1),
nn.Flatten(),
)
self.cross_attn = nn.MultiheadAttention(embed_dim, 8, batch_first=True)
self.action_head = nn.Sequential(
nn.Linear(embed_dim, 256), nn.ReLU(),
nn.Linear(256, 64), nn.ReLU(),
nn.Linear(64, action_dim), nn.Tanh(),
)
def forward(self, s, text_tokens=None, text_mask=None, image=None):
se = self.scene_proj(s)
te = self.encode_text(text_tokens, text_mask) if text_tokens is not None else None
ie = self.encode_image(image) if image is not None else None
toks = [se.unsqueeze(1)]
if te is not None:
toks.append(te.unsqueeze(1))
if ie is not None:
toks.append(ie.unsqueeze(1))
seq = torch.cat(toks, 1)
fused, _ = self.cross_attn(seq, seq, seq)
return {
"action": self.action_head(fused[:, 0]),
"scene_emb": se,
"text_emb": te,
"image_emb": ie,
}
s was derived from the image, so including the image in Model 3 seems circular. Three reasons it’s not:
Training signal: Image↔s alignment loss teaches Model 1 to produce good Φ+G. If s can’t be matched back to its source image, the encoder is losing information.
Multi-timestep fusion: s may have been built up over multiple observations. The current image provides fresh visual grounding that accumulated s might lack.
When s comes from language: “Imagine a heavy metal block on a glass plate.” This generates s from text alone. The image provides the actual visual context.
| Phase | Model | Method | Steps | Time |
|---|---|---|---|---|
| 1 | Encoder | Supervised (MuJoCo GT) | ~50K | ~4h |
| 2 | Vectorizer | Contrastive (scene pairs) | ~100K | ~8h |
| 3a | Cross-Encoder alignment | Contrastive (s, text) | ~50K | ~4h |
| 3b | Cross-Encoder action | Behavioral cloning | ~100K | ~8h |
| Total | ~300K | ~24h | ||
All data generated on-the-fly from MuJoCo. No pre-collected dataset needed.
▼ View `FullPipeline` wiring from `nerf/pipeline.py`The top-level module is intentionally short. It only composes the three trained stages in sequence: encoder, vectorizer, then cross-encoder.
class FullPipeline(nn.Module):
def __init__(self):
super().__init__()
self.encoder = EncodingModel(128, 8)
self.vectorizer = StateVectorizer(128, 512)
self.cross_encoder = CrossEncoder(512, 512, 6)
def forward(self, image, text_tokens=None, text_mask=None):
pg = self.encoder(image)
s = self.vectorizer(pg["phi"], pg["graph"])
out = self.cross_encoder(s, text_tokens, text_mask, image)
out["phi"] = pg["phi"]
out["graph"] = pg["graph"]
out["state_vector"] = s
return out
Image to Φ+G to state vector to action. The three stages below cover a supervised encoder, a contrastive state vectorizer, and a cross-encoder trained with contrastive alignment plus behavioral cloning.
import torch
import torch.nn as nn
import torch.nn.functional as F
# ============ MODEL 1: Encoding Model (SUPERVISED) ============
class EncodingModel(nn.Module):
def __init__(self, latent_dim=128, max_objects=8):
super().__init__()
self.latent_dim, self.max_objects = latent_dim, max_objects
sd = 128
self.backbone = nn.Sequential(
nn.Conv2d(3,32,7,2,3),nn.ReLU(),nn.Conv2d(32,64,5,2,2),nn.ReLU(),
nn.Conv2d(64,128,3,2,1),nn.ReLU(),nn.Conv2d(128,256,3,2,1),nn.ReLU())
self.slots_init = nn.Parameter(torch.randn(1,max_objects,sd)*0.02)
self.sq=nn.Linear(sd,64);self.sk=nn.Linear(256,64);self.sv=nn.Linear(256,sd)
self.sgru=nn.GRUCell(sd,sd)
self.smlp=nn.Sequential(nn.Linear(sd,sd),nn.ReLU(),nn.Linear(sd,sd))
self.to_z=nn.Linear(sd,latent_dim);self.to_pose=nn.Linear(sd,7)
self.to_exist=nn.Linear(sd,1)
self.edge_net=nn.Sequential(nn.Linear(sd*2,64),nn.ReLU(),nn.Linear(64,33))
self.mask_dec=nn.Sequential(nn.Linear(sd,64),nn.ReLU(),nn.Linear(64,16*16))
def forward(self, image):
B,N=image.shape[0],self.max_objects
f=self.backbone(image).flatten(2).permute(0,2,1)
slots=self.slots_init.expand(B,-1,-1)
for _ in range(3):
q,k,v=self.sq(slots),self.sk(f),self.sv(f)
a=F.softmax(torch.einsum('bnd,bmd->bnm',q,k)/8,dim=1)
a=a/(a.sum(-1,keepdim=True)+1e-8)
u=torch.einsum('bnm,bmd->bnd',a,v)
slots=self.sgru(u.reshape(B*N,-1),slots.reshape(B*N,-1)).reshape(B,N,-1)
slots=slots+self.smlp(slots)
z=self.to_z(slots);poses=self.to_pose(slots)
exist=torch.sigmoid(self.to_exist(slots).squeeze(-1))
si=slots.unsqueeze(2).expand(-1,-1,N,-1);sj=slots.unsqueeze(1).expand(-1,N,-1,-1)
eo=self.edge_net(torch.cat([si,sj],-1))
masks=torch.sigmoid(self.mask_dec(slots).reshape(B,N,16,16))
return {'phi':{'z':z,'poses':poses,'existence':exist},
'graph':{'node_features':slots,'edge_features':eo[...,:32],
'contact_probs':torch.sigmoid(eo[...,32])},'masks':masks}
# Training: SUPERVISED from MuJoCo GT
# Losses: MSE(pose), BCE(existence), BCE(masks), BCE(contacts)
# Uses Hungarian matching to assign predicted slots to GT objects
# ============ MODEL 2: State Vectorizer (CONTRASTIVE) ============
class GNNLayer(nn.Module):
def __init__(self, hd, ed):
super().__init__()
self.msg=nn.Sequential(nn.Linear(hd*2+ed,hd),nn.ReLU(),nn.Linear(hd,hd))
self.upd=nn.Sequential(nn.Linear(hd*2,hd),nn.ReLU(),nn.Linear(hd,hd))
self.norm=nn.LayerNorm(hd)
def forward(self,h,ef,cp,mask):
B,N,D=h.shape
hi=h.unsqueeze(2).expand(-1,-1,N,-1);hj=h.unsqueeze(1).expand(-1,N,-1,-1)
m=self.msg(torch.cat([hi,hj,ef],-1))*cp.unsqueeze(-1)*mask.unsqueeze(1)
return self.norm(h+self.upd(torch.cat([h,m.sum(2)],-1)))*mask
class StateVectorizer(nn.Module):
def __init__(self, latent_dim=128, output_dim=512):
super().__init__()
hd=256
self.node_emb=nn.Sequential(nn.Linear(latent_dim+8,hd),nn.ReLU(),nn.Linear(hd,hd))
self.gnns=nn.ModuleList([GNNLayer(hd,32) for _ in range(3)])
self.aq=nn.Linear(hd,64);self.ak=nn.Linear(hd,64)
self.proj=nn.Sequential(nn.Linear(hd,hd),nn.ReLU(),nn.Linear(hd,output_dim))
self.norm=nn.LayerNorm(output_dim)
def forward(self,phi,graph):
B,N=phi['z'].shape[:2]
h=self.node_emb(torch.cat([phi['z'],phi['poses'],phi['existence'].unsqueeze(-1)],-1))
mask=phi['existence'].unsqueeze(-1);h=h*mask
for g in self.gnns: h=g(h,graph['edge_features'],graph['contact_probs'],mask)
q=self.aq(h.mean(1,keepdim=True));k=self.ak(h)
a=F.softmax(torch.einsum('bid,bjd->bij',q,k)/8+(1-mask.transpose(1,2))*-1e9,-1)
return self.norm(self.proj(torch.einsum('bij,bjd->bid',a,h).squeeze(1)))
# Training: SELF-SUPERVISED CONTRASTIVE (InfoNCE)
# Positive pairs: same MuJoCo scene from 2 different camera angles → same s
# Negative pairs: different scenes in the batch → different s
# Hard negatives: same geometry, swapped materials → different s
# ============ MODEL 3: Cross-Encoder (CONTRASTIVE + SUPERVISED) ============
class CrossEncoder(nn.Module):
def __init__(self, state_dim=512, embed_dim=512, action_dim=6):
super().__init__()
self.scene_proj=nn.Sequential(nn.Linear(state_dim,embed_dim),nn.ReLU(),nn.Linear(embed_dim,embed_dim))
self.scene_norm=nn.LayerNorm(embed_dim)
self.text_emb=nn.Embedding(10000,256)
self.text_pos=nn.Parameter(torch.randn(1,64,256)*0.02)
self.text_tf=nn.TransformerEncoder(nn.TransformerEncoderLayer(256,4,512,batch_first=True),2)
self.text_proj=nn.Linear(256,embed_dim);self.text_norm=nn.LayerNorm(embed_dim)
self.img_enc=nn.Sequential(nn.Conv2d(3,32,7,4,3),nn.ReLU(),nn.Conv2d(32,64,3,2,1),nn.ReLU(),
nn.Conv2d(64,128,3,2,1),nn.ReLU(),nn.AdaptiveAvgPool2d(1),nn.Flatten())
self.img_proj=nn.Sequential(nn.Linear(128,embed_dim),nn.ReLU(),nn.Linear(embed_dim,embed_dim))
self.img_norm=nn.LayerNorm(embed_dim)
self.cross_attn=nn.MultiheadAttention(embed_dim,8,batch_first=True)
self.fuse_norm=nn.LayerNorm(embed_dim)
self.fuse_mlp=nn.Sequential(nn.Linear(embed_dim,embed_dim),nn.ReLU(),nn.Linear(embed_dim,embed_dim))
self.action_head=nn.Sequential(nn.Linear(embed_dim,256),nn.ReLU(),nn.Linear(256,64),nn.ReLU(),nn.Linear(64,action_dim),nn.Tanh())
self.logit_scale=nn.Parameter(torch.tensor(1/0.07).log())
def encode_scene(self,s): return self.scene_norm(self.scene_proj(s))
def encode_text(self,tok,mask=None):
x=self.text_emb(tok)+self.text_pos[:,:tok.shape[1]]
return self.text_norm(self.text_proj(self.text_tf(x,src_key_padding_mask=mask)[:,0]))
def encode_image(self,img): return self.img_norm(self.img_proj(self.img_enc(img)))
def forward(self,s,text_tokens=None,text_mask=None,image=None):
se=self.encode_scene(s)
te=self.encode_text(text_tokens,text_mask) if text_tokens is not None else None
ie=self.encode_image(image) if image is not None else None
toks=[se.unsqueeze(1)]
if te is not None: toks.append(te.unsqueeze(1))
if ie is not None: toks.append(ie.unsqueeze(1))
seq=torch.cat(toks,1)
f,_=self.cross_attn(seq,seq,seq)
fused=self.fuse_norm(f[:,0]+self.fuse_mlp(f[:,0]))
return {'action':self.action_head(fused),'scene_emb':se,'text_emb':te,'image_emb':ie,'fused':fused}
def contrastive_loss(self,se,te):
s,t=F.normalize(se,-1),F.normalize(te,-1)
l=self.logit_scale.exp()*s@t.T;lb=torch.arange(len(l),device=l.device)
return(F.cross_entropy(l,lb)+F.cross_entropy(l.T,lb))/2
# Training Phase A: CONTRASTIVE alignment
# (s, text) pairs from MuJoCo. Text auto-generated from GT properties.
# Loss: InfoNCE on (scene_embedding, text_embedding)
# Training Phase B: SUPERVISED action prediction
# Behavioral cloning from scripted MuJoCo policies (reach, grasp, push).
# Loss: MSE on predicted vs GT action. Freeze alignment, fine-tune action head.
# ============ FULL PIPELINE ============
class FullPipeline(nn.Module):
def __init__(self):
super().__init__()
self.encoder=EncodingModel(128,8)
self.vectorizer=StateVectorizer(128,512)
self.cross_encoder=CrossEncoder(512,512,6)
def forward(self,image,text_tokens=None,text_mask=None):
pg=self.encoder(image)
s=self.vectorizer(pg['phi'],pg['graph'])
out=self.cross_encoder(s,text_tokens,text_mask,image)
out['phi']=pg['phi'];out['graph']=pg['graph'];out['state_vector']=s
return out
TRAINING_SCHEDULE = """
Training Schedule (all MuJoCo, single GPU, ~24h total)
Phase 1 — Model 1 Encoder (SUPERVISED) ~50K steps, ~4h
Data: Random MuJoCo scenes, render RGB + extract GT
Loss: MSE(pose) + BCE(existence, masks, contacts) + MSE(SDF at sampled points)
Metric: Pose error <1cm, contact F1 >0.9
Phase 2 — Model 2 Vectorizer (CONTRASTIVE) ~100K steps, ~8h
Data: Scene pairs (same scene, 2 cameras) from MuJoCo
Loss: InfoNCE on (s_view_A, s_view_B)
Metric: Scene retrieval accuracy, physics probe R²>0.8
Phase 3a — Model 3 Alignment (CONTRASTIVE) ~50K steps, ~4h
Data: (s, text) pairs. Text auto-generated from GT.
Loss: InfoNCE on (scene_emb, text_emb)
Metric: Text→scene retrieval accuracy
Phase 3b — Model 3 Action (SUPERVISED) ~100K steps, ~8h
Data: Scripted MuJoCo demonstrations (reach/grasp/push)
Loss: MSE(predicted_action, gt_action)
Metric: Task success rate on held-out scenes
"""
if __name__=="__main__":
m=FullPipeline()
total=sum(p.numel() for p in m.parameters())
print(f"Total: {total:,} params")
for n,s in[("Encoder",m.encoder),("Vectorizer",m.vectorizer),("CrossEncoder",m.cross_encoder)]:
print(f" {n}: {sum(p.numel() for p in s.parameters()):,}")
B=2;img=torch.randn(B,3,256,256);txt=torch.randint(0,1000,(B,12))
with torch.no_grad(): out=m(img,text_tokens=txt)
print(f"\nz: {list(out['phi']['z'].shape)}")
print(f"s: {list(out['state_vector'].shape)}")
print(f"action: {list(out['action'].shape)} = {[round(x.item(),3) for x in out['action'][0]]}")
print(TRAINING_SCHEDULE)
This appendix gives the complete formal specification of the field half of the representation. The field \(\Phi\) is a neural function parameterized by an object latent \(z_i\) that maps a 3D query point (optionally with direction and time) to distributional predictions over the physical quantities relevant to robotic manipulation. It is the primary mechanism through which geometry, material, and dynamics information becomes explicitly queryable at task time.
The notation below is an interface summary. Each output category is realized by a separate prediction head; the field should be read as a bundle of property-specific heads exposed through one query interface, not as a single fully specified joint probabilistic model over every quantity simultaneously. The implications of that distinction for downstream use are noted at each relevant step.
| Input | Type | When needed |
|---|---|---|
| x | R³ — 3D position | Always |
| d̂ | S² — unit direction | Anisotropic materials, appearance |
| t | R — time | Dynamic scenes |
| zi | RD — object latent | Always (conditions the field per-object) |
Some physical properties are isotropic: their value at a point does not depend on direction. Signed distance, density, and temperature fall into this category. Others are inherently anisotropic, and querying them without a direction is not well-defined:
Friction \(\mu(x, \hat{d})\): Brushed metal has different friction along vs. across the grain; fabric has weave-aligned resistance. A robot sliding in direction \(\hat{d}\) requires friction in that specific direction, not an average over directions.
Stiffness \(E(x, \hat{d})\): Wood is approximately 20× stiffer along the grain than across it. The effective stiffness under compression in direction \(\hat{d}\) is \(\hat{d}^\top \mathbf{C}(x)\hat{d}\), where \(\mathbf{C}\) is the fourth-order elasticity tensor.
Thermal conductivity \(\kappa(x, \hat{d})\): Carbon-fiber composites conduct heat roughly 10× more efficiently along the fiber axis than transversely.
Appearance \(c(x, \hat{d})\): This is precisely the NeRF formulation — color depends on viewing direction (specular highlights, iridescence). The field extends the same factorization principle from appearance to physical material properties.
| Output | Distribution | Depends on d̂? |
|---|---|---|
| SDF d(x) | Gaussian(μ, σ) | No |
| Normal n̂(x) | Derived: ∇d / |∇d| | No (computed via autograd) |
| Density ρ(x) | LogNormal | No |
| Temperature T(x) | Gaussian | No |
| Restitution e(x) | Beta(α,β) ∈ [0,1] | No |
| Friction μ(x, d̂) | LogNormal | Yes |
| Stiffness E(x, d̂) | LogNormal | Yes |
| Velocity v(x) | 3× Gaussian | No (vector output) |
| Stress σij(x) | 6× Gaussian (Voigt) | No (tensor output) |
| Color c(x, d̂) | RGB | Yes (as in NeRF) |
| Semantics | Categorical | No |
These outputs live in heterogeneous spaces. In this appendix, the field should be read as a bundle of property-specific predictive heads exposed through one query interface, not yet as one fully specified joint probabilistic law over every quantity at once.
The network follows a shared-trunk, branched-head design. A shared trunk processes the concatenation of the Fourier-encoded position \(x\), time encoding \(t\), and object latent \(z_i\) into a feature vector \(h \in \mathbb{R}^H\). Geometry and isotropic material branches take only \(h\) as input. Directional branches (friction, stiffness, thermal conductivity, appearance) concatenate \(h\) with a separate encoding of \(\hat{d}\), following the NeRF design principle that geometry should be view-direction invariant.
Surface normals are derived by automatic differentiation of the SDF, not predicted by a separate head. In regions where the learned SDF is smooth and \(\nabla d(x) \neq 0\), the normal \(\hat{n} = \nabla d / \|\nabla d\|\) is perpendicular to the level set by construction. Near poorly learned boundaries, nonsmooth regions, or degenerate SDF gradients, the derived normal may be numerically unstable — a known limitation noted in the main text.
▼ View complete PyTorch implementation (field_complete.py)This implementation models Φ(x, d̂, t | zi) with a shared trunk and separate geometry, material, dynamics, appearance, and semantic heads. Direction only enters the branches that actually need anisotropic or view-dependent context.
Complete Field Specification: Φ(x, d̂, t | z_i)
================================================
The belief field is a neural function that maps a query (position, direction, time)
to DISTRIBUTIONS over every physical quantity relevant for manipulation.
Design Principles:
1. Position x ∈ R³ is always required
2. Direction d̂ ∈ S² is required for anisotropic/directional properties
3. Time t is required for dynamic properties
4. Object latent z_i conditions the field per-object
5. Every output is DISTRIBUTIONAL (mean + uncertainty)
6. Outputs span scalars, vectors, and tensors — matching physics
This is the NeRF paradigm extended from appearance to full multiphysics:
NeRF: Φ(x, d̂) → (color, density) [appearance only]
Ours: Φ(x, d̂, t | z) → (geometry, material, dynamics, semantics)
Architecture:
The network has a SHARED TRUNK that processes (x, t, z_i) into a feature vector,
then BRANCHES for different output groups. The direction d̂ enters ONLY for
branches that need it (anisotropic material, appearance) — following the NeRF
insight that geometry shouldn't depend on view direction.
┌─────────────┐
│ x ∈ R³ │──→ Fourier Encoding ──→┐
│ t ∈ R │──→ Time Encoding ──→ │
│ z_i ∈ R^D │──────────────────────→ ├──→ Shared Trunk (MLP) ──→ h ∈ R^H
└─────────────┘ │
│
┌──────────── h ──────────────────────┐│
│ ││
│ ┌─── Geometry Branch ←── h ││
│ │ (position-only) ││
│ │ → SDF: μ_d, σ_d ││
│ │ → Normal: n̂ = ∇μ_d / |∇μ_d| ││ (computed via autograd, not predicted)
│ │ → Curvature: κ = ∇²d ││ (second derivative of SDF)
│ │ ││
│ ┌─── Isotropic Material Branch ← h││
│ │ (position-only) ││
│ │ → Density: μ_ρ, σ_ρ ││ LogNormal
│ │ → Temperature: μ_T, σ_T ││ Normal (or LogNormal for Kelvin)
│ │ → Restitution: α_e, β_e ││ Beta[0,1]
│ │ → Poisson ratio: α_ν, β_ν ││ Beta[0,0.5]
│ │ ││
│ ┌─── Directional Material Branch ←─┤│← d̂ ∈ S² (direction enters here)
│ │ (position + direction) ││
│ │ → Friction: μ_f(d̂), σ_f ││ Friction along sliding direction d̂
│ │ → Stiffness: E(d̂), σ_E ││ Young's modulus along d̂
│ │ → Thermal cond: κ(d̂), σ_κ ││ Conductivity along d̂
│ │ ││
│ ┌─── Dynamics Branch ← h ││
│ │ (position-only, vector output)││
│ │ → Velocity: v ∈ R³ ││
│ │ → Stress: σ_ij (6 indep) ││ Symmetric 3x3 → Voigt notation
│ │ → Strain: ε_ij (6 indep) ││
│ │ ││
│ ┌─── Appearance Branch ←───────────┤│← d̂ (view direction, as in NeRF)
│ │ → Color: c ∈ R³ ││
│ │ → Opacity: α ∈ [0,1] ││
│ │ ││
│ ┌─── Semantic Branch ← h ││
│ │ → Class logits: R^C ││
│ │ → Affordance: R^A ││
│ └─────────────────────────────────┘│
└─────────────────────────────────────┘
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from typing import Dict, Optional, Tuple
# ============================================================
# Encodings
# ============================================================
class FourierEncoding(nn.Module):
"""Sinusoidal positional encoding. Maps R^n → R^(n + 2nL)."""
def __init__(self, input_dim: int = 3, n_freqs: int = 6, include_input: bool = True):
super().__init__()
self.input_dim = input_dim
self.n_freqs = n_freqs
self.include_input = include_input
freqs = 2.0 ** torch.linspace(0, n_freqs - 1, n_freqs)
self.register_buffer('freqs', freqs)
@property
def out_dim(self):
return self.input_dim * (1 + 2 * self.n_freqs) if self.include_input else self.input_dim * 2 * self.n_freqs
def forward(self, x):
# x: [..., input_dim]
x_proj = x.unsqueeze(-1) * self.freqs # [..., input_dim, L]
enc = torch.cat([x_proj.sin(), x_proj.cos()], dim=-1) # [..., input_dim, 2L]
enc = enc.reshape(*x.shape[:-1], -1) # [..., input_dim * 2L]
return torch.cat([x, enc], dim=-1) if self.include_input else enc
class DirectionEncoding(nn.Module):
"""Encode unit direction vector d̂ ∈ S².
Can be input as:
- Unit vector (dx, dy, dz) with |d|=1
- Spherical angles (θ, φ)
Uses spherical harmonics basis (lower frequency than position encoding
since directional variation is typically smoother).
"""
def __init__(self, n_freqs: int = 4):
super().__init__()
self.pos_enc = FourierEncoding(input_dim=3, n_freqs=n_freqs)
@property
def out_dim(self):
return self.pos_enc.out_dim
def forward(self, d: torch.Tensor):
"""d: [..., 3] unit vectors OR [..., 2] (theta, phi)"""
if d.shape[-1] == 2:
# Convert spherical to cartesian
theta, phi = d[..., 0], d[..., 1]
dx = torch.sin(theta) * torch.cos(phi)
dy = torch.sin(theta) * torch.sin(phi)
dz = torch.cos(theta)
d = torch.stack([dx, dy, dz], dim=-1)
# Normalize
d = F.normalize(d, dim=-1)
return self.pos_enc(d)
# ============================================================
# Distribution Heads (proper probabilistic outputs)
# ============================================================
class GaussianHead(nn.Module):
"""Predicts Gaussian(μ, σ) for unbounded scalars."""
def __init__(self, in_dim: int, out_dim: int = 1):
super().__init__()
self.mu = nn.Linear(in_dim, out_dim)
self.log_sigma = nn.Linear(in_dim, out_dim)
def forward(self, h):
return self.mu(h), F.softplus(self.log_sigma(h)) + 1e-4
class LogNormalHead(nn.Module):
"""Predicts LogNormal for positive scalars (density, stiffness, etc.)."""
def __init__(self, in_dim: int, out_dim: int = 1, scale: float = 1.0):
super().__init__()
self.log_mu = nn.Linear(in_dim, out_dim)
self.log_sigma = nn.Linear(in_dim, out_dim)
self.scale = scale
def forward(self, h):
log_mu = self.log_mu(h)
log_sigma = F.softplus(self.log_sigma(h)) + 1e-4
# Mean of LogNormal = exp(log_mu + log_sigma²/2)
mu = torch.exp(log_mu + 0.5 * log_sigma ** 2) * self.scale
# Approximate sigma in real space
sigma = mu * torch.sqrt(torch.exp(log_sigma ** 2) - 1)
return mu, sigma
class BetaHead(nn.Module):
"""Predicts Beta(α, β) for [0, upper_bound] quantities."""
def __init__(self, in_dim: int, out_dim: int = 1, upper_bound: float = 1.0):
super().__init__()
self.alpha = nn.Linear(in_dim, out_dim)
self.beta = nn.Linear(in_dim, out_dim)
self.upper_bound = upper_bound
def forward(self, h):
alpha = F.softplus(self.alpha(h)) + 1.01 # > 1 for unimodal
beta = F.softplus(self.beta(h)) + 1.01
mu = alpha / (alpha + beta) * self.upper_bound
var = alpha * beta / ((alpha + beta) ** 2 * (alpha + beta + 1))
sigma = torch.sqrt(var) * self.upper_bound
return mu, sigma
class VectorHead(nn.Module):
"""Predicts a 3D vector with per-component uncertainty."""
def __init__(self, in_dim: int):
super().__init__()
self.mu = nn.Linear(in_dim, 3)
self.log_sigma = nn.Linear(in_dim, 3)
def forward(self, h):
return self.mu(h), F.softplus(self.log_sigma(h)) + 1e-4
class SymmetricTensorHead(nn.Module):
"""Predicts a 3×3 symmetric tensor via 6 Voigt components.
Voigt ordering: [σ_xx, σ_yy, σ_zz, σ_yz, σ_xz, σ_xy]
Ensures symmetry by construction.
"""
def __init__(self, in_dim: int):
super().__init__()
self.voigt = nn.Linear(in_dim, 6) # 6 independent components
self.log_sigma = nn.Linear(in_dim, 6)
def forward(self, h):
v = self.voigt(h)
s = F.softplus(self.log_sigma(h)) + 1e-4
return v, s
@staticmethod
def voigt_to_matrix(voigt):
"""Convert [B, 6] Voigt to [B, 3, 3] symmetric matrix."""
B = voigt.shape[0]
mat = torch.zeros(B, 3, 3, device=voigt.device)
mat[:, 0, 0] = voigt[:, 0] # xx
mat[:, 1, 1] = voigt[:, 1] # yy
mat[:, 2, 2] = voigt[:, 2] # zz
mat[:, 1, 2] = mat[:, 2, 1] = voigt[:, 3] # yz
mat[:, 0, 2] = mat[:, 2, 0] = voigt[:, 4] # xz
mat[:, 0, 1] = mat[:, 1, 0] = voigt[:, 5] # xy
return mat
# ============================================================
# The Complete Object Field
# ============================================================
class ObjectFieldComplete(nn.Module):
"""Complete per-object neural field.
Φ(x, d̂, t | z_i) → full multiphysics state
Architecture follows the NeRF insight:
- Shared trunk processes (x, t, z_i) → features h
- Geometry branch uses h only (no direction dependence for SDF)
- Directional branches take h + encoded d̂
- This ensures SDF is view/direction invariant
Args:
latent_dim: dimension of object latent z_i
hidden_dim: width of trunk MLP
n_layers: depth of trunk
pos_freqs: Fourier frequencies for position encoding
dir_freqs: Fourier frequencies for direction encoding
n_semantic_classes: number of semantic categories
skip_at: layer index for skip connection
"""
def __init__(
self,
latent_dim: int = 128,
hidden_dim: int = 256,
n_layers: int = 6,
pos_freqs: int = 6,
dir_freqs: int = 4,
n_semantic_classes: int = 32,
skip_at: int = 3,
):
super().__init__()
self.latent_dim = latent_dim
self.hidden_dim = hidden_dim
self.skip_at = skip_at
# --- Encodings ---
self.pos_enc = FourierEncoding(input_dim=3, n_freqs=pos_freqs)
self.dir_enc = DirectionEncoding(n_freqs=dir_freqs)
self.time_enc = FourierEncoding(input_dim=1, n_freqs=4)
trunk_input_dim = self.pos_enc.out_dim + self.time_enc.out_dim + latent_dim
# --- Shared Trunk ---
trunk_layers = []
dims = [trunk_input_dim] + [hidden_dim] * n_layers
for i in range(n_layers):
in_d = dims[i] + (trunk_input_dim if i == skip_at else 0)
trunk_layers.append(nn.Linear(in_d, hidden_dim))
self.trunk = nn.ModuleList(trunk_layers)
# --- Geometry Branch (position-only) ---
self.sdf_head = GaussianHead(hidden_dim)
# Note: surface normal is computed as ∇SDF via autograd, not predicted
# Curvature can be computed as ∇²SDF (Laplacian)
# --- Isotropic Material Branch (position-only) ---
self.density_head = LogNormalHead(hidden_dim, scale=1000.0) # kg/m³
self.temperature_head = GaussianHead(hidden_dim) # Kelvin
self.restitution_head = BetaHead(hidden_dim, upper_bound=1.0) # [0, 1]
self.poisson_head = BetaHead(hidden_dim, upper_bound=0.5) # [0, 0.5]
self.acoustic_damping_head = LogNormalHead(hidden_dim, scale=1.0)
# --- Directional Material Branch (position + direction) ---
dir_input_dim = hidden_dim + self.dir_enc.out_dim
self.dir_material_trunk = nn.Sequential(
nn.Linear(dir_input_dim, hidden_dim // 2),
nn.ReLU(),
nn.Linear(hidden_dim // 2, hidden_dim // 2),
nn.ReLU(),
)
dir_h_dim = hidden_dim // 2
self.friction_head = LogNormalHead(dir_h_dim, scale=1.0) # μ ∈ R+ (typ 0.1-1.0)
self.stiffness_head = LogNormalHead(dir_h_dim, scale=1e9) # Pa
self.thermal_cond_head = LogNormalHead(dir_h_dim, scale=1.0) # W/(m·K)
# --- Dynamics Branch (position-only, vector/tensor outputs) ---
self.velocity_head = VectorHead(hidden_dim) # m/s
self.stress_head = SymmetricTensorHead(hidden_dim) # Pa (6 Voigt components)
self.strain_head = SymmetricTensorHead(hidden_dim) # dimensionless (6 Voigt)
# --- Appearance Branch (position + direction, as in NeRF) ---
app_input_dim = hidden_dim + self.dir_enc.out_dim
self.appearance_trunk = nn.Sequential(
nn.Linear(app_input_dim, hidden_dim // 2),
nn.ReLU(),
)
self.color_head = nn.Linear(hidden_dim // 2, 3) # RGB ∈ [0,1]
self.opacity_head = nn.Linear(hidden_dim // 2, 1) # α ∈ [0,1]
# --- Semantic Branch ---
self.semantic_head = nn.Linear(hidden_dim, n_semantic_classes)
self.affordance_head = nn.Linear(hidden_dim, 8) # graspable, pushable, pourable, etc.
def trunk_forward(self, x: torch.Tensor, t: torch.Tensor, z: torch.Tensor) -> torch.Tensor:
"""Shared trunk: (x, t, z) → feature vector h."""
x_enc = self.pos_enc(x)
t_enc = self.time_enc(t.unsqueeze(-1) if t.dim() == 1 else t)
if z.dim() == 1:
z = z.unsqueeze(0).expand(x.shape[0], -1)
h = torch.cat([x_enc, t_enc, z], dim=-1)
h_input = h
for i, layer in enumerate(self.trunk):
if i == self.skip_at:
h = torch.cat([h, h_input], dim=-1)
h = F.relu(layer(h))
return h
def forward(
self,
x: torch.Tensor, # [B, 3] position in object-local frame
z: torch.Tensor, # [D] or [B, D] object latent
t: Optional[torch.Tensor] = None, # [B] or [B, 1] time
d: Optional[torch.Tensor] = None, # [B, 3] or [B, 2] query direction
compute_normals: bool = False, # whether to compute ∇SDF
compute_curvature: bool = False, # whether to compute ∇²SDF
) -> Dict[str, torch.Tensor]:
"""
Full field query.
Returns dict with all physical quantities as (mean, uncertainty) pairs.
"""
B = x.shape[0]
if t is None:
t = torch.zeros(B, 1, device=x.device)
elif t.dim() == 1:
t = t.unsqueeze(-1)
# Enable gradient tracking for normal/curvature computation
if compute_normals or compute_curvature:
x = x.detach().requires_grad_(True)
# --- Shared trunk ---
h = self.trunk_forward(x, t, z)
result = {}
# --- Geometry (position-only) ---
sdf_mu, sdf_sigma = self.sdf_head(h)
result['sdf_mu'] = sdf_mu
result['sdf_sigma'] = sdf_sigma
# Surface normal via autograd
if compute_normals:
grad_sdf = torch.autograd.grad(
sdf_mu.sum(), x, create_graph=compute_curvature, retain_graph=True
)[0] # [B, 3]
normal = F.normalize(grad_sdf, dim=-1)
result['normal'] = normal
result['sdf_gradient'] = grad_sdf
if compute_curvature:
# Laplacian of SDF ≈ mean curvature
curvature = torch.zeros(B, device=x.device)
for dim in range(3):
grad2 = torch.autograd.grad(
grad_sdf[:, dim].sum(), x, retain_graph=True
)[0][:, dim]
curvature += grad2
result['mean_curvature'] = curvature
# --- Isotropic materials (position-only) ---
result['density_mu'], result['density_sigma'] = self.density_head(h)
result['temperature_mu'], result['temperature_sigma'] = self.temperature_head(h)
result['restitution_mu'], result['restitution_sigma'] = self.restitution_head(h)
result['poisson_mu'], result['poisson_sigma'] = self.poisson_head(h)
result['damping_mu'], result['damping_sigma'] = self.acoustic_damping_head(h)
# --- Directional materials (position + direction) ---
if d is not None:
d_enc = self.dir_enc(d)
h_dir = self.dir_material_trunk(torch.cat([h, d_enc], dim=-1))
result['friction_mu'], result['friction_sigma'] = self.friction_head(h_dir)
result['stiffness_mu'], result['stiffness_sigma'] = self.stiffness_head(h_dir)
result['thermal_cond_mu'], result['thermal_cond_sigma'] = self.thermal_cond_head(h_dir)
else:
# If no direction given, query with canonical directions and average
# (gives isotropic estimate)
canonical = torch.tensor([[1,0,0],[0,1,0],[0,0,1]], dtype=torch.float, device=x.device)
fric_sum = torch.zeros(B, 1, device=x.device)
stiff_sum = torch.zeros(B, 1, device=x.device)
for cd in canonical:
cd_batch = cd.unsqueeze(0).expand(B, -1)
d_enc = self.dir_enc(cd_batch)
h_dir = self.dir_material_trunk(torch.cat([h, d_enc], dim=-1))
f_mu, _ = self.friction_head(h_dir)
s_mu, _ = self.stiffness_head(h_dir)
fric_sum += f_mu
stiff_sum += s_mu
result['friction_mu'] = fric_sum / 3
result['stiffness_mu'] = stiff_sum / 3
# Higher uncertainty for averaged isotropic estimate
result['friction_sigma'] = torch.ones_like(result['friction_mu']) * 0.1
result['stiffness_sigma'] = result['stiffness_mu'] * 0.2
# --- Dynamics ---
result['velocity_mu'], result['velocity_sigma'] = self.velocity_head(h)
stress_v, stress_s = self.stress_head(h)
result['stress_voigt'] = stress_v # [B, 6]
result['stress_sigma'] = stress_s
strain_v, strain_s = self.strain_head(h)
result['strain_voigt'] = strain_v
result['strain_sigma'] = strain_s
# --- Appearance (needs direction) ---
if d is not None:
d_enc = self.dir_enc(d)
h_app = self.appearance_trunk(torch.cat([h, d_enc], dim=-1))
result['color'] = torch.sigmoid(self.color_head(h_app))
result['opacity'] = torch.sigmoid(self.opacity_head(h_app))
# --- Semantics ---
result['semantic_logits'] = self.semantic_head(h)
result['affordance_logits'] = self.affordance_head(h)
# --- Store features for gating ---
result['features'] = h
return result
# ============================================================
# Convenience: Derived Quantities
# ============================================================
def contact_response(field_output: Dict, contact_force: torch.Tensor, contact_direction: torch.Tensor):
"""Given field output at a contact point, predict contact response.
Args:
field_output: from ObjectFieldComplete.forward(x, z, d=contact_normal)
contact_force: [B, 1] applied normal force
contact_direction: [B, 3] contact normal
Returns:
dict with:
- max_static_friction: F_max = μ_s * F_normal
- deformation: δ ≈ F / (E * A) for small deformations
- will_slip: bool, whether tangential force exceeds friction
"""
mu = field_output['friction_mu']
E = field_output['stiffness_mu']
max_friction = mu * contact_force.abs()
# Hertzian contact approximation: deformation ~ (F / E)^(2/3)
deformation = (contact_force.abs() / (E + 1e-6)) ** 0.667
return {
'max_static_friction': max_friction,
'deformation': deformation,
'friction': mu,
'stiffness': E,
}
# ============================================================
# Test
# ============================================================
if __name__ == "__main__":
print("=" * 60)
print("Complete Object Field — Architecture Test")
print("=" * 60)
field = ObjectFieldComplete(
latent_dim=128,
hidden_dim=256,
n_layers=6,
pos_freqs=6,
dir_freqs=4,
)
params = sum(p.numel() for p in field.parameters())
print(f"\nTotal parameters: {params:,}")
# Itemize by component
components = {
'trunk': sum(p.numel() for l in field.trunk for p in l.parameters()),
'geometry': sum(p.numel() for p in field.sdf_head.parameters()),
'isotropic_material': sum(
sum(p.numel() for p in head.parameters())
for head in [field.density_head, field.temperature_head,
field.restitution_head, field.poisson_head, field.acoustic_damping_head]
),
'directional_material': sum(p.numel() for p in field.dir_material_trunk.parameters()) +
sum(p.numel() for p in field.friction_head.parameters()) +
sum(p.numel() for p in field.stiffness_head.parameters()) +
sum(p.numel() for p in field.thermal_cond_head.parameters()),
'dynamics': sum(p.numel() for p in field.velocity_head.parameters()) +
sum(p.numel() for p in field.stress_head.parameters()) +
sum(p.numel() for p in field.strain_head.parameters()),
'appearance': sum(p.numel() for p in field.appearance_trunk.parameters()) +
sum(p.numel() for p in field.color_head.parameters()) +
sum(p.numel() for p in field.opacity_head.parameters()),
'semantic': sum(p.numel() for p in field.semantic_head.parameters()) +
sum(p.numel() for p in field.affordance_head.parameters()),
}
for name, count in components.items():
pct = count / params * 100
print(f" {name:25s} {count:>8,} ({pct:5.1f}%)")
# Forward pass
B = 64
x = torch.randn(B, 3) * 0.05
z = torch.randn(128)
t = torch.zeros(B)
d = F.normalize(torch.randn(B, 3), dim=-1)
print(f"\nQuery: {B} points, with direction, with normals+curvature")
out = field(x, z, t=t, d=d, compute_normals=True, compute_curvature=True)
print(f"\nOutputs:")
for k, v in sorted(out.items()):
if isinstance(v, torch.Tensor):
print(f" {k:25s} shape={str(list(v.shape)):12s} range=[{v.min().item():.4f}, {v.max().item():.4f}]")
# Query without direction (isotropic average)
print(f"\nQuery: {B} points, NO direction (isotropic estimate)")
out_iso = field(x, z, t=t, d=None, compute_normals=True)
print(f" friction_mu (isotropic avg): {out_iso['friction_mu'].mean().item():.4f}")
print(f" normal (from ∇SDF): shape={list(out_iso['normal'].shape)}")
# Contact response
print(f"\nContact response at query points:")
F_contact = torch.ones(B, 1) * 5.0 # 5N
response = contact_response(out, F_contact, d)
for k, v in response.items():
if isinstance(v, torch.Tensor):
print(f" {k:25s} mean={v.mean().item():.4f}")
print("\nDone!")
This appendix specifies the discrete half of the state. Where the belief field \(\Phi\) provides per-point physical queries at geometric resolution, the graph \(G_t\) provides the mechanism for tracking which object is which across time and for representing relational structure between objects. These two components are complementary: a field without a graph has no persistent object identity; a graph without a field has no local physical query interface.
The global field is a weighted mixture of object-local fields. For a query point x in world frame:
Here p(· | x, t) is shorthand for the collection of property-specific outputs returned at query point x, not a claim that the repo already specifies one joint density tying every scalar, vector, tensor, and categorical head together.
The gating weights wi(x) are computed from each object’s SDF: points well inside object i should receive high weight for i. This is a heuristic ownership mechanism rather than a proved partition of unity. Near overlapping, gapped, or inaccurate SDF boundaries, the ownership can become ambiguous, and this appendix does not yet provide identifiability or error-propagation bounds. A new object enters the scene by allocating a new zi and pose Ti.
Object permanence: The intended behavior is to keep node i alive with decaying existence probability while the object is occluded, then recover identity when it reappears. Achieving that requires explicit temporal update and data-association machinery; it does not follow from writing down \(G_t=(V_t,E_t)\) alone.
Relational reasoning: “The cup is on the plate, which is on the table.” This support chain is representable as a path in the graph. Turning that stored relation into a reliable counterfactual such as “remove the plate, then the cup falls” still requires explicit dynamics or symbolic reasoning rather than the graph state by itself.
Contact-conditional dynamics: Graph edges can parameterize sparse message passing over likely contacts. Turning that representation into reliable force propagation still requires a concrete update rule and a learned or analytic dynamics model.
This is a supporting visualization of the representation on a MuJoCo tabletop scene with five objects. Click any object to query its field properties and inspect the graph edges, ownership map, and SDF wireframes.
The same tabletop scene used for the live field-graph visualization and object-level queries below.
The panels below are interactive renderings of the three toy experiments described in Section 5. Each panel corresponds to a runnable script in the repository; the measurements were produced by MuJoCo 3.5 simulation. Use the controls to explore the parameter space for each experiment. All results carry the same scope qualifications as the Section 5 descriptions: these are repo evidence for individual interface properties, not general performance comparisons.
Corresponding to Section 5, Experiment A. Two cubes are rendered with identical visual appearance (same shape, color, and texture) but with dramatically different hidden physics: steel (2.0 kg, \(\mu = 0.5\)) versus foam (0.05 kg, \(\mu = 0.3\)). The slider selects the applied force; pressing Play animates the displacement trajectories frame-by-frame. The bar chart shows final displacement at each tested force on a log scale, where the gap between the two materials is most apparent.
The lighter cube accelerates away while the heavier cube barely moves under the same visual setup.
The higher-force version makes the separation in hidden physics even more obvious.
This sweep is the rendered counterpart to the displacement-vs-force chart shown in the panel.
Corresponding to Section 5, Experiment B. This panel demonstrates that the field state can be updated through interaction rather than relying on a single visual estimate. The estimator maintains a joint posterior over mass \(m\) and friction coefficient \(\mu\) using a grid of 1,296 precomputed MuJoCo simulation outcomes as the likelihood model. After each push, the observed displacement is compared to the precomputed likelihood grid and the posterior is updated via Bayes’ rule.
Select a material and drag the Interactions slider (or press Autoplay) to observe how the mass estimate and its uncertainty evolve. Note how the relative error typically drops below 20% within four to six interactions, regardless of starting prior. This is the “interaction refines belief” property from Section 5.
Sequential pushes in the lighter-material regime show the observation stream used to tighten the posterior.
The same estimator on a heavier, lower-friction object produces a different convergence path.
Corresponding to Section 5, Experiment C. Given the field state for a known material, the safe grip-force range is a direct closed-form consequence: the minimum force to lift without slipping is \(F_{\min} = mg/(2\mu)\), and the maximum force before crushing is \(F_{\max} = F_{\text{crush}}\). This experiment verifies that the material estimates in the field state are sufficient to compute this control constraint without any additional learned layer.
Select a material and sweep the Grip Force slider to see whether the robot lifts, drops (insufficient friction), or crushes the object. The color-coded force map on the right summarizes the full outcome spectrum for the selected material at once.
The fragile-object comparison shows why the safe operating band matters more than a single nominal force.
Large-scale human and internet video is already an established component of the contemporary world-model landscape. Structured World Models from Human Videos, Genie, and UniSim collectively demonstrate that rich video corpora can support world modeling, interactive environment generation, and transferable control. The field-graph approach is compatible with this line of work: internet video becomes the data source for Stage S2 of the VLA curriculum (Section 10), supervising view-consistent geometry and relational priors rather than pixel reconstruction.
The key distinction is the supervision target: routing internet video through a structured 3D state \((\Phi, G)\) rather than through pixel prediction or latent dynamics produces an intermediate representation that is addressable by downstream skills. Section 6 establishes that this structured intermediate already outperforms an image baseline under a controlled protocol. Section 10 describes how that result scales to the full VLA setting.
| Question | Video-centric systems | Field-graph hypothesis |
|---|---|---|
| What is directly supervised? | Pixels, latent video dynamics, or simulator rollouts | Structured 3D state variables and relations |
| What interface is exposed downstream? | Typically images, latents, or simulator trajectories | Local queries plus persistent object graph state |
| What remains to be shown here? | A controlled repo baseline proving that the field-graph interface transfers better than a simple image-only baseline on the same MuJoCo tasks | |
Level 0: Simulation pre-training. MuJoCo provides ground truth SDF, materials, contacts, and trajectories. This is the most direct supervision currently available in the repo.
Level 1: Internet or human video. RGB only, no direct force labels. The intended role here is to learn geometry and relational priors, consistent with the direction explored in the cited literature.
Level 2: Robot interaction. Use targeted interaction to calibrate quantities that are weakly observed from video alone, such as friction or crush thresholds.
Level 3: Language priors. Use labels or task descriptions to initialize a belief state, then refine that state through observation and contact.
This section shows the current embedded export tied to the repo’s neural-field training and evaluation workflow in nerf/train.py and nerf/evaluate.py. The prototype is trained on MuJoCo ground truth for three objects and is included here as validation of the field machinery, not as a direct comparison against image-only baselines.
The curves and counters below are the current site export from the prototype field workflow. They indicate that the shared architecture can fit the toy objects used in the repo and recover material-related quantities on those examples.
This should be read as evidence of prototype viability and exportability. The direct proof-gated image-versus-structured comparison now lives in Section 6; this appendix remains separate evidence that the field machinery can be trained and exported inside the repo.
The three-model pipeline in Section 7 is currently trained in stages: supervised encoder first, then contrastive vectorizer, then cross-encoder. The next step is to collapse this into a single end-to-end objective, then extend the result to a full Vision–Language–Action (VLA) model where the explicit \((\Phi, G)\) state acts as a physics-grounded intermediate between perception and action. This section describes both stages.
Staged training allows each module to converge independently but introduces a mismatch: the encoder is trained on MuJoCo ground truth supervision, while the vectorizer is trained on the encoder’s outputs which are imperfect. End-to-end training removes this distribution gap by computing gradients through all three stages simultaneously.
| Training phase | Loss terms | What is learned |
|---|---|---|
| E2E Warm-up | \(\mathcal{L}_{\text{sup}} + \mathcal{L}_{\text{ctr}}\) with small LR | Align Model 1 gradients with downstream contrastive signal from Model 2 |
| E2E Joint | \(\mathcal{L}_{\text{sup}} + \mathcal{L}_{\text{ctr}} + \mathcal{L}_{\text{action}}\) | All three models optimized simultaneously; \(\Phi+G\) shaped by both geometric supervision and task reward |
| E2E Refinement | \(\mathcal{L}_{\text{action}}\) with frozen encoder | Fine-tune policy head on new task distributions while keeping the structured state stable |
The key advantage of end-to-end training is that the encoder learns to produce a \(\Phi + G\) that is useful for control, not just one that reconstructs MuJoCo ground truth. Gradients from the action loss propagate back through the vectorizer into the encoder, allowing the object latents \(z_i\) to encode the physically relevant structure that actually matters for the downstream task.
The Cross-Encoder (Model 3) in Section 7 already has the structure of a VLA: it accepts a structured scene representation, a language instruction, and a current image, and outputs an action. The next step is to scale this to a full VLA trained on internet video and robot demonstrations, where \((\Phi, G)\) acts as the physics-grounded intermediate state rather than a raw image latent.
Contemporary VLAs such as RT-2, OpenVLA, and π0 (Physical Intelligence) map directly from visual tokens and language to action tokens using transformer backbones pretrained on internet-scale data. They achieve impressive generalization but encode all physical understanding implicitly in the attention layers. The field-graph VLA proposes inserting the explicit \((\Phi, G)\) state as a mandatory intermediate, making physical structure addressable at inference time.
The critical architectural difference is that the field-graph VLA can answer explicit physical queries at inference time: a planner can query the SDF of an object before choosing a grasp, read material friction before selecting a grip force, or traverse the graph to check for support relations before executing a pick. Standard VLAs cannot expose these answers without retraining.
| Stage | Data source | Objective | What \(\Phi+G\) learns |
|---|---|---|---|
| S1: Sim pre-training | MuJoCo scenes with GT SDF, materials, contacts | Supervised \(\mathcal{L}_{\text{geo}} + \mathcal{L}_{\text{mat}} + \mathcal{L}_{\text{contact}}\) | Accurate SDF, material prediction, and graph structure in controlled settings |
| S2: Video pre-training | Internet video and human manipulation demonstrations | View-consistency contrastive \(\mathcal{L}_{\text{ctr}}\); self-supervised depth/flow | Geometry and relational priors from large-scale natural observation, consistent with Structured World Models from Human Videos |
| S3: Interaction calibration | Robot self-play and targeted interaction | Posterior update on contact outcomes; friction and stiffness calibration | Material quantities that are weakly observable from video alone (friction, crush threshold, compliance) |
| S4: Language alignment | Scene + instruction pairs; CLIP-style paired data | Contrastive \(\mathcal{L}_{\text{align}}\) between scene embedding and instruction embedding | Semantic grounding: “the heavy metal block” initializes a prior over mass and friction before any contact |
| S5: Policy fine-tuning | Robot demonstrations on target tasks | Behavioral cloning \(\mathcal{L}_{\text{action}}\) through all three models | Task-specific shaping of the structured state to support downstream control |
This document has introduced, justified, and benchmarked a hybrid intermediate state \(S_t = (\Phi_t, G_t)\) as the interface between robotic perception and control. The field component \(\Phi_t\) exposes geometry and material properties at any 3D query point; the graph component \(G_t\) tracks persistent object identity and relational structure. Together they form an explicitly addressable world state that downstream skills can query without re-encoding physical structure from raw pixels at each new task.
Three layers of evidence support the design: literature precedent for each component (Section 2), controlled MuJoCo experiments validating three independent interface properties (Sections 5 and 8), and a proof-gated direct benchmark showing a 46.3% reduction in out-of-distribution prediction error over a matched image baseline (Section 6). The complete three-model pipeline (Section 7) is runnable, instrumented, and designed for end-to-end training. The roadmap in Section 10 describes how this pipeline becomes a full Vision–Language–Action model: training all three models jointly, then scaling to internet video and robot demonstration data in a five-stage curriculum.
The benchmark in Section 6 covers a one-object calibration-push task under a matched norm-bounded linear probe. It does not yet cover multi-object occlusion identity, contact-rich manipulation, or richer nonlinear baselines. The field exposes per-property heads rather than a single joint probabilistic model; surface normal stability and SDF-based ownership are contingent on SDF regularity. Graph persistence through occlusion and relational counterfactuals require explicit update and reasoning machinery beyond the state definition. These are the engineering boundaries for the end-to-end training phase.
| Category | Reference | Why it appears here |
|---|---|---|
| World models / dreaming | DreamerV3 (Hafner et al.); Genie; UniSim; V-JEPA (LeCun et al.) | Latent-space imagination and embedding-space prediction frameworks that motivate the dreaming argument in Section 1. |
| Compositional / physics world models | DreMa (ICLR 2025); PIN-WM; DayDreamer (CoRL 2022) | One-shot policy learning from compositional world models (DreMa), few-shot dynamics identification (PIN-WM), and physical robot learning via imagination in ~1 hour (DayDreamer). |
| Survey | Ai et al., Science Robotics 2025 | Comprehensive review confirming that structured state representations with physics priors consistently improve sample efficiency and generalization for manipulation. |
| Object-centric models | Slot Attention; FOCUS | Examples of structured object representations that motivate the graph half of the proposed interface. |
| Neural fields for robotics | NeRF; Dex-NeRF; Evo-NeRF | Examples of continuous field-based geometry interfaces related to the Φ half of the representation. |
| Scene graphs / planning | Hydra; Hierarchical 3D Scene Graph Planning | Examples of persistent object identity and relational planning interfaces related to the G half of the representation. |
| Repo proof notes | proofs/fairness_of_image_vs_state_comparison.md; proofs/superiority_certification_from_experiment.md | Formal fairness and certification rules that define the exact scope of the benchmark claim reported in Section 6. |
| Repo benchmark pipeline | training/scripts/generate_planned_comparison_data.py; training/scripts/run_planned_comparison.py | Executable generator and trainer used to produce the direct image-versus-structured benchmark. |
| Repo benchmark outputs | training/results/planned_comparison_gpu_batched/metrics.json; seed_metrics.csv; ood_predictions.csv | Saved validation, held-out-camera, and per-seed evidence backing the metric cards and table on this page. |
This is a selected reference section rather than a full bibliography. External literature links cover the representative prior work named in the report, while repo-local proof notes and output paths are listed because they are the primary evidence source for the direct benchmark claim.