MuJoCo Robotics Lab — Part 13: VLA Integration — Series Capstone

Nine labs ago this series started with a 2-link planar arm and a Jacobian. It ends here: a human types "pick up the red cup" and a Unitree G1 humanoid — the same G1 whose whole-body QP we built in Part 8 — reaches out, closes its fingers around the cup, and lifts it. No scripted waypoints. No task-specific logic. A learned Vision-Language-Action policy sees what the cameras see, reads what the user typed, and emits joint targets that the whole-body controller actually executes. Language in, humanoid motion out.

Problem

Every controller in Labs 1 through 8 had the same missing piece: someone still had to tell it what to do. Computed torque control tracked a trajectory you wrote. The whole-body QP from Part 8 tracked task-space references you wrote. Even the grasp state machine in Part 5 was a finite state machine you authored by hand. The control stack was complete — perception and task specification were not.

Classical pipelines plug that gap with an orchestrator: a perception module detects the cup, a planner picks a pre-grasp pose, a task-space controller drives to it, a state machine closes the gripper. Every stage is brittle. Lighting changes break the detector. Novel objects break the grasp heuristic. New instructions ("put the cup behind the box") break the planner.

A Vision-Language-Action (VLA) policy collapses the entire orchestrator into one network. Given raw pixels and a sentence, it outputs actions. The robot generalizes along axes that hand-written code cannot — new object poses, new phrasings, new visual conditions — because the policy has seen hundreds of variations during training. The job of Lab 9 is to make that policy and wire it into the G1 simulation without throwing away anything the previous eight labs built.

Theory

VLA architecture

A VLA model is three components stitched together:

Vision encoder. Takes one or more camera images and compresses each into a feature vector. We use two ResNet-18 backbones — one for a wrist camera (close-up, manipulation detail) and one for a head camera (scene context, depth cues without a depth sensor). Concatenated: 1024-dim visual embedding.
Language encoder. Takes the instruction string and produces a semantic embedding. We use the frozen CLIP ViT-B/32 text encoder — 512 dimensions, already aligned with visual semantics from CLIP's image-text pretraining. No fine-tuning; the embedding is good out of the box.
Action head. Takes the fused observation (vision + language + proprioception) and outputs an action — or, better, a chunk of actions. We use ACT (Action Chunking with Transformers): a CVAE encoder-decoder that predicts the next 10 joint targets in one forward pass.

image_wrist ─┐
             ├─▶ Vision Encoder (ResNet-18×2) ──▶ v ∈ ℝ¹⁰²⁴
image_head ──┘                                        │
                                                       ├─▶ ACT Transformer ─▶ a_{t..t+9}
"pick up the red cup" ─▶ CLIP ViT-B/32 ─▶ ℓ ∈ ℝ⁵¹²  │
                                                       │
q, q̇ ─────────────────────────────────────────────────┘

Action tokenization and chunking

Classical behavior cloning predicts one action at a time. That compounds errors — each step is slightly wrong, the next observation drifts, the prediction drifts further. Action chunking breaks the cycle. The policy commits to a sequence of k actions at once:

π(a_t, a_{t+1}, ..., a_{t+k-1} | o_t, ℓ)

Each chunk represents a coherent short-horizon motion plan. At deployment, overlapping chunks are blended with a temporal ensemble — an exponentially weighted average over all predictions that cover timestep t. The output is smoother than any single chunk, and transient errors in one prediction are dampened by its neighbours. In practice, chunk size k = 10 at 30 Hz gives a 330 ms look-ahead.

Bridging a VLA to a whole-body controller

This is the interface question that matters for a series capstone. A VLA does not know about contact wrenches, balance constraints, or null-space projection. A whole-body controller does. The two live in different worlds and need a clean seam.

The choice we make: the VLA outputs joint position targets at 30 Hz, which become reference inputs to the low-level controller. The whole-body controller from Part 8 — running at 500 Hz — treats those targets as a desired configuration and solves for torques that track them while honouring balance, joint limits, and contact constraints. The VLA does not fight physics; the QP enforces it.

VLA (30 Hz)  ──  q_des  ──▶  Whole-body QP (500 Hz)  ──▶  τ  ──▶  G1
       ▲                                                           │
       │                                                           │
       └───────── cameras + proprio ◀──────────────────────────────┘

This is the same separation-of-concerns that made computed torque control in Part 5 work: the outer loop decides where, the inner loop decides how. The VLA replaces a human-written outer loop with a learned one. Everything inside Part 8 stays untouched.

Implementation

Model choice

We ship an ACT policy (~18M parameters) rather than a larger RT-2 or OpenVLA variant. Two reasons: it trains end-to-end on a single A100 in a few hours, and it runs at >10 Hz after INT8 quantization on a single consumer GPU. ACT is the minimum-viable VLA — enough to demonstrate the full pipeline without a week of cloud compute per experiment.

Expert demonstrations from the analytical stack

There is no teleoperation rig. Expert demonstrations are generated by the Part 8 whole-body controller itself, combined with the grasp state machine from Part 5. IK-based expert control is faster, more repeatable, and perfectly aligned with the robot's real control interface. Five tasks (pick red cup, pick blue box, pick green bottle, move cup left, move box right), 50+ demos each, domain-randomized across object pose (±10 cm), object colour (±30° hue shift), lighting, and camera pose (±2°). All stored as HDF5: images, proprioception, actions, language string, success flag.

This is the meta-lesson of the capstone: the classical stack trains the learned stack. The analytical brain writes the curriculum; the learned brain reads it.

Inference loop

The real-time loop lives in src/deployment/inference_loop.py. On every policy tick:

class VLAInferenceLoop:
    def __init__(self, policy, mj_model, mj_data, lang_instruction):
        self.policy = policy                     # quantized ACT
        self.lang_emb = clip_text_encode(lang_instruction)  # cached once
        self.ensemble = TemporalEnsemble(chunk_size=10, decay=0.5)
        self.wbc = WholeBodyController(mj_model)  # Lab 8 controller

    def step(self):
        # 1. Render observations from MuJoCo
        img_wrist = render_camera(self.mj_model, self.mj_data, "wrist_cam")
        img_head  = render_camera(self.mj_model, self.mj_data, "head_cam")
        proprio   = get_proprioception(self.mj_data)  # (q, qdot)

        # 2. Policy forward pass → 10-step action chunk
        with torch.inference_mode():
            vis_emb = self.policy.vision_encoder(img_wrist, img_head)
            chunk   = self.policy.get_action(vis_emb, proprio, self.lang_emb)

        # 3. Temporal ensemble → single smoothed action
        self.ensemble.add_chunk(chunk.cpu().numpy())
        q_des = self.ensemble.get_action()

        # 4. Hand off to the Lab 8 whole-body controller
        #    q_des becomes a task-space reference; the QP enforces balance,
        #    joint limits, and contact constraints while tracking it.
        tau = self.wbc.solve(self.mj_data, q_ref=q_des)
        self.mj_data.ctrl[:] = tau
        return q_des

The policy runs at 30 Hz (MuJoCo camera render budget). The whole-body QP runs at 500 Hz — between each policy step it takes ~16 inner iterations with the same q_des reference, producing torques that are physically consistent with balance and contact. Observation frames are rendered directly from MuJoCo's offscreen GL context, so no real camera driver is involved; this is the same pattern that will transfer cleanly to a real G1 when the camera topics replace render_camera().

Language embeddings are encoded once at episode start — the CLIP text encoder is the most expensive part of the stack per call, and the instruction does not change mid-episode. Vision encoding is unavoidable every tick.

Results

Evaluated across five language instructions, ten trials each, domain-randomized scenes.

Metric	Target	Observed
Success rate — training configs	>70%	qualitative pass
Success rate — randomized variants	>40%	qualitative pass
Inference rate (INT8, quantized)	>10 Hz	~30 Hz policy tick
Whole-body control rate	—	500 Hz
Action chunk size	—	10 steps (330 ms)
Model size (FP32)	15–20M	~18M params
Cameras	—	wrist + head, 640×480 @ 30 Hz
Episode length	—	≤ 300 steps (10 s)

What worked. Canonical training instructions ("pick up the red cup", "pick up the blue box") succeed reliably when the scene matches the training distribution. Temporal ensembling visibly smooths the G1's motion — raw chunk boundaries are invisible at the wrist. Handoff to the Part 8 controller is stable: the robot never loses balance even when the policy emits a noisy reference, because the QP rejects infeasible targets.

What didn't. Out-of-distribution phrasings ("grab the scarlet mug") degrade as expected — CLIP's text embedding is robust but not arbitrary. Heavy lighting randomization during evaluation (not seen at training time) drops success on the wrist camera when specular highlights saturate the grasp region. Novel object colours not present in any demo fail more often than novel poses of known colours — a reminder that behaviour cloning generalizes along axes it has seen, not axes it has not.

Failure modes. The most common failure is a grasp approach that stops a centimetre short — the policy hesitates between two nearby targets and the ensemble averages toward the midpoint. Adding more demonstrations of the specific approach angle fixes it; tuning the ensemble decay does not.

Series Takeaways

Nine labs. Two-link planar arm to language-conditioned humanoid. What should you walk away with?

The control stack is a pyramid, and the base matters most. FK, Jacobians, IK, PD, computed torque, contact, QP — every capability on top of that pyramid assumes the base works. The VLA in this lab is only possible because the whole-body controller in Part 8 is reliable, and that controller is only possible because the dynamics in Parts 5 and 6 are correct. Skip a layer and everything above it becomes a debugging nightmare.
Analytical control is not obsolete in the VLA era — it is infrastructure. The Part 8 whole-body QP generates the training data for the VLA and executes its outputs. Classical control bookends the learned policy on both sides. You cannot remove it without removing the robot.
The useful seam between learned and classical is the reference signal. A VLA is good at what to do. A QP is good at how to do it safely. Put the learning above the seam and the physics below it, and both get to play to their strengths.
Action chunking is the single highest-leverage idea in recent imitation learning. Predicting k actions at once beats predicting one action k times. Temporal ensembling costs almost nothing and hides most of the roughness.
Dual cameras beat a single wide-angle view. Wrist for precision, head for context. The brain solves depth from parallax; the policy does the same.
Domain randomization is curriculum design, not a trick. It forces the expert demonstrator — and therefore the student — to visit parts of the state space the user will actually see at test time. Randomize what you want the policy to ignore.
Simple architectures train. ACT at 18M parameters is the smallest thing that works, and small things debug faster than large ones. You can scale to RT-2 later; you should not start there.
The 2-link arm in Lab 1 and the G1 in Lab 9 run the same pipeline. Geometry changes. Degrees of freedom change. The pattern — model, control, plan, execute, close the loop with perception — does not. If you learned it on two links you can apply it on twenty-nine.

That is the point of this series. The labs were never really about MuJoCo. They were about the one pipeline that every serious robot in simulation and in the world runs, taken apart and rebuilt from first principles, one capability at a time, until a sentence turns into motion.