Humanoid VLA: Vision-Language-Action Controlled Robot

Overview

This system joins language conditioning, robot perception, and action chunking into a single control loop for the Unitree G1 humanoid in MuJoCo: a natural-language command goes in, and a trained ACT policy executes the manipulation. The work spanned 10 weeks across 6 project phases, and the focus throughout was making a research stack stable enough for repeated evaluation — not just a polished demo run.

System Architecture

flowchart LR
  cmd["Natural-language<br/>command"] --> parse["Task parser"]
  parse --> act["ACT policy<br/>15.6M params"]
  obs["Vision + proprioception"] --> act
  act -->|"20-step action chunks"| ens["Temporal ensembling"]
  ens --> pd["PD controller<br/>+ gravity compensation"]
  pd --> g1["Unitree G1<br/>(MuJoCo)"]
  g1 --> obs

The loop runs through ROS 2 at 30 Hz: language is parsed into a task specification, the ACT policy consumes camera images and joint state, and predicted action chunks are blended and executed by a low-level PD controller.

Simulation & Control

The G1 model uses torque-controlled joints, so every layer above it rests on a hand-built control foundation:

PD control with gravity compensation — τ = Kp(q_des − q) − Kd·q̇ + τ_gravity, with gravity torques computed via mj_inverse on a zero-velocity state. Without this, the arms fall limp before the policy can act.
Leg stabilization — even with a fixed base, uncontrolled leg joints drift over long episodes; a separate PD loop pins them.

Data Generation

Training data came from scripted demonstrations driven by Jacobian-based inverse kinematics, with two hard-won practices baked in:

Every generated demo is validated before entering the dataset — iterative IK diverges near workspace boundaries, and a handful of bad demos measurably poisons a small dataset.
Weld constraints during collection keep objects attached to the hand while scripting, avoiding physics artifacts that would otherwise teach the policy the wrong contact dynamics. The bimanual tasks then drop the weld and grasp by friction alone, which demands genuine force balancing.

ACT Policy

The policy is a 15.6M-parameter Action Chunking Transformer trained in about 2.5 hours on an RTX 4050 laptop GPU:

Action chunking — predicting 20 timesteps per inference instead of one is the single biggest contributor to smooth, temporally consistent motion.
Partially frozen vision encoder — ResNet18 with early layers frozen; training the full encoder on a small robotics dataset overfits.
Temporal ensembling — overlapping chunks blended with exponential decay weighting, rather than executed back-to-back.

Evaluation

Capability	Result
Single-arm manipulation (4 tasks)	86% success
Bimanual grasping	100% success
In-distribution, with domain randomization	90%
Out-of-distribution generalization	55%, degrading gracefully rather than collapsing

Domain randomization was layered in deliberately to measure the in-distribution/out-of-distribution gap — the 90% → 55% profile with graceful degradation was the target behavior, versus a policy that silently memorizes the nominal scene.

Lessons

The condensed engineering lessons from this build — controller design, IK fragility, ACT training practice, bimanual contact — are written up in 50 Lessons from Building a Humanoid VLA System.