NVIDIA IsaacSim SmolVLA LeRobot VLA Robotics Deep Learning Python

Deployment of our fine-tuned SmolVLA policy to the simulated SO-101 arms. The failure recovery behavior seen here likely comes from emergent behavior from the SmolVLA pretraining, rather than from our fine-tuning, as we only included successful demonstrations in the training dataset. As can be seen here, it is a struggle to get the shirt not to crumple; however, by the competition's distance metrics this fold is a success. We could have improved upon the failure recovery behavior either through implementing an expert-driven paradigm like DAgger, or through an RL-based posttraining approach.

Overview

Main Repo LingBot VLA Fork HuggingFace

Team: Saif Ahmad, Andnet DeBoer, Conor Hayes, Praneeth Reddy Mallupalli, Kasina Jyothi Swaroop, Chenyu Zhu, Rob Zhu

In this project, we taught 2 HuggingFace SO-101 arms to fold laundry as entrants to ICRA 2026’s LeHome Challenge (team name: LaundryNauts).

Quick facts about the challenge:

4 garments to fold (pants, shorts, longsleeve shirt, T-shirt)
Given a basic IsaacSim environment with the arms in a simulated kitchen
Given meshes for 4 examples of each garment
Competition performance was evaluated on an unseen validation set with many more garments & environments
A successful fold was defined by distance between several pairs of keypoints, indicating regions of the fabric that should be close or far apart (i.e. shirt’s left arm should be folded across the chest to the opposite shoulder, without crumpling the shirt). My write-up on how to achieve each fold is here

Our performance

40.0% success rate across 4 garment types on the competition’s private validation set, with a single policy handling all 4 garments.
placed 54th out of ~230 entrants to the competition

My role:

Project manager (organized the effort, led meetings, distributed tasks)
Ported LingBot VLA to work with the competition environment, set up & trained using a RunPod instance
Set up local server environment & evaluated diffusion policy, ACT, and SmolVLA baselines
Designed & implemented state machines & motion planning for automated data generation pipeline

Egocentric video data as supplied to the policy. The policy receives these 3 camera inputs as well as joint angles, and outputs joint angles.

Solution

Our best result was achieved by fine-tuning SmolVLA, with a training set generated by data augmentation (changing rooms & garment color) from a limited number of teleoperated demonstrations. This policy achieved 40.0% success across all garments on the competition’s private validation set (long-sleeve shirts: 39.5%, T-shirts: 10.5%, pants: 43.5%, shorts: 66.5%), and 53.2% success across all garments on our training set.

Process:

Model inputs: Wrist cameras (x2), overhead camera (x1), joint angles (2x6dof=12) Model outputs: joint angles (2x6dof=12)

Models evaluated:

X-VLA
SmolVLA
LingBot
Diffusion Policy
Action Chunking Transformer (ACT)

Data augmentation strategies used:

NVIDIA Cosmos: Changing arm, garment, and background textures + colors
IsaacSim: Scripting changes in room, garment colors, and lighting environment during teleop (idea is that the behavior may be different in low lighting; collect failure recovery information)

Data Collection strategies:

Manual Teleoperation using leader arms to control follower arms in IsaacSim
Automated data generation - using traditional robotics programming techniques (motion planner (NVIDIA cuRobo), state machine for folding each garment) to perform the fold automatically given perfect knowledge of clothing mesh + keypoints. We didn’t get this done in time to submit for competition.

Lessons Learned

Ease-of-use for the data collection, training, and evaluation pipelines is extremely important, to tighten the development-test loop. Without this, it can quickly become extremely demoralizing to iterate on the system.
COMPUTE MATTERS SO MUCH. The limited resources of our lab’s local servers (2 NVIDIA RTX 6000 GPU’s with 48GB of VRAM) made running IsaacSim slow, and made it literally impossible to train large models like LingBot. Spending cash on an H200 instance on RunPod immediately paid off once I switched to that to train LingBot.
Teleoperation is hard. We made a last-minute pivot to try automated data generation, and I think that strategy would have paid off had we finished in time. But as for teleoperation, especially with an arm like the 6dof SO-101 which is not overactuated, the control interface of a leader/follower setup was extremely difficult to master. Things that would have made this easier:
- more easily backdrivable leader arm. I think the motors are a bit too highly geared for this purpose
- more power in the leader arm motors; they struggle to overcome the force of the hand even when the follower arm impacts a surface
- more responsive haptic signals; when teleoperating the simulated arm in Isaac, there was some small but perceptible delay in control & haptic response. This made it extremely difficult to manipulate anything usefully
- tactile feedback. even just a buzzer to indicate impacting a surface, or lower complex impedance on the leader arm control system to be able to indicate stuff like the follower scraping across a material as vibration.

My ideas on teleoperation above are greatly informed by my work with Prof. Ed Colgate, who is supervising my thesis, and his fantastic PhD students working on haptics, as well as Prof. Kevin Lynch & his equally fantastic students.