Tactile Sensing Robotic Manipulation Imitation Learning Diffusion Policy Raspberry Pi Python SLAM ROS 2

PolyUMI: Visual + Auditory + Tactile Manipulation Platform for Imitation Learning

GitHub Gripper CAD Franka Mount CAD Build Instructions (Coming Soon)

PolyUMI is a novel real-time data collection & control platform for robotic imitation learning, which unifies the following sensor modalities in a single end-effector:

  • touch (via a custom optical tactile-sensing finger, based off of PolyTouch) - 10fps 540x480 MJPEG video (MP4)
  • mechanical vibration (via a contact microphone fixed to the finger housing) - 16kHz PCM audio (WAV)
  • vision (via GoPro camera on wrist + finger camera peripheral vision) - 60fps 1920x1080 MJPEG video (MP4) + 10fps 540x480 MJPEG video
  • proprioception (via monocular inertial SLAM from GoPro + IMU in gripper, or robot joint encoders + FK in embodiments)

It combines the Universal Manipulation Interface (UMI) platform with a custom touch-sensing finger inspired by the PolyTouch tactile + audio sensor, with hardware, firmware, and software built from scratch for a modern robotics stack (ROS2 Kilted + Python 3.13 + Foxglove visualizer) [1, 2].

PolyUMI platform overview
PolyTouch + UMI =
Gripper + EE CAD Gripper + EE IRL Live visualization of gripper audiovisual data
PolyUMI combines the touch-sensing capabilities of the PolyTouch sensor with the UMI platform, with novel mechanisms, electrical design, firmware, and software purpose-built for both offline and real-time data collection and control.

Philosophically, my inspiration for this project comes from my belief that embodied AI’s greatest ally is hardware-software co-design. AI taking over the planning & control layers of the robotics stack means a more flexible, agile, intuitive approach to hardware and sensing is possible.

With the completion of this phase of the PolyUMI project, I’m now able to pursue multimodal imitation learning research on my own research platform built from the bottom-up; this means that my iteration on model performance will include iteration on the full design, including the hardware itself.

System Overview

PolyUMI Gripper

UMI-style gripper supporting the PolyUMI touch sensing finger (designed & manufactured by me, with UMI’s open source gripper as a starting point)

  • Fully wireless data recording, with ~5hrs of battery life. No external PC required until you want to transfer the data for processing & training.
  • Start and stop recording with a touch of a single button, with all data sources time-synchronized.
  • Novel mechanical, electrical, and software design to support PolyUMI’s 4 simultaneous datastreams, with a focus on ease of use and rapid iteration
PolyUMI platform overview
PolyUMI gripper CAD PolyUMI gripper IRL PolyUMI finger exploded view

PolyUMI End-Effectors

Franka Hand

PolyUMI end-effector for the Franka Hand (designed & manufactured by me)

  • Maintains the same spatial relationships between sensors & fingers as the PolyUMI gripper to aid in policy transfer
  • Separate finger-to-mount adapter (pink in photo left, gunmetal grey in CAD right) enable reuse of the Gripper’s finger designs on the Franka end-effector. This also enables using one physical sensing finger for both training and inference by transferring it from the Gripper to the arm, which will likely improve model performance.
Franka EE IRL Franka EE CAD Exploded View

Others

System can easily support other arms, humanoids, etc — anything with a wrist. Just need an end-effector hardware design for each.

Data Pipeline

PolyUMI software component diagram

PolyUMI’s firmware and software is architected from scratch with the following priorities:

  1. As close to turnkey as possible.
    • With this system, given a clean hardware setup, you can go from 0 to 1 in 20 minutes (nothing is installed -> livestreaming data from the arm, recording data on the gripper, and postprocessing gripper data on PC)
  2. Data collection & processing is quick and easy
    • a single button press starts & stops data collection, and a single command fetches all recorded episodes onto PC from the gripper over the network.
  3. Data is ready for training
    • all data is timestamped and synchronized to the same clock
    • all sensors produce data at rates and resolutions standard for training robot policies (>10Hz, >540p)
  4. Data artifacts are well-organized
    • the postprocessing pipeline gathers all data from the hardware and stores each episode at-rest in an MCAP file alongside comprehensive metadata, saving intermediate artifacts for traceability
    • built for robot policy training; pipeline extensions convert at-rest format to any common dataset format (Zarr, LeRobot Dataset, etc)
    • all demonstrations can be replayed or livestreamed to Foxglove

Sensors: Theory of Operation

Optical Tactile Finger

The finger surface deformation is clearly visible in the color gradient when an object (such as this M3x30 screw) presses against it. Audio is from the finger's integrated contact microphone.
labelled sensor view
Guide to the visual field of the finger camera. (click for full-size)

The PolyUMI finger follows the same core sensing principle as PolyTouch: a deformable sensing surface is observed by an internal camera through a curved mirror [2].

The hardware & software for the PolyUMI finger are designed and built from scratch; however, the sensing approach and system architecture draw heavily from PolyTouch. No source code or design files from that work were available — only the published paper — so all implementation details were independently developed, and modifications were made as appropriate to fit the goals of rapid iteration and integration with a UMI-style gripper design.

  • Sensing surface: VHB tape for conformability, coated with aluminim powder + covered with medical tape on the outer surface for reflectivity + texture & durability (following PolyTouch). Mounted on an acrylic plate which is detachable from finger assembly for easy replacement + iteration.
  • Illumination: Flexible LED tube mounted in the rear of the sensing surface (see below for ECAD + circuit diagrams, and Onshape linked above for mechanical details).
  • Peripheral vision: side windows provide a secondary view of the manipulation scene, similar to PolyTouch.
  • Output: 10 fps MJPEG video at 540x480 (stored in the finger as JPEG frames for efficiency, and at rest as MP4 + MCAP)

Sensing surface manufacturing instructions are here. Full build documentation including 3D print, laser cut, and mechanical + PCB assembly instructions are coming soon.

Tactile sensors inspiring this one: PolyTouch [2], GelSight [3], DenseTact [4].

PolyUMI gripper CAD
The camera obtains an "overhead view" of the sensing surface via a curved mirror mounted on the opposite inner surface of the finger. This gives a more detailed view of the surface interior and gives a sufficiently consistent optical distance from surface to lens that the whole surface fits in the camera's depth-of-field. Figure adapted from PolyTouch [2] to convey the general idea.

Contact Microphone

The contact microphone is rigidly coupled to the finger housing, and it mostly captures mechanical vibration traveling through the sensor body, with relatively little airborne sound. The primary purpose of this modality is to capture moments of contact between the finger and the environment, whether for precise timestamping of contact events or to capture information about the nature of the contact (e.g. a hard tap vs a soft press, or texture information from sliding contact) [5].

The audio DAC and amplification is handled prior to processing in the RPi firmware using the Raspiaudio ULTRA+ HAT, a COTS hobby audio interface board for the Raspberry Pi Zero. The contact mic itself is a low-end unit designed for use on acoustic guitars (see the BOM spreadsheet for details).

  • Output: 16 kHz mono PCM audio (stored at rest as WAV container, MCAP)
Contact mic waveform
Contact microphone's output for a 1-minute demo session. Contacts between the finger and a hard surface are evident as sharp spikes in the waveform.
Contact mic
Contact mic mount point.

Wrist Camera

This sensing system follows the UMI design with minor hardware updates for a new GoPro version (Hero 12 vs. Hero 9) [1]. PolyUMI also adds BLE-based control of the GoPro from the gripper’s onboard SBC (Raspberry Pi Zero 2W), which is required to synchronize timestamps for all sensor datastreams & ensure that the sensors begin recording at the same time.

  • GoPro Hero 12 + MAX Lens Mod 2.0 (approximately 177 deg FOV)
  • Side mirrors provide a binocular view of the manipulated object.
  • Output:: 60 fps video at rest as MP4
  • Control: via Open GoPro’s BLE API from the gripper’s onboard Raspberry Pi

Proprioception

PolyUMI proprioception dataflow
Proprioceptive data can be supplied as embodiment-specifc joint angles or embodiment-agnostic 6DoF pose + gripper width at both training and inference time.

Embodiment Proprioception

When mounted to an arm or any other embodiment, joint angle data can be sourced directly from motor encoders, and EE pose can be trivially obtained through forward kinematics. For the Franka arm, this data is all availble directly from the libfranka API.

PolyUMI Gripper Proprioception

When data is collected from the gripper alone, proprioceptive information must be derived from other sensor data. In particular, PolyUMI follows the original UMI paper in using ORB-SLAM3’s monocular-inertial SLAM implementation to derive a 6DoF pose trajectory using the GoPro’s video feed and integrated IMU [1, 6]. Gripper width is calculated using the ArUco tags on the gripper fingers.

Then, an embodiment-specific IK solver can be used to map this pose trajectory to a trajectory in joint-space.

System Architecture and Design

Mechanical

  • CAD resources: Gripper and Franka Mount Onshape documents (linked above).
  • 6 new parts designed:
    • PolyUMI sensor finger (4 parts):
      • Finger shell – housing for the mirror, sensing surface, LED strip, and contact mic.
      • Finger holder – encloses rear of finger, and mounts the finger camera + raspberry pi stack
      • Side window(s) + sensor surface acrylic - completes finger enclosure
    • Franka EE (2 parts):
      • Finger adapter - adapts the Franka Hand’s parallel gripper fingers to the PolyUMI’s fingers
      • Gopro + mirror mount - mounts the GoPro and mirrors to the Franka Hand
  • Additional modifications:
    • The original fin-ray finger + UMI finger mount were slightly enlarged to match the dimensions of the sensing finger.
    • The remaining parts of the original UMI (core trigger mechanism + GoPro mount + finger rail) were left identical to the original; this also enables existing users of the UMI to switch out only the fingers to switch to the PolyUMI system.

Electrical

LED circuit diagram LED circuit detail Electrical system diagram
Quick-and-dirty LED driver circuit. The RPi Zero 2W's 16mA GPIO pins can't drive the LED strip's 80mA load directly, so PWM control of a MOSFET is used, powered by the RPi's 5V rail.

Software

Pi firmware help screen Postprocessing help screen
Help screens for the PolyUMI pi firmware and postprocessing tools.
Receiver node info
ROS 2 parameters & topics published by the receiver node.

Next Steps

  • hardware design improvements
    • add second LED strip at fingertip; design custom LED pcb
      • need better illumination at front of sensor
      • better control over light orientation (the current tube gets twisted), more mechanical robustness, and smaller form factor
    • optimize mirror curvature + camera placement to minimize distortion, variation in effective sensing surface distance (so it all fits in the same depth-of-field), and blind spots (currently there’s one at the rear of the sensor)
    • sensing surface stiffness - experiment with materials to make it softer
  • slam pipeline robustness improvements
  • implement inference + training infrastructure, and evaluate various policies! planning to explore
    • diffusion policy
    • ACT
    • various VLA’s with SFT
    • RL-based fine-tuning of all for failure recovery
    • others TBD
  • publish :)

PolyUMI failed & test parts
I iterated on these parts quite a bit.
Here is a graveyard of not-quite-right prints, failed cuts, broken mirrors, and also a nice jig for doing heat-set inserts for the camera mount, which is at an awkward angle inside the finger housing cavity.

References

  1. Universal Manipulation Interface (UMI). Project page and publication links. https://umi-gripper.github.io/
  2. PolyTouch tactile + audio sensor. Project page and publication links. https://polytouch.alanz.info/
  3. GelSight tactile sensing technology. https://www.gelsight.com/
  4. DenseTact optical tactile sensor. https://techfinder.stanford.edu/technology/densetact-optical-tactile-sensor
  5. ManiWAV: Learning robot manipulation from in-the-wild audio-visual data. Project page and publication links. https://mani-wav.github.io/
  6. ORB-SLAM3 repository and related paper resources. https://github.com/UZ-SLAMLab/ORB_SLAM3