Touch-GS: Visual-Tactile Supervised 3D Gaussian Splatting

Stanford University
IROS 2024 (to appear)
Armlab Logo MSL Logo

Touch-GS combines the power of vision and touch to generate high-quality few-shot and challenging scenes, such as few-view object centric scenes, mirrors, and transparent objects.

Abstract

We present a novel method to supervise 3D Gaussian Splatting (3DGS) scenes using optical tactile sensors. Optical tactile sensors have become widespread in their use in robotics for manipulation and object representation; however, raw optical tactile sensor data is unsuitable to directly supervise a 3DGS scene. Our representation leverages a Gaussian Process Implicit Surface to implicitly represent the object, combining many touches into a unified representation with uncertainty. We merge this model with a monocular depth estimation network which is aligned in a two stage process coarsely aligning with a depth camera and then finely adjusting to match our touch data. For every input color image, our method produces a corresponding fused depth and uncertainty map. Utilizing this additional information, we propose a new loss function, variance weighted depth supervised loss. We leverage the DenseTact optical tactile sensor and RealSense RGB-D camera, deploying both on a Kinova Gen3 robot, to show that combining touch and vision in this manner leads to quantitatively and qualitatively better results than vision or touch alone in a few-view scene syntheses on opaque as well as on reflective, and transparent objects. Our method is highlighted below, where prior methods fail to reconstruct the geometry of a mirror.

Method Image.

Method

Our method leverages state-of-the-art monocular depth estimation and Gaussian Process Implicit Surfaces from touches along an object and optimally fuses them to train a Gaussian Splatting model, or any other traditional NeRF. The monocular depth estimator gives us a coarse depth map, which we then align to real-world depths with depth data from a noisy depth camera and further with our touch data. We then combine this with our Gaussian Process Implicit Surface, which provides a more finer depth map. Finally, we can use a novel, uncertainty-weighted depth loss to train a NeRF on few view scenes, as well as mirrors and transparent objects, where vision alone fails.

Method Image.

Gaussian Process Implicit Surface

We use a Gaussian Process Implicit Surface (GPIS) to represent a touched object. The GPIS seamelessly fuses many noisy touches into a 3D representation of the object with uncertainty. The below images demonstrate how our GPIS is able to fill in the gaps between touches and still provide a smooth representation of the object, while still providing uncertainty.

Monocular Depth Estimation and Fusion

We perform a monocular depth estimation and alignment procedure, which consists of using the ZOEDepth monocular depth estimator to output a coarse depth. We then align this depth with real world data from a Realsense to learn a scale factor and offset. Finally, we align this depth with our touch data with an offset. This output depth is used to compute an uncertainty map, which is based on a simple heuristic that the uncertainty is proportional to the depth; higher depth predictions mean higher uncertainty.

Method at a Glance

Combined, our method can be shown as a). a RGB image which b). we compute a depth map from ZOEDepth and c). a vision uncertainty map. In the touch pipeline, we use the DenseTact sensor to get a set of e). touches across an object. We use the GPIS representation to compute f). depth and g) uncertainty. We finally construct the optimally fused d). depth and h). uncertainty.

Results

We show our method on novel views compared to the baseline of only using vision based depth in a simulated scene and a few challenging real-world scenes. Each scene below consists of a RGB and depth comparison between DS-GS (only vision for depth) and Touch-GS. Our method is on the right side of the slider.

Bunny Blender Scene (5 input views)
Bunny Real Scene (8 input views)
Mirror Real Scene (151 input views)
Prism Real Scene (58 input views)

Ablations

We show ablations of our method on the Blender simulated scene (RGB on the left, depth on the right) to highlight the advantages of our method.

No Depth
Sparse Depth
Vision-based Dense Depth
Poorly Tuned Vision-based Dense Depth
Raw Depth from ZOEDepth
Touch-Only Depth
Touch-Only Depth Initializing With Touch Points
Full Method, No Uncertainty and Touch Point Initialization
Full Method, No Uncertainty
Full Method
Full Method trained on Ground Truth Depth

Visualizations

(For larger screens, minimum 769 pixel width) We include interactive visualizations of trained splats on our method to highlight the advantages of our method.

5 Input Views / Left: Depth-GS / Right: Touch-GS

8 Input Views / Left: Depth-GS / Right: Touch-GS
Navigate: click and drag / Zoom: shift and scroll

BibTeX

@article{swann2024touchgs,
  author    = {Aiden Swann and Matthew Strong and Won Kyung Do and Gadiel Sznaier Camps and Mac Schwager and Monroe Kennedy III},
  title     = {Touch-GS: Visual-Tactile Supervised 3D Gaussian Splatting},
  journal   = {arXiv},
  year      = {2024},
}