SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction

1Toyota Technological Institute at Chicago, 2Massachusetts Institute of Technology
Teaser

SHuBERT pre-training.

(a) We locate a set of landmarks in each frame of the input video using MediaPipe, with inter-frame interpolation to fill in missing landmarks. From these, we extract the upper body pose, crop the hand and face regions, and blur and partially mask the face crop is blurred for a measure of privacy.
(b) We extract body pose and DINOv2 features for the hands and face, yielding a four-stream representation (two hands, face, body pose) for each frame.
(c) We assign the feature vectors for frame \(t\) to cluster indices using pre-computed \(k\)-means clusters, yielding assignments \((f_t,l_t,r_t,b_t)\in [k]^4\) for face, left and right hand and body pose, respectively.
(d) We partially mask the features, which form the input to a transformer encoder. The output of the transformer is fed to a linear classifier
(e) predicting the cluster assignments for each frame, \((\widehat{f}_t,\widehat{l}_t,\widehat{r}_t,\widehat{b}_t)\). We train the linear layer and the encoder using the log-loss of the predicted vs. true assignments for the masked tokens.

Abstract

Sign language processing has traditionally relied on task-specific models, limiting the potential for transfer learning across tasks. We introduce SHuBERT (Sign Hidden-Unit BERT), a self-supervised transformer encoder that learns strong representations from approximately 1,000 hours of American Sign Language (ASL) video content. Inspired by the success of the HuBERT speech representation model, SHuBERT adapts masked prediction for multi-stream visual sign language input, learning to predict multiple targets for corresponding to clustered hand, face, and body pose streams. SHuBERT achieves state-of-the-art performance across multiple benchmarks. On sign language translation, it outperforms prior methods trained on publicly available data on the How2Sign (+0.7 BLEU), OpenASL (+10.0 BLEU), and FLEURS-ASL (+0.3 BLEU) benchmarks. Similarly for isolated sign language recognition, SHuBERT's accuracy surpasses that of specialized models on ASL-Citizen (+5%) and SEM-LEX (+20.6%), while coming close to them on WLASL2000 (-3%). Ablation studies confirm the contribution of each component of the approach.

Radar graph

Comparison between our results using fine-tuned SHuBERT and results of the previous state-of-the-art task-specific models on a suite of multiple tasks, datasets, and metrics.

BibTeX

@inproceedings{gueuwou2025shubert,
      title={SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction},
      author={Gueuwou, Shester and Du, Xiaodan and Shakhnarovich, Greg and Livescu, Karen and Liu, Alexander H},
      booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
      year={2025},
}