Diverse Signer Avatars with Manual and Non-Manual Feature Modelling for Sign Language Production

1CVSSP, University of Surrey,
Teaser Image

Our goal is to generate photorealistic digital avatars that preserve essential sign language cues — such as facial expressions, hand movements, and mouthing — while enabling diversity across ethnicities and adaptability to different sign languages.

Abstract

The diversity of sign representation is essential for Sign Language Production (SLP) as it captures variations in appearance, facial expressions, and hand movements. However, existing SLP models are often unable to capture diversity while preserving visual quality and modelling non-manual attributes such as emotions. To address this problem, we propose a novel approach that leverages Latent Diffusion Model (LDM) to synthesise photorealistic digital avatars from a generated reference image. We propose a novel sign feature aggregation module that explicitly models the non-manual features (e.g., the face) and the manual features (e.g., the hands). We show that our proposed module ensures the preservation of linguistic content while seamlessly using reference images with different ethnic backgrounds to ensure diversity. Experiments on the YouTube-SL-25 sign language dataset show that our pipeline achieves superior visual quality compared to state-of-the-art methods, with significant improvements on perceptual metrics.

Method Overview

Given a sequence of video frames \(\mathcal{V} = \{ \mathbf{V}_i \}\) of a sign language, our goal is to synthesise a diverse sequence \(\mathcal{O} = \{ \mathbf{O}_i \}\) that faithfully preserves both the manual and non-manual linguistic features while allowing for variation across different signer appearances. Our novel feature aggregation module, \(\Psi_{\text{motion}}\), uses multi-scale dilated convolutions with dilation rates \(d \in \{1, 2, 4\}\) to fuse fine-grained non-manual details (e.g., facial expressions) and coarse manual gestures (e.g., hand movements) into a unified representation. The LDM then generates each frame \(\mathbf{O}_i\) through an iterative denoising process in the latent space, guided by the aggregated features of \(\Psi_{\text{motion}}\), enabling the synthesis of signers with diverse ethnic and visual characteristics.

Comparisons against Baselines (DSGS)

Comparisons against Baselines (BSL)

Adaptablity to any sign language

Our model can work with any sign language. The first two examples shown are from Swiss-German Sign Language (DSGS) and the last is British Sign Language (BSL).

Comparison to other LDM methods

Our model can handle better the driving pose for sign language sequences. Note: We apply a mask to report the metrics in Tab. 4 to remove the influence of the background.

Sample diversity

Ablation Studies teaser diversity

Our model can easily adapt to diverse images within the same pose sequence.

User study

User Study Results

We highlight the accuracy of the synthesised videos (left), the realism of the videos (centre) and user preference (right).

BibTeX

@article{diverse_sign,
      title={Diverse Signer Avatars with Manual and Non-Manual Feature Modelling for Sign Language Production},
      author={Mohamed Ilyes Lakhal and Richard Bowden},
      booktitle={ArXiv},
      year={2026}
    }