Skip to content

makarov-gv/avatars-review

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

GenAI for Digital Avatar Synthesis — Review Resources

A curated, browsable companion to the survey “GenAI for Digital Avatar Synthesis: A Comprehensive Review.”

Methods Tasks Years License: MIT

This repository is the supplementary material for our task-oriented review of human-centric Generative AI for digital avatar synthesis, covering work published in 2024–2025 at leading (CORE A / A*) AI conferences. It collects, links, and organizes every generation method reviewed in the paper so the literature is easy to browse, cite, and extend.

Diffusion models increasingly act as strong priors for synthesis and editing, while Gaussian-splatting representations dominate real-time reconstruction and rendering — pointing toward hybrid pipelines that jointly optimize controllability and deployability.

End-to-end pipeline for digital avatar synthesis
An end-to-end avatar-synthesis pipeline: inputs and modalities → preprocessing → neural network (GAN / diffusion / NeRF / 3DGS) and auxiliary modules → postprocessing → deployable outputs.

Contents

Scope & taxonomy

This companion accompanies a task-oriented review of human-centric Generative AI for digital avatar synthesis, concentrating on work published in 2024–2025 at leading (CORE A / A*) AI conferences. Recent progress spans diverse output representations (images, video, 3D/4D assets) and conditioning signals (pose, speech, language instructions, affective attributes), broadening avatar applications to telepresence, virtual production, immersive AR / VR, and customer-facing interaction. The literature, however, is rapidly expanding and fragmented across problem settings, architectures, and deployment constraints.

To consolidate it, the review introduces a unified taxonomy of nine task families, aligns each task with both its representative methods and the corpora they use, and connects them through an end-to-end pipeline that links inputs and preprocessing to model components and deployable outputs (see the graphical abstract above). Across the field, diffusion models increasingly act as strong priors for synthesis and editing, while Gaussian-splatting representations dominate real-time reconstruction and rendering — pointing toward hybrid pipelines that jointly optimize controllability and deployability. Each method and corpus is placed under its primary task, mirroring the paper:

  1. Generalization — reconstruct or animate new identities from few, single, or unconstrained in-the-wild observations, ideally without per-subject optimization.
  2. Expressiveness — speech-, emotion-, and motion-driven faces and bodies with nuanced, fine-grained expressions and co-speech gestures.
  3. Text guidance & Stylization — language-conditioned avatar generation, stylization, and editing from natural-language prompts.
  4. Attribute editing — controllable, disentangled editing of appearance and shape, down to individual attributes such as hair, clothing, or expression.
  5. Physics improvements & World interaction — relighting, contact, cloth and body dynamics, and human–object or human–scene interaction.
  6. Hair and clothes improvements — strand-level hair, layered garments, and disentangled, simulation-ready assets.
  7. High fidelity and realism — photorealistic geometry, texture, and appearance, often at high resolution.
  8. Real-time generation & Compression — efficient, lightweight, on-device and streamable avatars.
  9. Temporal consistency — long-horizon, drift-free, identity-stable video and motion.

Corpora

108 corpora used across the reviewed methods, grouped by the task they primarily serve (mirroring the paper). Real-time generation & Compression and Temporal consistency are architecture-level tasks and have no dedicated corpora. Click a title to expand its type, modalities, and size.

The 108 corpora grouped by primary task and publication year
The 108 reviewed corpora, grouped by primary task and publication year.

Generalization Corpora

Corpus Title & Repository / Information Venue
FRGC
Overview of the Face Recognition Grand Challenge
  • Image corpus
  • Size: 5k+ images
  • Add. tasks: High fidelity and realism
CVPR 2005
FRGCv2
Overview of the Face Recognition Grand Challenge
  • Image corpus
  • Size: 50k images
  • Add. tasks: High fidelity and realism
CVPR 2005
VoxCeleb
VoxCeleb: A Large-Scale Speaker Identification Dataset
  • Video corpus
  • Modalities: audio
  • Size: 153k+ clips
  • Add. tasks: Expressiveness
INTERSPEECH 2017
AVSpeech
Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
  • Video corpus
  • Modalities: audio
  • Size: 4,700 hours
  • Add. tasks: Expressiveness
arXiv 2018
VoxCeleb2
VoxCeleb2: Deep Speaker Recognition
  • Video corpus
  • Modalities: audio
  • Size: 1.09M clips
  • Add. tasks: Expressiveness
INTERSPEECH 2018
HUMBI
HUMBI: A Large Multiview Dataset of Human Body Expressions
  • 3D/4D corpus
  • Modalities: motions
  • Size: 772 subjects
  • Add. tasks: High fidelity and realism
CVPR 2020
LYHM
Statistical Modeling of Craniofacial Shape and Texture
  • 3D/4D corpus
  • Size: 1,216 subjects
  • Add. tasks: High fidelity and realism
IJCV 2020
TalkingHead-1KH
One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing
  • Video corpus
  • Modalities: audio
  • Size: 500k clips
  • Add. tasks: High fidelity and realism
CVPR 2021
THUman2.0
Function4d: Real-Time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors
  • 3D/4D corpus
  • Size: 500 scans
  • Add. tasks: Physics improvements & World interaction, High fidelity and realism
CVPR 2021
THUman2.1
Function4d: Real-Time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors
  • 3D/4D corpus
  • Size: 2,500 scans
  • Add. tasks: Physics improvements & World interaction, High fidelity and realism
CVPR 2021
WebFace42M
Webface260m: A Benchmark Unveiling the Power of Million-Scale Deep Face Recognition
  • Image corpus
  • Size: 42M images
  • Add. tasks: High fidelity and realism
CVPR 2021
WebFace260M
WebFace260M: A Benchmark Unveiling the Power of Million-Scale Deep Face Recognition
  • Image corpus
  • Size: 260M images
  • Add. tasks: High fidelity and realism
CVPR 2021
THUman3.0
Deepcloth: Neural Garment Representation for Shape and Style Editing
  • 3D/4D corpus
  • Add. tasks: High fidelity and realism
TPAMI 2022
THUman4.0
Structured Local Radiance Fields for Human Avatar Modeling
  • Video corpus
  • Modalities: motions
  • Size: 3 clips, 7500+ frames
  • Add. tasks: Physics improvements & World interaction, High fidelity and realism
CVPR 2022
CustomHumans
Learning Locally Editable Virtual Humans
  • 3D/4D corpus
  • Size: 643 scans
  • Add. tasks: High fidelity and realism
CVPR 2023
FaceScape
FaceScape: 3D Facial Dataset and Benchmark for Single-View 3D Face Reconstruction
  • 3D/4D corpus
  • Modalities: emotions
  • Size: 938 subjects, 18,760 scans
  • Add. tasks: Expressiveness, Attribute editing, High fidelity and realism
TPAMI 2023
NeRSemble
NeRSemble: Multi-View Radiance Field Reconstruction of Human Heads
  • Video corpus
  • Modalities: audio, emotions, motions
  • Size: 222 subjects, 31.7M frames
  • Add. tasks: Expressiveness, Attribute editing, High fidelity and realism
TOG 2023
RenderMe-360
RenderMe-360: A Large Digital Asset Library and Benchmarks Towards High-Fidelity Head Avatars
  • 3D/4D corpus
  • Modalities: audio, emotions, text, motions, hair/clothes
  • Size: 500 subjects, 243M frames
  • Add. tasks: Expressiveness, Text guidance & Stylization, Attribute editing, Hair and clothes improvements, High fidelity and realism
NeurIPS 2023
OpenHumanVid
OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation
  • Video corpus
  • Modalities: audio, text, motions
  • Size: 13.2M clips, 16.7k hours
  • Add. tasks: Text guidance & Stylization, Physics improvements & World interaction, High fidelity and realism
CVPR 2025
WildAvatar
WildAvatar: Learning In-the-Wild 3D Avatars from the Web
  • Video corpus
  • Size: 10k+ subjects
  • Add. tasks: High fidelity and realism
CVPR 2025

Expressiveness Corpora

Corpus Title & Repository / Information Venue
BU-3DFE
A 3D Facial Expression Database for Facial Behavior Research
  • 3D/4D corpus
  • Modalities: emotions
  • Size: 6 emotions
  • Add. tasks: High fidelity and realism
FGR 2006
BP4D
Bp4d-Spontaneous: A High-Resolution Spontaneous 3D Dynamic Facial Expression Database
  • 3D/4D corpus
  • Modalities: emotions, motions
  • Size: 41 subjects
  • Add. tasks: Attribute editing, High fidelity and realism
Image and Vision Computi 2014
CREMA-D
CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset
  • Video corpus
  • Modalities: audio, emotions
  • Size: 7,442 clips
TAC 2014
BP4D+
Multimodal Spontaneous Emotion Corpus for Human Behavior Analysis
  • 3D/4D corpus
  • Modalities: emotions, motions
  • Size: 140 subjects
  • Add. tasks: Attribute editing, High fidelity and realism
CVPR 2016
AffectNet
AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild
  • Image corpus
  • Modalities: emotions
  • Size: ~1M images
  • Add. tasks: Generalization
TAC 2017
CMU-MOSEI
Multimodal Language Analysis in the Wild: Cmu-Mosei Dataset and Interpretable Dynamic Fusion Graph
  • Video corpus
  • Modalities: audio, emotions, text
  • Size: 23k+ clips
  • Add. tasks: Generalization
ACL 2018
CoMA
Generating 3D Faces Using Convolutional Mesh Autoencoders
  • 3D/4D corpus
  • Modalities: emotions, motions
  • Size: 12 subjects
  • Add. tasks: Attribute editing, High fidelity and realism
ECCV 2018
RAVDESS
The Ryerson Audio-Visual Database of Emotional Speech and Song (Ravdess): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English
  • Video corpus
  • Modalities: audio, emotions
  • Size: 1,440 clips
PloS one 2018
MEAD
MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation
  • Video corpus
  • Modalities: audio, emotions
  • Size: 60 subjects
ECCV 2020
HDTF
Flow-Guided One-Shot Talking Face Generation with a High-Resolution Audio-Visual Dataset
  • Video corpus
  • Modalities: audio
  • Size: 300+ subjects
  • Add. tasks: Generalization, High fidelity and realism
CVPR 2021
Speech2-AffectiveGestures
Speech2-AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning
  • Video corpus
  • Modalities: audio, text, motions
  • Size: 1,766 clips, 106.1 hours
  • Add. tasks: Text guidance & Stylization, Physics improvements & World interaction
ACMMM 2021
3D-ETF
Emotalk: Speech-Driven Emotional Disentanglement for 3D Face Animation
  • 3D/4D corpus
  • Modalities: audio, emotions, motions
  • Add. tasks: Generalization, Attribute editing
ICCV 2023
FaMoS
Instant Multi-View Head Capture Through Learnable Registration
  • 3D/4D corpus
  • Modalities: emotions, motions
  • Size: 95 subjects, 600k frames
  • Add. tasks: Attribute editing, High fidelity and realism
CVPR 2023
MEAD-3D
Speech4mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation
  • 3D/4D corpus
  • Modalities: emotions, motions
  • Add. tasks: Attribute editing
ICCV 2023
TalkSHOW
Generating Holistic 3D Human Motion from Speech
  • 3D/4D corpus
  • Modalities: audio, emotions, motions
  • Size: 26.9 hours, 4 subjects
  • Add. tasks: Attribute editing, Physics improvements & World interaction
CVPR 2023
EmoTalk3D
EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head
  • 3D/4D corpus
  • Modalities: audio, emotions, motions
  • Size: 35 subjects
  • Add. tasks: Attribute editing, High fidelity and realism
ECCV 2024
FEED
Emoportraits: Emotion-Enhanced Multimodal One-Shot Head Avatars
  • Video corpus
  • Modalities: audio, emotions
  • Add. tasks: High fidelity and realism
CVPR 2024
3D-BEF
Emodiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models
  • 3D/4D corpus
  • Modalities: audio, emotions, motions
  • Size: 2k+ sequences, 9 emotions
  • Add. tasks: Attribute editing, High fidelity and realism
arXiv 2025
AffectNet+
AffectNet+: A Database for Enhancing Facial Expression Recognition with Soft-Labels
  • Image corpus
  • Modalities: emotions
  • Size: ~1M images
  • Add. tasks: Generalization
TAC 2025
MENTOR
Vlogger: Multimodal Diffusion for Embodied Avatar Synthesis
  • Video corpus
  • Modalities: audio, emotions, motions
  • Size: 800k subjects
  • Add. tasks: Generalization
CVPR 2025
TalkBody4D
Taoavatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting
  • Video corpus
  • Modalities: audio, motions
  • Size: 8 sequences, 59 cameras
  • Add. tasks: Physics improvements & World interaction, High fidelity and realism
CVPR 2025
VOCASET
Emovoca: Speech-Driven Emotional 3D Talking Heads
  • 3D/4D corpus
  • Modalities: audio, motions
  • Size: 12 subjects
  • Add. tasks: Attribute editing
WACV 2025

Text guidance & Stylization Corpora

Corpus Title & Repository / Information Venue
BEAT
BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis
  • 3D/4D corpus
  • Modalities: audio, emotions, text, motions
  • Size: 76 hours, 30 subjects
  • Add. tasks: Generalization, Expressiveness, Physics improvements & World interaction
ECCV 2022
CelebV-Text
CelebV-Text: A Large-Scale Facial Text-Video Dataset
  • Video corpus
  • Modalities: emotions, text
  • Size: 70k clips
  • Add. tasks: Generalization, Expressiveness, High fidelity and realism
CVPR 2023
Human-Art
Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes
  • Image corpus
  • Modalities: text
  • Size: 50k images
  • Add. tasks: Generalization
CVPR 2023
BEAT2
Emage: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling
  • 3D/4D corpus
  • Modalities: audio, emotions, text, motions
  • Size: 60 hours
  • Add. tasks: Generalization, Expressiveness, Physics improvements & World interaction
CVPR 2024
CosmicManHQ-1.0
Cosmicman: A Text-to-Image Foundation Model for Humans
  • Image corpus
  • Modalities: text, hair/clothes
  • Size: 5.46M images
  • Add. tasks: Generalization, Hair and clothes improvements
CVPR 2024
SFHQ-T2I
Synthetic Faces High Quality - Text 2 Image (Sfhq-T2i) Dataset
  • Image corpus
  • Modalities: text
  • Size: 122,726 images
  • Add. tasks: Generalization, High fidelity and realism
Dataset 2024
SignAvatars
SignAvatars: A Large-Scale 3D Sign Language Holistic Motion Dataset and Benchmark
  • 3D/4D corpus
  • Modalities: text, motions
  • Size: 70k clips, 153 subjects
  • Add. tasks: Generalization, Physics improvements & World interaction
ECCV 2024

Attribute editing Corpora

Corpus Title & Repository / Information Venue
BiwiKinect
Random Forests for Real Time 3D Face Analysis
  • Video corpus
  • Modalities: motions
  • Size: 15k images, 20 subjects
  • Add. tasks: Physics improvements & World interaction, High fidelity and realism
IJCV 2013
FaceWarehouse
FaceWarehouse: A 3D Facial Expression Database for Visual Computing
  • 3D/4D corpus
  • Modalities: emotions, motions
  • Size: 150 subjects
  • Add. tasks: Generalization, Expressiveness
TVCG 2013
Stirling
Stirling Esrc 3D Face Database
  • 3D/4D corpus
  • Modalities: emotions, motions
  • Size: 99 subjects
  • Add. tasks: Expressiveness
Dataset 2013
NeRFace
Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction
  • Video corpus
  • Add. tasks: High fidelity and realism
CVPR 2021
CelebV-HQ
CelebV-HQ: A Large-Scale Video Facial Attributes Dataset
  • Video corpus
  • Modalities: emotions
  • Size: 35k+ clips
  • Add. tasks: Generalization, Expressiveness, High fidelity and realism
ECCV 2022
NeRFBlendShape
Reconstructing Personalized Semantic Facial NeRF Models from Monocular Video
  • Video corpus
  • Modalities: motions
  • Size: 8 subjects
  • Add. tasks: High fidelity and realism
TOG 2022
AvatarReX
AvatarReX: Real-Time Expressive Full-Body Avatars
  • Video corpus
  • Modalities: motions
  • Size: 4 sequences, 16 cameras
  • Add. tasks: Physics improvements & World interaction, High fidelity and realism
TOG 2023
LPFF
LPFF: A Portrait Dataset for Face Generators Across Large Poses
  • Image corpus
  • Size: 19,590 images
  • Add. tasks: High fidelity and realism
ICCV 2023
PointAvatar
PointAvatar: Deformable Point-Based Head Avatars from Videos
  • Video corpus
  • Modalities: motions
  • Size: 3 subjects
  • Add. tasks: High fidelity and realism
CVPR 2023

Physics improvements & World interaction Corpora

Corpus Title & Repository / Information Venue
Decaf
Decaf: Meg-Based Multimodal Database for Decoding Affective Physiological Responses
  • Video corpus
  • Modalities: motions
  • Size: 8 subjects
  • Add. tasks: Attribute editing, High fidelity and realism
TAC 2015
KIT-ML
The Kit Motion-Language Dataset
  • Else corpus
  • Modalities: text, motions
  • Size: 3,911 clips, 6,278 texts
  • Add. tasks: Generalization, Text guidance & Stylization
Big Data 2016
MonoPerfCap
MonoPerfCap: Human Performance Capture from Monocular Video
  • Video corpus
  • Modalities: motions
  • Size: 120 clips
  • Add. tasks: High fidelity and realism
TOG 2018
PeopleSnapshot
Video Based Reconstruction of 3D People Models
  • Video corpus
  • Modalities: motions
  • Size: 11 subjects
CVPR 2018
AMASS
AMASS: Archive of Motion Capture as Surface Shapes
  • Else corpus
  • Modalities: motions
  • Size: 11,265 motions
ICCV 2019
PROX
Resolving 3D Human Pose Ambiguities with 3D Scene Constraints
  • Video corpus
  • Modalities: motions
  • Size: 12 scenes
ICCV 2019
Speech2Gesture
Learning Individual Styles of Conversational Gesture
  • Video corpus
  • Modalities: audio, motions
  • Size: 2,710 clips, 144 hours
  • Add. tasks: Generalization, Expressiveness
CVPR 2019
Talking-WithHands16.2M
Talking-WithHands16.2M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis
  • 3D/4D corpus
  • Modalities: audio, motions
  • Size: 16.2M frames
  • Add. tasks: Generalization, Expressiveness
ICCV 2019
Talking-WithHands32M
Talking with Hands 16.2 M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis
  • 3D/4D corpus
  • Modalities: audio, motions
  • Size: 32M frames
  • Add. tasks: Generalization, Expressiveness
ICCV 2019
InterHand2.6M
InterHand2.6M: A Dataset and Baseline for 3D Interacting Hand Pose Estimation from a Single RGB Image
  • Image corpus
  • Modalities: motions
  • Size: 26 subjects
  • Add. tasks: Attribute editing
ECCV 2020
DeepCap
DeepCap: Monocular Human Performance Capture Using Weak Supervision
  • Video corpus
  • Modalities: motions
  • Size: 17 sequences
  • Add. tasks: High fidelity and realism
CVPR 2020
BABEL
BABEL: Bodies, Action and Behavior with English Labels
  • Else corpus
  • Modalities: text, motions
  • Size: 43 hours, 250+ actions
  • Add. tasks: Generalization, Text guidance & Stylization
CVPR 2021
DynaCap
Real-Time Deep Dynamic Characters
  • Video corpus
  • Modalities: motions
  • Size: 5 sequences
TOG 2021
ZJU-MoCap
Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans
  • Video corpus
  • Modalities: motions
  • Size: 9 sequences
CVPR 2021
BEHAVE
BEHAVE: Dataset and Method for Tracking Human Object Interactions
  • 3D/4D corpus
  • Modalities: motions
  • Size: 321 sequences
  • Add. tasks: High fidelity and realism
CVPR 2022
DART
DART: Articulated Hand Model with Diverse Accessories and Rich Textures
  • Image corpus
  • Modalities: motions
  • Size: 800k images
  • Add. tasks: High fidelity and realism
NeurIPS 2022
EgoBody
EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices
  • Video corpus
  • Modalities: motions
  • Size: 125 sequences
ECCV 2022
HuMMan
HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling
  • 3D/4D corpus
  • Modalities: motions
  • Size: 1k subjects, 60M frames
  • Add. tasks: Generalization, High fidelity and realism
ECCV 2022
HumanML3D
Generating Diverse and Natural 3D Human Motions from Text
  • Else corpus
  • Modalities: text, motions
  • Size: 14.6k clips, 45.0k texts
  • Add. tasks: Generalization, Text guidance & Stylization
CVPR 2022
MANO
Embodied Hands: Modeling and Capturing Hands and Bodies Together
  • 3D/4D corpus
  • Modalities: motions
  • Size: 1k+ scans
  • Add. tasks: Attribute editing
arXiv 2022
NeuMan
NeuMan: Neural Human Radiance Field from a Single Video
  • Video corpus
  • Size: 6 clips
  • Add. tasks: High fidelity and realism
ECCV 2022
CHAIRS
Full-Body Articulated Human-Object Interaction
  • 3D/4D corpus
  • Modalities: motions
  • Size: 17.3 hours, 46 subjects
  • Add. tasks: Attribute editing, High fidelity and realism
ICCV 2023
CIRCLE
CIRCLE: Capture in Rich Contextual Environments
  • 3D/4D corpus
  • Modalities: motions
  • Size: 10 hours
CVPR 2023
Re:InterHand
A Dataset of Relighted 3D Interacting Hands
  • 3D/4D corpus
  • Modalities: motions
  • Size: 106,766 scans
  • Add. tasks: Attribute editing, High fidelity and realism
NeurIPS 2023
X-Avatar
X-Avatar: Expressive Human Avatars
  • 3D/4D corpus
  • Modalities: motions
  • Size: 233 sequences
  • Add. tasks: Attribute editing, High fidelity and realism
CVPR 2023
Ava-256
Codec Avatar Studio: Paired Human Captures for Complete, Driveable, and Generalizable Avatars
  • 3D/4D corpus
  • Modalities: motions
  • Size: 256 subjects
  • Add. tasks: Attribute editing, High fidelity and realism
NeurIPS 2024
FS-DART
Have-Fun: Human Avatar Reconstruction from Few-Shot Unconstrained Images
  • 3D/4D corpus
  • Modalities: motions
  • Size: 100 subjects
  • Add. tasks: High fidelity and realism
CVPR 2024
LINGO
Autonomous Character-Scene Interaction Synthesis from Text Instruction
  • Else corpus
  • Modalities: text, motions
  • Size: 16 hours
  • Add. tasks: Text guidance & Stylization
SIGGRAPH 2024
TRUMANS
Scaling up Dynamic Human-Scene Interaction Modeling
  • 3D/4D corpus
  • Modalities: motions
  • Size: 15 hours
  • Add. tasks: High fidelity and realism
CVPR 2024

Hair and clothes improvements Corpora

Corpus Title & Repository / Information Venue
DeepFashion
DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations
  • Image corpus
  • Modalities: hair/clothes
  • Size: 801k items
  • Add. tasks: High fidelity and realism
CVPR 2016
DeepFashion2
DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images
  • Image corpus
  • Modalities: hair/clothes
  • Size: 801k items
  • Add. tasks: High fidelity and realism
CVPR 2019
CAPE
Learning to Dress 3D People in Generative Clothing
  • 3D/4D corpus
  • Modalities: hair/clothes
  • Size: 150k scans, 15 subjects
  • Add. tasks: Physics improvements & World interaction
CVPR 2020
SIZER
SIZER: A Dataset and Model for Parsing 3D Clothing and Learning Size Sensitive 3D Clothing
  • 3D/4D corpus
  • Modalities: hair/clothes
  • Size: 2k scans
ECCV 2020
TikTok
Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos
  • 3D/4D corpus
  • Size: 300+ sequences, 100k+ frames
  • Add. tasks: Generalization, High fidelity and realism
CVPR 2021
3DHumans
Sharp: Shape-Aware Reconstruction of People in Loose Clothing
  • 3D/4D corpus
  • Modalities: motions, hair/clothes
  • Size: ~180 scans
  • Add. tasks: Attribute editing, High fidelity and realism
IJCV 2023
DNA-Rendering
DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-Centric Rendering
  • Video corpus
  • Modalities: motions, hair/clothes
  • Size: 1,500+ subjects, 67.5M frames
  • Add. tasks: Generalization, Physics improvements & World interaction, High fidelity and realism
ICCV 2023
Goliath
Codec Avatar Studio: Paired Human Captures for Complete, Driveable, and Generalizable Avatars
  • 3D/4D corpus
  • Modalities: motions, hair/clothes
  • Size: 4 subjects
  • Add. tasks: Physics improvements & World interaction, High fidelity and realism
NeurIPS 2024
I3D-Human
Within the Dynamic Context: Inertia-Aware 3D Human Modeling with Pose Sequence
  • 3D/4D corpus
  • Modalities: motions, hair/clothes
  • Size: 6 subjects, 10k frames
  • Add. tasks: Physics improvements & World interaction
ECCV 2024
MVHumanNet
MVHumanNet: A Large-Scale Dataset of Multi-View Daily Dressing Human Captures
  • Video corpus
  • Modalities: text, motions
  • Size: 4,500 subjects, 645M frames
  • Add. tasks: Generalization, Text guidance & Stylization, Physics improvements & World interaction, High fidelity and realism
CVPR 2024
MVHumanNet++
MVHumanNet++: A Large-Scale Dataset of Multi-View Daily Dressing Human Captures with Richer Annotations for 3D Human Digitization
  • Video corpus
  • Modalities: text, motions
  • Size: 4,500 subjects, 645M frames
  • Add. tasks: Generalization, Text guidance & Stylization, Physics improvements & World interaction, High fidelity and realism
arXiv 2025

High fidelity and realism Corpora

Corpus Title & Repository / Information Venue
Florence2D/3D
The Florence 2D/3D Hybrid Face Dataset
  • 3D/4D corpus
  • Modalities: emotions
J-HGBU 2011
FFHQ
A Style-Based Generator Architecture for Generative Adversarial Networks
  • Image corpus
  • Size: 70k images
  • Add. tasks: Generalization
CVPR 2019
Multiface
Multiface: A Dataset for Neural Face Rendering
  • Video corpus
  • Modalities: motions
  • Size: 13 subjects
arXiv 2022
SFHQ
Synthetic Faces High Quality (Sfhq) Dataset
  • Image corpus
  • Size: 100k images
  • Add. tasks: Generalization
Dataset 2022
SHHQ
StyleGAN-Human: A Data-Centric Odyssey of Human Generation
  • Image corpus
  • Size: 40k images
  • Add. tasks: Generalization
ECCV 2022
VFHQ
VFHQ: A High-Quality Dataset and Benchmark for Video Face Super-Resolution
  • Video corpus
  • Size: 16k+ clips
  • Add. tasks: Generalization
CVPR 2022
2K2K
High-Fidelity 3D Human Digitization from Single 2K Resolution Images
  • 3D/4D corpus
  • Modalities: motions
  • Size: 2k images
CVPR 2023
ActorsHQ
Humanrf: High-Fidelity Neural Radiance Fields for Humans in Motion
  • Video corpus
  • Size: 39,765 frames, 160 cameras
  • Add. tasks: Physics improvements & World interaction
TOG 2023
INSTA
Instant Volumetric Head Avatars
  • Video corpus
  • Modalities: motions
  • Add. tasks: Attribute editing
CVPR 2023
TexTalk4D
Towards High-Fidelity 3D Talking Avatar with Personalized Dynamic Texture
  • 3D/4D corpus
  • Modalities: audio, motions
  • Size: 100 subjects, 100 minutes
  • Add. tasks: Generalization, Expressiveness, Attribute editing
CVPR 2025

Methods

203 primary methods reviewed across 9 task families, plus 6 logical continuation papers (209 works in total). Every entry is an avatar-generation method published at a 2024–2025 CORE A/A* venue. Rows marked are follow-up papers grouped under the method they extend. Click a title to expand a one-line summary.

The 203 methods grouped by primary task and publication year
The 203 reviewed methods, grouped by primary task and publication year.

Generalization Methods

Method Title & Repository / Description Venue
DisCo
DisCo: Disentangled Control for Realistic Human Dance Generation (Apache 2.0)
DisCo introduces a pose-guided synthesis model for realistic human dance generation that emphasizes two principles: generalizability and compositionality. To achieve this, the authors design a disentangled-control architecture with a human-attribute pretraining stage.
CVPR 2024
SiTH
SiTH: Single-View Textured Human Reconstruction with Image-Conditioned Diffusion
SiTH proposes a two-stage pipeline that reconstructs a fully textured 3D human mesh from a single input image. First, an image-conditioned diffusion model hallucinates the back-view appearance of the person. Then, a mesh reconstruction network uses both the original front view and the hallucinated back view, guided by a skinned human body prior, to reconstruct full-body geometry and texture.
CVPR 2024
DiffHuman
DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans
DiffHuman is a conditional diffusion model that denoises pixel-aligned 2D observations of an underlying 3D shape representation. The authors propose a novel neural generator that approximates rendering with reduced runtime (up to ×55).
CVPR 2024
HaveFun
HaveFun: Human Avatar Reconstruction from Few-Shot Unconstrained Images
The authors of HaveFun present a framework that can reconstruct animatable full‑body human avatars from a small set of unconstrained images by combining a skinning mechanism with Deep Marching Tetrahedra and a two‑phase optimization: reference alignment and unseen‑region guidance.
CVPR 2024
Morphable Diffusion
Morphable Diffusion: 3D-Consistent Diffusion for Single-Image Avatar Creation
The authors introduce a diffusion model that enables creation of fully 3D animatable photorealistic human avatars. They have managed to integrate 3D morphable multi-view-consistent model (e.g., SMPL or FLAME) into a denoising approach with seamless and accurate incorporation of facial expressions and body pose control into the generation process.
CVPR 2024
Stratified Avatar
Stratified Avatar Generation from Sparse Observations
The paper proposes a stratified two-stage pipeline that first reconstructs an upper-body avatar from a small set of sparse HMD and hand observations and then conditions a lower-body synthesis on the learned upper-body latent to recover full-body poses. The authors leverage a VQ-VAE and latent diffusion formulation to model the conditional distribution of full-body motion given sparse inputs.
CVPR 2024
Portrait4D
Portrait4D: Learning One-Shot 4D Head Avatar Synthesis Using Synthetic Data
Portrait4D proposes a one-shot framework for 4D head avatar synthesis from a single image. It first implies training a part-wise 4D generative model to synthesize multi-view and motion-varying training data and then using a transformer-based animatable tri-plane reconstructor for avatar reconstruction. Similar to, they first train a 3D head synthesizer on synthetic multi-view images, use it to convert monocular real videos into pseudo multi-view ones and then learn a full 4D head synthesizer via cross-view self-reenactment.
CVPR 2024
Portrait4D-v2 cont. of Portrait4D
Portrait4D-V2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer
In their next work, the authors introduce Portrait4D-v2, a feedforward one-shot 4D head avatar synthesis method that replaces reliance on monocular-video reconstruction and 3DMM guidance with pseudo multi-view data.
ECCV 2024
AvatarOne
AvatarOne: Monocular 3D Human Animation
AvatarOne reconstructs an animatable 3D human avatar from a single monocular video and a tracked skeleton. The method builds a canonical SDF representation with accompanying texture, then uses a forward-skinning deformation module and grid-based volumetric rendering to support novel-pose and novel-view synthesis.
WACV 2024
SphereHead
SphereHead: Stable 3D Full-Head Synthesis with Spherical Tri-Plane Representation
SphereHead introduces a spherical tri‑plane representation for 3D head synthesis, which better models full-head geometry and reduces back-view artifacts compared to standard Cartesian tri-planes. Another proposition is a view-image consistency loss that enforces alignment between generated images and camera parameters, enabling stable 360-degree head generation and inversion from a single image.
ECCV 2024
PAV
PAV: Personalized Head Avatar from Unstructured Video Collection
PAV proposes learning a dynamic deformable NeRF from a collection of monocular videos of the same person under different appearances (e.g., hair, facial changes). The method attaches learnable latent appearance embeddings to a base mesh and conditions both density and color of the NeRF on them.
ECCV 2024
HumanSplat
HumanSplat: Generalizable Single-Image Human Gaussian Splatting with Structure Priors
In HumanSplat, the authors propose a method to reconstruct a 3D human avatar from a single image by predicting 3DGS parameters using a 2D multi‑view diffusion model and a latent reconstruction transformer, enriched with human-structure priors. This allows feedforward generation of human Gaussians without per-subject optimization or dense multi-view capture.
NeurIPS 2024
Human-3Diffusion
Human-3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models
The authors propose a realistic avatar creation pipeline. Similar to previous approaches, it first utilizes a 2D multi-view diffusion model as a prior. Then it uses an image-conditioned 3DGS reconstruction model for explicit 3D representation.
NeurIPS 2024
GAGAvatar
Generalizable and Animatable Gaussian Head Avatar
The authors propose GAGAvatar, a one-shot animatable head avatar method that regresses 3D Gaussian parameters from a single image using a dual-lifting approach and integrates 3DMM priors for expression control. The feedforward model reconstructs unseen identities without per-subject optimization and renders reenactments in real time.
NeurIPS 2024
Real3D-Portrait
Real3D-Portrait: One-Shot Realistic 3D Talking Portrait Synthesis
Real3D-Portrait presents a one-shot pipeline that reconstructs a 3D avatar from a single image and conditions it on audio or video to produce talking head avatars. The system uses a large image-to-plane 3D prior, an efficient motion adapter for conditioned animation, and a head-torso/background super-resolution model.
ICLR 2024
GPAvatar (multi-input)
GPAvatar: Generalizable and Precise Head Avatar from Image(s)
In the work GPAvatar (not to be confused with), a method is proposed that reconstructs a 3D head avatar from one or several input images in a single forward pass by using a dynamic point‑based expression field and a Multi Tri-planes Attention fusion module to combine information from multiple images.
ICLR 2024
Shafir et al.
Human Motion Diffusion as a Generative Prior
The paper also proposes using a pretrained motion diffusion model as a generative prior to overcome data scarcity in motion synthesis. The authors introduce three composition mechanisms -- sequential, parallel, and model composition -- enabling long animations, two-person motion, and fine‑grained control without collecting huge new corpora. For example, with their “DoubleTake” inference trick, they generate long motion sequences from a prior trained only on short clips.
ICLR 2024
Fine Structure-Aware Sampling
Fine Structure-Aware Sampling: A New Sampling Training Scheme for Pixel-Aligned Implicit Models in Single-View Human Reconstruction
The paper proposes a Fine Structure-Aware Sampling strategy that emphasizes “fine” structures (ears, fingers, hair edges) when training pixel-aligned implicit models from single views, reducing reconstruction artifacts and improving detailed geometry/texture recovery.
AAAI 2024
InvertAvatar
InvertAvatar: Incremental GAN Inversion for Generalized Head Avatars
The authors introduce InvertAvatar, an incremental 3D GAN inversion method that improves avatar reconstruction quality as more frames are provided. The technique includes an animatable 3D-GAN prior, a neural texture encoder with UV parameterization, and temporal aggregation (ConvGRU) to boost geometry/texture detail from multi-frame input.
SIGGRAPH 2024
Pippo
Pippo: High-Resolution Multi-View Humans from a Single Image
Pippo is a generative model based on a multi-view DiT designed to create dense, 1K resolution turnaround videos or multi-view 3D representations of a person from a single input image. It uses a multi-stage training approach, starting with pretraining on 3B human images. Key innovations include an attention biasing technique that allows generating more views than in the original training distribution and a ControlMLP that uses pixel-aligned controls to enhance 3D consistency during high-resolution generation.
CVPR 2025
GAF
GAF: Gaussian Avatar Reconstruction from Monocular Videos via Multi-View Diffusion
In GAF, the authors propose reconstructing animatable 3DGS head avatars from a monocular video captured on a commodity device. They use a multi-view latent diffusion model conditioned on normal maps from a FLAME model mesh and VAE image features to generate pseudo-ground-truth novel-view renderings, which guide the optimization of a 3DGS avatar representation. A latent upsampler further refines facial detail before decoding.
CVPR 2025
CAP4D
CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models
CAP4D uses a morphable multi-view diffusion model to reconstruct 4D avatars. It works with an arbitrary number of reference images, even with just one. The proposed pipeline is capable of predicting novel views and unseen expressions.
CVPR 2025
AvatarArtist
AvatarArtist: Open-Domain 4D Avatarization
In AvatarArtist, the authors propose a training paradigm using both GANs and diffusion models. They explain that, based on their observations, 4D-GANs fail at cross-domain tasks, but excel at bridging images and tri-planes. 2D diffusion models in the pipeline serve as diverse data distribution experts that assist GANs in the avatar creation.
CVPR 2025
FRESA
FRESA: Feedforward Reconstruction of Personalized Skinned Avatars from Few Images
FRESA reconstructs personalized full-body skinned avatars from just a few casual images in a single feedforward pass. The method jointly infers shape, skinning weights, and pose-dependent deformations, improving geometric fidelity over shared-weight approaches. Multi-frame feature aggregation and 3D canonicalization help capture details.
CVPR 2025
Zero-1-to-A
Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion
Zero-1-to-A is a method of synthesizing spatially and temporally consistent corpora for 4D digital avatar synthesis. It iteratively constructs video subsets, progressively trains a diffusion model in such a way that the resulting quality is improved and the animation is more temporally coherent.
CVPR 2025
Vid2Avatar-Pro
Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior
Sharing a common idea about efficient priors, Vid2Avatar-Pro uses a universal prior model trained on multiple clothed human views to guide the fitting of a photorealistic avatar from a monocular in-the-wild video. The avatar is represented via expressive 3D Gaussians with shared canonical front/back maps. Inverse rendering is used to adapt the prior to the input identity.
CVPR 2025
GASP
GASP: Gaussian Avatars with Synthetic Priors
The authors train a 3DGS model prior using a perfectly annotated synthetic corpus, which is then fit and fine-tuned on a single photo or short video to enable 360-degree animatable avatars on a specific identity. Correlations among per-Gaussian features learned in synthetic space are utilized within the fitting process to bridge the domain gap.
CVPR 2025
AniGS
AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction
AniGS reconstructs animatable 3D avatars from a single image using 4D Gaussian Splatting. Multi-view canonical images are generated via a transformer-based model, and reconstruction inconsistencies are leveraged as motion cues for animation.
CVPR 2025
SynShot
Synthetic Prior for Few-Shot Drivable Head Avatar Inversion
The authors of SynShot use a large synthetic avatar head corpus to create prior knowledge within the model, which is then fine-tuned using just a few real images to bridge the domain gap.
CVPR 2025
Avat3r
Avat3r: Large Animatable Gaussian Reconstruction Model for High-Fidelity 3D Head Avatars
Avat3r is a model that regresses a high‑quality animatable 3D head avatar from just a few input images by learning a strong Gaussian‑splat prior over heads from a large multi-view 3D head corpus and enabling animation via cross‑attention to expression codes.
ICCV 2025
Sun et al.
Fine-Grained 3D Gaussian Head Avatars Modeling from Static Captures via Joint Reconstruction and Registration
The authors propose modeling high-fidelity head avatars by optimizing two parallel 3DGS sets from static image captures: one prior-based set with animation rigging and one prior-free with texture/geometry details. They jointly register and merge them, then combine occluded parts from the prior set to output a complete animatable avatar.
ICCV 2025
GAS (Generative Avatar Synthesis)
GAS: Generative Avatar Synthesis from a Single Image
Generative Avatar Synthesis framework combines the regression-based 3D human reconstruction with a diffusion-based approach. A dense driving signal from the reconstructed human outpaces real information, like depth or normal maps, due to the discrepancy of the latter. It serves as comprehensive conditioning for high-quality avatar synthesis.
ICCV 2025
GUAVA
GUAVA: Generalizable Upper Body 3D Gaussian Avatar
Generalizable Upper Body 3D Gaussian Avatar reconstructs an animatable upper-body Gaussian avatar (torso, hands, face) from a single image in about 0.1 seconds using an expressive human model and projection-based sampling.
ICCV 2025
MoGA
MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction
The paper introduces Monocular Gaussian Avatar, a method that leverages a generative avatar prior to reconstruct high‑fidelity animatable avatars from monocular videos. The key idea, similar to that of previously described methods, lies in combining a learned 2D avatar prior with 3DGS for monocular reconstruction.
ICCV 2025
Low-Rank Register Modules
Low-Rank Head Avatar Personalization with Registers
The paper proposes a framework to personalize a pretrained head-avatar model using Low-Rank Register Modules based on the Low-Rank Adaptation mechanism first introduced for language models. Instead of fine-tuning the full network, small learnable modules are inserted to adapt identity, appearance, and subtle facial details for new subjects.
NeurIPS 2025
3D²-Actor
3D²-Actor: Learning Pose-Conditioned 3D-Aware Denoiser for Realistic Gaussian Avatar Modeling
3D²‑Actor proposes a pipeline combining a pose‑conditioned 2D denoiser with a 3DGS‑based rectifier. Given a multi‑view video of a person, the system denoises and generates multi‑view images in arbitrary poses, then reconstructs a 3D avatar with a two‑stage projection strategy and local coordinate representation.
AAAI 2025

Expressiveness Methods

Method Title & Repository / Description Venue
FaceTalk
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models
The authors propose using a latent diffusion model in the expression space of neural parametric head models to generate temporally coherent, high-fidelity 3D head animations from input audio.
CVPR 2024
SMIRK
SMIRK: 3D Facial Expressions Through Analysis-by-Neural-Synthesis (MIT)
SMIRK replaces traditional differentiable-rendering losses with a neural renderer to reconstruct expressive 3D faces from single in-the-wild images. This enables faithful recovery of subtle, extreme, asymmetric, or rare expressions that prior methods often miss.
CVPR 2024
DiffTED
DiffTED: One-Shot Audio-Driven Ted Talk Video Generation with Diffusion-Based Co-Speech Gestures
DiffTED is a novel method for one-shot audio-driven avatar synthesis from a single image. It leverages a diffusion model to generate Thin-Plate Spline motion model keypoints to control the avatar's movements for temporally coherent and diverse co-speech articulation. This method uses CFG.
CVPR 2024
DiffusionAvatars
DiffusionAvatars: Deferred Diffusion for High-Fidelity 3D Head Avatars
DiffusionAvatars is a method for generating high-fidelity 3D head avatars with control over pose and expression. The work's notable contribution is a neural parametric head model that is used to guide expression and head pose, as it serves as a proxy geometry for the subject. It generates expression encodings that are aggregated into the DiffusionAvatars pipeline via cross-attention. It also creates a canonical space, utilized by learnable spatial features that are later rigged to the head's surface using tri-planes.
CVPR 2024
EMAGE
EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling
In this work, the authors propose a framework for full-body avatar motion generation conditioned on audio and masked gestures. These motions include facial, local body, hands, and global movements with high expressiveness and fidelity. To achieve this, they introduce the BEAT2 mesh-level co-speech corpus based on the SMPL-X body with FLAME head parameters.
CVPR 2024
EMOPortraits
EMOPortraits: Emotion-Enhanced Multimodal One-Shot Head Avatars
The authors focus on the limitations of the latent space for facial expression descriptors. They modify a previous SOTA method to work with asymmetric facial expressions, introduced audio modality for audio-driven facial animation, and proposed a new FEED corpus that fills the gap with intense, asymmetric, and various facial expressions of identities in videos as compared to MEAD.
CVPR 2024
Diffused Heads
Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation
The authors of Diffused Heads use an autoregressive diffusion model that -- given a single identity image and an audio clip -- generates a full talking‑head video. The method hallucinates natural head movement, blinks, and lip motion. It is capable of preserving identity and background, overcoming common limitations of GAN-based approaches.
WACV 2024
LaughTalk
LaughTalk: Expressive 3D Talking Head Generation with Laughter
The authors of LaughTalk propose a system for 3D talking-head synthesis that can produce both speech and natural laughter -- something many prior methods struggle with, since laughter involves subtle face and head dynamics beyond speech articulation.
WACV 2024
EMO
EMO: Emote Portrait Alive Generating Expressive Portrait Videos with Audio2video Diffusion Model Under Weak Conditions
In this work, the authors address the issue of human expressions and the uniqueness of facial styles. A framework is proposed that directly synthesizes video using the audio modality. Along with it, a reference image with motion frames and face region mask are utilized in a Stable Diffusion based pipeline. First, they generate hand positions using a DiT. There, the audio is incorporated via cross-attention. The previous motion latent sequence is concatenated with the current one for better transition smoothness. Second, the generated co-speech gestures are encoded and added into a noisy latent.
ECCV 2024
EMO2 cont. of EMO
EMO2: End-Effector Guided Audio-Driven Avatar Video Generation
The same authors propose a two-stage pipeline to synchronize the audio modality with co-speech gestures.
arXiv 2025
Arc2Face
Arc2Face: A Foundation Model for Id-Consistent Human Faces (MIT)
Arc2Face is a diffusion-based foundation model that generates photorealistic human faces conditioned solely on a person’s ArcFace embedding, achieving stronger identity fidelity than text-prompted methods.
ECCV 2024
Expressive Whole-Body 3D Gaussian Avatar
Expressive Whole-Body 3D Gaussian Avatar
Expressive Whole-Body 3D Gaussian Avatar introduces a hybrid representation combining a parametric mesh and 3DGS to produce animatable full-body avatars from short monocular videos. By rigging Gaussians to mesh vertices, the method models body, face, and hand deformations simultaneously, enabling expressive novel-pose synthesis with accurate facial expressions and hand gestures.
ECCV 2024
HeadGaS
HeadGaS: Real-Time Animatable Head Avatars via 3D Gaussian Splatting
HeadGaS presents a method to generate real-time animatable head avatars using 3D Gaussian splats with learnable latent features. The Gaussians are rigged to a parametric head model and incorporate expression-dependent color and opacity, enabling animatable facial expressions.
ECCV 2024
ScanTalk
ScanTalk: 3D Talking Heads from Unregistered Scans
ScanTalk is a framework that animates arbitrary 3D face meshes from speech. It overcomes the common limitation that many 3D face animation methods require fixed mesh topology and point‑to‑point correspondence. ScanTalk relies on a diffusion‑based mesh deformation network (DiffusionNet) that takes per‑vertex features and audio as input and outputs a deformation sequence, enabling speech‑driven animation even on previously unseen or unregistered scans.
ECCV 2024
ID-to-3D
ID-to-3D: Expressive Id-Guided 3D Heads via Score Distillation Sampling
The authors of ID-to-3D introduce a method that, starting from a single casual reference image and a text prompt, generates a 3D human head avatar with identity-consistent geometry and texture. It also supports up to 13 distinct expressions. They combine an ArcFace embedding for identity, task-specific 2D diffusion priors, and a neural parametric representation for expression, foregoing reliance on large captured 3D corpora.
NeurIPS 2024
VASA-1
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
VASA-1 generates photorealistic talking-face videos from a single input image and a speech-audio clip. The system models holistic facial dynamics and head motion in a disentangled latent space, producing synchronized lip movement, expressive facial nuances, and natural head motion.
NeurIPS 2024
MimicTalk
MimicTalk: Mimicking a Personalized and Expressive 3D Talking Face in Minutes
MimicTalk proposes a hybrid adaptation pipeline. It generates an avatar starting from a person-agnostic generic 3D talking-face model, then quickly fine-tunes to a given identity in only a few minutes, and uses an in-context stylized speech2motion module to replicate the target’s speaking style.
NeurIPS 2024
GAIA
GAIA: Zero-Shot Talking Avatar Generation
GAIA tackles talking avatar synthesis in a zero-shot setting. It generates natural videos without relying on 3DMMs or warping heuristics. The model disentangles appearance and motion, then uses a diffusion-based motion generator conditioned on the portrait and audio.
ICLR 2024
Follow-Your-Emoji
Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation
In this work, the authors offer a diffusion-based framework for animating a reference portrait under a target landmark sequence. Identity is preserved while expressions are applied, with a novel “expression-aware landmark” motion signal and a fine-grained facial loss for subtle expression transfer. The system also supports long-term temporal consistency via progressive generation. It adds a progressive generation strategy with a Taylor-interpolated cache to achieve roughly 2.6× faster inference while maintaining quality. It also improves landmark alignment and loss weighting to better handle exaggerated expressions and diverse portrait types.
SIGGRAPH 2024
Follow-Your-Emoji-Faster cont. of Follow-Your-Emoji
Follow-Your-Emoji-Faster: Towards Efficient, Fine-Controllable, and Expressive Freestyle Portrait Animation
Follow-Your-Emoji-Faster continues the authors' Follow-Your-Emoji line by making the same fine-controllable, expression-preserving portrait animation much faster and more robust.
arXiv 2025
Media2Face
Media2Face: Co-Speech Facial Animation Generation with Multi-Modality Guidance
Media2Face is a diffusion-based generator that integrates diverse media inputs (audio, image, and text) for facial animation and head pose synthesis for avatars. For its training, the authors utilize the Generalized Neural Parametric Facial Asset, an efficient VAE mapping facial geometry and images to a highly generalized expression latent space.
SIGGRAPH 2024
AniTalker
AniTalker: Animate Vivid and Diverse Talking Faces Through Identity-Decoupled Facial Motion Encoding
AniTalker decouples identity and motion via a motion encoder that produces identity-independent facial motion representations. A synthesis network then applies those motions to target identities to yield diverse, expressive talking-face videos from audio or text. T
ACMMM 2024
TexTalker
Towards High-Fidelity 3D Talking Avatar with Personalized Dynamic Texture
The authors introduce TexTalk4D, a high-resolution 4D corpus of 100 minutes of audio-aligned scan-level meshes with 8K dynamic textures from 100 subjects. They also present the diffusion-based framework TexTalker to generate facial motion and aligned dynamic textures simultaneously from speech. They reveal that dynamic texture is critical for high-fidelity speech-driven 3D head avatars and propose a pivot-based style injection strategy to disentangle motion style and texture style for better controllability.
CVPR 2025
Arc2Avatar
Arc2Avatar: Generating Expressive 3D Avatars from a Single Image via Id Guidance
A continuation of Arc2Face, Arc2Avatar is a method that takes a single portrait image and generates a full 3D head avatar with blendshape-based expression control. They leverage a human-face foundation diffusion model fine-tuned for multi-view head synthesis and initialize a modified 3DGS representation in dense correspondence with a human face mesh template connectivity regularizers ensure expression-capable topology. An optional SDS based correction step refines blendshape expressions, and strong identity priors reduce reliance on heavy guidance, solving color fidelity issues common in SDS workflows.
CVPR 2025
Wang et al.
3D Gaussian Head Avatars with Expressive Dynamic Appearances by Compact Tensorial Representations
The paper proposes a digital avatar synthesis method using rigged 3D Gaussian splats and a tensorial representation for dynamic textures. The authors add an adaptive truncated opacity penalty and class-balanced sampling to improve generalization across expressions.
CVPR 2025
VLOGGER
VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis
The authors of VLOGGER introduce an avatar synthesis method from a single input image with audio guidance. First, a motion generator creates a sequence of 3D facial expressions and body poses for each frame based on the audio. These are transformed into denser representations and added to the reference image. Second, the packed input is then passed into a temporal diffusion model where it forgoes the denoising process. Finally, the pipeline uses a trainable super-resolution module to make the generation of each frame photorealistic.
CVPR 2025
EmoVOCA
EmoVOCA: Speech-Driven Emotional 3D Talking Heads
The paper also proposes a method for generating 3D talking-head avatars with realistic emotional expressions from audio input. The approach uses a speech-to-expression network to predict fine-grained, time-varying facial deformations corresponding to emotion cues in speech. To render these deformations, the authors employ a 3D face representation that preserves geometry and appearance under different expressions and head poses.
WACV 2025
GeoAvatar
GeoAvatar: Adaptive Geometrical Gaussian Splatting for 3D Head Avatar
GeoAvatar introduces an adaptive 3DGS framework that separates rigid and flexible facial regions for better deformation control. It applies distinct regularizations to stabilize geometry while maintaining expression flexibility and incorporates a mouth-specific rigging structure for more accurate lip motion.
ICCV 2025
GaussianSpeech
GaussianSpeech: Audio-Driven Personalized 3D Gaussian Avatars
In this work, the authors introduce a method that takes spoken audio and generates high-fidelity, personalized, multi-view--consistent 3D head avatars using a 3DGS representation. They couple a transformer-based audio feature extractor with expression-dependent Gaussian color modeling and capture a new large-scale multi-view audio-visual corpus for training.
ICCV 2025
FaceCraft4D
FaceCraft4D: Animated 3D Facial Avatar Generation from a Single Image
FaceCraft4D proposed in the paper takes a single image as input to create 360-degree animatable avatars. To make this possible, they utilized three different priors -- a shape prior, an image prior, and a video prior. The latter is used to enhance control over expressions and articulations in animations.
ICCV 2025
VASA-3D
VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image
The authors present VASA-3D -- a logical continuation of -- a pipeline that builds a lifelike, audio-driven 3D Gaussian head avatar from a single portrait by leveraging a learned 2D audio-motion latent (from prior VASA-1 work) and lifting it into a 3D Gaussian expression space.
NeurIPS 2025
CyberHost
CyberHost: A One-Stage Diffusion Framework for Audio-Driven Talking Body Generation
The authors propose an end-to-end audio-driven avatar synthesis framework. Within it, they tackle the problem of hand integrity, identity consistency, and naturalness of motion. The key design of the framework -- CyberHost -- is the Region Codebook Attention mechanism. It refines the quality of facial and hand animations by integrating fine-grained local features with learned motion pattern priors.
ICLR 2025
TEASER
TEASER: Token Enhanced Spatial Modeling for Expressions Reconstruction
The authors of TEASER propose a hybrid representation combining explicit facial parameters (e.g., from a 3DMM) with implicit appearance tokens derived by a multi-scale tokenizer.
ICLR 2025
DEEPTalk
DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation
DEEPTalk is a novel approach for generating speech-driven 3D facial animations. To significantly increase expressiveness and reduce monotony, the authors first train a Dynamic Emotion Embedding. It serves as an embedding-space representation of both speech and facial motions. Then a Temporally Hierarchical VQ-VAE is employed as an expressive and robust motion prior, overcoming the limitations of VAEs and VQ-VAEs.
AAAI 2025
EchoMimic
EchoMimic: Lifelike Audio-Driven Portrait Animations Through Editable Landmark Conditions
EchoMimic presents a method for generating high‑quality videos driven by audio and/or editable facial landmarks. The core idea is to train a model that can take either an audio clip, a sequence of facial keypoints, or a combination of both and produce a portrait animation. From a reference image, audio, and optional hand‑pose sequence, it generates semi‑body (torso + arms + head) animated videos with synchronized speech, facial expression, and body/hand gestures.
AAAI 2025
EchoMimicV2 cont. of EchoMimic
EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation
Continuing their work, the authors present EchoMimicV2 that extends the original idea to half‑body human animation.
CVPR 2025
VQTalker
VQTalker: Towards Multilingual Talking Avatars Through Facial Motion Tokenization
VQTalker introduces a vector‑quantization-based facial motion tokenizer to capture articulations/pose features underlying speech. It uses this to generate talking‑head avatars that generalize across multiple languages. By discretizing facial motion and then performing coarse‑to‑fine motion generation, it achieves high-quality lip‑sync and natural animation from audio.
AAAI 2025
Model See Model Do
Model See Model Do: Speech-Driven Facial Animation with Style Control
The authors of Model See Model Do propose a speech-driven facial animation framework that uses a style reference to control the expressive style of generated animations. The method separates speech and stylistic motion and enables transferring speaking styles from a reference model while preserving speaker identity and lip sync.
SIGGRAPH 2025
EVA
EVA: Expressive Virtual Avatars from Multi-View Videos
The authors introduce EVA, a framework that builds full‑body avatars from multi‑view video. It builds on a deformable template mesh and a decoupled 3DGS.
SIGGRAPH 2025

Text guidance & Stylization Methods

Method Title & Repository / Description Venue
Make-It-Vivid
Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text
Make-It-Vivid allows the generation of high-quality UV-texture maps for 3D biped cartoon characters based on text prompts. The method uses a pretrained text-to-image diffusion model and a custom adversarial fine-tuning to handle the domain shift between natural images and cartoonish UV texture space.
CVPR 2024
CosmicMan
CosmicMan: A Text-to-Image Foundation Model for Humans
CosmicMan is a holistic text-to-image foundation model that allows for the synthesis of photorealistic static human images. Having found out the influence of data production flow, the authors introduce a new Annotate Anyone paradigm and a large-scale CosmicManHQ-1.0 corpus with 6 million high-quality annotated human images. A Decomposed-Attention-Refocusing training framework is also introduced to utilize the relationship between dense text descriptions and image pixels.
CVPR 2024
HumanGaussian
HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting (MIT)
The paper introduces HumanGaussian, a framework using 3DGS for text‑driven human avatar synthesis. The key innovations include a Structure‑Aware SDS that jointly optimizes geometry and appearance via both RGB and depth guidance, and an Annealed Negative Prompt Guidance scheme to reduce over‑saturation artifacts.
CVPR 2024
HumanNorm
HumanNorm: Learning Normal Diffusion Model for High-Quality and Realistic 3D Human Generation
HumanNorm is a text-conditioned 3D human synthesis approach. The core novelty is the usage of a normal-adapted and a normal-aligned diffusion models. The first one creates high-fidelity normal maps corresponding to user prompts with a view-dependent, body-aware text. The second one generates colored images aligned with the normal maps.
CVPR 2024
3DToonify
3DToonify: Creating Your High-Fidelity 3D Stylized Avatar Easily from 2D Portrait Images
The authors present 3DToonify, which converts a set of 2D portrait images into a stylized, high‑fidelity 3D avatar using implicit neural fields and a three‑stage progressive training scheme: guided prior learning, deformable geometry adaptation, and explicit texture adaptation.
CVPR 2024
DreamAvatar
DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models
DreamAvatar was among the first works devoted to the text guidance in digital avatar synthesis. The proposed network takes a text prompt, a 3D shape and a pose as inputs to train NeRF. Pretrained Stable Diffusion models serve as supervisors that generate intermediate 2D representations of the avatar used in the optimization pipeline.
CVPR 2024
StyleAvatar
StyleAvatar: Stylizing Animatable Head Avatars
StyleAvatar introduces a method to stylize animatable 3D head avatars -- not by post-processing renders, but by directly editing the representation.
WACV 2024
Wang et al.
Disentangled Clothed Avatar Generation from Text Descriptions
The authors propose a text-to-avatar generation method that separately models the human body and clothes through a representation called SO-SMPL: a pair of meshes built on the SMPL parametric model. They introduce an SDS-based pipeline to generate both meshes from text prompts, enabling better semantic alignment, higher texture and geometry quality, and effective editing/try-on capabilities.
ECCV 2024
HeadStudio
HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting
HeadStudio introduces a pipeline that generates animatable 3D head avatars from text prompts by rigging 3D Gaussians to a FLAME head prior. The method couples FLAME-based mesh deformation with Gaussian-splat geometry/texture and uses text-to-3D optimization to produce avatars that can be animated in pose/expression and rendered in real time.
ECCV 2024
AvatarPopUp
Instant 3D Human Avatar Generation Using Image Diffusion Models
The proposed method in their work, called AvatarPopUp, shows that one can generate a 3D human avatar quickly from either a single image or text prompt, by first using diffusion‑based image generation to synthesize front and back views with pose/shape control and then applying a 3D lifting network to produce a rigged mesh.
ECCV 2024
MagicMirror
MagicMirror: Fast and High-Quality Avatar Generation with a Constrained Search Space
The authors of MagicMirror propose a hybrid approach for stylized avatar synthesis. It consists of a NeRF that creates a versatile initial solution space and a text-to-image diffusion model with a learned geometric prior. A VSD is used instead of the more common SDS for texture loss and oversaturation issue mitigation.
ECCV 2024
Stable Video Portraits
Stable Video Portraits
The authors propose Stable Video Portraits -- a novel hybrid 2D/3D generation method for photorealistic portrait videos. It leverages a large pretrained text-to-image prior bound by 3DMM control. The method implies person-specific fine-tuning of a general 2D Stable Diffusion model with temporal conditioning using 3DMM sequences.
ECCV 2024
X-Oscar
X-Oscar: A Progressive Framework for High-Quality Text-Guided 3D Animatable Avatar Generation
In this work, the authors propose X-Oscar, a progressive (geometry, texture, animation) framework that generates high-quality animatable 3D avatars from text prompts, introducing Adaptive Variational Parameter and Avatar-aware Score Distillation Sampling to reduce oversaturation and improve optimization stability.
ICML 2024
AvatarVerse
AvatarVerse: High-Quality & Stable 3D Avatar Creation from Text and Pose
The authors of AvatarVerse propose a pipeline that generates full 3D avatars from a text prompt and pose guidance. The core is a 2D diffusion model conditioned on DensePose signals. The method uses a progressive high‑resolution 3D synthesis strategy to enhance geometric and texture detail.
AAAI 2024
Follow Your Pose
Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos
The authors propose Follow Your Pose, a two-stage pipeline to generate pose-controllable character videos from text and pose trajectories, even when no paired text-video corpus exists. First, they fine-tune a text-to-image model on pose-image pairs to encode pose. Then, they add temporal self-attention and cross-frame attention and fine-tune on pose-free video data to generate smooth guided videos.
AAAI 2024
HeadArtist
HeadArtist: Text-Conditioned 3D Head Generation with Self Score Distillation
The authors of HeadArtist propose a pipeline that generates 3D head avatars from text prompts by optimizing a parametric head model under the supervision of a frozen ControlNet model via the proposed Self Score Distillation.
SIGGRAPH 2024
DivAvatar
DivAvatar: Diverse 3D Avatar Generation with a Single Prompt
In DivAvatar, the authors address the limited diversity of existing text-to-avatar systems by allowing the synthesis of many distinct 3D avatars from a single text prompt. Their method fine-tunes a pretrained 3D generative model and introduces two key designs: a noise-sampling strategy at training time to preserve generation diversity, and a semantic-aware zoom mechanism paired with a novel depth loss to enforce geometry quality while adhering to textual semantics.
WACV 2025
StrandHead
StrandHead: Text to Hair-Disentangled 3D Head Avatars Using Human-Centric Priors
StrandHead generates 3D head avatars with strand-level hair from text prompts by first synthesizing a FLAME-aligned bald head via 2D human priors and then optimizing hair strands with a differentiable prismatization that enforces realistic orientation and curvature.
ICCV 2025
TeRA
TeRA: Rethinking Text-Guided Realistic 3D Avatar Generation
The authors of TeRA propose a two‑stage generative framework for text‑to‑3D‑avatar creation that distills a decoder producing a structured latent space from a large human reconstruction model.
ICCV 2025
AvatarGO
AvatarGO: Zero-Shot 4D Human-Object Interaction Generation and Animation
AvatarGO generates 4D HOI animations from high-level textual descriptions without requiring paired HOI training data. It first composes a 3D scene via text-guided 3D generation, then uses a SMPL‑X-based motion optimization to animate both human and object, enforcing spatial constraints and avoiding penetration.
NeurIPS 2025
InstructAvatar
InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation
InstructAvatar introduces a novel system that lets users control both facial emotion and motion of a 2D avatar via text guidance, in addition to audio. The method uses a two-branch diffusion‑based generator: one branch conditions on audio and another on text.
AAAI 2025
Wu et al.
Text-Based Animatable 3D Avatars with Morphable Model Alignment
In the paper, the authors propose aligning text-driven digital avatar synthesis with morphable model geometry to produce animatable heads that respect parametric face constraints.
SIGGRAPH 2025

Attribute editing Methods

Method Title & Repository / Description Venue
Control4D
Control4D: Efficient 4D Portrait Editing with Text
The authors propose a 4D portrait editing framework that uses a novel representation called GaussianPlanes -- a plane‑based decomposition of Gaussian Splatting over space-time -- and a generator trained to convert 2D diffusion text-driven edits into temporally consistent 4D outputs.
CVPR 2024
Animate Anyone
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation
The authors propose Animate Anyone, a diffusion‑based framework that animates a static character image into a full video, preserving appearance detail via their spatial‑attention ReferenceNet and enabling pose‑controllable motion with a “pose guider” module and temporal modeling to ensure smooth transitions between frames.
CVPR 2024
NECA
NECA: Neural Customizable Human Avatar
The authors of NECA train a fully customizable human avatar from monocular or sparse-view video. It predicts disentangled neural fields for geometry, albedo, shadow, and external lighting in two complementary spaces (canonical and surface) and renders them volumetrically with high-frequency details.
CVPR 2024
GeneAvatar
GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image
GeneAvatar introduces a method that, given a single input image, can produce a volumetric 3D head avatar and allow expression-aware editing by lifting 2D edits into a consistent 3D modification field.
CVPR 2024
PEGASUS
PEGASUS: Personalized Generative 3D Avatars with Composable Attributes
PEGASUS is a method that builds a person‑specific generative 3D avatar from a monocular video by first synthesizing a video collection of that identity with varying facial attributes (hair, nose, etc.), then training a generative model enabling disentangled compositional attribute control while preserving identity.
CVPR 2024
SplattingAvatar
SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting
The work introduces a hybrid avatar representation combining explicit triangle-mesh geometry for low-frequency deformation and embedded 3D Gaussians for high-frequency geometry and appearance. The method is capable of creating photorealistic avatars that render at 300+ FPS on desktop and ~30 FPS on a mobile device. Their method is trainable from monocular video for head or full-body avatars and explicitly controls Gaussians via mesh motion, avoiding purely MLP based deformation fields.
CVPR 2024
AttriHuman-3D
AttriHuman-3D: Editable 3D Human Avatar Generation with Attribute Decomposition and Indexing
AttriHuman-3D proposes an editable avatar synthesis framework with attribute decomposition and indexing in latent space. By separating attributes such as body, hair, and clothing, it enables precise editing without affecting unrelated parts.
CVPR 2024
OHTA
OHTA: One-Shot Hand Avatar via Data-Driven Implicit Priors
The authors of OHTA introduce a one-shot framework for building realistic hand avatars from a single image using data-driven implicit priors. The model learns a shape-texture prior from a large hand corpus and fine-tunes it for the target identity.
CVPR 2024
RAM-Avatar
RAM-Avatar: Real-Time Photo-Realistic Avatar from Monocular Videos with Full-Body Control
RAM-Avatar presents a real-time system that learns a photorealistic, fully controllable human avatar from a single monocular video. The model uses a region-aware module to separately model the head, hands, and body. It integrates these into a unified avatar through pose-conditioned fusion.
CVPR 2024
Animatable Gaussians
Animatable Gaussians: Learning Pose-Dependent Gaussian Maps for High-Fidelity Human Avatar Modeling
Animatable Gaussians introduces a template-guided parameterization that learns pose-dependent Gaussian maps (front and back) with a StyleGAN/StyleUNet-style conditional generator.
CVPR 2024
GaussianAvatars
GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians
The authors present GaussianAvatars, which rig 3D Gaussian splats to a parametric face model so each splat moves with an underlying triangle frame and per-splat offsets are optimized jointly with morphable model parameters.
CVPR 2024
TexVocab
TexVocab: Texture Vocabulary-Conditioned Human Avatars
TexVocab constructs a pose-conditioned texture vocabulary by back-projecting multi-view RGB video frames into SMPL UV space, then learns to query and interpolate texture tokens per body part for dynamic, pose-dependent appearance synthesis.
CVPR 2024
CVTHead
CVTHead: One-Shot Controllable Head Avatar with Vertex-Feature Transformer
In the paper for CVTHead, the authors propose a method that generates a controllable 3D head avatar from a single reference image by treating the mesh vertices as a point set and applying a Vertex-feature Transformer to learn per-vertex descriptors. This representation supports animation of pose, expression, and view changes via a lightweight neural point-based renderer.
WACV 2024
CanonicalFusion
CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images
CanonicalFusion's authors present a framework that reconstructs animatable 3D human avatars from multiple images, by first predicting per‑view depth maps and LBS weight maps via a shared encoder-dual‑decoder, then canonicalizing each view into a unified mesh space. Rather than predicting full high‑dimensional skinning weights, the method compresses them into 3D vectors per each vertex using a pretrained MLP. A forward skinning‑based differentiable rendering scheme merges the various reconstructions.
ECCV 2024
Champ
Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance (MIT)
The authors of Champ integrate a 3D human parametric model (e.g., SMPL) into a latent-diffusion-based animation pipeline to improve motion guidance, shape alignment, and pose fidelity in human image animation. They condition on depth, normal, semantic maps rendered from SMPL sequences and skeleton motion to steer the latent diffusion model.
ECCV 2024
OmniControl
OmniControl: Control Any Joint at Any Time for Human Motion Generation
OmniControl presents a diffusion-based human motion generation model that -- unlike prior works limited to controlling only pelvis trajectory -- allows specification of spatial constraints for any joint at any time.
ICLR 2024
GG-Editor
GG-Editor: Locally Editing 3D Avatars with Multimodal Large Language Model Guidance
In this work, the authors present GG‑Editor, a text-driven method for local editing of 3D avatars. Instead of global edits, the method uses an LLM (e.g., GPT‑4V) to infer reasonable local editing regions (hair, clothes, geometry details), then applies a global‑to‑local view‑synergy editing pipeline to modify geometry and texture while preserving cross‑view consistency.
ACMMM 2024
E³Gen
E³Gen: Efficient, Expressive and Editable Avatars Generation
The paper introduces a novel method to generate high-fidelity, editable 3D avatars by encoding 3D Gaussian primitives into a structured 2D UV feature-plane defined over a parametric human mesh (e.g., SMPL-X). This UV-plane representation lets a diffusion model learn over many subjects, while a part-aware deformation module enables expressive full-body pose control and local editing (clothes, wrinkles).
ACMMM 2024
ControlFace
ControlFace: Harnessing Facial Parametric Control for Face Rigging
The authors of ControlFace propose a face‑rigging method that combines 3DMM renderings with a dual‑branch U‑Net to allow precise control over pose, expression, and lighting directly from a single image.
CVPR 2025
PERSE
PERSE: Personalized 3D Generative Avatars from a Single Portrait
PERSE presents a method that takes a single portrait image and builds a personalized 3D avatar with disentangled latent controls for facial attributes.
CVPR 2025
MeGA
MeGA: Hybrid Mesh-Gaussian Head Avatar for High-Fidelity Rendering and Head Editing
MeGA introduces a hybrid representation that uses a refined mesh model for facial skin and 3D Gaussian splats for hair, allowing higher fidelity and editing flexibility across the whole head. A UV displacement map enhances facial geometry detail, and occlusion-aware blending merges mesh and Gaussian components for seamless rendering.
CVPR 2025
Editable Photorealistic Avatar (Tetrahedral 3DGS)
Creating Your Editable 3D Photorealistic Avatar with Tetrahedron-Constrained Gaussian Splatting
The paper introduces a method for building editable photorealistic avatars by combining tetrahedral-grid constraints with 3DGS. The pipeline first instantiates an avatar from a monocular video, then uses local spatial adaptation via tetrahedrons to structure Gaussian kernels, and finally refines appearance with a coarse-to-fine strategy.
CVPR 2025
FATE
FATE: Full-Head Gaussian Avatar with Textural Editing from Monocular Video
FATE introduces a sampling-based densification to improve rendering efficiency and achieve a better positional distribution of points. For texture editing, the authors convert Gaussian representations into editable attribute maps.
CVPR 2025
Gaussian Deja-vu
Gaussian Deja-Vu: Creating Controllable 3D Gaussian Head-Avatars with Enhanced Generalization and Personalization Abilities
The authors present Gaussian Deja-vu, a two-stage framework that first trains a generalized 3DGS head prior on large 2D (synthetic + real) image corpora and then personalizes this prior quickly using monocular video with learnable expression-aware rectification blendmaps.
WACV 2025
PERSONA
PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image
PERSONA proposes a method to create a personalized, animatable, whole-body 3D avatar from a single image. The core innovation is using a diffusion-based video generation model to synthesize a pose-rich training video from the input image, which then guides the optimization of a 3D avatar representation. To maintain high fidelity and mitigate identity drift from the generated data, the framework uses balanced sampling of the original image and geometry-weighted optimization.
ICCV 2025
ToMiE
Towards Explicit Exoskeleton for the Reconstruction of Complicated 3D Human Avatars
The authors present ToMiE -- a framework that adapts the joint tree of the SMPL skeleton by dynamically growing “external joints” to explicitly model objects held by people or loose garments. The method implies two steps: localize parent joints by gradients from skin‑blending weights and motion kernels, then optimize external joint transforms across frames.
ICCV 2025
CtrlAvatar
CtrlAvatar: Controllable Avatars Generation via Disentangled Invertible Networks
CtrlAvatar introduces a method to generate controllable, customizable human avatars by separating the deformation process into two disentangled streams: an implicit body geometry network and an explicit texture network.
AAAI 2025

Physics improvements & World interaction Methods

Method Title & Repository / Description Venue
NIFTY
NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis
NIFTY introduces a neural “interaction field” attached to objects that encodes valid HOI configurations. During motion generation, this field guides an object-conditioned human motion diffusion model to produce realistic interactions.
CVPR 2024
CG-HOI
CG-HOI: Contact-Guided 3D Human-Object Interaction Generation
CG-HOI tackles generation of full 3D HOI motion sequences from a text prompt and object geometry. The method jointly models human motion, object motion, and explicit contact between body and object, using a diffusion process with cross-attention to ensure coherence and physical plausibility.
CVPR 2024
WANDR
WANDR: Intention-Guided Human Motion Generation
WANDR introduces a conditional VAE that generates realistic human motion trajectories aiming at a 3D goal. Given an initial pose and a target goal position, it outputs natural full-body motion sequences that place the end-effector (e.g., hand) on the goal. Instead of reinforcement learning or hand-crafted controllers, the model uses learned “intention features” that guide movement.
CVPR 2024
RoHM
RoHM: Robust Human Motion Reconstruction via Diffusion
RoHM is a diffusion-based motion model that tackles the problem of robust reconstruction of 3D human motions in the presence of noise and occlusions. The paper proposes using two models addressing distant solution spaces: one for global trajectory and one for local motion.
CVPR 2024
IntrinsicAvatar
IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing
The authors present a method that recovers intrinsic properties -- geometry, albedo, material, and lighting -- of clothed human avatars from a single monocular video by modeling volumetric scattering and performing explicit Monte‑Carlo ray tracing integrated with body articulation.
CVPR 2024
Saito et al.
Relightable Gaussian Codec Avatars
The paper presents a method to build high-fidelity head avatars that support real-time relighting and animation by using a geometry model based on 3D Gaussians that capture sub-millimeter details -- hair strands, pores -- and an appearance model based on learnable radiance transfer combined with spherical harmonics for diffuse and reflection components.
CVPR 2024
Xu et al.
Relightable and Animatable Neural Avatar from Sparse-View Video
The work addresses the problem of reconstructing animatable and relightable human avatars from sparse-view or monocular video under unknown illumination. The authors introduce a Hierarchical Distance Query algorithm that enables efficient sphere-tracing of deformed SDFs to estimate light visibility and surface intersections under arbitrary poses.
CVPR 2024
Intrinsic Hand Avatar
Intrinsic Hand Avatar: Illumination-Aware Hand Appearance and Shape Reconstruction from Monocular RGB Video
The work recovers a full hand avatar -- geometry, appearance, and environment lighting -- from a monocular RGB video of a user’s hand under arbitrary real-world illumination. They optimize shape, material, and lighting jointly using a differentiable renderer with Monte Carlo path tracing.
WACV 2024
CHOIS
Controllable Human-Object Interaction Synthesis
CHOIS from Controllable HOI Synthesis is a conditional diffusion model informed by language descriptions and object waypoint constraints to jointly generate human and object motion in 3D scenes.
ECCV 2024
HUMOS
HUMOS: Human Motion Model Conditioned on Body Shape
In this work, the authors propose a generative human motion model that conditions not only on pose but also on body shape -- meaning that people with different body types move differently. The model is learned from unpaired data using cycle consistency, physics and stability constraints.
ECCV 2024
URAvatar
URAvatar: Universal Relightable Gaussian Codec Avatars
URAvatar presents a pipeline to build photorealistic, relightable head avatars from a single phone scan under unknown illumination by learning a radiance-transfer style model rather than explicit inverse-rendered reflectance. In this way, avatars can be relit and animated in real time.
SIGGRAPH 2024
Jiang et al.
Autonomous Character-Scene Interaction Synthesis from Text Instruction
Though not exactly about digital avatar synthesis, but rather about motion-wise human animation, the paper proposes a framework for multi-stage scene-aware interaction motion synthesis. It is conditioned on text instructions and a goal location. A diffusion model and an autonomous scheduler are utilized to predict sequential motion segments for each action stage.
SIGGRAPH 2024
VRMM
VRMM: A Volumetric Relightable Morphable Head Model
The authors of VRMM propose a volumetric, relightable morphable head prior that disentangles identity, expression, view, and lighting -- using volumetric primitives attached to a base mesh, yielding a head model that supports animation and/or relighting under novel lighting/view conditions.
SIGGRAPH 2024
PhysReaction
PhysReaction: Physically Plausible Real-Time Humanoid Reaction Synthesis via Forward Dynamics Guided 4D Imitation
The authors of PhysReaction propose a Forward Dynamics Guided 4D Imitation framework to synthesize physically plausible humanoid reactions in real time. Instead of purely kinematic approaches, which often suffer from sliding feet, foot penetration or non-physical motions, their method uses a learned policy to generate full-body reactions under physics constraints.
ACMMM 2024
HRAvatar
HRAvatar: High-Quality and Relightable Gaussian Head Avatar
In this work, the authors present HRAvatar, a method that reconstructs high-fidelity, animatable 3D head avatars from monocular videos while enabling realistic relighting and material editing. They address limitations in past 3DGS approaches by incorporating an end-to-end tracking optimization, learnable blend-shapes and LBS for improved deformation.
CVPR 2025
InteractAvatar
InteractAvatar: Modeling Hand-Face Interaction in Photorealistic Avatars with Deformable Gaussians
The authors of InteractAvatar introduce a novel avatar model that explicitly captures dynamic hand-face interactions, using 3D Gaussians splats anchored to a hand mesh that deform with articulation to model wrinkles, shadows, and contact effects. Their system has a “Dynamic Gaussian Hand” module that refines geometry and appearance via a neural network and a dedicated interaction module that adjusts facial geometry and shading when hands touch the face.
ICCV 2025
PRIMAL
PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning
In this work, the authors propose a novel generative real-time system. It allows for physically reactive and interactive avatars controlled with discrete commands and continuous signals, such as being pulled by a “magnet”. In the pretraining stage, the model learns body movements from sub-second motion segments. Then a ControlNet-like adaptor is employed to further fine-tune the base model to new tasks.
ICCV 2025
BecomingLit
BecomingLit: Relightable Gaussian Avatars with Hybrid Neural Shading
BecomingLit presents a method to make 3DGS-based avatars relightable under arbitrary illumination conditions. The approach combines physically-based shading of Gaussian primitives with a neural network that refines shadows, highlights, and skin detail.
NeurIPS 2025
Agent-to-Sim
Agent-to-Sim: Learning Interactive Behavior Models from Casual Longitudinal Videos
Agent‑to‑Sim (ATS) learns interactive 3D agent behavior from casually captured, long-term video -- no MoCap suits or multi‑view rigs needed. It reconstructs a persistent 4D representation across videos using a coarse-to-fine registration, then builds a behavior model that generates new agent motion conditioned on ego‑perception and environment.
ICLR 2025
Wang et al.
Relightable Full-Body Gaussian Codec Avatars
The authors propose a new full‑body avatar framework combining 3DGS with a learned radiance‑transfer appearance model to enable relightable, pose‑dependent rendering including face and hands. Their method decomposes light transport into local and non-local effects through zonal harmonics for efficient diffuse transfer under articulation and a shadow network for occlusion shadows.
SIGGRAPH 2025

Hair and clothes improvements Methods

Method Title & Repository / Description Venue
DiffAvatar
DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation
The authors of DiffAvatar introduce a method for generating high-quality garment assets that are simulation-ready. It performs body and garment co-optimization using differentiable simulation. For proper geometry reconstruction and material parameters extraction, physical simulations are integrated into the optimization loop.
CVPR 2024
PhysAvatar
PhysAvatar: Learning the Physics of Dressed 3D Avatars from Visual Observations
PhysAvatar combines 4D mesh-aligned Gaussian techniques, inverse rendering, and a physics simulator to recover not only shape and appearance, but also physical properties of clothing from multi-view video.
ECCV 2024
Zakharov et al.
Human Hair Reconstruction with Strand-Aligned 3D Gaussians
The paper introduces a method that represents hair with strand‑aligned 3D Gaussians, combining classical hair‑strand geometry with 3DGS’s differentiable rendering to reconstruct realistic, strand‑level hairstyles from multi‑view data.
ECCV 2024
DLCA-Recon
DLCA-Recon: Dynamic Loose Clothing Avatar Reconstruction from Monocular Videos
DLCA‑Recon reconstructs dynamic human avatars with loose clothing from monocular video. They combine an explicit mesh and an implicit SDF representation and introduce a Dynamic Deformation Field to model realistic cloth deformation with frame-to-frame consistency.
AAAI 2024
LayGA
LayGA: Layered Gaussian Avatars for Animatable Clothing Transfer
LayGA separates body and clothing into two layers (body‑Gaussians + garment‑Gaussians). This enables animatable clothing transfer from multi‑view video, allowing users to switch clothes between avatars while preserving proper garment-body interaction and plausible deformation under motion.
SIGGRAPH 2024
DAGSM
DAGSM: Disentangled Avatar Generation with Gs-Enhanced Mesh
The paper proposes DAGSM, where the authors enable text-conditioned avatar synthesis that disentangles human body and garments. They model the body and each clothing part separately using Gaussian-enhanced meshes to better represent complex textures like wool or transparent fabrics and support clothing replacement and realistic animation via a view-consistent texture refinement module.
CVPR 2025
SimAvatar
SimAvatar: Simulation-Ready Avatars with Layered Hair and Clothing
The authors address the task of representing hair and garment geometry while also utilizing prior knowledge from a foundational model -- Stable Diffusion -- and making avatars simulation-ready via physics or neural simulators. They propose a two-stage framework. In the first stage, three text-conditioned diffusion-based models generate hair strands, a body mesh, and a garment. In the second stage, the elements are combined into a model and assigned learnable 3D Gaussians which then forgo optimization. Image-based Stable Diffusion is used in the SDS loss calculation.
CVPR 2025
LUCAS
LUCAS: Layered Universal Codec Avatars
LUCAS is a Universal Prior Model for digital avatar synthesis that disentangles face and hair via a layered representation, enabling both real-time mesh-based rendering and high-fidelity Gaussian avatar synthesis with improved cross-identity generalization and dynamic expression/pose handling.
CVPR 2025
Zhang et al.
Disentangled Clothed Avatar Generation with Layered Representation
The authors propose a feedforward diffusion-based method that generates clothed avatars with fully disentangled components by using a layered UV feature-plane representation where each component occupies a distinct layer of a Gaussian-based UV feature map.
ICCV 2025
HADES
HADES: Human Avatar with Dynamic Explicit Hair Strands
HADES models full-body avatars with dynamic hair represented as deformable strands attached to 3D Gaussians. It simulates realistic hair motion through temporal fusion and color-consistency correction across multi-view inputs, achieving natural animation and stable rendering.
ICCV 2025
HairCUP
HairCUP: Hair Compositional Universal Prior for 3D Gaussian Avatars
The authors of the work present HairCUP, a universal prior model for 3D head avatars that explicitly disentangles face and hair by learning separate latent spaces for each component.
ICCV 2025
Im2Haircut
Im2Haircut: Single-View Strand-Based Hair Reconstruction for Human Avatars
Im2Haircut is a method that reconstructs 3D strand-based hair geometry from a single input photograph by combining a transformer-based global hair prior (trained on synthetic + real data) with a 3DGS reconstruction module.
ICCV 2025
SeqAvatar
Sequential Gaussian Avatars with Hierarchical Motion Context
The authors present SeqAvatar, a method for animatable human avatar synthesis using 3DGS enriched by a hierarchical motion context. They combine coarse skeleton‑level and fine-grained vertex motions in a coarse‑to‑fine conditioning scheme. Consequently, they apply a spatio‑temporal multi‑scale sampling strategy to better capture non-rigid deformations (e.g., cloth folds) under motion.
ICCV 2025
DGH
DGH: Dynamic Gaussian Hair
The authors introduce DGH, a method for modeling dynamic hair within 3DGS-based avatars. Hair is represented as volumetric Gaussians that capture both the overall hairstyle and local motion dynamics.
NeurIPS 2025
MPMAvatar
MPMAvatar: Learning 3D Gaussian Avatars with Accurate and Robust Physics-Based Dynamics
MPMAvatar builds clothed human avatars from multi-view video, combining 3DGS with a Material‑Point‑Method physics simulator to realistically simulate cloth dynamics and body‑cloth interactions.
NeurIPS 2025

High fidelity and realism Methods

Method Title & Repository / Description Venue
GaussianAvatar
GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians (MIT)
The authors of GaussianAvatar propose a method for creating realistic human avatars from a single monocular video by introducing animatable 3DGS with dynamic appearance networks to support pose‑dependent appearance modeling and jointly optimizing motion and appearance to tackle motion‑estimation inaccuracies.
CVPR 2024
UltrAvatar
UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures
UltrAvatar is a novel 3D avatar synthesis approach with enhanced fidelity of geometry and superior quality of physics-based rendering textures. It presents a diffuse color extraction model and an authenticity guided texture diffusion model, both used for improving overall quality of generated avatars.
CVPR 2024
Gaussian Head Avatar
Gaussian Head Avatar: Ultra High-Fidelity Head Avatar via Dynamic Gaussians
The authors propose a representation of animatable head avatars using controllable 3D Gaussians, jointly optimizing a neutral Gaussian set and a MLP-based deformation field to capture fine-grained dynamic expressions under sparse-view capture. A geometry-guided initialization using an implicit SDF and Deep Marching Tetrahedra stabilizes training and improves convergence.
CVPR 2024
RodinHD
RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models
In this work, the authors tackle the problem of catastrophic forgetting caused by fitting to many tri-planes sequentially on many avatars. They propose a novel data scheduling strategy and a weight consolidation regularization term, which improves the rendering of sharper details rendering in resulting avatars. A hierarchical representation of the portrait image is also introduced for rich 2D texture cues that are injected into a 3D diffusion model via cross-attention.
ECCV 2024
Bridging the Gap (Studio-Quality from Phone)
Bridging the Gap: Studio-Like Avatar Creation from a Monocular Phone Capture
In the paper, the authors tackle the problem of producing studio‑quality human avatars from a short monocular smartphone video capture. They parameterize the phone‑captured texture maps via the latent space of StyleGAN2 and then fine‑tune a StyleGAN2 model using a small studio‑captured texture corpus, followed by a diffusion‑based super‑resolution step to improve fine details in the facial texture map.
ECCV 2024
MeshAvatar
MeshAvatar: Learning High-Quality Triangular Human Avatars from Multi-View Videos
MeshAvatar introduces a method for building high-quality human avatars from multi-view video by combining an implicit SDF representation with an extracted triangular mesh and a pose-conditioned material field. The system jointly optimizes geometry and materials, uses a 2D U-Net and pseudo-normal supervision to improve fine detail, and produces avatars that integrate cleanly into standard rendering pipelines.
ECCV 2024
Tri²-plane
Tri²-Plane: Thinking Head Avatar via Feature Pyramid
The method uses a multi-scale tri-plane representation to reconstruct photorealistic head avatars from monocular video. Instead of a single tri-plane, it stacks tri-planes at multiple scales to capture fine facial detail. The authors add a geometry-aware sliding window training augmentation to improve robustness under camera/pose variation.
ECCV 2024
Pose Modulated Avatars
Pose Modulated Avatars from Video
The paper Pose Modulated Avatars from Video proposes a method for reconstructing human avatars from a video, where deformation due to pose is explicitly handled via a two‑branch neural network. A GNN modeling local correlations given skeleton pose and a frequency‑modulation branch that adjusts rendering features based on these correlations.
ICLR 2024
Qin et al.
High-Fidelity 3D Head Avatars Reconstruction Through Spatially-Varying Expression Conditioned Neural Radiance Field
The paper presents a method for 3D head‑avatar reconstruction from video, introducing a Spatially‑Varying Expression conditioning. For each 3D point, the radiance field is conditioned not just on a global expression vector but also on spatial positional features.
AAAI 2024
IDOL
IDOL: Instant Photorealistic 3D Human Creation from a Single Image
The paper presents a method to reconstruct high-fidelity 3D human avatars from a single RGB image. The approach combines a parametric human model with neural rendering to capture detailed geometry, texture, and appearance in one shot.
CVPR 2025
StableAnimator
StableAnimator: High-Quality Identity-Preserving Human Image Animation
StableAnimator is an end-to-end video diffusion framework designed to preserve identity while animating a reference image to match a target pose sequence. It uses a distribution-aware ID Adapter, a face-refining encoder, and a Hamilton-Jacobi-Bellman-based optimization during inference to constrain denoising and maintain facial fidelity.
CVPR 2025
TAGA
TAGA: Self-Supervised Learning for Template-Free Animatable Gaussian Articulated Model
TAGA introduces a template‑free approach to build animatable human avatars using 3D Gaussians. The method detects and corrects “Ambiguous Gaussians” in sparse posed data, refining geometry and skinning for accurate novel pose/view animation.
CVPR 2025
HERA
HERA: Hybrid Explicit Representation for Ultra-Realistic Head Avatars
In HERA, the authors introduce a hybrid explicit representation combining UV-mapped 3D meshes with 3DGS, using the mesh to capture sharp surface textures (skin, stubble) and the Gaussians to model intricate geometry (hair, eyelashes).
CVPR 2025
TGA
TGA: True-to-Geometry Avatar Dynamic Reconstruction
TGA proposes a 4D Gaussian‑based avatar reconstruction framework that integrates perspective-aware Gaussian transformations and dynamic Gaussian Bounding Volume Hierarchy tree based mesh extraction to better capture fine facial geometry and dynamic deformations under motion, improving geometric accuracy over previous Gaussian‑splat methods.
NeurIPS 2025
SurFhead
SurFhead: Affine Rig Blending for Geometrically Accurate 2D Gaussian Surfel Head Avatars
The authors of SurFhead propose a new avatar representation using 2D Gaussian surfels (instead of 3D Gaussians), rigged via affine‑transformation blending with polar decomposition. This allows much more accurate head geometry (surface normals, depth, mesh consistency) than prior 3DGS‑based avatars, while remaining riggable and animatable from RGB video alone.
ICLR 2025
ScaffoldAvatar
ScaffoldAvatar: High-Fidelity Gaussian Avatars with Patch Expressions
The authors of ScaffoldAvatar present a hybrid pipeline that builds high-fidelity Gaussian head avatars by anchoring “patch expressions” -- localized Gaussian patches tied to a scaffold mesh -- to capture fine expression detail and enable robust animation.
SIGGRAPH 2025
TeGA
TeGA: Texture Space Gaussian Avatars for High-Resolution Dynamic Head Modeling
The authors of TeGA introduce a high-detail 3D head avatar model that embeds 3D Gaussians within a continuous UVD texture space over a morphable head mesh -- allowing densification where detail matters while preserving efficient animation.
SIGGRAPH 2025

Real-time generation & Compression Methods

Method Title & Repository / Description Venue
GPS-Gaussian
GPS-Gaussian: Generalizable Pixel-Wise 3D Gaussian Splatting for Real-Time Human Novel View Synthesis (MIT)
In this work, the authors propose a framework that generates 3D Gaussian representations from sparse input views using a learned regression of Gaussian parameters from 2D image planes. Beyond just human characters, it handles humans in the context of scenes, still under sparse‑view conditions, and renders them in real time.
CVPR 2024
GPS-Gaussian+ cont. of GPS-Gaussian
GPS-Gaussian+: Generalizable Pixel-Wise 3D Gaussian Splatting for Real-Time Human-Scene Rendering from Sparse Views
This follow‑up work extends GPS‑Gaussian by targeting human‑scene rendering.
TPAMI 2025
GauHuman
GauHuman: Articulated Gaussian Splatting from Monocular Human Videos
The authors of GauHuman present an avatar synthesis framework that uses 3DGS with LBS to animate full-body characters quickly. They encode Gaussians in canonical space and deform them via skinning to posed space, with modules refining pose and LBS weights for detail preservation.
CVPR 2024
Gaussian Shell Maps
Gaussian Shell Maps for Efficient 3D Human Generation
The authors of Gaussian Shell Maps propose a volumetric representation that uses shell‑structured Gaussian distributions to represent the human body -- capturing geometry and appearance -- and enable fast 3D human synthesis and rendering.
CVPR 2024
Bai et al.
Efficient 3D Implicit Head Avatar with Mesh-Anchored Hash Table Blendshapes
In this work, the authors propose a real‑time 3D head avatar system that uses a novel mesh‑anchored hash table blendshapes technique: multiple tiny hash tables are attached to vertices of a parametric face mesh and their embeddings are linearly blended (via weights predicted from a CNN) to represent expression‑dependent geometry and appearance. A lightweight MLP then predicts density and color from these embeddings for volumetric rendering, accelerated by a hierarchical kNN lookup.
CVPR 2024
FlashAvatar
FlashAvatar: High-Fidelity Head Avatar with Efficient Gaussian Embedding (MIT)
The authors propose FlashAvatar, a method for reconstructing a high-fidelity animatable head avatar from a short monocular video in minutes and rendering it at ~300 FPS on a consumer GPU. They embed a uniform 3DGS field on the surface of a parametric face model and learn additional spatial offsets for non-surface regions and subtle facial details.
CVPR 2024
3DGS-Avatar
3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting (MIT)
3DGS-Avatar presents an animatable human avatar model using deformable 3D Gaussian splats. A canonical Gaussian field is combined with a pose-conditioned deformation network, improving generalization to unseen poses.
CVPR 2024
GoMAvatar
GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh
GoMAvatar introduces the Gaussians-on-Mesh hybrid representation that attaches 3D Gaussian splats to a deformable mesh to get both high-quality appearance and efficient articulation. The model is trained end-to-end from a single monocular video.
CVPR 2024
GAvatar
GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning
The authors propose GAvatar, which builds animatable avatars using a 3DGS representation embedded in pose-driven primitives and further learns an SDF-based implicit mesh on top of the Gaussians to extract high-fidelity geometry and texture.
CVPR 2024
MoRF
MoRF: Mobile Realistic Fullbody Avatars from a Monocular Video
In the paper, the authors propose MoRF that builds realistic full-body avatars from monocular video. It uses a mesh-based body proxy (SMPL-X), a neural texture, and per-frame warping fields to improve temporal consistency and appearance fidelity.
WACV 2024
POCA
POCA: Post-Training Quantization with Temporal Alignment for Codec Avatars
POCA studies quantization for avatar decoders, showing that naive quantization (8-bit and 6-bit) introduces temporal noise in animated avatars. POCA proposes a novel Post-Training Quantization scheme with temporal alignment that preserves visual fidelity while compressing the decoder by 5.3×.
ECCV 2024
ReliaAvatar
ReliaAvatar: A Robust Real-Time Avatar Animator with Integrated Motion Prediction
The authors present ReliaAvatar, a real-time avatar animator that integrates full-body motion prediction into an autoregressive animation pipeline to handle low-quality or missing input signals.
IJCAI 2024
GGHead
GGHead: Fast and Generalizable 3D Gaussian Heads
The authors of GGHead propose embedding 3DGS within a 3D-GAN framework to learn a high‑fidelity, 3D‑consistent head prior from 2D image corpora. A CNN predicts Gaussian parameters over a template‑mesh UV layout. A novel total variation loss ensures geometric coherence, enabling real‑time rendering of full‑resolution heads without 2D super‑resolution.
SIGGRAPH 2024
GEM (Gaussian Eigen Models)
Gaussian Eigen Models for Human Heads
The authors of GEM propose representing 3D head avatars using a linear eigen‑basis of 3D Gaussians - position, scale, rotation, opacity -- enabling a low‑dimensional, network‑free representation that is light, animatable, and real‑time friendly.
CVPR 2025
Zhan et al.
Real-Time High-Fidelity Gaussian Human Avatars with Position-Based Interpolation of Spatially Distributed Mlps
The paper proposes a 3DGS-based avatar synthesis where multiple MLPs are spatially distributed across the body and each Gaussian’s properties are interpolated from nearby MLPs' outputs.
CVPR 2025
FADA
FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-Cfg Distillation
The authors propose a mixed-supervised loss to address the problem of poor distilled diffusion model performance with open-set input images. They also propose a multi-CFG distillation with learnable tokens to utilize the correlation between audio and reference image conditions.
CVPR 2025
GPAvatar (monocular)
GPAvatar: High-Fidelity Head Avatars by Learning Efficient Gaussian Projections
The authors of GPAvatar propose a method that reconstructs high-fidelity dynamic 3D head avatars from monocular videos using Gaussian splats in a high-dimensional embedding space.
CVPR 2025
TaoAvatar
TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting
TaoAvatar presents a high-fidelity, lightweight pipeline for creating full-body talking avatars optimized for AR devices. The method implies binding 3D Gaussians to a clothed human parametric template and distilling pose-dependent non-rigid deformations into an MLP to develop proper blend-shapes.
CVPR 2025
RGBAvatar
RGBAvatar: Reduced Gaussian Blendshapes for Online Modeling of Head Avatars
RGBAvatar proposes an online framework for animatable head avatar modeling using a reduced Gaussian blendshape representation. Instead of fixed 3DMM bases, a compact learned space is created for each individual, improving identity accuracy and expressiveness. A color initialization scheme and batch-parallel Gaussian rasterization enable real-time training and inference.
CVPR 2025
MobilePortrait
MobilePortrait: Real-Time One-Shot Neural Head Avatars on Mobile Devices
MobilePortrait introduces a novel method for real-time head avatar synthesis on mobile devices. Lightweight U-Net backbones are used to reduce computational requirements. To compensate for possible quality loss, the authors mix explicit and implicit keypoints for motion modeling and utilize precomputed visual features for foreground and background synthesis.
CVPR 2025
LHM
LHM: Large Animatable Human Reconstruction Model for Single Image to 3D in Seconds
LHM proposes a feedforward model that reconstructs detailed, animatable 3D humans from a single image in seconds -- representing geometry and appearance with 3D Gaussian splats. It uses a multimodal transformer to fuse image features, body positional priors, and a head feature pyramid encoding to preserve facial identity and fine detail.
ICCV 2025
GraphAvatar
GraphAvatar: Compact Head Avatars with Gnn-Generated 3D Gaussians
GraphAvatar proposes to replace explicit storage of 3D Gaussians for head avatars with a compact GNN that generates Gaussian attributes from a tracked mesh.
AAAI 2025
SqueezeMe
SqueezeMe: Mobile-Ready Distillation of Gaussian Full-Body Avatars
SqueezeMe shows how to distill high-fidelity 3D Gaussian full-body avatars into a lightweight representation suitable for mobile devices by compressing Gaussian decoding and reducing compute/memory overhead while preserving animation and rendering quality.
SIGGRAPH 2025
LAM
LAM: Large Avatar Model for One-Shot Animatable Gaussian Head
The authors of LAM propose a method that builds a fully animatable 3D Gaussian‑head avatar from a single input image in a single forward pass. No video, no multi-view rig, no post‑processing are needed.
SIGGRAPH 2025
HGC-Avatar
HGC-Avatar: Hierarchical Gaussian Compression for Streamable Dynamic 3D Avatars
The paper proposes a hierarchical compression scheme for dynamic Gaussian‑based avatars, aimed at efficient streaming and rendering. It splits the representation into a structural layer (pose‑to‑Gaussian generator) and a motion layer (via SMPL‑X), enabling compact transmission, progressive decoding, and controllable rendering under new poses.
ACMMM 2025

Temporal consistency Methods

Method Title & Repository / Description Venue
Lodge
Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives
The method generates long dance motion sequences by using a coarse-to-fine diffusion network guided by extracted dance primitives, allowing both global structure and fine motion details across time.
CVPR 2024
Make-Your-Anchor
Make-Your-Anchor: A Diffusion-Based 2D Avatar Generation Framework
The authors address the problem of full-body avatar synthesis where movements are “anchored” to the ones from the video. Specifically, they propose a novel system, Make-Your-Anchor, that only needs a one-minute video for training to enable precise translation of torso and hands. A structure-guided diffusion model is fine-tuned to take 3D mesh conditions as a separate modality.
CVPR 2024
Loopy
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency
In this work, the authors propose an end‑to‑end video diffusion model conditioned only on audio, designed to generate realistic portrait videos with natural long‑term motion. The model uses inter‑/intra‑clip temporal modules and an audio‑to‑latents mapping so it can leverage long‑range temporal dependencies and produce smooth, expressive motion from audio alone.
ICLR 2025
Hallo
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation
Hallo proposes a diffusion-based framework for portrait image animation driven by audio. It provides a hierarchical audio-driven synthesis module that jointly generates lip motion, facial expressions, and head pose.
arXiv 2024
Hallo2 cont. of Hallo
Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation
The follow-up paper presents Hallo2, a method that generates long (tens of minutes) and high-resolution (up to 4K) talking-head videos from a single reference image and input audio, maintaining temporal coherence and avoiding drift over time.
ICLR 2025
DAWN
DAWN: Dynamic Frame Avatar with Non-Autoregressive Diffusion Framework for Talking Head Video Generation
DAWN presents a non-autoregressive diffusion‑based framework that generates full talking‑head videos (lip sync + head pose + blinks) from a single portrait and an audio clip.
ICLR 2025
MimicMotion
MimicMotion: High-Quality Human Motion Video Generation with Confidence-Aware Pose Guidance
MimicMotion introduces a video generation framework that can produce long, high‑quality human motion videos guided by a pose sequence. The method relies on confidence‑aware pose guidance to weigh pose keypoints by reliability, regional loss amplification to preserve detail in important regions (e.g., hands), and a progressive latent fusion strategy to enable temporally coherent videos of arbitrary length.
ICML 2025
MaintaAvatar
MaintaAvatar: A Maintainable Avatar Based on Neural Radiance Fields by Continual Learning
MaintaAvatar tackles the problem of updating a 3D avatar over time as a person’s appearance or pose changes, without losing the ability to render previous appearances. The method augments a NeRF-based avatar with a Global-Local Joint Storage Module and a Pose‑Distillation Module.
AAAI 2025

Citation & license

If you find these resources useful, please cite the review:

@article{makarov2026avatars,
  title   = {GenAI for Digital Avatar Synthesis: A Comprehensive Review},
  author  = {Makarov, Georgy and Ryumin, Dmitry},
  journal = {Neurocomputing, Peer Review},
  year    = {2026}
}

This repository is released under the MIT License. Figures are reproduced from the accompanying review paper by its authors. Linked code repositories remain under their own licenses (shown in parentheses next to a title where available).

About

GenAI for Digital Avatar Synthesis: A Comprehensive Review (2026) - supplementary for a task-oriented review of human-centric avatar generation methods (2024–2025, CORE A/A* venues). Catalogs 203 methods and 108 datasets across 9 tasks: generalization, expressiveness, stylization, editing, physics, hair/clothes, fidelity, real-time, consistency.

Topics

Resources

License

Stars

Watchers

Forks

Contributors