GenAI for Digital Avatar Synthesis — Review Resources

A curated, browsable companion to the survey “GenAI for Digital Avatar Synthesis: A Comprehensive Review.”

This repository is the supplementary material for our task-oriented review of human-centric Generative AI for digital avatar synthesis, covering work published in 2024–2025 at leading (CORE A / A*) AI conferences. It collects, links, and organizes every generation method reviewed in the paper so the literature is easy to browse, cite, and extend.

Diffusion models increasingly act as strong priors for synthesis and editing, while Gaussian-splatting representations dominate real-time reconstruction and rendering — pointing toward hybrid pipelines that jointly optimize controllability and deployability.

An end-to-end avatar-synthesis pipeline: inputs and modalities → preprocessing → neural network (GAN / diffusion / NeRF / 3DGS) and auxiliary modules → postprocessing → deployable outputs.

Scope & taxonomy

This companion accompanies a task-oriented review of human-centric Generative AI for digital avatar synthesis, concentrating on work published in 2024–2025 at leading (CORE A / A*) AI conferences. Recent progress spans diverse output representations (images, video, 3D/4D assets) and conditioning signals (pose, speech, language instructions, affective attributes), broadening avatar applications to telepresence, virtual production, immersive AR / VR, and customer-facing interaction. The literature, however, is rapidly expanding and fragmented across problem settings, architectures, and deployment constraints.

To consolidate it, the review introduces a unified taxonomy of nine task families, aligns each task with both its representative methods and the corpora they use, and connects them through an end-to-end pipeline that links inputs and preprocessing to model components and deployable outputs (see the graphical abstract above). Across the field, diffusion models increasingly act as strong priors for synthesis and editing, while Gaussian-splatting representations dominate real-time reconstruction and rendering — pointing toward hybrid pipelines that jointly optimize controllability and deployability. Each method and corpus is placed under its primary task, mirroring the paper:

Generalization — reconstruct or animate new identities from few, single, or unconstrained in-the-wild observations, ideally without per-subject optimization.
Expressiveness — speech-, emotion-, and motion-driven faces and bodies with nuanced, fine-grained expressions and co-speech gestures.
Text guidance & Stylization — language-conditioned avatar generation, stylization, and editing from natural-language prompts.
Attribute editing — controllable, disentangled editing of appearance and shape, down to individual attributes such as hair, clothing, or expression.
Physics improvements & World interaction — relighting, contact, cloth and body dynamics, and human–object or human–scene interaction.
Hair and clothes improvements — strand-level hair, layered garments, and disentangled, simulation-ready assets.
High fidelity and realism — photorealistic geometry, texture, and appearance, often at high resolution.
Real-time generation & Compression — efficient, lightweight, on-device and streamable avatars.
Temporal consistency — long-horizon, drift-free, identity-stable video and motion.

Corpora

108 corpora used across the reviewed methods, grouped by the task they primarily serve (mirroring the paper). Real-time generation & Compression and Temporal consistency are architecture-level tasks and have no dedicated corpora. Click a title to expand its type, modalities, and size.

The 108 reviewed corpora, grouped by primary task and publication year.

Generalization _Corpora

Corpus	Title & Repository / Information	Venue
FRGC	Overview of the Face Recognition Grand Challenge Image corpus Size: 5k+ images Add. tasks: High fidelity and realism	CVPR 2005
FRGCv2	Overview of the Face Recognition Grand Challenge Image corpus Size: 50k images Add. tasks: High fidelity and realism	CVPR 2005
VoxCeleb	VoxCeleb: A Large-Scale Speaker Identification Dataset Video corpus Modalities: audio Size: 153k+ clips Add. tasks: Expressiveness	INTERSPEECH 2017
AVSpeech	Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation Video corpus Modalities: audio Size: 4,700 hours Add. tasks: Expressiveness	arXiv 2018
VoxCeleb2	VoxCeleb2: Deep Speaker Recognition Video corpus Modalities: audio Size: 1.09M clips Add. tasks: Expressiveness	INTERSPEECH 2018
HUMBI	HUMBI: A Large Multiview Dataset of Human Body Expressions 3D/4D corpus Modalities: motions Size: 772 subjects Add. tasks: High fidelity and realism	CVPR 2020
LYHM	Statistical Modeling of Craniofacial Shape and Texture 3D/4D corpus Size: 1,216 subjects Add. tasks: High fidelity and realism	IJCV 2020
TalkingHead-1KH	One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing Video corpus Modalities: audio Size: 500k clips Add. tasks: High fidelity and realism	CVPR 2021
THUman2.0	Function4d: Real-Time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors 3D/4D corpus Size: 500 scans Add. tasks: Physics improvements & World interaction, High fidelity and realism	CVPR 2021
THUman2.1	Function4d: Real-Time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors 3D/4D corpus Size: 2,500 scans Add. tasks: Physics improvements & World interaction, High fidelity and realism	CVPR 2021
WebFace42M	Webface260m: A Benchmark Unveiling the Power of Million-Scale Deep Face Recognition Image corpus Size: 42M images Add. tasks: High fidelity and realism	CVPR 2021
WebFace260M	WebFace260M: A Benchmark Unveiling the Power of Million-Scale Deep Face Recognition Image corpus Size: 260M images Add. tasks: High fidelity and realism	CVPR 2021
THUman3.0	Deepcloth: Neural Garment Representation for Shape and Style Editing 3D/4D corpus Add. tasks: High fidelity and realism	TPAMI 2022
THUman4.0	Structured Local Radiance Fields for Human Avatar Modeling Video corpus Modalities: motions Size: 3 clips, 7500+ frames Add. tasks: Physics improvements & World interaction, High fidelity and realism	CVPR 2022
CustomHumans	Learning Locally Editable Virtual Humans 3D/4D corpus Size: 643 scans Add. tasks: High fidelity and realism	CVPR 2023
FaceScape	FaceScape: 3D Facial Dataset and Benchmark for Single-View 3D Face Reconstruction 3D/4D corpus Modalities: emotions Size: 938 subjects, 18,760 scans Add. tasks: Expressiveness, Attribute editing, High fidelity and realism	TPAMI 2023
NeRSemble	NeRSemble: Multi-View Radiance Field Reconstruction of Human Heads Video corpus Modalities: audio, emotions, motions Size: 222 subjects, 31.7M frames Add. tasks: Expressiveness, Attribute editing, High fidelity and realism	TOG 2023
RenderMe-360	RenderMe-360: A Large Digital Asset Library and Benchmarks Towards High-Fidelity Head Avatars 3D/4D corpus Modalities: audio, emotions, text, motions, hair/clothes Size: 500 subjects, 243M frames Add. tasks: Expressiveness, Text guidance & Stylization, Attribute editing, Hair and clothes improvements, High fidelity and realism	NeurIPS 2023
OpenHumanVid	OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation Video corpus Modalities: audio, text, motions Size: 13.2M clips, 16.7k hours Add. tasks: Text guidance & Stylization, Physics improvements & World interaction, High fidelity and realism	CVPR 2025
WildAvatar	WildAvatar: Learning In-the-Wild 3D Avatars from the Web Video corpus Size: 10k+ subjects Add. tasks: High fidelity and realism	CVPR 2025

Expressiveness _Corpora

Corpus	Title & Repository / Information	Venue
BU-3DFE	A 3D Facial Expression Database for Facial Behavior Research 3D/4D corpus Modalities: emotions Size: 6 emotions Add. tasks: High fidelity and realism	FGR 2006
BP4D	Bp4d-Spontaneous: A High-Resolution Spontaneous 3D Dynamic Facial Expression Database 3D/4D corpus Modalities: emotions, motions Size: 41 subjects Add. tasks: Attribute editing, High fidelity and realism	Image and Vision Computi 2014
CREMA-D	CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset Video corpus Modalities: audio, emotions Size: 7,442 clips	TAC 2014
BP4D+	Multimodal Spontaneous Emotion Corpus for Human Behavior Analysis 3D/4D corpus Modalities: emotions, motions Size: 140 subjects Add. tasks: Attribute editing, High fidelity and realism	CVPR 2016
AffectNet	AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild Image corpus Modalities: emotions Size: ~1M images Add. tasks: Generalization	TAC 2017
CMU-MOSEI	Multimodal Language Analysis in the Wild: Cmu-Mosei Dataset and Interpretable Dynamic Fusion Graph Video corpus Modalities: audio, emotions, text Size: 23k+ clips Add. tasks: Generalization	ACL 2018
CoMA	Generating 3D Faces Using Convolutional Mesh Autoencoders 3D/4D corpus Modalities: emotions, motions Size: 12 subjects Add. tasks: Attribute editing, High fidelity and realism	ECCV 2018
RAVDESS	The Ryerson Audio-Visual Database of Emotional Speech and Song (Ravdess): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English Video corpus Modalities: audio, emotions Size: 1,440 clips	PloS one 2018
MEAD	MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation Video corpus Modalities: audio, emotions Size: 60 subjects	ECCV 2020
HDTF	Flow-Guided One-Shot Talking Face Generation with a High-Resolution Audio-Visual Dataset Video corpus Modalities: audio Size: 300+ subjects Add. tasks: Generalization, High fidelity and realism	CVPR 2021
Speech2-AffectiveGestures	Speech2-AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning Video corpus Modalities: audio, text, motions Size: 1,766 clips, 106.1 hours Add. tasks: Text guidance & Stylization, Physics improvements & World interaction	ACMMM 2021
3D-ETF	Emotalk: Speech-Driven Emotional Disentanglement for 3D Face Animation 3D/4D corpus Modalities: audio, emotions, motions Add. tasks: Generalization, Attribute editing	ICCV 2023
FaMoS	Instant Multi-View Head Capture Through Learnable Registration 3D/4D corpus Modalities: emotions, motions Size: 95 subjects, 600k frames Add. tasks: Attribute editing, High fidelity and realism	CVPR 2023
MEAD-3D	Speech4mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation 3D/4D corpus Modalities: emotions, motions Add. tasks: Attribute editing	ICCV 2023
TalkSHOW	Generating Holistic 3D Human Motion from Speech 3D/4D corpus Modalities: audio, emotions, motions Size: 26.9 hours, 4 subjects Add. tasks: Attribute editing, Physics improvements & World interaction	CVPR 2023
EmoTalk3D	EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head 3D/4D corpus Modalities: audio, emotions, motions Size: 35 subjects Add. tasks: Attribute editing, High fidelity and realism	ECCV 2024
FEED	Emoportraits: Emotion-Enhanced Multimodal One-Shot Head Avatars Video corpus Modalities: audio, emotions Add. tasks: High fidelity and realism	CVPR 2024
3D-BEF	Emodiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models 3D/4D corpus Modalities: audio, emotions, motions Size: 2k+ sequences, 9 emotions Add. tasks: Attribute editing, High fidelity and realism	arXiv 2025
AffectNet+	AffectNet+: A Database for Enhancing Facial Expression Recognition with Soft-Labels Image corpus Modalities: emotions Size: ~1M images Add. tasks: Generalization	TAC 2025
MENTOR	Vlogger: Multimodal Diffusion for Embodied Avatar Synthesis Video corpus Modalities: audio, emotions, motions Size: 800k subjects Add. tasks: Generalization	CVPR 2025
TalkBody4D	Taoavatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting Video corpus Modalities: audio, motions Size: 8 sequences, 59 cameras Add. tasks: Physics improvements & World interaction, High fidelity and realism	CVPR 2025
VOCASET	Emovoca: Speech-Driven Emotional 3D Talking Heads 3D/4D corpus Modalities: audio, motions Size: 12 subjects Add. tasks: Attribute editing	WACV 2025

Text guidance & Stylization _Corpora

Corpus	Title & Repository / Information	Venue
BEAT	BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis 3D/4D corpus Modalities: audio, emotions, text, motions Size: 76 hours, 30 subjects Add. tasks: Generalization, Expressiveness, Physics improvements & World interaction	ECCV 2022
CelebV-Text	CelebV-Text: A Large-Scale Facial Text-Video Dataset Video corpus Modalities: emotions, text Size: 70k clips Add. tasks: Generalization, Expressiveness, High fidelity and realism	CVPR 2023
Human-Art	Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes Image corpus Modalities: text Size: 50k images Add. tasks: Generalization	CVPR 2023
BEAT2	Emage: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling 3D/4D corpus Modalities: audio, emotions, text, motions Size: 60 hours Add. tasks: Generalization, Expressiveness, Physics improvements & World interaction	CVPR 2024
CosmicManHQ-1.0	Cosmicman: A Text-to-Image Foundation Model for Humans Image corpus Modalities: text, hair/clothes Size: 5.46M images Add. tasks: Generalization, Hair and clothes improvements	CVPR 2024
SFHQ-T2I	Synthetic Faces High Quality - Text 2 Image (Sfhq-T2i) Dataset Image corpus Modalities: text Size: 122,726 images Add. tasks: Generalization, High fidelity and realism	Dataset 2024
SignAvatars	SignAvatars: A Large-Scale 3D Sign Language Holistic Motion Dataset and Benchmark 3D/4D corpus Modalities: text, motions Size: 70k clips, 153 subjects Add. tasks: Generalization, Physics improvements & World interaction	ECCV 2024

Attribute editing _Corpora

Corpus	Title & Repository / Information	Venue
BiwiKinect	Random Forests for Real Time 3D Face Analysis Video corpus Modalities: motions Size: 15k images, 20 subjects Add. tasks: Physics improvements & World interaction, High fidelity and realism	IJCV 2013
FaceWarehouse	FaceWarehouse: A 3D Facial Expression Database for Visual Computing 3D/4D corpus Modalities: emotions, motions Size: 150 subjects Add. tasks: Generalization, Expressiveness	TVCG 2013
Stirling	Stirling Esrc 3D Face Database 3D/4D corpus Modalities: emotions, motions Size: 99 subjects Add. tasks: Expressiveness	Dataset 2013
NeRFace	Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction Video corpus Add. tasks: High fidelity and realism	CVPR 2021
CelebV-HQ	CelebV-HQ: A Large-Scale Video Facial Attributes Dataset Video corpus Modalities: emotions Size: 35k+ clips Add. tasks: Generalization, Expressiveness, High fidelity and realism	ECCV 2022
NeRFBlendShape	Reconstructing Personalized Semantic Facial NeRF Models from Monocular Video Video corpus Modalities: motions Size: 8 subjects Add. tasks: High fidelity and realism	TOG 2022
AvatarReX	AvatarReX: Real-Time Expressive Full-Body Avatars Video corpus Modalities: motions Size: 4 sequences, 16 cameras Add. tasks: Physics improvements & World interaction, High fidelity and realism	TOG 2023
LPFF	LPFF: A Portrait Dataset for Face Generators Across Large Poses Image corpus Size: 19,590 images Add. tasks: High fidelity and realism	ICCV 2023
PointAvatar	PointAvatar: Deformable Point-Based Head Avatars from Videos Video corpus Modalities: motions Size: 3 subjects Add. tasks: High fidelity and realism	CVPR 2023

Physics improvements & World interaction _Corpora

Corpus	Title & Repository / Information	Venue
Decaf	Decaf: Meg-Based Multimodal Database for Decoding Affective Physiological Responses Video corpus Modalities: motions Size: 8 subjects Add. tasks: Attribute editing, High fidelity and realism	TAC 2015
KIT-ML	The Kit Motion-Language Dataset Else corpus Modalities: text, motions Size: 3,911 clips, 6,278 texts Add. tasks: Generalization, Text guidance & Stylization	Big Data 2016
MonoPerfCap	MonoPerfCap: Human Performance Capture from Monocular Video Video corpus Modalities: motions Size: 120 clips Add. tasks: High fidelity and realism	TOG 2018
PeopleSnapshot	Video Based Reconstruction of 3D People Models Video corpus Modalities: motions Size: 11 subjects	CVPR 2018
AMASS	AMASS: Archive of Motion Capture as Surface Shapes Else corpus Modalities: motions Size: 11,265 motions	ICCV 2019
PROX	Resolving 3D Human Pose Ambiguities with 3D Scene Constraints Video corpus Modalities: motions Size: 12 scenes	ICCV 2019
Speech2Gesture	Learning Individual Styles of Conversational Gesture Video corpus Modalities: audio, motions Size: 2,710 clips, 144 hours Add. tasks: Generalization, Expressiveness	CVPR 2019
Talking-WithHands16.2M	Talking-WithHands16.2M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis 3D/4D corpus Modalities: audio, motions Size: 16.2M frames Add. tasks: Generalization, Expressiveness	ICCV 2019
Talking-WithHands32M	Talking with Hands 16.2 M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis 3D/4D corpus Modalities: audio, motions Size: 32M frames Add. tasks: Generalization, Expressiveness	ICCV 2019
InterHand2.6M	InterHand2.6M: A Dataset and Baseline for 3D Interacting Hand Pose Estimation from a Single RGB Image Image corpus Modalities: motions Size: 26 subjects Add. tasks: Attribute editing	ECCV 2020
DeepCap	DeepCap: Monocular Human Performance Capture Using Weak Supervision Video corpus Modalities: motions Size: 17 sequences Add. tasks: High fidelity and realism	CVPR 2020
BABEL	BABEL: Bodies, Action and Behavior with English Labels Else corpus Modalities: text, motions Size: 43 hours, 250+ actions Add. tasks: Generalization, Text guidance & Stylization	CVPR 2021
DynaCap	Real-Time Deep Dynamic Characters Video corpus Modalities: motions Size: 5 sequences	TOG 2021
ZJU-MoCap	Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans Video corpus Modalities: motions Size: 9 sequences	CVPR 2021
BEHAVE	BEHAVE: Dataset and Method for Tracking Human Object Interactions 3D/4D corpus Modalities: motions Size: 321 sequences Add. tasks: High fidelity and realism	CVPR 2022
DART	DART: Articulated Hand Model with Diverse Accessories and Rich Textures Image corpus Modalities: motions Size: 800k images Add. tasks: High fidelity and realism	NeurIPS 2022
EgoBody	EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices Video corpus Modalities: motions Size: 125 sequences	ECCV 2022
HuMMan	HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling 3D/4D corpus Modalities: motions Size: 1k subjects, 60M frames Add. tasks: Generalization, High fidelity and realism	ECCV 2022
HumanML3D	Generating Diverse and Natural 3D Human Motions from Text Else corpus Modalities: text, motions Size: 14.6k clips, 45.0k texts Add. tasks: Generalization, Text guidance & Stylization	CVPR 2022
MANO	Embodied Hands: Modeling and Capturing Hands and Bodies Together 3D/4D corpus Modalities: motions Size: 1k+ scans Add. tasks: Attribute editing	arXiv 2022
NeuMan	NeuMan: Neural Human Radiance Field from a Single Video Video corpus Size: 6 clips Add. tasks: High fidelity and realism	ECCV 2022
CHAIRS	Full-Body Articulated Human-Object Interaction 3D/4D corpus Modalities: motions Size: 17.3 hours, 46 subjects Add. tasks: Attribute editing, High fidelity and realism	ICCV 2023
CIRCLE	CIRCLE: Capture in Rich Contextual Environments 3D/4D corpus Modalities: motions Size: 10 hours	CVPR 2023
Re:InterHand	A Dataset of Relighted 3D Interacting Hands 3D/4D corpus Modalities: motions Size: 106,766 scans Add. tasks: Attribute editing, High fidelity and realism	NeurIPS 2023
X-Avatar	X-Avatar: Expressive Human Avatars 3D/4D corpus Modalities: motions Size: 233 sequences Add. tasks: Attribute editing, High fidelity and realism	CVPR 2023
Ava-256	Codec Avatar Studio: Paired Human Captures for Complete, Driveable, and Generalizable Avatars 3D/4D corpus Modalities: motions Size: 256 subjects Add. tasks: Attribute editing, High fidelity and realism	NeurIPS 2024
FS-DART	Have-Fun: Human Avatar Reconstruction from Few-Shot Unconstrained Images 3D/4D corpus Modalities: motions Size: 100 subjects Add. tasks: High fidelity and realism	CVPR 2024
LINGO	Autonomous Character-Scene Interaction Synthesis from Text Instruction Else corpus Modalities: text, motions Size: 16 hours Add. tasks: Text guidance & Stylization	SIGGRAPH 2024
TRUMANS	Scaling up Dynamic Human-Scene Interaction Modeling 3D/4D corpus Modalities: motions Size: 15 hours Add. tasks: High fidelity and realism	CVPR 2024

Hair and clothes improvements _Corpora

Corpus	Title & Repository / Information	Venue
DeepFashion	DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations Image corpus Modalities: hair/clothes Size: 801k items Add. tasks: High fidelity and realism	CVPR 2016
DeepFashion2	DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images Image corpus Modalities: hair/clothes Size: 801k items Add. tasks: High fidelity and realism	CVPR 2019
CAPE	Learning to Dress 3D People in Generative Clothing 3D/4D corpus Modalities: hair/clothes Size: 150k scans, 15 subjects Add. tasks: Physics improvements & World interaction	CVPR 2020
SIZER	SIZER: A Dataset and Model for Parsing 3D Clothing and Learning Size Sensitive 3D Clothing 3D/4D corpus Modalities: hair/clothes Size: 2k scans	ECCV 2020
TikTok	Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos 3D/4D corpus Size: 300+ sequences, 100k+ frames Add. tasks: Generalization, High fidelity and realism	CVPR 2021
3DHumans	Sharp: Shape-Aware Reconstruction of People in Loose Clothing 3D/4D corpus Modalities: motions, hair/clothes Size: ~180 scans Add. tasks: Attribute editing, High fidelity and realism	IJCV 2023
DNA-Rendering	DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-Centric Rendering Video corpus Modalities: motions, hair/clothes Size: 1,500+ subjects, 67.5M frames Add. tasks: Generalization, Physics improvements & World interaction, High fidelity and realism	ICCV 2023
Goliath	Codec Avatar Studio: Paired Human Captures for Complete, Driveable, and Generalizable Avatars 3D/4D corpus Modalities: motions, hair/clothes Size: 4 subjects Add. tasks: Physics improvements & World interaction, High fidelity and realism	NeurIPS 2024
I3D-Human	Within the Dynamic Context: Inertia-Aware 3D Human Modeling with Pose Sequence 3D/4D corpus Modalities: motions, hair/clothes Size: 6 subjects, 10k frames Add. tasks: Physics improvements & World interaction	ECCV 2024
MVHumanNet	MVHumanNet: A Large-Scale Dataset of Multi-View Daily Dressing Human Captures Video corpus Modalities: text, motions Size: 4,500 subjects, 645M frames Add. tasks: Generalization, Text guidance & Stylization, Physics improvements & World interaction, High fidelity and realism	CVPR 2024
MVHumanNet++	MVHumanNet++: A Large-Scale Dataset of Multi-View Daily Dressing Human Captures with Richer Annotations for 3D Human Digitization Video corpus Modalities: text, motions Size: 4,500 subjects, 645M frames Add. tasks: Generalization, Text guidance & Stylization, Physics improvements & World interaction, High fidelity and realism	arXiv 2025

High fidelity and realism _Corpora

Corpus	Title & Repository / Information	Venue
Florence2D/3D	The Florence 2D/3D Hybrid Face Dataset 3D/4D corpus Modalities: emotions	J-HGBU 2011
FFHQ	A Style-Based Generator Architecture for Generative Adversarial Networks Image corpus Size: 70k images Add. tasks: Generalization	CVPR 2019
Multiface	Multiface: A Dataset for Neural Face Rendering Video corpus Modalities: motions Size: 13 subjects	arXiv 2022
SFHQ	Synthetic Faces High Quality (Sfhq) Dataset Image corpus Size: 100k images Add. tasks: Generalization	Dataset 2022
SHHQ	StyleGAN-Human: A Data-Centric Odyssey of Human Generation Image corpus Size: 40k images Add. tasks: Generalization	ECCV 2022
VFHQ	VFHQ: A High-Quality Dataset and Benchmark for Video Face Super-Resolution Video corpus Size: 16k+ clips Add. tasks: Generalization	CVPR 2022
2K2K	High-Fidelity 3D Human Digitization from Single 2K Resolution Images 3D/4D corpus Modalities: motions Size: 2k images	CVPR 2023
ActorsHQ	Humanrf: High-Fidelity Neural Radiance Fields for Humans in Motion Video corpus Size: 39,765 frames, 160 cameras Add. tasks: Physics improvements & World interaction	TOG 2023
INSTA	Instant Volumetric Head Avatars Video corpus Modalities: motions Add. tasks: Attribute editing	CVPR 2023
TexTalk4D	Towards High-Fidelity 3D Talking Avatar with Personalized Dynamic Texture 3D/4D corpus Modalities: audio, motions Size: 100 subjects, 100 minutes Add. tasks: Generalization, Expressiveness, Attribute editing	CVPR 2025

Methods

203 primary methods reviewed across 9 task families, plus 6 logical continuation papers (209 works in total). Every entry is an avatar-generation method published at a 2024–2025 CORE A/A* venue. Rows marked ↳ are follow-up papers grouped under the method they extend. Click a title to expand a one-line summary.

The 203 reviewed methods, grouped by primary task and publication year.

Generalization _Methods

Method	Title & Repository / Description	Venue
DisCo	DisCo: Disentangled Control for Realistic Human Dance Generation (Apache 2.0) DisCo introduces a pose-guided synthesis model for realistic human dance generation that emphasizes two principles: generalizability and compositionality. To achieve this, the authors design a disentangled-control architecture with a human-attribute pretraining stage.	CVPR 2024
SiTH	SiTH: Single-View Textured Human Reconstruction with Image-Conditioned Diffusion SiTH proposes a two-stage pipeline that reconstructs a fully textured 3D human mesh from a single input image. First, an image-conditioned diffusion model hallucinates the back-view appearance of the person. Then, a mesh reconstruction network uses both the original front view and the hallucinated back view, guided by a skinned human body prior, to reconstruct full-body geometry and texture.	CVPR 2024
DiffHuman	DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans DiffHuman is a conditional diffusion model that denoises pixel-aligned 2D observations of an underlying 3D shape representation. The authors propose a novel neural generator that approximates rendering with reduced runtime (up to ×55).	CVPR 2024
HaveFun	HaveFun: Human Avatar Reconstruction from Few-Shot Unconstrained Images The authors of HaveFun present a framework that can reconstruct animatable full‑body human avatars from a small set of unconstrained images by combining a skinning mechanism with Deep Marching Tetrahedra and a two‑phase optimization: reference alignment and unseen‑region guidance.	CVPR 2024
Morphable Diffusion	Morphable Diffusion: 3D-Consistent Diffusion for Single-Image Avatar Creation The authors introduce a diffusion model that enables creation of fully 3D animatable photorealistic human avatars. They have managed to integrate 3D morphable multi-view-consistent model (e.g., SMPL or FLAME) into a denoising approach with seamless and accurate incorporation of facial expressions and body pose control into the generation process.	CVPR 2024
Stratified Avatar	Stratified Avatar Generation from Sparse Observations The paper proposes a stratified two-stage pipeline that first reconstructs an upper-body avatar from a small set of sparse HMD and hand observations and then conditions a lower-body synthesis on the learned upper-body latent to recover full-body poses. The authors leverage a VQ-VAE and latent diffusion formulation to model the conditional distribution of full-body motion given sparse inputs.	CVPR 2024
Portrait4D	Portrait4D: Learning One-Shot 4D Head Avatar Synthesis Using Synthetic Data Portrait4D proposes a one-shot framework for 4D head avatar synthesis from a single image. It first implies training a part-wise 4D generative model to synthesize multi-view and motion-varying training data and then using a transformer-based animatable tri-plane reconstructor for avatar reconstruction. Similar to, they first train a 3D head synthesizer on synthetic multi-view images, use it to convert monocular real videos into pseudo multi-view ones and then learn a full 4D head synthesizer via cross-view self-reenactment.	CVPR 2024
Portrait4D-v2 ^{cont. of Portrait4D}	Portrait4D-V2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer In their next work, the authors introduce Portrait4D-v2, a feedforward one-shot 4D head avatar synthesis method that replaces reliance on monocular-video reconstruction and 3DMM guidance with pseudo multi-view data.	ECCV 2024
AvatarOne	AvatarOne: Monocular 3D Human Animation AvatarOne reconstructs an animatable 3D human avatar from a single monocular video and a tracked skeleton. The method builds a canonical SDF representation with accompanying texture, then uses a forward-skinning deformation module and grid-based volumetric rendering to support novel-pose and novel-view synthesis.	WACV 2024
SphereHead	SphereHead: Stable 3D Full-Head Synthesis with Spherical Tri-Plane Representation SphereHead introduces a spherical tri‑plane representation for 3D head synthesis, which better models full-head geometry and reduces back-view artifacts compared to standard Cartesian tri-planes. Another proposition is a view-image consistency loss that enforces alignment between generated images and camera parameters, enabling stable 360-degree head generation and inversion from a single image.	ECCV 2024
PAV	PAV: Personalized Head Avatar from Unstructured Video Collection PAV proposes learning a dynamic deformable NeRF from a collection of monocular videos of the same person under different appearances (e.g., hair, facial changes). The method attaches learnable latent appearance embeddings to a base mesh and conditions both density and color of the NeRF on them.	ECCV 2024
HumanSplat	HumanSplat: Generalizable Single-Image Human Gaussian Splatting with Structure Priors In HumanSplat, the authors propose a method to reconstruct a 3D human avatar from a single image by predicting 3DGS parameters using a 2D multi‑view diffusion model and a latent reconstruction transformer, enriched with human-structure priors. This allows feedforward generation of human Gaussians without per-subject optimization or dense multi-view capture.	NeurIPS 2024
Human-3Diffusion	Human-3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models The authors propose a realistic avatar creation pipeline. Similar to previous approaches, it first utilizes a 2D multi-view diffusion model as a prior. Then it uses an image-conditioned 3DGS reconstruction model for explicit 3D representation.	NeurIPS 2024
GAGAvatar	Generalizable and Animatable Gaussian Head Avatar The authors propose GAGAvatar, a one-shot animatable head avatar method that regresses 3D Gaussian parameters from a single image using a dual-lifting approach and integrates 3DMM priors for expression control. The feedforward model reconstructs unseen identities without per-subject optimization and renders reenactments in real time.	NeurIPS 2024
Real3D-Portrait	Real3D-Portrait: One-Shot Realistic 3D Talking Portrait Synthesis Real3D-Portrait presents a one-shot pipeline that reconstructs a 3D avatar from a single image and conditions it on audio or video to produce talking head avatars. The system uses a large image-to-plane 3D prior, an efficient motion adapter for conditioned animation, and a head-torso/background super-resolution model.	ICLR 2024
GPAvatar (multi-input)	GPAvatar: Generalizable and Precise Head Avatar from Image(s) In the work GPAvatar (not to be confused with), a method is proposed that reconstructs a 3D head avatar from one or several input images in a single forward pass by using a dynamic point‑based expression field and a Multi Tri-planes Attention fusion module to combine information from multiple images.	ICLR 2024
Shafir et al.	Human Motion Diffusion as a Generative Prior The paper also proposes using a pretrained motion diffusion model as a generative prior to overcome data scarcity in motion synthesis. The authors introduce three composition mechanisms -- sequential, parallel, and model composition -- enabling long animations, two-person motion, and fine‑grained control without collecting huge new corpora. For example, with their “DoubleTake” inference trick, they generate long motion sequences from a prior trained only on short clips.	ICLR 2024
Fine Structure-Aware Sampling	Fine Structure-Aware Sampling: A New Sampling Training Scheme for Pixel-Aligned Implicit Models in Single-View Human Reconstruction The paper proposes a Fine Structure-Aware Sampling strategy that emphasizes “fine” structures (ears, fingers, hair edges) when training pixel-aligned implicit models from single views, reducing reconstruction artifacts and improving detailed geometry/texture recovery.	AAAI 2024
InvertAvatar	InvertAvatar: Incremental GAN Inversion for Generalized Head Avatars The authors introduce InvertAvatar, an incremental 3D GAN inversion method that improves avatar reconstruction quality as more frames are provided. The technique includes an animatable 3D-GAN prior, a neural texture encoder with UV parameterization, and temporal aggregation (ConvGRU) to boost geometry/texture detail from multi-frame input.	SIGGRAPH 2024
Pippo	Pippo: High-Resolution Multi-View Humans from a Single Image Pippo is a generative model based on a multi-view DiT designed to create dense, 1K resolution turnaround videos or multi-view 3D representations of a person from a single input image. It uses a multi-stage training approach, starting with pretraining on 3B human images. Key innovations include an attention biasing technique that allows generating more views than in the original training distribution and a ControlMLP that uses pixel-aligned controls to enhance 3D consistency during high-resolution generation.	CVPR 2025
GAF	GAF: Gaussian Avatar Reconstruction from Monocular Videos via Multi-View Diffusion In GAF, the authors propose reconstructing animatable 3DGS head avatars from a monocular video captured on a commodity device. They use a multi-view latent diffusion model conditioned on normal maps from a FLAME model mesh and VAE image features to generate pseudo-ground-truth novel-view renderings, which guide the optimization of a 3DGS avatar representation. A latent upsampler further refines facial detail before decoding.	CVPR 2025
CAP4D	CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models CAP4D uses a morphable multi-view diffusion model to reconstruct 4D avatars. It works with an arbitrary number of reference images, even with just one. The proposed pipeline is capable of predicting novel views and unseen expressions.	CVPR 2025
AvatarArtist	AvatarArtist: Open-Domain 4D Avatarization In AvatarArtist, the authors propose a training paradigm using both GANs and diffusion models. They explain that, based on their observations, 4D-GANs fail at cross-domain tasks, but excel at bridging images and tri-planes. 2D diffusion models in the pipeline serve as diverse data distribution experts that assist GANs in the avatar creation.	CVPR 2025
FRESA	FRESA: Feedforward Reconstruction of Personalized Skinned Avatars from Few Images FRESA reconstructs personalized full-body skinned avatars from just a few casual images in a single feedforward pass. The method jointly infers shape, skinning weights, and pose-dependent deformations, improving geometric fidelity over shared-weight approaches. Multi-frame feature aggregation and 3D canonicalization help capture details.	CVPR 2025
Zero-1-to-A	Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion Zero-1-to-A is a method of synthesizing spatially and temporally consistent corpora for 4D digital avatar synthesis. It iteratively constructs video subsets, progressively trains a diffusion model in such a way that the resulting quality is improved and the animation is more temporally coherent.	CVPR 2025
Vid2Avatar-Pro	Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior Sharing a common idea about efficient priors, Vid2Avatar-Pro uses a universal prior model trained on multiple clothed human views to guide the fitting of a photorealistic avatar from a monocular in-the-wild video. The avatar is represented via expressive 3D Gaussians with shared canonical front/back maps. Inverse rendering is used to adapt the prior to the input identity.	CVPR 2025
GASP	GASP: Gaussian Avatars with Synthetic Priors The authors train a 3DGS model prior using a perfectly annotated synthetic corpus, which is then fit and fine-tuned on a single photo or short video to enable 360-degree animatable avatars on a specific identity. Correlations among per-Gaussian features learned in synthetic space are utilized within the fitting process to bridge the domain gap.	CVPR 2025
AniGS	AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction AniGS reconstructs animatable 3D avatars from a single image using 4D Gaussian Splatting. Multi-view canonical images are generated via a transformer-based model, and reconstruction inconsistencies are leveraged as motion cues for animation.	CVPR 2025
SynShot	Synthetic Prior for Few-Shot Drivable Head Avatar Inversion The authors of SynShot use a large synthetic avatar head corpus to create prior knowledge within the model, which is then fine-tuned using just a few real images to bridge the domain gap.	CVPR 2025
Avat3r	Avat3r: Large Animatable Gaussian Reconstruction Model for High-Fidelity 3D Head Avatars Avat3r is a model that regresses a high‑quality animatable 3D head avatar from just a few input images by learning a strong Gaussian‑splat prior over heads from a large multi-view 3D head corpus and enabling animation via cross‑attention to expression codes.	ICCV 2025
Sun et al.	Fine-Grained 3D Gaussian Head Avatars Modeling from Static Captures via Joint Reconstruction and Registration The authors propose modeling high-fidelity head avatars by optimizing two parallel 3DGS sets from static image captures: one prior-based set with animation rigging and one prior-free with texture/geometry details. They jointly register and merge them, then combine occluded parts from the prior set to output a complete animatable avatar.	ICCV 2025
GAS (Generative Avatar Synthesis)	GAS: Generative Avatar Synthesis from a Single Image Generative Avatar Synthesis framework combines the regression-based 3D human reconstruction with a diffusion-based approach. A dense driving signal from the reconstructed human outpaces real information, like depth or normal maps, due to the discrepancy of the latter. It serves as comprehensive conditioning for high-quality avatar synthesis.	ICCV 2025
GUAVA	GUAVA: Generalizable Upper Body 3D Gaussian Avatar Generalizable Upper Body 3D Gaussian Avatar reconstructs an animatable upper-body Gaussian avatar (torso, hands, face) from a single image in about 0.1 seconds using an expressive human model and projection-based sampling.	ICCV 2025
MoGA	MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction The paper introduces Monocular Gaussian Avatar, a method that leverages a generative avatar prior to reconstruct high‑fidelity animatable avatars from monocular videos. The key idea, similar to that of previously described methods, lies in combining a learned 2D avatar prior with 3DGS for monocular reconstruction.	ICCV 2025
Low-Rank Register Modules	Low-Rank Head Avatar Personalization with Registers The paper proposes a framework to personalize a pretrained head-avatar model using Low-Rank Register Modules based on the Low-Rank Adaptation mechanism first introduced for language models. Instead of fine-tuning the full network, small learnable modules are inserted to adapt identity, appearance, and subtle facial details for new subjects.	NeurIPS 2025
3D²-Actor	3D²-Actor: Learning Pose-Conditioned 3D-Aware Denoiser for Realistic Gaussian Avatar Modeling 3D²‑Actor proposes a pipeline combining a pose‑conditioned 2D denoiser with a 3DGS‑based rectifier. Given a multi‑view video of a person, the system denoises and generates multi‑view images in arbitrary poses, then reconstructs a 3D avatar with a two‑stage projection strategy and local coordinate representation.	AAAI 2025

Expressiveness _Methods

Method	Title & Repository / Description	Venue
FaceTalk	FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models The authors propose using a latent diffusion model in the expression space of neural parametric head models to generate temporally coherent, high-fidelity 3D head animations from input audio.	CVPR 2024
SMIRK	SMIRK: 3D Facial Expressions Through Analysis-by-Neural-Synthesis (MIT) SMIRK replaces traditional differentiable-rendering losses with a neural renderer to reconstruct expressive 3D faces from single in-the-wild images. This enables faithful recovery of subtle, extreme, asymmetric, or rare expressions that prior methods often miss.	CVPR 2024
DiffTED	DiffTED: One-Shot Audio-Driven Ted Talk Video Generation with Diffusion-Based Co-Speech Gestures DiffTED is a novel method for one-shot audio-driven avatar synthesis from a single image. It leverages a diffusion model to generate Thin-Plate Spline motion model keypoints to control the avatar's movements for temporally coherent and diverse co-speech articulation. This method uses CFG.	CVPR 2024
DiffusionAvatars	DiffusionAvatars: Deferred Diffusion for High-Fidelity 3D Head Avatars DiffusionAvatars is a method for generating high-fidelity 3D head avatars with control over pose and expression. The work's notable contribution is a neural parametric head model that is used to guide expression and head pose, as it serves as a proxy geometry for the subject. It generates expression encodings that are aggregated into the DiffusionAvatars pipeline via cross-attention. It also creates a canonical space, utilized by learnable spatial features that are later rigged to the head's surface using tri-planes.	CVPR 2024
EMAGE	EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling In this work, the authors propose a framework for full-body avatar motion generation conditioned on audio and masked gestures. These motions include facial, local body, hands, and global movements with high expressiveness and fidelity. To achieve this, they introduce the BEAT2 mesh-level co-speech corpus based on the SMPL-X body with FLAME head parameters.	CVPR 2024
EMOPortraits	EMOPortraits: Emotion-Enhanced Multimodal One-Shot Head Avatars The authors focus on the limitations of the latent space for facial expression descriptors. They modify a previous SOTA method to work with asymmetric facial expressions, introduced audio modality for audio-driven facial animation, and proposed a new FEED corpus that fills the gap with intense, asymmetric, and various facial expressions of identities in videos as compared to MEAD.	CVPR 2024
Diffused Heads	Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation The authors of Diffused Heads use an autoregressive diffusion model that -- given a single identity image and an audio clip -- generates a full talking‑head video. The method hallucinates natural head movement, blinks, and lip motion. It is capable of preserving identity and background, overcoming common limitations of GAN-based approaches.	WACV 2024
LaughTalk	LaughTalk: Expressive 3D Talking Head Generation with Laughter The authors of LaughTalk propose a system for 3D talking-head synthesis that can produce both speech and natural laughter -- something many prior methods struggle with, since laughter involves subtle face and head dynamics beyond speech articulation.	WACV 2024
EMO	EMO: Emote Portrait Alive Generating Expressive Portrait Videos with Audio2video Diffusion Model Under Weak Conditions In this work, the authors address the issue of human expressions and the uniqueness of facial styles. A framework is proposed that directly synthesizes video using the audio modality. Along with it, a reference image with motion frames and face region mask are utilized in a Stable Diffusion based pipeline. First, they generate hand positions using a DiT. There, the audio is incorporated via cross-attention. The previous motion latent sequence is concatenated with the current one for better transition smoothness. Second, the generated co-speech gestures are encoded and added into a noisy latent.	ECCV 2024
EMO2 ^{cont. of EMO}	EMO2: End-Effector Guided Audio-Driven Avatar Video Generation The same authors propose a two-stage pipeline to synchronize the audio modality with co-speech gestures.	arXiv 2025
Arc2Face	Arc2Face: A Foundation Model for Id-Consistent Human Faces (MIT) Arc2Face is a diffusion-based foundation model that generates photorealistic human faces conditioned solely on a person’s ArcFace embedding, achieving stronger identity fidelity than text-prompted methods.	ECCV 2024
Expressive Whole-Body 3D Gaussian Avatar	Expressive Whole-Body 3D Gaussian Avatar Expressive Whole-Body 3D Gaussian Avatar introduces a hybrid representation combining a parametric mesh and 3DGS to produce animatable full-body avatars from short monocular videos. By rigging Gaussians to mesh vertices, the method models body, face, and hand deformations simultaneously, enabling expressive novel-pose synthesis with accurate facial expressions and hand gestures.	ECCV 2024
HeadGaS	HeadGaS: Real-Time Animatable Head Avatars via 3D Gaussian Splatting HeadGaS presents a method to generate real-time animatable head avatars using 3D Gaussian splats with learnable latent features. The Gaussians are rigged to a parametric head model and incorporate expression-dependent color and opacity, enabling animatable facial expressions.	ECCV 2024
ScanTalk	ScanTalk: 3D Talking Heads from Unregistered Scans ScanTalk is a framework that animates arbitrary 3D face meshes from speech. It overcomes the common limitation that many 3D face animation methods require fixed mesh topology and point‑to‑point correspondence. ScanTalk relies on a diffusion‑based mesh deformation network (DiffusionNet) that takes per‑vertex features and audio as input and outputs a deformation sequence, enabling speech‑driven animation even on previously unseen or unregistered scans.	ECCV 2024
ID-to-3D	ID-to-3D: Expressive Id-Guided 3D Heads via Score Distillation Sampling The authors of ID-to-3D introduce a method that, starting from a single casual reference image and a text prompt, generates a 3D human head avatar with identity-consistent geometry and texture. It also supports up to 13 distinct expressions. They combine an ArcFace embedding for identity, task-specific 2D diffusion priors, and a neural parametric representation for expression, foregoing reliance on large captured 3D corpora.	NeurIPS 2024
VASA-1	VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time VASA-1 generates photorealistic talking-face videos from a single input image and a speech-audio clip. The system models holistic facial dynamics and head motion in a disentangled latent space, producing synchronized lip movement, expressive facial nuances, and natural head motion.	NeurIPS 2024
MimicTalk	MimicTalk: Mimicking a Personalized and Expressive 3D Talking Face in Minutes MimicTalk proposes a hybrid adaptation pipeline. It generates an avatar starting from a person-agnostic generic 3D talking-face model, then quickly fine-tunes to a given identity in only a few minutes, and uses an in-context stylized speech2motion module to replicate the target’s speaking style.	NeurIPS 2024
GAIA	GAIA: Zero-Shot Talking Avatar Generation GAIA tackles talking avatar synthesis in a zero-shot setting. It generates natural videos without relying on 3DMMs or warping heuristics. The model disentangles appearance and motion, then uses a diffusion-based motion generator conditioned on the portrait and audio.	ICLR 2024
Follow-Your-Emoji	Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation In this work, the authors offer a diffusion-based framework for animating a reference portrait under a target landmark sequence. Identity is preserved while expressions are applied, with a novel “expression-aware landmark” motion signal and a fine-grained facial loss for subtle expression transfer. The system also supports long-term temporal consistency via progressive generation. It adds a progressive generation strategy with a Taylor-interpolated cache to achieve roughly 2.6× faster inference while maintaining quality. It also improves landmark alignment and loss weighting to better handle exaggerated expressions and diverse portrait types.	SIGGRAPH 2024
Follow-Your-Emoji-Faster ^{cont. of Follow-Your-Emoji}	Follow-Your-Emoji-Faster: Towards Efficient, Fine-Controllable, and Expressive Freestyle Portrait Animation Follow-Your-Emoji-Faster continues the authors' Follow-Your-Emoji line by making the same fine-controllable, expression-preserving portrait animation much faster and more robust.	arXiv 2025
Media2Face	Media2Face: Co-Speech Facial Animation Generation with Multi-Modality Guidance Media2Face is a diffusion-based generator that integrates diverse media inputs (audio, image, and text) for facial animation and head pose synthesis for avatars. For its training, the authors utilize the Generalized Neural Parametric Facial Asset, an efficient VAE mapping facial geometry and images to a highly generalized expression latent space.	SIGGRAPH 2024
AniTalker	AniTalker: Animate Vivid and Diverse Talking Faces Through Identity-Decoupled Facial Motion Encoding AniTalker decouples identity and motion via a motion encoder that produces identity-independent facial motion representations. A synthesis network then applies those motions to target identities to yield diverse, expressive talking-face videos from audio or text. T	ACMMM 2024
TexTalker	Towards High-Fidelity 3D Talking Avatar with Personalized Dynamic Texture The authors introduce TexTalk4D, a high-resolution 4D corpus of 100 minutes of audio-aligned scan-level meshes with 8K dynamic textures from 100 subjects. They also present the diffusion-based framework TexTalker to generate facial motion and aligned dynamic textures simultaneously from speech. They reveal that dynamic texture is critical for high-fidelity speech-driven 3D head avatars and propose a pivot-based style injection strategy to disentangle motion style and texture style for better controllability.	CVPR 2025
Arc2Avatar	Arc2Avatar: Generating Expressive 3D Avatars from a Single Image via Id Guidance A continuation of Arc2Face, Arc2Avatar is a method that takes a single portrait image and generates a full 3D head avatar with blendshape-based expression control. They leverage a human-face foundation diffusion model fine-tuned for multi-view head synthesis and initialize a modified 3DGS representation in dense correspondence with a human face mesh template connectivity regularizers ensure expression-capable topology. An optional SDS based correction step refines blendshape expressions, and strong identity priors reduce reliance on heavy guidance, solving color fidelity issues common in SDS workflows.	CVPR 2025
Wang et al.	3D Gaussian Head Avatars with Expressive Dynamic Appearances by Compact Tensorial Representations The paper proposes a digital avatar synthesis method using rigged 3D Gaussian splats and a tensorial representation for dynamic textures. The authors add an adaptive truncated opacity penalty and class-balanced sampling to improve generalization across expressions.	CVPR 2025
VLOGGER	VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis The authors of VLOGGER introduce an avatar synthesis method from a single input image with audio guidance. First, a motion generator creates a sequence of 3D facial expressions and body poses for each frame based on the audio. These are transformed into denser representations and added to the reference image. Second, the packed input is then passed into a temporal diffusion model where it forgoes the denoising process. Finally, the pipeline uses a trainable super-resolution module to make the generation of each frame photorealistic.	CVPR 2025
EmoVOCA	EmoVOCA: Speech-Driven Emotional 3D Talking Heads The paper also proposes a method for generating 3D talking-head avatars with realistic emotional expressions from audio input. The approach uses a speech-to-expression network to predict fine-grained, time-varying facial deformations corresponding to emotion cues in speech. To render these deformations, the authors employ a 3D face representation that preserves geometry and appearance under different expressions and head poses.	WACV 2025
GeoAvatar	GeoAvatar: Adaptive Geometrical Gaussian Splatting for 3D Head Avatar GeoAvatar introduces an adaptive 3DGS framework that separates rigid and flexible facial regions for better deformation control. It applies distinct regularizations to stabilize geometry while maintaining expression flexibility and incorporates a mouth-specific rigging structure for more accurate lip motion.	ICCV 2025
GaussianSpeech	GaussianSpeech: Audio-Driven Personalized 3D Gaussian Avatars In this work, the authors introduce a method that takes spoken audio and generates high-fidelity, personalized, multi-view--consistent 3D head avatars using a 3DGS representation. They couple a transformer-based audio feature extractor with expression-dependent Gaussian color modeling and capture a new large-scale multi-view audio-visual corpus for training.	ICCV 2025
FaceCraft4D	FaceCraft4D: Animated 3D Facial Avatar Generation from a Single Image FaceCraft4D proposed in the paper takes a single image as input to create 360-degree animatable avatars. To make this possible, they utilized three different priors -- a shape prior, an image prior, and a video prior. The latter is used to enhance control over expressions and articulations in animations.	ICCV 2025
VASA-3D	VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image The authors present VASA-3D -- a logical continuation of -- a pipeline that builds a lifelike, audio-driven 3D Gaussian head avatar from a single portrait by leveraging a learned 2D audio-motion latent (from prior VASA-1 work) and lifting it into a 3D Gaussian expression space.	NeurIPS 2025
CyberHost	CyberHost: A One-Stage Diffusion Framework for Audio-Driven Talking Body Generation The authors propose an end-to-end audio-driven avatar synthesis framework. Within it, they tackle the problem of hand integrity, identity consistency, and naturalness of motion. The key design of the framework -- CyberHost -- is the Region Codebook Attention mechanism. It refines the quality of facial and hand animations by integrating fine-grained local features with learned motion pattern priors.	ICLR 2025
TEASER	TEASER: Token Enhanced Spatial Modeling for Expressions Reconstruction The authors of TEASER propose a hybrid representation combining explicit facial parameters (e.g., from a 3DMM) with implicit appearance tokens derived by a multi-scale tokenizer.	ICLR 2025
DEEPTalk	DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation DEEPTalk is a novel approach for generating speech-driven 3D facial animations. To significantly increase expressiveness and reduce monotony, the authors first train a Dynamic Emotion Embedding. It serves as an embedding-space representation of both speech and facial motions. Then a Temporally Hierarchical VQ-VAE is employed as an expressive and robust motion prior, overcoming the limitations of VAEs and VQ-VAEs.	AAAI 2025
EchoMimic	EchoMimic: Lifelike Audio-Driven Portrait Animations Through Editable Landmark Conditions EchoMimic presents a method for generating high‑quality videos driven by audio and/or editable facial landmarks. The core idea is to train a model that can take either an audio clip, a sequence of facial keypoints, or a combination of both and produce a portrait animation. From a reference image, audio, and optional hand‑pose sequence, it generates semi‑body (torso + arms + head) animated videos with synchronized speech, facial expression, and body/hand gestures.	AAAI 2025
EchoMimicV2 ^{cont. of EchoMimic}	EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation Continuing their work, the authors present EchoMimicV2 that extends the original idea to half‑body human animation.	CVPR 2025
VQTalker	VQTalker: Towards Multilingual Talking Avatars Through Facial Motion Tokenization VQTalker introduces a vector‑quantization-based facial motion tokenizer to capture articulations/pose features underlying speech. It uses this to generate talking‑head avatars that generalize across multiple languages. By discretizing facial motion and then performing coarse‑to‑fine motion generation, it achieves high-quality lip‑sync and natural animation from audio.	AAAI 2025
Model See Model Do	Model See Model Do: Speech-Driven Facial Animation with Style Control The authors of Model See Model Do propose a speech-driven facial animation framework that uses a style reference to control the expressive style of generated animations. The method separates speech and stylistic motion and enables transferring speaking styles from a reference model while preserving speaker identity and lip sync.	SIGGRAPH 2025
EVA	EVA: Expressive Virtual Avatars from Multi-View Videos The authors introduce EVA, a framework that builds full‑body avatars from multi‑view video. It builds on a deformable template mesh and a decoupled 3DGS.	SIGGRAPH 2025

Text guidance & Stylization _Methods

Method	Title & Repository / Description	Venue
Make-It-Vivid	Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text Make-It-Vivid allows the generation of high-quality UV-texture maps for 3D biped cartoon characters based on text prompts. The method uses a pretrained text-to-image diffusion model and a custom adversarial fine-tuning to handle the domain shift between natural images and cartoonish UV texture space.	CVPR 2024
CosmicMan	CosmicMan: A Text-to-Image Foundation Model for Humans CosmicMan is a holistic text-to-image foundation model that allows for the synthesis of photorealistic static human images. Having found out the influence of data production flow, the authors introduce a new Annotate Anyone paradigm and a large-scale CosmicManHQ-1.0 corpus with 6 million high-quality annotated human images. A Decomposed-Attention-Refocusing training framework is also introduced to utilize the relationship between dense text descriptions and image pixels.	CVPR 2024
HumanGaussian	HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting (MIT) The paper introduces HumanGaussian, a framework using 3DGS for text‑driven human avatar synthesis. The key innovations include a Structure‑Aware SDS that jointly optimizes geometry and appearance via both RGB and depth guidance, and an Annealed Negative Prompt Guidance scheme to reduce over‑saturation artifacts.	CVPR 2024
HumanNorm	HumanNorm: Learning Normal Diffusion Model for High-Quality and Realistic 3D Human Generation HumanNorm is a text-conditioned 3D human synthesis approach. The core novelty is the usage of a normal-adapted and a normal-aligned diffusion models. The first one creates high-fidelity normal maps corresponding to user prompts with a view-dependent, body-aware text. The second one generates colored images aligned with the normal maps.	CVPR 2024
3DToonify	3DToonify: Creating Your High-Fidelity 3D Stylized Avatar Easily from 2D Portrait Images The authors present 3DToonify, which converts a set of 2D portrait images into a stylized, high‑fidelity 3D avatar using implicit neural fields and a three‑stage progressive training scheme: guided prior learning, deformable geometry adaptation, and explicit texture adaptation.	CVPR 2024
DreamAvatar	DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models DreamAvatar was among the first works devoted to the text guidance in digital avatar synthesis. The proposed network takes a text prompt, a 3D shape and a pose as inputs to train NeRF. Pretrained Stable Diffusion models serve as supervisors that generate intermediate 2D representations of the avatar used in the optimization pipeline.	CVPR 2024
StyleAvatar	StyleAvatar: Stylizing Animatable Head Avatars StyleAvatar introduces a method to stylize animatable 3D head avatars -- not by post-processing renders, but by directly editing the representation.	WACV 2024
Wang et al.	Disentangled Clothed Avatar Generation from Text Descriptions The authors propose a text-to-avatar generation method that separately models the human body and clothes through a representation called SO-SMPL: a pair of meshes built on the SMPL parametric model. They introduce an SDS-based pipeline to generate both meshes from text prompts, enabling better semantic alignment, higher texture and geometry quality, and effective editing/try-on capabilities.	ECCV 2024
HeadStudio	HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting HeadStudio introduces a pipeline that generates animatable 3D head avatars from text prompts by rigging 3D Gaussians to a FLAME head prior. The method couples FLAME-based mesh deformation with Gaussian-splat geometry/texture and uses text-to-3D optimization to produce avatars that can be animated in pose/expression and rendered in real time.	ECCV 2024
AvatarPopUp	Instant 3D Human Avatar Generation Using Image Diffusion Models The proposed method in their work, called AvatarPopUp, shows that one can generate a 3D human avatar quickly from either a single image or text prompt, by first using diffusion‑based image generation to synthesize front and back views with pose/shape control and then applying a 3D lifting network to produce a rigged mesh.	ECCV 2024
MagicMirror	MagicMirror: Fast and High-Quality Avatar Generation with a Constrained Search Space The authors of MagicMirror propose a hybrid approach for stylized avatar synthesis. It consists of a NeRF that creates a versatile initial solution space and a text-to-image diffusion model with a learned geometric prior. A VSD is used instead of the more common SDS for texture loss and oversaturation issue mitigation.	ECCV 2024
Stable Video Portraits	Stable Video Portraits The authors propose Stable Video Portraits -- a novel hybrid 2D/3D generation method for photorealistic portrait videos. It leverages a large pretrained text-to-image prior bound by 3DMM control. The method implies person-specific fine-tuning of a general 2D Stable Diffusion model with temporal conditioning using 3DMM sequences.	ECCV 2024
X-Oscar	X-Oscar: A Progressive Framework for High-Quality Text-Guided 3D Animatable Avatar Generation In this work, the authors propose X-Oscar, a progressive (geometry, texture, animation) framework that generates high-quality animatable 3D avatars from text prompts, introducing Adaptive Variational Parameter and Avatar-aware Score Distillation Sampling to reduce oversaturation and improve optimization stability.	ICML 2024
AvatarVerse	AvatarVerse: High-Quality & Stable 3D Avatar Creation from Text and Pose The authors of AvatarVerse propose a pipeline that generates full 3D avatars from a text prompt and pose guidance. The core is a 2D diffusion model conditioned on DensePose signals. The method uses a progressive high‑resolution 3D synthesis strategy to enhance geometric and texture detail.	AAAI 2024
Follow Your Pose	Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos The authors propose Follow Your Pose, a two-stage pipeline to generate pose-controllable character videos from text and pose trajectories, even when no paired text-video corpus exists. First, they fine-tune a text-to-image model on pose-image pairs to encode pose. Then, they add temporal self-attention and cross-frame attention and fine-tune on pose-free video data to generate smooth guided videos.	AAAI 2024
HeadArtist	HeadArtist: Text-Conditioned 3D Head Generation with Self Score Distillation The authors of HeadArtist propose a pipeline that generates 3D head avatars from text prompts by optimizing a parametric head model under the supervision of a frozen ControlNet model via the proposed Self Score Distillation.	SIGGRAPH 2024
DivAvatar	DivAvatar: Diverse 3D Avatar Generation with a Single Prompt In DivAvatar, the authors address the limited diversity of existing text-to-avatar systems by allowing the synthesis of many distinct 3D avatars from a single text prompt. Their method fine-tunes a pretrained 3D generative model and introduces two key designs: a noise-sampling strategy at training time to preserve generation diversity, and a semantic-aware zoom mechanism paired with a novel depth loss to enforce geometry quality while adhering to textual semantics.	WACV 2025
StrandHead	StrandHead: Text to Hair-Disentangled 3D Head Avatars Using Human-Centric Priors StrandHead generates 3D head avatars with strand-level hair from text prompts by first synthesizing a FLAME-aligned bald head via 2D human priors and then optimizing hair strands with a differentiable prismatization that enforces realistic orientation and curvature.	ICCV 2025
TeRA	TeRA: Rethinking Text-Guided Realistic 3D Avatar Generation The authors of TeRA propose a two‑stage generative framework for text‑to‑3D‑avatar creation that distills a decoder producing a structured latent space from a large human reconstruction model.	ICCV 2025
AvatarGO	AvatarGO: Zero-Shot 4D Human-Object Interaction Generation and Animation AvatarGO generates 4D HOI animations from high-level textual descriptions without requiring paired HOI training data. It first composes a 3D scene via text-guided 3D generation, then uses a SMPL‑X-based motion optimization to animate both human and object, enforcing spatial constraints and avoiding penetration.	NeurIPS 2025
InstructAvatar	InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation InstructAvatar introduces a novel system that lets users control both facial emotion and motion of a 2D avatar via text guidance, in addition to audio. The method uses a two-branch diffusion‑based generator: one branch conditions on audio and another on text.	AAAI 2025
Wu et al.	Text-Based Animatable 3D Avatars with Morphable Model Alignment In the paper, the authors propose aligning text-driven digital avatar synthesis with morphable model geometry to produce animatable heads that respect parametric face constraints.	SIGGRAPH 2025

Attribute editing _Methods

Method	Title & Repository / Description	Venue
Control4D	Control4D: Efficient 4D Portrait Editing with Text The authors propose a 4D portrait editing framework that uses a novel representation called GaussianPlanes -- a plane‑based decomposition of Gaussian Splatting over space-time -- and a generator trained to convert 2D diffusion text-driven edits into temporally consistent 4D outputs.	CVPR 2024
Animate Anyone	Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation The authors propose Animate Anyone, a diffusion‑based framework that animates a static character image into a full video, preserving appearance detail via their spatial‑attention ReferenceNet and enabling pose‑controllable motion with a “pose guider” module and temporal modeling to ensure smooth transitions between frames.	CVPR 2024
NECA	NECA: Neural Customizable Human Avatar The authors of NECA train a fully customizable human avatar from monocular or sparse-view video. It predicts disentangled neural fields for geometry, albedo, shadow, and external lighting in two complementary spaces (canonical and surface) and renders them volumetrically with high-frequency details.	CVPR 2024
GeneAvatar	GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image GeneAvatar introduces a method that, given a single input image, can produce a volumetric 3D head avatar and allow expression-aware editing by lifting 2D edits into a consistent 3D modification field.	CVPR 2024
PEGASUS	PEGASUS: Personalized Generative 3D Avatars with Composable Attributes PEGASUS is a method that builds a person‑specific generative 3D avatar from a monocular video by first synthesizing a video collection of that identity with varying facial attributes (hair, nose, etc.), then training a generative model enabling disentangled compositional attribute control while preserving identity.	CVPR 2024
SplattingAvatar	SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting The work introduces a hybrid avatar representation combining explicit triangle-mesh geometry for low-frequency deformation and embedded 3D Gaussians for high-frequency geometry and appearance. The method is capable of creating photorealistic avatars that render at 300+ FPS on desktop and ~30 FPS on a mobile device. Their method is trainable from monocular video for head or full-body avatars and explicitly controls Gaussians via mesh motion, avoiding purely MLP based deformation fields.	CVPR 2024
AttriHuman-3D	AttriHuman-3D: Editable 3D Human Avatar Generation with Attribute Decomposition and Indexing AttriHuman-3D proposes an editable avatar synthesis framework with attribute decomposition and indexing in latent space. By separating attributes such as body, hair, and clothing, it enables precise editing without affecting unrelated parts.	CVPR 2024
OHTA	OHTA: One-Shot Hand Avatar via Data-Driven Implicit Priors The authors of OHTA introduce a one-shot framework for building realistic hand avatars from a single image using data-driven implicit priors. The model learns a shape-texture prior from a large hand corpus and fine-tunes it for the target identity.	CVPR 2024
RAM-Avatar	RAM-Avatar: Real-Time Photo-Realistic Avatar from Monocular Videos with Full-Body Control RAM-Avatar presents a real-time system that learns a photorealistic, fully controllable human avatar from a single monocular video. The model uses a region-aware module to separately model the head, hands, and body. It integrates these into a unified avatar through pose-conditioned fusion.	CVPR 2024
Animatable Gaussians	Animatable Gaussians: Learning Pose-Dependent Gaussian Maps for High-Fidelity Human Avatar Modeling Animatable Gaussians introduces a template-guided parameterization that learns pose-dependent Gaussian maps (front and back) with a StyleGAN/StyleUNet-style conditional generator.	CVPR 2024
GaussianAvatars	GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians The authors present GaussianAvatars, which rig 3D Gaussian splats to a parametric face model so each splat moves with an underlying triangle frame and per-splat offsets are optimized jointly with morphable model parameters.	CVPR 2024
TexVocab	TexVocab: Texture Vocabulary-Conditioned Human Avatars TexVocab constructs a pose-conditioned texture vocabulary by back-projecting multi-view RGB video frames into SMPL UV space, then learns to query and interpolate texture tokens per body part for dynamic, pose-dependent appearance synthesis.	CVPR 2024
CVTHead	CVTHead: One-Shot Controllable Head Avatar with Vertex-Feature Transformer In the paper for CVTHead, the authors propose a method that generates a controllable 3D head avatar from a single reference image by treating the mesh vertices as a point set and applying a Vertex-feature Transformer to learn per-vertex descriptors. This representation supports animation of pose, expression, and view changes via a lightweight neural point-based renderer.	WACV 2024
CanonicalFusion	CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images CanonicalFusion's authors present a framework that reconstructs animatable 3D human avatars from multiple images, by first predicting per‑view depth maps and LBS weight maps via a shared encoder-dual‑decoder, then canonicalizing each view into a unified mesh space. Rather than predicting full high‑dimensional skinning weights, the method compresses them into 3D vectors per each vertex using a pretrained MLP. A forward skinning‑based differentiable rendering scheme merges the various reconstructions.	ECCV 2024
Champ	Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance (MIT) The authors of Champ integrate a 3D human parametric model (e.g., SMPL) into a latent-diffusion-based animation pipeline to improve motion guidance, shape alignment, and pose fidelity in human image animation. They condition on depth, normal, semantic maps rendered from SMPL sequences and skeleton motion to steer the latent diffusion model.	ECCV 2024
OmniControl	OmniControl: Control Any Joint at Any Time for Human Motion Generation OmniControl presents a diffusion-based human motion generation model that -- unlike prior works limited to controlling only pelvis trajectory -- allows specification of spatial constraints for any joint at any time.	ICLR 2024
GG-Editor	GG-Editor: Locally Editing 3D Avatars with Multimodal Large Language Model Guidance In this work, the authors present GG‑Editor, a text-driven method for local editing of 3D avatars. Instead of global edits, the method uses an LLM (e.g., GPT‑4V) to infer reasonable local editing regions (hair, clothes, geometry details), then applies a global‑to‑local view‑synergy editing pipeline to modify geometry and texture while preserving cross‑view consistency.	ACMMM 2024
E³Gen	E³Gen: Efficient, Expressive and Editable Avatars Generation The paper introduces a novel method to generate high-fidelity, editable 3D avatars by encoding 3D Gaussian primitives into a structured 2D UV feature-plane defined over a parametric human mesh (e.g., SMPL-X). This UV-plane representation lets a diffusion model learn over many subjects, while a part-aware deformation module enables expressive full-body pose control and local editing (clothes, wrinkles).	ACMMM 2024
ControlFace	ControlFace: Harnessing Facial Parametric Control for Face Rigging The authors of ControlFace propose a face‑rigging method that combines 3DMM renderings with a dual‑branch U‑Net to allow precise control over pose, expression, and lighting directly from a single image.	CVPR 2025
PERSE	PERSE: Personalized 3D Generative Avatars from a Single Portrait PERSE presents a method that takes a single portrait image and builds a personalized 3D avatar with disentangled latent controls for facial attributes.	CVPR 2025
MeGA	MeGA: Hybrid Mesh-Gaussian Head Avatar for High-Fidelity Rendering and Head Editing MeGA introduces a hybrid representation that uses a refined mesh model for facial skin and 3D Gaussian splats for hair, allowing higher fidelity and editing flexibility across the whole head. A UV displacement map enhances facial geometry detail, and occlusion-aware blending merges mesh and Gaussian components for seamless rendering.	CVPR 2025
Editable Photorealistic Avatar (Tetrahedral 3DGS)	Creating Your Editable 3D Photorealistic Avatar with Tetrahedron-Constrained Gaussian Splatting The paper introduces a method for building editable photorealistic avatars by combining tetrahedral-grid constraints with 3DGS. The pipeline first instantiates an avatar from a monocular video, then uses local spatial adaptation via tetrahedrons to structure Gaussian kernels, and finally refines appearance with a coarse-to-fine strategy.	CVPR 2025
FATE	FATE: Full-Head Gaussian Avatar with Textural Editing from Monocular Video FATE introduces a sampling-based densification to improve rendering efficiency and achieve a better positional distribution of points. For texture editing, the authors convert Gaussian representations into editable attribute maps.	CVPR 2025
Gaussian Deja-vu	Gaussian Deja-Vu: Creating Controllable 3D Gaussian Head-Avatars with Enhanced Generalization and Personalization Abilities The authors present Gaussian Deja-vu, a two-stage framework that first trains a generalized 3DGS head prior on large 2D (synthetic + real) image corpora and then personalizes this prior quickly using monocular video with learnable expression-aware rectification blendmaps.	WACV 2025
PERSONA	PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image PERSONA proposes a method to create a personalized, animatable, whole-body 3D avatar from a single image. The core innovation is using a diffusion-based video generation model to synthesize a pose-rich training video from the input image, which then guides the optimization of a 3D avatar representation. To maintain high fidelity and mitigate identity drift from the generated data, the framework uses balanced sampling of the original image and geometry-weighted optimization.	ICCV 2025
ToMiE	Towards Explicit Exoskeleton for the Reconstruction of Complicated 3D Human Avatars The authors present ToMiE -- a framework that adapts the joint tree of the SMPL skeleton by dynamically growing “external joints” to explicitly model objects held by people or loose garments. The method implies two steps: localize parent joints by gradients from skin‑blending weights and motion kernels, then optimize external joint transforms across frames.	ICCV 2025
CtrlAvatar	CtrlAvatar: Controllable Avatars Generation via Disentangled Invertible Networks CtrlAvatar introduces a method to generate controllable, customizable human avatars by separating the deformation process into two disentangled streams: an implicit body geometry network and an explicit texture network.	AAAI 2025

Physics improvements & World interaction _Methods

Method	Title & Repository / Description	Venue
NIFTY	NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis NIFTY introduces a neural “interaction field” attached to objects that encodes valid HOI configurations. During motion generation, this field guides an object-conditioned human motion diffusion model to produce realistic interactions.	CVPR 2024
CG-HOI	CG-HOI: Contact-Guided 3D Human-Object Interaction Generation CG-HOI tackles generation of full 3D HOI motion sequences from a text prompt and object geometry. The method jointly models human motion, object motion, and explicit contact between body and object, using a diffusion process with cross-attention to ensure coherence and physical plausibility.	CVPR 2024
WANDR	WANDR: Intention-Guided Human Motion Generation WANDR introduces a conditional VAE that generates realistic human motion trajectories aiming at a 3D goal. Given an initial pose and a target goal position, it outputs natural full-body motion sequences that place the end-effector (e.g., hand) on the goal. Instead of reinforcement learning or hand-crafted controllers, the model uses learned “intention features” that guide movement.	CVPR 2024
RoHM	RoHM: Robust Human Motion Reconstruction via Diffusion RoHM is a diffusion-based motion model that tackles the problem of robust reconstruction of 3D human motions in the presence of noise and occlusions. The paper proposes using two models addressing distant solution spaces: one for global trajectory and one for local motion.	CVPR 2024
IntrinsicAvatar	IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing The authors present a method that recovers intrinsic properties -- geometry, albedo, material, and lighting -- of clothed human avatars from a single monocular video by modeling volumetric scattering and performing explicit Monte‑Carlo ray tracing integrated with body articulation.	CVPR 2024
Saito et al.	Relightable Gaussian Codec Avatars The paper presents a method to build high-fidelity head avatars that support real-time relighting and animation by using a geometry model based on 3D Gaussians that capture sub-millimeter details -- hair strands, pores -- and an appearance model based on learnable radiance transfer combined with spherical harmonics for diffuse and reflection components.	CVPR 2024
Xu et al.	Relightable and Animatable Neural Avatar from Sparse-View Video The work addresses the problem of reconstructing animatable and relightable human avatars from sparse-view or monocular video under unknown illumination. The authors introduce a Hierarchical Distance Query algorithm that enables efficient sphere-tracing of deformed SDFs to estimate light visibility and surface intersections under arbitrary poses.	CVPR 2024
Intrinsic Hand Avatar	Intrinsic Hand Avatar: Illumination-Aware Hand Appearance and Shape Reconstruction from Monocular RGB Video The work recovers a full hand avatar -- geometry, appearance, and environment lighting -- from a monocular RGB video of a user’s hand under arbitrary real-world illumination. They optimize shape, material, and lighting jointly using a differentiable renderer with Monte Carlo path tracing.	WACV 2024
CHOIS	Controllable Human-Object Interaction Synthesis CHOIS from Controllable HOI Synthesis is a conditional diffusion model informed by language descriptions and object waypoint constraints to jointly generate human and object motion in 3D scenes.	ECCV 2024
HUMOS	HUMOS: Human Motion Model Conditioned on Body Shape In this work, the authors propose a generative human motion model that conditions not only on pose but also on body shape -- meaning that people with different body types move differently. The model is learned from unpaired data using cycle consistency, physics and stability constraints.	ECCV 2024
URAvatar	URAvatar: Universal Relightable Gaussian Codec Avatars URAvatar presents a pipeline to build photorealistic, relightable head avatars from a single phone scan under unknown illumination by learning a radiance-transfer style model rather than explicit inverse-rendered reflectance. In this way, avatars can be relit and animated in real time.	SIGGRAPH 2024
Jiang et al.	Autonomous Character-Scene Interaction Synthesis from Text Instruction Though not exactly about digital avatar synthesis, but rather about motion-wise human animation, the paper proposes a framework for multi-stage scene-aware interaction motion synthesis. It is conditioned on text instructions and a goal location. A diffusion model and an autonomous scheduler are utilized to predict sequential motion segments for each action stage.	SIGGRAPH 2024
VRMM	VRMM: A Volumetric Relightable Morphable Head Model The authors of VRMM propose a volumetric, relightable morphable head prior that disentangles identity, expression, view, and lighting -- using volumetric primitives attached to a base mesh, yielding a head model that supports animation and/or relighting under novel lighting/view conditions.	SIGGRAPH 2024
PhysReaction	PhysReaction: Physically Plausible Real-Time Humanoid Reaction Synthesis via Forward Dynamics Guided 4D Imitation The authors of PhysReaction propose a Forward Dynamics Guided 4D Imitation framework to synthesize physically plausible humanoid reactions in real time. Instead of purely kinematic approaches, which often suffer from sliding feet, foot penetration or non-physical motions, their method uses a learned policy to generate full-body reactions under physics constraints.	ACMMM 2024
HRAvatar	HRAvatar: High-Quality and Relightable Gaussian Head Avatar In this work, the authors present HRAvatar, a method that reconstructs high-fidelity, animatable 3D head avatars from monocular videos while enabling realistic relighting and material editing. They address limitations in past 3DGS approaches by incorporating an end-to-end tracking optimization, learnable blend-shapes and LBS for improved deformation.	CVPR 2025
InteractAvatar	InteractAvatar: Modeling Hand-Face Interaction in Photorealistic Avatars with Deformable Gaussians The authors of InteractAvatar introduce a novel avatar model that explicitly captures dynamic hand-face interactions, using 3D Gaussians splats anchored to a hand mesh that deform with articulation to model wrinkles, shadows, and contact effects. Their system has a “Dynamic Gaussian Hand” module that refines geometry and appearance via a neural network and a dedicated interaction module that adjusts facial geometry and shading when hands touch the face.	ICCV 2025
PRIMAL	PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning In this work, the authors propose a novel generative real-time system. It allows for physically reactive and interactive avatars controlled with discrete commands and continuous signals, such as being pulled by a “magnet”. In the pretraining stage, the model learns body movements from sub-second motion segments. Then a ControlNet-like adaptor is employed to further fine-tune the base model to new tasks.	ICCV 2025
BecomingLit	BecomingLit: Relightable Gaussian Avatars with Hybrid Neural Shading BecomingLit presents a method to make 3DGS-based avatars relightable under arbitrary illumination conditions. The approach combines physically-based shading of Gaussian primitives with a neural network that refines shadows, highlights, and skin detail.	NeurIPS 2025
Agent-to-Sim	Agent-to-Sim: Learning Interactive Behavior Models from Casual Longitudinal Videos Agent‑to‑Sim (ATS) learns interactive 3D agent behavior from casually captured, long-term video -- no MoCap suits or multi‑view rigs needed. It reconstructs a persistent 4D representation across videos using a coarse-to-fine registration, then builds a behavior model that generates new agent motion conditioned on ego‑perception and environment.	ICLR 2025
Wang et al.	Relightable Full-Body Gaussian Codec Avatars The authors propose a new full‑body avatar framework combining 3DGS with a learned radiance‑transfer appearance model to enable relightable, pose‑dependent rendering including face and hands. Their method decomposes light transport into local and non-local effects through zonal harmonics for efficient diffuse transfer under articulation and a shadow network for occlusion shadows.	SIGGRAPH 2025

Hair and clothes improvements _Methods

Method	Title & Repository / Description	Venue
DiffAvatar	DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation The authors of DiffAvatar introduce a method for generating high-quality garment assets that are simulation-ready. It performs body and garment co-optimization using differentiable simulation. For proper geometry reconstruction and material parameters extraction, physical simulations are integrated into the optimization loop.	CVPR 2024
PhysAvatar	PhysAvatar: Learning the Physics of Dressed 3D Avatars from Visual Observations PhysAvatar combines 4D mesh-aligned Gaussian techniques, inverse rendering, and a physics simulator to recover not only shape and appearance, but also physical properties of clothing from multi-view video.	ECCV 2024
Zakharov et al.	Human Hair Reconstruction with Strand-Aligned 3D Gaussians The paper introduces a method that represents hair with strand‑aligned 3D Gaussians, combining classical hair‑strand geometry with 3DGS’s differentiable rendering to reconstruct realistic, strand‑level hairstyles from multi‑view data.	ECCV 2024
DLCA-Recon	DLCA-Recon: Dynamic Loose Clothing Avatar Reconstruction from Monocular Videos DLCA‑Recon reconstructs dynamic human avatars with loose clothing from monocular video. They combine an explicit mesh and an implicit SDF representation and introduce a Dynamic Deformation Field to model realistic cloth deformation with frame-to-frame consistency.	AAAI 2024
LayGA	LayGA: Layered Gaussian Avatars for Animatable Clothing Transfer LayGA separates body and clothing into two layers (body‑Gaussians + garment‑Gaussians). This enables animatable clothing transfer from multi‑view video, allowing users to switch clothes between avatars while preserving proper garment-body interaction and plausible deformation under motion.	SIGGRAPH 2024
DAGSM	DAGSM: Disentangled Avatar Generation with Gs-Enhanced Mesh The paper proposes DAGSM, where the authors enable text-conditioned avatar synthesis that disentangles human body and garments. They model the body and each clothing part separately using Gaussian-enhanced meshes to better represent complex textures like wool or transparent fabrics and support clothing replacement and realistic animation via a view-consistent texture refinement module.	CVPR 2025
SimAvatar	SimAvatar: Simulation-Ready Avatars with Layered Hair and Clothing The authors address the task of representing hair and garment geometry while also utilizing prior knowledge from a foundational model -- Stable Diffusion -- and making avatars simulation-ready via physics or neural simulators. They propose a two-stage framework. In the first stage, three text-conditioned diffusion-based models generate hair strands, a body mesh, and a garment. In the second stage, the elements are combined into a model and assigned learnable 3D Gaussians which then forgo optimization. Image-based Stable Diffusion is used in the SDS loss calculation.	CVPR 2025
LUCAS	LUCAS: Layered Universal Codec Avatars LUCAS is a Universal Prior Model for digital avatar synthesis that disentangles face and hair via a layered representation, enabling both real-time mesh-based rendering and high-fidelity Gaussian avatar synthesis with improved cross-identity generalization and dynamic expression/pose handling.	CVPR 2025
Zhang et al.	Disentangled Clothed Avatar Generation with Layered Representation The authors propose a feedforward diffusion-based method that generates clothed avatars with fully disentangled components by using a layered UV feature-plane representation where each component occupies a distinct layer of a Gaussian-based UV feature map.	ICCV 2025
HADES	HADES: Human Avatar with Dynamic Explicit Hair Strands HADES models full-body avatars with dynamic hair represented as deformable strands attached to 3D Gaussians. It simulates realistic hair motion through temporal fusion and color-consistency correction across multi-view inputs, achieving natural animation and stable rendering.	ICCV 2025
HairCUP	HairCUP: Hair Compositional Universal Prior for 3D Gaussian Avatars The authors of the work present HairCUP, a universal prior model for 3D head avatars that explicitly disentangles face and hair by learning separate latent spaces for each component.	ICCV 2025
Im2Haircut	Im2Haircut: Single-View Strand-Based Hair Reconstruction for Human Avatars Im2Haircut is a method that reconstructs 3D strand-based hair geometry from a single input photograph by combining a transformer-based global hair prior (trained on synthetic + real data) with a 3DGS reconstruction module.	ICCV 2025
SeqAvatar	Sequential Gaussian Avatars with Hierarchical Motion Context The authors present SeqAvatar, a method for animatable human avatar synthesis using 3DGS enriched by a hierarchical motion context. They combine coarse skeleton‑level and fine-grained vertex motions in a coarse‑to‑fine conditioning scheme. Consequently, they apply a spatio‑temporal multi‑scale sampling strategy to better capture non-rigid deformations (e.g., cloth folds) under motion.	ICCV 2025
DGH	DGH: Dynamic Gaussian Hair The authors introduce DGH, a method for modeling dynamic hair within 3DGS-based avatars. Hair is represented as volumetric Gaussians that capture both the overall hairstyle and local motion dynamics.	NeurIPS 2025
MPMAvatar	MPMAvatar: Learning 3D Gaussian Avatars with Accurate and Robust Physics-Based Dynamics MPMAvatar builds clothed human avatars from multi-view video, combining 3DGS with a Material‑Point‑Method physics simulator to realistically simulate cloth dynamics and body‑cloth interactions.	NeurIPS 2025

High fidelity and realism _Methods

Method	Title & Repository / Description	Venue
GaussianAvatar	GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians (MIT) The authors of GaussianAvatar propose a method for creating realistic human avatars from a single monocular video by introducing animatable 3DGS with dynamic appearance networks to support pose‑dependent appearance modeling and jointly optimizing motion and appearance to tackle motion‑estimation inaccuracies.	CVPR 2024
UltrAvatar	UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures UltrAvatar is a novel 3D avatar synthesis approach with enhanced fidelity of geometry and superior quality of physics-based rendering textures. It presents a diffuse color extraction model and an authenticity guided texture diffusion model, both used for improving overall quality of generated avatars.	CVPR 2024
Gaussian Head Avatar	Gaussian Head Avatar: Ultra High-Fidelity Head Avatar via Dynamic Gaussians The authors propose a representation of animatable head avatars using controllable 3D Gaussians, jointly optimizing a neutral Gaussian set and a MLP-based deformation field to capture fine-grained dynamic expressions under sparse-view capture. A geometry-guided initialization using an implicit SDF and Deep Marching Tetrahedra stabilizes training and improves convergence.	CVPR 2024
RodinHD	RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models In this work, the authors tackle the problem of catastrophic forgetting caused by fitting to many tri-planes sequentially on many avatars. They propose a novel data scheduling strategy and a weight consolidation regularization term, which improves the rendering of sharper details rendering in resulting avatars. A hierarchical representation of the portrait image is also introduced for rich 2D texture cues that are injected into a 3D diffusion model via cross-attention.	ECCV 2024
Bridging the Gap (Studio-Quality from Phone)	Bridging the Gap: Studio-Like Avatar Creation from a Monocular Phone Capture In the paper, the authors tackle the problem of producing studio‑quality human avatars from a short monocular smartphone video capture. They parameterize the phone‑captured texture maps via the latent space of StyleGAN2 and then fine‑tune a StyleGAN2 model using a small studio‑captured texture corpus, followed by a diffusion‑based super‑resolution step to improve fine details in the facial texture map.	ECCV 2024
MeshAvatar	MeshAvatar: Learning High-Quality Triangular Human Avatars from Multi-View Videos MeshAvatar introduces a method for building high-quality human avatars from multi-view video by combining an implicit SDF representation with an extracted triangular mesh and a pose-conditioned material field. The system jointly optimizes geometry and materials, uses a 2D U-Net and pseudo-normal supervision to improve fine detail, and produces avatars that integrate cleanly into standard rendering pipelines.	ECCV 2024
Tri²-plane	Tri²-Plane: Thinking Head Avatar via Feature Pyramid The method uses a multi-scale tri-plane representation to reconstruct photorealistic head avatars from monocular video. Instead of a single tri-plane, it stacks tri-planes at multiple scales to capture fine facial detail. The authors add a geometry-aware sliding window training augmentation to improve robustness under camera/pose variation.	ECCV 2024
Pose Modulated Avatars	Pose Modulated Avatars from Video The paper Pose Modulated Avatars from Video proposes a method for reconstructing human avatars from a video, where deformation due to pose is explicitly handled via a two‑branch neural network. A GNN modeling local correlations given skeleton pose and a frequency‑modulation branch that adjusts rendering features based on these correlations.	ICLR 2024
Qin et al.	High-Fidelity 3D Head Avatars Reconstruction Through Spatially-Varying Expression Conditioned Neural Radiance Field The paper presents a method for 3D head‑avatar reconstruction from video, introducing a Spatially‑Varying Expression conditioning. For each 3D point, the radiance field is conditioned not just on a global expression vector but also on spatial positional features.	AAAI 2024
IDOL	IDOL: Instant Photorealistic 3D Human Creation from a Single Image The paper presents a method to reconstruct high-fidelity 3D human avatars from a single RGB image. The approach combines a parametric human model with neural rendering to capture detailed geometry, texture, and appearance in one shot.	CVPR 2025
StableAnimator	StableAnimator: High-Quality Identity-Preserving Human Image Animation StableAnimator is an end-to-end video diffusion framework designed to preserve identity while animating a reference image to match a target pose sequence. It uses a distribution-aware ID Adapter, a face-refining encoder, and a Hamilton-Jacobi-Bellman-based optimization during inference to constrain denoising and maintain facial fidelity.	CVPR 2025
TAGA	TAGA: Self-Supervised Learning for Template-Free Animatable Gaussian Articulated Model TAGA introduces a template‑free approach to build animatable human avatars using 3D Gaussians. The method detects and corrects “Ambiguous Gaussians” in sparse posed data, refining geometry and skinning for accurate novel pose/view animation.	CVPR 2025
HERA	HERA: Hybrid Explicit Representation for Ultra-Realistic Head Avatars In HERA, the authors introduce a hybrid explicit representation combining UV-mapped 3D meshes with 3DGS, using the mesh to capture sharp surface textures (skin, stubble) and the Gaussians to model intricate geometry (hair, eyelashes).	CVPR 2025
TGA	TGA: True-to-Geometry Avatar Dynamic Reconstruction TGA proposes a 4D Gaussian‑based avatar reconstruction framework that integrates perspective-aware Gaussian transformations and dynamic Gaussian Bounding Volume Hierarchy tree based mesh extraction to better capture fine facial geometry and dynamic deformations under motion, improving geometric accuracy over previous Gaussian‑splat methods.	NeurIPS 2025
SurFhead	SurFhead: Affine Rig Blending for Geometrically Accurate 2D Gaussian Surfel Head Avatars The authors of SurFhead propose a new avatar representation using 2D Gaussian surfels (instead of 3D Gaussians), rigged via affine‑transformation blending with polar decomposition. This allows much more accurate head geometry (surface normals, depth, mesh consistency) than prior 3DGS‑based avatars, while remaining riggable and animatable from RGB video alone.	ICLR 2025
ScaffoldAvatar	ScaffoldAvatar: High-Fidelity Gaussian Avatars with Patch Expressions The authors of ScaffoldAvatar present a hybrid pipeline that builds high-fidelity Gaussian head avatars by anchoring “patch expressions” -- localized Gaussian patches tied to a scaffold mesh -- to capture fine expression detail and enable robust animation.	SIGGRAPH 2025
TeGA	TeGA: Texture Space Gaussian Avatars for High-Resolution Dynamic Head Modeling The authors of TeGA introduce a high-detail 3D head avatar model that embeds 3D Gaussians within a continuous UVD texture space over a morphable head mesh -- allowing densification where detail matters while preserving efficient animation.	SIGGRAPH 2025

Real-time generation & Compression _Methods

Method	Title & Repository / Description	Venue
GPS-Gaussian	GPS-Gaussian: Generalizable Pixel-Wise 3D Gaussian Splatting for Real-Time Human Novel View Synthesis (MIT) In this work, the authors propose a framework that generates 3D Gaussian representations from sparse input views using a learned regression of Gaussian parameters from 2D image planes. Beyond just human characters, it handles humans in the context of scenes, still under sparse‑view conditions, and renders them in real time.	CVPR 2024
GPS-Gaussian+ ^{cont. of GPS-Gaussian}	GPS-Gaussian+: Generalizable Pixel-Wise 3D Gaussian Splatting for Real-Time Human-Scene Rendering from Sparse Views This follow‑up work extends GPS‑Gaussian by targeting human‑scene rendering.	TPAMI 2025
GauHuman	GauHuman: Articulated Gaussian Splatting from Monocular Human Videos The authors of GauHuman present an avatar synthesis framework that uses 3DGS with LBS to animate full-body characters quickly. They encode Gaussians in canonical space and deform them via skinning to posed space, with modules refining pose and LBS weights for detail preservation.	CVPR 2024
Gaussian Shell Maps	Gaussian Shell Maps for Efficient 3D Human Generation The authors of Gaussian Shell Maps propose a volumetric representation that uses shell‑structured Gaussian distributions to represent the human body -- capturing geometry and appearance -- and enable fast 3D human synthesis and rendering.	CVPR 2024
Bai et al.	Efficient 3D Implicit Head Avatar with Mesh-Anchored Hash Table Blendshapes In this work, the authors propose a real‑time 3D head avatar system that uses a novel mesh‑anchored hash table blendshapes technique: multiple tiny hash tables are attached to vertices of a parametric face mesh and their embeddings are linearly blended (via weights predicted from a CNN) to represent expression‑dependent geometry and appearance. A lightweight MLP then predicts density and color from these embeddings for volumetric rendering, accelerated by a hierarchical kNN lookup.	CVPR 2024
FlashAvatar	FlashAvatar: High-Fidelity Head Avatar with Efficient Gaussian Embedding (MIT) The authors propose FlashAvatar, a method for reconstructing a high-fidelity animatable head avatar from a short monocular video in minutes and rendering it at ~300 FPS on a consumer GPU. They embed a uniform 3DGS field on the surface of a parametric face model and learn additional spatial offsets for non-surface regions and subtle facial details.	CVPR 2024
3DGS-Avatar	3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting (MIT) 3DGS-Avatar presents an animatable human avatar model using deformable 3D Gaussian splats. A canonical Gaussian field is combined with a pose-conditioned deformation network, improving generalization to unseen poses.	CVPR 2024
GoMAvatar	GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh GoMAvatar introduces the Gaussians-on-Mesh hybrid representation that attaches 3D Gaussian splats to a deformable mesh to get both high-quality appearance and efficient articulation. The model is trained end-to-end from a single monocular video.	CVPR 2024
GAvatar	GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning The authors propose GAvatar, which builds animatable avatars using a 3DGS representation embedded in pose-driven primitives and further learns an SDF-based implicit mesh on top of the Gaussians to extract high-fidelity geometry and texture.	CVPR 2024
MoRF	MoRF: Mobile Realistic Fullbody Avatars from a Monocular Video In the paper, the authors propose MoRF that builds realistic full-body avatars from monocular video. It uses a mesh-based body proxy (SMPL-X), a neural texture, and per-frame warping fields to improve temporal consistency and appearance fidelity.	WACV 2024
POCA	POCA: Post-Training Quantization with Temporal Alignment for Codec Avatars POCA studies quantization for avatar decoders, showing that naive quantization (8-bit and 6-bit) introduces temporal noise in animated avatars. POCA proposes a novel Post-Training Quantization scheme with temporal alignment that preserves visual fidelity while compressing the decoder by 5.3×.	ECCV 2024
ReliaAvatar	ReliaAvatar: A Robust Real-Time Avatar Animator with Integrated Motion Prediction The authors present ReliaAvatar, a real-time avatar animator that integrates full-body motion prediction into an autoregressive animation pipeline to handle low-quality or missing input signals.	IJCAI 2024
GGHead	GGHead: Fast and Generalizable 3D Gaussian Heads The authors of GGHead propose embedding 3DGS within a 3D-GAN framework to learn a high‑fidelity, 3D‑consistent head prior from 2D image corpora. A CNN predicts Gaussian parameters over a template‑mesh UV layout. A novel total variation loss ensures geometric coherence, enabling real‑time rendering of full‑resolution heads without 2D super‑resolution.	SIGGRAPH 2024
GEM (Gaussian Eigen Models)	Gaussian Eigen Models for Human Heads The authors of GEM propose representing 3D head avatars using a linear eigen‑basis of 3D Gaussians - position, scale, rotation, opacity -- enabling a low‑dimensional, network‑free representation that is light, animatable, and real‑time friendly.	CVPR 2025
Zhan et al.	Real-Time High-Fidelity Gaussian Human Avatars with Position-Based Interpolation of Spatially Distributed Mlps The paper proposes a 3DGS-based avatar synthesis where multiple MLPs are spatially distributed across the body and each Gaussian’s properties are interpolated from nearby MLPs' outputs.	CVPR 2025
FADA	FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-Cfg Distillation The authors propose a mixed-supervised loss to address the problem of poor distilled diffusion model performance with open-set input images. They also propose a multi-CFG distillation with learnable tokens to utilize the correlation between audio and reference image conditions.	CVPR 2025
GPAvatar (monocular)	GPAvatar: High-Fidelity Head Avatars by Learning Efficient Gaussian Projections The authors of GPAvatar propose a method that reconstructs high-fidelity dynamic 3D head avatars from monocular videos using Gaussian splats in a high-dimensional embedding space.	CVPR 2025
TaoAvatar	TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting TaoAvatar presents a high-fidelity, lightweight pipeline for creating full-body talking avatars optimized for AR devices. The method implies binding 3D Gaussians to a clothed human parametric template and distilling pose-dependent non-rigid deformations into an MLP to develop proper blend-shapes.	CVPR 2025
RGBAvatar	RGBAvatar: Reduced Gaussian Blendshapes for Online Modeling of Head Avatars RGBAvatar proposes an online framework for animatable head avatar modeling using a reduced Gaussian blendshape representation. Instead of fixed 3DMM bases, a compact learned space is created for each individual, improving identity accuracy and expressiveness. A color initialization scheme and batch-parallel Gaussian rasterization enable real-time training and inference.	CVPR 2025
MobilePortrait	MobilePortrait: Real-Time One-Shot Neural Head Avatars on Mobile Devices MobilePortrait introduces a novel method for real-time head avatar synthesis on mobile devices. Lightweight U-Net backbones are used to reduce computational requirements. To compensate for possible quality loss, the authors mix explicit and implicit keypoints for motion modeling and utilize precomputed visual features for foreground and background synthesis.	CVPR 2025
LHM	LHM: Large Animatable Human Reconstruction Model for Single Image to 3D in Seconds LHM proposes a feedforward model that reconstructs detailed, animatable 3D humans from a single image in seconds -- representing geometry and appearance with 3D Gaussian splats. It uses a multimodal transformer to fuse image features, body positional priors, and a head feature pyramid encoding to preserve facial identity and fine detail.	ICCV 2025
GraphAvatar	GraphAvatar: Compact Head Avatars with Gnn-Generated 3D Gaussians GraphAvatar proposes to replace explicit storage of 3D Gaussians for head avatars with a compact GNN that generates Gaussian attributes from a tracked mesh.	AAAI 2025
SqueezeMe	SqueezeMe: Mobile-Ready Distillation of Gaussian Full-Body Avatars SqueezeMe shows how to distill high-fidelity 3D Gaussian full-body avatars into a lightweight representation suitable for mobile devices by compressing Gaussian decoding and reducing compute/memory overhead while preserving animation and rendering quality.	SIGGRAPH 2025
LAM	LAM: Large Avatar Model for One-Shot Animatable Gaussian Head The authors of LAM propose a method that builds a fully animatable 3D Gaussian‑head avatar from a single input image in a single forward pass. No video, no multi-view rig, no post‑processing are needed.	SIGGRAPH 2025
HGC-Avatar	HGC-Avatar: Hierarchical Gaussian Compression for Streamable Dynamic 3D Avatars The paper proposes a hierarchical compression scheme for dynamic Gaussian‑based avatars, aimed at efficient streaming and rendering. It splits the representation into a structural layer (pose‑to‑Gaussian generator) and a motion layer (via SMPL‑X), enabling compact transmission, progressive decoding, and controllable rendering under new poses.	ACMMM 2025

Temporal consistency _Methods

Method	Title & Repository / Description	Venue
Lodge	Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives The method generates long dance motion sequences by using a coarse-to-fine diffusion network guided by extracted dance primitives, allowing both global structure and fine motion details across time.	CVPR 2024
Make-Your-Anchor	Make-Your-Anchor: A Diffusion-Based 2D Avatar Generation Framework The authors address the problem of full-body avatar synthesis where movements are “anchored” to the ones from the video. Specifically, they propose a novel system, Make-Your-Anchor, that only needs a one-minute video for training to enable precise translation of torso and hands. A structure-guided diffusion model is fine-tuned to take 3D mesh conditions as a separate modality.	CVPR 2024
Loopy	Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency In this work, the authors propose an end‑to‑end video diffusion model conditioned only on audio, designed to generate realistic portrait videos with natural long‑term motion. The model uses inter‑/intra‑clip temporal modules and an audio‑to‑latents mapping so it can leverage long‑range temporal dependencies and produce smooth, expressive motion from audio alone.	ICLR 2025
Hallo	Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation Hallo proposes a diffusion-based framework for portrait image animation driven by audio. It provides a hierarchical audio-driven synthesis module that jointly generates lip motion, facial expressions, and head pose.	arXiv 2024
Hallo2 ^{cont. of Hallo}	Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation The follow-up paper presents Hallo2, a method that generates long (tens of minutes) and high-resolution (up to 4K) talking-head videos from a single reference image and input audio, maintaining temporal coherence and avoiding drift over time.	ICLR 2025
DAWN	DAWN: Dynamic Frame Avatar with Non-Autoregressive Diffusion Framework for Talking Head Video Generation DAWN presents a non-autoregressive diffusion‑based framework that generates full talking‑head videos (lip sync + head pose + blinks) from a single portrait and an audio clip.	ICLR 2025
MimicMotion	MimicMotion: High-Quality Human Motion Video Generation with Confidence-Aware Pose Guidance MimicMotion introduces a video generation framework that can produce long, high‑quality human motion videos guided by a pose sequence. The method relies on confidence‑aware pose guidance to weigh pose keypoints by reliability, regional loss amplification to preserve detail in important regions (e.g., hands), and a progressive latent fusion strategy to enable temporally coherent videos of arbitrary length.	ICML 2025
MaintaAvatar	MaintaAvatar: A Maintainable Avatar Based on Neural Radiance Fields by Continual Learning MaintaAvatar tackles the problem of updating a 3D avatar over time as a person’s appearance or pose changes, without losing the ability to render previous appearances. The method augments a NeRF-based avatar with a Global-Local Joint Storage Module and a Pose‑Distillation Module.	AAAI 2025

Citation & license

If you find these resources useful, please cite the review:

@article{makarov2026avatars,
  title   = {GenAI for Digital Avatar Synthesis: A Comprehensive Review},
  author  = {Makarov, Georgy and Ryumin, Dmitry},
  journal = {Neurocomputing, Peer Review},
  year    = {2026}
}

This repository is released under the MIT License. Figures are reproduced from the accompanying review paper by its authors. Linked code repositories remain under their own licenses (shown in parentheses next to a title where available).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
figures		figures
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenAI for Digital Avatar Synthesis — Review Resources

A curated, browsable companion to the survey “GenAI for Digital Avatar Synthesis: A Comprehensive Review.”

Contents

Scope & taxonomy

Corpora

Generalization _Corpora

Expressiveness _Corpora

Text guidance & Stylization _Corpora

Attribute editing _Corpora

Physics improvements & World interaction _Corpora

Hair and clothes improvements _Corpora

High fidelity and realism _Corpora

Methods

Generalization _Methods

Expressiveness _Methods

Text guidance & Stylization _Methods

Attribute editing _Methods

Physics improvements & World interaction _Methods

Hair and clothes improvements _Methods

High fidelity and realism _Methods

Real-time generation & Compression _Methods

Temporal consistency _Methods

Citation & license

About

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

GenAI for Digital Avatar Synthesis — Review Resources

A curated, browsable companion to the survey “GenAI for Digital Avatar Synthesis: A Comprehensive Review.”

Contents

Scope & taxonomy

Corpora

Generalization Corpora

Expressiveness Corpora

Text guidance & Stylization Corpora

Attribute editing Corpora

Physics improvements & World interaction Corpora

Hair and clothes improvements Corpora

High fidelity and realism Corpora

Methods

Generalization Methods

Expressiveness Methods

Text guidance & Stylization Methods

Attribute editing Methods

Physics improvements & World interaction Methods

Hair and clothes improvements Methods

High fidelity and realism Methods

Real-time generation & Compression Methods

Temporal consistency Methods

Citation & license

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Generalization _Corpora

Expressiveness _Corpora

Text guidance & Stylization _Corpora

Attribute editing _Corpora

Physics improvements & World interaction _Corpora

Hair and clothes improvements _Corpora

High fidelity and realism _Corpora

Generalization _Methods

Expressiveness _Methods

Text guidance & Stylization _Methods

Attribute editing _Methods

Physics improvements & World interaction _Methods

Hair and clothes improvements _Methods

High fidelity and realism _Methods

Real-time generation & Compression _Methods

Temporal consistency _Methods