Disentangled Clothed Avatar Generation from Text Descriptions

ECCV 2024

Jionghao Wang^{1, 2}, Yuan Liu^3, Zhiyang Dou^{3, 4}, Zhengming Yu¹,
Yongqing Liang¹, Cheng Lin³, Rong Xie², Li Song², Xin Li¹, Wenping Wang¹

¹Texas A&M University ²Shanghai Jiao Tong University ³The University of Hong Kong ⁴TransGP

Paper (Arxiv)

Code

Abstract

Our method generates high-quality separated human body and clothes meshes from text prompts. Kinematics or simulation motions can drive the disentangled avatar representations to achieve photorealistic animations.

In this paper, we introduced a novel text-to-avatar generation method that separately generates the human body and the clothes and allows high-quality animation on the generated avatar. While recent advancements in text-to-avatar generation have yielded diverse human avatars from text prompts, these methods typically combine all elements-clothes, hair, and body-into a single 3D representation. Such an entangled approach poses challenges for downstream tasks like editing or animation. To overcome these limitations, we propose a novel disentangled 3D avatar representation named Sequentially Offset-SMPL (SO-SMPL), building upon the SMPL model. SO-SMPL represents the human body and clothes with two separate meshes, but associates them with offsets to ensure the physical alignment between the body and the clothes. Then, we design an Score Distillation Sampling(SDS)-based distillation framework to generate the proposed SO-SMPL representation from text prompts. In comparison with existing text-to-avatar methods, our approach not only achieves higher exture and geometry quality and better semantic alignment with text prompts, but also significantly improves the visual quality of character animation, virtual try-on, and avatar editing.

Character Animations

a white teenage boy wearing
Christmas sweater and denim jeans

a Hispanic man wearing
black sports jersey and blue track pants

a chubby bald old man wearing
denim work shirt and dirty washed jeans

an African-American man wearing
textured white cable-knit sweater and tan chinos

an old lady wearing
long lilac cardigan and taupe shorts

a Caucasian girl with tousled hair wearing
yellow cycling jersey and beige yoga pants

a white teenage boy wearing
red flannel of checker pattern and denim shorts

an African-American woman wearing
navy and beige peplum shirt and beige yoga pants

a Chinese woman with short black hair wearing
jade green blouse with short sleeves and olive green shorts

a Hispanic woman with blonde hair wearing
green flannel shirt of checker pattern and light blue jeans

Framework

Our pipeline has two stages. In Stage I, we generate a base human body model by optimizing its shape parameter and albedo texture. In Stage II, we freeze the human body model and optimize the clothes shape and texture. The rendered RGB images and normal maps of both the clothed human and the clothes are used in computing the SDS losses. For more details, please check out our paper.