VMware 3D Avatars

Over the summer of 2021, I worked as an AI/ML research intern at VMware, where I was tasked with creating a 3D avatar generation solution. This pages gives an overview of why, how, and what I built.

Motivation

In a professional environment, it is typical for employees to upload photos of their face on social platforms such as slack, zoom, and anything else your team might use. However, many people don’t like to share photos of their face. They may not have any photos of themselves that they like, or they may be uncomfortable about certain details on their face. 3D avatars are one possible solution for this. The avatars are still recognizable, but they hide or modify certain details of their face so that employees are comfortable sharing them. The result is a better employee experience because they are more comfortable sharing their avatars, and everyone in the team is able to recognize their profile image.

How it works

The avatar generation pipeline takes in a single input image of a person’s face, and outputs either a 3D avatar or a 2D render of it, so that it can be used as a profile picture. The pipeline contains many steps, several of which use or build on top of existing open-source projects or research papers.

The pipeline begins with loading the image, and then detecting, cropping and aligning the face that is contained in it. After that, the pipeline splits into its two major parts - head and hair generation.

Here’s an overview of the head generation pipeline:

Detect important features on the face, such as:
- Eye color
- Gender (used to select appropriate base head shape)
- 2D/3D landmarks around the face, mouth, nose, eyes, eyebrows
Fit a statistical 3D head shape (such as Flame) to the landmarks.
Project the face image to the 3D head shape. From this, we can get a UV map, or a flattened version of the 3D texture. We can also remove any facial expression by zeroing out the expression vector in Flame. This gives the 3D model a neutral pose. Alternatively, we can create some preset expressions or let users keep their facial expressions in the avatar.
The projected texture will contain stretching and other artifacts, and missing or incorrect regions due to pose variation or occlusions such as hair and head-wear. To address this, a UV-inpainting GAN neural network is used. It fills in missing regions and corrects regions where projection artifacts may be present.
The avatar is refined according to defaults or user configuration to make it look either more realistic or hide details. Users are able to customize skin smoothness level, eye color, hair color and hairstyle, and more.

Overview of the automatic hair generation pipeline, which is used if the user does not manually select a hairstyle:

Crop and align to the face in the input image
Detect important features on the face, such as:
- Hair color
- Hair mask - segment hair on the input image
- Hair orientation map - which way the hair is flowing at each point. Gabor filters are used to achieve this.
The hair mask and orientation maps are used to find the closest matching hairstyle from a database of 3D hairstyles.

Next steps

This was a large project, and I really enjoyed working on it. Each of the steps I described above could be an internship project in itself! As such, there are many ways this project can be taken further. Here are some examples:

AR/VR conferencing - useful when people don’t want to show their face or background on video, or if they don’t want to be on video at all!
Accessories - allow avatars to wear hats, glasses, earrings and more!
More detail, features, customizability - can always make it better!

Examples

Myself

Barack Obama, from image here: https://www.biography.com/us-president/barack-obama

Emma Stone

Emma Stone, image credit: http://www.indiewire.com/2014/11/dont-love-emma-stone-yet-this-interview-will-fix-that-68047/

References:

L. Hu, C. Ma, L. Luo, and H. Li, “Single-View Hair Modeling Using A Hairstyle Database,” ACM Transactions on Graphics (Proceedings SIGGRAPH 2015), vol. 34, no. 4, Jul. 2015.
T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero, “Learning a model of facial shape and expression from 4D scans,” ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), vol. 36, no. 6, pp. 194:1–194:17, 2017 [Online]. Available at: https://doi.org/10.1145/3130800.3130813
J. Lin, Y. Yuan, and Z. Zou, “MeInGame: Create a Game Character Face from a Single Portrait,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021.