Deep Learning Face Attributes in the Wild


Predicting face attributes from web images


It cascades two CNNs (LNet and ANet) forface localization and attribute prediction respectively.


(1) It shows how LNet and ANet can be improved by different pre-trainingstrategies.

(2) It reveals that although filters of LNet are fine-tuned by attributelabels, their response maps over the entire image have strong indication offace’s location.

(3)It also demonstrates that the high-level hidden neurons of ANetautomatically discover semantic concepts after pretraining, and such concepts aresignificantly enriched after fine-tuning.

pre-train andfine-tuned

LNet and ANet are first pretrained differently and then jointly trainedwith attribute labels.

LNet is pre-trained by classifying massive general object categories.Thus, its pre-trained features have good generalization capability on handlingvarious background clutters. LNet is then fine-tuned by predicting attributes.

ANet is pre-trained by classifying massive face identities, to obtaindiscriminative face representation. Then it is fine-tuned by the attributeprediction task.


A filter (or a group of filters) functions as a detector of an attribute. Whena subset of neurons are activated, they indicate the existence of face images,which have a particular attribute configuration. The neurons at differentlayers can form many activation patterns, implying that the whole set of face imagescan be divided into many subsets based on attribute configurations, and eachactivation pattern corresponds to one subset (e.g. ‘pointy nose’, ‘rosy cheek’, and‘smiling’). Therefore, it is not surprising that filters learned by attribute predictionlead to effective representations for face localization. By simply averagingand thresholding response maps, good face localization is achieved.

With this strategy, each face attribute is well explained by a sparselinear combination of these sematic concepts. By analyzing the coefficients ofsuch combinations, attributes show clear grouping patterns, which could be wellinterpreted semantically.

Structure Of Framework


1 LNeto定位头部和肩部

2 LNets定位脸(更准确的定位)

3 ANet表达和预测人脸特性


4 SVM人脸特征分类






3.1.Coarse-to-fine Face Localization


3.2.Feature Extraction

4.Learning Algorithms
Theconvolutional structures (C1 to C5) of LNet+ is designed in the same way asLNeto and LNets. We add two fully-connected hidden layers on top of C5 in orderto improve the non-linearity for classification.
All the filters ofLNeto andLNets areinitialized by LNet+ after pre-training.
LNeto adopts the full image xoas input
LNets uses the the image of head-shoulder xsas input
we add twofully-connected layers to both LNeto and LNets, where the weight matrices are initialized randomly.

ANet employs theestimated face region xf as input

1 SVM人脸属性线性组合分类理论基础
  The pre-training ofANet essentially discovers semantic concepts related to identity.
  The attributespresented in each test image is explained by a sparse linear combination ofthese concepts.
  (例子)For instance, thefirst image is described by “a lady with big bang, brown hair, pale skin,narrow eyes, and high cheekbone”, which completely matches the human perception.

Different attributescapture information from different regions of face. We show that ANet automaticallylearn to discover these regions.