Convolutional neural net face recognition works in non-human-like ways

Convolutional neural networks (CNNs) give the state-of-the-art performance in many pattern recognition problems but can be fooled by carefully crafted patterns of noise. We report that CNN face recognition systems also make surprising ‘errors'. We tested six commercial face recognition CNNs and found that they outperform typical human participants on standard face-matching tasks. However, they also declare matches that humans would not, where one image from the pair has been transformed to appear a different sex or race. This is not due to poor performance; the best CNNs perform almost perfectly on the human face-matching tasks, but also declare the most matches for faces of a different apparent race or sex. Although differing on the salience of sex and race, humans and computer systems are not working in completely different ways. They tend to find the same pairs of images difficult, suggesting some agreement about the underlying similarity space.


Introduction
Convolutional Neural Networks (CNNs) have transformed pattern recognition, achieving state of the art performance in many applications, including automated face recognition (AFR) (1).However, they can be deceived by noise patterns, either on their own or added to another image (2).For example, an image that to humans looks like a dog might be classified as a penguin.The current controversies around the public use of automated face recognition and lack of clear legislation on the use of AFR has resulted in a ban on its use in some places.Therefore, it is important to investigate whether the same sort of problems apply to faces.
What can you say about the people depicted in Figure 1?The second image (B) is a composite made of several actors.The first (A) is the same face, transformed to look female, the third (C), the same face transformed to look African American.This work originated with the observation that a state-of-the-art CNN face recognition system reports that images A and C are both a match for B. This is surprising, because to the human observer image A appears to be a different sex, and image C a different race to image B.

A B C
Figure 1 Variations on a face: A) transformed to look female, B) the original, a composite of a number of actors, C) transformed to look African American To test the generality of this finding and to try and understand something of the reason, we tested six commercial CNN face recognition engines during July and August 2019.In alphabetical order, these were Amazon Rekognition, Face++, FaceSoft, FaceX, Kairos, and Microsoft.The testing reported here absolutely does not speak to whether one system is better than another.Rather, our aim was to examine whether this insensitivity to changes in race or sex is common to a variety of different CNNs.In what follows therefore, these systems will be referred to only by a number, in a different order to the list above.
The systems were tested on two types of task.The first type were four matching tasks designed to be difficult for human observers.In these matching tasks, human participants are shown two images and must decide whether they show the same person or two different people.Some pairs are 'matched' (same identity), and some are 'mismatched' (different identity).The second set of tests involved matching tasks with faces that had been transformed to appear either a different sex or a different race, using PsychoMorph (see Figure 2) (3).The CNNs were tested on the match between a transformed face and a different original photograph of the same person.

Human matching tasks
We used a) the Kent Face Matching Task (Kent) (4), b) the Models Matching Task (Models) (5), c) the Makeup Task (Makeup) using pictures of women with and without heavy makeup that were obtained from YouTube videos (6), and d) the Dutch Matching Task (Dutch), using images of two Dutch TV personalities chosen for their similar appearance (7).
The long version of the Kent task (4) consists of 200 match trials and 20 mismatch trials constructed from a set of 252 pairs of colour images.In each pair, one is a studio picture showing head and shoulders at a resolution of 283x332 pixels, the other is a passport style image, taken some months earlier, showing just the head at a resolution of 142x192 pixels.The human data reported here come from the original paper (4); there were fifty participants with an average age of 19.5 years.
The Models task uses pictures of male models who often vary in appearance markedly in different shoots (5) and consists of a total of 90 trials, half matched and half mismatched, in three blocks A-C.The three sets were constructed to be of equal difficulty, based on pilot testing.All pairs were presented side by side, in colour, at a resolution of 300x420 pixels.The human data reported here came from 80 participants, average age 22, who had been tested on sets A and B only (8).The computer systems were tested on all 90 trials.
The Makeup task, created in our lab, uses images of YouTube makeup videos posted by vloggers from non-English speaking countries, to reduce the chance of familiarity for our participants (6).Four images of each vlogger were sourced, two with and two without makeup.Pairs of images consisted of both with makeup, both without makeup, and one with and one without make up.Half of the trials were matched, and half were mismatched.Images were presented side by side, at a resolution of 320 pixels square, in colour and tightly cropped around the face.There were 294 pairs in total.We collected data from 48 participants, average age 26.
Each was tested on 48 trials, 16 in each makeup condition, with the set used counterbalanced across participants to test 288 image pairs in total.The computer systems were tested on all 294 pairs.The Dutch task, another in-house created test, consisted of 96 trials featuring two Dutch TV presenters, Bridget Maasland and Chantal Jantzen.Half the trials were matched and half mismatched.Of the mismatch trials, half consisted of one image of each presenter, while the other half showed one of the presenters with a similar looking third person (different for each pair).The images chosen were highly varied, including some that were tightly cropped around the face, and ranged in size from 187x184 pixels, to 548x500.We collected data from 60 participants, average age 22.
For all the sets, the mismatch items were chosen by the experimenters who created the task, based on visual similarity (e.g.same age, hair colour, eye colour).

Face Transforms
The transformed images were generated using Psychomorph (3).This can compute the average appearance of a set of faces, for example white male actors (see Figure 2A).It then computes the difference, in shape and colour, between this and another average, for example white female or African American male, and applies this vector difference to a third image, to change its apparent appearance.For instance, by adding the sex vector difference to a female identity, one can transform this identity to appear male while keeping all remaining information constant.That is, the new face deviates from the male average in the same way that the original deviated from the female average.This method was used to generate the images in Figure 1.The CNNs were tested on pairs made up of a face transformed by apparent sex or race and a different, unaltered photograph of the same person.Before any transformation, all these pairs should therefore be declared as matched; the question is whether they are still declared a match when one of the faces appears to be a different sex or race.
The faces for transformation came from the Glasgow Unfamiliar Face database (9), sets C1 and DV (images of the same people, taken with different cameras), which we had previously 'marked up' for use in Psychomorph(3) using the 179 points shown in Figure 2. Faces were transformed by sex or race by adding the vector difference between the two average faces to each original face.In the case of the sex transform, 120% of the difference was added, to produce an image more clearly of the opposite sex in all cases.The C1 images were transformed, and the DV images used to test for a match.
The average images used to perform the male race transform (Figure 2 A & B) came from images of white and African American students collected by Chris Meissner at the University of South Florida, and for the female set from Lisa DeBruine, collected in London.The male and female averages (Figure 2 C & D) were produced at Stirling, using many images of white actors and actresses.The average images and pointers to the face sets used are on our OSF page.The Kairos system returns information about the sex and race of each face.We used this to assess the effectiveness of the transformations, at least as assessed by this CNN (it is a test of our manipulation, not of the CNN).The Glasgow face database participants are not all white; the biggest minority have south Asian origins.Rather than picking 'suitable' faces for transformation, we used the whole set.This was to ensure the faces used for transforms were not already 'biased' to look in a specific way, e.g., we did not purposefully select feminine looking males.Table 1 shows the Kairos system classification for the Glasgow faces before and after each transformation.The CNN clearly indicates that the race transform has mostly worked; the sex transform works for women but less completely for men (some of whom have beards).Evidently the race transform also makes women look more masculine as well, according to the CNN.
Table 1.Sex and Race classifications returned by the Kairos system for the original and transformed Glasgow faces.Sex classification is binary, so any not classed as female are classed as male.However, some faces do not reach a threshold of 0.5 for any race, therefore the total of the four race categories may not add up to the total N.

Human matching tasks
Table 2. Performance of the CNNs and humans on the four matching tasks: match/mismatch percent correct.AUC is area under the curve across all tests combined, calculated using perfcurve in Matlab.Table 2 shows the performance of the CNNs and human participants on the four matching tasks.It is apparent that the computer systems mostly far out-perform the humans, who average around 70% correct for both match and mismatch trials on these tasks.CNNs 1 and 4 show near-perfect performance, while others would improve with a different decision threshold -the decision boundary between match and mismatch.A high threshold means that a higher similarity score is required for a match to be declared.We used the manufacturer recommended threshold in all tests.For these images, CNN 3 could afford to be higher, reducing false positives on mismatch trials, and CNN 5 lower, which would improve the hit rate on matches.To compare performance with thresholds that are optimal for these tasks, the area under the curve is also shown.An AUC of 1 is perfect, meaning that it can detect all the true matches for no false matches.CNN 1 made only 6 errors out of 700 trials, including one where it could not find the face.The human score, and this is averaged over many participants, not individual scores, corresponds to about 90% of matches at a false match rate of 10%.
For the human data, there is a slight bias to say mismatch in the last three tests where there are equal numbers of match and mismatch trials.The Kent set has 200 match trials and only 20 mismatches, which is a more realistic ratio in applied contexts.Humans respond by being more likely to say match overall.By default, computer systems have a fixed threshold and are not affected by match-mismatch ratios.It is, however, possible to change the decision threshold depending on the relative likelihood and cost of errors in either direction.Thus, if face recognition is used to permit access to a secure area or bank account, a high threshold might be set to minimize the risk of fraudulent access.Conversely, a low threshold might be used if looking for face matches in a family photograph collection, which would give the best chance of finding people at the minor inconvenience of increasing false matches.

Transformed images.
Table 3. Percentage of sex and race transformed faces reported as 'matched' to different original images of the same person (see Figure 1 for illustration).Table 3 shows the results of testing the CNNs on the transformed face tasks.Most systems are largely blind to the sex transform, declaring the transformed faces to be a match.The results from the race transform are more variable, ranging from 0 to 76% of images declared a match.The striking finding is that system 1, that does best in the four standard face matching tests, makes the most 'errors' (i.e.declares the most matches) on the transformed faces.Table 4 shows the almost perfect Spearman rank correlation between performance on the four face matching tasks and matches on the transformed images.This correlation appears to be driven more strongly by the match scores, totalled across all four tests, than mismatches.These findings demonstrate that highly-performing current CNNs are relatively blind to face changes that would cause a human observer to reject a match out of hand, on the grounds of being the wrong sex or race.One possible explanation is that the CNNs use a completely different similarity metric to human observers.We can start to investigate this by looking at the rank correlations between the confidence scores returned by the CNNs and the average accuracy of humans for each trial.High correlations would indicate that the CNNs and humans have a similar ordering of difficulty of matching the pairs, whereas low correlations would suggest different matching strategies.Indeed, it has been reported that a CNN and human observers have a similar hierarchy of the salience of changes to facial appearance.Abudarham, Shkiller and Yovel made controlled changes to faces and found a correlation between the similarity scores given by human observers and the match scores from a CNN (11).
It is important to note that our data do not speak to the issue of whether CNNs show differential performance between races or sexes (see 12 for a recent survey).What we are reporting is that the networks are relatively blind to variations on these dimensions that humans regard as highly salient.It could be said that CNNs do not share our preconceptions.This need not be the case; a simple model that learns identity by using Linear Discriminant Analysis on pre-processed face images learns sex as the first dimension and race as the second (13).CNNs can certainly be trained to identify sex and race.It would therefore be possible to add a rule that explicitly says, if the images do not match on sex or race, they are not the same person.Whether that would be desirable is questionable: transgender individuals are indeed the same person and explicit attempts have been made to recognize people before and after transition(e.g.14).

Conclusion
The challenge in face matching is to see past the variations caused by changes in factors such as lighting, viewpoint, expression and age to decide whether two images depict the same identity.CNNs are trained by showing millions of images, depicting thousands of identities and learning to match pictures to a specific identity.They learn to recognize the variety of images that can represent the same person, and to ignore the non-identity variation.It appears that the CNNs that are best able to ignore this extraneous variation are also most prone to ignoring variations that are obvious to humans.At one level of description, these nets are doing very well: it really is the same underlying face, transformed in a way that the net has never been exposed to before.The CNNs may therefore declare the face to be a match, which is not necessarily wrong, just surprising from a human perspective.

Figure 2
Figure 2 A&B white and African American male average images, showing the 179 reference points used.C & D average male and female faces

Table 4
Rank correlations between CNN performance (AUC, matches and mismatches averaged over all four tests) and declared matches on transformed faces (* significant at .05, ** at .01, two-tailed, N=6)

Table 5
Average Spearman rank correlations between CNNs and human performance by item, match trials in bold, top right and mismatch bottom left.

Table 5
shows the rank correlations, averaged across all four tests via a Z-transform, between each CNN and with the human data.The correlations between the CNNs are high, with an overall average of 0.72 for matches and 0.57 for mismatches.The correlations between the CNNs and the human data are remarkably consistent across systems, averaging 0.41 for match trials and 0.27 for mismatch trials.A correlation of 0.4 is similar to what you might expect for humans doing two different face tasks (e.g.10).So, while the CNNs agree more with each other than they do with the humans, the similarity metrics used clearly have something in common.