Model card: Person Segmentation (v0.11)

Model details

Model date: 7/25/2022
Model version: 0.11
License: refer to the terms of service for Lightship.

Technical specifications

The person segmentation model returns a floating point value from 0 to 1 for each pixel indicating the probability of that pixel being part of a person. This value is then thresholded to return a boolean mask of presence/absence of “person” at each pixel.

Intended use

Intended use cases

General semantic segmentation of people for augmented reality applications accessed through the Lightship ARDK.
Querying the presence or absence of a person at any specified pixel in the camera feed.
Using the semantic mask for “person” to enable screen space effects.

Permitted users

Augmented reality developers through Niantic Lightship.

Out-of-scope use cases

This model does not provide the capability to:

Segment individual people (instance segmentation)
Track individuals
Identify or recognize individuals

Factors

The following factors apply to all semantic segmentation provided in the Lightship ARDK, including person segmentation:

Scale: objects / classes may not be segmented if they are very far away from the camera.
Lighting: extreme light conditions may affect the overall performance.
Viewpoint: extreme camera views that have not been seen during training may lead to a miss in detection or a class confusion.
Occlusion: objects / classes may not be segmented if they are covered by other objects.
Motion blur: fast camera or object motion may degrade the performance of the model.
Flicker: predictions are made frame by frame and no temporal smoothing or context is applied; this may lead to a ‘jittering’ effect between predictions of temporally adjacent frames.

For person segmentation specifically, based on known problems with computer vision technology, we identify potential relevant factors that include subgroups for:

Geographical region
Skin tone
Gender
Body posture: certain body configurations may be harder to predict due to appearing less often in the training corpus.
Other: age, fashion style, accessories, body alterations

Fairness evaluation

At Niantic we strive for our technology to be inclusive and fair by following strict equality and fairness practices when building, evaluating, and deploying our models. We define person segmentation fairness as follows: a model makes fair predictions if it performs equally on images that depict a variety of the identified subgroups. The evaluation results focus on measuring the performance of the person segmentation channel on the first three main subgroups (geographical region, skin tone and gender).

Instrumentation and dataset details

Our benchmark dataset comprises 5650 images captured around the world using the back camera of a smartphone, with the specification:

Only one person per image is depicted.
Both indoors and outdoors environments.
Captured with a variety of devices.
No occlusions.
Full body within the frame of the image in a variety of poses.

Images are labeled with the following attributes:

Geographical region: based on the UN geoscheme with the merge of European subregions and Micronesia, Polynesia and Melanesia:
- Northern Africa
- Eastern Africa
- Middle Africa
- Southern Africa
- Western Africa
- Caribbean
- Central America
- South America
- Northern America
- Central Asia
- Eastern Asia
- South Eastern Asia
- Southern Asia
- Western Asia
- Europe
- Australia and New Zealand
- Melanesia, Micronesia and Polynesia
Skin tone: following the Fitzpatrick scale images are annotated from subgroup 1 to 6. The skin tone is annotated by the person in the image, thus it is a self-reported value.
Gender: images are annotated with self-reported gender.

Metrics

The standard and used metric to evaluate a segmentation model is the Intersection over union(IoU). It is computed as follows:

IoU = true_positives / (true_positives + false_positives + false_negatives)

Reported IoUs are averages (mean IoU or mIoU) over images belonging to the referenced subgroup unless stated otherwise.

Fairness criteria

A model is considered to be making unfair predictions if it yields a performance (mIoU) for a particular subgroup that is three standard deviations units or more from the average across all the subgroups.

Results

Geographical evaluation

Average performance across all 6 skin tones is 83.84% with a standard deviation of 1.26%. All skin tones subgroups yield a performance in the range of [81.72%, 85.45%]. The maximum difference between the mean and the worst performing skin tone subgroup is 2.13%, within our fairness criterion threshold of 3 stdevs ( 3x1.26 = 3.78%).

Region	mIoU	stdev	Number of images
Northern Africa	85.37%	12.41%	301
Eastern Africa	83.61%	14.82%	336
Middle Africa	84.57%	14.83%	322
Southern Africa	83.15%	15.62%	368
Western Africa	80.81%	18.50%	364
Caribbean	84.52%	13.95%	412
Central America	85.14%	11.68%	415
South America	83.30%	16.19%	397
Northern America	80.06%	18.48%	335
Central Asia	87.07%	10.81%	229
Eastern Asia	86.06%	12.06%	346
South Eastern Asia	81.47%	14.83%	333
Southern Asia	83.64%	15.32%	353
Western Asia	85.94%	13.37%	370
Europe	86.26%	11.87%	320
Australia and New Zealand	82.34%	14.84%	374
Melanesia, Micronesia and Polynesia	82.10%	21.57%	75
Average (across all images)	83.86%	14.89%	5650
Average (across regions)	83.85%	2.06%	-

Skin tone evaluation results

Average performance across all six skin tones is 83.84% with a standard deviation of 1.26%. All skin tones subgroups yield a performance in the range of [81.72%, 85.45%]. The maximum difference between the mean and the worst performing skin tone subgroup is 2.13%, within our fairness criterion threshold of 3 stdevs ( 3x1.26 = 3.78%).

Skin tone (Fitzpatrick scale)	mIoU	stdev	Number of images
1	85.45%	10.87%	247
2	84.48%	13.81%	1919
3	84.14%	14.20%	1463
4	83.28%	15.57%	457
5	84.02%	14.70%	706
6	81.72%	18.19%	858
Average (across all images)	83.86%	14.83%	5650
Average (across skin tones)	83.85%	1.26%	-

Gender evaluation results

Average performance of all evaluated gender subgroups is 83.76% with a range [82.58, 84.93]. The difference between the average and the worst performing gender is 1.18%, within our fairness criterion threshold of 3 stdevs ( 3x1.66 = 4.98%).

Perceived gender	mIoU	stdev	Number of images
Female	82.58%	15.98%	2585
Male	84.93%	13.70%	3065
Average (across all images)	83.86%	14.83%	5650
Average (across genders)	83.76%	1.18%	-

Ethical Considerations

Privacy: the model was trained and evaluated on images that may depict humans. All the used images were either consented or anonymized when the data was captured in the public domain. When the model is used in ARDK, inference is only applied on-device and the image is not transferred off of the user device.
Human life: this model is designed for entertainment purposes within an augmented reality application. It is not intended to be used for making human life-critical decisions
Bias: Training datasets have not been audited for diversity and may present biases not surfaced by our benchmarks.

Caveats and Recommendations

Our annotated dataset only contains binary genders, which we include as male/female. Further data needed to evaluate across a spectrum of genders.
An ideal skin tone evaluation dataset would additionally include camera details, and more environment details such as lighting and humidity. Furthermore, the Fitzpatrick scale has limitations as it doesn’t fully represent the full spectrum of human skin tones.