Model card: Person Segmentation (v0.11)
Model date: 7/25/2022
Model version: 0.11
License: refer to the terms of service for Lightship.
The person segmentation model returns a floating point value from 0 to 1 for each pixel indicating the probability of that pixel being part of a person. This value is then thresholded to return a boolean mask of presence/absence of “person” at each pixel.
Intended use cases
General semantic segmentation of people for augmented reality applications accessed through the Lightship ARDK.
Querying the presence or absence of a person at any specified pixel in the camera feed.
Using the semantic mask for “person” to enable screen space effects.
Augmented reality developers through Niantic Lightship.
Out-of-scope use cases
This model does not provide the capability to:
Segment individual people (instance segmentation)
Identify or recognize individuals
The following factors apply to all semantic segmentation provided in the Lightship ARDK, including person segmentation:
Scale: objects / classes may not be segmented if they are very far away from the camera.
Lighting: extreme light conditions may affect the overall performance.
Viewpoint: extreme camera views that have not been seen during training may lead to a miss in detection or a class confusion.
Occlusion: objects / classes may not be segmented if they are covered by other objects.
Motion blur: fast camera or object motion may degrade the performance of the model.
Flicker: predictions are made frame by frame and no temporal smoothing or context is applied; this may lead to a ‘jittering’ effect between predictions of temporally adjacent frames.
For person segmentation specifically, based on known problems with computer vision technology, we identify potential relevant factors that include subgroups for:
Body posture: certain body configurations may be harder to predict due to appearing less often in the training corpus.
Other: age, fashion style, accessories, body alterations
At Niantic we strive for our technology to be inclusive and fair by following strict equality and fairness practices when building, evaluating, and deploying our models. We define person segmentation fairness as follows: a model makes fair predictions if it performs equally on images that depict a variety of the identified subgroups. The evaluation results focus on measuring the performance of the person segmentation channel on the first three main subgroups (geographical region, skin tone and gender).
Instrumentation and dataset details
Our benchmark dataset comprises 5650 images captured around the world using the back camera of a smartphone, with the specification:
Only one person per image is depicted.
Both indoors and outdoors environments.
Captured with a variety of devices.
Full body within the frame of the image in a variety of poses.
Images are labeled with the following attributes:
Geographical region: based on the UN geoscheme with the merge of European subregions and Micronesia, Polynesia and Melanesia:
South Eastern Asia
Australia and New Zealand
Melanesia, Micronesia and Polynesia
Skin tone: following the Fitzpatrick scale images are annotated from subgroup 1 to 6. The skin tone is annotated by the person in the image, thus it is a self-reported value.
Gender: images are annotated with self-reported gender.
The standard and used metric to evaluate a segmentation model is the Intersection over union(IoU). It is computed as follows:
IoU = true_positives / (true_positives + false_positives + false_negatives)
Reported IoUs are averages (mean IoU or mIoU) over images belonging to the referenced subgroup unless stated otherwise.
A model is considered to be making unfair predictions if it yields a performance (mIoU) for a particular subgroup that is three standard deviations units or more from the average across all the subgroups.
Average performance across all 6 skin tones is 83.84% with a standard deviation of 1.26%. All skin tones subgroups yield a performance in the range of [81.72%, 85.45%]. The maximum difference between the mean and the worst performing skin tone subgroup is 2.13%, within our fairness criterion threshold of 3 stdevs ( 3x1.26 = 3.78%).
|Region||mIoU||stdev||Number of images|
|South Eastern Asia||81.47%||14.83%||333|
|Australia and New Zealand||82.34%||14.84%||374|
|Melanesia, Micronesia and Polynesia||82.10%||21.57%||75|
|Average (across all images)||83.86%||14.89%||5650|
|Average (across regions)||83.85%||2.06%||-|
Skin tone evaluation results
Average performance across all six skin tones is 83.84% with a standard deviation of 1.26%. All skin tones subgroups yield a performance in the range of [81.72%, 85.45%]. The maximum difference between the mean and the worst performing skin tone subgroup is 2.13%, within our fairness criterion threshold of 3 stdevs ( 3x1.26 = 3.78%).
|mIoU||stdev||Number of images|
|Average (across all images)||83.86%||14.83%||5650|
|Average (across skin tones)||83.85%||1.26%||-|
Gender evaluation results
Average performance of all evaluated gender subgroups is 83.76% with a range [82.58, 84.93]. The difference between the average and the worst performing gender is 1.18%, within our fairness criterion threshold of 3 stdevs ( 3x1.66 = 4.98%).
|Perceived gender||mIoU||stdev||Number of images|
|Average (across all images)||83.86%||14.83%||5650|
|Average (across genders)||83.76%||1.18%||-|
Privacy: the model was trained and evaluated on images that may depict humans. All the used images were either consented or anonymized when the data was captured in the public domain. When the model is used in ARDK, inference is only applied on-device and the image is not transferred off of the user device.
Human life: this model is designed for entertainment purposes within an augmented reality application. It is not intended to be used for making human life-critical decisions
Bias: Training datasets have not been audited for diversity and may present biases not surfaced by our benchmarks.
Caveats and Recommendations
Our annotated dataset only contains binary genders, which we include as male/female. Further data needed to evaluate across a spectrum of genders.
An ideal skin tone evaluation dataset would additionally include camera details, and more environment details such as lighting and humidity. Furthermore, the Fitzpatrick scale has limitations as it doesn’t fully represent the full spectrum of human skin tones.