ML Research Engineer · MachVIS Lab

TVFace

2,609,210 faces, 28,955 identities, mined from 22 TV networks. The largest public facial-clustering dataset.

Year2024

RoleML Research Engineer · MachVIS Lab

ScopeComputer Vision, Dataset, Facial Recognition, Demographic Fairness

DeviceDataset · Springer publication

ToolsPython, PyTorch, Clustering, Face Detection

Linksgithub.com/zaineli/TVFace Paper (Springer)TVFace Challenge 2026 (PDF)MachVIS Lab

01Context

Real faces, at real scale.

TVFace is a large-scale facial-recognition and clustering dataset built at NUST's MachVIS lab: 2,609,210 high-resolution face images of 28,955 unique individuals, extracted from broadcasts across 22 global television networks.

Because the source is live television, the data captures genuine real-world conditions (pose, lighting, expression, and aging over time) instead of the clean, frontal images most datasets rely on.

02The Problem

Benchmarks that don't look like the world.

Facial-recognition research needs scale, diversity, and a realistic long tail. Most public datasets are small, curated, or demographically skewed. Without a benchmark that mirrors real distributions, it's hard to study unsupervised clustering, large-scale recognition, or demographic fairness honestly.

03Approach

Mine broadcasts, annotate probabilistically.

Faces were detected and clustered from television broadcasts into a natural long-tail distribution, from 10 to 21,983 images per identity, at 224×224 resolution. Each identity carries probabilistic demographic annotations: age, gender, ethnicity (7 groups), expression (7 categories), and head pose.

The dataset ships with a PyTorch-compatible loader and an ethics-first usage framework: research-only, non-commercial, with explicit privacy and bias-mitigation requirements.

04By the Numbers

The scale of it.

Images

2,609,210

Identities

28,955

Per identity

10 to 21,983

Networks

22 worldwide

Resolution

224 × 224

Size

65 GB

TVFace age-distribution bar chart — Age skews working-age: the 20 to 29 and 30 to 39 bands hold ~675k and ~654k images, tapering to 17,347 at 70+.

TVFace ethnicity-distribution pie chart — Ethnicity, measured and published: White 52.5%, Middle Eastern 18.0%, Black 11.1%, East Asian 9.4%, plus Indian, Latino, and SE Asian.

TVFace expression-distribution bar chart — Expression in real broadcast faces: Neutral dominates at 39.4%, with Sad and Happy near 17% each. Nothing posed, nothing staged.

05What Shipped

The release

The dataset

2.6M annotated face images across 28,955 identities, released for research.

Demographic labels

Probabilistic age, gender, ethnicity, expression, and head-pose annotations.

PyTorch loader

A TVFaceDataset class for drop-in use in recognition and clustering pipelines.

Peer-reviewed

Published in Springer's Pattern Analysis and Applications (2025).

06Outcome

Where it landed

A research resource for unsupervised facial clustering, large-scale recognition, and demographic-fairness analysis. Built at MachVIS and published in Springer's Pattern Analysis and Applications, under a Creative Commons non-commercial license.

It now anchors the TVFace Challenge 2026, a public clustering-and-retrieval competition built on the dataset.

Venue

Pattern Analysis & Applications

Year

2025

License

CC BY-NC 4.0 · research only

07What I Learned

Portable lessons

01Real-world distributions beat clean ones for honest benchmarks.

02A long tail is a feature, not a defect, when you study recognition.

03Scale and ethics have to ship together.