Examining DINOv2

3 min readFeb 12, 2025

DINO is a state of the art model from Facebook that stands for DIstillation with No Labels. As you might infer from the name, the Dino model was trained with a significant amount of data where much of this data was not explicitly labeled. While mostly trained without labels, Dino still uses curated datasets, so the complete process is not completely unsupervised. Because gathering labeled data can be quite expensive, this capability potentially allows for ML models to train from even larger datasets. The 2nd concept in its namesake — Distillation — refers to the process of utilizing a teacher ML model to teach a student ML model to achieve higher performance. Usually, the distillation refers to knowledge distillation, where a larger model is used to guide the training of a smaller model. Dino adopts the concept into a clever self distillation technique in training.

Data Source

The data source contains curated data including ImageNet-22k, the train split of ImageNet-1k, Google Landmarks, and several fine-grained datasets. For the uncurated data source, DINOv2 collects a raw unfiltered dataset of images from crawled web data, totalling 1.2 billion unique images.

Deduplication and Retrieval

DINOv2 applies a pipeline to the uncurated data and removes near-duplicate images. A self-supervised image retrieval is used to remove near-duplicates. An image is first computed to an embedding using a self-supervised ViT-H/16 network pretrained on ImageNet-22k. Images are then clustered using k-means on uncurated data, with distance calculation based on cosine similarity. The deduplication and retrieval relies on the Faiss library. Each cluster reserves only the representative image embeddings. At the end of the retrieval process, the dataset has 142 million curated images named LVD-142M.

Self-Distillation Training

DINO’s namesake training method, self-distillation with no labels, cleverly adopts knowledge distillation techniques in a different way. Knowledge distillation works by training a smaller model to match a larger, more powerful model. Self distillation does not have a previously existing larger model and instead makes a few alterations.

The teacher and student are both trained jointly, or co-distilled. The teacher and student use the same architecture, and each receives an augmented input from the same source. However, the augmentations differ between the teacher and student. The teacher receives augmentations that still maintain most of the view of the source image, while the student receives smaller crops of the same source. The student network updates its parameters per iteration via gradient descent. The teacher also updated each iteration, but it directly comes from the student’s weights, not backprop.

The resulting training regime allows for a teacher to learn “globally”, while the student learns the “local” features. The student’s parameter updates on the teacher network should merge the “local” feature processing to the “global” features in the teacher network. Given the augmentations and loss attempting to match the outputs of the 2 networks, the global and local features correlate well assuming strong correlations in the dataset.

PCA of Patch Features

The authors performed principal component analysis (PCA) on the patch features extracted by the model. A simple threshold after the first component over patches with a positive value produces separation of the image’s main object from the background. Then a second PCA is applied on the remaining patches across three images of the same category. By coloring the components, the foregrounds are naturally separated and the object parts are naturally in the same colors. Even though the model is not trained on parsing, the PCA can result in semantic understanding of the images.

Conclusion

In this post, we discussed the main points of DINO, and the novel techniques it introduces to research. The DINO model introduces self distillation with no labels to achieve state of the art results.