Very Rough Takeaways from CVPR 2024 Best Papers
This post lists some very rough takeaways/summaries that our 3 team members had for these papers.
Generative Image Dynamics
The main idea is using diffusion to generate motion encodings for an image, and then use this intermediate to render motion animation. What can be taken away is diffusion is shown to be effective at solving intermediate tasks that can be reused in further tasks. Also, the motion representation may find use in other tasks related to motion understanding.
This paper demonstrates a novel method for generating motion from a single still image. In this case, motion is represented as a Spectral Volume which is a per-pixel representation of motion within different frequencies. The spectral volume (motion representation) is generated using diffusion (a generative AI technique) where random noise is iteratively denoised to create a realistic motion representation. Finally, once the final denoised spectral volume (motion representation) is generated, they generate the rendered motion clip.
In computer graphics, motions can be modeled in the spectral Fourier domain. The video motions can be represented as dynamic textures. In this paper, the motion texture is mapped in 2D Fourier frequency domain. Then the motion is predicted by an LDM (latent diffusion mode).
Rich Human Feedback for Text-to-Image Generation
Drawing inspiration from current LLM techniques, this paper proposes tasks with an accompanying dataset to adapt RLHF for image generation. The tasks include targeted landmarking to reflect where a human would notice specific inconsistencies in a generated image along with overall ratings on different qualities for human preference. This method promises to help adapt fine-tuning of text-to-image generative models to current state-of-the-art methods for LLMs.
The paper Rich Human Feedback for Text-to-Image Generation demonstrates the use of RLHF (Reinforcement Learning from Human Feedback) within the image generation setting. RLHF is used to create a model that ranks images and text based on 3 criteria (plausibility, alignment, and aesthetics). With the use of RLHF, the final generative AI model is then able to finetune and improve the plausibility, alignment, and aesthetics of the final generated images.
This paper creates a dataset with Rich Human Feedback. The authors also proposed A multimodal Transformer model (RAHF) that can predict human annotations. These feedbacks can then improve image generation. The model also shows good generalization capability to adapt to models unseen. Below is the architecture of the model trained. Some of the rich human feedbacks include: plausibility, alignment, aesthetic, and overall. You can access the github repository at: https://github.com/google-research/google-research/tree/master/richhf_18k
Mip-Splatting: Alias-free 3D Gaussian Splatting
Novel View Synthesis: NVS generates new images from different viewpoints from original images. Among the techniques, NeRFs and 3DGS (3D Gaussian Splatting) have demonstrated impressive realistic results, especially with 3DGS achieving real time rendering for high resolution images. This paper modified 3DGS to minimize artifacts. It introduced 3D frequency constrained filters that are related to the input views. Instead of having 2D dilation, this paper uses a 2D Mip filter. In doing so, they have achieved state of art results, and reduced dilation effects as well as high frequency artifacts.
The Mip-Splatting: Alias-free 3D Gaussian Splatting paper combines 2D Mip filters and 3D smoothing filters with 3D Gaussian Splatting to achieve state of the art rendering. These techniques seem to match and/or surpass other NeRF based techniques that achieve a similar result. The 3D Gaussian Splatting technique represents 3D objects with 3D gaussians which render scenes by projecting those 3D gaussians from 3D to 2D space.
Mip-Splatting adds techniques to improve resulting scenes with 2 contributions: 3D smoothing filters and 2D Mip filters. These techniques seek to improve issues when focal length and movement front-to-back.
BIOCLIP: A Vision Foundation Model for the Tree of Life
BIOCLIP is a large collected dataset of taxonomic labels paired with images. BIOCLIP’s model builds on the CLIP model with taxonomic naming input providing interesting effects. By using a language model where a token for a given hierarchy will mix with only tokens of a higher level, the learned visual features also optimize for hierarchical concepts. The emergent ability of BIOCLIP to accurately infer higher taxonomic naming of unseen subjects suggests the hierarchical concepts training worked.
Bioclip builds off the Clip paper and uses the Clip contrastive pre-training objective as a way to train a more generalizable model for classifying hierarchies. The Clip contrastive pre-training objective can be used for multiple modal training, for example text and images. The cosine similarity of a text embedding and an image embedding can be calculated in order to calculate the loss for the Clip objective. Importantly, the cosine similarity allows the network to map embeddings from different modalities (text and images) together in a simple and effective way. This mechanism may also help models trained using the Clip objective to be more generalizable and robust than other approaches. Clip is also trained from a dataset that was not curated to the final task. In this case, the dataset is made of text and image pairs that can potentially be collected more easily. Bioclip also continues the trend of ML models being multi-modal models rather than a model that can only process images or text.
BIOCLIP combines several large datasets. TREEOFLIFE-10M for images, then they integrate and canonicalize taxonomic labels across the sources. A large effort is in unifying and
backfilling taxonomic hierarchies. These text database tasks were significant as the taxonomic hierarchies are difficult. The generated text type is as follows.
In BIOCLIP, the text encoder uses Taxonomic names. The standard seven-level biology taxonomy from higher to lower level is kingdom, phylum, class, order, family, genus and species. In the text for each species, the taxonomy is flattened into a single string.
BIOCLIP utilizes CLIP’s multimodal to learn hierarchical representations of a taxonomy. The multimodal trains on image and text pairs.
Resources