Transformers, Dall-E Mini, and the Cat in the Hat

4 min readMay 4, 2022

Wiley Wang, John Inacay, Mike Wang(All authors equally)

In our past blog posts, we have introduced the Transformer model architecture and its application in language translation. The Transformer is a powerful neural network framework. When the Transformer is trained over a large amount of natural language data, it learns an incredible amount of insight into the language itself. One of the examples is the GPT 3 model developed by OpenAI.

In most of our previous blog posts, we have been focusing on the intricacies and details of specific deep learning models and architectures, including the algorithm’s inner workings and data structure. Developing and practicing model building from scratch has its limit, mostly bounded by the computing power and data sources. As for GPL 3, we’re taking a point of view at the application layer, and building something that is fun and demoable. In this blog post, we discuss how we made use of multiple off-the-shelf deep learning models as tools to create a piece of media.

Uncanny Hat Cat (What We Did)

We used several readily available models or open deep-learning powered products to construct a derivative re-telling of Dr. Seuss’s The Cat in the Hat. The original text was used as the sole input from the original work. Using a language model, we generated summarizations of paragraphs from the original text. Then, we passed those summarizations to another model to create illustrations. The text was also passed to a text-to-speech product to produce narration. We then assembled the derived text, illustrations, and narration into the final video product.

Text Summarization with GPT-3

GPT-3 has the ability to interpret text commands. Fundamentally, GPT-3 is a model that was trained to predict the next word given a huge corpus of data. If provided inputs have text a command, GPT-3 is able to comprehend the command, and “predict” the corresponding output with regard to the command. We asked GPT-3 to summarize and paraphrase one of Dr. Suess’ poems, The Cat in the Hat. In this way, GPT-3 helps us to generate novel content that we can use in our final video.

Query Development

Interacting with GPT-3 requires the user to develop queries in order to get the desired response. For the purpose of paraphrasing the poem, we supplied 1 to 3 stanzas individually to GPT-3 and asked it to summarize/paraphrase the text. In this way, we generated a new poem that roughly follows the original poem in content and style.

Image Generation with Dall-E Mini

To make illustrations, we made use of the Dall-E Mini model, a model that takes in text and produces images. We fed the text summarization from the previous step into Dall-E Mini, and the model generated several multiple illustrations based on the text prompt. In the spirit of human-computer collaboration, the authors selected the illustration among the candidates to be paired with the text in the final video.

Text-to-speech

For this project, we use Google’s text-to-speech to generate the voice used in the video. Compared to a couple years ago, text to speech technologies are now quite mature. It’s easy to supply a piece of text to a system and generate speech that is easy for humans to understand. Google’s system also allows you to select different types of voices. We were quite easily able to generate the voiceover used in our video with a few quick and easy steps.

Creating the Video

Finally, we manually stitch together the generated images, text, and voice-over to create the final video. We use iMovie for editing our movie with the necessary images, text, and audio.

On Creating Art with Deep Learning

We were able to create a piece of (mediocre) art using several off-the-shelf models and products. And we were able to do it without taxing too many resources. While our final output is interesting for its novelty, the ease of use shows that a more generally entertaining piece of art can be possible with better models and/or more mastery of the models as tools.

Conclusion

In conclusion, we were able to create a video using 3 different machine learning models. The models that we use span across computer vision, natural language processing, and speech. Our project shows how much easier it has become to apply technologies from multiple different disciplines to create applications for users. As these technologies continue to advance, we should continue to explore newer and better ways of applying tech. The applications of the future are likely going to employ better and more numerous models.