Trying to Modify an LLM in Early 2023

6 min readAug 9, 2023

Neil Day, John Inacay, Wiley Wang, Mike Wang, Nishu Lahoti

Thank you to Kamaroopin Gestora for sponsoring the AWS services.

Note: We attempted our experiments and wrote this article before the release of Llama 2.

The GPT family of Large Language Models (LLMs) are based on the Transformer architecture, and approach language generation by predicting the next word given some input sequence or “prompt”, be it finishing a sentence or responding in a dialogue. ChatGPT, initially based on GPT-3, goes even further by fine tuning on a dialogue dataset, making it an ideal basis for a chatbot. To optimize ChatGPT for dialogue, GPT-3 was finetuned using a technique called “Reinforcement Learning from Human Feedback”. We were interested in understanding the details of this process, which has been critical to producing the high quality output we are seeing today.

There are many excellent articles on GPT, ChatGPT, LLMs and transformers. This discussion assumes a basic knowledge of these concepts.

What is Reinforcement Learning from Human Feedback (RLHF)?

Reinforcement learning is a field of Machine Learning where agents use trial and error to learn the best actions that maximize their cumulative reward in an environment. RLHF builds on top of this concept as one of the main mechanisms behind the engineering that makes ChatGPT work. RLHF requires 3 main components to work: the reward model and the LLM (large language model) and a fine tuning dataset.

What is the reward model?

RLHF tries to improve an LLM’s performance with a reward model using human annotations. Given a response to some input, a reward model simply rates the quality of the computer generated response. The reward model is trained using supervised learning to predict or rate a response according to some metric. Basically, it replicates and automates how humans would rate the responses. For ChatGPT, human annotators would rate responses according to the different criteria including fact correctness and response quality. We then expect the reward model to provide a decent approximation of how the annotators would respond. The reward model is an essential component of the RLHF training process. It should be noted that the reward model itself is a supervised model, but the process it deploys to will be reinforcement learning.

How do you fine-tune a model using Reinforcement Learning?

Specifically, ChatGPT fine-tunes a regular LLM using the reward model that you trained before on human annotated data. During the training phase, you can fine tune the large language model to generate responses that maximize the scores from your reward model. Note that a human is not scoring these responses on the fly. Instead, the reward model scores different responses within the reinforcement learning training loop.

(Attempting) the Experimentation Phase

As we were learning about RLHF, it occurred to the team that we might be able to set up a relatively simple test to deflect a LLM’s behavior in an observable way.

One idea that we liked was getting an LLM to use more “snazzy” versions of adjectives in its answers. For example, instead of using “Blue” to describe the color of the sea we would use RLHF to retrain to “Sapphire” or “Azure”. The thought was that we could then ask questions about the sky and other Blue objects and see the results. Red, Green and Blue seemed like appropriate choices as a starting point.

To test the effects of the training, we would ask a series of questions that would naturally elicit a target adjective in the response with the untrained model, and then repeat the same prompt after training. By running a reasonably large set of prompts on the untrained and trained model we can develop a statistical view of the effects of the RLHF training.

Pending successful tests, we could expand the training set to include a pretty wide range of adjective mappings.

Access to language models

In order to experiment in language models for learning purposes, we have to overcome a few hurdles: Commercial language models are generally not available as “deployable” instances, and the size of these models tends to be too big and costly for non-commercial experimentation.

Our first choice was ChatGPT, but we were unable to use it. In May 2023 the published API did not provide access for developers to do supplemental training of the base model. In addition, we wanted a controlled environment to run experiments, which API access wouldn’t provide.

Our second choice was the LLaMA model. Given the work that Alpaca had done on top of Meta’s LLaMA model, we wanted to see if we could get developer access to the Meta codebase. There was ample evidence that developers were actively working with the system, and it looked very promising as a vehicle for running our experiments. Meta requires that developers get permission and access to the models in order to work with them. Unfortunately, we were unable to get a response from the Meta team after several attempts to get permission.

In March 2023, the LLaMA model, a large language model (LLM) developed by Facebook AI, was leaked. It’s a 65-billion parameter model that was trained by Meta, however we were uncomfortable using code that was not explicitly provided by Meta under license.

Finally our third choice, that we ultimately settled on, was GPT 2. OpenAI released the GPT-2 model in 2019. It is a language model with 1.5 billion parameters. We felt that this would be a sufficiently advanced model to work with, while still being small enough to deploy on cloud infrastructure.

Upon doing further research, we determined that LoRA was a good test bed for experimentation, and that there was a GPT 2 model available on Hugging Face integrated into the LoRA package. This appeared to be a complete package available that we could work from. It was the most accessible option for us to build our test bed on top of.

The trouble with Deploying

Now that we have a base model, we needed to fine tune it. To run the fine tune training, we initially tried out 2 tools, Google Colab and Amazon Sagemaker. Google Colab offers a jupyter notebook-like environment in the browser for users to run their models. Given the limitations of free Google Colab, we turned to AWS and their Sagemaker backend to train the LoRA model. This would allow us to get past the time and compute limitations we were experiencing with free Google Colab. However, we still needed to perform AWS administrative tasks.

We ran into 2 extra hurdles to set up Sagemaker before we could even attempt to train. The first was the several administrative tasks to set up the Sagemaker backend itself. It involved several AWS steps of setting up roles and domains before we could even start up a notebook. The second step was then gaining access to GPU compute. Our account was not given access to GPU compute at the start, and we needed to contact AWS support before we were given capacity for our account. We then could commence coding in a Sagemaker notebook.

After some troubleshooting and experimentation, we learned that getting feedback on why a notebook job failed can be slow. Debugging a notebook job through your browser can be slow, especially compared to developing locally. Due to these observations, we decided to write and document about experiences here in this blog post as a learning experience.

The team’s opinion was that it was going to take quite a bit of time to get a fully operational LLM + RLHF stack up and running, and behaving predictably. We felt that this was beyond the scope of the study group’s available time. This highlighted the fact that these systems, especially the OSS versions, are still relatively immature and require quite a bit of “noodling” to produce a working system. It is reminiscent of early distributions of Linux projects — it sort of telt like trying to get Apache running from a tarball in 1996; possible but not straightforward.

Conclusion

In conclusion, we’ve learned today about how ChatGPT works using Reinforcement Learning with human feedback. We’ve also documented some of the experiences that we’ve learned in the process of trying to train a LoRA model in a notebook environment. As the technology improves, we expect that it will get easier to perform all parts of the LLM fine tuning stack.

Trying to Modify an LLM in Early 2023

Written by Deep Gan Team