Written by Francis Elhelou

March 18, 2021

Nvidia’s scientists have introduced a novel approach that has the potential to transform the way we synchronize large language models (LLMs) with user instructions. Named SteerLM, this method seeks to address the constraints associated with conventional reinforcement learning from human feedback (RLHF), which is typically employed for aligning LLMs.

Unlike RLHF, which conditions responses on a single reward, SteerLM conditions on multiple attributes, promising a more nuanced and user-aligned model performance.

How LLM alignment works

Language models such as ChatGPT have become widely used because of their effectiveness in adhering to user instructions. The common approach for enhancing these models comprises a two-step procedure. First, the pre-trained model undergoes supervised fine-tuning (SFT), during which it learns from human-annotated examples of instructions and responses. SFT enables the model to improve its ability to match user instructions with its responses. Following that, the model goes through reinforcement learning from human feedback (RLHF), wherein a reward model, trained on human preferences, further fine-tunes the LLM to better align with user objectives.

RLHF functions by steering the model solely based on a one-dimensional reward, disregarding the multifaceted aspects of human preferences, including utility, safety, and others. This approach poses challenges for users looking to customize various aspects of the model’s behavior during inference. The researchers are confident that their innovative approach, SteerLM, has the potential to address these deficiencies.

Nvidia introduces SteerLM as an innovative approach to model alignment through supervised fine-tuning (SFT), addressing the limitations associated with conventional SFT and reinforcement learning from human feedback (RLHF) methods. SteerLM trains the model to assess both response quality and a broad spectrum of human preferences.

In contrast to RLHF, SteerLM adopts an offline and scalable process that generates and annotates its training data independently. This simplifies the process, making SteerLM accessible to a wider range of organizations and users. The researchers clarify, “We train the response generation to consider both prompt instructions and annotated attributes, allowing SteerLM to effectively capture human preferences and generate responses that align with them.”

The authors’ tests demonstrate that SteerLM outperforms both SFT and RLHF in terms of adhering to user instructions. In automated and human-evaluated assessments, a 43-billion parameter LLaMA model fine-tuned with SteerLM surpassed other baseline models, including the larger ChatGPT 3.5. Impressively, even at 13 billion parameters, SteerLM outperformed most baseline models.

Furthermore, SteerLM provides greater control over various aspects of the model’s output, such as humor, creativity, and helpfulness. This heightened control allows for a more personalized and user-centered AI experience. The researchers conclude, “We aspire for our work to inspire further research into developing straightforward and efficient model alignment methods that empower improved AI assistants for everyone.”

SteerLM operates through a series of four steps. In the initial step, an “Attribute Prediction Model” (APM) is trained. In contrast to traditional models that predict a single quality metric, the APM predicts various aspects of responses, including attributes like humor, helpfulness, toxicity, creativity, and language quality. The APM is trained using an open-source dataset that has been manually annotated with response attributes, which transforms it into a manageable supervised learning task.

The second step utilizes the trained APM to annotate additional collected data. This process is scalable and significantly faster compared to the manual labeling of data. The researchers emphasize that using the APM for annotation can address certain issues associated with crowd-sourced human-annotated data. These issues include noise resulting from annotators misinterpreting instructions, limited expertise in annotating responses, and varying levels of language comprehension proficiency. The researchers clarify, “By incorporating an Attribute Prediction Model, it becomes possible to mitigate these issues by cleaning up the human-annotated attributes and standardizing scores across different annotators.”

The third step, termed “Attribute-Conditioned SFT,” extends the conventional SFT process by integrating reward signal information through attribute labels. In this step, the large language model (LLM) is trained on offline examples annotated with the APM, as opposed to the online data collection used in RLHF. The researchers note, “By adopting a purely offline training approach, this significantly simplifies the training setup in comparison to the heterogeneous setup of RLHF.” In this stage, the model is conditioned on both the response and its associated attributes.

Following the SFT phase, SteerLM employs a two-step procedure that mirrors RLHF. Initially, it generates multiple responses from the fine-tuned model for each prompt while specifying a maximum quality criterion. Subsequently, these responses are ranked using the APM’s feedback. This ranking informs another round of Attribute-Conditioned SFT. This iterative process allows the model to continuously refine its responses, aligning them more closely with user preferences and instructions.

Why SteerLM is important?

SteerLM possesses several attractive features that distinguish it from the current methods. In addition to its ability to generate high-quality output, it simplifies the traditionally intricate RLHF pipeline, which typically requires the coordination of a large workforce.

SteerLM leverages examples extracted from open-source datasets, including the OpenAssistant dataset, the Helpful and Harmless – Reinforcement Learning from Human Feedback dataset, and the Model Self-Identification Dataset. This source code, training recipe, and data are available for other researchers and organizations to use for further research. Furthermore, the trained 13-billion-parameter SteerLM model can be accessed on Hugging Face.

It’s important to note, however, that the training process for SteerLM remains computationally demanding. The researchers report that they utilized a cluster of 128 A100-80GB GPUs, incurring an approximate cost of $200 per hour, to train both the Attribution Prediction Model and the Attribute Conditioned Supervised Fine-Tuning model.

Nevertheless, SteerLM represents a significant improvement over the previous computational costs associated with RLHF. We can anticipate that other researchers will enhance this technique in the upcoming weeks and months.

Related Articles

Related

Artificial Intelligence Index Report 2023

Shift in AI Development Leadership Historically, academia led in releasing significant machine learning models. Since 2014, the industry has taken the helm, producing 32 major models in 2022 alone, compared to just three from academia. This shift is largely due to the...

read more

Curabitur at nunc in arcu consectetur fermentum tincidunt et eros. Suspendisse elementum enim justo, sit amet commodo sapien sagittis id. Morbi vel nisi cursus eros ultrices suscipit et sit amet augue. In dapibus sapien dolor, id faucibus velit maximus vel. Nunc sed sem suscipit, tempus purus sit amet, pulvinar odio. Phasellus sit amet ultricies leo.

Sed feugiat rutrum tellus vitae mollis. Vivamus pharetra turpis in ullamcorper ultrices. Cras elit enim, euismod eget posuere ut, accumsan vel tellus. Proin auctor, leo sed commodo sollicitudin, nulla dui consectetur quam, id tempor massa est quis sem. Nunc ultrices augue eget elit tincidunt convallis. Suspendisse aliquam, nisi sed malesuada suscipit, ipsum neque cursus erat, sit amet finibus ipsum nisi id justo.

Fusce efficitur, eros a bibendum bibendum, risus diam malesuada ex, ac dignissim est est ac enim. Vivamus in nisl et nunc bibendum faucibus. Maecenas id molestie lorem.

Written by Francis Elhelou

My consultancy goes beyond mere advice; it’s a partnership aimed at embedding ML into your strategic core, enabling smarter decisions, more efficient programming teams, and unlocking new opportunities.