SteerLM could be the successor to RLHF

Written by Francis Elhelou

March 18, 2021

Nvidia’s scientists have introduced a novel approach that has the potential to transform the way we synchronize large language models (LLMs) with user instructions. Named SteerLM, this method seeks to address the constraints associated with conventional reinforcement learning from human feedback (RLHF), which is typically employed for aligning LLMs.

Unlike RLHF, which conditions responses on a single reward, SteerLM conditions on multiple attributes, promising a more nuanced and user-aligned model performance.

How LLM alignment works

Language models such as ChatGPT have become widely used because of their effectiveness in adhering to user instructions. The common approach for enhancing these models comprises a two-step procedure. First, the pre-trained model undergoes supervised fine-tuning (SFT), during which it learns from human-annotated examples of instructions and responses. SFT enables the model to improve its ability to match user instructions with its responses. Following that, the model goes through reinforcement learning from human feedback (RLHF), wherein a reward model, trained on human preferences, further fine-tunes the LLM to better align with user objectives.

reinforcement-learning-from-human-feedback-rlhf

RLHF functions by steering the model solely based on a one-dimensional reward, disregarding the multifaceted aspects of human preferences, including utility, safety, and others. This approach poses challenges for users looking to customize various aspects of the model’s behavior during inference. The researchers are confident that their innovative approach, SteerLM, has the potential to address these deficiencies.

Nvidia introduces SteerLM as an innovative approach to model alignment through supervised fine-tuning (SFT), addressing the limitations associated with conventional SFT and reinforcement learning from human feedback (RLHF) methods. SteerLM trains the model to assess both response quality and a broad spectrum of human preferences.

In contrast to RLHF, SteerLM adopts an offline and scalable process that generates and annotates its training data independently. This simplifies the process, making SteerLM accessible to a wider range of organizations and users. The researchers clarify, “We train the response generation to consider both prompt instructions and annotated attributes, allowing SteerLM to effectively capture human preferences and generate responses that align with them.”

The authors’ tests demonstrate that SteerLM outperforms both SFT and RLHF in terms of adhering to user instructions. In automated and human-evaluated assessments, a 43-billion parameter LLaMA model fine-tuned with SteerLM surpassed other baseline models, including the larger ChatGPT 3.5. Impressively, even at 13 billion parameters, SteerLM outperformed most baseline models.

Furthermore, SteerLM provides greater control over various aspects of the model’s output, such as humor, creativity, and helpfulness. This heightened control allows for a more personalized and user-centered AI experience. The researchers conclude, “We aspire for our work to inspire further research into developing straightforward and efficient model alignment methods that empower improved AI assistants for everyone.”

SteerLM operates through a series of four steps. In the initial step, an “Attribute Prediction Model” (APM) is trained. In contrast to traditional models that predict a single quality metric, the APM predicts various aspects of responses, including attributes like humor, helpfulness, toxicity, creativity, and language quality. The APM is trained using an open-source dataset that has been manually annotated with response attributes, which transforms it into a manageable supervised learning task.

The second step utilizes the trained APM to annotate additional collected data. This process is scalable and significantly faster compared to the manual labeling of data. The researchers emphasize that using the APM for annotation can address certain issues associated with crowd-sourced human-annotated data. These issues include noise resulting from annotators misinterpreting instructions, limited expertise in annotating responses, and varying levels of language comprehension proficiency. The researchers clarify, “By incorporating an Attribute Prediction Model, it becomes possible to mitigate these issues by cleaning up the human-annotated attributes and standardizing scores across different annotators.”

The third step, termed “Attribute-Conditioned SFT,” extends the conventional SFT process by integrating reward signal information through attribute labels. In this step, the large language model (LLM) is trained on offline examples annotated with the APM, as opposed to the online data collection used in RLHF. The researchers note, “By adopting a purely offline training approach, this significantly simplifies the training setup in comparison to the heterogeneous setup of RLHF.” In this stage, the model is conditioned on both the response and its associated attributes.

Following the SFT phase, SteerLM employs a two-step procedure that mirrors RLHF. Initially, it generates multiple responses from the fine-tuned model for each prompt while specifying a maximum quality criterion. Subsequently, these responses are ranked using the APM’s feedback. This ranking informs another round of Attribute-Conditioned SFT. This iterative process allows the model to continuously refine its responses, aligning them more closely with user preferences and instructions.

Why SteerLM is important?

SteerLM possesses several attractive features that distinguish it from the current methods. In addition to its ability to generate high-quality output, it simplifies the traditionally intricate RLHF pipeline, which typically requires the coordination of a large workforce.

SteerLM leverages examples extracted from open-source datasets, including the OpenAssistant dataset, the Helpful and Harmless – Reinforcement Learning from Human Feedback dataset, and the Model Self-Identification Dataset. This source code, training recipe, and data are available for other researchers and organizations to use for further research. Furthermore, the trained 13-billion-parameter SteerLM model can be accessed on Hugging Face.

It’s important to note, however, that the training process for SteerLM remains computationally demanding. The researchers report that they utilized a cluster of 128 A100-80GB GPUs, incurring an approximate cost of $200 per hour, to train both the Attribution Prediction Model and the Attribute Conditioned Supervised Fine-Tuning model.

Nevertheless, SteerLM represents a significant improvement over the previous computational costs associated with RLHF. We can anticipate that other researchers will enhance this technique in the upcoming weeks and months.

When Quantum Meets Gravity!

by admin | Jun 5, 2025 | Research

TLDR; Last month, a team of scientists from Caltech and Fermilab used Google’s Sycamore quantum processor to simulate a traversable wormhole—and successfully sent a qubit through it, intact. While this experiment didn’t literally bend space and time, it represents the...

Quantum Computing Just Had a “Week” — And You Should Be Paying Attention

by admin | Apr 16, 2025 | Quantum

TLDR; The "What": JPMorgan Chase and D-Wave Quantum have recently made significant strides in quantum computing. JPMorgan demonstrated a real-world use case for certified randomness using a quantum computer, while D-Wave claimed quantum supremacy on a practical...

Understanding the Timing of Quantum Entanglement at Attosecond Scales

by admin | Apr 7, 2025 | Research

TLDR; The "What":Scientists have, for the first time, tracked the speed of quantum entanglement between electrons at the attosecond scale—a billionth of a billionth of a second—revealing that even "instantaneous" quantum events unfold over measurable time. The "So...

Written by Francis Elhelou

My consultancy goes beyond mere advice; it’s a partnership aimed at embedding ML into your strategic core, enabling smarter decisions, more efficient programming teams, and unlocking new opportunities.

SteerLM could be the successor to RLHF

Written by Francis Elhelou

How LLM alignment works

Why SteerLM is important?

Related Articles

When Quantum Meets Gravity!

Quantum Computing Just Had a “Week” — And You Should Be Paying Attention

Understanding the Timing of Quantum Entanglement at Attosecond Scales

Written by Francis Elhelou