rlvr terminology - hallerite's blog

# An attempt at an RLVR terminology Ever since DeepSeek-R1 came out, many people have been rushing to finetune models with the techniques that R1 was trained on. As of now, terms like 'verifier' have not really been clearly defined, leading to a lot of confusion, as everyone has their own ideas about what is and isn't part of the verifier. In this short post I want to make the case for one way to think about the different terms. I will quickly explain and given an example for each of primitives that I deem to be necessary for building a successful Gym-like framework for training LLMs for reasoning. For now, I want to focus on RLVR and purposefully exclude RLHF and PRMs or other neural reward models, though we can talk about how they fit in some other time. My claim is that all you really need as primitives are: - verifiers - reward functions - extractor / parser - dataset - environment - RL algorithm Let's go through them one by one and see how they interact on a superficial level: 1. Verifier - A verifier is just a function that takes in the relevant part of an LLM response and, optionally, the ground truth data, and tells us whether the output is correct. Depending on the verifier it may also be reasonable to - Example: Suppose the LLM is tasked with telling us the derivative of some function. We extract the relevant part that contains the solution in a form that our verifier expects, like `3*x**2 + 1` as a string. The **SymPy verifier** then checks whether this is the correct answer, given the original function, which it has access to. 2. Reward Function - A reward function is a sum of multiple functions. The most important part of that sum is the verifier reward, which is either $1$ if the verifier returns `True` or $0$ if the verifier returns `False`. - The other components of the function are what I call *rubrics*^[Shoutouts to [Will Brown](https://x.com/willccbb) for popularizing (if not coining) the term. You should definitely check him out, if you are interested in the stuff discussed here. He will go down in LLM RLVR history at this rate.]. These are just rules that the output of the LLM has to follow. - They can be orthogonal or partly related to the verifier. There are, from what I can tell, no empirical analyses about how to best design rubrics, though rubrics engineering is likely to emerge as a niche field in the future. - Naturally, rubrics could also themselves be generated by an LLM. - Example: In the DeepSeek-R1 paper, they added a format reward, which forced the LLM to do its thinking inside special tokens. 3. Extractor / Parser - An extractor or parser takes the response of the LLM and extracts the relevant part and converts it into a format that the verifier can handle. - Whether this is done inside the verifier or is independent of it is an implementation detail that I will not comment on for now. - I prefer the term 'extractor', as 'parser' is quite the overloaded word (though for good reason), even in the LLM space. - Example: Suppose the LLM is tasked with telling us the derivative of some function. We have used rubrics to train the LLM to place its final response into a `\boxed{}`. The extractor would then find the last `\boxed{}` and convert the contents inside into a format that SymPy can understand. 4. Dataset - A dataset contains the questions and ground truth that we train our model on. - A dataset can also be *generative*, i.e. an LLM that has been trained to pose questions. For now, I will leave 'how we can do verification in this setting' as an exercise to the reader. The savvy reader will know. - Example: A wrapper around the MATH dataset that allows to sample questions that are posed to the LLM to be trained. 5. Environment - An environment ties it all together. It holds state, defines a reward function, a dataset, a reset function, which pulls a new question from the dataset and a step function that moves the state of the environment further, depending on the action the LLM took. - An action does not have to be restricted to being one output token. It can also be tool use or a whole thought. - We can also define a decoding based on logit statistics like logit entropy and varentropy. *cough* entropix? 6. RL Algorithm - An algorithm that takes in an environment and a base model, as well as some other parameters, and trains the model on the environment. - Example: TRL GRPOTrainer, veRL, open-instruct, ... We should be able to mix and match all of these. Make RL composable again! ![[Pasted image 20250215233600.png]] If you think that I am a blithering idiot for my mental model or you dislike some part of it, please do tell. Happy to discuss and update my model (or yours 😉).