# Evolution of Reinforcement Learning in Large Language Model Training: Insights from the FLOCK Research Team
The FLOCK research team has released a report highlighting advancements in reinforcement learning (RL), which is considered the “second half” of the large language model (LLM) training process. According to the report, Chinese AI company DeepSeek has recently introduced the Group Relative Policy Optimization (GRPO) technique, which reduces human intervention while maintaining model performance.
Traditional LLM training involves three stages: pre-training, supervised fine-tuning, and reinforcement learning from human feedback (RLHF). Among these, RL is a critical process that refines the model to better meet user expectations.
# Understanding RL: AI Training Through Interaction and Rewards
Reinforcement learning is often likened to Pavlov’s dog experiment. In an environment where rewards are given for specific behaviors, an AI agent learns to make optimal choices. Here, rewards signal the success of actions. Prominent RL algorithms include Q-learning, Deep Q-Network (DQN), Policy Gradient, and Proximal Policy Optimization (PPO). These algorithms allow the agent to choose actions based on the current state and learn through received rewards.
# RLHF and PPO: How Models Learn from Human Feedback
Reinforcement learning is frequently used in the final stage of LLM training. After the model generates various responses, humans rank the quality of these responses. This data is then used to train a reward model, and algorithms like PPO are applied to enhance the model. PPO adjusts policies stably without excessive changes, and Generalized Advantage Estimation (GAE) is used to calculate the relative quality of responses. The critic (evaluation model) predicts long-term rewards, smoothing the model’s updates.
# GRPO: Maintaining RLHF Performance Without a Critic—DeepSeek’s New Approach
DeepSeek’s GRPO is a simplified version of PPO. The core innovation is “Group-Based Advantage Estimation (GRAE).” Multiple responses are generated for a single prompt and compared to evaluate relative superiority. This basis is used to update the model with a PPO-style loss function. GRPO retains the reward model but removes the critic (value function), simplifying training. This approach requires fewer computational resources and can process complex inferences more quickly.
# Transforming LLMs with RL
Reinforcement learning enhances not just response quality but also the alignment of AI with human-centric criteria, ensuring more reliable answers in real-world scenarios. RLHF, particularly, is crucial when expert data is limited, providing a unique means of fine-tuning the model. The FLOCK research team stated, “The advancement of RL techniques is key to both the practicality and transparency of AI,” and they plan to continue introducing related content through educational series.
⚠ Premium content
To unlock this and other premium content, subscribe via Access Protocol. Learn more ›