Kwai AI’s SRPO framework reduces large language model (LLM) reinforcement learning (RL) post-training steps by 90%, while still matching the performance of DeepSeek-R1 in tasks like math and coding. This two-stage RL strategy, which incorporates history resampling, effectively addresses the limitations of GRPO. The impressive results of OpenAI’s o1 series and DeepSeek-R1 clearly highlight how large-scale RL can unlock advanced reasoning abilities and significantly boost LLM performance.
SRPO: A New Frontier in Reinforcement Learning for Multi-Domain Reasoning
Despite major advances in reasoning models, the underlying training methodologies often remain opaque. Most recent efforts have emphasized mathematical reasoning, leaving cross-domain generalization—particularly between math and code—largely underexplored. Standard Generalized Reinforcement Learning from Preference Optimization (GRPO) methods suffer from performance limitations, inefficient sample usage, and an inability to develop domain-specific reasoning skills on mixed datasets, impeding the scalability of reinforcement learning for large language models (LLMs).
To address these issues, the Kwaipilot team at Kuaishou has introduced Two-Staged history-Resampling Policy Optimization (SRPO)—a novel RL framework tailored to overcome the inefficiencies of conventional GRPO. Accompanied by a detailed technical report, the team has open-sourced the SRPO-Qwen-32B model.
Remarkably, SRPO achieves DeepSeek-R1-Zero-level performance across both math and code domains using only one-tenth of the training steps, surpassing its predecessor on benchmarks such as AIME24 (50) and LiveCodeBench (41.6) using the same base model (Qwen2.5-32B).
Problems with Standard GRPO
Initial attempts with vanilla GRPO revealed several major bottlenecks:
- Cross-Domain Conflicts: Mathematical problems demand detailed chain-of-thought (CoT) reasoning, while code tasks typically do not. Mixing them led to degraded performance in both areas.
- Reward Homogeneity: When most samples in a batch receive similar rewards, gradient updates become negligible due to near-zero advantage values, severely reducing learning efficiency.
- Early Saturation: Performance gains plateaued early, often due to low-quality or overly simplistic training data, limiting the model’s capacity to tackle complex problems.
Two-Stage Training Strategy
SRPO employs a two-phase curriculum to resolve reasoning conflicts and maximize learning efficiency:
Reasoning Development
Focuses exclusively on challenging math data, fostering robust reasoning behaviors like reflection, backtracking, and step-by-step decomposition.
Skill Integration
Gradually introduces code data, building on the reasoning foundation. This enhances procedural thinking, recursion, and tool-use capabilities in programming tasks.
Comparative Training Outcomes
The team analyzed how different data training regimes impact reasoning depth and response length:
- Mixed Math + Code: Led to diluted reasoning, short responses, and mediocre performance.
- Math-Only: Encouraged long, detailed reasoning with transferable skills even for code tasks.
- Code-Only: Improved coding benchmarks but failed to cultivate deep reasoning or long-form answers.
- Staged Training (SRPO): Delivered best-in-class results, with consistent reflective reasoning in math and structured problem-solving in code, including spontaneous use of code for verifying math solutions.
History Resampling for Better Gradient Signals
During mid-to-late training, over half of GRPO sample batches returned identical rewards, leading to poor gradient signals. SRPO resolves this with History Resampling, which:
- Filters out trivial samples (where all rollouts are correct).
- Prioritizes diverse and difficult samples, ensuring better reward variance and meaningful gradient updates.
- Implements curriculum learning by retaining hard examples that may later yield productive gradients.
This strategy substantially outperformed dynamic sampling methods like those in DAPO, leading to more efficient training and stable improvements in response length and reasoning depth.
Data Preparation and Quality Control
For robust performance, the Kwaipilot team curated and cleaned public math and code datasets using strict filters. They:
- Removed malformed entries and ambiguous solutions.
- Eliminated math questions requiring visual interpretation and code tasks needing specific runtime environments.
- Verified correctness and labeled problem difficulty via Pass@k rates.
Experimental Observations
SRPO’s training process demonstrated a steady and interpretable growth curve:
- Stage 1 yielded a rapid increase in reward and response length.
- Stage 2 initially saw a dip in rewards as code tasks were introduced, followed by stable growth.
- The response length for code tasks remained mostly consistent, confirming the model’s reliance on reasoning structures learned in Stage 1.
Emergence of Reflective Reasoning
SRPO led to the development of human-like metacognitive behaviors:
- Rechecking, hesitation, and exploratory reasoning patterns emerged during training.
- These behaviors increased significantly in frequency over time, demonstrating self-verification and adaptive problem-solving.
- The model began generating code to validate its mathematical solutions, showcasing integrated reasoning and tool usage.
Conclusion
The SRPO framework marks a significant leap in reinforcement learning for multi-domain reasoning. Through its two-staged approach and innovative history resampling strategy, SRPO not only overcomes the key limitations of traditional GRPO but also fosters advanced, human-like reasoning behaviors in LLMs—achieving high performance with remarkable training efficiency. This work sets a new benchmark for scalable, generalizable reasoning in LLM training.