Can GRPO Efficiency Be Increased Tenfold? Kwai AI's SRPO Says Yes

Kwai AI’s SRPO framework reduces large language model (LLM) reinforcement learning (RL) post-training steps by 90%, while still matching the performance of DeepSeek-R1 in tasks like math and coding. This two-stage RL strategy, which incorporates history resampling, effectively addresses the limitations of GRPO. The impressive results of OpenAI’s o1 series and DeepSeek-R1 clearly highlight how large-scale RL can unlock advanced reasoning abilities and significantly boost LLM performance.

SRPO: A New Frontier in Reinforcement Learning for Multi-Domain Reasoning

Despite major advances in reasoning models, the underlying training methodologies often remain opaque. Most recent efforts have emphasized mathematical reasoning, leaving cross-domain generalization—particularly between math and code—largely underexplored. Standard Generalized Reinforcement Learning from Preference Optimization (GRPO) methods suffer from performance limitations, inefficient sample usage, and an inability to develop domain-specific reasoning skills on mixed datasets, impeding the scalability of reinforcement learning for large language models (LLMs).

To address these issues, the Kwaipilot team at Kuaishou has introduced Two-Staged history-Resampling Policy Optimization (SRPO)—a novel RL framework tailored to overcome the inefficiencies of conventional GRPO. Accompanied by a detailed technical report, the team has open-sourced the SRPO-Qwen-32B model.

Remarkably, SRPO achieves DeepSeek-R1-Zero-level performance across both math and code domains using only one-tenth of the training steps, surpassing its predecessor on benchmarks such as AIME24 (50) and LiveCodeBench (41.6) using the same base model (Qwen2.5-32B).

Problems with Standard GRPO

Initial attempts with vanilla GRPO revealed several major bottlenecks:

Cross-Domain Conflicts: Mathematical problems demand detailed chain-of-thought (CoT) reasoning, while code tasks typically do not. Mixing them led to degraded performance in both areas.
Reward Homogeneity: When most samples in a batch receive similar rewards, gradient updates become negligible due to near-zero advantage values, severely reducing learning efficiency.
Early Saturation: Performance gains plateaued early, often due to low-quality or overly simplistic training data, limiting the model’s capacity to tackle complex problems.

Two-Stage Training Strategy

SRPO employs a two-phase curriculum to resolve reasoning conflicts and maximize learning efficiency:

Reasoning Development

Focuses exclusively on challenging math data, fostering robust reasoning behaviors like reflection, backtracking, and step-by-step decomposition.

Skill Integration

Gradually introduces code data, building on the reasoning foundation. This enhances procedural thinking, recursion, and tool-use capabilities in programming tasks.

Comparative Training Outcomes

The team analyzed how different data training regimes impact reasoning depth and response length:

Mixed Math + Code: Led to diluted reasoning, short responses, and mediocre performance.
Math-Only: Encouraged long, detailed reasoning with transferable skills even for code tasks.
Code-Only: Improved coding benchmarks but failed to cultivate deep reasoning or long-form answers.
Staged Training (SRPO): Delivered best-in-class results, with consistent reflective reasoning in math and structured problem-solving in code, including spontaneous use of code for verifying math solutions.

History Resampling for Better Gradient Signals

During mid-to-late training, over half of GRPO sample batches returned identical rewards, leading to poor gradient signals. SRPO resolves this with History Resampling, which:

Filters out trivial samples (where all rollouts are correct).
Prioritizes diverse and difficult samples, ensuring better reward variance and meaningful gradient updates.
Implements curriculum learning by retaining hard examples that may later yield productive gradients.

This strategy substantially outperformed dynamic sampling methods like those in DAPO, leading to more efficient training and stable improvements in response length and reasoning depth.

Data Preparation and Quality Control

For robust performance, the Kwaipilot team curated and cleaned public math and code datasets using strict filters. They:

Removed malformed entries and ambiguous solutions.
Eliminated math questions requiring visual interpretation and code tasks needing specific runtime environments.
Verified correctness and labeled problem difficulty via Pass@k rates.

Experimental Observations

SRPO’s training process demonstrated a steady and interpretable growth curve:

Stage 1 yielded a rapid increase in reward and response length.
Stage 2 initially saw a dip in rewards as code tasks were introduced, followed by stable growth.
The response length for code tasks remained mostly consistent, confirming the model’s reliance on reasoning structures learned in Stage 1.

Emergence of Reflective Reasoning

SRPO led to the development of human-like metacognitive behaviors:

Rechecking, hesitation, and exploratory reasoning patterns emerged during training.
These behaviors increased significantly in frequency over time, demonstrating self-verification and adaptive problem-solving.
The model began generating code to validate its mathematical solutions, showcasing integrated reasoning and tool usage.

Conclusion

The SRPO framework marks a significant leap in reinforcement learning for multi-domain reasoning. Through its two-staged approach and innovative history resampling strategy, SRPO not only overcomes the key limitations of traditional GRPO but also fosters advanced, human-like reasoning behaviors in LLMs—achieving high performance with remarkable training efficiency. This work sets a new benchmark for scalable, generalizable reasoning in LLM training.

What's Hot

Sony WF-C710N Review: Exceeding Midrange Expectations

Samsung Galaxy S25 Edge Review: Beyond Just Ultra-Slim Design

Panasonic S1 II Review: An Almost Flawless Camera for Creators—If Budget Isn’t a Concern

Can GRPO Efficiency Be Increased Tenfold? Kwai AI’s SRPO Says Yes

ICLR 2019: Tsinghua, Google & ByteDance Introduce Neural Networks for Logic Reasoning and Inductive Learning

DeepSeek Launches DeepSeek-Prover-V2: Boosting Neural Theorem Proving with Recursive Proof Search and a New Benchmark

Walmart Scales Enterprise AI: One Framework, Thousands of Use Cases

The Hidden Scaling Trap That Could Derail Your Agent Deployments

Realizing the Vision of an AI Scientist Is Now Within Reach

Sony WF-C710N Review: Exceeding Midrange Expectations

Samsung Galaxy S25 Edge Review: Beyond Just Ultra-Slim Design

Panasonic S1 II Review: An Almost Flawless Camera for Creators—If Budget Isn’t a Concern

Sony WF-C710N Review: Exceeding Midrange Expectations

Samsung Galaxy S25 Edge Review: Beyond Just Ultra-Slim Design

Panasonic S1 II Review: An Almost Flawless Camera for Creators—If Budget Isn’t a Concern

Playdate Season 2 Review: A Delightful Journey with Tiny Turnip and Chance’s Big Escape

Sony WF-C710N Review: Exceeding Midrange Expectations

Samsung Galaxy S25 Edge Review: Beyond Just Ultra-Slim Design

Panasonic S1 II Review: An Almost Flawless Camera for Creators—If Budget Isn’t a Concern

Playdate Season 2 Review: A Delightful Journey with Tiny Turnip and Chance’s Big Escape

Our Picks

Sony WF-C710N Review: Exceeding Midrange Expectations

Samsung Galaxy S25 Edge Review: Beyond Just Ultra-Slim Design

Panasonic S1 II Review: An Almost Flawless Camera for Creators—If Budget Isn’t a Concern

Top Reviews

This brain-computer interface is tiny enough to slip between your hair follicles.

iPhone 16 and 16 Plus Review: Small Tweaks, Big Impact

Superconducting Magnets from Dark Matter Labs Capture the Universe’s Hidden Symphony

Subscribe to Updates

What's Hot

Can GRPO Efficiency Be Increased Tenfold? Kwai AI’s SRPO Says Yes

SRPO: A New Frontier in Reinforcement Learning for Multi-Domain Reasoning

Problems with Standard GRPO

Two-Stage Training Strategy

Reasoning Development

Skill Integration

Comparative Training Outcomes

History Resampling for Better Gradient Signals

Data Preparation and Quality Control

Experimental Observations

Emergence of Reflective Reasoning

Conclusion

Related Posts