Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Sony WF-C710N Review: Exceeding Midrange Expectations

    July 2, 2025

    Samsung Galaxy S25 Edge Review: Beyond Just Ultra-Slim Design

    July 2, 2025

    Panasonic S1 II Review: An Almost Flawless Camera for Creators—If Budget Isn’t a Concern

    July 2, 2025
    Facebook X (Twitter) Instagram
    Tech CarzTech Carz
    • Tech News
    • AI
    • Digital Lifestyle
    • Future Tech
    • Smart Devices
    • Gadget Reviews
    Tech CarzTech Carz
    Home»AI»Can GRPO Efficiency Be Increased Tenfold? Kwai AI’s SRPO Says Yes
    AI

    Can GRPO Efficiency Be Increased Tenfold? Kwai AI’s SRPO Says Yes

    Irma EBy Irma EJune 27, 2025No Comments4 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email

    Kwai AI’s SRPO framework reduces large language model (LLM) reinforcement learning (RL) post-training steps by 90%, while still matching the performance of DeepSeek-R1 in tasks like math and coding. This two-stage RL strategy, which incorporates history resampling, effectively addresses the limitations of GRPO. The impressive results of OpenAI’s o1 series and DeepSeek-R1 clearly highlight how large-scale RL can unlock advanced reasoning abilities and significantly boost LLM performance.

    SRPO: A New Frontier in Reinforcement Learning for Multi-Domain Reasoning

    Despite major advances in reasoning models, the underlying training methodologies often remain opaque. Most recent efforts have emphasized mathematical reasoning, leaving cross-domain generalization—particularly between math and code—largely underexplored. Standard Generalized Reinforcement Learning from Preference Optimization (GRPO) methods suffer from performance limitations, inefficient sample usage, and an inability to develop domain-specific reasoning skills on mixed datasets, impeding the scalability of reinforcement learning for large language models (LLMs).

    To address these issues, the Kwaipilot team at Kuaishou has introduced Two-Staged history-Resampling Policy Optimization (SRPO)—a novel RL framework tailored to overcome the inefficiencies of conventional GRPO. Accompanied by a detailed technical report, the team has open-sourced the SRPO-Qwen-32B model.

    Remarkably, SRPO achieves DeepSeek-R1-Zero-level performance across both math and code domains using only one-tenth of the training steps, surpassing its predecessor on benchmarks such as AIME24 (50) and LiveCodeBench (41.6) using the same base model (Qwen2.5-32B).

    Problems with Standard GRPO

    Initial attempts with vanilla GRPO revealed several major bottlenecks:

    • Cross-Domain Conflicts: Mathematical problems demand detailed chain-of-thought (CoT) reasoning, while code tasks typically do not. Mixing them led to degraded performance in both areas.
    • Reward Homogeneity: When most samples in a batch receive similar rewards, gradient updates become negligible due to near-zero advantage values, severely reducing learning efficiency.
    • Early Saturation: Performance gains plateaued early, often due to low-quality or overly simplistic training data, limiting the model’s capacity to tackle complex problems.

    Two-Stage Training Strategy

    SRPO employs a two-phase curriculum to resolve reasoning conflicts and maximize learning efficiency:

    Reasoning Development

    Focuses exclusively on challenging math data, fostering robust reasoning behaviors like reflection, backtracking, and step-by-step decomposition.

    Skill Integration

    Gradually introduces code data, building on the reasoning foundation. This enhances procedural thinking, recursion, and tool-use capabilities in programming tasks.

    Comparative Training Outcomes

    The team analyzed how different data training regimes impact reasoning depth and response length:

    • Mixed Math + Code: Led to diluted reasoning, short responses, and mediocre performance.
    • Math-Only: Encouraged long, detailed reasoning with transferable skills even for code tasks.
    • Code-Only: Improved coding benchmarks but failed to cultivate deep reasoning or long-form answers.
    • Staged Training (SRPO): Delivered best-in-class results, with consistent reflective reasoning in math and structured problem-solving in code, including spontaneous use of code for verifying math solutions.

    History Resampling for Better Gradient Signals

    During mid-to-late training, over half of GRPO sample batches returned identical rewards, leading to poor gradient signals. SRPO resolves this with History Resampling, which:

    • Filters out trivial samples (where all rollouts are correct).
    • Prioritizes diverse and difficult samples, ensuring better reward variance and meaningful gradient updates.
    • Implements curriculum learning by retaining hard examples that may later yield productive gradients.

    This strategy substantially outperformed dynamic sampling methods like those in DAPO, leading to more efficient training and stable improvements in response length and reasoning depth.

    Data Preparation and Quality Control

    For robust performance, the Kwaipilot team curated and cleaned public math and code datasets using strict filters. They:

    • Removed malformed entries and ambiguous solutions.
    • Eliminated math questions requiring visual interpretation and code tasks needing specific runtime environments.
    • Verified correctness and labeled problem difficulty via Pass@k rates.

    Experimental Observations

    SRPO’s training process demonstrated a steady and interpretable growth curve:

    • Stage 1 yielded a rapid increase in reward and response length.
    • Stage 2 initially saw a dip in rewards as code tasks were introduced, followed by stable growth.
    • The response length for code tasks remained mostly consistent, confirming the model’s reliance on reasoning structures learned in Stage 1.

    Emergence of Reflective Reasoning

    SRPO led to the development of human-like metacognitive behaviors:

    • Rechecking, hesitation, and exploratory reasoning patterns emerged during training.
    • These behaviors increased significantly in frequency over time, demonstrating self-verification and adaptive problem-solving.
    • The model began generating code to validate its mathematical solutions, showcasing integrated reasoning and tool usage.

    Conclusion

    The SRPO framework marks a significant leap in reinforcement learning for multi-domain reasoning. Through its two-staged approach and innovative history resampling strategy, SRPO not only overcomes the key limitations of traditional GRPO but also fosters advanced, human-like reasoning behaviors in LLMs—achieving high performance with remarkable training efficiency. This work sets a new benchmark for scalable, generalizable reasoning in LLM training.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Irma E
    • Website

    Related Posts

    ICLR 2019: Tsinghua, Google & ByteDance Introduce Neural Networks for Logic Reasoning and Inductive Learning

    June 27, 2025

    DeepSeek Launches DeepSeek-Prover-V2: Boosting Neural Theorem Proving with Recursive Proof Search and a New Benchmark

    June 27, 2025

    Walmart Scales Enterprise AI: One Framework, Thousands of Use Cases

    June 27, 2025

    The Hidden Scaling Trap That Could Derail Your Agent Deployments

    June 27, 2025
    Leave A Reply Cancel Reply

    Demo
    Top Posts

    Realizing the Vision of an AI Scientist Is Now Within Reach

    June 29, 2025

    Sony WF-C710N Review: Exceeding Midrange Expectations

    July 2, 2025

    Samsung Galaxy S25 Edge Review: Beyond Just Ultra-Slim Design

    July 2, 2025

    Panasonic S1 II Review: An Almost Flawless Camera for Creators—If Budget Isn’t a Concern

    July 2, 2025
    Don't Miss
    Gadget Reviews

    Sony WF-C710N Review: Exceeding Midrange Expectations

    July 2, 2025

    A comfortable fit, feature-packed design, and strong ANC performance are the standout highlights.While Sony’s premium…

    Samsung Galaxy S25 Edge Review: Beyond Just Ultra-Slim Design

    July 2, 2025

    Panasonic S1 II Review: An Almost Flawless Camera for Creators—If Budget Isn’t a Concern

    July 2, 2025

    Playdate Season 2 Review: A Delightful Journey with Tiny Turnip and Chance’s Big Escape

    July 2, 2025
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    Demo
    Top Reviews
    Editors Picks

    Sony WF-C710N Review: Exceeding Midrange Expectations

    July 2, 2025

    Samsung Galaxy S25 Edge Review: Beyond Just Ultra-Slim Design

    July 2, 2025

    Panasonic S1 II Review: An Almost Flawless Camera for Creators—If Budget Isn’t a Concern

    July 2, 2025

    Playdate Season 2 Review: A Delightful Journey with Tiny Turnip and Chance’s Big Escape

    July 2, 2025
    Advertisement
    Demo
    About Us
    About Us

    Tech Carz is your trusted source for the latest innovations in technology. From AI-powered vehicles to cutting-edge gadgets, we explore how smart tech is transforming the world.

    Stay informed with expert insights, reviews, and updates shaping the future of intelligent living and mobility.

    Our Picks

    Sony WF-C710N Review: Exceeding Midrange Expectations

    July 2, 2025

    Samsung Galaxy S25 Edge Review: Beyond Just Ultra-Slim Design

    July 2, 2025

    Panasonic S1 II Review: An Almost Flawless Camera for Creators—If Budget Isn’t a Concern

    July 2, 2025
    Top Reviews

    This brain-computer interface is tiny enough to slip between your hair follicles.

    June 29, 2025

    iPhone 16 and 16 Plus Review: Small Tweaks, Big Impact

    June 25, 2025

    Superconducting Magnets from Dark Matter Labs Capture the Universe’s Hidden Symphony

    June 29, 2025
    © 2025 All Rights Reserved by Tech Carz
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms and Conditions
    • Disclaimer
    • Sitemap

    Type above and press Enter to search. Press Esc to cancel.