Main Workflow

Figure ?? represents an illustration of the development process, organised around a two-stage workflow for training a reasoning language model.

The development process includes five main stages: (1) SFT Dataset Generation (M.1-M.3, with S.X steps), (2) SFT Training (M.4), (3) RLVR Data Curation (M.5), (4) RLVR Training (M.6, with V.X steps), and (5) Deployment (M.7-M.9).

Figure 1: The development process includes five main stages: (1) SFT Dataset Generation (M.1-M.3, with S.X steps), (2) SFT Training (M.4), (3) RLVR Data Curation (M.5), (4) RLVR Training (M.6, with V.X steps), and (5) Deployment (M.7-M.9).

  • M.1 Initial Raw Data Collection: Creating template schemes for various problem types (e.g. Matrix Multiplication, Linear Equations, etc.). Defines hyperparameters to ensure varying difficulty levels concerning the format and output constraints.
  • M.2 Synthetic Data Generation: Generating raw problems via synthetic data generation.
  • M.3 Data Distillation: Preparing the dataset for the Supervised Fine-Tuning (SFT) stage.
    • S.1 Chain-of-Thought (CoT) Generation: Generating step-by-step reasoning traces for collected questions using a powerful “teacher” model (i.e. DeepSeek-R1).
    • S.2 Verification: Validating the correctness of answers and CoTs through methods like Math Verify or LLM judges.
    • S.3 Difficulty-Based Filtering: Selecting problems based on their difficulty level. Samples with moderate pass rates are retained, while those that are too easy or too hard are filtered out. Filtering may also involve learning impact measurement or off-the-shelf LLM judges.
    • S.4 Rigorous Data Cleaning: Performing deduplication (via embedding similarity or n-gram), rejection sampling, and decontamination to ensure high-quality and unbiased data.
  • M.4 Supervised Fine-tuning (SFT) Training: To enable the base model to imitate high-quality reasoning traces from stronger models. This phase serves as a baseline for stable training. The entire reasoning dataset is used.
  • M.5 Reinforcement Learning from Verifiable Rewards (RLVR) Data Curation: This is a distinct and critical phase that involves preparing high-quality, verifiable datasets specifically for RL training. This includes:
    • Construction of Verified Questions and Answers:: Gathering datasets for RL, often focused on math problems that can be objectively verified via automated tools.
    • Difficulty-Based Filtering: RL datasets are often designed to include samples where models are likely to make mistakes (i.e., moderate pass rates), as samples that are too easy or too hard offer limited learning opportunities and are often filtered out. This differs from SFT’s general emphasis on difficulty and diversity to ensure data richness.
    • Detailed Pre-processing (Verification, Filtering, Cleaning): Rigorous steps similar to SFT data preparation, but tailored for RL (e.g., filtering for samples with moderate pass rates, removing unverifiable formats, and ensuring decontamination).
  • M.6 Reinforcement Learning from Verifiable Rewards (RLVR) Training: The core RL phase that further enhances reasoning capabilities. This process includes:
    • V.1 RL Algorithm Implementation: Utilizing and adapting algorithms such as PPO, GRPO, or their variants.
    • V.2 Reward System Design: Defining rule-based reward functions (e.g., accuracy, format, length) that guide the model’s learning process.
    • V.3 Sampling Strategies and Curriculum Learning: Employing techniques like dynamic sampling, epoch-level history resampling, or progressively increasing context/response length to optimize training efficiency and stability.
    • V.4 Reasoning Model: As a result of the actor-critic optimisations, a reasoning model is obtained.
  • M.7 Model Evaluation: Attentive assessment of the Reasoning Language Model’s (RLM) performance on relevant benchmarks (e.g., AIME, MATH500).
  • M.8 Further Refinement: The model can be quantized to facilitate easier adoption.
  • M.9 Deployment and Release: Making the trained reasoning language model and dataset available on HuggingFace.

Citation

If you found this post useful for your work, please consider citing it as:

Razvan Florian Vasile. (June 2025). "LinAlgZero: A linear algebra dataset for reasoning". Atom Blog. Retrieved from https://atomwalk12.github.io/posts/lingalgzero/linalgzero/.

or

@misc{vasile2025linalgzero,
    title = "LinAlgZero: A linear algebra dataset for reasoning",
    author = "Razvan Florian Vasile",
    note = "Personal blog",
    year = "2025",
    month = "June",
    url = "https://atomwalk12.github.io/posts/lingalgzero/linalgzero/"
}