Posts by Tags Ruizhe Wang

Megatron-LM 3 posts

<table>
<tbody>
<tr>
<td>Megaton-LM Training Large Models Practical Guide</td>
<td>2 - Model Construct</td>
</tr>
</tbody>
</table>

Megaton-LM Training Large Models Practical Guide 2 - Model Construct

February 04, 2026 15 minute read

A practical guide to constructing and modifying GPT-style models in Megatron-LM: code organization, the Spec-based layer system, parameter flow, and how to switch between local and Transformer Engi...

Megatron-LM Practical Guide

<table>
<tbody>
<tr>
<td>Megaton-LM Training Large Models Practical Guide</td>
<td>1 - Data Preprocess</td>
</tr>
</tbody>
</table>

Megaton-LM Training Large Models Practical Guide 1 - Data Preprocess

December 18, 2025 16 minute read

A practical overview of Megatron-LM data preprocessing: supported text formats, the two-step preprocessing pipeline, and how IndexedDataset/GPTDataset/BlendedDataset indexing works, with engineerin...

Megatron-LM Practical Guide

<table>
<tbody>
<tr>
<td>Megaton-LM Training Large Models Practical Guide</td>
<td>0 - Preface</td>
</tr>
</tbody>
</table>

Megaton-LM Training Large Models Practical Guide 0 - Preface

October 10, 2025 16 minute read

Why we must use Megatron-LM for large model training, and some warnings for those who have never used it before. A practical guide from personal experience.

Megatron-LM Practical Guide

Paper Interpretation 4 posts

Paper Summary for Recursive Looped Transformers: Latent Reasoning

December 30, 2025 19 minute read

A paper-reading note on latent reasoning in Looped / Recursive Transformers: scaling test-time compute via recurrent depth, recursive latent thoughts, and large-scale looped language models.

Recursive Transformers Paper Interpretation

Paper Summary for Recursive Looped Transformers: Parameter Efficiency

October 28, 2025 25 minute read

Exploring how loops and recursion can improve parameter utilization efficiency in LLMs. A comprehensive summary of recursive mechanisms in Transformer architectures.

Recursive Transformers Paper Interpretation

A One-Stop Guide to Scaling Laws in LLM Quantization

August 03, 2025 27 minute read

A comprehensive overview of Quantization Scaling Laws. Dive deep into 5 papers to understand how performance loss from quantization varies with model parameters and token count.

Quantization Paper Interpretation

5,000 words Analysis of FP4 Quantization for Training Large Language Models

May 30, 2025 29 minute read

Detailed Paper Interpretation of ‘Optimizing Large Language Model Training Using FP4 Quantization’. This post walks you through the motivation, key insights, and design rationale behind our work.

Quantization Paper Interpretation

Practical Guide 3 posts

Megaton-LM Training Large Models Practical Guide 2 - Model Construct

February 04, 2026 15 minute read

A practical guide to constructing and modifying GPT-style models in Megatron-LM: code organization, the Spec-based layer system, parameter flow, and how to switch between local and Transformer Engi...

Megatron-LM Practical Guide

Megaton-LM Training Large Models Practical Guide 1 - Data Preprocess

December 18, 2025 16 minute read

A practical overview of Megatron-LM data preprocessing: supported text formats, the two-step preprocessing pipeline, and how IndexedDataset/GPTDataset/BlendedDataset indexing works, with engineerin...

Megatron-LM Practical Guide

Megaton-LM Training Large Models Practical Guide 0 - Preface

October 10, 2025 16 minute read

Why we must use Megatron-LM for large model training, and some warnings for those who have never used it before. A practical guide from personal experience.

Megatron-LM Practical Guide

Quantization 2 posts

A One-Stop Guide to Scaling Laws in LLM Quantization

August 03, 2025 27 minute read

A comprehensive overview of Quantization Scaling Laws. Dive deep into 5 papers to understand how performance loss from quantization varies with model parameters and token count.

Quantization Paper Interpretation

5,000 words Analysis of FP4 Quantization for Training Large Language Models

May 30, 2025 29 minute read

Detailed Paper Interpretation of ‘Optimizing Large Language Model Training Using FP4 Quantization’. This post walks you through the motivation, key insights, and design rationale behind our work.

Quantization Paper Interpretation

Recursive Transformers 2 posts

Paper Summary for Recursive Looped Transformers: Latent Reasoning

December 30, 2025 19 minute read

A paper-reading note on latent reasoning in Looped / Recursive Transformers: scaling test-time compute via recurrent depth, recursive latent thoughts, and large-scale looped language models.

Recursive Transformers Paper Interpretation

Paper Summary for Recursive Looped Transformers: Parameter Efficiency

October 28, 2025 25 minute read

Exploring how loops and recursion can improve parameter utilization efficiency in LLMs. A comprehensive summary of recursive mechanisms in Transformer architectures.

Recursive Transformers Paper Interpretation