Hi, I'm Ruizhe Wang (王瑞哲) 👋, currently pursuing a joint Ph.D. program between USTC and Microsoft Research Asia, co-supervised by Prof. Zhengjun Zha and Prof. Baining Guo. I also collaborate closely with Peng Cheng and Yeyun Gong at MSRA.

My research interest are AI Infrastructure, Large Language Model (LLM) Pretraining, and Efficient AI System Designing. I enjoy building scalable models, exploring novel training methods, and solving challenging problems at the intersection of theory and practice.

AI Infrastructure LLM Pretraining Efficient AI Systems Low-bit Quantization

I’m looking to collaborate on inspirable companions. You can find more info related to my research on my google scholar homepage:

📖 Educations

2023.09 - Present

Ph.D. Candidate in Automation

University of Science and Technology of China (USTC)

Hefei, China · Joint Program with Microsoft Research Asia

2019.09 - 2023.06

B.E. in Electronic Information Engineering

University of Science and Technology of China (USTC)

Hefei, China

📝 Publications

ICML 2025

Optimizing Large Language Model Training Using FP4 Quantization

Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, Peng Cheng

Jan 2025 ICML 2025

We propose the first FP4 training framework for LLMs, introducing Differentiable Gradient Estimation and Outlier Clamp and Compensation to address quantization challenges, achieving lossless pre-training performance on 13B LLMs and 100B tokens datasets.

Recycling Pretrained Checkpoints: Orthogonal Growth of MoE for Efficient LLM Pre-Training

Ruizhe Wang, Yucheng Ding, Xiao Liu, Yaoxiang Wang, Peng Cheng, Baining Guo, Zhengjun Zha, Yeyun Gong

Oct 2025 Preprint

We propose a "checkpoint recycling" strategy that expands existing models through orthogonal growth on 70B MoE models with 1T training tokens, delivering a 10.6% accuracy improvement over training from scratch while significantly maximizing the value of prior computational investments.

Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization

Yaoxiang Wang, Qingguo Hu, Yucheng Ding, Ruizhe Wang, Yeyun Gong, Jian Jiao, Yelong Shen, Peng Cheng, Jinsong Su

Sep 2025 Preprint

Introducing Matryoshka MoE (M-MoE) that enables elastic expert utilization during inference by instilling a coarse-to-fine structure into expert ensembles, allowing dynamic compute allocation based on resource constraints while maintaining model quality.

📝 Technical Reports

FP8-LM: Training FP8 Large Language Models

Houwen Peng*, Kan Wu*, Yixuan Wei*, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, Ruihang Li, Miaosen Zhang, Chen Li, Jia Ning, Ruizhe Wang, Zheng Zhang, Shuguang Liu, Joe Chau, Han Hu, Peng Cheng

Oct 2023 Technical Report

A comprehensive FP8 automatic mixed-precision training framework for LLMs that achieves up to 39% reduction in memory usage and 1.75× training speedup on H100 GPUs while maintaining model accuracy comparable to BF16 training.

SIGMA: An AI-Empowered Training Stack on Early-Life Hardware

Lei Qu, Lianhai Ren, Peng Cheng, Rui Gao, Ruizhe Wang, Tianyu Chen, Xiao Liu, Xingjian Zhang, Yeyun Gong, Yifan Xiong, Yucheng Ding, Yuting Jiang, Zhenghao Lin, Zhongxin Guo, Ziyue Yang

Dec 2025 Technical Report

Introducing SIGMA, an open-source training stack designed to overcome the reliability and efficiency challenges of large-scale AI training on early-life accelerators, enabling the stable pre-training of a 200B MoE model with 94.45% accelerator utilization.

Sigma-MoE-Tiny Technical Report

Qingguo Hu, Zhenghao Lin, Ziyue Yang, Yucheng Ding, Xiao Liu, Yuting Jiang, Ruizhe Wang, Tianyu Chen, Zhongxin Guo, Yifan Xiong, Rui Gao, Lei Qu, Jinsong Su, Peng Cheng, Yeyun Gong

Dec 2025 Technical Report

Introducing Sigma-MoE-Tiny, an ultra-sparse MoE language model that activates only 0.5B out of 20B parameters per token by using a fine-grained 96-expert architecture, achieving state-of-the-art performance at this extreme sparsity.

🦈 Blogs

📚 View full blogs page: All Blog Posts

5,000 words Analysis of FP4 Quantization for Training LLMs

May 30, 2025 29 min read

Detailed Paper Interpretation of "Optimizing Large Language Model Training Using FP4 Quantization". This post walks you through the motivation, key insights, and design rationale behind our work.

QuantizationPaper Interpretation

A One-Stop Guide to Scaling Laws in LLM Quantization

Aug 3, 2025 27 min read

A comprehensive overview of Quantization Scaling Laws. Dive deep into 5 papers to understand how performance loss from quantization varies with model parameters and token count.

QuantizationScaling Laws

Megatron-LM Training Large Models Practical Guide: 0 - Preface

Oct 10, 2025 16 min read

Why we must use Megatron-LM for large model training, and some warnings for those who have never used it before. A practical guide from personal experience.

Megatron-LMPractical Guide

Paper Summary for Recursive Looped Transformers: Parameter Efficiency

Oct 28, 2025 25 min read

Exploring how loops and recursion can improve parameter utilization efficiency in LLMs. A comprehensive summary of recursive mechanisms in Transformer architectures.

Recursive TransformersPaper Interpretation

🏆 Honors and Awards

🎓

2023

Outstanding Graduate of USTC

🏅

2022

China National Scholarship

⭐

2019-2021

Elites Program Scholarship (3×)

💻 Internships

Microsoft Research Asia (MSRA)

Research Intern · Natural Language Computing Group

Jul 2022 - Present Beijing, China