Hi, I'm Ruizhe Wang (็Ž‹็‘žๅ“ฒ) ๐Ÿ‘‹, currently pursuing a joint Ph.D. program between USTC and Microsoft Research Asia, co-supervised by Prof. Zhengjun Zha and Prof. Baining Guo. I also collaborate closely with Peng Cheng and Yeyun Gong at MSRA.

My research interest are AI Infrastructure, Large Language Model (LLM) Pretraining, and Efficient AI System Designing. I enjoy building scalable models, exploring novel training methods, and solving challenging problems at the intersection of theory and practice.

AI Infrastructure LLM Pretraining Efficient AI Systems Low-bit Quantization

Iโ€™m looking to collaborate on inspirable companions. You can find more info related to my research on my google scholar homepage:

๐Ÿ“– Educations

2023.09 - Present
Ph.D. Candidate in Automation
University of Science and Technology of China (USTC)
Hefei, China ยท Joint Program with Microsoft Research Asia
2019.09 - 2023.06
B.E. in Electronic Information Engineering
University of Science and Technology of China (USTC)
Hefei, China

๐Ÿ“ Publications

ICML 2025
FP4 Quantization

Optimizing Large Language Model Training Using FP4 Quantization

Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, Peng Cheng

Jan 2025 ICML 2025

We propose the first FP4 training framework for LLMs, introducing Differentiable Gradient Estimation and Outlier Clamp and Compensation to address quantization challenges, achieving lossless pre-training performance on 13B LLMs and 100B tokens datasets.

Recycling Checkpoints

Recycling Pretrained Checkpoints: Orthogonal Growth of MoE for Efficient LLM Pre-Training

Ruizhe Wang, Yucheng Ding, Xiao Liu, Yaoxiang Wang, Peng Cheng, Baining Guo, Zhengjun Zha, Yeyun Gong

Oct 2025 Preprint

We propose a "checkpoint recycling" strategy that expands existing models through orthogonal growth on 70B MoE models with 1T training tokens, delivering a 10.6% accuracy improvement over training from scratch while significantly maximizing the value of prior computational investments.

Matryoshka MoE

Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization

Yaoxiang Wang, Qingguo Hu, Yucheng Ding, Ruizhe Wang, Yeyun Gong, Jian Jiao, Yelong Shen, Peng Cheng, Jinsong Su

Sep 2025 Preprint

Introducing Matryoshka MoE (M-MoE) that enables elastic expert utilization during inference by instilling a coarse-to-fine structure into expert ensembles, allowing dynamic compute allocation based on resource constraints while maintaining model quality.

๐Ÿ“ Technical Reports

FP8-LM

FP8-LM: Training FP8 Large Language Models

Houwen Peng*, Kan Wu*, Yixuan Wei*, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, Ruihang Li, Miaosen Zhang, Chen Li, Jia Ning, Ruizhe Wang, Zheng Zhang, Shuguang Liu, Joe Chau, Han Hu, Peng Cheng

Oct 2023 Technical Report

A comprehensive FP8 automatic mixed-precision training framework for LLMs that achieves up to 39% reduction in memory usage and 1.75ร— training speedup on H100 GPUs while maintaining model accuracy comparable to BF16 training.

SIGMA

SIGMA: An AI-Empowered Training Stack on Early-Life Hardware

Lei Qu, Lianhai Ren, Peng Cheng, Rui Gao, Ruizhe Wang, Tianyu Chen, Xiao Liu, Xingjian Zhang, Yeyun Gong, Yifan Xiong, Yucheng Ding, Yuting Jiang, Zhenghao Lin, Zhongxin Guo, Ziyue Yang

Dec 2025 Technical Report

Introducing SIGMA, an open-source training stack designed to overcome the reliability and efficiency challenges of large-scale AI training on early-life accelerators, enabling the stable pre-training of a 200B MoE model with 94.45% accelerator utilization.

Sigma-MoE-Tiny

Sigma-MoE-Tiny Technical Report

Qingguo Hu, Zhenghao Lin, Ziyue Yang, Yucheng Ding, Xiao Liu, Yuting Jiang, Ruizhe Wang, Tianyu Chen, Zhongxin Guo, Yifan Xiong, Rui Gao, Lei Qu, Jinsong Su, Peng Cheng, Yeyun Gong

Dec 2025 Technical Report

Introducing Sigma-MoE-Tiny, an ultra-sparse MoE language model that activates only 0.5B out of 20B parameters per token by using a fine-grained 96-expert architecture, achieving state-of-the-art performance at this extreme sparsity.

๐Ÿฆˆ Blogs

๐Ÿ“š View full blogs page: All Blog Posts

FP4 Quantization

5,000 words Analysis of FP4 Quantization for Training LLMs

May 30, 2025 15 min read

Detailed Paper Interpretation of "Optimizing Large Language Model Training Using FP4 Quantization". This post walks you through the motivation, key insights, and design rationale behind our work.

QuantizationPaper Interpretation
Quantization Scaling Law

A One-Stop Guide to Scaling Laws in LLM Quantization

Aug 3, 2025 20 min read

A comprehensive overview of Quantization Scaling Laws. Dive deep into 5 papers to understand how performance loss from quantization varies with model parameters and token count.

QuantizationScaling Laws
Megatron-LM Guide

Megatron-LM Training Large Models Practical Guide: 0 - Preface

Oct 10, 2025 10 min read

Why we must use Megatron-LM for large model training, and some warnings for those who have never used it before. A practical guide from personal experience.

Megatron-LMPractical Guide
Recursive Transformers

Paper Summary for Recursive Looped Transformers: Parameter Efficiency

Oct 28, 2025 25 min read

Exploring how loops and recursion can improve parameter utilization efficiency in LLMs. A comprehensive summary of recursive mechanisms in Transformer architectures.

Recursive TransformersPaper Interpretation

๐Ÿ† Honors and Awards

๐ŸŽ“
2023
Outstanding Graduate of USTC
๐Ÿ…
2022
China National Scholarship
โญ
2019-2021
Elites Program Scholarship (3ร—)

๐Ÿ’ป Internships

Research Intern ยท Natural Language Computing Group
Jul 2022 - Present Beijing, China