Can an AI write faster code than the world’s best compiler?
Compilers translate your code into processor instructions. They are built to be safe for every program ever written. We trained an AI to find the faster instruction sequences the compiler won’t try. The target: ARM, the architecture inside every smartphone, AWS data center, and AI chip.
Every AI model you use runs on a processor. The code that drives that processor was written by a compiler. We’re teaching an AI to write that code better than the compiler can.
Every smartphone, AWS Graviton cloud instances, Azure data centers, every Apple Mac since 2020, and Meta’s in-house AI chips all run on ARM. Improving how code runs on ARM touches all of that.
A compiler translates your code into processor instructions. It is optimized to be safe for every possible program ever written. That safety comes at a cost: on specific hardware, for specific workloads, there are faster instruction sequences the compiler will never try because it cannot afford to be wrong even once.
ARM-Gym gives a 7-billion-parameter AI a programming function and asks it to rewrite the processor instructions from scratch. Every attempt is verified by a real assembler, a hardware emulator, and a cycle counter. The AI explores. The verifier confirms.
At the start of training, the AI’s assembly was correct only 19% of the time. By the end of 250 steps, it was correct 70% of the time, improving every quarter. Once it learned to write assembly that actually runs, it started finding sequences faster than the compiler. No prior system has done this on ARM.
Qwen2.5-Coder-7B-Instruct with LoRA fine-tuning, trained via GRPO on 649 kernel variants. 107 minutes on a single GPU.
A three-gate verification pipeline. Zero LLM judges. Fully deterministic reward.
Each step selects a C kernel from 649 variants across 15 templates. The compiler baseline is generated with clang-21 -O3. The model receives both and writes optimized AArch64 assembly.
Every attempt must pass: a real assembler for syntax, a hardware simulator running 20 adversarial tests for correctness, and a cycle counter for performance.
flowchart LR
A["C Kernel<br/>15 templates x 649 variants"] --> B["clang-21 -O3<br/>Baseline Assembly"]
B --> C["LLM Prompt<br/>C + Baseline ASM"]
C --> D["Qwen2.5-Coder-7B<br/>+ LoRA r=32 G=8"]
D --> E["Agent Assembly"]
E --> F{"3-Gate<br/>Verifier"}
F -->|"Syntax"| G["GNU as<br/>aarch64"]
F -->|"Correctness"| H["QEMU x 20<br/>Adversarial Tests"]
F -->|"Performance"| I["LLVM-MCA<br/>Neoverse V2"]
I --> J["Dual Verifier<br/>Cross-Check"]
J --> K["Reward<br/>fmt+syntax+correct+speedup"]
K --> L["GRPO<br/>z-score clip ±1.5"]
L --> D
Both training runs are fully reproducible. The scripts, logs, and trained weights are all public.
107 minutes on a single NVIDIA L40S. 8 generations per step. 649 kernel variants. This is the run all results are based on.
94 minutes on a single NVIDIA L40S. 6 generations per step. Used as the baseline comparison in all plots. Shows what a slightly smaller config achieves.
Training script (v10_train.py) →
Training logs, both runs (CSV) →
The notebook was built to run on HuggingFace infrastructure. It has dependency notes for running on Colab and snippets showing each stage of the GRPO training loop.
The environment is running. Hit interactive Swagger docs to test every endpoint directly in the browser, or use curl.
# Check toolchain: gcc, llvm-mca, QEMU curl -s "https://kaori02-arm-gym.hf.space/health" # Get a kernel to optimize (C source + baseline assembly + cycle count) curl -s -X POST "https://kaori02-arm-gym.hf.space/reset?seed=42" # Submit assembly, get reward (syntax + correctness + speedup) curl -s -X POST "https://kaori02-arm-gym.hf.space/step" \ -H "content-type: application/json" \ -d '{"variant_id":"vec_add_n16_float32","assembly":".text .global kernel kernel: ret"}' # All 649 kernel variants by difficulty curl -s "https://kaori02-arm-gym.hf.space/tasks"