ARM-Gym

Can an AI write faster code than the world’s best compiler?

Compilers translate your code into processor instructions. They are built to be safe for every program ever written. We trained an AI to find the faster instruction sequences the compiler won’t try. The target: ARM, the architecture inside every smartphone, AWS data center, and AI chip.

Read the Full Story Training Notebook Trained Model API Docs

Why This Matters

Compilers play it safe.
We don’t have to.

Every AI model you use runs on a processor. The code that drives that processor was written by a compiler. We’re teaching an AI to write that code better than the compiler can.

The Hardware

ARM is the world’s most deployed processor.

Every smartphone, AWS Graviton cloud instances, Azure data centers, every Apple Mac since 2020, and Meta’s in-house AI chips all run on ARM. Improving how code runs on ARM touches all of that.

The Problem

Compilers are brilliant generalists with a blind spot.

A compiler translates your code into processor instructions. It is optimized to be safe for every possible program ever written. That safety comes at a cost: on specific hardware, for specific workloads, there are faster instruction sequences the compiler will never try because it cannot afford to be wrong even once.

The Idea

A language model as a probabilistic scout.

ARM-Gym gives a 7-billion-parameter AI a programming function and asks it to rewrite the processor instructions from scratch. Every attempt is verified by a real assembler, a hardware emulator, and a cycle counter. The AI explores. The verifier confirms.

The Result

The model learned to write valid ARM assembly, then started beating the compiler.

At the start of training, the AI’s assembly was correct only 19% of the time. By the end of 250 steps, it was correct 70% of the time, improving every quarter. Once it learned to write assembly that actually runs, it started finding sequences faster than the compiler. No prior system has done this on ARM.

Training Results

250 Steps on an NVIDIA L40S

Qwen2.5-Coder-7B-Instruct with LoRA fine-tuning, trained via GRPO on 649 kernel variants. 107 minutes on a single GPU.

70%

Assembly Correctness at End
Started at 19% — rose every quarter

6.50

Final Quarter Reward
Up from 3.30 at the start

+14.5%

Best Speedup Over Compiler
Cycle estimate, single best event

649

Kernel Variants Trained On
15 AI inference templates

250

Training Steps
107 minutes wall clock

<1ms

Per Attempt Verification
Deterministic, no runtime noise

Reward curve across 250 training steps — Mean reward per group across 250 training steps. Rose from 3.30 to 6.50, nearly doubling over the run.

QEMU correctness rate rising from 19% to 70% — Assembly correctness verified by running 20 randomized tests per attempt. Rose from 19% to 70% across 250 steps, improving every quarter.

Training loss over 250 steps. Steady decline confirms the model is learning to structure valid assembly, not random exploration.

Compiler output vs model output on a sample kernel — Compiler output vs model output on a sample vector kernel. The model found a shorter, faster sequence on cycle estimates.

Architecture

How It Works

A three-gate verification pipeline. Zero LLM judges. Fully deterministic reward.

The Environment

C function in, ARM assembly out.

Each step selects a C kernel from 649 variants across 15 templates. The compiler baseline is generated with clang-21 -O3. The model receives both and writes optimized AArch64 assembly.

The Verification

Three gates. No AI judge.

Every attempt must pass: a real assembler for syntax, a hardware simulator running 20 adversarial tests for correctness, and a cycle counter for performance.

flowchart LR
    A["C Kernel<br/>15 templates x 649 variants"] --> B["clang-21 -O3<br/>Baseline Assembly"]
    B --> C["LLM Prompt<br/>C + Baseline ASM"]
    C --> D["Qwen2.5-Coder-7B<br/>+ LoRA r=32 G=8"]
    D --> E["Agent Assembly"]
    E --> F{"3-Gate<br/>Verifier"}
    F -->|"Syntax"| G["GNU as<br/>aarch64"]
    F -->|"Correctness"| H["QEMU x 20<br/>Adversarial Tests"]
    F -->|"Performance"| I["LLVM-MCA<br/>Neoverse V2"]
    I --> J["Dual Verifier<br/>Cross-Check"]
    J --> K["Reward<br/>fmt+syntax+correct+speedup"]
    K --> L["GRPO<br/>z-score clip ±1.5"]
    L --> D

Reproduce the Training

Everything is open.

Both training runs are fully reproducible. The scripts, logs, and trained weights are all public.

V11

Primary Run

Qwen2.5-Coder-7B + LoRA r=32, 250 steps

107 minutes on a single NVIDIA L40S. 8 generations per step. 649 kernel variants. This is the run all results are based on.

Training script (v11_train.py) →
Trained LoRA adapters →

V10

Comparison Run

Qwen2.5-Coder-7B + LoRA r=24, 200 steps

94 minutes on a single NVIDIA L40S. 6 generations per step. Used as the baseline comparison in all plots. Shows what a slightly smaller config achieves.

Training script (v10_train.py) →
Training logs, both runs (CSV) →

Colab Notebook

Key training steps with dependency notes

The notebook was built to run on HuggingFace infrastructure. It has dependency notes for running on Colab and snippets showing each stage of the GRPO training loop.

Open notebook →

Live Environment

Try it now.

The environment is running. Hit interactive Swagger docs to test every endpoint directly in the browser, or use curl.

Open Interactive API Docs

bash

# Check toolchain: gcc, llvm-mca, QEMU
curl -s "https://kaori02-arm-gym.hf.space/health"

# Get a kernel to optimize (C source + baseline assembly + cycle count)
curl -s -X POST "https://kaori02-arm-gym.hf.space/reset?seed=42"

# Submit assembly, get reward (syntax + correctness + speedup)
curl -s -X POST "https://kaori02-arm-gym.hf.space/step" \
  -H "content-type: application/json" \
  -d '{"variant_id":"vec_add_n16_float32","assembly":".text
.global kernel
kernel:
  ret"}'

# All 649 kernel variants by difficulty
curl -s "https://kaori02-arm-gym.hf.space/tasks"