MLRift

An experimental ML compiler that AOT-compiles LLM inference, neural networks, and evolutionary algorithms directly to AMD GPU machine code — no Python, no PyTorch runtime, no hipcc, no LLVM. On an RX 7800 XT, MLRift now beats PyTorch on both GPU and CPU: Qwen3-0.6B 264 tok/s GPU (4.23× PT bf16), Llama-3.2-1B 99.8 tok/s GPU (+95% vs PT bf16), Llama-3.2-3B 40.7 tok/s GPU (+28% vs PT bf16), Mistral-7B 24.9 tok/s GPU (+96% vs PT bf16) and 16.2 tok/s CPU at 2.39 GB peak RSS (+7% vs PT CPU bf16 at -37% RAM), bit-identical output.

MLRift is an ML compiler created by Pantelis Christou.

$ ./build/mlrc --target=amdgpu-native examples/qwen3_generate.mlr -o /tmp/qw3
$ MLRIFT_NATIVE_MEGAKERNEL=2 MLRIFT_QWEN3_MEGAKERNEL_SPECK16=1 MLRIFT_SPEC_K=16 /tmp/qw3
tokens_per_sec_x1000=266201

MLRift is an experimental zero-dependency machine-learning compiler built on the KernRift toolchain. It compiles models ahead-of-time (AOT) directly to machine code — bypassing the Python / PyTorch stack at runtime. The current focus is LLM inference performance ceilings on consumer hardware: what does Qwen3 actually run at when every layer of abstraction between MLRift source and the GPU's instruction stream has been removed? An AST-walking pass turns @kernel functions into raw RDNA 3 / RDNA 2 ISA code-objects with no hipcc, no LLVM, and no clang in the build path. A KFD-direct shim talks to the AMD kernel driver with no userspace ROCm dependency. The qwen3 inference example that loads HF safetensors and decodes 264 tokens/sec on an RX 7800 XT is one statically-linked ELF — Python is never involved.

LLM GPU Inference — RX 7800 XT

Qwen3-0.6B

Greedy decode through the same compiler stack — no Python, no PyTorch runtime. Hardware: AMD Ryzen 9 7900X + AMD Radeon RX 7800 XT (gfx1100). All MLRift rows produce token-id output bit-identical to HuggingFace transformers.generate(do_sample=False). Peak VRAM read from /sys/class/drm/card1/device/mem_info_vram_used with 3-second cooldown between runs; idle baseline 174 MiB. Methodology in SLICE4_MEGAKERNEL_DESIGN.md.

Stack Weights → GEMM tok/s Peak VRAM RSS vs PyTorch bf16
MLRift CPU bf16 → f32 24.1 174 MiB 1.67 GiB 0.39×
MLRift GPU per-op bf16 → f32 54.8 2 801 MiB 1.82 GiB 0.88×
PyTorch ROCm fp32 f32 → f32 40.2 3 533 MiB 4.92 GiB 0.64×
PyTorch ROCm bf16 bf16 → bf16 (WMMA) 62.5 2 091 MiB 4.52 GiB 1.00× (baseline)
MLRift M=1 mega (slice 7.6) bf16 → f32 119.7 2 046 MiB 1.76 GiB 1.92×
MLRift mks4 (spec_K=4) bf16 → f32 169.3 2 046 MiB 1.70 GiB 2.71×
MLRift mks8 (spec_K=8) bf16 → f32 201.6 2 046 MiB 1.70 GiB 3.23×
MLRift mks16 (spec_K=16) bf16 → f32 264.2 2 046 MiB 1.98 GiB 4.23×

mks16 = 4.23× PyTorch ROCm bf16, 6.57× ROCm fp32, 9.5× MLRift CPU. All MLRift mega paths use bf16 weights / fp32 GEMM — mantissa-truncated weights stream from VRAM, multiply happens in f32 lanes. Peak VRAM is slightly under PyTorch's native bf16 path and 42 % less than fp32. RSS is ~1.7-2.0 GiB across all MLRift configurations because the safetensors file is mmap'd; PyTorch carries an additional ~3 GiB of Python + transformers + tensor metadata. The mks-K kernels run as a single dispatch per layer (29 launches/token) with persistent work-groups and cross-WG barriers; spec_K = K is amortised through prompt-lookup-decode (PLD) speculative decoding with 99 % accept rate on the bench prompt. Number re-verified 2026-05-17 post-reboot; reproducer in docs/bench_2026-05-17.md.

Llama-3.2-1B-Instruct

Same hardware. The MLRift GPU path here is a per-layer mega-kernel emitted by MLRift's @kernel AST-walker (no hipcc, no LLVM, no ROCm runtime in the launcher — the --target=amdgpu-native KFD shim talks straight to /dev/kfd). Bit-exact vs llama.cpp greedy on 20 / 20 token ids for prompt "hello". Precision: bf16 weights, f32 activations, f32 accumulator — higher numerical fidelity than the PyTorch bf16 GPU baseline (which casts activations to bf16 too). Methodology in docs/bench_2026-05-13.md.

Stack Weights → GEMM tok/s Peak VRAM RSS vs PyTorch bf16
PyTorch ROCm fp32 f32 → f32 31.2 4 749 MiB ~4.7 GiB 0.61×
PyTorch ROCm bf16 bf16 → bf16 (SDPA) 51.3 2 392 MiB ~3.8 GiB 1.00× (baseline)
MLRift M=1 mega (slice 7.6) bf16 → f32 99.8 ~2 100 MiB 3.71 GiB 1.95× (+95%)
MLRift speck4 + PLD bf16 → f32 84.8 ~2 100 MiB 1.27 GiB 1.65× (+65%)

MLRift M=1 = 1.95× PyTorch bf16 GPU on Llama-3.2-1B, on any prompt (no PLD required). The per-layer mega-kernel fuses input-rmsnorm, QKV matmul, qkv-split + qknorm + RoPE, attention (decode-coop), o_proj + residual, post-norm, gate_up + silu + down + residual into a single dispatch with 9 cross-WG barriers — slashing per-layer launch overhead 9× and feeding the inner gemv loop with an s_clause-batched 12-load / 8-FMA pipeline (slice 4.25, 4× unroll). PyTorch fp32 VRAM (4749 MiB) leaves no headroom on a 16 GiB card for context expansion; MLRift's ~2.1 GiB envelope sits well below the bf16 baseline. +162% over PyTorch fp32 GPU at matching numerical fidelity.

Llama-3.2-3B-Instruct

Same mega-kernel emitter, scaled to a 3-B model with non-power-of-two GQA (32 Q heads, 8 KV heads — ratio 3:1 was the slice-8.1 libdivide fix). Greedy decode 20 tokens for prompt "hello", output byte-identical to PyTorch bf16+fp32 CPU: 11, 358, 2846, 8173, 304, 6975, 810, 922, 279, 7434, 315, 330, 23449, 5030, 1, 323, 1202, 25127, 369, 1057. Precision: bf16 weights, f32 activations, f32 accumulator. Slice-7.4 host-RSS dealloc (commit 447d220) saves 3.5 GiB of un-fused per-tensor bf16 originals after they're memcpy'd into the stacked qkv/gate-up layouts. Number re-verified 2026-05-17 post-reboot.

Stack Weights → GEMM tok/s Peak VRAM RSS vs PyTorch bf16
PyTorch ROCm fp32 f32 → f32 15.1 12 294 MiB ~4.0 GiB 0.48×
PyTorch ROCm bf16 bf16 → bf16 (SDPA) 31.7 6 164 MiB ~3.5 GiB 1.00× (baseline)
MLRift M=1 mega (slice 8.4 + 7.4) bf16 → f32 40.7 ~5 800 MiB 9.63 GiB 1.28× (+28%)

MLRift Llama-3.2-3B = 1.28× PyTorch bf16 GPU at -6% VRAM, byte-identical to PyTorch's greedy decode (3B was the model that established PT — not MLRift CPU — as the canonical GPU reference; see memory project_llama_3b_gpu.md). Slice 8.4 collapsed an earlier 7× cputree perf regression (5.88 → 40.5 tok/s) by removing a dead cputree dispatch path that was built on the inverted "CPU bf16 reference" framing. Higher host RSS than 1-B because the 28-layer Q8_0→bf16/fp32 dequant footprint scales with layer count; the slice-7.4 dealloc clawed back 3.5 GiB of those buffers (commit 447d220).

Mistral-7B-Instruct-v0.3

Same hardware, same mega-kernel emitter, scaled to a 7-B model. Prompt "hello", greedy decode, 20 tokens, output bit-identical to transformers.generate(do_sample=False): 1, 6312, 28709, 28725, 13, 13, 28710, 506, 264, 2996, .... PyTorch fp32 GPU OOMs on a 16 GiB card (the bf16 model alone is ~14 GiB) so the only PyTorch GPU baseline that fits is bf16. Precision: bf16 weights, f32 activations, f32 accumulator — same as Llama-1B and Qwen3-0.6B rows above.

Stack Weights → GEMM tok/s Peak VRAM RSS vs PyTorch bf16
PyTorch ROCm bf16 bf16 → bf16 (SDPA) 12.7 14 484 MiB ~5.0 GiB 1.00× (baseline)
MLRift M=1 mega (Q8_0 GGUF) (slice 7.9) bf16 → f32 24.9 14 233 MiB 14.0 GiB 1.96× (+96%)
MLRift M=1 mega (bf16 safetensors) (slice 7.6) bf16 → f32 22.7 14 233 MiB 27.8 GiB 1.79× (+79%)

MLRift Mistral-7B = 1.79-1.96× PyTorch bf16 GPU at parity VRAM, bit-identical greedy output through 22 of 23 tokens (the pos-6 divergence is exactly 1 bf16 ULP at the lm_head output and was bisected to reduction-tree topology — same family as Ollama/llama.cpp reduction trees, structurally sub-ULP to chase). The same 9-phase mega-kernel that runs Qwen3-0.6B (HIDDEN=1024), Llama-3.2-1B (HIDDEN=2048) and Llama-3.2-3B (HIDDEN=3072) extends to Mistral's HIDDEN=4096 with no kernel rewrite — only arch-globals retargeting. Slice 7.4 drops un-fused per-tensor host bf16 buffers right after fusing into the stacked qkv/gate-up layouts, trimming peak host RSS from 19.9 GiB to 14.0 GiB on the Q8_0 path. Slice 7.5 doubles the inner-loop unroll factor from 4× to 8× — each outer iter now issues 24 VMEM loads in one s_clause(23) batch and fires 16 v_fmac_f32 ops, widening the independent-FMA window from 8 to 16 to expose more ILP for the gfx11 issue pipeline. Slice 7.6 splits the single-accumulator FMA chain in the phase-3 inner gemv into 8 independent accumulators with a final 7-add reduction, then packs adjacent FMA pairs into gfx11 VOPD v_dual_fmac_f32 for RDNA3 dual-issue — Mistral Q8_0 21.7 → 22.9 tok/s (+5.5%). Tokens byte-identical to slice 7.5. Phase 9 (residual gemv / down_proj) retains the slice-7.5 pattern — applying multi-acc there wedges MES firmware at Mistral's K=14336 (bisected 2026-05-17 to the amdgpu: failed to remove hardware queue from MES fingerprint, where MES stops responding to host control messages). Slice 7.7 adds a 5-second time-based watchdog in std/kfd.mlr + std/hip_kfd.mlr that catches the wedge fingerprint, prints one [KFD-WEDGE] diagnostic, short-circuits subsequent dispatches, and skips the blocking DESTROY_QUEUE ioctl in teardown — launcher exits in ~13 s instead of the prior 60+ s opaque hang. Slice 7.9 structurally splits phase 9 (o_proj) and phase 17 (down_proj) into two pieces with a cross-WG barrier between them — gemv → intermediate scratch + residual-add. The gemv now matches phase 3's proven shape (multi-acc + VOPD + plain store) instead of the wedge-triggering dependent-VALU reduce → VMEM residual-load → store shape. Mistral 22.9 → 24.9 tok/s (+8.7%); Llama-1B / Llama-3B / Qwen3 keep the monolithic phase-9 emitter and stay byte-identical. The Q8_0 GGUF path (default, lighter RSS) and the bf16 safetensors path (heavier mmap+heap duplication, used for sub-ULP PT bisection) produce essentially identical tok/s.

LLM CPU Inference — Ryzen 9 7900X

The same compiler, AVX2 backend, no GPU. MLRift programs (examples/qwen3_generate.mlr, examples/llama3_generate.mlr, examples/qwen3_14b_q8_generate.mlr) load HF safetensors or GGUF weights via mmap, run AVX2 matmul / SiLU / RMSNorm / attention kernels through a raw clone() + futex thread pool, and emit greedy decode. Hardware: AMD Ryzen 9 7900X (12 c / 24 t, AVX2). MLRift output is bit-identical to llama.cpp greedy on all rows. The Llama-3.2-1B row uses MLRIFT_CPU_BF16=1 which dequantises Q8_0 weights to bf16 once at load and routes through MLRift's AVX2 2-wide bf16 matmul. Methodology in docs/bench_2026-05-13.md, BENCH_QWEN3.md, and BENCH_QWEN3_14B.md.

Model Runtime / weights Decode wall tok/s Peak RSS vs PyTorch bf16
Qwen3-0.6B PyTorch F32 (f32 weights) 2 043 ms / 20 tok 9.79 7.23 GB 0.38×
Qwen3-0.6B PyTorch BF16 (bf16 weights, f32 GEMM) 774 ms / 20 tok 25.83 4.44 GB 1.00× (baseline)
MLRift Qwen3-0.6B bf16 safetensors, AVX2 + MT 624 ms / 20 tok 32.03 1.67 GB 1.24×
MLRift Qwen3-0.6B bf16 GGUF, AVX2 + MT 619 ms / 20 tok 32.27 1.67 GB 1.25×
Qwen3-14B PyTorch BF16 (bf16 weights, f32 GEMM) 151 366 ms / 20 tok 0.132 20.32 GB 1.00× (baseline)
MLRift Qwen3-14B Q8_0 GGUF, AVX2 + MT 41 712 ms / 20 tok 0.479 14.81 GB 3.63×
Llama-3.2-1B PyTorch BF16 (bf16 weights, f32 GEMM) 1 316 ms / 20 tok 15.2 ~3.8 GB 1.00× (baseline)
MLRift Llama-3.2-1B (slice 4.27) Q8_0→bf16 dequant + madvise, AVX2 + MT 1 233 ms / 20 tok 16.2 2.39 GB 1.07× at -37% RAM
Qwen3-VL-4B (text) PyTorch BF16 (bf16 weights, f32 GEMM) 4 082 ms / 20 tok 4.9 ~7.5 GB 1.00× (baseline)
MLRift Qwen3-VL-4B (text) Q8_0→bf16 dequant, AVX2 + MT 3 953 ms / 20 tok 5.06 ~6.7 GB 1.03× (+3%)

MLRift CPU beats PyTorch CPU bf16 on every model where a like-for-like baseline exists, and now uses less RAM doing it. 14B compares MLRift Q8_0 vs PyTorch BF16 — different storage dtypes, real-world end-to-end comparison rather than arithmetic-parity. PyTorch F32 at 14B requires ~56 GB RAM and does not fit on the 30 GB test box, so it is omitted. Llama-1B and Qwen3-VL-4B compare PyTorch CPU bf16 (MKL+oneDNN under the hood) against MLRift's AVX2 2-wide bf16 inner loop (mm_worker_bf16_f32_avx2_naive_2w). Slice 4.27 (2026-05-13 mem opt) drops the Llama-1B CPU bf16 peak RSS from 3.63 → 2.39 GB (-34 %) by issuing madvise(MADV_DONTNEED) on the Q8_0 GGUF pages immediately after each tensor's bf16 dequant, then re-using the tied lmhead bf16 buffer for per-token embedding lookup so the embed Q8_0 region can also be evicted. Across all rows MLRift's RSS sits below PyTorch's because the safetensors / GGUF file is mmap'd and Python+transformers metadata is absent. Llama-1B prompt: "hello", N=20.

60 M-Neuron Spiking-Net Benchmark

LIF spiking network, 60 million neurons, 240 million synapses (sparse CSR, K = 4), 2 000 timesteps, identical algorithm across every runtime. Hardware: AMD Ryzen 9 7900X (12 c / 24 t) + AMD Radeon RX 7800 XT. Spike counts are bit-identical across MLRift CPU, MLRift GPU, and cupy; PyTorch differs by ≈ 0.0003 % due to a HIP uint64 remainder fallback. Methodology in docs/bench_60m.md.

Runtime Device Threads Total wall Per step vs PyTorch CPU
numpy CPU 1 58 min 37 s 1.75 s 0.60×
PyTorch CPU 24 35 min 01 s 1.05 s 1.00× (baseline)
cupy GPU (ROCm 7.2) 2 min 15 s 64 ms 15.5×
PyTorch GPU (ROCm 6.4) 1 min 43 s 51 ms 20.4×
MLRift CPU (AVX2 + MT) 24 6 min 26 s 191 ms 5.4×
MLRift GPU (HIP / hipcc) 28.40 s 13 ms 74.0×
MLRift GPU (native gfx1100 ISA) 28.87 s 13 ms 72.8×

MLRift GPU runs the entire sim — CSR build, decay, LIF step, spike delivery — as four @kernel functions compiled to a single ROCm code object. The native-ISA row uses the AST-walking AMDGPU emitter (src/format_amdgpu.mlr) that emits all four kernels directly into a code object — no hipcc, no LLVM. Spike count is bit-identical to the HIP path over 120 billion neuron-step computations, including IEEE-correct f64 division and CAS-retry atomic_add_f64. Wall-time is essentially identical (+0.16 %); the win is the build pipeline — 0.92 ms .co build vs hipcc's 482 ms (524× faster) and zero toolchain dependency.

Phase 3 — AST → AMDGPU ISA Compiler

An AST-walking pass in src/format_amdgpu.mlr recognises ~30 LLM-pipeline kernel shapes — gemv, gemm, gemv_coop_bf16_f32, attention (single + speck4 + 14B variants), RoPE, RMSNorm, softmax, layernorm, silu_mul, qkv_split, K/V scatter — and emits raw RDNA 3 / RDNA 2 ISA bytes directly into a code object. ~2200 lines of hand-emit code deleted. No hipcc, no LLVM, no clang in the build path.

Multi-Arch Backends

Same MLRift source, multiple ISAs. gfx1100 (RX 7000 / RDNA 3) is the primary target; gfx1030 (RX 6000 / RDNA 2) ships 31 kernels via asm_u32_arch pair-pickers, zero .long opaque bytes. WMMA tensor cores for bf16/f32 GEMM are wired on RDNA 3. RDNA 4 (gfx1200) and NVIDIA Blackwell / Ada / Ampere are on the roadmap — only the per-arch encoding tables change.

Speculative Mega-Kernel Inference

The 28-layer Qwen3 forward pass collapses into one GPU dispatch per layer with persistent work-groups and cross-WG barriers — 29 launches/token instead of 310. Multi-query variants (mks4, mks8, mks16) amortise weight-row reads across 4/8/16 query tokens per dispatch using prompt-lookup-decode (PLD) speculative decoding. 264 tok/s on a 16 GB consumer card.

KFD-Direct Driver Path

An --target=amdgpu-native mode talks straight to /dev/kfd via the kernel's KFD ioctl interface, with a thin shim that reimplements the HIP module-load / launch-kernel / memcpy API surface. The shim now beats stock ROCm on synchronous-launch latency (102 µs vs 850 µs baseline) by allocating completion signals in COHERENT|UNCACHED GTT.

Phaenora — Evolutionary Engine

A bare-metal implementation of CMA-ES and multi-objective evolutionary algorithms, evolving brain topologies directly on the GPU. Phaenora currently converges a 250 k-neuron working-memory network in roughly three and a half minutes of wall time — same compiler stack, no Python in the loop.

The EML Universal Operator

Exploring Andrzej Odrzywołek's continuous-time operator eml(x, y) = exp(x) − ln(y). A symbolic-regression engine sweeps EML expression trees to recover sequences and investigate continuous-time analogues of classical digital logic.

Zero-Dependency Multithreading

A 24-thread pool built on raw Linux clone(CLONE_VM|CLONE_FS|…) and futex — no pthreads, no libc threading. Parallel CSR build uses splitmix64's seekable RNG so the same seed produces bit-identical output across any thread count.

AVX2 SIMD Codegen

KernRift's IR gained an AVX2 vector backend for the MLRift CPU path: F32 fused multiply-add, horizontal reductions, masked stores, BF16-decode-on-load via vpmovzxwd + vpslld 16. CPUID-dispatched, so the same binary runs on older machines without AVX2.

C-ABI Compliant IR

Strict SysV (x86_64) and AAPCS64 (ARM64) compliance in the IR layer — structs map to GPU memory with no marshalling, and host-GPU shared types produce identical offsets on either side of the hipMemcpy.

Quickstart

One repo, one compiler, one statically-linked binary per workload. The MLRift compiler is a self-hosted KernRift program; building it once gives you build/mlrc, which then compiles .mlr files directly to ROCm code objects or AMDGPU ISA.

Build & Run
# Clone and build — needs only KernRift's krc bootstrap compiler.
$ git clone https://github.com/Rift-Intelligence/MLRift && cd MLRift
$ make                       # produces build/mlrc

# Compile the Qwen3-0.6B inference example to native AMDGPU code.
$ ./build/mlrc --target=amdgpu-native examples/qwen3_generate.mlr -o /tmp/qw3
71732 tokens, 38134 nodes, 250720 bytes -> /tmp/qw3

# Decode 80 tokens at spec_K=4 (mks4 mega-kernel + PLD draft proposer).
# Slice 8.5 (2026-05-17): MLRIFT_NATIVE_MEGAKERNEL=2 is the single opt-in;
# the older MLRIFT_QWEN3_MEGAKERNEL=1 + MLRIFT_GPU_FULL_FORWARD=1 envs are
# now implied and no longer required.
$ MLRIFT_NATIVE_MEGAKERNEL=2 \
  MLRIFT_QWEN3_MEGAKERNEL_SPECK4=1 MLRIFT_SPEC_K=4 \
  MLRIFT_LONG_PROMPT=1 /tmp/qw3
[qwen3_generate] full-GPU forward ENABLED (M_eff=4)
[qwen3_generate] padded-row weight cache warmed for 28 layers
  step 0  took 62 ms launches=145
  step 1  took 21 ms launches=32
  ...
  step 19 took 22 ms launches=32
generated_tokens=80
tokens_per_sec_x1000=169308
silu_f32.mlr — A Real @kernel
// y = silu(x) = x * sigmoid(x) = x / (1 + exp(-x)).
// Recognised by `amdgpu_lower_silu_f32_3a` in src/format_amdgpu.mlr —
// the AST-walking pass emits this directly to RDNA 3 / RDNA 2 ISA.

fn block_idx_x() -> uint64 { return 0 }
fn tid_local_x() -> uint64 { return 0 }
fn silu(f32 x) -> f32 { return x }

@kernel
fn silu_f32(uint64 in_ptr, uint64 out_ptr, uint64 n) {
    uint64 gid = block_idx_x() * 256 + tid_local_x()
    if gid < n {
        uint64 ip = in_ptr + gid * 4
        uint64 op = out_ptr + gid * 4
        f32 x = 0.0
        unsafe { *(ip as f32) -> x }
        f32 r = silu(x)
        unsafe { *(op as f32) = r }
    }
}
qwen3_layer.mlr — Mega-Kernel Dispatch (excerpt)
// The 28-layer Qwen3 forward collapses into ONE dispatch per layer.
// 29 launches per token instead of 310 (per-op chain).

fn qwen3_forward_layer_megakernel_speck4_gpu(
    uint64 d_in_arr, uint64 d_out_arr,
    Qwen3LayerWeights w, uint64 pos,
    uint64 kc_layer, uint64 vc_layer,
    uint64 layer
) -> uint64 {
    // Persistent work-groups + cross-WG barriers — the entire
    // attention + FFN block runs as a single HIP launch.
    if qwen3_megakernel_speck4_ready() == 0 { return 1 }

    // Padded-row bf16 weight cache (slice 4.13 channel repack).
    uint64 w_qkv  = gpu_get_or_upload_bf16_weight_padded(
        w.q_proj_w, hidden, q_dim + 2 * kv_dim, 1152)

    return gpu_qwen3_layer_megakernel_speck4_to_dev(
        d_in_arr, d_out_arr, w_qkv, ..., pos, kc_layer, vc_layer)
}

Discussions & Feedback