MLRift | Bare-Metal LLM & ML Compiler for AMD GPUs

MLRift is an experimental zero-dependency machine-learning compiler built on the KernRift toolchain. It compiles models ahead-of-time (AOT) directly to machine code — bypassing the Python / PyTorch stack at runtime. The current focus is LLM inference performance ceilings on consumer hardware: what does Qwen3 actually run at when every layer of abstraction between MLRift source and the GPU's instruction stream has been removed? An AST-walking pass turns @kernel functions into raw RDNA 3 / RDNA 2 ISA code-objects with no hipcc, no LLVM, and no clang in the build path. A KFD-direct shim talks to the AMD kernel driver with no userspace ROCm dependency. The qwen3 inference example that loads HF safetensors and decodes 229 tokens/sec on an RX 7800 XT is one statically-linked ELF — Python is never involved.

Qwen3-0.6B GPU Inference — RX 7800 XT

Greedy decode through the same compiler stack — no Python, no PyTorch runtime. Hardware: AMD Ryzen 9 7900X + AMD Radeon RX 7800 XT (gfx1100). All MLRift rows produce token-id output bit-identical to HuggingFace transformers.generate(do_sample=False). Peak VRAM read from /sys/class/drm/card1/device/mem_info_vram_used with 3-second cooldown between runs; idle baseline 174 MiB. Methodology in SLICE4_MEGAKERNEL_DESIGN.md.

Stack	Weights → GEMM	tok/s	Peak VRAM	RSS	vs PyTorch bf16
MLRift CPU	bf16 → f32	24.1	174 MiB	1.67 GiB	0.39×
MLRift GPU per-op	bf16 → f32	54.8	2 801 MiB	1.82 GiB	0.88×
PyTorch ROCm fp32	f32 → f32	40.2	3 533 MiB	4.92 GiB	0.64×
PyTorch ROCm bf16	bf16 → bf16 (WMMA)	62.5	2 091 MiB	4.52 GiB	1.00× (baseline)
MLRift M=1 mega	bf16 → f32	87.5	2 046 MiB	1.67 GiB	1.40×
MLRift mks4 (spec_K=4)	bf16 → f32	169.3	2 046 MiB	1.70 GiB	2.71×
MLRift mks8 (spec_K=8)	bf16 → f32	201.6	2 046 MiB	1.70 GiB	3.23×
MLRift mks16 (spec_K=16)	bf16 → f32	229.8	2 046 MiB	1.69 GiB	3.68×

mks16 = 3.68× PyTorch ROCm bf16, 5.71× ROCm fp32, 9.5× MLRift CPU. All MLRift mega paths use bf16 weights / fp32 GEMM — mantissa-truncated weights stream from VRAM, multiply happens in f32 lanes. Peak VRAM is slightly under PyTorch's native bf16 path and 42 % less than fp32. RSS is ~1.7 GiB across all MLRift configurations because the safetensors file is mmap'd; PyTorch carries an additional ~3 GiB of Python + transformers + tensor metadata. The mks-K kernels run as a single dispatch per layer (29 launches/token) with persistent work-groups and cross-WG barriers; spec_K = K is amortised through prompt-lookup-decode (PLD) speculative decoding with 99 % accept rate on the bench prompt. Bumped from slice-4.20's 216.4 by post-4.20 lm_head bf16-direct fixes (2026-05-12).

Qwen3 CPU Inference — Ryzen 9 7900X

The same compiler, AVX2 backend, no GPU. The MLRift programs (examples/qwen3_generate.mlr, examples/qwen3_14b_q8_generate.mlr) load HF safetensors or GGUF weights via mmap, run AVX2 matmul / SiLU / RMSNorm / attention kernels through a raw clone() + futex thread pool, and emit greedy decode. Hardware: AMD Ryzen 9 7900X (16 threads, AVX2). First 10 generated token ids bit-identical to HuggingFace transformers.generate(do_sample=False) on both sizes. Methodology in BENCH_QWEN3.md and BENCH_QWEN3_14B.md.

Model	Runtime / weights	Decode wall	tok/s	Peak RSS	vs PyTorch bf16
Qwen3-0.6B	PyTorch F32 (f32 weights)	2 043 ms / 20 tok	9.79	7.23 GB	0.38×
Qwen3-0.6B	PyTorch BF16 (bf16 weights, f32 GEMM)	774 ms / 20 tok	25.83	4.44 GB	1.00× (baseline)
MLRift Qwen3-0.6B	bf16 safetensors, AVX2 + MT	624 ms / 20 tok	32.03	1.67 GB	1.24×
MLRift Qwen3-0.6B	bf16 GGUF, AVX2 + MT	619 ms / 20 tok	32.27	1.67 GB	1.25×
Qwen3-14B	PyTorch BF16 (bf16 weights, f32 GEMM)	151 366 ms / 20 tok	0.132	20.32 GB	1.00× (baseline)
MLRift Qwen3-14B	Q8_0 GGUF, AVX2 + MT	41 712 ms / 20 tok	0.479	14.81 GB	3.63×

14B compares MLRift Q8_0 vs PyTorch BF16 — different storage dtypes, real-world end-to-end comparison rather than arithmetic-parity. PyTorch F32 at 14B requires ~56 GB RAM and does not fit on the 30 GB test box, so it is omitted. The 0.6B rows use the same dtype on both sides for a fair arithmetic comparison. Prompt: "The capital of France is" → "Paris. What is the capital of the United States…" (greedy, 20 tokens).

60 M-Neuron Spiking-Net Benchmark

LIF spiking network, 60 million neurons, 240 million synapses (sparse CSR, K = 4), 2 000 timesteps, identical algorithm across every runtime. Hardware: AMD Ryzen 9 7900X (12 c / 24 t) + AMD Radeon RX 7800 XT. Spike counts are bit-identical across MLRift CPU, MLRift GPU, and cupy; PyTorch differs by ≈ 0.0003 % due to a HIP uint64 remainder fallback. Methodology in docs/bench_60m.md.

Runtime	Device	Threads	Total wall	Per step	vs PyTorch CPU
numpy	CPU	1	58 min 37 s	1.75 s	0.60×
PyTorch	CPU	24	35 min 01 s	1.05 s	1.00× (baseline)
cupy	GPU (ROCm 7.2)	—	2 min 15 s	64 ms	15.5×
PyTorch	GPU (ROCm 6.4)	—	1 min 43 s	51 ms	20.4×
MLRift	CPU (AVX2 + MT)	24	6 min 26 s	191 ms	5.4×
MLRift	GPU (HIP / hipcc)	—	28.40 s	13 ms	74.0×
MLRift	GPU (native gfx1100 ISA)	—	28.87 s	13 ms	72.8×

MLRift GPU runs the entire sim — CSR build, decay, LIF step, spike delivery — as four @kernel functions compiled to a single ROCm code object. The native-ISA row uses the AST-walking AMDGPU emitter (src/format_amdgpu.mlr) that emits all four kernels directly into a code object — no hipcc, no LLVM. Spike count is bit-identical to the HIP path over 120 billion neuron-step computations, including IEEE-correct f64 division and CAS-retry atomic_add_f64. Wall-time is essentially identical (+0.16 %); the win is the build pipeline — 0.92 ms .co build vs hipcc's 482 ms (524× faster) and zero toolchain dependency.

Phase 3 — AST → AMDGPU ISA Compiler

An AST-walking pass in src/format_amdgpu.mlr recognises ~30 LLM-pipeline kernel shapes — gemv, gemm, gemv_coop_bf16_f32, attention (single + speck4 + 14B variants), RoPE, RMSNorm, softmax, layernorm, silu_mul, qkv_split, K/V scatter — and emits raw RDNA 3 / RDNA 2 ISA bytes directly into a code object. ~2200 lines of hand-emit code deleted. No hipcc, no LLVM, no clang in the build path.

Multi-Arch Backends

Same MLRift source, multiple ISAs. gfx1100 (RX 7000 / RDNA 3) is the primary target; gfx1030 (RX 6000 / RDNA 2) ships 31 kernels via asm_u32_arch pair-pickers, zero .long opaque bytes. WMMA tensor cores for bf16/f32 GEMM are wired on RDNA 3. RDNA 4 (gfx1200) and NVIDIA Blackwell / Ada / Ampere are on the roadmap — only the per-arch encoding tables change.

Speculative Mega-Kernel Inference

The 28-layer Qwen3 forward pass collapses into one GPU dispatch per layer with persistent work-groups and cross-WG barriers — 29 launches/token instead of 310. Multi-query variants (mks4, mks8, mks16) amortise weight-row reads across 4/8/16 query tokens per dispatch using prompt-lookup-decode (PLD) speculative decoding. 229 tok/s on a 16 GB consumer card.

KFD-Direct Driver Path

An --target=amdgpu-native mode talks straight to /dev/kfd via the kernel's KFD ioctl interface, with a thin shim that reimplements the HIP module-load / launch-kernel / memcpy API surface. The shim now beats stock ROCm on synchronous-launch latency (102 µs vs 850 µs baseline) by allocating completion signals in COHERENT|UNCACHED GTT.

Noesis — Evolutionary Engine

A bare-metal implementation of CMA-ES and multi-objective evolutionary algorithms, evolving brain topologies directly on the GPU. Noesis currently converges a 250 k-neuron working-memory network in roughly three and a half minutes of wall time — same compiler stack, no Python in the loop.

The EML Universal Operator

Exploring Andrzej Odrzywołek's continuous-time operator eml(x, y) = exp(x) − ln(y). A symbolic-regression engine sweeps EML expression trees to recover sequences and investigate continuous-time analogues of classical digital logic.

Zero-Dependency Multithreading

A 24-thread pool built on raw Linux clone(CLONE_VM|CLONE_FS|…) and futex — no pthreads, no libc threading. Parallel CSR build uses splitmix64's seekable RNG so the same seed produces bit-identical output across any thread count.

AVX2 SIMD Codegen

KernRift's IR gained an AVX2 vector backend for the MLRift CPU path: F32 fused multiply-add, horizontal reductions, masked stores, BF16-decode-on-load via vpmovzxwd + vpslld 16. CPUID-dispatched, so the same binary runs on older machines without AVX2.

C-ABI Compliant IR

Strict SysV (x86_64) and AAPCS64 (ARM64) compliance in the IR layer — structs map to GPU memory with no marshalling, and host-GPU shared types produce identical offsets on either side of the hipMemcpy.

Quickstart

One repo, one compiler, one statically-linked binary per workload. The MLRift compiler is a self-hosted KernRift program; building it once gives you build/mlrc, which then compiles .mlr files directly to ROCm code objects or AMDGPU ISA.

Build & Run

# Clone and build — needs only KernRift's krc bootstrap compiler.
$ git clone https://github.com/Rift-Intelligence/MLRift && cd MLRift
$ make                       # produces build/mlrc

# Compile the Qwen3-0.6B inference example to native AMDGPU code.
$ ./build/mlrc --target=amdgpu-native examples/qwen3_generate.mlr -o /tmp/qw3
71732 tokens, 38134 nodes, 250720 bytes -> /tmp/qw3

# Decode 80 tokens at spec_K=4 (mks4 mega-kernel + PLD draft proposer).
$ MLRIFT_GPU_MATMUL=1 MLRIFT_QWEN3_MEGAKERNEL=1 \
  MLRIFT_QWEN3_MEGAKERNEL_SPECK4=1 MLRIFT_SPEC_K=4 \
  MLRIFT_LONG_PROMPT=1 /tmp/qw3
[qwen3_generate] full-GPU forward ENABLED (M_eff=4)
[qwen3_generate] padded-row weight cache warmed for 28 layers
  step 0  took 62 ms launches=145
  step 1  took 21 ms launches=32
  ...
  step 19 took 22 ms launches=32
generated_tokens=80
tokens_per_sec_x1000=169308

silu_f32.mlr — A Real @kernel

// y = silu(x) = x * sigmoid(x) = x / (1 + exp(-x)).
// Recognised by `amdgpu_lower_silu_f32_3a` in src/format_amdgpu.mlr —
// the AST-walking pass emits this directly to RDNA 3 / RDNA 2 ISA.

fn block_idx_x() -> uint64 { return 0 }
fn tid_local_x() -> uint64 { return 0 }
fn silu(f32 x) -> f32 { return x }

@kernel
fn silu_f32(uint64 in_ptr, uint64 out_ptr, uint64 n) {
    uint64 gid = block_idx_x() * 256 + tid_local_x()
    if gid < n {
        uint64 ip = in_ptr + gid * 4
        uint64 op = out_ptr + gid * 4
        f32 x = 0.0
        unsafe { *(ip as f32) -> x }
        f32 r = silu(x)
        unsafe { *(op as f32) = r }
    }
}

qwen3_layer.mlr — Mega-Kernel Dispatch (excerpt)

// The 28-layer Qwen3 forward collapses into ONE dispatch per layer.
// 29 launches per token instead of 310 (per-op chain).

fn qwen3_forward_layer_megakernel_speck4_gpu(
    uint64 d_in_arr, uint64 d_out_arr,
    Qwen3LayerWeights w, uint64 pos,
    uint64 kc_layer, uint64 vc_layer,
    uint64 layer
) -> uint64 {
    // Persistent work-groups + cross-WG barriers — the entire
    // attention + FFN block runs as a single HIP launch.
    if qwen3_megakernel_speck4_ready() == 0 { return 1 }

    // Padded-row bf16 weight cache (slice 4.13 channel repack).
    uint64 w_qkv  = gpu_get_or_upload_bf16_weight_padded(
        w.q_proj_w, hidden, q_dim + 2 * kv_dim, 1152)

    return gpu_qwen3_layer_megakernel_speck4_to_dev(
        d_in_arr, d_out_arr, w_qkv, ..., pos, kc_layer, vc_layer)
}