MLRift is an experimental zero-dependency machine-learning compiler built on the KernRift toolchain. It compiles models ahead-of-time (AOT) directly to machine code — bypassing the Python / PyTorch stack at runtime. The current focus is LLM inference performance ceilings on consumer hardware: what does Qwen3 actually run at when every layer of abstraction between MLRift source and the GPU's instruction stream has been removed? An AST-walking pass turns @kernel functions into raw RDNA 3 / RDNA 2 ISA code-objects with no hipcc, no LLVM, and no clang in the build path. A KFD-direct shim talks to the AMD kernel driver with no userspace ROCm dependency. The qwen3 inference example that loads HF safetensors and decodes 229 tokens/sec on an RX 7800 XT is one statically-linked ELF — Python is never involved.
Qwen3-0.6B GPU Inference — RX 7800 XT
Greedy decode through the same compiler stack — no Python, no PyTorch runtime. Hardware: AMD Ryzen 9 7900X + AMD Radeon RX 7800 XT (gfx1100). All MLRift rows produce token-id output bit-identical to HuggingFace transformers.generate(do_sample=False). Peak VRAM read from /sys/class/drm/card1/device/mem_info_vram_used with 3-second cooldown between runs; idle baseline 174 MiB. Methodology in SLICE4_MEGAKERNEL_DESIGN.md.
| Stack | Weights → GEMM | tok/s | Peak VRAM | RSS | vs PyTorch bf16 |
|---|---|---|---|---|---|
| MLRift CPU | bf16 → f32 | 24.1 | 174 MiB | 1.67 GiB | 0.39× |
| MLRift GPU per-op | bf16 → f32 | 54.8 | 2 801 MiB | 1.82 GiB | 0.88× |
| PyTorch ROCm fp32 | f32 → f32 | 40.2 | 3 533 MiB | 4.92 GiB | 0.64× |
| PyTorch ROCm bf16 | bf16 → bf16 (WMMA) | 62.5 | 2 091 MiB | 4.52 GiB | 1.00× (baseline) |
| MLRift M=1 mega | bf16 → f32 | 87.5 | 2 046 MiB | 1.67 GiB | 1.40× |
| MLRift mks4 (spec_K=4) | bf16 → f32 | 169.3 | 2 046 MiB | 1.70 GiB | 2.71× |
| MLRift mks8 (spec_K=8) | bf16 → f32 | 201.6 | 2 046 MiB | 1.70 GiB | 3.23× |
| MLRift mks16 (spec_K=16) | bf16 → f32 | 229.8 | 2 046 MiB | 1.69 GiB | 3.68× |
mks16 = 3.68× PyTorch ROCm bf16, 5.71× ROCm fp32, 9.5× MLRift CPU. All MLRift mega paths use bf16 weights / fp32 GEMM — mantissa-truncated weights stream from VRAM, multiply happens in f32 lanes. Peak VRAM is slightly under PyTorch's native bf16 path and 42 % less than fp32. RSS is ~1.7 GiB across all MLRift configurations because the safetensors file is mmap'd; PyTorch carries an additional ~3 GiB of Python + transformers + tensor metadata. The mks-K kernels run as a single dispatch per layer (29 launches/token) with persistent work-groups and cross-WG barriers; spec_K = K is amortised through prompt-lookup-decode (PLD) speculative decoding with 99 % accept rate on the bench prompt. Bumped from slice-4.20's 216.4 by post-4.20 lm_head bf16-direct fixes (2026-05-12).
Qwen3 CPU Inference — Ryzen 9 7900X
The same compiler, AVX2 backend, no GPU. The MLRift programs (examples/qwen3_generate.mlr, examples/qwen3_14b_q8_generate.mlr) load HF safetensors or GGUF weights via mmap, run AVX2 matmul / SiLU / RMSNorm / attention kernels through a raw clone() + futex thread pool, and emit greedy decode. Hardware: AMD Ryzen 9 7900X (16 threads, AVX2). First 10 generated token ids bit-identical to HuggingFace transformers.generate(do_sample=False) on both sizes. Methodology in BENCH_QWEN3.md and BENCH_QWEN3_14B.md.
| Model | Runtime / weights | Decode wall | tok/s | Peak RSS | vs PyTorch bf16 |
|---|---|---|---|---|---|
| Qwen3-0.6B | PyTorch F32 (f32 weights) | 2 043 ms / 20 tok | 9.79 | 7.23 GB | 0.38× |
| Qwen3-0.6B | PyTorch BF16 (bf16 weights, f32 GEMM) | 774 ms / 20 tok | 25.83 | 4.44 GB | 1.00× (baseline) |
| MLRift Qwen3-0.6B | bf16 safetensors, AVX2 + MT | 624 ms / 20 tok | 32.03 | 1.67 GB | 1.24× |
| MLRift Qwen3-0.6B | bf16 GGUF, AVX2 + MT | 619 ms / 20 tok | 32.27 | 1.67 GB | 1.25× |
| Qwen3-14B | PyTorch BF16 (bf16 weights, f32 GEMM) | 151 366 ms / 20 tok | 0.132 | 20.32 GB | 1.00× (baseline) |
| MLRift Qwen3-14B | Q8_0 GGUF, AVX2 + MT | 41 712 ms / 20 tok | 0.479 | 14.81 GB | 3.63× |
14B compares MLRift Q8_0 vs PyTorch BF16 — different storage dtypes, real-world end-to-end comparison rather than arithmetic-parity. PyTorch F32 at 14B requires ~56 GB RAM and does not fit on the 30 GB test box, so it is omitted. The 0.6B rows use the same dtype on both sides for a fair arithmetic comparison. Prompt: "The capital of France is" → "Paris. What is the capital of the United States…" (greedy, 20 tokens).
60 M-Neuron Spiking-Net Benchmark
LIF spiking network, 60 million neurons, 240 million synapses (sparse CSR, K = 4), 2 000 timesteps, identical algorithm across every runtime. Hardware: AMD Ryzen 9 7900X (12 c / 24 t) + AMD Radeon RX 7800 XT. Spike counts are bit-identical across MLRift CPU, MLRift GPU, and cupy; PyTorch differs by ≈ 0.0003 % due to a HIP uint64 remainder fallback. Methodology in docs/bench_60m.md.
| Runtime | Device | Threads | Total wall | Per step | vs PyTorch CPU |
|---|---|---|---|---|---|
| numpy | CPU | 1 | 58 min 37 s | 1.75 s | 0.60× |
| PyTorch | CPU | 24 | 35 min 01 s | 1.05 s | 1.00× (baseline) |
| cupy | GPU (ROCm 7.2) | — | 2 min 15 s | 64 ms | 15.5× |
| PyTorch | GPU (ROCm 6.4) | — | 1 min 43 s | 51 ms | 20.4× |
| MLRift | CPU (AVX2 + MT) | 24 | 6 min 26 s | 191 ms | 5.4× |
| MLRift | GPU (HIP / hipcc) | — | 28.40 s | 13 ms | 74.0× |
| MLRift | GPU (native gfx1100 ISA) | — | 28.87 s | 13 ms | 72.8× |
MLRift GPU runs the entire sim — CSR build, decay, LIF step, spike delivery — as four @kernel functions compiled to a single ROCm code object. The native-ISA row uses the AST-walking AMDGPU emitter (src/format_amdgpu.mlr) that emits all four kernels directly into a code object — no hipcc, no LLVM. Spike count is bit-identical to the HIP path over 120 billion neuron-step computations, including IEEE-correct f64 division and CAS-retry atomic_add_f64. Wall-time is essentially identical (+0.16 %); the win is the build pipeline — 0.92 ms .co build vs hipcc's 482 ms (524× faster) and zero toolchain dependency.
Phase 3 — AST → AMDGPU ISA Compiler
An AST-walking pass in src/format_amdgpu.mlr recognises ~30 LLM-pipeline kernel shapes — gemv, gemm, gemv_coop_bf16_f32, attention (single + speck4 + 14B variants), RoPE, RMSNorm, softmax, layernorm, silu_mul, qkv_split, K/V scatter — and emits raw RDNA 3 / RDNA 2 ISA bytes directly into a code object. ~2200 lines of hand-emit code deleted. No hipcc, no LLVM, no clang in the build path.
Multi-Arch Backends
Same MLRift source, multiple ISAs. gfx1100 (RX 7000 / RDNA 3) is the primary target; gfx1030 (RX 6000 / RDNA 2) ships 31 kernels via asm_u32_arch pair-pickers, zero .long opaque bytes. WMMA tensor cores for bf16/f32 GEMM are wired on RDNA 3. RDNA 4 (gfx1200) and NVIDIA Blackwell / Ada / Ampere are on the roadmap — only the per-arch encoding tables change.
Speculative Mega-Kernel Inference
The 28-layer Qwen3 forward pass collapses into one GPU dispatch per layer with persistent work-groups and cross-WG barriers — 29 launches/token instead of 310. Multi-query variants (mks4, mks8, mks16) amortise weight-row reads across 4/8/16 query tokens per dispatch using prompt-lookup-decode (PLD) speculative decoding. 229 tok/s on a 16 GB consumer card.
KFD-Direct Driver Path
An --target=amdgpu-native mode talks straight to /dev/kfd via the kernel's KFD ioctl interface, with a thin shim that reimplements the HIP module-load / launch-kernel / memcpy API surface. The shim now beats stock ROCm on synchronous-launch latency (102 µs vs 850 µs baseline) by allocating completion signals in COHERENT|UNCACHED GTT.
Noesis — Evolutionary Engine
A bare-metal implementation of CMA-ES and multi-objective evolutionary algorithms, evolving brain topologies directly on the GPU. Noesis currently converges a 250 k-neuron working-memory network in roughly three and a half minutes of wall time — same compiler stack, no Python in the loop.
The EML Universal Operator
Exploring Andrzej Odrzywołek's continuous-time operator eml(x, y) = exp(x) − ln(y). A symbolic-regression engine sweeps EML expression trees to recover sequences and investigate continuous-time analogues of classical digital logic.
Zero-Dependency Multithreading
A 24-thread pool built on raw Linux clone(CLONE_VM|CLONE_FS|…) and futex — no pthreads, no libc threading. Parallel CSR build uses splitmix64's seekable RNG so the same seed produces bit-identical output across any thread count.
AVX2 SIMD Codegen
KernRift's IR gained an AVX2 vector backend for the MLRift CPU path: F32 fused multiply-add, horizontal reductions, masked stores, BF16-decode-on-load via vpmovzxwd + vpslld 16. CPUID-dispatched, so the same binary runs on older machines without AVX2.
C-ABI Compliant IR
Strict SysV (x86_64) and AAPCS64 (ARM64) compliance in the IR layer — structs map to GPU memory with no marshalling, and host-GPU shared types produce identical offsets on either side of the hipMemcpy.
Quickstart
One repo, one compiler, one statically-linked binary per workload. The MLRift compiler is a self-hosted KernRift program; building it once gives you build/mlrc, which then compiles .mlr files directly to ROCm code objects or AMDGPU ISA.
# Clone and build — needs only KernRift's krc bootstrap compiler.
$ git clone https://github.com/Rift-Intelligence/MLRift && cd MLRift
$ make # produces build/mlrc
# Compile the Qwen3-0.6B inference example to native AMDGPU code.
$ ./build/mlrc --target=amdgpu-native examples/qwen3_generate.mlr -o /tmp/qw3
71732 tokens, 38134 nodes, 250720 bytes -> /tmp/qw3
# Decode 80 tokens at spec_K=4 (mks4 mega-kernel + PLD draft proposer).
$ MLRIFT_GPU_MATMUL=1 MLRIFT_QWEN3_MEGAKERNEL=1 \
MLRIFT_QWEN3_MEGAKERNEL_SPECK4=1 MLRIFT_SPEC_K=4 \
MLRIFT_LONG_PROMPT=1 /tmp/qw3
[qwen3_generate] full-GPU forward ENABLED (M_eff=4)
[qwen3_generate] padded-row weight cache warmed for 28 layers
step 0 took 62 ms launches=145
step 1 took 21 ms launches=32
...
step 19 took 22 ms launches=32
generated_tokens=80
tokens_per_sec_x1000=169308
@kernel
// y = silu(x) = x * sigmoid(x) = x / (1 + exp(-x)).
// Recognised by `amdgpu_lower_silu_f32_3a` in src/format_amdgpu.mlr —
// the AST-walking pass emits this directly to RDNA 3 / RDNA 2 ISA.
fn block_idx_x() -> uint64 { return 0 }
fn tid_local_x() -> uint64 { return 0 }
fn silu(f32 x) -> f32 { return x }
@kernel
fn silu_f32(uint64 in_ptr, uint64 out_ptr, uint64 n) {
uint64 gid = block_idx_x() * 256 + tid_local_x()
if gid < n {
uint64 ip = in_ptr + gid * 4
uint64 op = out_ptr + gid * 4
f32 x = 0.0
unsafe { *(ip as f32) -> x }
f32 r = silu(x)
unsafe { *(op as f32) = r }
}
}
// The 28-layer Qwen3 forward collapses into ONE dispatch per layer.
// 29 launches per token instead of 310 (per-op chain).
fn qwen3_forward_layer_megakernel_speck4_gpu(
uint64 d_in_arr, uint64 d_out_arr,
Qwen3LayerWeights w, uint64 pos,
uint64 kc_layer, uint64 vc_layer,
uint64 layer
) -> uint64 {
// Persistent work-groups + cross-WG barriers — the entire
// attention + FFN block runs as a single HIP launch.
if qwen3_megakernel_speck4_ready() == 0 { return 1 }
// Padded-row bf16 weight cache (slice 4.13 channel repack).
uint64 w_qkv = gpu_get_or_upload_bf16_weight_padded(
w.q_proj_w, hidden, q_dim + 2 * kv_dim, 1152)
return gpu_qwen3_layer_megakernel_speck4_to_dev(
d_in_arr, d_out_arr, w_qkv, ..., pos, kc_layer, vc_layer)
}
Discussions & Feedback