The Compiler
MLRift is a bare-metal machine-learning compiler designed for end-to-end LLM inference without a Python or PyTorch runtime. It compiles ahead of time to native machine code — no interpreter, no framework, no hidden dispatch. Programs are static ELFs that mmap their model weights from disk and decode tokens directly through compiled GPU and CPU kernels. The language aims to make low-level ML work — bf16 weight streams, hand-tuned mega-kernels, persistent work-group barriers, KFD-direct device dispatch — feel first-class rather than buried under five layers of framework abstraction.
MLRift is built on top of KernRift, the systems language and self-hosting compiler that provides the toolchain. The MLRift compiler driver mlrc is itself a KernRift program — there is no Rust, no C, and no LLVM in the build. mlrc consumes .mlr source files and emits ROCm code objects (--target=hip-amd) or raw RDNA 3 / RDNA 2 ISA code objects (--target=amdgpu-native) directly, with no hipcc or clang in the path.
The flagship workload is Qwen3-0.6B inference. On an AMD Radeon RX 7800 XT (gfx1100), the speculative-mega-kernel path decodes at 229 tok/s — 3.68× PyTorch ROCm bf16 and 5.71× PyTorch ROCm fp32 — while using less peak VRAM than PyTorch's native bf16 path. Token-id output is bit-identical to HuggingFace transformers.generate(do_sample=False) across every generated position. Qwen3-14B Q8_0 runs through the AVX2 CPU path at 3.63× PyTorch BF16 on a 7900X.
The Phase 3 Pipeline
MLRift's defining piece is Phase 3: an AST-walking pass in src/format_amdgpu.mlr that recognises ~30 distinct LLM-pipeline kernel shapes — gemv, gemm (with LDS double-buffer + WMMA tensor-core variants), gemv_coop_bf16_f32, attention (single-stream, multi-query speck4, 14B-shape), RoPE, RMSNorm, softmax, layernorm, silu_mul, qkv_split, K/V scatter — and emits raw AMDGPU ISA bytes directly into a code object. About 2 200 lines of hand-emit code were deleted when this pipeline landed; new kernel shapes now ship as MLRift source plus a small recogniser, not as opaque hex.
The same source compiles for multiple GPU architectures: gfx1100 (RDNA 3 / RX 7000) is the primary target; gfx1030 (RDNA 2 / RX 6000) ships 31 kernels via asm_u32_arch pair-pickers with zero .long opaque bytes. WMMA tensor cores (v_wmma_f32_16x16x16_bf16) are wired on RDNA 3 for bf16 → f32 GEMM at the gate-up phase. RDNA 4 (gfx1200) and NVIDIA Blackwell / Ada / Ampere are next on the roadmap — only the per-arch encoding tables change.
A KFD-direct shim talks straight to /dev/kfd via the kernel's KFD ioctl interface and reimplements the HIP module-load / launch-kernel / memcpy API surface in pure MLRift. The shim now beats stock ROCm on synchronous-launch latency — 102 µs vs 850 µs baseline — by allocating its completion signals in COHERENT|UNCACHED GTT so the CP fetcher reads them without an L2 round-trip. The whole driver path lives inside the same compiler: no userspace ROCm dependency, no libhsa-runtime64.so, no libamdhip64.so.
On top of the kernels sits the speculative mega-kernel — the 28-layer Qwen3 forward pass collapses into one GPU dispatch per layer with persistent work-groups and cross-WG barriers (29 launches/token instead of 310). Multi-query variants mks4, mks8, mks16 amortise weight-row reads across 4 / 8 / 16 query tokens per dispatch using prompt-lookup-decode (PLD) speculative decoding. The whole thing fits inside 2 GB of VRAM.
Design Values
- No Python, no framework, no runtime. Programs are static ELFs that load weights and decode tokens. PyTorch, transformers, ROCm userspace, even libc — none of them are linked. The bench numbers reflect what the hardware actually does, not what an abstraction layer charges you for.
- Bit-identical to the reference. Every MLRift Qwen3 row produces token-ids identical to HuggingFace
transformers.generate(do_sample=False). Performance work that drifts the output is reverted; correctness comes before speed. - Bare-metal primitives are first-class.
@kernelfunctions, persistent work-groups, cross-WG barriers, raw AQL packet construction, KFD ioctls, AVX2 fused-multiply-add, rawclone()+futexthread pools — the pieces you need to write a competitive ML runtime are in the base language, not a library. - Coverage over hand-tuning. The north star is "run the majority of LLMs at best perf," which means the encoding work is lifted into the compiler. Phase 3 emits parametric kernels at runtime per
arch_paramsrather than shipping a hand-tuned variant per model. Hand-emitted gfx1100 bytes do not scale across hardware; lifting that work into the AST-walking pass does. - Measure on the same hardware. Every benchmark on the site was run on the same RX 7800 XT + Ryzen 9 7900X under the same Linux kernel. PyTorch baselines were measured on the same box, the same session, the same idle-VRAM starting point. No cherry-picked datasheets.
The Creator
MLRift is created by Pantelis Christou, an Electrical and Computer Engineering student. The project started as a way to learn how machine-learning runtimes actually work from the bottom up — by building one, not by reading about one. It has grown since.
The compiler, the AMDGPU ISA emitter, the KFD shim, the LLM kernels, the qwen3 inference pipeline, the standard library, and this website are written and maintained by one person. Bug reports, questions, and contributions are welcome on GitHub.
Related Projects
MLRift sits inside a larger, ongoing effort to build an independent, vertically integrated computing stack from the physics level up to cognitive architectures. I am building these to explore alternatives to the modern computing monopoly. You can find the other pieces of this ecosystem on GitHub:
- KernRift: The bare-metal systems language and self-hosting compiler that MLRift is built on. Provides the SSA IR backend, the AVX2 / AArch64 emitters, the AMDGPU ISA primitives, and the static-ELF runtime that MLRift sits on top of. kernrift.org.
- Morphium: A design-space exploration platform for post-silicon, reconfigurable materials. It simulates alternative substrates like IGZO for logic and Phase-Change Photonics, exploring a hardware future that can be fabricated differently.
- Noesis (Python Reference): A prototype for Whole-Brain Emulation (WBE). It simulates the continuous chaotic attractor dynamics of biological networks and tests Constructivist theories of intelligence before they are ported to the native MLRift stack.
- EIR (Energy Inference Representation): A differentiable compiler that treats planning as inference, modeling future trajectories as energy fields for latent world models.
- W-bit: An alternative non-binary, weighted, noisy logic substrate intended for machine-learning inference and learning, shifting away from strict binary computation.
Contact
For collaborations, architecture discussions, or inquiries regarding MLRift and my other projects, feel free to reach out:
Discussions & Feedback