About

The compiler, the design, and the person behind it.

The Compiler

MLRift is a bare-metal machine-learning compiler designed for end-to-end LLM inference without a Python or PyTorch runtime. It compiles ahead of time to native machine code — no interpreter, no framework, no hidden dispatch. Programs are static ELFs that mmap their model weights from disk and decode tokens directly through compiled GPU and CPU kernels. The language aims to make low-level ML work — bf16 weight streams, hand-tuned mega-kernels, persistent work-group barriers, KFD-direct device dispatch — feel first-class rather than buried under five layers of framework abstraction.

MLRift is built on top of KernRift, the systems language and self-hosting compiler that provides the toolchain. The MLRift compiler driver mlrc is itself a KernRift program — there is no Rust, no C, and no LLVM in the build. mlrc consumes .mlr source files and emits ROCm code objects (--target=hip-amd) or raw RDNA 3 / RDNA 2 ISA code objects (--target=amdgpu-native) directly, with no hipcc or clang in the path.

The flagship workload is Qwen3-0.6B inference. On an AMD Radeon RX 7800 XT (gfx1100), the speculative-mega-kernel path decodes at 229 tok/s — 3.68× PyTorch ROCm bf16 and 5.71× PyTorch ROCm fp32 — while using less peak VRAM than PyTorch's native bf16 path. Token-id output is bit-identical to HuggingFace transformers.generate(do_sample=False) across every generated position. Qwen3-14B Q8_0 runs through the AVX2 CPU path at 3.63× PyTorch BF16 on a 7900X.

The Phase 3 Pipeline

MLRift's defining piece is Phase 3: an AST-walking pass in src/format_amdgpu.mlr that recognises ~30 distinct LLM-pipeline kernel shapes — gemv, gemm (with LDS double-buffer + WMMA tensor-core variants), gemv_coop_bf16_f32, attention (single-stream, multi-query speck4, 14B-shape), RoPE, RMSNorm, softmax, layernorm, silu_mul, qkv_split, K/V scatter — and emits raw AMDGPU ISA bytes directly into a code object. About 2 200 lines of hand-emit code were deleted when this pipeline landed; new kernel shapes now ship as MLRift source plus a small recogniser, not as opaque hex.

The same source compiles for multiple GPU architectures: gfx1100 (RDNA 3 / RX 7000) is the primary target; gfx1030 (RDNA 2 / RX 6000) ships 31 kernels via asm_u32_arch pair-pickers with zero .long opaque bytes. WMMA tensor cores (v_wmma_f32_16x16x16_bf16) are wired on RDNA 3 for bf16 → f32 GEMM at the gate-up phase. RDNA 4 (gfx1200) and NVIDIA Blackwell / Ada / Ampere are next on the roadmap — only the per-arch encoding tables change.

A KFD-direct shim talks straight to /dev/kfd via the kernel's KFD ioctl interface and reimplements the HIP module-load / launch-kernel / memcpy API surface in pure MLRift. The shim now beats stock ROCm on synchronous-launch latency — 102 µs vs 850 µs baseline — by allocating its completion signals in COHERENT|UNCACHED GTT so the CP fetcher reads them without an L2 round-trip. The whole driver path lives inside the same compiler: no userspace ROCm dependency, no libhsa-runtime64.so, no libamdhip64.so.

On top of the kernels sits the speculative mega-kernel — the 28-layer Qwen3 forward pass collapses into one GPU dispatch per layer with persistent work-groups and cross-WG barriers (29 launches/token instead of 310). Multi-query variants mks4, mks8, mks16 amortise weight-row reads across 4 / 8 / 16 query tokens per dispatch using prompt-lookup-decode (PLD) speculative decoding. The whole thing fits inside 2 GB of VRAM.

Design Values

The Creator

MLRift is created by Pantelis Christou, an Electrical and Computer Engineering student. The project started as a way to learn how machine-learning runtimes actually work from the bottom up — by building one, not by reading about one. It has grown since.

The compiler, the AMDGPU ISA emitter, the KFD shim, the LLM kernels, the qwen3 inference pipeline, the standard library, and this website are written and maintained by one person. Bug reports, questions, and contributions are welcome on GitHub.

Related Projects

MLRift sits inside a larger, ongoing effort to build an independent, vertically integrated computing stack from the physics level up to cognitive architectures. I am building these to explore alternatives to the modern computing monopoly. You can find the other pieces of this ecosystem on GitHub:

Contact

For collaborations, architecture discussions, or inquiries regarding MLRift and my other projects, feel free to reach out:

[email protected]

Discussions & Feedback