No PyTorch. No Python runtime. Just fast, portable, production-ready inference.
$ curl -sSL https://guoqingbao.github.io/xinfer/install.sh | bash
npm install -g xinfer-ai
Features
Everything you need for production LLM inference — without the Python baggage.
Pure Rust backend — no PyTorch, no CUDA Python bindings, no Python runtime.
Flash Attention, FlashInfer, CUDA Graphs, continuous batching. Up to 175+ tok/s.
Core scheduling & attention logic in under 5,000 lines of Rust.
CUDA on Linux, Metal on macOS. Same binary, same API everywhere.
OpenAI & Anthropic APIs, built-in Web UI, MCP tool calling, structured outputs.
TurboQuant 2–4 bit KV cache extends context up to 4.3× with minimal quality loss.
Performance
Tested on V100, A100, Hopper H800, and RTX 5090.
| Model | Format | Size | Hardware | Speed | Note |
|---|---|---|---|---|---|
| Qwen3-30B-A3B | NVFP4 | 30B MoE | RTX 5090 (SM120) | 0 tok/s | HW NVFP4 |
| Gemma4-26B-A4B | NVFP4 | 26B MoE | RTX 5090 (SM120) | 0 tok/s | HW NVFP4 |
| Qwen3.6-35B-A3B | FP8 | 35B MoE | H800 (SM90) | 0 tok/s | HW FP8 |
| DeepSeek-R1-Qwen3-8B | Q4_K_M | 8B | A100 (SM80) | 0 tok/s | GGUF |
| Llama-3.1-8B | ISQ Q4K | 8B | A100 (SM80) | 0 tok/s | SW quant |
| MiniMax-M2.5 | NVFP4 | 229B MoE | H800 ×2 (SM90) | 0 tok/s | SW NVFP4 (no HW) |
| Qwen3-30B-A3B | NVFP4 | 30B MoE | V100 (SM70) | 0 tok/s | SW FP4 |
| Qwen3.6-27B | FP8 | 27B Dense | H800 (SM90) | 0 tok/s | HW FP8 |
* HW = hardware-accelerated. Hopper (SM90) supports HW FP8 but not HW NVFP4; Blackwell (SM120) supports both. NVFP4 on Hopper uses software emulation.
Models
From 3B to 397B — dense, MoE, and multimodal architectures.
Supported Formats
Quick Start
Get up and running in minutes.
curl -sSL https://guoqingbao.github.io/xinfer/install.sh | bash
npm install -g xinfer-ai
# Prerequisites: Rust, CUDA Toolkit (or Metal Xcode CLI) curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh sudo apt-get install -y git build-essential libssl-dev pkg-config export XINFER_REPO="https://github.com/guoqingbao/xinfer" # macOS/Metal: replace features with `metal` # SM70/SM75 (V100): remove `flashinfer` and `cutlass` cargo install --git $XINFER_REPO xinfer --features cuda,nccl,flashinfer,cutlass
# Build Python wheel from source pip install maturin maturin[patchelf] # FlashInfer backend (SM80+) ./build.sh --release --features cuda,nccl,flashinfer,cutlass,python # macOS Metal maturin build --release --features metal,python # Install the wheel pip install target/wheels/xinfer*.whl --force-reinstall
# SM70/SM75: remove `flashinfer` and `cutlass` ./build_docker.sh "cuda,nccl,flashinfer,cutlass"
# HuggingFace model with Web UI xinfer --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo4 --ui-server # Multi-GPU with local model xinfer --w /path/to/model --d 0,1 --ui-server # Python mode python3 -m xinfer.server --m Qwen/Qwen3.6-27B-FP8 --ui-server
Downloads
Pre-compiled binaries and pip wheels for every GPU architecture.
KV Cache
Extend context length up to 4.3× with --kvcache-dtype.
| Mode (--kvcache-dtype) | Compression | Quality | GPU |
|---|---|---|---|
| default (BF16) | 1× | Baseline | All |
| fp8 | 2× | Near-lossless | SM70+ / M1 |
| turbo8 | 2.6× | High quality | SM70+ / M1 |
| turbo4 | 3.7× | Best balance | SM70+ / M1 |
| turbo3 | 4.7× | Max compression | SM70+ |
Documentation
Guides, API references, and integration tutorials.