Show HN: InferX – an AI-native OS for running 50 LLMs per GPU with hot swapping
Hey folks , We’ve been building InferX. an AI-native runtime that snapshots the full GPU execution state of LLMs (weights, KV cache, CUDA context) and restores it in under 2s. This lets us hot-swap models like threads. no reloading, no cold starts.
We treat each model as a lightweight, resumable process. like an OS for LLM inference.
Why it matters:
•Run 50+ LLMs per GPU (7B–13B range)
•90% GPU utilization (vs ~30–40% with conventional setups)
•Avoids cold starts by snapshotting and restoring directly on GPU •Designed for agentic workflows, toolchains, and multi-tenant use cases
•Helpful for Codex CLI-style orchestration or bursty multi-model apps
Still early, but we’re seeing strong interest from builders and infra folks. Would love thoughts, feedback, or edge cases you’d want to see tested.
Demo: https://inferx.net X: @InferXai
Very interesting. How would memory (or previous chat context awareness) work in the case of hot swapping, when multiple users to hot swapping models like threads.
Wow, that's really cool!