Small models,
trained in the open.
An independent AI research lab. We build small, efficient language models you can own and run yourself — and release the weights, code and datasets in the open. This release wave ships Abacus, our Rust terminal coding agent; refreshes the Qwythos GGUFs with v2 runtime fixes and MTP variants; and announces Qwythos-27B as the next larger Mythos model. Claire remains our in-house language model in training.
Abacus, Qwythos GGUF v2, and Qwythos-27B.
Abacus is the coding agent.
A fast, local-first terminal agent in Rust for setup, search, edits, review, sessions and scripting. Bring your own model endpoint; every mutation is approval-gated and shown first as a per-file diff.
Redownload the Qwythos GGUFs.
v2 replaces the original normal files, fixes tokenizer and embedded chat/tool-template metadata for Qwen3.5 GGUF runtimes, adds -MTP- variants, and smoke-tests Q4/Q8 tool calling, 1M context and vision.
The next Qwythos size is on deck.
Qwythos-27B is announced as the larger next member of the Mythos line. The shipped 9B remains the current open-weight flagship while 27B moves the family toward a heavier local-reasoning tier.
A 9B that checks
its own work.
Qwythos-9B-Claude-Mythos-5-1M is our shipped flagship — a full-parameter reasoning model distilled from Claude Mythos 5 and Claude Fable traces, with a 1M-token context, native tool use, and a +34.3 MMLU lift over its base. The v2 GGUFs are live now: fixed Qwen3.5 runtime metadata, stronger embedded chat/tool templates, normal replacement files, explicit -MTP- variants for draft speculation, and a vision projector. Qwythos-27B is announced as the next larger Mythos model.
Matched lm-evaluation-harness run (limit 100). Not every benchmark moved up — gpqa-diamond −0.05, arc-challenge roughly flat. Full table on the model card.
How many prime numbers are there below 100,000? Verify with python.
This is a counting problem over a fixed range — the kind of thing I should verify rather than recall. I'll write a primality test, count every prime from 2 through 99,999, and run it in the python_executor rather than trusting a memorised figure.
There are 9,592 prime numbers below 100,000. I verified this by computing all primes from 2 through 99,999 using a primality test and counting them.
→ Reaches for a tool to verify rather than recalling a number. Writes the primality test, runs it, reports the checked result.
Our shipped flagship. A full-parameter reasoning model on a deeply uncensored Qwen3.5-9B base, post-trained on 500M+ tokens of Claude Mythos and Claude Fable traces with in-house chain-of-thought. Ships with a 1,048,576-token (1M) context via YaRN by default and native function calling — and self-corrects with tools (7/7 on hard factual prompts spanning math, cybersecurity, pharmacology and biochem). Versus the base, under a matched harness: +34.3 MMLU, +30 gsm8k-strict, +19 gsm8k-flex. GGUF v2 adds fixed runtime metadata, MTP variants and vision-projector support; Qwythos-27B is announced as the next larger Mythos member.
We build the whole stack ourselves — the harness that finds better architectures, the pipeline that builds the training data, and the model they produce.
An automated, LLM-in-the-loop harness for discovering novel attention mechanisms and transformer blocks. It proposes candidate architectures, sandbox-trains each one, benchmarks them against a synthetic gauntlet plus a mini language-model tier, and ranks the ideas worth promoting to real scale.
Our supervised-fine-tuning data factory. It generates conversation traces with a teacher model and assembles them into a staged, curriculum-by-position corpus — short and simple first, long and multi-turn last — validated against Claire's real tokenizer before training.
Our in-house language model, trained from scratch on a custom architecture and training recipe. A 6B mixture-of-experts with roughly 500M parameters active per token (6B-A500M) — frontier-style sparsity at a size you can actually own and run, built for reasoning, code and grounded tool use.
A fast, local-first terminal coding agent written in Rust. Bring your own OpenAI-compatible endpoint — local Ollama, llama.cpp, vLLM or hosted providers — then inspect approval-gated diffs, sessions, goals, skills, MCP tools, subagents and scheduled jobs from one focused TUI.
Multi-model fusion chat: a master model asks up to three independent fusion models the same task, then synthesizes one coherent answer from their replies. TUI plus a web UI, every turn logged.
A lean, local experiment tracker with a live web dashboard. Import and go — loss curves, arbitrary metrics, artifact saving and run comparison update in real time. No servers, no API keys, no cloud.
Automatically offload PyTorch checkpoints to (S)FTP as they are written, then delete the local copy so long runs never fill the disk. Resume by run name and checkpoint name.
A fast, concurrent task generator for distillation data, written in native Rust. Dozens of domains across math, code, science, creative writing and conversation. OpenAI-compatible; builds the public tasklist-* corpora.
A high-fidelity proxy that translates the Anthropic Messages API to any OpenAI-compatible backend: extended thinking, document blocks, streaming tool use and cache control preserved.
A nanochat fork with Block Attention Residuals: learned, input-dependent attention over previous block outputs in place of fixed additive residuals. A small, readable testbed for an architectural idea.
A tiny ~63M GPT-2 trained from scratch on public-domain scripture — a spare-compute tribute to Terry A. Davis. Not a serious model. That is the point.
Efficient models, from scratch
Claire is our clean-sheet language model — a custom architecture and training recipe, built as a 6B mixture-of-experts with only ~500M parameters active per token. We are convinced a carefully designed sparse model punches well above its active-parameter weight: efficient enough to own and run yourself, capable enough to reason, write code and call tools.
Automated architecture discovery
microverse is our LLM-in-the-loop harness for finding better building blocks. It proposes attention mechanisms and transformer blocks, sandbox-trains each against a synthetic gauntlet, and surfaces the few structural ideas worth promoting to a real training run — before we spend the compute.
Data as curriculum
SFTSuite and taskgen treat data as a first-class part of the model. Conversation traces are generated, validated against the real tokenizer, and ordered so that position in the corpus is the curriculum — simple and short first, long and multi-turn last. The raw datasets are published openly.
Open by default
Weights, code, datasets and tools ship in the open: Abacus in your terminal, runmonitor for live training, offside-checkpoints for storage, plus our published models on Hugging Face. No waitlists. Built because we needed them; open because someone else might too.
Follow the build.
An occasional dispatch from the lab — progress on Qwythos and Claire, what we found with microverse, new Abacus releases and the one thing we got wrong that week. No hype, no roadmap theatre. Cancel from any line.
