

Ha ha! I actually finished it over the weekend. Now it’s onto the documentation…ICBF lol
I just tried to get shit GPT to do it this morning, as it’s generally pretty ok for that. As always, it produces real “page turners”. Here is its idea of a “lay explainer”
Mixture of Assholes: Llama-swap + “MoA router”: making small local models act reliably (without pretending they’re bigger)
This project is a harness for local inference: llama-swap is the model traffic-cop, and the router is the conductor that decides what kind of work you want done (straight answer, self-critique loop, style rewrite, vision/OCR), when, and with what context. Vodka acts as memory layer and context re-roll.
The goal isn’t to manufacture genius. It’s to make local models behave predictably under hardware constraints by:
- making retrieval explicit (no “mystery memory”),
- keeping “fancy modes” opt-in,
- and making the seams inspectable when something goes wrong.
The shape is simple:
UI → Router (modes + RAG + memory plumbing) → llama-swap (model switching) → answer. ([GitHub][1])
The “what”: one OpenAI-style endpoint that routes workflows, not just models
At the front is an OpenAI-compatible POST /v1/chat/completions endpoint. From the client’s point of view, it’s “just chat completions” (optionally streaming). From the router’s point of view, each request can become a different workflow.
It also accepts OpenAI-style multimodal message blocks (text + image_url), which matters for the vision/OCR paths.
Under the hood, the router does three things:
- Decides the pipeline (Serious / Mentats / Fun / Vision / OCR)
- Builds an explicit FACTS block (RAG) if you’ve attached any KBs
- Calls llama-swap, which routes the request to the chosen local model backend behind an OpenAI-like interface ([GitHub][1])
The “why”: small models fail less when you make the seams visible
A lot of local “agent” setups fail in the same boring ways:
- they silently change behaviour,
- they smuggle half-remembered context,
- they hallucinate continuity.
This design makes those seams legible and user-controlled:
- You pick the mode explicitly (no silent “auto-escalation”).
- Retrieval is explicit and inspectable.
- There’s a “peek” path that can show what the RAG facts block would look like without answering — which is unbelievably useful for debugging.
The philosophy is basically: if the system is going to influence the answer, it should be inspectable, not mystical.
The “what’s cool”: you’re routing workflows (Serious / Mentats / Fun / Vision)
There are two layers of control:
A) Session commands (>…): change the router state
These change how the router behaves across turns (things like sticky fun mode, which KBs are attached, and some retrieval observability):
>>status— show session state (sticky mode, attached KBs, last RAG query/hits)>>fun/>>fun off— toggle sticky fun mode>>attach <kb>/>>detach <kb|all>/>list_kb— manage KBs per session>>ingest <kb>/>ingest_all— ingest markdown into Qdrant>>peek <query>— preview the would-be facts block
B) Per-turn selectors (#…): choose the pipeline for one message
# mentats …— deep 3-pass “draft → critique → final”## fun …— answer, then rewrite in a persona voice# vision …/# ocr …— image paths
The three main pipelines (what they actually do)
1) Serious: the default “boring, reliable” answer
Serious is the default when you don’t ask for anything special. It can inject a FACTS block (RAG) and it receives a constraints block (which is currently a V1 placeholder). It also enforces a confidence/source line if it’s missing.
Docs vs implementation (minor note): the docs describe Serious as “query + blocks” oriented. The current implementation also has a compact context/transcript shaping step as part of prompt construction. Treat the code as the operational truth; the docs are describing the intended shape and may lag slightly in details as things settle.
2) Mentats: explicit 3-pass “think → critique → final”
This is the “make the model check itself” harness:
- Thinker drafts using QUERY + FACTS + constraints
- Critic checks for overreach / violations
- Thinker produces the final, carrying forward a “FACTS_USED / CONSTRAINTS_USED” discipline
If the pipeline can’t complete cleanly (protocol errors), the router falls back to Serious.
3) Fun: answer first, then do the performance
Fun is deliberately a post-processing transform:
- pass 1: generate the correct content (lower temperature)
- pass 2: rewrite in a persona voice (higher temperature), explicitly instructed not to change the technical meaning
This keeps “voice” from leaking into reasoning or memory. It’s: get it right first, then style it.
RAG, but practical: Qdrant + opt-in KB (knowledge base) attach + “peek what you’re feeding me”
KBs are opt-in per session
Nothing is retrieved unless you attach KBs (>attach linux, etc.). The FACTS block is built only from attached KBs and the router tracks last query/hit counts for debugging.
Ingestion: “KB folder → chunks → vectors in Qdrant”
Ingestion walks markdown, chunks, embeds, and inserts into Qdrant tagged by KB. It’s simple and operational: turn a folder of docs into something you can retrieve from reliably.
The KB refinery: SUMM → DISTILL → ingest
This is one of the more interesting ideas: treat the KB as a product, not a dump.
- SUMM produces a human-readable summary (strict: no fabrication, no silent renaming) from base text
- DISTILL produces dense, retrieval-shaped atoms (embedding-friendly headings/bullets, minimal noise)
- then ingest the distilled output
The key point: DISTILL isn’t “a nicer summary.” It’s explicitly trying to produce retrieval-friendly material.
Vodka: deterministic memory plumbing (not “AI memory vibes”)
Vodka does two jobs:
- context reduction / stability: keep the effective context small and consistent
- explicit notes: store/retrieve nuggets on demand (
!!store,??recall, plus cleanup commands), TTL (facts expire unless used)
It can also leave internal breadcrumb markers and later expand them when building a transcript/context — those IDs aren’t surfaced unless you deliberately show them.
Roadmap reality check: what’s left for V1.1
- Constraints/GAG: placeholder in V1 (constraints block currently empty)
- Coder role: present in config but not wired yet



You had me at horded.
You. Had. Me. At. Horded.