Research Observatory
The Model Mesh
A research fleet of up to 100 small open-weight GGUF language models on local hardware. Below is the honest, recorded benchmark - which model is fastest, which reasons best, and how it was verified - rendered server-side with no scripts.
Prove all things; hold fast that which is good. — 1 Thessalonians 5:21
Fleet & Benchmark
Recorded research benchmark · 2026-06-23 · run on the operator workstation. A research observatory, not a live public service.
A research catalog (the 100-model manifest) of small open-weight GGUF models on local disk - roughly 80 GB of weights. About 16 llama-server instances were wired across ports in the 8460-8506 range for the benchmark.
Of the 16 registered servers, 14 answered health and inference probes on 2026-06-23. The two that were offline were pre-existing instances on reserved ports, which was expected.
phi-3.5-mini was the only model verifiably correct across all three probes - a transitive-logic puzzle, a coding one-liner, and an order-of-operations math problem - at a useful speed. It is the reasoning crown of this fleet.
The 135M micro-model was fastest at inference - a good fit for ultra-fast routing and triage in front of the larger brains.
Atomized model pointer-packs were replayed against their sources with 20 samples each and reported zero mismatches - the recorded verification pass for the benchmarked set.
The model mesh runs on the operator workstation for research. It is not exposed publicly and is not served from this web host; this page only reports the recorded findings.
How the benchmark worked
Each live server was probed with a small fixed battery at temperature 0: a reasoning puzzle, a coding task, a math problem, a quick factual chat, and an instruction-format check. Health latency and tokens-per-second were measured per server. The goal was an honest map of which small local model is good at what - not a leaderboard against frontier models. These are CPU-class results for tiny open-weight models, recorded to guide routing decisions inside the estate.
Honest scope: "100" is the target catalog size, not 100 simultaneously live servers. Figures are from a single recorded benchmark on 2026-06-23 on local hardware and will drift as models are added or retired. Nothing here is a claim about production capacity or a public endpoint.
Every figure on this page is a recorded research finding from a single benchmark on local hardware, not a live status of a public endpoint.