Skip to content

Research Observatory

The Model Mesh

A research fleet of up to 100 small open-weight GGUF language models on local hardware. Below is the honest, recorded benchmark - which model is fastest, which reasons best, and how it was verified - rendered server-side with no scripts.

Prove all things; hold fast that which is good. — 1 Thessalonians 5:21

Fleet & Benchmark

Recorded research benchmark · 2026-06-23 · run on the operator workstation. A research observatory, not a live public service.

Fleet catalog
Up to 100 local GGUF models

A research catalog (the 100-model manifest) of small open-weight GGUF models on local disk - roughly 80 GB of weights. About 16 llama-server instances were wired across ports in the 8460-8506 range for the benchmark.

Benchmark coverage
14 servers benchmarked healthy

Of the 16 registered servers, 14 answered health and inference probes on 2026-06-23. The two that were offline were pre-existing instances on reserved ports, which was expected.

Reasoning crown
phi-3.5-mini (3.8B)

phi-3.5-mini was the only model verifiably correct across all three probes - a transitive-logic puzzle, a coding one-liner, and an order-of-operations math problem - at a useful speed. It is the reasoning crown of this fleet.

Fastest router
smollm2-135m - 14.3 tok/s

The 135M micro-model was fastest at inference - a good fit for ultra-fast routing and triage in front of the larger brains.

Verification
Pointer-pack replays: 0 mismatches

Atomized model pointer-packs were replayed against their sources with 20 samples each and reported zero mismatches - the recorded verification pass for the benchmarked set.

Public availability
Internal research - not a public API

The model mesh runs on the operator workstation for research. It is not exposed publicly and is not served from this web host; this page only reports the recorded findings.

How the benchmark worked

Each live server was probed with a small fixed battery at temperature 0: a reasoning puzzle, a coding task, a math problem, a quick factual chat, and an instruction-format check. Health latency and tokens-per-second were measured per server. The goal was an honest map of which small local model is good at what - not a leaderboard against frontier models. These are CPU-class results for tiny open-weight models, recorded to guide routing decisions inside the estate.

Honest scope: "100" is the target catalog size, not 100 simultaneously live servers. Figures are from a single recorded benchmark on 2026-06-23 on local hardware and will drift as models are added or retired. Nothing here is a claim about production capacity or a public endpoint.

Every figure on this page is a recorded research finding from a single benchmark on local hardware, not a live status of a public endpoint.