Production AI Platform

AngelsWorks Hub

The engineering platform underneath everything else I run — one LLM gateway, RBAC, observability. Built, then operated.

22 Modules
67 Containers
12+ LLM providers via gateway
7 Model tiers behind one gateway
9 Mission Control stations

AngelsWorks Hub architecture — modules, LiteLLM gateway, RBAC trust scoring, Mission Control surface — Scroll horizontally to read labels →

Hub is the engineering platform underneath everything else. It centralizes LLM routing, RBAC, and observability so every module talks to one substrate, and it turns specs into running modules — each one a container with its own manifest, health endpoint, and place in the dependency graph. One place to act when something is off, one place to add the next thing.

Decisions

01
Single LLM gateway, every call

All AI calls route through LiteLLM with seven model tiers across 12+ providers. Daytime hits hosted endpoints (Groq, Cerebras, Gemini), nights fall back to local Ollama. Self-healing circuit-breaker fallback has held through real provider outages and quota exhaustion with zero downtime. The integration tax of "every call goes through one gateway" is real, but it bought me unified cost reporting, audit logs, retry policy, and the ability to swap a provider without touching application code.
02
Service + repository layered pattern, applied strangler-fig

New hub modules and any module being touched for non-trivial work must follow a route → service → repo shape rather than the original single-file pattern. Strangler-fig migration — existing modules don't have to convert until they're modified. The shift surfaced a real recurring cost in the old shape, but converting everything at once would have stalled feature work for weeks. The horizontal pattern that worked for "split by file type" stopped scaling, so the next axis had to be vertical (route → service → repo).
03
RBAC with trust scoring, not just roles

Agents (and humans) accumulate a trust score from observed behavior. Sandboxed actions are checked against the score, not just against a static role. Patrol workers continuously verify that agents stay inside their declared sandboxes. The Mission Control Security station surfaces denials, trust-score deltas, and any sandbox escapes. Strict deny-by-default is feature-flagged until every agent is confirmed registered, then turned on.
04
Mission Control as the observability surface

Nine stations (modules, agents, providers, security, telemetry, backups, knowledge, boot, ops). Every module exposes a health endpoint and a manifest; the dashboard reads those rather than the modules pushing metrics. When something is wrong, I look at the same place every time and the source of truth is the module's own /health rather than a parallel monitoring config that drifts.
05
Modules are containers

Each module ships as its own Docker container with its own manifest declaring backup paths, ports, dependencies, and health endpoint. Adding a module is a docker-compose entry plus a manifest. Removing one doesn't leave residue elsewhere. This costs a bit more memory than running everything in one process, but the operational story (independent restart, isolated failure, clear dependency graph) is worth it.

The reason Hub exists is mundane. I was running 22 modules across 67 containers, calling 12+ AI providers, and the only way to answer “is this thing healthy right now” was to SSH to a box and read logs. Hub gave that answer to itself. Once it could, I started routing every LLM call through one place, enforcing access from one place, instrumenting from one place. The shape of the system stopped fighting me — and the same shape became the platform underneath every new module I ship.

A few things I think transfer to a team building this kind of platform. Instrumentation pays back faster than features — every time I waited to add metrics until “after I shipped the thing”, I shipped a thing I couldn’t operate. A single gateway for AI calls lets you change everything later — swapping providers, adding observability, enforcing rate limits, none of it requires touching the consumers. And the boring parts (RBAC, backups, manifests, health endpoints) are what determine whether the interesting parts (agents, routing, evaluation) survive contact with a real workload. Get the substrate right and new modules ship at the rate the substrate allows.

Hub is also where the strangler-fig refactor lives. Old single-file modules still work; new and touched modules go through the route → service → repo layering. That migration is in progress in modules/resources/; the lessons file is honest about what worked and what was harder than expected.

AngelsWorks Hub

Decisions

Single LLM gateway, every call

Service + repository layered pattern, applied strangler-fig

RBAC with trust scoring, not just roles

Mission Control as the observability surface

Modules are containers

Related work

ArtLens

Tenant Manager

Open to permanent AI Architect roles, EU remote.