Free 30-min discovery call CT · NY · MA · RI · nationwide
~/insights $ cat

Why we built a private AI cloud instead of reselling public APIs

Architecture is a business decision. Here's the one we made.

The default architecture

Walk into any AI consulting engagement in 2025 and the default architecture is roughly the same: a thin application layer, a handful of integrations, and a stack of public API calls to OpenAI, Anthropic, Google, or some combination. The consultancy’s differentiation, to the extent it has any, is in the application layer and the integrations. The AI itself is rented.

This architecture has genuine advantages. It’s fast to start. It requires no infrastructure investment. It benefits from the capability improvements of the underlying foundation models without the consulting firm doing anything. For many workloads, it’s the right answer.

It was not the right answer for the firm we wanted to be.

What public APIs get wrong for a consulting practice

Three structural problems with building a consulting practice on top of rented inference:

1. The economics break at scale. Per-token pricing is reasonable for experimentation and early production. It becomes untenable for workloads with high query volume, large retrieval contexts, or agentic patterns that generate substantial token load per user interaction. Our clients who scale their usage end up with quarterly bills that surprise their finance teams.

2. The data handling story collapses under due diligence. Public API providers publish terms of service, data retention policies, and training opt-out procedures. These are often fine for most purposes. They are not fine for regulated work — healthcare, legal, public-sector, certain financial services — where the procurement review will read every line and conclude that sending client data to a third party is not acceptable. We did not want to build a practice that couldn’t serve those clients.

3. You don’t own any part of your stack. When a foundation model provider changes their pricing, their availability, their policies, or their API, you have no response. You pass the change through to your clients, or you re-architect your applications. Neither is what a client is paying you for.

What the private cloud gives us

We operate across two Tier III TierPoint facilities, in Marlborough and Chicago, connected by private fiber. We run open-weight language models — the Llama family, Mistral, Qwen, and others — in Kubernetes clusters. We operate vector databases, retrieval infrastructure, and the observability layer ourselves.

This gives us:

Data flows we can document. For every engagement, we can write down exactly what data traverses what component, and we can defend that document in a procurement review.

Predictable economics. Our clients pay for workload-scaled capacity, not per-token usage. Their bills do not move when their users’ behavior changes.

Independence from a single foundation model. Model selection is per-workload. When a better open-weight model ships, we evaluate it, test it, and roll it out for workloads where it helps. We are not tied to any one provider’s release schedule or policy direction.

A real posture for regulated work. Our facility-level attestations — SOC 2, ISO 27001, HIPAA/HITECH, PCI DSS — are real, and our architecture is structured to take advantage of them. We can serve clients whose procurement processes rule out public-only architectures.

Where we still use public APIs

We do not pretend everything runs in our cloud. For specific workloads, best-in-class external APIs are the right choice:

Vision analysis for low-frequency operations. When a client needs high-quality vision understanding for a one-time-per-document or one-time-per-catalog-item operation, the economics favor a top-tier external API. We use them deliberately and transparently.

Frontier reasoning for high-value interpretive tasks. For a subset of reasoning tasks — the ones where the marginal quality of a frontier model materially changes outcomes — we route to external providers like Anthropic’s Claude. The decision is per-task, not blanket.

In every engagement, we document what runs where and why. Clients know, before they sign, exactly what their architecture looks like.

The tradeoff we accepted

Running your own inference is harder than reselling someone else’s. Capacity planning is a real problem. Model updates require engineering work. Observability and operations require investment. We built the firm around those capabilities because we believed the clients we wanted to serve needed them.

The architecture is a business decision. We made the one that matches the firm we wanted to build.

~/contact $ open

Want to talk about this work?

A 30-minute conversation is usually enough to tell whether we're the right partner for what you're working on.