Route, don't call: why a model gateway belongs between you and the LLM

The first integration everyone writes calls the provider SDK directly: import the vendor's client, pass your key, get a completion. It works on the first try, demos beautifully and welds your model choice into your codebase, where it sits as a time bomb until the day a price change, an outage, or a rate limit goes off. It is the right way to learn and the wrong way to ship. My stance: the model belongs behind a gateway, not in your call sites, a thin routing layer that every call goes through. Here is what that one indirection buys and why each item shows up the week after launch, not before.

The model becomes config, not code

When you call a provider directly, the model choice is welded into your code and your deployment. Through a gateway, the model is a string the gateway resolves, which means you can change it without shipping. That sounds minor until the first time you need to:

move a tenant to a cheaper model because their volume exploded,
pin one customer to a specific version while everyone else moves on,
A/B two models on real traffic.

None of those should require a release. With a gateway, the default model is a config value, not a code constant and that single property is worth the abstraction on its own.

Fallback you didn't have to hand-roll

Providers have bad minutes. A direct call turns a provider's 503 into your outage. A gateway can fail over, same request, second provider, so a single upstream wobble does not page you.

You can build this yourself, but you will build it badly the first time (retrying non-idempotent calls, retrying on the wrong status codes, no jitter). A gateway that already does it correctly is one of those "buy, don't build" boundaries.

Cost and limits in one place

Per-call cost is the metric that turns a successful feature into a budget problem. If every service calls providers directly, your cost data is scattered across vendors and impossible to attribute. Routed through one layer, you get spend per tenant, per feature, per model in a single place and somewhere to enforce rate and budget limits before the bill, not after.

Observability that survives a vendor swap

The questions you will actually need to answer in production, what was the prompt, what came back, how long did it take, which model, did it get retried, are gateway-level questions. Put the logging there once and it keeps working when you change providers. Put it in each call site and you re-implement it every time, inconsistently.

"But that's a single point of failure"

The honest objection. Two things make it a non-issue in practice:

The gateway is thin. It routes, it does not think. Less logic than the retry code you would otherwise scatter everywhere.
It is the layer that removes single points of failure downstream by enabling fallback. A thin, well-understood router in the middle is a far better risk than N copies of provider-coupling spread through your services.

Retries are a trap you want someone else to have already sprung

"Just retry on failure" is where most hand-rolled gateways quietly go wrong and it is worth seeing why, because it is the strongest argument for not building this yourself:

Retrying the wrong status codes. A 429 (rate limit) and a 503 (upstream down) want retries; a 400 (your malformed request) wants to fail fast. Blindly retrying everything turns a bug in your payload into four identical failures and a 4× bill.
No backoff or jitter. Immediate retries during an upstream wobble are a thundering herd, you DoS the provider exactly when it is already struggling and your own latency spikes.
Retrying non-idempotent work. If a call had a side effect before it failed, retrying double-applies it. With pure completions this is usually fine; the moment tool-calls or stateful operations enter the picture, "just retry" corrupts state.

A mature gateway has already made these decisions correctly. That is the difference between "buy" and "build" here: not that you can't write retry logic, but that you will write the naive version first and discover the edge cases in production.

Keep it thin, what does not belong in the gateway

The gateway's value comes from being boring. The failure mode is letting it accumulate responsibilities until it becomes a second application nobody understands. Things that belong outside it:

Prompt construction and business logic. The gateway routes a request; it does not decide what to ask. Grounding, templating and domain rules live in your feature code.
Per-feature behavior. If the gateway starts branching on which feature is calling, you have leaked your application into your infrastructure.
Caching of meaningful results. Tempting, but cache invalidation for model output is a feature-level decision, not a routing one.

Route, log, fail over, meter. Nothing else. A gateway that thinks is a gateway that becomes the new single point of failure for real.

The shape

Copilot feature

doc extraction

mail triage

gateway, once • model = config
• fallback & retries
• cost & rate limits
• prompt/response logging

primary

provider A

only on 5xx

provider B

Every call site converges on one chokepoint. Three features, three wires, all meet in the gateway, so fallback, cost, limits and logging exist exactly once. The model becomes a config value; the fallback provider is a dashed possibility, not a second code path.

Whether you run a hosted router or a few hundred lines of your own, the architectural point is the same: one chokepoint that every model call passes through. Direct provider calls feel faster on day one. By the time you are running real traffic, the gateway is the difference between changing a config value and shipping a release to do the same thing.

Start direct to learn. Put the gateway in before you depend on it, which is sooner than it feels.

Route, don't call: why a model gateway belongs between you and the LLM

The model becomes config, not code

Fallback you didn't have to hand-roll

Cost and limits in one place

Observability that survives a vendor swap

"But that's a single point of failure"

Retries are a trap you want someone else to have already sprung

Keep it thin, what does not belong in the gateway

The shape

Related

Your AI feature must run on a fresh tenant, or it doesn't run

Shipping a Copilot feature in Business Central that survives real users

A 502 in your PDF-to-LLM pipeline is the gateway, not the model

Route, don't call: why a model gateway belongs between you and the LLM

The model becomes config, not code

Fallback you didn't have to hand-roll

Cost and limits in one place

Observability that survives a vendor swap

"But that's a single point of failure"

Retries are a trap you want someone else to have already sprung

Keep it thin, what does not belong in the gateway

The shape

Related

Your AI feature must run on a fresh tenant, or it doesn't run

Shipping a Copilot feature in Business Central that survives real users

A 502 in your PDF-to-LLM pipeline is the gateway, not the model

Subscribe