LLM APIs Have No Seatbelts. I Built One.
Why People Are Putting a Reverse Proxy in Front of Their AI Traffic
TL;DR: LLM APIs don’t ship with the controls you’d expect from any other piece of infrastructure — no per-caller auth, no tool restrictions, no cost ceiling. I got burned (meeting invitation I did not want to sent, $385 weekend loop, private information, like IBAN, in plaintext) and built a reverse proxy that fixes it. Here’s how three features work in practice, with the exact configs I run.
It was a Sunday afternoon. I was building a personal scheduling agent — the kind that reads your calendar, finds gaps, and books meetings automatically. Super useful for coordinating squash or catching up with friends. I’d been hacking on it for a couple of days and wanted to test the full flow end-to-end.
I needed test contacts which would react, so I exported a few from my phone — figured I’d use people I actually know. My friends Marek and Tomek, and a couple of others. I told the agent to “book some test meetings for next week” and went to make coffee.
By the time I got back, it had sent real calendar invites. To all of them. For a meeting titled “Test Meeting 3” with no agenda, no description, nothing. Marek texted me: “what is this?” Lukasz sent mem in response. Tomek ignored it. Fine.
But there was a fourth contact in that export I’d forgotten about — someone from a networking event six months ago. I barely remembered his name. He accepted the invite without replying.
Monday morning he showed up on the call. I had no idea who was joining or why. I spent the next ten minutes pretending this was intentional.
The agent did exactly what I asked. “Book meetings.” With exactly the data I gave it. Nothing in between said “these are real people, maybe confirm first.”
That was the moment I understood the problem. Not that the agent was broken. That I had no layer between its decisions and the real world.
The Real Issue: LLM APIs Ship Without Operational Controls
After my inbox incident I started looking at how other teams run their agents. I talked to a friend at a mid-size fintech — five departments, three different API keys, zero idea what they were spending. Last month someone grep’d the logs during an unrelated investigation and found customer IBANs going to GPT-4 in plaintext. Thousands of requests over four months. Nobody had noticed because the bot worked great.
Different setups. Same gap.
“We have no idea what our agents are sending, what they’re allowed to do, or what they’re costing us.”
It’s not because anyone is careless. It’s because LLM APIs don’t ship with operational controls. There’s no per-caller identity. No way to say “this bot can’t see destructive tools.” No cost ceiling that actually shuts the door. You get an API key, you call the endpoint, and whatever the client sends goes straight through.
Every other piece of infrastructure I’ve run — databases, message queues, HTTP backends — has a proxy layer with auth, rate limiting, and observability. LLM traffic had none of that.
So I built one. It became the gateway component of Dativo Talon — an open-source tool I’ve been working on. A reverse proxy that sits between your clients and the LLM provider, identifies each caller, and applies policy before forwarding. One Go binary, one YAML config.
Here’s how the three features I needed most work in practice.
1. Tool Filtering — So the Model Never Learns calendar_invite Exists
This is the feature I built first, because it directly solves what happened to me.
My scheduling agent had five tools: read_calendar, find_gaps, create_draft, calendar_invite, and send_reminder. The first three are safe — they read data or create local drafts. The last two reach the real world. And the model couldn’t tell the difference, because I’d given it all five.
I didn’t need to remove calendar_invite from my code. I needed to remove it from what the model sees during testing.
That’s what the gateway does. It inspects the tools array in the JSON body before the request reaches OpenAI. Any tool matching a forbidden pattern gets stripped. The model never learns it exists. It can’t call calendar_invite if it was never told about calendar_invite.
Tool filtering is prevention, not detection. By the time you intercept a tool call, the model already decided to make it. The gateway removes the option before the decision happens.
Here’s what the config looks like:
gateway:
default_policy:
# "filter" = silently strip matching tools before the model sees them
tool_policy_action: "filter"
forbidden_tools:
- "calendar_invite" # the tool that sent three real meeting invites
- "send_*" # matches send_email, send_reminder, send_sms
- "delete_*" # matches delete_thread, delete_emails
- "admin_*"
- "bulk_*"
- "drop_*"What this looks like in practice:
# What OpenAI sees WITHOUT the gateway:
tools: [read_calendar, find_gaps, create_draft, calendar_invite, send_reminder]
# What OpenAI sees WITH the gateway:
tools: [read_calendar, find_gaps, create_draft]The model gets the read tools and the draft tool. It can plan meetings and prepare invites all day long. But it can’t send anything, because it doesn’t know sending is an option.
Patterns use glob syntax, case-insensitive. send_* matches send_email, send_reminder, Send_SMS. The lists are additive across levels — default policy, provider, and per-caller overrides all merge into one set.
Two modes:
filter (default) — silently removes forbidden tools, forwards the rest. The agent keeps working; it just can’t see the ones that reach the real world.
block — rejects the entire request with HTTP 403 if any forbidden tool is present.
You can also go the other direction with a per-caller allowlist. Only the tools you name get through:
callers:
- name: "scheduling-agent"
api_key: "talon-gw-sched-001"
tenant_id: "default"
policy_overrides:
# strict allowlist: ONLY these tools pass through
allowed_tools: ["read_calendar", "find_gaps", "create_draft"]
tool_policy_action: "block"Now the agent can read, search, and draft. Nothing else. If I later wire up calendar_invite or send_email, the model will never see it unless I explicitly add it to the allowlist. This is the config I run for any agent that’s still in testing — default to read-only, unlock write tools deliberately.
The same principle would have prevented the OpenClaw incident. One forbidden_tools: ["delete_*"] line and the model would never have known deletion was an option.
Test it yourself:
curl -s -X POST http://localhost:8080/v1/proxy/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer talon-gw-sched-001" \
-d '{
"model": "gpt-4o-mini",
"messages": [{"role":"user","content":"Book a test meeting for next Tuesday"}],
"tools": [
{"type":"function","function":{"name":"find_gaps","parameters":{}}},
{"type":"function","function":{"name":"calendar_invite","parameters":{}}}
]
}'calendar_invite gets stripped. The model only sees find_gaps. It can find the time slot, but it can’t book anything. The evidence record logs which tools were requested, filtered, and forwarded — signed and timestamped.
2. PII-Based Routing — Because I Fed the Model Real Email Addresses
Here’s the thing I didn’t appreciate until after the calendar incident: the tool wasn’t the only problem. The data was the problem too. I fed my agent real contact email addresses, and those went straight to OpenAI as part of the prompt. Even if I’d blocked calendar_invite, the model would still have seen sarah.chen@company.com and marcus.klein@bigcorp.de in its context window. Those are real people’s real email addresses sitting on OpenAI’s servers.
The fintech story made it worse. Their support bot was summarising customer tickets, and those tickets contained IBANs, email addresses, phone numbers. Thousands of requests over four months. All of it went to GPT-4 in plaintext.
Under pii_action: "block", every one of those requests would have been rejected before reaching OpenAI. Under "redact", the IBANs would have been replaced with [REDACTED:iban] and the emails with [REDACTED:email] before the model saw them. Either way, four months of undetected PII leakage doesn’t happen.
Every request that hits the gateway goes through a PII classifier first. It scans for personal data patterns — IBANs, emails, phone numbers, tax IDs — and assigns a data tier (0, 1, or 2) based on what it finds. That tier feeds into what happens next.
Four actions, two directions. On the request side (what the client sends): allow passes through, warn logs to evidence, redact replaces PII with [REDACTED:type] before forwarding, block rejects with HTTP 400. On the response side (what the model sends back): same four actions, with block returning HTTP 451.
Different callers get different treatment:
callers:
- name: "internal-analytics"
api_key: "talon-gw-analytics-001"
tenant_id: "default"
team: "data"
policy_overrides:
pii_action: "warn" # log PII, forward unchanged — need to iterate fast
response_pii_action: "warn"
max_data_tier: 1 # deny tier 2 (high-sensitivity) requests
- name: "customer-facing-bot"
api_key: "talon-gw-custbot-002"
tenant_id: "default"
team: "support"
policy_overrides:
pii_action: "redact" # IBANs, emails → [REDACTED:type] before OpenAI sees them
response_pii_action: "redact" # redact PII in model responses too
max_data_tier: 0 # only public/anonymised data allowed
- name: "scheduling-agent-dev"
api_key: "talon-gw-sched-dev-001"
tenant_id: "default"
team: "engineering"
policy_overrides:
pii_action: "redact" # would have caught sarah.chen@company.com
response_pii_action: "warn"Internal analytics gets warn — I see what PII is flowing, but the team can iterate. Customer-facing bot gets redact — any IBAN or email in the prompt becomes [REDACTED:iban] before it touches OpenAI. My scheduling agent in dev gets redact — so even if I’m lazy and paste real contacts into the test prompt, the gateway scrubs them before the model sees them. Sunday-afternoon-proof.
The max_data_tier adds a second gate. If the classifier tags a request as tier 2 (high-sensitivity) but the caller is only cleared for tier 0, the policy engine denies it regardless of the PII action. Your customer-facing bot can’t accidentally process data it was never supposed to see.
Response scanning works for both streaming (SSE) and non-streaming. For streams, the gateway buffers the full response, scans, and forwards the original events if clean or rewrites them if redaction is needed.
Every PII detection — both directions — ends up in the evidence store. talon audit list shows which requests contained PII, what types, and what action was taken. No log grepping.
3. Cost Caps — I Burned $385 on a Saturday. Here’s How I Made Sure It Never Happens Again.
Different weekend, different mistake. I left a test loop running — GPT-4, increasingly long context windows, no stop condition. By Sunday evening: $385 in API charges on a project budgeted at $20/month.
I seem to learn everything on weekends.
That Monday I added cost caps to the gateway. Least interesting feature to build, most money saved.
Here’s the thing most teams don’t realise: you find out about a cost overrun when the monthly invoice arrives. OpenAI’s usage dashboard updates, but there’s no hard stop. No circuit breaker. A gateway that blocks at the daily limit is fundamentally different from a provider alert that shows up 30 days later.
Every request gets a cost estimate based on the model and token count. The gateway tracks daily and monthly spend per caller by querying the evidence store — the same SQLite database that holds audit records. When a caller hits the cap, the next request gets a 403. No grace period.
callers:
- name: "production-agent"
api_key: "talon-gw-prod-001"
tenant_id: "default"
policy_overrides:
max_daily_cost: 50.00 # hard cap: 403 after $50/day
max_monthly_cost: 1000.00
- name: "dev-sandbox"
api_key: "talon-gw-dev-002"
tenant_id: "default"
policy_overrides:
max_daily_cost: 5.00 # weekend loops die at $5, not $385
max_monthly_cost: 50.00
default_policy:
max_daily_cost: 100.00 # global ceiling for callers without overrides
max_monthly_cost: 2000.00production-agent gets $50/day. dev-sandbox gets $5/day. If I leave another loop running on a Saturday, the gateway kills it at $5 instead of letting it burn for 48 hours.
The CLI tells you where you stand:
talon costs --tenant default
# Agent Today ($) Month ($) Limit (day) Limit (month)
# production-agent 22.10 487.30 50.00 1000.00
# dev-sandbox 1.80 28.70 5.00 50.00
# support-bot 0.80 15.20 — —
# Total 24.70 531.20 100.00 2000.00Every evidence record includes model_used, cost, input_tokens, output_tokens, and duration_ms. Export with talon audit export --format csv and you can answer: which model is burning the most, which caller is growing fastest, where tokens are wasted on retries.
Rate limiting complements cost caps for the speed-of-spend problem. Cost caps say “no more than $50 today.” Rate limits say “no more than 60 requests per minute.” Together they catch both the slow bleed and the fast burst:
rate_limits:
global_requests_per_min: 300 # shared across all callers
per_caller_requests_per_min: 60 # per-caller cap — slows runaway agentsA Full Config — All Three Together
Here’s what I actually run for three callers, each with different tool, PII, and cost policies:
gateway:
enabled: true
listen_prefix: "/v1/proxy"
mode: "enforce"
providers:
openai:
enabled: true
secret_name: "openai-api-key" # real key in encrypted vault, never in client config
base_url: "https://api.openai.com"
allowed_models: ["gpt-4o", "gpt-4o-mini", "gpt-4-turbo"]
callers:
- name: "production-agent"
api_key: "talon-gw-prod-001" # caller token — not the OpenAI key
tenant_id: "default"
team: "engineering"
allowed_providers: ["openai"]
policy_overrides:
max_daily_cost: 50.00
max_monthly_cost: 1000.00
pii_action: "redact" # scrub PII from requests
response_pii_action: "warn" # log PII in responses, don't block
allowed_models: ["gpt-4o", "gpt-4o-mini"]
forbidden_tools: ["delete_*", "admin_*", "drop_*", "send_*"]
- name: "internal-bot"
api_key: "talon-gw-internal-001"
tenant_id: "default"
team: "support"
allowed_providers: ["openai"]
policy_overrides:
max_daily_cost: 10.00
max_monthly_cost: 200.00
pii_action: "redact"
response_pii_action: "redact"
allowed_tools: ["search_kb", "read_ticket", "create_draft"] # strict allowlist
tool_policy_action: "block" # reject if any other tool appears
- name: "dev-sandbox"
api_key: "talon-gw-dev-002"
tenant_id: "default"
team: "engineering"
allowed_providers: ["openai"]
policy_overrides:
max_daily_cost: 5.00 # Saturday-proof
max_monthly_cost: 50.00
pii_action: "redact" # scrub real emails from test prompts
allowed_models: ["gpt-4o-mini"] # cheapest model only
default_policy:
default_pii_action: "warn"
response_pii_action: "warn"
max_daily_cost: 100.00
max_monthly_cost: 2000.00
require_caller_id: true
log_prompts: true
tool_policy_action: "filter"
forbidden_tools:
- "delete_*"
- "admin_*"
- "export_all_*"
- "bulk_*"
- "rm_*"
- "drop_*"
attachment_policy:
action: "warn"
injection_action: "block" # block prompt injection in file attachments
max_file_size_mb: 10
rate_limits:
global_requests_per_min: 300
per_caller_requests_per_min: 60
timeouts:
connect_timeout: 10s
request_timeout: 120s
stream_idle_timeout: 60sThree callers, three risk profiles. production-agent gets a generous budget, PII redaction on input, and a blocklist of destructive and send tools. internal-bot gets a strict allowlist (three tools, nothing else), PII redaction both ways, and a tighter budget. dev-sandbox gets the cheapest model, PII redaction (no more testing with real emails), and a $5/day ceiling.
The clients don’t know about any of this. They point at the gateway URL with their caller key. The gateway does the rest.
When This Is the Wrong Choice
A gateway adds a hop. If you’re running a single script on your laptop and you’re the only user, it’s overhead for no benefit.
If you need the absolute lowest first-token latency and you’re at the edge, PII scanning on streaming responses adds buffering time. The passthrough path (pii_action: "allow") is ~1ms overhead, but redaction on a long stream is measurable.
If your agents only have read-only tools and never touch sensitive data, the risk profile is lower. Still worth auditing, but the urgency drops.
If you’re not dealing with customer PII yet — pre-revenue, purely internal — the compliance angle is less pressing. But the moment you start processing real user data or fall under NIS2 scope, the gateway goes from “nice to have” to “how did we not have this.”
And if you need policy on every individual tool invocation — not just what the model is told about, but what happens when the tool runs — a gateway isn’t enough. That’s a different shape: an MCP proxy or full agent runner with per-tool policy.
Final Thought
Calendar invites to real people. $385 on a weekend loop. IBANs in plaintext for four months. Every one of those happened because there was nothing between the agent and the API — no filter on what tools the model could see, no scan on what data was in the prompt, no ceiling on what it could spend.
The fix is the same pattern we’ve been using on HTTP traffic for twenty years: a reverse proxy with policy. It just hadn’t been applied to LLM APIs yet.
talon init takes fifteen minutes. The difference between “the agent booked a meeting with a stranger” and “the agent tried and the gateway said no” is one YAML file.
Checkout GitHub for more.


