AI Gateway Architecture
Deep dive into the AI Gateway internals and design decisions.
Component Overview
┌──────────────────────────────────────────────────────────────────┐
│ AI Gateway │
├──────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Router │───▶│ Cache │───▶│ Metrics │ │
│ └──────┬──────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Budget │◀──▶│ Rate Limit │◀──▶│ Audit │ │
│ └──────┬──────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Provider Adapters │ │
│ ├─────────────┬─────────────┬─────────────┬──────────┤ │
│ │ OpenAI │ Anthropic │ Ollama │ Vercel │ │
│ └─────────────┴─────────────┴─────────────┴──────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘Request Flow
1. Request Ingress
// Request arrives at gateway
const request = {
tenantId: 'tenant-123',
model: 'gpt-4o-mini',
messages: [...],
options: {
temperature: 0.7,
maxTokens: 1024
}
};2. Authentication & Context
// Validate tenant and build context
const context = await buildContext({
tenantId: request.tenantId,
realmKey: request.realmKey,
userId: request.userId
});
// Check tenant entitlements
const entitlements = await getEntitlements(context);
if (!entitlements.aiEnabled) {
throw new ForbiddenError('AI not enabled for tenant');
}3. Budget Check
// Check if tenant has budget remaining
const budget = await checkBudget(context, request.model);
if (budget.remaining <= 0) {
throw new BudgetExceededError('Monthly AI budget exceeded');
}4. Cache Lookup
// Check cache for identical request
const cacheKey = computeCacheKey(request);
const cached = await cache.get(cacheKey);
if (cached && !request.options.skipCache) {
return cached;
}5. Rate Limiting
// Apply rate limits
const rateLimit = await checkRateLimit(context, request.model);
if (rateLimit.exceeded) {
throw new RateLimitError('Rate limit exceeded', {
retryAfter: rateLimit.retryAfter
});
}6. Provider Routing
// Select provider based on model and availability
const provider = selectProvider(request.model, {
preferLocal: context.preferences.preferLocal,
fallbackEnabled: true
});
// Execute request
const response = await provider.chat(request);7. Response Processing
// Record usage
await recordUsage(context, {
model: request.model,
inputTokens: response.usage.promptTokens,
outputTokens: response.usage.completionTokens,
cost: calculateCost(response.usage)
});
// Cache response
await cache.set(cacheKey, response, { ttl: 3600 });
// Audit log
await audit.log({
event: 'ai.chat.complete',
context,
request: redact(request),
response: summarize(response)
});
return response;Provider Adapters
OpenAI Adapter
class OpenAIAdapter implements AIProvider {
async chat(request: ChatRequest): Promise<ChatResponse> {
const client = new OpenAI({ apiKey: this.apiKey });
const response = await client.chat.completions.create({
model: this.mapModel(request.model),
messages: request.messages,
temperature: request.options?.temperature,
max_tokens: request.options?.maxTokens,
});
return this.normalizeResponse(response);
}
supportedModels = ['gpt-4o', 'gpt-4o-mini', 'gpt-4-turbo'];
}Anthropic Adapter
class AnthropicAdapter implements AIProvider {
async chat(request: ChatRequest): Promise<ChatResponse> {
const client = new Anthropic({ apiKey: this.apiKey });
const response = await client.messages.create({
model: this.mapModel(request.model),
messages: this.convertMessages(request.messages),
max_tokens: request.options?.maxTokens || 1024,
});
return this.normalizeResponse(response);
}
supportedModels = ['claude-3-opus', 'claude-3-sonnet', 'claude-3-haiku'];
}Ollama Adapter
class OllamaAdapter implements AIProvider {
async chat(request: ChatRequest): Promise<ChatResponse> {
const response = await fetch(`${this.baseUrl}/api/chat`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: request.model,
messages: request.messages,
stream: false,
}),
});
return this.normalizeResponse(await response.json());
}
supportedModels = ['llama3', 'mistral', 'codellama', 'nomic-embed-text'];
}Caching Strategy
Cache Key Computation
function computeCacheKey(request: ChatRequest): string {
const normalized = {
model: request.model,
messages: request.messages.map(m => ({
role: m.role,
content: m.content
})),
temperature: request.options?.temperature || 0.7,
};
return crypto
.createHash('sha256')
.update(JSON.stringify(normalized))
.digest('hex');
}Cache Invalidation
- TTL: 1 hour for deterministic responses
- Invalidate on: model update, tenant config change
- Skip cache: streaming, high temperature (>0.9)
Budget Management
Cost Calculation
const COST_PER_1K_TOKENS = {
'gpt-4o': { input: 0.0025, output: 0.01 },
'gpt-4o-mini': { input: 0.00015, output: 0.0006 },
'claude-3-opus': { input: 0.015, output: 0.075 },
'claude-3-sonnet': { input: 0.003, output: 0.015 },
'ollama-*': { input: 0, output: 0 }, // Free (local)
};
function calculateCost(usage: Usage, model: string): number {
const rates = COST_PER_1K_TOKENS[model];
return (
(usage.promptTokens / 1000) * rates.input +
(usage.completionTokens / 1000) * rates.output
);
}Budget Tiers
| Tier | Monthly Budget | Rate Limit |
|---|---|---|
| Free | $5 | 10 req/min |
| Starter | $50 | 60 req/min |
| Pro | $500 | 300 req/min |
| Enterprise | Custom | Custom |
Fallback Strategy
const FALLBACK_CHAIN = {
'gpt-4o': ['gpt-4o-mini', 'claude-3-sonnet', 'llama3'],
'claude-3-opus': ['claude-3-sonnet', 'gpt-4o', 'llama3'],
'llama3': ['gpt-4o-mini'], // Fallback to cloud if local fails
};
async function executeWithFallback(request: ChatRequest) {
const chain = [request.model, ...FALLBACK_CHAIN[request.model]];
for (const model of chain) {
try {
const provider = getProviderForModel(model);
return await provider.chat({ ...request, model });
} catch (error) {
if (isRetryable(error)) continue;
throw error;
}
}
throw new AllProvidersFailedError();
}Observability
Metrics Collected
- Request count by model, tenant, status
- Latency percentiles (p50, p95, p99)
- Token usage by model
- Cache hit rate
- Budget utilization
- Error rate by type
Audit Events
interface AIAuditEvent {
timestamp: string;
eventType: 'ai.chat.start' | 'ai.chat.complete' | 'ai.chat.error';
tenantId: string;
userId?: string;
model: string;
inputTokens?: number;
outputTokens?: number;
latencyMs: number;
cached: boolean;
errorCode?: string;
}Security Considerations
- API Key Isolation: Keys stored in Vercel, never in code
- Request Redaction: PII stripped from audit logs
- Tenant Isolation: Strict context boundary per request
- Rate Limiting: Prevents abuse and cost overruns
- Input Validation: All requests validated before processing