AI Gateway Architecture

Deep dive into the AI Gateway internals and design decisions.

Component Overview

┌──────────────────────────────────────────────────────────────────┐
│                        AI Gateway                                 │
├──────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │   Router    │───▶│   Cache     │───▶│   Metrics   │          │
│  └──────┬──────┘    └─────────────┘    └─────────────┘          │
│         │                                                        │
│         ▼                                                        │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │   Budget    │◀──▶│ Rate Limit  │◀──▶│   Audit     │          │
│  └──────┬──────┘    └─────────────┘    └─────────────┘          │
│         │                                                        │
│         ▼                                                        │
│  ┌─────────────────────────────────────────────────────┐        │
│  │                  Provider Adapters                   │        │
│  ├─────────────┬─────────────┬─────────────┬──────────┤        │
│  │   OpenAI    │  Anthropic  │   Ollama    │  Vercel  │        │
│  └─────────────┴─────────────┴─────────────┴──────────┘        │
│                                                                   │
└──────────────────────────────────────────────────────────────────┘

Request Flow

1. Request Ingress

// Request arrives at gateway
const request = {
  tenantId: 'tenant-123',
  model: 'gpt-4o-mini',
  messages: [...],
  options: {
    temperature: 0.7,
    maxTokens: 1024
  }
};

2. Authentication & Context

// Validate tenant and build context
const context = await buildContext({
  tenantId: request.tenantId,
  realmKey: request.realmKey,
  userId: request.userId
});
 
// Check tenant entitlements
const entitlements = await getEntitlements(context);
if (!entitlements.aiEnabled) {
  throw new ForbiddenError('AI not enabled for tenant');
}

3. Budget Check

// Check if tenant has budget remaining
const budget = await checkBudget(context, request.model);
if (budget.remaining <= 0) {
  throw new BudgetExceededError('Monthly AI budget exceeded');
}

4. Cache Lookup

// Check cache for identical request
const cacheKey = computeCacheKey(request);
const cached = await cache.get(cacheKey);
if (cached && !request.options.skipCache) {
  return cached;
}

5. Rate Limiting

// Apply rate limits
const rateLimit = await checkRateLimit(context, request.model);
if (rateLimit.exceeded) {
  throw new RateLimitError('Rate limit exceeded', {
    retryAfter: rateLimit.retryAfter
  });
}

6. Provider Routing

// Select provider based on model and availability
const provider = selectProvider(request.model, {
  preferLocal: context.preferences.preferLocal,
  fallbackEnabled: true
});
 
// Execute request
const response = await provider.chat(request);

7. Response Processing

// Record usage
await recordUsage(context, {
  model: request.model,
  inputTokens: response.usage.promptTokens,
  outputTokens: response.usage.completionTokens,
  cost: calculateCost(response.usage)
});
 
// Cache response
await cache.set(cacheKey, response, { ttl: 3600 });
 
// Audit log
await audit.log({
  event: 'ai.chat.complete',
  context,
  request: redact(request),
  response: summarize(response)
});
 
return response;

Provider Adapters

OpenAI Adapter

class OpenAIAdapter implements AIProvider {
  async chat(request: ChatRequest): Promise<ChatResponse> {
    const client = new OpenAI({ apiKey: this.apiKey });
 
    const response = await client.chat.completions.create({
      model: this.mapModel(request.model),
      messages: request.messages,
      temperature: request.options?.temperature,
      max_tokens: request.options?.maxTokens,
    });
 
    return this.normalizeResponse(response);
  }
 
  supportedModels = ['gpt-4o', 'gpt-4o-mini', 'gpt-4-turbo'];
}

Anthropic Adapter

class AnthropicAdapter implements AIProvider {
  async chat(request: ChatRequest): Promise<ChatResponse> {
    const client = new Anthropic({ apiKey: this.apiKey });
 
    const response = await client.messages.create({
      model: this.mapModel(request.model),
      messages: this.convertMessages(request.messages),
      max_tokens: request.options?.maxTokens || 1024,
    });
 
    return this.normalizeResponse(response);
  }
 
  supportedModels = ['claude-3-opus', 'claude-3-sonnet', 'claude-3-haiku'];
}

Ollama Adapter

class OllamaAdapter implements AIProvider {
  async chat(request: ChatRequest): Promise<ChatResponse> {
    const response = await fetch(`${this.baseUrl}/api/chat`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        model: request.model,
        messages: request.messages,
        stream: false,
      }),
    });
 
    return this.normalizeResponse(await response.json());
  }
 
  supportedModels = ['llama3', 'mistral', 'codellama', 'nomic-embed-text'];
}

Caching Strategy

Cache Key Computation

function computeCacheKey(request: ChatRequest): string {
  const normalized = {
    model: request.model,
    messages: request.messages.map(m => ({
      role: m.role,
      content: m.content
    })),
    temperature: request.options?.temperature || 0.7,
  };
 
  return crypto
    .createHash('sha256')
    .update(JSON.stringify(normalized))
    .digest('hex');
}

Cache Invalidation

TTL: 1 hour for deterministic responses
Invalidate on: model update, tenant config change
Skip cache: streaming, high temperature (>0.9)

Budget Management

Cost Calculation

const COST_PER_1K_TOKENS = {
  'gpt-4o': { input: 0.0025, output: 0.01 },
  'gpt-4o-mini': { input: 0.00015, output: 0.0006 },
  'claude-3-opus': { input: 0.015, output: 0.075 },
  'claude-3-sonnet': { input: 0.003, output: 0.015 },
  'ollama-*': { input: 0, output: 0 }, // Free (local)
};
 
function calculateCost(usage: Usage, model: string): number {
  const rates = COST_PER_1K_TOKENS[model];
  return (
    (usage.promptTokens / 1000) * rates.input +
    (usage.completionTokens / 1000) * rates.output
  );
}

Budget Tiers

Tier	Monthly Budget	Rate Limit
Free	$5	10 req/min
Starter	$50	60 req/min
Pro	$500	300 req/min
Enterprise	Custom	Custom

Fallback Strategy

const FALLBACK_CHAIN = {
  'gpt-4o': ['gpt-4o-mini', 'claude-3-sonnet', 'llama3'],
  'claude-3-opus': ['claude-3-sonnet', 'gpt-4o', 'llama3'],
  'llama3': ['gpt-4o-mini'], // Fallback to cloud if local fails
};
 
async function executeWithFallback(request: ChatRequest) {
  const chain = [request.model, ...FALLBACK_CHAIN[request.model]];
 
  for (const model of chain) {
    try {
      const provider = getProviderForModel(model);
      return await provider.chat({ ...request, model });
    } catch (error) {
      if (isRetryable(error)) continue;
      throw error;
    }
  }
 
  throw new AllProvidersFailedError();
}

Observability

Metrics Collected

Request count by model, tenant, status
Latency percentiles (p50, p95, p99)
Token usage by model
Cache hit rate
Budget utilization
Error rate by type

Audit Events

interface AIAuditEvent {
  timestamp: string;
  eventType: 'ai.chat.start' | 'ai.chat.complete' | 'ai.chat.error';
  tenantId: string;
  userId?: string;
  model: string;
  inputTokens?: number;
  outputTokens?: number;
  latencyMs: number;
  cached: boolean;
  errorCode?: string;
}

Security Considerations

API Key Isolation: Keys stored in Vercel, never in code
Request Redaction: PII stripped from audit logs
Tenant Isolation: Strict context boundary per request
Rate Limiting: Prevents abuse and cost overruns
Input Validation: All requests validated before processing

Overview API Reference