AI Gateway
Architecture

AI Gateway Architecture

Deep dive into the AI Gateway internals and design decisions.

Component Overview

┌──────────────────────────────────────────────────────────────────┐
│                        AI Gateway                                 │
├──────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │   Router    │───▶│   Cache     │───▶│   Metrics   │          │
│  └──────┬──────┘    └─────────────┘    └─────────────┘          │
│         │                                                        │
│         ▼                                                        │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │   Budget    │◀──▶│ Rate Limit  │◀──▶│   Audit     │          │
│  └──────┬──────┘    └─────────────┘    └─────────────┘          │
│         │                                                        │
│         ▼                                                        │
│  ┌─────────────────────────────────────────────────────┐        │
│  │                  Provider Adapters                   │        │
│  ├─────────────┬─────────────┬─────────────┬──────────┤        │
│  │   OpenAI    │  Anthropic  │   Ollama    │  Vercel  │        │
│  └─────────────┴─────────────┴─────────────┴──────────┘        │
│                                                                   │
└──────────────────────────────────────────────────────────────────┘

Request Flow

1. Request Ingress

// Request arrives at gateway
const request = {
  tenantId: 'tenant-123',
  model: 'gpt-4o-mini',
  messages: [...],
  options: {
    temperature: 0.7,
    maxTokens: 1024
  }
};

2. Authentication & Context

// Validate tenant and build context
const context = await buildContext({
  tenantId: request.tenantId,
  realmKey: request.realmKey,
  userId: request.userId
});
 
// Check tenant entitlements
const entitlements = await getEntitlements(context);
if (!entitlements.aiEnabled) {
  throw new ForbiddenError('AI not enabled for tenant');
}

3. Budget Check

// Check if tenant has budget remaining
const budget = await checkBudget(context, request.model);
if (budget.remaining <= 0) {
  throw new BudgetExceededError('Monthly AI budget exceeded');
}

4. Cache Lookup

// Check cache for identical request
const cacheKey = computeCacheKey(request);
const cached = await cache.get(cacheKey);
if (cached && !request.options.skipCache) {
  return cached;
}

5. Rate Limiting

// Apply rate limits
const rateLimit = await checkRateLimit(context, request.model);
if (rateLimit.exceeded) {
  throw new RateLimitError('Rate limit exceeded', {
    retryAfter: rateLimit.retryAfter
  });
}

6. Provider Routing

// Select provider based on model and availability
const provider = selectProvider(request.model, {
  preferLocal: context.preferences.preferLocal,
  fallbackEnabled: true
});
 
// Execute request
const response = await provider.chat(request);

7. Response Processing

// Record usage
await recordUsage(context, {
  model: request.model,
  inputTokens: response.usage.promptTokens,
  outputTokens: response.usage.completionTokens,
  cost: calculateCost(response.usage)
});
 
// Cache response
await cache.set(cacheKey, response, { ttl: 3600 });
 
// Audit log
await audit.log({
  event: 'ai.chat.complete',
  context,
  request: redact(request),
  response: summarize(response)
});
 
return response;

Provider Adapters

OpenAI Adapter

class OpenAIAdapter implements AIProvider {
  async chat(request: ChatRequest): Promise<ChatResponse> {
    const client = new OpenAI({ apiKey: this.apiKey });
 
    const response = await client.chat.completions.create({
      model: this.mapModel(request.model),
      messages: request.messages,
      temperature: request.options?.temperature,
      max_tokens: request.options?.maxTokens,
    });
 
    return this.normalizeResponse(response);
  }
 
  supportedModels = ['gpt-4o', 'gpt-4o-mini', 'gpt-4-turbo'];
}

Anthropic Adapter

class AnthropicAdapter implements AIProvider {
  async chat(request: ChatRequest): Promise<ChatResponse> {
    const client = new Anthropic({ apiKey: this.apiKey });
 
    const response = await client.messages.create({
      model: this.mapModel(request.model),
      messages: this.convertMessages(request.messages),
      max_tokens: request.options?.maxTokens || 1024,
    });
 
    return this.normalizeResponse(response);
  }
 
  supportedModels = ['claude-3-opus', 'claude-3-sonnet', 'claude-3-haiku'];
}

Ollama Adapter

class OllamaAdapter implements AIProvider {
  async chat(request: ChatRequest): Promise<ChatResponse> {
    const response = await fetch(`${this.baseUrl}/api/chat`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        model: request.model,
        messages: request.messages,
        stream: false,
      }),
    });
 
    return this.normalizeResponse(await response.json());
  }
 
  supportedModels = ['llama3', 'mistral', 'codellama', 'nomic-embed-text'];
}

Caching Strategy

Cache Key Computation

function computeCacheKey(request: ChatRequest): string {
  const normalized = {
    model: request.model,
    messages: request.messages.map(m => ({
      role: m.role,
      content: m.content
    })),
    temperature: request.options?.temperature || 0.7,
  };
 
  return crypto
    .createHash('sha256')
    .update(JSON.stringify(normalized))
    .digest('hex');
}

Cache Invalidation

  • TTL: 1 hour for deterministic responses
  • Invalidate on: model update, tenant config change
  • Skip cache: streaming, high temperature (>0.9)

Budget Management

Cost Calculation

const COST_PER_1K_TOKENS = {
  'gpt-4o': { input: 0.0025, output: 0.01 },
  'gpt-4o-mini': { input: 0.00015, output: 0.0006 },
  'claude-3-opus': { input: 0.015, output: 0.075 },
  'claude-3-sonnet': { input: 0.003, output: 0.015 },
  'ollama-*': { input: 0, output: 0 }, // Free (local)
};
 
function calculateCost(usage: Usage, model: string): number {
  const rates = COST_PER_1K_TOKENS[model];
  return (
    (usage.promptTokens / 1000) * rates.input +
    (usage.completionTokens / 1000) * rates.output
  );
}

Budget Tiers

TierMonthly BudgetRate Limit
Free$510 req/min
Starter$5060 req/min
Pro$500300 req/min
EnterpriseCustomCustom

Fallback Strategy

const FALLBACK_CHAIN = {
  'gpt-4o': ['gpt-4o-mini', 'claude-3-sonnet', 'llama3'],
  'claude-3-opus': ['claude-3-sonnet', 'gpt-4o', 'llama3'],
  'llama3': ['gpt-4o-mini'], // Fallback to cloud if local fails
};
 
async function executeWithFallback(request: ChatRequest) {
  const chain = [request.model, ...FALLBACK_CHAIN[request.model]];
 
  for (const model of chain) {
    try {
      const provider = getProviderForModel(model);
      return await provider.chat({ ...request, model });
    } catch (error) {
      if (isRetryable(error)) continue;
      throw error;
    }
  }
 
  throw new AllProvidersFailedError();
}

Observability

Metrics Collected

  • Request count by model, tenant, status
  • Latency percentiles (p50, p95, p99)
  • Token usage by model
  • Cache hit rate
  • Budget utilization
  • Error rate by type

Audit Events

interface AIAuditEvent {
  timestamp: string;
  eventType: 'ai.chat.start' | 'ai.chat.complete' | 'ai.chat.error';
  tenantId: string;
  userId?: string;
  model: string;
  inputTokens?: number;
  outputTokens?: number;
  latencyMs: number;
  cached: boolean;
  errorCode?: string;
}

Security Considerations

  1. API Key Isolation: Keys stored in Vercel, never in code
  2. Request Redaction: PII stripped from audit logs
  3. Tenant Isolation: Strict context boundary per request
  4. Rate Limiting: Prevents abuse and cost overruns
  5. Input Validation: All requests validated before processing