---
name: ai-model-integration-apex
description: "Use when calling AI models from Apex code — including the aiplatform.ModelsAPI External Services wrapper, Einstein Platform Services (Vision and Language), response parsing, token lifecycle management, and caching strategies. Trigger keywords: aiplatform.ModelsAPI, createGenerations, createChatGenerations, createEmbeddings, Einstein Platform Services, Einstein Vision, Einstein Language, ModelsAPI callout, AI callout from Apex. NOT for Agentforce actions, Einstein Next Best Action, or Prompt Builder flows."
category: apex
salesforce-version: "Spring '25+"
well-architected-pillars:
  - Performance
  - Security
  - Reliability
tags:
  - Apex
  - AI
  - aiplatform-ModelsAPI
  - Einstein
  - callouts
  - Platform-Cache
  - External-Services
  - Queueable
triggers:
  - "how do I call an AI model from Apex using aiplatform.ModelsAPI"
  - "my Apex AI callout is hitting callout limits under moderate load"
  - "how should I manage the Einstein Platform Services JWT token in Apex to avoid re-fetching it every transaction"
  - "bulk AI processing from Apex is timing out or hitting governor limits"
  - "how do I parse the response from createChatGenerations or createGenerations in Apex"
inputs:
  - "target AI model API name (e.g. sfdc_ai__DefaultOpenAIGPT4OmniMini for ModelsAPI or Einstein Vision model URL)"
  - "transaction volume — synchronous single-record or bulk batch/Queueable"
  - "whether Platform Cache partition exists and its TTL budget"
  - "Einstein Request entitlement and callout budget for the org"
outputs:
  - "Apex implementation with correct aiplatform.ModelsAPI call pattern and typed response traversal"
  - "Platform Cache wrapper for token or response caching with fallback"
  - "Queueable chain design for bulk AI processing within governor limits"
  - "review findings against callout budget, entitlement risks, and token lifecycle gaps"
dependencies:
  - apex/platform-cache
  - apex/apex-queueable-patterns
  - apex/callouts-and-http-integrations
version: 1.0.0
author: Pranav Nagrecha
updated: 2026-04-12
---

# AI Model Integration Apex

This skill activates when Apex code needs to call an AI model — either through the `aiplatform.ModelsAPI` External Services wrapper (the modern path) or through legacy Einstein Platform Services that use an OAuth 2.0 JWT bearer token. The skill covers correct API usage, response parsing, token lifecycle management via Platform Cache, and asynchronous Queueable patterns for bulk AI workloads.

---

## Before Starting

Gather this context before working on anything in this domain:

- Which API surface is in use: `aiplatform.ModelsAPI` (modern, External Services-generated), legacy Einstein Vision/Language (REST with JWT), or a direct HTTP callout to an external LLM endpoint?
- What is the transaction volume — single-record synchronous, trigger-driven bulk, or scheduled batch?
- Does a Platform Cache partition exist in the org? What is the configured TTL?
- What are the org's Einstein Request entitlement limits, and what callout-per-transaction budget remains?
- Will this run inside a trigger, a Flow-invoked action, a Queueable, or a Batch class? Each context has different callout limits and restrictions.

---

## Core Concepts

### aiplatform.ModelsAPI — The Modern Apex Path

The `aiplatform.ModelsAPI` class is autogenerated by the External Services framework and provides three primary methods:

- `createGenerations(modelApiName, request)` — for text generation (completion style)
- `createChatGenerations(modelApiName, request)` — for chat-style prompts with message history
- `createEmbeddings(modelApiName, request)` — for generating vector embeddings

The model API name is a string such as `sfdc_ai__DefaultOpenAIGPT4OmniMini`. Responses are typed objects; generated text surfaces at `response.Code200.generation.generatedText` for completion calls and at `response.Code200.generations[0].message.content` for chat calls.

Although the External Services wrapper hides the underlying authentication exchange, every call still counts against the org's Einstein Request entitlement budget and against the standard Apex callout governor limit (100 callouts per transaction, 120-second per-callout timeout). The wrapper does not provide retry or throttling logic — that responsibility belongs to the calling Apex.

### Legacy Einstein Platform Services — JWT Token Lifecycle

Legacy Einstein Vision and Language services authenticate with an OAuth 2.0 JWT bearer token obtained from `https://api.einstein.ai/v2/oauth2/token`. This token is a short-lived credential that must be included as a Bearer token on every REST callout.

The critical behavior: **the token does not auto-refresh**. Once obtained, it is valid for a limited window. If Apex fetches a new token on every transaction rather than reusing a cached token, each transaction burns a separate callout and makes a redundant authentication call. Under moderate load this doubles callout usage per transaction and can exhaust per-hour Einstein Request entitlements far faster than the model calls alone would.

The correct pattern is to store the token in Platform Cache (`Cache.Org`) with a TTL slightly below the token's validity window, check the cache before every callout, and only request a new token on a cache miss.

### Governor Limits and Bulk AI Processing

Apex enforces 100 callouts per transaction. A trigger processing 50 records that each require one AI model call would hit the limit at 50 records in a single transaction. Bulk AI workloads must be moved asynchronous. The correct structure is a Queueable chain: each Queueable job processes a bounded subset of records, makes its AI callout(s), persists results, and re-enqueues itself for the next batch if records remain. This pattern keeps every execution within callout limits and respects the 60-second Queueable timeout for callout-capable jobs.

Batch Apex can also make callouts, but only from `execute()` and only when the batch is defined with `Database.AllowsCallouts`. The batch scope (records per execute call) must be sized so the number of AI callouts per execute stays within the 100-callout limit.

### Response Parsing

Both `aiplatform.ModelsAPI` and legacy Einstein REST APIs return typed responses. Do not use dynamic JSON deserialization (`JSON.deserializeUntyped`) for ModelsAPI calls — the generated types provide compile-time safety. Null-check intermediate objects before traversing nested paths; a model returning a non-200 status code will populate a different response path and leave `Code200` null.

---

## Common Patterns

### Cache-Aside Token Management for Legacy Einstein Services

**When to use:** Any Apex class making legacy Einstein Vision or Language callouts.

**How it works:** Before constructing the HTTP request, check `Cache.Org.get(cacheKey)`. On a hit, use the cached token string directly. On a miss or null, call the JWT token endpoint, store the result in `Cache.Org.put(cacheKey, token, ttlSeconds)` with a TTL that is 30–60 seconds shorter than the actual token expiry, and proceed.

**Why not the alternative:** Fetching a fresh token per transaction doubles callout consumption and risks hitting per-hour Einstein Request limits before the model calls exhaust them.

### Queueable Chain for Bulk AI Processing

**When to use:** Triggers, batch jobs, or any context that passes more records to AI processing than can be handled in a single synchronous transaction.

**How it works:** The trigger or entry point collects record IDs and enqueues the first Queueable. Each Queueable processes a fixed slice (e.g. 5–10 records per execution), makes the AI callout(s), updates results, and calls `System.enqueueJob(new AiProcessQueueable(remainingIds))` if the list is not exhausted.

**Why not the alternative:** Synchronous processing in a trigger exceeds the callout limit at scale and can produce partial failures that leave records in inconsistent state.

### aiplatform.ModelsAPI Chat Generation with Null-Safe Response Parsing

**When to use:** Any call to a chat-capable model (e.g. GPT-4o Mini) where the response should be extracted safely.

**How it works:** Call `createChatGenerations`, check `response.Code200 != null`, then read from the typed path. Wrap in a try-catch for `aiplatform.ModelsAPI.createChatGenerationsException` to handle API-level errors distinctly from Apex exceptions.

---

## Decision Guidance

| Situation | Recommended Approach | Reason |
|---|---|---|
| Single-record synchronous AI call | aiplatform.ModelsAPI with null-safe response parsing | Clean typing, no auth complexity |
| Bulk records needing AI processing | Queueable chain with bounded slice per execution | Stay within 100-callout and timeout limits |
| Legacy Einstein Vision or Language | JWT token via Platform Cache (cache-aside) | Avoid redundant token callouts under load |
| Token or response value reuse across short time window | Cache.Org with explicit TTL and fallback | Reduce entitlement burn and callout count |
| Response status is unexpected | Check Code200 null, handle alternate status codes | ModelsAPI routes failures to different typed paths |

---

## Recommended Workflow

Step-by-step instructions for an AI agent or practitioner activating this skill:

1. Identify the API surface — confirm whether the project uses `aiplatform.ModelsAPI`, legacy Einstein REST (Vision/Language), or direct HTTP to an external model. Each path has different auth requirements and response shapes.
2. Check entitlement and callout budget — confirm Einstein Request limits and how many callouts the target transaction context can safely make. Determine if async processing is required.
3. Implement the API call — use the typed ModelsAPI methods (`createGenerations`, `createChatGenerations`, or `createEmbeddings`) with the correct model API name string. For legacy Einstein services, implement the cache-aside token pattern against Platform Cache before writing the HTTP callout.
4. Add null-safe response parsing — traverse the typed response path only after a null-check on `Code200`. Add a try-catch for API-specific exception types and log meaningful error context.
5. Design for scale — if transaction volume can exceed safe callout budget, refactor to a Queueable chain. Size each job's slice so total callouts per execution stay below 100.
6. Validate — run the checker script, confirm Platform Cache TTL is set below token expiry, confirm Queueable chain terminates correctly on empty input, and confirm no synchronous bulk callout patterns remain.

---

## Review Checklist

- [ ] aiplatform.ModelsAPI model API name string is correct and matches an available model in the org.
- [ ] Response traversal null-checks `Code200` before reading nested fields.
- [ ] Legacy Einstein JWT token is stored in Platform Cache with TTL below actual token expiry.
- [ ] Synchronous callout paths stay within 100-callout limit per transaction.
- [ ] Bulk AI processing uses Queueable chain or batch with AllowsCallouts, not synchronous trigger logic.
- [ ] Error handling distinguishes model API errors (non-200 responses) from network or Apex exceptions.
- [ ] Einstein Request entitlement impact has been estimated for expected volume.

---

## Salesforce-Specific Gotchas

1. **aiplatform.ModelsAPI is not authentication-free despite appearances** — the External Services wrapper handles the token exchange internally, but every call still consumes an Einstein Request entitlement credit. High-frequency use without caching intermediate results will exhaust entitlements before callout limits are hit.
2. **Legacy Einstein JWT tokens do not auto-refresh** — Apex must track token lifecycle explicitly. Without Platform Cache, every transaction fetches a fresh token, doubling callout count and burning per-hour entitlements twice as fast.
3. **Callout limit of 100 per transaction is hard** — a trigger processing a full batch of 200 records that each needs one AI call will throw a `System.LimitException` at record 101. Bulk AI processing must be refactored to async before going to production.
4. **ModelsAPI Code200 can be null on non-200 responses** — the generated type has separate fields for each HTTP status bucket. Code that assumes `response.Code200` is always populated will produce a NullPointerException on model errors, throttle responses, or quota-exceeded replies.

---

## Output Artifacts

| Artifact | Description |
|---|---|
| ModelsAPI call scaffold | Typed Apex calling createChatGenerations or createGenerations with null-safe response parsing |
| Platform Cache token wrapper | Cache-aside Apex class for JWT token storage and reuse |
| Queueable chain design | AiProcessQueueable structure with bounded slice, enqueue-next logic, and termination guard |
| Review findings | Identified callout budget risks, missing cache strategies, and bulk-processing antipatterns |

---

## Related Skills

- `apex/platform-cache` — detailed patterns for Cache.Org design, TTL selection, key namespacing, and cache-aside wrappers
- `apex/apex-queueable-patterns` — Queueable chain structure, re-enqueue patterns, and error handling for async jobs
- `apex/callouts-and-http-integrations` — general Apex callout patterns, Named Credentials, and HTTP request construction
- `apex/governor-limits` — full governor limit reference for callout budgets, heap limits, and transaction boundaries
- `agentforce/model-builder-and-byollm` — for configuring which AI models are available in the org for aiplatform.ModelsAPI
