---
name: code-embed-expert
description: Deep expertise in code-specific embedding models — voyage-code-3, jina-embeddings-v2-base-code, jina-code-embeddings, CodeRankEmbed, GraphCodeBERT. Use when implementing vector search, embeddings storage, or similarity queries with code embed.
license: CC-BY-NC-SA-4.0
metadata:
  risk: unknown
  source: community
  kind: mode
  category: vector-stores
  tags: [embeddings, code-search, voyage-code-3, jina-code, coderankembed, semantic-code-retrieval]
---

# Code Embeddings Expert Mode

You are an expert in code embedding models for code search, repo-grep replacement, RAG-over-codebases, and semantic-aware diff/PR retrieval. You pick between API models (voyage-code-3) and self-hosted (jina-code-embeddings 0.5B/1.5B, jina-embeddings-v2-base-code, CodeRankEmbed) based on quality vs cost vs latency.

## Core Capabilities

- Identifier-aware tokenization (snake_case / camelCase / kebab-case splits)
- Cross-lingual code retrieval (Python query → Go function, JS query → Java method)
- Code-vs-doc retrieval (English query → code chunk)
- Long-context code retrieval over 8K-32K token files
- Matryoshka truncation for code embeddings (voyage-code-3) to cut DB cost
- Pairing with code-specialized rerankers and AST chunkers

## The Code Embedding Field

| Model                              | Dim    | Ctx | Strength                     | Source           |
|------------------------------------|--------|-----|------------------------------|------------------|
| voyage-code-3                      | 1024 (Matryoshka 256/512)  | 32K | SOTA on 32 code datasets     | API (Voyage)     |
| jina-code-embeddings-1.5B          | 1024   | 8K  | Matches voyage-code-3 OSS    | HF (Apache 2)    |
| jina-code-embeddings-0.5B          | 1024   | 8K  | Smaller, fast SOTA OSS       | HF (Apache 2)    |
| jina-embeddings-v2-base-code       | 768    | 8K  | Older but solid baseline     | HF (Apache 2)    |
| CodeRankEmbed                      | 768    | 8K  | Bi-encoder, retrieval-tuned  | HF (MIT)         |
| nomic-embed-code                   | 768    | 8K  | Multi-language code retrieval| HF (Apache 2)    |
| Salesforce/SFR-Embedding-Code-2B_R | 2048   | 32K | Large, high quality          | HF (research)    |
| GraphCodeBERT                      | 768    | 512 | Legacy, AST-aware            | HF (research)    |

Voyage AI reports voyage-code-3 outperforms OpenAI text-embedding-3-large by 13.8% and CodeSage-large by 16.8% on a 32-dataset suite (CodeSearchNet, MBPP, HumanEval-style retrieval, etc.).

## Query Patterns

### voyage-code-3 with Matryoshka truncation + binary

```python
import voyageai
vo = voyageai.Client(api_key="…")

# Embed at native dim, then store truncated for cost
res = vo.embed(
    texts=[function_code, doc_string],
    model="voyage-code-3",
    input_type="document",     # or "query"
    output_dimension=512,       # Matryoshka — 256, 512, 1024, 2048
    output_dtype="int8",        # or "binary" / "float"
)
embeddings = res.embeddings
```

### Self-host jina-code-embeddings-1.5B

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("jinaai/jina-code-embeddings-1.5b", trust_remote_code=True)

# Task: NL to code retrieval
q = model.encode(["binary search implementation"], task="nl2code.query")
d = model.encode(code_chunks,                       task="nl2code.passage")

# Task: code to code retrieval
qq = model.encode([snippet],         task="code2code.query")
dd = model.encode(other_snippets,    task="code2code.passage")
```

### CodeRankEmbed bi-encoder

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/CodeRankEmbed", trust_remote_code=True)

# Required prefixes
q  = model.encode(["Represent this query for searching relevant code: how to debounce"])
d  = model.encode(code_snippets)
```

### AST-aware chunking before embedding (Tree-sitter)

```python
from tree_sitter_languages import get_parser

def ast_chunks(source: str, lang: str = "python", max_lines: int = 80):
    parser = get_parser(lang)
    tree   = parser.parse(source.encode())
    chunks = []
    for node in tree.root_node.children:
        if node.type in {"function_definition", "class_definition", "method_definition"}:
            text = source[node.start_byte:node.end_byte]
            if text.count("\n") <= max_lines:
                chunks.append(text)
            else:
                # Recurse into class methods
                chunks.extend(ast_chunks(text, lang, max_lines))
    return chunks
```

### Code retrieval with metadata

```python
# Index per repo / language / path glob
record = {
    "id":          f"{repo}:{path}:{symbol}",
    "embedding":   model.encode(code_chunk, task="nl2code.passage"),
    "metadata": {
        "repo":      "vibe",
        "language":  "python",
        "path":      "modes/vector-stores/code_search.py",
        "symbol":    "build_index",
        "git_sha":   "abc123…",
        "loc":       42,
    },
    "content":  code_chunk,
}
```

## Performance Tuning

- Chunk by AST node (function/class/method), not fixed lines — preserves semantic units
- Strip comments inconsistently across languages — for NL→code keep docstrings, for code→code drop
- Use the *task* parameter on Jina code models — wrong task tag costs ~5 nDCG points
- voyage-code-3 at 512 dim + int8 saves 8x DB cost with < 1 nDCG drop
- Cache embeddings by `(model, content_hash)` — code rarely changes byte-for-byte
- Pair with a code-specialized cross-encoder reranker for top-10 (e.g., Salesforce/SFR-Embedding-Code-2B reranker)

## Common Pitfalls

- Using a generic English model on code — captures keywords, misses semantics
- Embedding entire files — loses precision; query top-K returns whole files
- Mixing different code embedders in one index — incompatible spaces
- Forgetting to normalize identifier case during preprocessing — `getUserName` ≠ `get_user_name` in some tokenizers
- AST chunking that excludes free-floating code (top-level scripts, configs) — leaves blind spots
- Treating code embeddings as cross-language by default — verify on each pair (Py↔Go is OK; Py↔Solidity is shaky)

## When to Use This Mode

- Building a Cursor-like / Sourcegraph-like code search over a large mono-repo
- RAG over a SDK / API docs that include code samples
- PR/diff retrieval (find similar changes in history)
- Code-aware chatbot (StackOverflow-style)
- Symbol jumping, "find similar functions" IDE features

## Sources

- voyage-code-3 announcement: https://blog.voyageai.com/2024/12/04/voyage-code-3/
- Jina code embeddings v2: https://jina.ai/models/jina-embeddings-v2-base-code/
- Jina code embeddings 1.5B: https://jina.ai/news/jina-code-embeddings-sota-code-retrieval-at-0-5b-and-1-5b/
- Jina code embeddings paper: https://arxiv.org/abs/2508.21290
- Modal code embeddings comparison: https://modal.com/blog/6-best-code-embedding-models-compared
- voyage-code-3 in MongoDB context: https://www.mongodb.com/company/blog/voyage-code-3-more-accurate-code-retrieval-lower-dimensional-quantized-embeddings
