---
name: literature-review-agent
description: Step 3 of the PaperOrchestra pipeline (arXiv:2604.05018). Execute the literature search strategy from outline.json — discover candidate papers via web search, verify them through Semantic Scholar (Levenshtein > 70 fuzzy title match, temporal cutoff, dedup by paperId), build a BibTeX file, and draft Introduction + Related Work using ≥90% of the verified pool. Runs in parallel with the plotting-agent. TRIGGER when the orchestrator delegates Step 3 or when the user asks to "find citations for my paper", "draft the related work", or "build the bibliography".
---

# Literature Review Agent (Step 3)

Faithful implementation of the Hybrid Literature Agent from PaperOrchestra
(Song et al., 2026, arXiv:2604.05018, §4 Step 3, App. D.3, App. F.1 p.46).

**Cost: ~20–30 LLM calls.** This is one of the two longest steps (the other is
plotting). Wall-time floor is set by Semantic Scholar's 1 QPS verification
limit.

## Inputs

- `workspace/outline.json` — specifically `intro_related_work_plan` with the
  Introduction search directions and the 2-4 Related Work methodology
  clusters
- `workspace/inputs/conference_guidelines.md` — used to derive `cutoff_date`
- `workspace/inputs/idea.md`, `workspace/inputs/experimental_log.md` — for
  framing the Intro and grounding the Related Work positioning

## Outputs

- `workspace/citation_pool.json` — verified Semantic Scholar metadata for
  every paper that survived verification
- `workspace/refs.bib` — BibTeX file generated from the verified pool
- `workspace/drafts/intro_relwork.tex` — drafted Introduction and Related
  Work sections, written into the template, with the rest of the template
  preserved verbatim

## Two-phase pipeline (App. D.3)

```
PHASE 1 — Parallel Candidate Discovery
   For each search direction in introduction_strategy.search_directions:
   For each limitation_search_query in each related_work cluster:
     - Use the host's web search tool to discover up to ~10 candidate papers.
     - Run up to 10 discovery queries in parallel (host-permitting).
     - Collect (title, snippet, url) tuples — no verification yet.
   → PRE-DEDUP before Phase 2 (see Step 1.5 below)

PHASE 2 — Sequential Citation Verification (1 QPS, with cache)
   For each candidate (after pre-dedup), sequentially:
     0. Check s2_cache.json first (scripts/s2_cache.py --check).
        If HIT: use cached response, skip live S2 call. No throttle needed.
        If MISS: proceed with live request below.
     1. Query Semantic Scholar by title:
          GET https://api.semanticscholar.org/graph/v1/paper/search?query=<title>
              &fields=title,abstract,year,authors,venue,externalIds&limit=5
        (Public endpoint, no key. Throttle to 1 QPS for live requests only.)
     2. Store the S2 response in cache: s2_cache.py --store.
     3. Pick the top hit. Check Levenshtein title ratio against the original
        candidate title. If ratio < 70: discard.
     4. Bonus: if year and venue exactly align with hints, add a +5 point
        match-quality bonus.
     5. Require: abstract is non-empty.
     6. Require: paper.year (or month if known) strictly predates cutoff_date.
        Months default to day-1: e.g., "October 2024" → 2024-10-01.
     7. If all checks pass, add to verified pool.
   After all candidates are verified, dedup by Semantic Scholar paperId.
```

The host agent does the LLM/web work; the deterministic helpers in `scripts/`
do the math.

## Step-by-step

### 0. Derive `cutoff_date`

Parse `conference_guidelines.md` for the submission deadline. The paper aligns
research cutoff with venue submission deadline (App. D.1):

| Venue | Cutoff |
|---|---|
| CVPR 2025 | Nov 2024 |
| ICLR 2025 | Oct 2024 |
| Other | One month before the stated submission deadline |

Encode as `YYYY-MM-DD`. Months default to day-1 (e.g., `2024-10-01`).

### 1. Phase 1: Parallel Candidate Discovery

From `outline.json`:

- All `introduction_strategy.search_directions` (3-5 queries)
- For each cluster in `related_work_strategy.subsections`:
  - The cluster's `sota_investigation_mission` becomes a search query
  - All `limitation_search_queries` (1-3 each)

For each query, **use your host's web search tool** (e.g., `WebSearch` in
Claude Code, `@web` in Cursor, the search tool in Antigravity). Collect the
top ~10 candidates per query: title, abstract snippet, source URL.

If your host supports parallel sub-tasks, fire up to 10 concurrent search
queries. If not, run sequentially — slower but functionally equivalent.

#### Optional: Exa as a Phase 1 backend

If your host has no native web search, OR you want a research-paper-focused
backend with better signal-to-noise, you can use [Exa](https://exa.ai) via
the bundled `scripts/exa_search.py` helper. It is **opt-in** and reads
`EXA_API_KEY` from the environment — the repo never commits a key.

```bash
export EXA_API_KEY="your-key-here"   # get one at https://dashboard.exa.ai/
python skills/literature-review-agent/scripts/exa_search.py \
    --query "Sparse attention long context transformers" \
    --num-results 15 \
    --discovered-for "related_work[2.1]"
```

Output is a normalized candidate list ready to merge into
`raw_candidates.json`. Phase 2 verification (Semantic Scholar fuzzy match,
cutoff, dedup) is unchanged. See `references/exa-search-cookbook.md` for
the full recipe, query patterns, cost estimates, and security notes.

Combine all discovered candidates into a single working list. Tag each with
the originating query ID so you can later attribute it to "intro" vs
"related_work[i]".

### 1.5. Pre-dedup before Phase 2

**Always run this before starting Phase 2.** Multiple search queries routinely
return the same papers (e.g., "Attention is All You Need" appears in almost
every NLP discovery query). Verifying duplicates wastes 30-40% of S2 quota
at 1 QPS.

```bash
python skills/literature-review-agent/scripts/pre_dedup_candidates.py \
    --in workspace/raw_candidates.json \
    --out workspace/deduped_candidates.json
# Prints: "150 candidates → 97 unique (53 duplicates removed)"
```

Use `workspace/deduped_candidates.json` as input to Phase 2.

### 2. Phase 2: Sequential Verification via Semantic Scholar (with cache)

For each candidate in `deduped_candidates.json`, in **sequential** order:

**Step A — check cache first** (no S2 call, no throttle needed):
```bash
python skills/literature-review-agent/scripts/s2_cache.py \
    --cache workspace/cache/s2_cache.json \
    --check "<candidate title>"
# exit 0 + prints JSON → use cached response, skip Step B
# exit 1 → proceed to Step B
```

**Step B — live S2 request** (cache MISS only, throttle to 1 QPS):

**Preferred:** use the bundled `scripts/s2_search.py` helper — it handles
auth, retries, and 429 back-off automatically:

```bash
python skills/literature-review-agent/scripts/s2_search.py \
    --query "<URL-decoded candidate title>" --limit 5
# If SEMANTIC_SCHOLAR_API_KEY is set the key is forwarded automatically.
# If not, the public unauthenticated endpoint is used (≤1 QPS, still works).
```

Check whether the key is configured before starting Phase 2:

```bash
python skills/literature-review-agent/scripts/s2_search.py --check-key
```

**Fallback:** if you prefer your host's URL fetch tool, GET:
```
https://api.semanticscholar.org/graph/v1/paper/search?query=<URL-encoded title>&limit=5&fields=title,abstract,year,authors,venue,externalIds
```
Add header `x-api-key: <SEMANTIC_SCHOLAR_API_KEY>` if the env var is set.
Be polite: ≤1 request per second for live requests. Cache hits are free.

**Step C — store in cache** (after every successful live request):
```bash
python skills/literature-review-agent/scripts/s2_cache.py \
    --cache workspace/cache/s2_cache.json \
    --store "<candidate title>" \
    --response '<full S2 JSON response>'
```

For the top hit:

```bash
python skills/literature-review-agent/scripts/levenshtein_match.py \
    --candidate "Original candidate title" \
    --found "S2 returned title"
# prints integer 0-100. Discard if < 70.
```

Then check the temporal cutoff:

```bash
python skills/literature-review-agent/scripts/check_cutoff.py \
    --paper-year 2024 \
    --paper-month 9 \
    --cutoff 2024-10-01
# exit 0 if strictly predates, exit 1 if not
```

If both checks pass AND the abstract is non-empty, append the paper's full
S2 metadata to the verified pool.

### 3. Dedup and assemble the pool

After all candidates are verified:

```bash
python skills/literature-review-agent/scripts/dedupe_by_id.py \
    --in raw_pool.json \
    --out workspace/citation_pool.json
```

The dedupe script keys on `paperId` (Semantic Scholar's internal unique ID),
falling back to `externalIds.DOI`, then `externalIds.ArXiv`, then a
normalized title.

The script also computes and writes `min_cite_paper_count` =
`floor(0.9 * len(papers))` — the minimum number of papers the writing step
must cite (the paper's ≥90% integration rule, App. D.3).

**Immediately after dedupe_by_id.py**, validate and auto-fix the pool schema:

```bash
python skills/literature-review-agent/scripts/validate_pool.py \
    --pool workspace/citation_pool.json --fix
# Catches and fixes authors-as-strings, reports missing required fields.
# Must pass before proceeding to Step 4.
```

### 4. Build the BibTeX file

```bash
python skills/literature-review-agent/scripts/bibtex_format.py \
    --pool workspace/citation_pool.json \
    --out workspace/refs.bib
```

The script generates citation keys deterministically from `firstauthor + year
+ first significant word of title` (e.g., `vaswani2017attention`). It writes
out only `@article` / `@inproceedings` / `@misc` entries — never invents
fields. It also writes the canonical `bibtex_key` back into each paper record
in `citation_pool.json`.

**Immediately after bibtex_format.py**, sync keys in `intro_relwork.tex`:

```bash
python skills/literature-review-agent/scripts/sync_keys.py \
    --pool workspace/citation_pool.json \
    --tex  workspace/drafts/intro_relwork.tex \
    --inplace
# Replaces every \cite{agent_key} with \cite{canonical_bibtex_key}.
# Eliminates citation_coverage gate failures caused by key mismatch.
```

These two steps replace the manual Python snippets that were previously
required. The pipeline is now:

```
dedupe_by_id → validate_pool --fix → bibtex_format → sync_keys
```

### 5. Draft Introduction + Related Work

This is where you (the host agent) actually write text. Load the
**verbatim Literature Review Agent prompt** at `references/prompt.md`.
Substitute the template placeholders:

| Placeholder | Value |
|---|---|
| `intro_related_work_plan` | full JSON object from `outline.json` |
| `project_idea` | contents of `idea.md` |
| `project_experimental_log` | contents of `experimental_log.md` |
| `citation_checklist` | the BibTeX keys from `refs.bib` |
| `collected_papers` | list of `{key, title, abstract}` from `citation_pool.json` |
| `paper_count` | `len(citation_pool.papers)` |
| `min_cite_paper_count` | from `citation_pool.json` |
| `cutoff_date` | the date you derived in Step 0 |

**Also prepend the Anti-Leakage Prompt** from
`../paper-orchestra/references/anti-leakage-prompt.md`.

Run your LLM with the combined prompt against `template.tex`. The agent's
job is to fill in the empty Introduction and Related Work sections of the
template **and leave everything else untouched**. Output: the full
`template.tex` with those two sections filled. Save to
`workspace/drafts/intro_relwork.tex`.

### 6. Verify ≥90% citation coverage

```bash
python skills/literature-review-agent/scripts/citation_coverage.py \
    --tex workspace/drafts/intro_relwork.tex \
    --pool workspace/citation_pool.json
# exit 0 if ≥90% of pool is cited; exit 1 otherwise
```

If the gate fails, re-prompt the writing step explicitly listing the missing
keys and asking the agent to integrate them where contextually appropriate.

## Critical rules from the prompt

These are excerpted from `references/prompt.md`. The host agent MUST honor
them on the writing call:

- **Cite ONLY from `collected_papers`.** Never invent BibTeX keys, never
  reference papers not in the pool.
- **Cite at least `min_cite_paper_count` of them** in Intro + Related Work
  combined.
- **TIMELINE RULE**: Do not treat any papers published after `cutoff_date`
  as prior baselines to beat. They are concurrent work only.
- **EVALUATION RULE**: Do not claim our method beats / achieves SOTA over a
  specific cited paper UNLESS that paper is explicitly evaluated against in
  `experimental_log.md`. Frame other recent papers strictly as concurrent,
  orthogonal, or conceptual work.
- **Output format**: return the full code for the updated `template.tex`,
  with the two empty sections (Introduction and Related Work) filled in,
  and **all the other code** (packages, styles, other sections) **identical
  to the original** template.tex.
- Wrap output in ```` ```latex ... ``` ```` fences.
- Do not change `\usepackage[capitalize]{cleveref}` to `cleverref` (there is
  no `cleverref.sty`).

## Degraded mode (no web search)

If your host has no web search tool, switch to degraded mode:

1. If the user has placed a pre-built `workspace/inputs/refs.bib` in the
   workspace, load it directly into `workspace/refs.bib` and skip Phase 1
   and Phase 2.
2. Otherwise, emit `workspace/drafts/intro_relwork.tex` containing the
   template with two TODO markers in the Intro and Related Work sections,
   and tell the user the pipeline cannot complete Step 3 without web search.

## Resources

- `references/prompt.md` — verbatim Literature Review Agent prompt from App. F.1
- `references/discovery-pipeline.md` — Phase 1 + Phase 2 explained in detail
- `references/verification-rules.md` — Levenshtein cutoff, year alignment, dedup
- `references/citation-density-rule.md` — the ≥90% integration rule
- `references/s2-api-cookbook.md` — Semantic Scholar URLs, fields, rate limits
- `references/exa-search-cookbook.md` — optional Exa backend for Phase 1 (research-paper-focused web search)
- `scripts/pre_dedup_candidates.py` — **NEW** dedup Phase 1 candidates before Phase 2 (saves 30-40% S2 quota)
- `scripts/s2_cache.py` — **NEW** persistent S2 response cache (eliminates re-verification on re-runs)
- `scripts/validate_pool.py` — **NEW** validate & auto-fix citation_pool.json schema (authors format)
- `scripts/sync_keys.py` — **NEW** sync cite keys in .tex with canonical bibtex_keys after bibtex_format.py
- `scripts/levenshtein_match.py` — fuzzy title match (ratio > 70)
- `scripts/check_cutoff.py` — date cmp w/ month → day-1 default
- `scripts/dedupe_by_id.py` — dedup verified pool by S2 paperId
- `scripts/bibtex_format.py` — build refs.bib from JSON pool
- `scripts/citation_coverage.py` — ≥90% citation coverage gate
- `scripts/s2_search.py` — **NEW** Semantic Scholar title-search helper; reads `SEMANTIC_SCHOLAR_API_KEY` from env (optional — falls back to unauthenticated)
- `scripts/exa_search.py` — optional Exa Phase 1 backend (reads `EXA_API_KEY` from env)
