---
name: search-lit
description: Literature search and citation management for medical research. Searches PubMed, Semantic Scholar, and bioRxiv/medRxiv with verified citations. Anti-hallucination — every reference verified via API before inclusion. Generates BibTeX entries.
triggers: literature search, find papers, citation, references, bibliography, PubMed search, related work
tools: Read, Write, Edit, Bash, Grep, Glob
model: inherit
---

# Literature Search Skill

You are assisting a medical researcher with literature searches and citation management for
medical research papers. Every reference you produce must be verified against a live database --
never generate citations from memory alone.

## Communication Rules

- Communicate with the user in their preferred language.
- All citation content (titles, abstracts, BibTeX) in English.
- Medical terminology is always in English.

## Key Directories

- **BibTeX output**: User-specified directory (default: current working directory)
- **Manuscript workspace**: determined by the user or the calling skill

## Search Tools: MCP (Primary) + E-utilities (Fallback)

### Primary: MCP Tools (Claude.ai Remote)

| Database | MCP Tool | Purpose |
|----------|----------|---------|
| PubMed | `mcp__claude_ai_PubMed__search_articles` | Search by query, MeSH terms |
| PubMed | `mcp__claude_ai_PubMed__get_article_metadata` | Full metadata for a PMID |
| PubMed | `mcp__claude_ai_PubMed__find_related_articles` | Related articles for a PMID |
| PubMed | `mcp__claude_ai_PubMed__lookup_article_by_citation` | Verify a citation |
| PubMed | `mcp__claude_ai_PubMed__convert_article_ids` | Convert between PMID/DOI/PMCID |
| Semantic Scholar | `mcp__claude_ai_Scholar_Gateway__semanticSearch` | Semantic search across all fields |
| bioRxiv/medRxiv | `mcp__claude_ai_bioRxiv__search_preprints` | Search preprint servers |
| bioRxiv/medRxiv | `mcp__claude_ai_bioRxiv__get_preprint` | Full preprint metadata |
| CrossRef | WebFetch with `https://api.crossref.org/works/{DOI}` | DOI verification |

### Fallback: NCBI E-utilities (Direct API via Bash)

When PubMed MCP is unavailable (session timeout, "MCP session has been terminated" error,
or "No such tool available" error), fall back to NCBI E-utilities via bundled scripts.

**Detection**: If any `mcp__claude_ai_PubMed__*` call returns an error containing
"terminated", "not found", "not available", or "not connected", switch ALL subsequent
PubMed calls in this session to E-utilities. Do not retry MCP after a disconnect — it
will not recover within the same conversation.

**Scripts** (in `${CLAUDE_SKILL_DIR}/references/`):
- `pubmed_eutils.sh` — Bash wrapper for NCBI E-utilities API
- `parse_pubmed.py` — Python parser for E-utilities responses

**Usage patterns:**

```bash
EUTILS="${CLAUDE_SKILL_DIR}/references/pubmed_eutils.sh"
PARSER="${CLAUDE_SKILL_DIR}/references/parse_pubmed.py"

# Search PubMed (returns PMIDs)
bash "$EUTILS" search "diagnostic test accuracy meta-analysis radiology" 20 \
  | python3 "$PARSER" esearch

# Get article summaries as markdown table
bash "$EUTILS" fetch_json "16168343,16085191,31462531" \
  | python3 "$PARSER" esummary

# Get detailed metadata
bash "$EUTILS" fetch "16168343" \
  | python3 "$PARSER" efetch

# Generate BibTeX entries
bash "$EUTILS" fetch "16168343,16085191" \
  | python3 "$PARSER" bibtex

# Verify a citation by exact title
bash "$EUTILS" cite_lookup "Bivariate analysis of sensitivity and specificity" \
  | python3 "$PARSER" esearch

# Find related articles for a PMID
bash "$EUTILS" related "16168343" 10 \
  | python3 "$PARSER" esummary
```

**Rate limiting**: 3 requests/second without API key, 10/sec with NCBI_API_KEY.
The script auto-sleeps 350ms between calls. For batch operations, keep calls sequential.

**E-utilities → MCP equivalence:**

| MCP Tool | E-utilities Command | Parser Mode |
|----------|-------------------|-------------|
| `search_articles` | `search <query> [retmax]` | `esearch` |
| `get_article_metadata` | `fetch <pmids>` | `efetch` or `bibtex` |
| `find_related_articles` | `related <pmid> [retmax]` | `esummary` |
| `lookup_article_by_citation` | `cite_lookup <title>` | `esearch` → `fetch` |
| `convert_article_ids` | Not available (use CrossRef DOI lookup) | — |

---

## Workflow

### Phase 1: Search Strategy

1. **Understand the need**: Get the research topic, specific question, or manuscript section
   that needs references.
2. **Generate search terms**:
   - Identify key concepts (Population, Intervention/Exposure, Comparison, Outcome).
   - Generate MeSH terms for PubMed queries.
   - Build Boolean queries: `(concept1 OR synonym1) AND (concept2 OR synonym2)`.
3. **Define scope**:
   - Date range (default: last 10 years unless user specifies).
   - Article types (original research, review, meta-analysis, etc.).
   - Language filter (default: English).
4. **Present the search plan** to the user before executing. Include the Boolean query,
   databases to search, and filters.

**Gate:** Wait for user approval before running searches.

### Phase 2: Execute Search

1. **Search PubMed** using `search_articles` with the Boolean query.
2. **Search Semantic Scholar** using `semanticSearch` with natural language query.
3. **Search bioRxiv/medRxiv** using `search_preprints` if preprints are relevant.
4. **Deduplicate** results across databases (match by DOI or title similarity).
5. **Present results** in a structured table:

```
| # | Title | Authors (first + last) | Year | Journal | PMID/DOI | Relevance |
|---|-------|----------------------|------|---------|----------|-----------|
| 1 | ...   | Kim J, ... Lee S     | 2024 | Radiology | 12345678 | High      |
```

6. Ask the user to select which papers to include.

### Phase 3: Deep Read

For each selected paper:

1. **Retrieve full metadata** using `get_article_metadata` (PubMed) or `get_preprint` (bioRxiv).
2. **Extract key information**:
   - Study design
   - Sample size / dataset
   - Key methods
   - Primary findings (with specific numbers)
   - Limitations noted by authors
3. **Build a literature matrix** if multiple papers selected:

```
| Paper | Design | N | Key Finding | Limitation | Relevance to Our Study |
|-------|--------|---|-------------|------------|----------------------|
```

4. Present the matrix to the user for review.

### Phase 4: Citation Management

#### Anti-Hallucination Protocol

This is the most critical part of the skill. Follow these rules without exception:

1. **NEVER generate a reference from memory alone.** Every reference must come from an API search result.
2. **NEVER fabricate DOIs or PMIDs.** If you cannot find a DOI/PMID, mark the reference as `[UNVERIFIED - NEEDS MANUAL CHECK]`.
3. **Cross-check every reference** against the API result:
   - Author names (at least first author and last author)
   - Publication year
   - Journal name
   - Article title (exact match, not paraphrased)
   - Volume and pages (if available)
4. **If any field does not match**, flag the specific mismatch.
5. **For DOI verification**, use WebFetch with `https://api.crossref.org/works/{DOI}` to confirm the DOI resolves correctly.

#### BibTeX Generation

For each reference (verified or not), generate a BibTeX entry with an explicit
`verified` flag so downstream skills (`/lit-sync`, `/verify-refs`,
`/write-paper`) can reason about trust without re-running verification:

```bibtex
@article{FirstAuthorLastName_Year_ShortKey,
  author    = {Last1, First1 and Last2, First2 and Last3, First3},
  title     = {Full Title As Retrieved From Database},
  journal   = {Journal Name},
  year      = {2024},
  volume    = {310},
  number    = {2},
  pages     = {e234567},
  doi       = {10.1001/jama.2024.12345},
  pmid      = {12345678},
  verified  = {true},
  verified_by = {pubmed+crossref},
  verified_on = {2026-04-24},
}
```

**`verified` flag values** (required on every entry):

| Value | Meaning | Downstream behavior |
|---|---|---|
| `true` | DOI or PMID confirmed via PubMed/CrossRef; title, authors, year all match | Safe to cite; `/write-paper` citekey-only gate passes |
| `false` | Parsed from text but API lookup failed or returned mismatch | `/verify-refs` flags as UNVERIFIED; manuscript MUST show `[UNVERIFIED - NEEDS MANUAL CHECK]` |
| `manual` | User explicitly added despite lookup failure | Treated as verified=false by `/verify-refs` but suppresses repeat warnings |

`verified_by` lists the data sources that confirmed the entry (e.g., `pubmed`,
`crossref`, `semantic_scholar`, or a combination). `verified_on` is the ISO date
of the most recent successful verification.

**BibTeX key convention**: `FirstAuthorLastName_Year_OneWord` (e.g., `Kim_2024_Validation`).

#### Output

1. Save BibTeX entries to the specified .bib file (append, do not overwrite).
   Target: `references/library.bib` (candidate pool for `/lit-sync` to import
   into Zotero). NEVER write to `manuscript/_src/refs.bib` — that is `/lit-sync`'s
   sole-writer path per `docs/artifact_contract.md`.
2. Print a summary of all references with verification status:

```
Verified:    12 references (verified=true)
Unverified:   1 reference  (verified=false) [NEEDS MANUAL CHECK]
Total:       13 references
```

### Phase 4b: Zotero Library Integration

If a Zotero MCP server is available, integrate search results with the user's library:

1. **Add papers to Zotero**: Use `zotero_add_by_doi` for DOI-based import (auto-downloads OA PDFs).
2. **Organize into collections**: Use `zotero_manage_collections` to file into the relevant project collection.
3. **Check for duplicates**: Use `zotero_search_items` to avoid adding papers already in the library.
4. **Leverage annotations**: Use `zotero_get_annotations` to reference the user's prior reading notes.
5. **Write sync audit**: Record collection key, added/skipped/failed counts, and
   unsynced entries in `references/zotero_collection.json` so Zotero status is
   auditable rather than a hidden optional side effect.

> Requires Zotero Desktop running with MCP server. Skip this phase if unavailable.
> If skipped, still write `references/zotero_collection.json` with
> `status: "skipped"` and the reason.

### Phase 5: Full-Text Retrieval

After identifying relevant papers, retrieve full-text PDFs for detailed review.
This is especially important for meta-analyses where data extraction requires full text.

#### Phase 5a: Open Access Auto-Retrieval

Try sources in order of reliability:

1. **Unpaywall API** (highest quality OA links):
   ```python
   import os, requests
   email = os.environ.get("UNPAYWALL_EMAIL", "user@example.com")
   url = f"https://api.unpaywall.org/v2/{doi}?email={email}"
   r = requests.get(url).json()
   if r.get("best_oa_location", {}).get("url_for_pdf"):
       pdf_url = r["best_oa_location"]["url_for_pdf"]
   ```

2. **PubMed Central (PMC)**:
   - Convert PMID to PMCID via NCBI ID Converter
   - Download from PMC OA service: `https://www.ncbi.nlm.nih.gov/pmc/articles/PMC{id}/pdf/`

3. **OpenAlex API** (additional OA discovery):
   ```python
   url = f"https://api.openalex.org/works/https://doi.org/{doi}"
   # Requires polite pool: add email in User-Agent header or mailto= param
   r = requests.get(url, headers={"User-Agent": f"MyApp/1.0 (mailto:{email})"}).json()
   oa_url = r.get("open_access", {}).get("oa_url")
   ```

4. **CrossRef landing page**: Follow `https://api.crossref.org/works/{doi}` → publisher link
   → scrape `<meta name="citation_pdf_url">` tag

#### Phase 5b: Alternative Sources

Some researchers use alternative access methods for paywalled content.
**Users are responsible for ensuring compliance with their institutional access policies.**

If an environment variable (e.g., `SCIHUB_BASE`) is set, the skill may use it as an
alternative PDF source. No specific URLs are provided here — users configure this themselves.

Other options:
- **Institutional proxy/VPN**: Access publisher sites through institutional EZproxy or VPN
- **Interlibrary loan (ILL)**: Request through library services for papers not otherwise available
- **Author contact**: Email corresponding authors for preprints

#### PDF Validation

Always validate downloaded files before use:

```python
def is_valid_pdf(filepath):
    """Check that a downloaded file is actually a PDF, not an HTML redirect."""
    import os
    if os.path.getsize(filepath) < 10240:  # < 10KB is likely a stub/redirect
        return False
    with open(filepath, 'rb') as f:
        header = f.read(5)
    return header == b'%PDF-'
```

Additional checks:
- Verify HTTP `Content-Type: application/pdf` header before saving
- Files under 10KB are almost always HTML login/redirect pages, not real PDFs
- Some publishers return CAPTCHA pages — these fail the `%PDF-` check

#### Rate Limiting

- Unpaywall: Polite pool (no hard limit with email parameter)
- OpenAlex: Include email in User-Agent for polite pool access
- NCBI/PMC: 3 requests/sec without API key, 10/sec with `NCBI_API_KEY`
- General: 2-second minimum interval between requests to any single host

### Phase 6: Gap Analysis

When called during manuscript writing (especially by `/write-paper` Phase 7):

1. **Read the manuscript** to extract all inline citations.
2. **Compare** cited references against the search results.
3. **Identify gaps**:
   - Key papers in the field that are not cited.
   - Outdated references when newer versions exist.
   - Missing methodological references (e.g., statistical methods, reporting guidelines).
4. **Report** findings to the user with specific suggestions.

---

## Specialized Search Modes

### Mode: Systematic Search

For systematic reviews or comprehensive literature sections:

1. Document the full search strategy (PRISMA-compliant).
2. Record: database, date of search, query string, number of results.
3. Track inclusion/exclusion at each screening step.
4. Output a PRISMA flow diagram data summary.

### Mode: Quick Cite

For quickly finding a single reference the user describes:

1. User says something like "that 2023 paper by Smith about AI in chest X-ray."
2. Search PubMed and Semantic Scholar with the described details.
3. Present top 3 candidates.
4. User confirms which one.
5. Generate BibTeX entry.

### Mode: Related Papers

For expanding from a known paper:

1. User provides a PMID or DOI.
2. Use `find_related_articles` to get related papers.
3. Use Semantic Scholar for citation-based recommendations.
4. Present results ranked by relevance.

### Mode: Embase Browser Automation

Embase has no public API. Use Chrome browser automation (MCP) to search and export:

1. Navigate to `embase.com` — institutional SSO authenticates automatically.
   If cookie error (`login?error#`), clear Elsevier/Embase cookies and retry.
2. Go to **Advanced Search** tab.
3. Enter Embase-syntax query (Emtree `/exp` + `:ab,ti` field tags).
   Uncheck "Map to preferred term in Emtree" when using explicit `/exp` terms.
4. After results appear, use "Select number of items" dropdown → select total count.
5. Click **Export** (in Results section) → choose **CSV** format → check fields:
   Title, Author names, Source, Publication year, Publication type, DOI, Abstract,
   Language of article, Medline PMID.
6. Click Export → Download tab opens → click Download.
7. CSV is in **row format** (records separated by blank rows) — parse with:
   ```python
   # Each record = consecutive rows until blank row
   # Row format: [FIELD_NAME, value1, value2, ...]
   # AUTHOR NAMES row has multiple values (one per author)
   ```

**PubMed → Embase query translation:**
- MeSH `[Mesh]` → Emtree `/exp`
- `[tiab]` → `:ab,ti`
- `[Title/Abstract]` → `:ab,ti`
- Boolean operators stay the same (AND, OR)
- Phrase search: use single quotes in Embase (`'artificial ascites'`)

---

## Error Handling

- If a search returns 0 results, broaden the query (remove one concept or use broader MeSH terms) and retry.
- **CrossRef HTTP errors (token-saving rules):**
  - **403 (rate-limited):** Do NOT retry. Skip CrossRef silently → verify via PubMed title search instead.
  - **303 (redirect):** Follow the redirect if possible. If not, skip CrossRef → PubMed fallback.
  - **Any repeated failure:** After the first CrossRef 403/303 in a session, assume CrossRef is
    rate-limiting and skip CrossRef for ALL remaining references. Go directly to PubMed title
    verification. This avoids N×retry token waste.
  - **Never print raw error messages** like "Request failed with status code 403." Collect
    failures silently and report a single summary line at the end:
    `CrossRef unavailable for {N} references (rate-limited). Verified via PubMed instead.`
- If a DOI does not resolve via CrossRef (after applying the rules above), try searching PubMed by title to confirm the reference exists.
- If the user provides a reference that cannot be verified by any method, clearly state: "This reference could not be verified. Please check manually before submission."
- Never silently include an unverified reference.

## What This Skill Does NOT Do

- Does not download from paywalled journals without user-provided credentials or institutional access.
- Does not assess the quality of evidence (use `/analyze-stats` or `/check-reporting` for that).
- Does not write the literature review text (use `/write-paper` for that).
- Does not fabricate any part of a citation.
