---
name: parse_skill
description: Document parsing skill for extracting content from PDFs and other academic documents using MinerU and local parsers
version: 1.0.0
author: PaperAgent Team
---

# Document Parsing Skill

This skill enables you to parse and extract content from academic documents, particularly PDFs. Use this skill when you need to extract text, figures, tables, and structured content from research papers.

## Available Tools

You have access to the following parsing tools (registered in the Toolkit):

### 1. `parse_pdf_with_mineru`
Parse a PDF document using MinerU API for high-quality extraction.

**Parameters:**
- `pdf_url` (str, required): URL to the PDF file
- `extract_figures` (bool, optional): Whether to extract figures (default: True)
- `model_version` (str, optional): "vlm" (vision-language model) or "basic"

**Returns:** Parsed content including:
- Full text in markdown format
- Extracted sections with headings
- Figures with captions
- Tables (converted to markdown)
- Reference list

**Example:**
```
parse_pdf_with_mineru(pdf_url="https://arxiv.org/pdf/2301.12345.pdf", extract_figures=True, model_version="vlm")
```

**Note:** MinerU API requires authentication. Ensure the API token is configured.

### 2. `parse_pdf_local`
Parse a PDF document locally using pypdf (faster but less accurate).

**Parameters:**
- `file_path` (str, required): Local path to the PDF file

**Returns:** Extracted text content with page markers and metadata.

**Example:**
```
parse_pdf_local(file_path="/path/to/paper.pdf")
```

### 3. `extract_text_from_file`
Extract text from various file types.

**Parameters:**
- `file_path` (str, required): Path to the file (supports txt, md, pdf, docx)

**Returns:** Plain text content.

**Example:**
```
extract_text_from_file(file_path="/path/to/document.docx")
```

### 4. `extract_sections`
Extract structured sections from a parsed document.

**Parameters:**
- `content` (str, required): Document content in markdown format
- `section_patterns` (list[str], optional): Custom section heading patterns

**Returns:** List of sections with:
- Section title
- Section content
- Section level (h1, h2, h3, etc.)

**Example:**
```
extract_sections(content=markdown_content, section_patterns=["Abstract", "Introduction", "Methods", "Results", "Conclusion"])
```

### 5. `extract_figures`
Extract figure information from parsed content.

**Parameters:**
- `content` (str, required): Document content
- `include_images` (bool, optional): Whether to include image data (default: False)

**Returns:** List of figures with captions and positions.

**Example:**
```
extract_figures(content=parsed_content, include_images=True)
```

### 6. `extract_references`
Extract and parse reference list from document.

**Parameters:**
- `content` (str, required): Document content
- `format` (str, optional): Expected citation format for parsing hints

**Returns:** List of parsed references with structured fields.

**Example:**
```
extract_references(content=paper_content)
```

## Parsing Strategies

### For arXiv Papers
1. Use the PDF URL from arXiv (e.g., `https://arxiv.org/pdf/xxxx.xxxxx.pdf`)
2. Prefer `parse_pdf_with_mineru` with `model_version="vlm"` for best results
3. For faster processing, use `model_version="basic"`

### For Local PDFs
1. If file is accessible locally, use `parse_pdf_local` for quick extraction
2. For better quality, upload to accessible URL and use MinerU

### For Multi-column PDFs
1. MinerU handles multi-column layouts automatically
2. Local parsing may have column order issues

### For PDFs with Complex Figures
1. Use `extract_figures=True` with MinerU
2. VLM model provides better figure understanding
3. Figure captions are extracted separately

## Best Practices

1. **Choose the right parser**:
   - MinerU: Best quality, requires API, slower
   - Local: Fast, offline, lower quality

2. **Handle large documents**:
   - Parse in chunks if needed
   - Extract specific sections rather than full document

3. **Validate extraction**:
   - Check for missing sections
   - Verify figure/table extraction

4. **Process structured content**:
   - Use `extract_sections` to navigate large documents
   - Use `extract_references` for citation analysis

## Output Format

Parsed content is returned in markdown format:
- Headings preserved as `#`, `##`, `###`
- Figures as `![caption](url)` or `[Figure N: caption]`
- Tables as markdown tables
- References as numbered list
- Equations in LaTeX format when possible