---
title: "Turn messy document collections into structured rows with DocETL"
description: "Define repeatable extraction pipelines that pull fields from large document collections, normalize outputs, and audit failures across the corpus."
verification: "listed"
source: "https://github.com/ucbepic/docetl"
author: "UCB EPIC"
publisher_type: "organization"
category:
  - "Data Extraction & Transformation"
framework:
  - "Multi-Framework"
tool_ecosystem:
  github_repo: "ucbepic/docetl"
  github_stars: 3707
---

# Turn messy document collections into structured rows with DocETL

Define repeatable extraction pipelines that pull fields from large document collections, normalize outputs, and audit failures across the corpus.

## Prerequisites

Python 3.10+, DocETL, document corpus, extraction configuration

## Installation

Choose whichever fits your setup:

1. Copy this skill folder into your local skills directory.
2. Clone the repo and symlink or copy the skill into your agent workspace.
3. Add the repo as a git submodule if you manage shared skills centrally.
4. Install it through your internal provisioning or packaging workflow.
5. Download the folder directly from GitHub and place it in your skills collection.

Install command or upstream instructions:

```
Install DocETL from the project instructions, configure the extraction pipeline for your document set, then run the pipeline to emit normalized structured outputs and review failures.
```

## Documentation

- https://docetl.org/

## Source

- [Agent Skill Exchange](https://agentskillexchange.com/skills/turn-messy-document-collections-into-structured-rows-with-docetl/)