---
title: "Extract schema.org, Open Graph, and JSON-LD metadata from web pages for indexing"
description: "Uses extruct to pull machine-readable metadata from raw HTML so an agent can classify, deduplicate, or enrich pages without brittle full-page parsing. It is best for metadata harvesting workflows, not for crawling an entire site or rendering JavaScript-heavy pages."
verification: "security_reviewed"
source: "https://github.com/scrapinghub/extruct"
author: "Scrapinghub"
publisher_type: "Company"
category:
  - "Research & Scraping"
framework:
  - "Multi-Framework"
tool_ecosystem:
  github_repo: "scrapinghub/extruct"
  github_stars: 961
---

# Extract schema.org, Open Graph, and JSON-LD metadata from web pages for indexing

Uses extruct to pull machine-readable metadata from raw HTML so an agent can classify, deduplicate, or enrich pages without brittle full-page parsing. It is best for metadata harvesting workflows, not for crawling an entire site or rendering JavaScript-heavy pages.

## Prerequisites

Python 3 environment

## Installation

Choose whichever fits your setup:

1. Copy this skill folder into your local skills directory.
2. Clone the repo and symlink or copy the skill into your agent workspace.
3. Add the repo as a git submodule if you manage shared skills centrally.
4. Install it through your internal provisioning or packaging workflow.
5. Download the folder directly from GitHub and place it in your skills collection.

Install command or upstream instructions:

```
pip install extruct
```

## Documentation

- https://github.com/scrapinghub/extruct#readme

## Source

- [Agent Skill Exchange](https://agentskillexchange.com/skills/extract-schema-org-open-graph-and-json-ld-metadata-from-web-pages-for-indexing/)
