--- name: ceres-publish-dataset description: Use when performing the monthly HuggingFace dataset export for the Ceres Open Data Index. Covers the full workflow — data quality checks, parquet export, README dataset card update, versioning, and push to HuggingFace. --- # Ceres Publish Dataset — Monthly HuggingFace Export Recurring monthly workflow to export the Ceres open data index to [HuggingFace](https://huggingface.co/datasets/AndreaBozzo/ceres-open-data-index) as a curated Parquet dataset. **Tracking issue:** https://github.com/AndreaBozzo/Ceres/issues/89 ## Checklist 1. Run data quality checks (noise, duplicates, new portals) 2. Export: `ceres export --format parquet --output ~/ceres-open-data-index` 3. Remove any stale per-portal parquet files from previous exports (e.g. renamed portals) 4. Update `README.md` dataset card (counts, portal table, snapshot date, bias notes) 5. Version the snapshot (commit + tag) 6. Push to HuggingFace 7. (Optional) Regenerate HuggingFace Space visualization ## Export Command ```bash ceres export --format parquet --output ~/ceres-open-data-index ``` The export streams all non-stale datasets from PostgreSQL, applies curation, and writes: - `all.parquet` — Complete flattened dataset - `data/.parquet` — Per-portal subsets - `metadata.json` — Snapshot metadata with counts and portal breakdown Portal names are resolved from `~/.config/ceres/portals.toml`. Portals not in the config fall back to hostname-based naming (e.g. `https://data.gov.ro` becomes `data-gov-ro`). The export takes ~30-40 minutes for 900k+ datasets due to JSONB flattening. ## Export Schema (14 columns) | Column | Type | Description | |--------|------|-------------| | `original_id` | string | Dataset ID from source portal | | `source_portal` | string | Portal base URL | | `portal_name` | string | Human-readable portal name | | `url` | string | Direct URL to dataset page | | `title` | string | Dataset title | | `description` | string | Dataset description (nullable) | | `tags` | string | Comma-separated tag names (nullable) | | `organization` | string | Publishing organization (nullable) | | `license` | string | License title or identifier (nullable) | | `metadata_created` | string | Original creation date ISO 8601 (nullable) | | `metadata_modified` | string | Last modification date ISO 8601 (nullable) | | `first_seen_at` | string | When Ceres first indexed this dataset (RFC 3339) | | `language` | string | Primary language code (nullable) | | `is_duplicate` | boolean | True if same title appears in another portal | ## Curation Rules (applied automatically during export) - **Noise filter**: Removes datasets where title < 5 chars, description is empty, or title contains "test"/"prova"/"esempio" (case-insensitive substring match) - **Duplicate flag**: Same title (case-insensitive) across different portals sets `is_duplicate=true`. Duplicates are kept, not removed. SQL: `SELECT LOWER(title) FROM datasets GROUP BY LOWER(title) HAVING COUNT(DISTINCT TRIM(TRAILING '/' FROM source_portal)) > 1` - **Metadata flattening**: Tags from `metadata.tags[].name`, organization from `metadata.organization.title`, license from `metadata.license_title` ## HuggingFace Dataset Repository - **Location:** `~/ceres-open-data-index` - **Remote:** `https://huggingface.co/datasets/AndreaBozzo/ceres-open-data-index` - **Branch:** `main` - **LFS:** `*.parquet` tracked via Git-LFS (configured in `.gitattributes`) ### Commit and push workflow ```bash cd ~/ceres-open-data-index git add all.parquet data/ metadata.json README.md git commit -m "vN: Month Year export (X portals, Yk datasets)" git tag vN git push origin main git push origin vN ``` ## README Dataset Card — What to Update The README at `~/ceres-open-data-index/README.md` uses HuggingFace dataset card format with YAML frontmatter. Update these sections: 1. **YAML frontmatter**: `language` list (add new language codes), `size_categories`, `tags` 2. **Opening line**: Total datasets count, portal count, country count 3. **Files section**: Row count, file count 4. **Portal table** (`### Splits by portal`): Add new portals, update all counts, sort by count descending 5. **Curation section**: Noise filtering count, duplicate flagging count 6. **Update frequency**: Snapshot date 7. **Known biases**: Geographic skew percentages, language distribution, portal selection notes 8. **Citation block**: Snapshot date, portal count All numbers come from `metadata.json` (generated by the export). ## Export History | Version | Date | Exported | Filtered | Duplicates | Portals | Countries | |---------|------|----------|----------|------------|---------|-----------| | v1 | 2026-02-12 | 230,315 | 4,693 | 58,364 | 23 | 8 + intl | | v2 | 2026-02-25 | 349,836 | 6,532 | 59,062 | 25 | 9 + intl | | v3 | 2026-03-30 | 890,143 | 10,331 | 75,527 | 32 | 13 + intl | ## Key Files | File | Purpose | |------|---------| | `~/.config/ceres/portals.toml` | Portal config — controls name resolution during export | | `~/ceres-open-data-index/README.md` | HuggingFace dataset card | | `~/ceres-open-data-index/metadata.json` | Export metadata (generated, do not edit) | | `crates/ceres-core/src/parquet_export.rs` | Export logic (noise filter, dedup, flattening) | | `crates/ceres-db/src/repository.rs` | SQL queries (duplicate detection, dataset streaming) | ## HuggingFace Space (Optional) The visualization dashboard at `https://huggingface.co/spaces/AndreaBozzo/Ceres` is a static Plotly site generated from the exported data. - **Location:** `~/Documenti/Ceres(huggingfacespace)` - **Generator:** `~/Documenti/open-data-galaxy/galaxy_constellation.py` - Regenerate after pushing the dataset, then push the Space repo separately ## Pre-export Data Quality Checks ```bash # Overall stats ceres stats # Check noise candidates in DB (optional) docker exec ceres_db psql -U ceres_user -d ceres_db -c " SELECT COUNT(*) FILTER (WHERE LENGTH(title) < 5) as tiny_titles, COUNT(*) FILTER (WHERE description IS NULL OR TRIM(description) = '') as empty_desc, COUNT(*) FILTER (WHERE LOWER(title) LIKE '%test%' OR LOWER(title) LIKE '%prova%' OR LOWER(title) LIKE '%esempio%') as noise_titles FROM datasets WHERE NOT is_stale; " ```