---
name: dscli—dataset-gathering-cli
description: Manage Turkish news content records for aa-news-encoder. Covers all dscli commands with required/optional args and usage examples.
---

# dscli — Dataset Gathering CLI

`dscli` is a local CLI tool for managing Turkish news content records used to build Hugging Face datasets from three sources: **sabah**, **hurriyet**, and **sozcu**.

Records go through two stages:
- **Stage 1 (metadata):** `source`, `baslik`, `kategori`, `kaynak_url` — added via `add`
- **Stage 2 (full content):** `ozet`, `icerik`, `yayim_tarihi` — filled via `update`

Data is stored locally in PGlite (`.data/db/`). Records are never hard-deleted; use soft-delete (`delete` / `bulk_delete`).

---

## Setup

```bash
# From repo root
pnpm install

# From datasets-gathering/
cd datasets-gathering

# Generate migration SQL (first time or after schema changes)
pnpm db:generate

# Build the CLI
pnpm build

# Make dscli available globally
pnpm link --global

# Verify
dscli --help
```

---

## Kategori Values

> Values are **0-indexed** to match the model's convention directly (no remapping needed).

| Number | Name         |
|--------|--------------|
| 0      | POLITIKA     |
| 1      | EKONOMI      |
| 2      | SPOR         |
| 3      | SAGLIK       |
| 4      | KULTUR_SANAT |
| 5      | DUNYA        |
| 6      | TEKNOLOJI    |

---

## Commands

### `list`

List non-deleted records as a table (16 per page).

```
dscli list [--by <order>] [--page <number>]
```

| Option | Default       | Values                                          |
|--------|---------------|-------------------------------------------------|
| --by   | updated-desc  | updated-desc, updated-asc, created-desc, created-asc |
| --page | 0             | 0-based page number                             |

**Examples:**
```bash
dscli list
dscli list --by created-asc
dscli list --by updated-asc --page 2
```

---

### `detail`

Show all fields for a single record.

```
dscli detail --record_id <uuid>
```

**Examples:**
```bash
dscli detail --record_id 550e8400-e29b-41d4-a716-446655440000
```

---

### `add`

Add Stage 1 metadata. Prints the new `record_id` on success. `unique_id` is derived from `source+baslik+kategori` (lowercased, spaces removed) and must be unique.

```
dscli add --source <slug> --baslik <title> --kategori <0-6> --kaynak_url <url>
```

**Examples:**
```bash
dscli add --source sabah --baslik "Ekonomide son gelişmeler" --kategori 1 --kaynak_url "https://www.sabah.com.tr/ekonomi/..."
dscli add --source hurriyet --baslik "Fenerbahçe şampiyon" --kategori 2 --kaynak_url "https://www.hurriyet.com.tr/sporarena/futbol/..."
```

---

### `update`

Fill Stage 2 content fields for a metadata record. Cannot update already-filled or soft-deleted records.

```
dscli update --record_id <uuid> --ozet <text> --icerik <text> --yayim_tarihi <date>
```

**Examples:**
```bash
dscli update \
  --record_id 550e8400-e29b-41d4-a716-446655440000 \
  --ozet "Kısa açıklama metni" \
  --icerik "Temizlenmiş makale içeriği..." \
  --yayim_tarihi "2026-01-15"
```

---

### `delete`

Soft-delete a single record (excluded from `list` and `dump`).

```
dscli delete --record_id <uuid>
```

**Examples:**
```bash
dscli delete --record_id 550e8400-e29b-41d4-a716-446655440000
```

---

### `bulk_delete`

Soft-delete up to 16 records at once with comma-separated IDs.

```
dscli bulk_delete --record_ids "<uuid1>,<uuid2>,..."
```

**Examples:**
```bash
dscli bulk_delete --record_ids "uuid1,uuid2,uuid3"
```

---

### `dump`

Export non-deleted records to CSV or JSONL.

```
dscli dump --source <slug|all> --format <csv|jsonl> --path <output-file>
```

| Option   | Description                               |
|----------|-------------------------------------------|
| --source | Source slug (sabah, hurriyet, sozcu) or `all` |
| --format | `csv` or `jsonl`                          |
| --path   | Output file path (relative or absolute)   |

**Examples:**
```bash
# Export all sources as JSONL
dscli dump --source all --format jsonl --path ./export.jsonl

# Export only sabah as CSV
dscli dump --source sabah --format csv --path ./sabah.csv
```

---

## Target Counts Per Source

| Kategori     | sabah | hurriyet | sozcu |
|--------------|-------|----------|-------|
| POLITIKA (0) | 80    | 80       | 80    |
| EKONOMI (1)  | 72    | 72       | 72    |
| SPOR (2)     | 72    | 72       | 72    |
| SAGLIK (3)   | 72    | 72       | 72    |
| KULTUR_SANAT (4) | 72 | 72      | 72    |
| DUNYA (5)    | 72    | 72       | 72    |
| TEKNOLOJI (6)| 108   | 0        | 108   |

> Hürriyet does not offer a usable Teknoloji feed. Sabah and Sözcü each provide 108 items for TEKNOLOJI to compensate, making the total 216 (balanced across the dataset).
