---
name: hindi-language-intelligence
version: "1.0.0"
author: "Harshavardhan Bailur"
description: |
  Production-grade Hindi language processor that converts Hindi audio/video and screen
  recordings into THREE English-only output files. All outputs are in English only, regardless
  of Hindi source language. Implements dual-track system:
  - Track A: Hindi audio/video → English transcript JSON
  - Track B: Hindi screen text (OCR) → English transcript JSON
  - Track C: Merge both → Unified English knowledge base JSON

  Core technology: Whisper large-v3 for Hindi STT with task='translate' for native
  English output, Tesseract + PaddleOCR for Devanagari OCR, deep-translator for fallback
  translation, speaker diarization, entity extraction, and confidence tagging. Handles
  code-mixed Hinglish (Hindi+English) common in fintech/business contexts.

  USE THIS SKILL WHEN:
  - User has Hindi audio files (MP3, WAV, M4A, OGG, WEBM) to transcribe to English
  - User has Hindi video files (MP4, MOV, MKV) with Hindi speech to process
  - User says "transcribe Hindi to English", "translate Hindi audio", "Hindi speech to text"
  - User has screen recordings with Hindi UI text to extract and translate to English
  - User wants to create an English-language knowledge base from Hindi meetings/calls
  - User mentions "Hindi audio", "Hindi video", "Hinglish", "code-mixed Hindi"
  - User needs dual knowledge base (audio + screen text) all in English
  - User has CRM/LOS/fintech screen recordings with Hindi narration to process
  - User wants to process customer support calls in Hindi with English output
  - User needs speaker-wise Hindi transcription with English translation
  - User mentions "Devanagari OCR", "Hindi screen text", "Hindi to English translation"
  - ANY task involving Hindi speech recognition, transcription, or Hindi→English conversion
  - User has Hindi language content needing English translation and extraction
  - User mentions "Hindi meeting notes", "Hindi transcription", "audio KB from Hindi"
  - User wants to process Hinglish/code-mixed conversations
---

# Hindi Language Intelligence Skill

Production-grade processor for converting Hindi audio, video, and screen recordings into
three structured English-only output files with confidence scoring, speaker identification,
and unified knowledge base construction.

## Three Output Files (All English Only)

1. **`audio_transcript_english.json`** — Hindi speech → English text transcript
   - Full transcription with timestamps
   - Speaker identification and diarization
   - Confidence scores per segment
   - Topics/entities extracted
   - Original Hindi text included (for reference only)

2. **`screen_transcript_english.json`** — Hindi screen text (OCR) → English text
   - Frame-by-frame OCR results (Hindi Devanagari)
   - All text translated to English
   - UI element detection
   - Timestamp mapping
   - Confidence scores for each OCR segment

3. **`unified_knowledge_base.json`** — Audio + Screen → Unified English KB
   - Timeline-aligned audio + screen content
   - Single English-only knowledge base
   - Cross-referenced segments
   - Topics/entities merged from both sources
   - Master index for searching

## Architecture: 3-Pipeline System

```
HINDI AUDIO/VIDEO INPUT (Track A)
  ├─ Pipeline A1: Audio Extraction
  │  └─ Extract audio stream + convert to 16kHz mono WAV
  │
  ├─ Pipeline A2: Hindi STT + Translation
  │  ├─ Whisper large-v3 (language='hi', task='translate')
  │  ├─ Output: Hindi + English transcription with timestamps
  │  ├─ Fallback: Vosk Hindi → deep-translator GoogleTranslator
  │  └─ Result: segments with confidence scores
  │
  └─ Pipeline A3: Audio Knowledge Base Assembly
     ├─ Speaker diarization + attribution
     ├─ Entity + topic extraction from English text
     └─ Output: audio_transcript_english.json

HINDI SCREEN RECORDING INPUT (Track B)
  ├─ Pipeline B1: Keyframe Extraction
  │  ├─ Scene detection (threshold=0.15)
  │  ├─ SSIM-based deduplication (threshold=0.95)
  │  └─ Extract representative frames per scene
  │
  ├─ Pipeline B2: Hindi OCR + Translation
  │  ├─ Tesseract (--lang hin) for Devanagari
  │  ├─ PaddleOCR with devanagari_PP-OCRv5_mobile_rec
  │  ├─ Translate all detected text to English
  │  └─ Result: OCR with confidence per frame
  │
  └─ Pipeline B3: Screen Knowledge Base Assembly
     ├─ UI element detection + state tracking
     ├─ Merge OCR results with timestamps
     └─ Output: screen_transcript_english.json

UNIFIED KNOWLEDGE BASE (Track C)
  └─ Pipeline C: Merge A3 + B3
     ├─ Timeline alignment (audio timestamps ↔ screen keyframes)
     ├─ Cross-reference audio segments to screen context
     ├─ Merge topics/entities from both
     ├─ Create unified search index (English only)
     └─ Output: unified_knowledge_base.json
```

## When This Skill Activates

| User Intent | Trigger Keywords | Pipeline |
|---|---|---|
| **Hindi Speech to English** | "transcribe Hindi", "Hindi audio to English", "Hindi speech to text" | A1→A2→A3 |
| **Single Hindi Audio** | "Hindi meeting.mp3", "Hindi call", "Hindi recording" | A1→A2→A3 |
| **Single Hindi Video** | "Hindi video.mp4", "Hindi presentation", "Hindi demo" | A1→A2→A3 |
| **Hindi Screen Text** | "Hindi on screen", "Devanagari OCR", "Hindi screen capture" | B1→B2→B3 |
| **Audio + Screen** | "meeting with screen", "demo with narration", "video with Hindi UI" | A+B+C |
| **Batch Processing** | "process 5 Hindi files", "transcribe directory" | Wrapper (sequential) |
| **Merged KB** | "unified knowledge base", "combined audio+screen" | C (requires A3+B3) |

## Quick Start

### Single Hindi Audio File
```bash
# Input: meeting.mp3 (Hindi speech)
# Output: audio_transcript_english.json (English transcript with timestamps)

python scripts/process_audio.py meeting.mp3 ./output \
  --language hi \
  --model large \
  --translate \
  --diarize
```

**Output structure:**
```json
{
  "file": "meeting.mp3",
  "language": "hindi",
  "duration_seconds": 3600,
  "segments": [
    {
      "id": 1,
      "start_time": 0.0,
      "end_time": 5.3,
      "speaker": "Speaker_1",
      "hindi_text": "नमस्ते, यह एक बैठक है",
      "english_text": "Hello, this is a meeting",
      "confidence": 0.92,
      "topics": ["greeting", "meeting"],
      "entities": []
    }
  ],
  "summary": "Meeting discussion on Q1 planning",
  "language_detected": "hindi",
  "processing_metadata": {
    "model": "whisper-large-v3",
    "translation_engine": "whisper_translate",
    "diarization_enabled": true,
    "total_segments": 145,
    "mean_confidence": 0.88
  }
}
```

### Single Hindi Screen Recording
```bash
# Input: screen_recording.mp4 (Hindi UI text + narration)
# Output: screen_transcript_english.json (English OCR + translations)

python scripts/process_screen.py screen_recording.mp4 ./output \
  --language hi \
  --threshold 0.15 \
  --ssim 0.95
```

**Output structure:**
```json
{
  "file": "screen_recording.mp4",
  "duration_seconds": 600,
  "keyframes": [
    {
      "id": 1,
      "timestamp": 5.2,
      "frame_file": "frame_001.jpg",
      "ocr_segments": [
        {
          "text_hindi": "खाता खोलें",
          "text_english": "Open Account",
          "bbox": [100, 50, 300, 100],
          "confidence": 0.87,
          "language_detected": "hindi"
        }
      ],
      "ui_elements": ["button", "text_field"],
      "scene_description": "Account creation form"
    }
  ],
  "summary": "Screen recording of account opening process with Hindi UI",
  "processing_metadata": {
    "total_keyframes": 42,
    "ocr_engine": "tesseract+paddleocr",
    "mean_ocr_confidence": 0.84
  }
}
```

### Combined Audio + Screen (Unified KB)
```bash
# Input: video_with_narration.mp4 (Hindi UI + Hindi voiceover)
# Outputs:
#   - audio_transcript_english.json
#   - screen_transcript_english.json
#   - unified_knowledge_base.json

python scripts/process_audio.py video_with_narration.mp4 ./output \
  --video \
  --extract-audio \
  --translate \
  --diarize

python scripts/process_screen.py video_with_narration.mp4 ./output \
  --language hi

python scripts/merge_knowledge_bases.py \
  --audio ./output/audio_transcript_english.json \
  --screen ./output/screen_transcript_english.json \
  --output ./output/unified_knowledge_base.json
```

**Unified KB structure:**
```json
{
  "title": "Account Opening Tutorial",
  "files": {
    "audio": "audio_transcript_english.json",
    "screen": "screen_transcript_english.json"
  },
  "timeline": [
    {
      "timestamp": 5.2,
      "type": "audio_segment",
      "audio_id": 2,
      "english_text": "Click the open account button",
      "speaker": "Trainer"
    },
    {
      "timestamp": 5.2,
      "type": "screen_change",
      "screen_id": 1,
      "english_ocr": "Open Account",
      "description": "Button becomes highlighted"
    }
  ],
  "search_index": {
    "topics": ["account_opening", "form_filling", "submission"],
    "entities": ["account_type", "requirements"],
    "keywords": ["hindi", "tutorial", "account", "process"]
  }
}
```

## ASR Fallback Chain

### Whisper-Based (Primary)
1. **Whisper large-v3** with language='hi', task='translate'
   - Best Hindi accuracy (92-95% on clean audio)
   - Native English output via translate task
   - Handles code-mixed Hinglish
   - GPU accelerated if available

2. **Whisper medium** (if large too slow)
   - Balanced speed/accuracy (88-91%)
   - Still English output

3. **Whisper base** (fallback if GPU memory limited)
   - Fastest Whisper variant
   - Acceptable accuracy (85-88%)

### Offline/Local Fallback (no internet)
4. **Vosk Hindi**
   - Offline, no network needed
   - Lower accuracy (75-82%)
   - Good for real-time, handles Hinglish
   - Output: Hindi text only (requires translation)

### API Fallback (Whisper offline fails)
5. **Google Speech-to-Text API**
   - Requires credentials
   - Slower but reliable
   - Hindi output only (requires translation)

### Manual Override
6. **User-provided SRT subtitles**
   - Accept Hindi subtitle file
   - Align timing to audio/video
   - User manually fixes errors
   - Proceeed with A2 translation step

### Translation Chain (for non-Whisper outputs)
- Primary: `deep_translator.GoogleTranslator(source='hi', target='en')`
- Fallback: `deep_translator.MyMemoryTranslator(source='hi', target='en')`
- Last resort: Human-provided translations via SRT

## OCR Fallback Chain

### Devanagari OCR (Track B, Pipeline B2)

1. **Tesseract 5.x** with `--lang hin`
   - Best overall accuracy for printed Devanagari
   - Handles mixed Hindi+English
   - Local, no network needed

2. **PaddleOCR** with `lang='devanagari'` or `'devanagari_PP-OCRv5_mobile_rec'`
   - Good for handwritten/stylized text
   - Faster on large images
   - Fallback if Tesseract fails

3. **EasyOCR** with `lang=['hi', 'en']`
   - Last-resort OCR
   - Slower, GPU-friendly
   - Good confidence scoring

### Per-Frame Translation
After OCR detects Hindi text:
- `deep_translator.GoogleTranslator(source='hi', target='en')`
- Preserve original Hindi text as reference
- Tag with confidence + OCR engine used

## Confidence Tagging System

Every segment in every output gets confidence scores:

```json
{
  "segment": "...",
  "confidence": {
    "transcription": 0.92,
    "speaker_identification": 0.85,
    "translation": 0.88,
    "entity_extraction": 0.79
  },
  "confidence_source": "whisper_internal_score",
  "warnings": [
    "Low speaker confidence (<0.7) — manual review recommended"
  ]
}
```

**Confidence thresholds:**
- ≥ 0.90: High confidence, use as-is
- 0.70–0.89: Medium, flag for review
- < 0.70: Low, manual verification required

## Quality Gates

| Gate | Metric | Pass Threshold | Failure Action |
|---|---|---|---|
| **Audio Extraction** | Audio stream found + extracted | > 0 bytes | Skip Track A, flag user |
| **Transcription Confidence** | Mean Whisper confidence | ≥ 0.70 | Warn, allow manual SRT override |
| **Language Detection** | Detected language is Hindi | > 0.85 confidence | Warn, may process as assumed Hindi |
| **Translation Output** | Valid English text (non-empty) | 100% of segments | Block if >5% fail, flag user |
| **OCR Confidence** | Mean OCR confidence score | ≥ 0.65 | Warn, mark as low-confidence |
| **Unified KB Schema** | Valid JSON, proper structure | 100% | Block if invalid, show error |
| **File Size** | Output files reasonable | < 500MB each | Warn if oversized |
| **Processing Time** | Completes within timeout | < 3600 seconds per file | Cancel + error if timeout |

## Anti-Patterns (Do NOT Do)

1. **Do NOT output Hindi text as primary** — English only in main fields
   - Hindi text allowed as reference/reference_original only
   - Wrap Hindi: `"reference_hindi_text": "..."` (not "hindi_text")

2. **Do NOT skip translation**
   - Every Hindi segment must have English equivalent
   - Use fallback chain if primary fails

3. **Do NOT mix Hindi + English in output fields**
   - Fields like "english_text" must be English only
   - Code-mixed Hinglish segments: translate fully to English, preserve original as reference

4. **Do NOT lose speaker attribution**
   - Track who said what (Speaker_1, Speaker_2, etc.)
   - Include speaker confidence if available

5. **Do NOT merge files without timestamps**
   - Unified KB requires strict timestamp alignment
   - If alignment fails, warn user but output separate JSON files

6. **Do NOT ignore confidence scores**
   - Always tag confidence
   - If mean confidence < 0.70, warn user prominently

7. **Do NOT process unlabeled language**
   - Detect language first
   - Warn if detected language ≠ expected (user said Hindi, but detected Marathi)

8. **Do NOT lose original audio/screen context**
   - Always preserve reference to source file
   - Include file metadata (duration, codec, etc.)

## Production Checklist

Before outputting:
- [ ] All output JSON is valid (test with `json.load()`)
- [ ] All text is English only in primary fields
- [ ] Every segment has timestamp
- [ ] Every segment has confidence score
- [ ] No Devanagari text in English fields
- [ ] Speaker attribution present (if audio)
- [ ] File metadata included (filename, duration, language detected)
- [ ] Processing log included (model used, fallback chain, warnings)
- [ ] Unified KB only created if both audio + screen processed
- [ ] Output file size reasonable (< 500MB)

## Common Hinglish Patterns (Hindi+English Code-Mixing)

Handle these common fintech/business terms:

| Hinglish | Standard Hindi | English | Rule |
|---|---|---|---|
| "updation" | "अद्यतन" | "update" | Transliterated English → English |
| "revert" | "जवाब देना" | "reply" | Fintech slang → English |
| "prepone" | "आगे लाना" | "advance" | Indian English → English |
| "jugaad" | "जुगाड़" | "workaround" | Hindi concept → English equivalent |
| "KYC" | "केवाईसी" | "KYC" (know your customer) | Acronym stays same |
| "PAN" | "पैन" | "PAN" (permanent account number) | Acronym stays same |

**Processing rule:** Preserve acronyms, translate cultural/slang terms to nearest English equivalent.

## Execution Workflow

| Step | Task | Input | Output | Tool |
|---|---|---|---|---|
| 1 | Validate environment | User request | Status check | Python validation script |
| 2 | Validate input files | Audio/video/screen files | File validation report | Path + format checks |
| 3 | Extract audio (if video) | Video file | 16kHz mono WAV | ffmpeg |
| 4 | Transcribe with fallback | WAV file | Raw transcription JSON | Whisper → Vosk → GoogleTTS |
| 5 | Translate to English | Hindi transcription | English + Hindi segments | Whisper translate / deep-translator |
| 6 | Diarize speakers (optional) | Audio + transcription | Speaker labels per segment | Pyannote + Whisper timestamps |
| 7 | Extract entities/topics | English text | Entity + topic tags | spaCy / keyword extraction |
| 8 | Build audio KB | All audio outputs | audio_transcript_english.json | A3 assembly |
| 9 | Extract keyframes (if video) | Video file | JPEG frames | ffmpeg scene detection |
| 10 | OCR keyframes (Devanagari) | Frame images | Hindi text + bbox | Tesseract + PaddleOCR |
| 11 | Translate OCR text | Hindi OCR results | English text per frame | GoogleTranslator |
| 12 | Detect UI elements | Keyframe images | Element labels + state | Vision API / pattern matching |
| 13 | Build screen KB | All screen outputs | screen_transcript_english.json | B3 assembly |
| 14 | Align timelines | audio_kb + screen_kb | Merged timeline | Timestamp matching algorithm |
| 15 | Build unified KB | Aligned data | unified_knowledge_base.json | C merge |
| 16 | Generate search index | Unified KB | Topic + entity index | Index building algorithm |
| 17 | Validate outputs | All JSON files | Validation report | Schema check + content check |

## Example: Complete Hindi Fintech Call

**Input:** `customer_call.mp3` — 15-minute Hindi customer support call with Hinglish code-mixing

**Processing:**
```bash
python scripts/process_audio.py customer_call.mp3 ./output \
  --language hi \
  --model large \
  --translate \
  --diarize \
  --extract-topics
```

**Output: `audio_transcript_english.json`**
```json
{
  "file": "customer_call.mp3",
  "duration_seconds": 900,
  "language_detected": "hindi",
  "speakers": 2,
  "segments": [
    {
      "id": 1,
      "start": 12.5,
      "end": 28.3,
      "speaker": "Speaker_1",
      "confidence": 0.91,
      "english_text": "Hello, I want to check my account balance and update my KYC details.",
      "reference_hindi_text": "नमस्ते, मैं अपना खाता शेष जांचना चाहता हूँ और अपने केवाईसी विवरण अपडेट करना चाहता हूँ।",
      "topics": ["account_inquiry", "kyc_update"],
      "entities": ["account", "KYC"]
    },
    {
      "id": 2,
      "start": 28.5,
      "end": 45.0,
      "speaker": "Speaker_2",
      "confidence": 0.87,
      "english_text": "Sure sir, let me help you with that. Your current account balance is 50,000 rupees. For KYC, I need your PAN and Aadhaar.",
      "reference_hindi_text": "ठीक है सर, मैं आपकी मदद करूँगा। आपका वर्तमान खाता शेष 50,000 रुपये है। केवाईसी के लिए मुझे आपके पैन और आधार की जरूरत है।",
      "topics": ["balance_statement", "kyc_requirement"],
      "entities": ["amount", "PAN", "Aadhaar"]
    }
  ],
  "summary": "Customer support call: balance inquiry and KYC update request processed successfully.",
  "processing_metadata": {
    "model": "whisper-large-v3",
    "language": "hindi",
    "translation_engine": "whisper_translate",
    "diarization": true,
    "total_segments": 24,
    "mean_confidence": 0.88,
    "warnings": []
  }
}
```

## Confidence and Limitations

**Strengths:**
- Handles clean recorded speech very well (92-95% accuracy)
- Good for business/formal Hindi
- Excellent Hinglish code-mixing support
- Fast with GPU (2-3x audio speed)
- Built-in translation to English

**Known Limitations:**
- Whisper confuses Hindi↔Urdu in ~5% of segments (both Devanagari/Nastaliq similar)
- Handwritten/stylized Devanagari OCR accuracy 60-75% (vs. 90%+ for printed)
- Background noise >50dB reduces accuracy to 70-80%
- Heavy accents (regional Hindi dialects) may need manual correction
- Depends on FFmpeg for video processing (requires system binary)
- Large model requires 10GB+ VRAM (GPU) or will fall back to slower CPU

## References

- [Whisper Model Card](https://github.com/openai/whisper/blob/main/model-card.md)
- [Tesseract OCR Documentation](https://github.com/UB-Mannheim/tesseract/wiki)
- [PaddleOCR Hindi Support](https://github.com/PaddlePaddle/PaddleOCR)
- [Deep Translator Documentation](https://github.com/nidhaloff/deep-translator)
- [Devanagari Unicode Range](https://en.wikipedia.org/wiki/Devanagari)
- [Code-mixed NLP Challenges](https://arxiv.org/abs/2110.05509)

---

**Version:** 1.0.0
**Author:** Harshavardhan Bailur
**Last Updated:** 2025-03
**Status:** Production-ready