---
name: competitor-scraper
description: Fetch and parse public competitor pages across blog, LinkedIn, pricing, careers, G2 reviews, and press — returns structured artifact payloads for downstream classification
version: 1.0.0
confidence: 0.85
usageCount: 0
---

## Overview

The competitor-scraper skill fetches public web content from six competitor source types and returns structured, deduplicated artifact payloads. It operates as the foundational data-collection layer of the competitor intelligence pipeline. Every artifact produced by this skill carries full provenance metadata so downstream classifiers, theme extractors, and gap-analysis agents can trace each signal back to its origin without re-fetching the source.

The skill is designed around precise structured extraction, not raw HTML forwarding. It parses each source type with purpose-built extraction logic and returns only the fields downstream agents need — saving token budget, avoiding hallucination risk in downstream agents, and enabling robust deduplication.

## When to Use

- When the competitor-scheduler dispatches a scan job for a specific competitor
- When a one-off competitive intelligence request requires fresh data from a specific source type
- When the gap-analysis-agent needs a spot-check refresh of a competitor's pricing or careers page
- When the weekly-digest-generator needs a manual pull of recent press mentions before digest assembly

## How to Execute

### Step 1: Job Validation
Validate the incoming job payload: confirm that competitor ID, source URL list, run ID, and last-scan timestamp are all present. If any required field is missing, return an error and do not proceed. Load the competitor's entry from the competitor-intel knowledge file to confirm the source URL list is current — the knowledge file is the source of truth, not the job payload.

### Step 2: Blog Scraping
Fetch the competitor's blog or news index. Parse all posts with a publication date after the last-scan timestamp. For each post extract:
- **title** (string)
- **url** (string)
- **published_at** (ISO 8601 date)
- **author** (string, if available)
- **word_count** (integer)
- **body_text** (string, full text)
- **topic_tags** (array of strings)
- **product_claims** (array: any named features, capabilities, or integrations mentioned)
- **customer_references** (array: any named customers, logos, or case study references)
- **data_citations** (array: any third-party data, analyst reports, or statistics cited with source)

Limit to 20 most recent posts unless the scheduler has set a high-activity override.

### Step 3: LinkedIn Post Scraping
Retrieve recent public LinkedIn company posts. For each post extract:
- **post_text** (string)
- **published_at** (ISO 8601 date)
- **post_type** (enum: organic, product-announcement, event-promo, hiring, customer-story, thought-leadership)
- **engagement_signals** (object: likes, comments, reposts — include only if visible without authentication)
- **external_links** (array: any URLs linked from the post)
- **lyzr_mention** (boolean: does the post name Lyzr or a Lyzr-adjacent term?)

### Step 4: Pricing Page Scraping
Fetch the competitor's pricing page. Extract:
- **pricing_model** (enum: per-seat, consumption, flat-rate, enterprise-only, freemium)
- **tier_names** (array of strings)
- **tier_descriptions** (array: one string per tier)
- **feature_gates** (array: features listed as tier-specific)
- **deployment_options** (array: cloud, on-prem, hybrid — only if explicitly stated)
- **cta_language** (string: the primary call-to-action text, e.g., "Start Free Trial" or "Contact Sales")
- **minimum_contract_signals** (string: any language implying deal size floor, e.g., "for teams of 50+")
- **diff_from_previous** (boolean: has this page changed since the last scrape?)

### Step 5: Careers Page Scraping
Fetch the competitor's open roles. Aggregate by department. Extract:
- **total_open_roles** (integer)
- **roles_by_department** (object: department name → count)
- **ai_ml_engineering_count** (integer)
- **sales_count** (integer)
- **vertical_specific_roles** (array: role titles that name a vertical — BFSI, healthcare, manufacturing)
- **net_new_roles_vs_prior** (integer: positive = growth, negative = shrinkage)
- **notable_senior_roles** (array: VP, C-suite, or Director-level roles newly posted)

### Step 6: G2 Review Scraping
Retrieve the most recent G2 reviews from the competitor's product listing page. For each review extract:
- **star_rating** (integer 1-5)
- **reviewed_at** (ISO 8601 date)
- **reviewer_title** (string)
- **reviewer_company_size** (enum: 1-50, 51-200, 201-1000, 1001-5000, 5000+)
- **pros_text** (string)
- **cons_text** (string)
- **features_praised** (array: named product features mentioned positively)
- **features_criticized** (array: named product features mentioned negatively)
- **high_signal_flag** (boolean: true if review mentions enterprise deployment, pricing, implementation complexity, or support response time)

Do not extract reviewer names — reviewer title and company size are sufficient for signal analysis.

### Step 7: Press Mention Scraping
Search configured press sources and Google News for the competitor name within the scan window. For each mention extract:
- **headline** (string)
- **publication** (string)
- **published_at** (ISO 8601 date)
- **url** (string)
- **summary** (string: 2-3 sentence summary of the article's relevance)
- **mention_type** (enum: product-launch, partnership, funding, analyst-coverage, award, executive-change, negative-coverage, other)
- **lyzr_mentioned** (boolean: does the article also mention Lyzr or a direct Lyzr competitor alongside this competitor?)

### Step 8: Deduplication and Output Assembly
For every extracted artifact, compute a fingerprint: SHA-256 hash of (source_type + source_url + published_at + title). Check the artifact store for the fingerprint. If the fingerprint exists, discard the artifact and increment the duplicate count. Only write net-new artifacts to the output payload.

Each output artifact must include: `competitor_id`, `source_type`, `artifact_id` (UUID), `run_id`, `fingerprint`, `scraped_at` (ISO 8601), and the structured fields for that source type.

## Inputs

- **job_payload** (object, required): Competitor ID, source URL list, run ID, last-scan timestamp
- **artifact_store_reference** (string, required): Reference to the artifact store for deduplication checks
- **high_activity_override** (boolean, optional): If true, lift the 20-post limit for blog scraping

## Outputs

- **artifacts** (array): Array of structured artifact objects, one per net-new item discovered
- **duplicate_count** (integer): Number of artifacts discarded as duplicates
- **source_errors** (array): List of source URLs that returned errors, with status codes
- **job_completion_status** (enum: complete, partial, failed)

## Quality Criteria

- Every artifact must include all required metadata fields — an artifact with a missing run_id or fingerprint is invalid and must be discarded
- No fabricated or inferred fields — if a field cannot be extracted directly from the page content, it must be null, not estimated
- Deduplication must run before output assembly — duplicate artifacts must never appear in the output array
- Source errors must be reported accurately — a 429 rate-limit error and a 404 not-found error require different handling by the scheduler and must be reported with the correct status code
