---
name: integrations-daily-monitor
description: Monitor the Integrations dashboard "Daily monitoring" section from Datadog. Queries sync failures, sync durations, connection setup errors, CQ throttling, CloudScanErrors, CloudQuery interruptions, and GQL performance for the Automation team (production). Triggers when user says "daily monitor", "integrations monitor", "integrations dashboard", "daily monitoring", "check integrations", "sync health", or "integration health check".
---

# Integrations Daily Monitor

Query the "Daily monitoring" section of the Integrations Datadog dashboard (ID: `964-nd7-54k`) and produce a concise anomaly-focused summary.

## Input

- **preset**: Filter preset. Default: `automation` (env:production, vendor_team:automation). Options: `automation`, `web_platform`, `workforce`, `alchemy`, `all`.
- **time_range**: Lookback window for trend data. Default: `14d`.
- **vendor_slug**: Optional vendor filter. Default: `*` (all vendors for the preset).

## Preset Filters

| Preset | env | vendor_team | Notes |
|---|---|---|---|
| `automation` | production | automation | Default |
| `web_platform` | production | web_platform | |
| `workforce` | production | workforce | |
| `alchemy` | production | alchemy | |
| `all` | production | * | All teams |

## Procedure

Run ALL queries in parallel using Datadog MCP tools. Use a 14-day window with daily granularity (`DATE_TRUNC('day', timestamp)`). Compare last 1-4 days against prior 10-13 day baseline. **Only report anomalies and notable findings** — omit anything that looks normal/stable.

All log queries target `env:production -@ctx.company_slug:secureframe-dev`. Add `@ctx.vendor_team:{vendor_team}` based on preset.

### Base filter construction

```
For automation preset:
  BASE_SYNC_FILTER = @ctx.class:(SyncConnectionJob OR SyncCloudConnectionJob OR GithubSyncJob OR "Integrations::SyncIntegrationCleanupJob" OR "Integrations::SyncIntegrationResourcesJob" OR "Sidekiq::Batch::Callback") env:production -@ctx.company_slug:secureframe-dev @ctx.vendor_team:automation
```

### All Queries (run ALL in parallel)

#### Q1. Sync Failures - Unmapped errors by vendor per day (14d)

```
Tool: analyze_datadog_logs
filter: {BASE_SYNC_FILTER} "Sync job failed" (-@ctx.has_mapped_error:* OR @ctx.has_mapped_error:false)
from: now-14d
extra_columns: [{ name: "@ctx.vendor_slug", type: "string" }]
sql_query: SELECT DATE_TRUNC('day', timestamp) AS day, "@ctx.vendor_slug" AS vendor, COUNT(*) AS failures FROM logs GROUP BY DATE_TRUNC('day', timestamp), "@ctx.vendor_slug" ORDER BY day DESC, failures DESC
max_tokens: 8000
```

#### Q2. Sync Failures - First Sync by vendor per day (14d)

```
Tool: analyze_datadog_logs
filter: {BASE_SYNC_FILTER} "Sync job failed" @ctx.origin:create_vendor_auth
from: now-14d
extra_columns: [{ name: "@ctx.vendor_slug", type: "string" }]
sql_query: SELECT DATE_TRUNC('day', timestamp) AS day, "@ctx.vendor_slug" AS vendor, COUNT(*) AS failures FROM logs GROUP BY DATE_TRUNC('day', timestamp), "@ctx.vendor_slug" ORDER BY day DESC, failures DESC
max_tokens: 5000
```

#### Q3. NEW failing vendor/company pairs (first_seen detection, 14d)

```
Tool: analyze_datadog_logs
filter: {BASE_SYNC_FILTER} "Sync job failed" (-@ctx.has_mapped_error:* OR @ctx.has_mapped_error:false)
from: now-14d
extra_columns: [{ name: "@ctx.vendor_slug", type: "string" }, { name: "@ctx.company_slug", type: "string" }]
sql_query: SELECT "@ctx.vendor_slug" AS vendor, "@ctx.company_slug" AS company, COUNT(*) AS total_failures, MIN(timestamp) AS first_seen, MAX(timestamp) AS last_seen FROM logs GROUP BY "@ctx.vendor_slug", "@ctx.company_slug" ORDER BY first_seen DESC LIMIT 30
max_tokens: 5000
```

#### Q4. Connection setup errors by vendor per day (14d)

```
Tool: analyze_datadog_logs
filter: ("CreateVendorAuth: OAuthError" OR "CreateVendorAuth: ConnectionSetupError") env:production @ctx.vendor_team:automation
from: now-14d
extra_columns: [{ name: "@ctx.vendor_slug", type: "string" }]
sql_query: SELECT DATE_TRUNC('day', timestamp) AS day, "@ctx.vendor_slug" AS vendor, COUNT(*) AS errors FROM logs GROUP BY DATE_TRUNC('day', timestamp), "@ctx.vendor_slug" ORDER BY day DESC, errors DESC
max_tokens: 8000
```

#### Q5. NEW connection setup error vendors (first_seen detection, 14d)

```
Tool: analyze_datadog_logs
filter: ("CreateVendorAuth: OAuthError" OR "CreateVendorAuth: ConnectionSetupError") env:production @ctx.vendor_team:automation
from: now-14d
extra_columns: [{ name: "@ctx.vendor_slug", type: "string" }]
sql_query: SELECT "@ctx.vendor_slug" AS vendor, COUNT(*) AS total_errors, MIN(timestamp) AS first_seen, MAX(timestamp) AS last_seen FROM logs GROUP BY "@ctx.vendor_slug" ORDER BY first_seen DESC
max_tokens: 5000
```

#### Q6. CloudScanErrors - Unmapped per day (14d)

```
Tool: analyze_datadog_logs
filter: @ctx.class:CompanyVendorConnection message:"CloudScanError created" env:production -@ctx.company_slug:secureframe-dev @ctx.has_mapped_error:false @ctx.vendor_team:automation
from: now-14d
extra_columns: [{ name: "@ctx.vendor_slug", type: "string" }]
sql_query: SELECT DATE_TRUNC('day', timestamp) AS day, "@ctx.vendor_slug" AS vendor, COUNT(*) AS errors FROM logs GROUP BY DATE_TRUNC('day', timestamp), "@ctx.vendor_slug" ORDER BY day DESC, errors DESC
max_tokens: 8000
```

#### Q7. CloudScanErrors - Mapped per day (14d)

```
Tool: analyze_datadog_logs
filter: @ctx.class:CompanyVendorConnection message:"CloudScanError created" env:production -@ctx.company_slug:secureframe-dev @ctx.has_mapped_error:true @ctx.vendor_team:automation
from: now-14d
extra_columns: [{ name: "@ctx.vendor_slug", type: "string" }]
sql_query: SELECT DATE_TRUNC('day', timestamp) AS day, "@ctx.vendor_slug" AS vendor, COUNT(*) AS errors FROM logs GROUP BY DATE_TRUNC('day', timestamp), "@ctx.vendor_slug" ORDER BY day DESC, errors DESC
max_tokens: 8000
```

#### Q8. CQ Throttling Errors by day (14d)

```
Tool: analyze_datadog_logs
filter: @ctx.class:CompanyVendorConnection message:"CQ Error" env:production -@ctx.company_slug:secureframe-dev @ctx.cq_error:"Throttling: Rate exceeded" @ctx.vendor_team:automation
from: now-14d
extra_columns: [{ name: "@ctx.cq_table", type: "string" }]
sql_query: SELECT DATE_TRUNC('day', timestamp) AS day, "@ctx.cq_table" AS cq_table, COUNT(*) AS throttle_count FROM logs GROUP BY DATE_TRUNC('day', timestamp), "@ctx.cq_table" ORDER BY day DESC, throttle_count DESC
max_tokens: 5000
```

#### Q9. GQL Request Duration - top slowest endpoints (14d, daily)

```
Tool: get_datadog_metric
queries: ["avg:trace.execute.graphql.duration{env:production} by {resource_name}"]
from: now-14d
max_tokens: 10000
```

Filter results to only integrations-related resource_names (those containing: vendor, integration, connection, sync, device, computer, repository, cloud, task).

#### Q10. GQL Request Errors (14d)

```
Tool: get_datadog_metric
queries: ["sum:trace.execute.graphql.errors{env:production} by {resource_name}.as_count()"]
from: now-14d
max_tokens: 10000
```

Filter results same as Q9.

## Analysis Instructions

After collecting all data:

### Compute Baselines

For each metric, compute:
- **Baseline**: Average daily value over days 5-14 (older period)
- **Recent**: Average daily value over days 1-4

### Classify Findings

Only report items in these categories:

1. **NEW issues**: vendor/company pairs with `first_seen` in last 4 days (from Q3, Q5)
2. **Unresolved issues**: Started 2-4 days ago AND still has occurrences in the last 24h
3. **Spikes**: Any day in the last 4 days where a metric is >50% above baseline
4. **Resolved spikes**: Notable spikes from earlier that have now resolved (mention briefly)
5. **Trend changes**: Sustained increase/decrease >25% from baseline over multiple recent days
6. **GQL anomalies**: Endpoints with avg duration >2x their 14d average, or new errors appearing

### Suppress Normal Data

Do NOT report:
- Metrics that are stable (within +/-25% of baseline)
- Chronic known issues at steady state (mention once as "unchanged" if large volume)
- Empty categories
- Raw trend tables or daily breakdowns (unless illustrating an anomaly)

## Output Format

The output must be concise — suitable for a Slack message. Target under 2000 characters for quiet days, up to 3000 for days with issues.

```
## Integrations Daily Monitor — {status_emoji} {status_line}
{date} | Automation | env:production | [Dashboard](https://app.datadoghq.com/dashboard/964-nd7-54k)

{If NEW or UNRESOLVED issues exist:}
### Action Items
- {emoji} {description} — {vendor/company, when it started, count, whether ongoing}
- ...

{If spikes or trend changes exist:}
### Notable Changes
- {description with numbers}
- ...

{If GQL anomalies exist:}
### GQL Performance
- {endpoint}: avg {X}ms (was {Y}ms) — {change description}
- ...

### Steady State
- Unmapped CloudScanErrors: {total}/day (google_cloud {X}, aws {Y}) — {stable/changed}
- Mapped CloudScanErrors: {total}/day (aws {X}) — {stable/changed}
- CQ Throttling: {X}/day — {stable/changed}
- Sync failures: {X}/day — {stable/changed}

{Only if there are resolved items worth noting:}
### Recently Resolved
- {what resolved and when}
```

**Status line options:**
- "All clear" with :white_check_mark: — no new issues, no anomalies
- "Anomalies detected" with :warning: — trend changes or spikes but no new failures
- "New issues found" with :rotating_light: — new failing vendors/companies or new errors

## Caveats

- Log counts are exact (indexed logs, not sampled).
- Duration data comes from `ctx.sidekiq_job_duration` field in sync job ensure blocks.
- `CompleteCloudSyncJob` does not emit duration logs.
- The `vendor_team` filter relies on vendor metadata tagging.
- CloudScanErrors are specific to cloud vendors (AWS, Azure, GCP, etc.).
- Today's data is always partial (depends on time of day).
- GQL metrics use `env:production` without team filter (trace-level tags differ from log tags).
