--- name: errorlog-review description: Analyzes SQL Server ERRORLOG files for operational issues, availability group failures, memory pressure, I/O subsystem warnings, and security events. Use this skill whenever a SQL Server instance has experienced unexpected behavior, an AG failover, memory warnings, I/O latency alerts, or abnormal shutdown, and you need a structured timeline of what SQL Server recorded. Applies 28 checks (E1–E28) covering AG health, memory/resource pressure, I/O and storage, startup/shutdown, connectivity, and configuration signals. triggers: - /errorlog-review --- # SQL Server ERRORLOG Review Skill ## Purpose Parse and analyze SQL Server ERRORLOG content to surface operational warnings, high-availability failures, resource pressure signals, security events, and configuration anomalies. Applies 28 checks (E1–E28) across five categories: - **E1–E8** — AG / High Availability: failovers, lease expiry, replica state changes, synchronization errors - **E9–E14** — Memory and resource pressure: page allocation failures, OS paging, worker exhaustion, non-yielding schedulers - **E15–E19** — I/O and storage: slow I/O subsystem, corruption warnings, tempdb exhaustion, log backup gaps, VLF proliferation - **E20–E24** — Startup, shutdown, and connectivity: abnormal termination, restart cycling, login failure bursts, linked server errors - **E25–E28** — Configuration and informational: trace flags, unconfigured max memory, log rotation gaps, version end-of-support ## Input Accept any of: - **File path** — path to the SQL Server ERRORLOG file (default location: `C:\Program Files\Microsoft SQL Server\MSSQL.\MSSQL\Log\ERRORLOG`) - **Inline paste** — raw ERRORLOG text pasted directly into chat; partial excerpts are valid - **Natural language description** — describe the symptoms or paste selected log lines with context For best results, provide the current ERRORLOG and at least one prior log (`ERRORLOG.1`). When only partial content is available, state which time range is covered. ### Capture via T-SQL ```sql -- Read current ERRORLOG (0 = current, 1 = previous, 2 = the one before that) EXEC xp_readerrorlog 0, 1; -- SQL Server log, current file EXEC xp_readerrorlog 1, 1; -- SQL Server log, previous file -- Filter to AG-related messages only EXEC xp_readerrorlog 0, 1, N'availability', NULL, NULL, NULL, N'desc'; -- Filter to a time window (last 2 hours) DECLARE @start DATETIME = DATEADD(HOUR, -2, GETDATE()); EXEC xp_readerrorlog 0, 1, NULL, NULL, @start, NULL, N'desc'; ``` ### Column Reference | Column | Meaning | |--------|---------| | LogDate | Timestamp of the log entry (datetime2 precision) | | ProcessInfo | SPID or system process (e.g., `spid28s`, `Logon`, `Backup`) | | Text | Log message text | --- ## Thresholds Reference | Threshold | Value | Used by | |-----------|-------|---------| | Login failure burst — Warning | > 5 `Login failed` messages in any 5-min window | E22 | | Login failure burst — Critical | > 20 `Login failed` messages in any 5-min window | E22 | | Restart cycling | ≥ 2 SQL Server startup messages within 60 min | E21 | | I/O slow built-in threshold | 15 seconds (SQL Server internal, non-configurable) | E15 | | Log backup overdue — FULL/BULK_LOGGED | > 24 hr since last `Database backed up` entry | E18 | | Log backup overdue — active log pressure signal | > 8 hr when `log_reuse_wait_desc = LOG_BACKUP` | E18 | --- ## AG / High Availability Checks (E1–E8) ### E1 — AG Failover Event - **Trigger:** Log contains `performing a planned role change` or `automatic failover` in the same entry or within the same minute as a role-change message; also `in response to a request from the Windows Server Failover Cluster` - **Severity:** Warning — planned failover expected; Critical if the word `automatic` appears (unplanned loss of primary) - **Fix:** For unplanned failovers, check E2 (lease expiry) and E6 (health check timeout) as probable root causes. For planned failovers in unexpected windows, review change-management records. Run `/sqlwait-review` on HADR_SYNC_COMMIT and HADR_WORK_QUEUE waits. ### E2 — Lease Expiry - **Trigger:** Log contains `lease between the availability group and the Windows Server Failover Cluster has expired` or `The lease of availability group` combined with `has expired` - **Severity:** Critical — lease expiry is the most common root cause of unplanned AG failovers - **Fix:** Investigate the time immediately before this entry for E15 (slow I/O), E13 (non-yielding scheduler), or OS-level events. Common causes: storage latency spike causing the sp_server_diagnostics thread to miss its deadline, high CPU starvation, or WSFC network interruption. Increase `LeaseTimeout` in WSFC only as a temporary measure — fix the root cause. ### E3 — Replica State Change - **Trigger:** Log contains `The local replica of availability group ... is changing roles` or `is preparing to transition to the` - **Severity:** Warning — state transitions are normal during planned operations; unexpected transitions during business hours warrant investigation - **Fix:** Correlate the timestamp with E1 (failover), E2 (lease), or external WSFC events. If unplanned, check the Windows Event Log and WSFC cluster log for the triggering event. ### E4 — AG Database Joining Failure - **Trigger:** Log contains `Failed to join local availability database` or `The availability database ... is not in the correct state` - **Severity:** Critical — the AG database is not receiving redo; secondary is running but not synchronized, providing false HA coverage - **Fix:** Run `SELECT * FROM sys.dm_hadr_database_replica_states` to check `synchronization_state_desc` and `redo_queue_size`. If redo queue is growing, check disk I/O on the secondary. If the database is in `NOT SYNCHRONIZING`, re-join: `ALTER DATABASE [db] SET HADR AVAILABILITY GROUP = [ag_name]`. ### E5 — Data Synchronisation Suspended - **Trigger:** Log contains `Synchronization of this database ... has been suspended` or `Data movement for availability database ... has been suspended` - **Severity:** Warning — a suspended database is not receiving log records; RPO clock is running - **Fix:** Identify whether the suspension was manual (`ALTER DATABASE ... SET HADR SUSPEND`) or automatic (error-triggered). Check for E15/E16 (I/O or corruption) causing automatic suspension. Resume: `ALTER DATABASE [db] SET HADR RESUME`. Monitor redo queue. ### E6 — AG Health Check Timeout - **Trigger:** Log contains `availability group ... has failed to take necessary action within the time allotted` or `The availability group ... exceeded the health-check timeout` - **Severity:** Critical — health-check failure directly precedes automatic failover; this entry combined with E1 confirms the full failover sequence - **Fix:** Identify what the primary was doing at the time. E13 (non-yielding scheduler) or E9 (page allocation failure) are common co-occurrences. The `HealthCheckTimeout` WSFC property controls sensitivity — do not increase it without fixing the underlying responsiveness problem. ### E7 — Redo Thread Error - **Trigger:** Log contains `An error occurred in the redo thread for database` or `Redo thread for database ... encountered error` - **Severity:** Critical — the secondary redo thread has failed; the secondary is no longer applying log records and RPO is accumulating - **Fix:** Note the error number in the log message. Common causes: corruption on the secondary (check E16), log record version mismatch after an upgrade, or disk full on secondary. For disk-full, free space and resume synchronization. For corruption, restore the secondary from a backup and re-seed. ### E8 — Secondary Not Synchronising - **Trigger:** Log contains `Waiting for redo catch-up` or mentions secondary redo queue in a warning context; or `log send queue` appearing repeatedly with growing values - **Severity:** Warning — secondary is lagging; failover to this replica would result in data loss proportional to the redo queue depth - **Fix:** Check network bandwidth between primary and secondary. Run `SELECT redo_queue_size, redo_rate FROM sys.dm_hadr_database_replica_states`. If redo rate < log generation rate, the secondary cannot keep up — review disk I/O on secondary (E15) or increase network bandwidth. --- ## Memory and Resource Pressure Checks (E9–E14) ### E9 — FAIL_PAGE_ALLOCATION - **Trigger:** Log contains `FAIL_PAGE_ALLOCATION` (exact string, case-insensitive) - **Severity:** Critical — SQL Server could not satisfy an internal memory allocation; queries may have failed with out-of-memory errors; this entry often precedes OS paging (E10) - **Fix:** Check `max server memory` configuration (E26). Run `SELECT type, pages_kb FROM sys.dm_os_memory_clerks ORDER BY pages_kb DESC` to identify which clerk is consuming the most memory. Consider reducing max server memory by 10–15% to leave headroom for OS and other processes. ### E10 — OS Memory Pressure - **Trigger:** Log contains `A significant part of sql server process memory has been paged out` or `Working set trim` - **Severity:** Critical — Windows has paged SQL Server memory to disk under OS memory pressure; buffer pool pages are on disk, causing extreme I/O latency - **Fix:** Reduce `max server memory` to allow OS headroom (leave at least 10% of RAM or 4 GB, whichever is greater). Enable `Lock Pages in Memory` (LPIM) to prevent paging for 64-bit SQL Server service account. Investigate other processes competing for RAM on the host. ### E11 — Buffer Pool Insufficient - **Trigger:** Log contains `There is insufficient system memory in resource pool` or `Memory Manager: Memory node available memory is less than threshold` - **Severity:** Critical — queries requiring memory grants are being denied; workload will stall on `RESOURCE_SEMAPHORE` waits - **Fix:** Run `/sqlwait-review` and check for `RESOURCE_SEMAPHORE` dominance. Increase `max server memory` if physical RAM allows, or reduce `min memory per query` via Resource Governor. Identify large-grant queries with `/sqlplan-review` S2–S4. ### E12 — Worker Thread Exhaustion - **Trigger:** Log contains `There are no more threads available to process new requests` or `Worker Thread ... has been waiting too long` - **Severity:** Critical — new connections are being refused or queued; the instance is at maximum worker thread capacity - **Fix:** Increase `max worker threads` via `sp_configure` only after identifying root cause. Common causes: blocking chains holding threads (check `sys.dm_exec_requests`), long-running queries, or undersized `max worker threads` for the workload. Run `/sqlwait-review` for `THREADPOOL` waits (V-checks). ### E13 — Scheduler Non-Yielding - **Trigger:** Log contains `Process appears to be non-yielding on Scheduler` or `A scheduler appears to be non-yielding` - **Severity:** Critical — a thread is monopolising a scheduler without yielding; this blocks all other threads on that scheduler, degrades responsiveness, and can trigger AG health-check timeouts (E6) and lease expiry (E2) - **Fix:** A memory dump is typically generated automatically. Look for a `.mdmp` file in the SQL Server Log directory matching the timestamp. Common causes: large in-memory sort, CLR call, XTP operation, or a bug in a specific build — check if a known hotfix applies for the version (E28). Consider enabling `DBCC TRACEON(8086)` on advice from Microsoft Support. ### E14 — Memory Grant Timeout - **Trigger:** Log contains `Memory grant request timed out` or `A request for memory failed with OOM (out of memory) status` - **Severity:** Warning — a query could not acquire its requested memory grant within the timeout; it may have been killed or retried with a reduced grant, causing a spill to TempDb - **Fix:** Capture the affected query and run `/sqlplan-review` for S2–S4 (memory grant checks). Update statistics to improve cardinality estimates. Use Resource Governor to cap grants for ad-hoc workloads. Check for E11 (resource pool exhaustion) as a co-trigger. --- ## I/O and Storage Checks (E15–E19) ### E15 — I/O Subsystem Slow - **Trigger:** Log contains `SQL Server has encountered` combined with `I/O requests taking longer than 15 seconds` (SQL Server's built-in slow I/O threshold) - **Severity:** Critical — storage latency has exceeded the 15-second internal threshold; this is a primary trigger for AG lease expiry (E2) and health-check timeouts (E6) - **Fix:** Note the file path and database in the message. Investigate storage subsystem: check disk queue length, RAID controller cache status, SAN/NVMe latency metrics, and any concurrent backup or maintenance operations competing for I/O. If on a VM, check storage IOPS limits. Run `/sqlwait-review` for `PAGEIOLATCH_SH` and `PAGEIOLATCH_EX` dominance. ### E16 — Database Corruption Warning - **Trigger:** Log contains `checksum mismatch`, `torn page`, `consistency errors detected`, or `DBCC CHECKDB found` with error counts > 0 - **Severity:** Critical — data corruption has been detected; backup integrity is unknown until verified; the affected database may be inaccessible or returning wrong results - **Fix:** Run `DBCC CHECKDB ([database]) WITH NO_INFOMSGS` immediately to assess scope. Do not attempt to repair until a current, verified backup exists. For `REPAIR_ALLOW_DATA_LOSS`, treat it as a last resort — restore from backup is always preferable. Investigate E15 (I/O latency) and storage hardware health as root causes. ### E17 — TempDB Space Exhaustion - **Trigger:** Log contains `Could not allocate space` combined with `in database 'tempdb'` or `tempdb is full` or `tempdb ran out of space` - **Severity:** Critical — queries requiring temporary space (sorts, hashes, spools, row versioning) are failing; error 1105 is returned to applications - **Fix:** Immediately: `DBCC SHRINKFILE` on tempdb data files to recover any unused allocated space, or add a tempdb data file. Long term: investigate which query is consuming tempdb (check `sys.dm_db_session_space_usage`). Run `/sqlplan-review` for N41–N43 (spill operators). Consider pre-allocating tempdb to expected working size at startup. ### E18 — Log Backup Overdue - **Trigger:** Gap between consecutive `Database backed up` entries for the same database exceeds the threshold for that recovery model. For databases in FULL or BULK_LOGGED recovery, flag if the gap exceeds 24 hours; flag more urgently if log backup entries are absent while other evidence suggests active transaction log growth - **Severity:** Warning — log space will grow unboundedly without log backups; in a FULL recovery database, the log cannot be truncated until backed up - **Fix:** Run a log backup immediately: `BACKUP LOG [database] TO DISK = N'path\logbackup.bak'`. Verify the SQL Agent log backup job is scheduled and enabled. Check `sys.databases` column `log_reuse_wait_desc` — if `LOG_BACKUP`, the log is waiting for a backup to allow truncation. ### E19 — VLF Proliferation Signal - **Trigger:** Log shows repeated `autogrow` events on transaction log files (multiple autogrow completions in the log window), or the database log has grown significantly between ERRORLOG entries — inferred from repeated log file path growth messages - **Severity:** Info — excessive VLFs degrade recovery time and log-backup performance; auto-grow events indicate the log was not sized for the workload - **Fix:** Shrink and pre-size the log: set the initial log file size to cover expected working set and disable autogrow on the log (or set a large, infrequent growth increment). Run `DBCC LOGINFO ([database])` to count current VLFs — if > 1,000, shrink and re-expand in one step. Align with E18 (log backup cadence) to ensure the log truncates regularly. --- ## Startup, Shutdown, and Connectivity Checks (E20–E24) ### E20 — Abnormal Shutdown - **Trigger:** Log contains `SQL Server is terminating` or `SQL Server has encountered` combined with `stack dump` or shutdown messages, without a preceding graceful shutdown marker (`SQL Server is terminating due to a system shutdown request` at the end of the prior log file) - **Severity:** Critical — the instance crashed rather than shut down cleanly; uncommitted transactions were rolled back on restart; any in-flight work is lost - **Fix:** Check the Windows Event Log (`Application` and `System` sources) for the crash timestamp. Look for a dump file in the SQL Server Log directory. If the crash occurred mid-transaction in an AG, check whether secondary databases advanced beyond the primary (split-brain risk). Engage Microsoft Support with the minidump if the crash is reproducible. ### E21 — Repeated Restarts - **Trigger:** ERRORLOG or combined ERRORLOG + ERRORLOG.1 contains ≥ 2 SQL Server startup messages (lines containing `SQL Server is starting` or `This instance of SQL Server last reported using a process ID`) within a 60-minute window - **Severity:** Critical — the instance is crash-looping; each restart drops all plan cache and connection state; applications experience repeated connection failures - **Fix:** Check E20 (abnormal shutdown) for the crash cause between restarts. If the instance is restarting due to a failed startup condition (e.g., tempdb creation failure, master database corruption, or xp_cmdshell misconfiguration), resolve the startup error first. Enable Windows `Automatic Recovery` only after identifying the underlying fault. ### E22 — Login Failure Burst - **Trigger:** Count of `Login failed` entries exceeds the threshold within a 5-minute rolling window — see Thresholds Reference for Warning and Critical levels - **Severity:** Warning if > 5 failures in 5 min; Critical if > 20 failures in 5 min - **Fix:** Identify the `ClientConnectionID` and source IP in the failure messages. A burst from one account likely indicates a misconfigured application connection string after a password rotation. A burst from many accounts may indicate a brute-force attempt. For brute-force: enable SQL Server Audit or Extended Events on `Failed Logins` and block the source IP at the network layer. Ensure `LOGINAUDIT` is set to `Failed logins only` or `Both` in Server properties so future bursts appear in the ERRORLOG. ### E23 — Linked Server Error - **Trigger:** Log contains `OLE DB provider` combined with `reported an error` or `Cannot obtain the required interface` for a linked server provider - **Severity:** Warning — distributed queries or cross-server stored procedures using this linked server will fail until the provider error is resolved - **Fix:** Identify the linked server name and provider from the error text. Common causes: target server unavailable, credential expiry, or OLE DB provider version mismatch. Test connectivity: `EXEC sp_testlinkedserver [linked_server_name]`. If the provider is outdated, update it on the SQL Server host. ### E24 — Connectivity Error - **Trigger:** Log contains `A connection was successfully established with the server, but then an error occurred during the login process` or `The connection has been lost` or `A network-related or instance-specific error` in the ERRORLOG (as opposed to the client) - **Severity:** Warning — SQL Server is logging errors from its own outbound connections (linked servers, distributed queries, SSISDB, mail, replication) or from incoming connections that dropped after TCP establishment - **Fix:** Correlate the timestamp with E22 (login failures), network infrastructure changes, or TLS/SSL certificate renewals. If `TLS handshake` appears in the message, verify that the certificate in use has not expired and that the client supports the negotiated protocol. --- ## Configuration and Informational Checks (E25–E28) ### E25 — Trace Flag Active - **Trigger:** Log contains `Trace flag` combined with `is set` or `was enabled at startup` in startup messages - **Severity:** Info — trace flags change engine behavior; document intent and verify they are still appropriate for the current SQL Server version - **Fix:** List all active trace flags: `DBCC TRACESTATUS(-1)`. Common production trace flags and their intent: 1117/1118 (tempdb allocation — superseded in 2016+), 3226 (suppress successful backup log entries), 4199 (QO hotfixes). Remove trace flags that are no longer needed or that apply to behaviour fixed in a later CU. ### E26 — Max Server Memory Default - **Trigger:** Log contains startup line showing `max server memory` = 2147483647 MB, or the instance has been running with the default (unlimited) memory configuration — inferred from startup messages or the absence of an explicit `max server memory` setting entry - **Severity:** Info — unlimited memory allows SQL Server to consume all available RAM, starving the OS and any other services, which can trigger E10 (OS paging) - **Fix:** Set `max server memory` to total RAM minus OS headroom: leave at least 10% of RAM or 4 GB (whichever is larger) for the OS. For example, on a 64 GB server: `EXEC sp_configure 'max server memory (MB)', 57344; RECONFIGURE;` ### E27 — ERRORLOG Rotation Gap - **Trigger:** Only a single ERRORLOG file is provided, covering a window shorter than 24 hours, with no prior context from ERRORLOG.1 or earlier - **Severity:** Info — events before the current file (including the original startup, prior AG failovers, or earlier memory events) are not visible; findings may be incomplete - **Fix:** Retrieve prior ERRORLOG files: `EXEC xp_readerrorlog 1, 1` through `EXEC xp_readerrorlog 6, 1` (SQL Server retains up to 6 prior logs by default, configurable in SSMS → Server Properties → Database Settings → Number of error log files). State in the report: "Analysis covers [start] – [end] only; prior events not available." ### E28 — SQL Server Version - **Trigger:** Startup line containing `Microsoft SQL Server 20XX` version string — present in every ERRORLOG at instance start - **Severity:** Info — extract and evaluate: (1) is this build on extended support, mainstream support, or past end-of-support? (2) is this the latest CU for this major version? - **Fix:** Compare the build number in the log against the Microsoft SQL Server build list. If past end-of-support (e.g., SQL 2014 EOL 2019-07-09, SQL 2016 EOL 2026-07-14), plan upgrade. If not on the latest CU, evaluate whether open bugs fixed in later CUs are relevant to the observed issues. Report the version string verbatim in the Output Summary. --- ## Output Format Structure the report as follows. Use this exact section order. ``` ## SQL Server ERRORLOG Analysis ### Summary - X Critical, Y Warnings, Z Info - Time range: [first log entry datetime] – [last log entry datetime] - SQL Server version: [version string from E28 startup line, or "Not found in provided excerpt"] - Highest-risk finding: [check name and ID, e.g., "E2 — Lease Expiry"] - Log coverage note: [single file / multiple files / partial excerpt — dates if known] ### Critical Issues ### [C1 — E2] Lease Expiry (2026-01-15 14:32:05) - **Observed:** "lease between the availability group 'AG1' and the Windows Server Failover Cluster has expired" at 14:32:05. Preceded by E15 (I/O slow) at 14:28:44 on E:\Data\AG1_Primary.mdf. - **Impact:** Unplanned AG failover triggered. AG1 primary role transferred to secondary. Applications lost primary connection for the duration of the failover. - **Fix:** Investigate I/O latency on E:\Data at 14:28 (see C2 — E15). Do not increase LeaseTimeout without resolving the root cause I/O delay. ### Warnings ### [W1 — E1] AG Failover Event (2026-01-15 14:32:08) ... ### Info ### [I1 — E25] Trace Flag Active (startup) ... ### Passed Checks | Check | Result | |-------|--------| | E9 — FAIL_PAGE_ALLOCATION | PASS — no FAIL_PAGE_ALLOCATION entries found | | E16 — Database Corruption Warning | PASS — no checksum or torn-page errors found | ``` Each finding label uses `[C1]`, `[W1]`, `[I1]` sequence numbering, with the check ID in parentheses. Findings reference related checks by ID where one explains another (e.g., "root cause of C1 — E2"). Passed Checks must list every check explicitly evaluated. When a check cannot be evaluated (e.g., E18 with no backup log entries), state "SKIP — no `Database backed up` entries in provided log window" rather than PASS or FAIL. --- ## Notes - ERRORLOG entries use local server time — note timezone if it differs from the analyst's context. - Messages from `spid28s` (or any `spidNs`) are system threads; `Logon` is the login auditing thread; `Backup` is the backup thread. - When multiple ERRORLOG files span a long window, the startup entry in each file signals the beginning of a new SQL Server process (i.e., a restart occurred between files). - The ERRORLOG does not record all events — OS-level events (WSFC partitions, disk controller errors) appear only in the Windows Event Log and WSFC cluster log. Reference the companion skill list below for those artifacts when ERRORLOG evidence points to external causes. - Do not report a PASS for E18 if no `Database backed up` entries are present — the absence of backup log entries is itself an E18 signal for databases in FULL recovery. State clearly which databases had backup evidence and which did not. ## Companion Skills - `/sqlwait-review` — correlate ERRORLOG memory and I/O signals (E9–E15) with `PAGEIOLATCH_SH`, `RESOURCE_SEMAPHORE`, `HADR_SYNC_COMMIT`, and `THREADPOOL` wait dominance - `/sqlplan-review` + `/sqlplan-index-advisor` — analyze execution plans for queries that were running during the incident window; high-cost queries during a memory or I/O event often accelerate the failure - `/query-store-review` — identify plan regressions introduced after a post-incident restart clears the plan cache, causing previously stable queries to recompile with bad plans - `/tsql-review` — review T-SQL source of stored procedures flagged during the incident as high resource consumers before and after the failure - `/sqlplan-deadlock` — if E22 (login failure burst) or connectivity errors coincide with error 1205 in application logs, analyze the deadlock XML from the `system_health` XE session