---
name: implement-retry-logic
description: |
  Implements retry logic with exponential backoff and jitter for async operations. Use when implementing
  external service calls, database operations, network requests, or any operation that
  may fail transiently. Follows project patterns for ServiceResult wrapping, configuration
  injection, and resilience patterns. Triggers on "add retry logic for X", "implement exponential
  backoff", "make Z resilient", or "handle transient errors".
allowed-tools:
  - Read
  - Grep
  - Edit
  - MultiEdit
---

Works with async/await operations in Python.
# Implement Retry Logic

## Purpose

Add retry logic with exponential backoff and jitter to async operations, following project patterns for resilience, configuration injection, and ServiceResult wrapping.

## When to Use

Use this skill when:
- **Implementing external service calls** - API calls that may timeout or fail
- **Database operations** - Database connections that may drop
- **Network requests** - Any network operation subject to transient failures
- **Operations that may fail transiently** - Temporary errors that can be retried

**Trigger phrases:**
- "Add retry logic for X"
- "Implement exponential backoff for Y"
- "Make Z resilient to failures"
- "Handle transient errors in X"

## Table of Contents

### Core Sections

- [Purpose](#purpose)
  - Core capability of the skill
- [Quick Start](#quick-start)
  - Pattern detection and basic implementation
- [Instructions](#instructions)
  - [Step 1: Identify Retry Requirements](#step-1-identify-retry-requirements) - When to use retry logic
  - [Step 2: Add Configuration](#step-2-add-configuration) - Configuration injection patterns
  - [Step 3: Implement Retry Logic](#step-3-implement-retry-logic) - Core retry implementation
  - [Step 4: Add Backoff Helper](#step-4-add-backoff-helper) - Exponential backoff with jitter
  - [Step 5: Classify Errors](#step-5-classify-errors) - Retriable vs permanent errors
  - [Step 6: Add Tests](#step-6-add-tests) - Test coverage for retry behavior

### Examples & Reference

- [Examples](#examples)
  - [Example 1: Database Operation Retry](#example-1-database-operation-retry) - Neo4j database operations with retry
  - [Example 2: External API Retry](#example-2-external-api-retry) - HTTP API calls with rate limiting
- [Requirements](#requirements)
  - Dependencies and project patterns
- [See Also](#see-also)
  - Supporting resources and reference implementations

### Supporting Resources

- [references/reference.md](./references/reference.md) - Advanced patterns and troubleshooting
- [templates/retry-template.py](./templates/retry-template.py) - Copy-paste template

### Utility Scripts
- [Add Retry Logic](./scripts/add_retry_logic.py) - Auto-add retry logic to async service methods
- [Analyze Retryable Operations](./scripts/analyze_retryable_operations.py) - Analyze codebase to find operations that need retry logic
- [Validate Retry Patterns](./scripts/validate_retry_patterns.py) - Validate retry logic implementations against best practices

## Quick Start

**Pattern Detection**: Look for external API calls, database operations, or network requests without retry logic.

**Basic Implementation**: Add retry loop with exponential backoff + jitter:

```python
# Configuration in settings
max_retries: int = 3
retry_delay: float = 1.0

# Implementation
for attempt in range(max_retries):
    try:
        result = await external_operation()
        return ServiceResult.ok(result)
    except RetriableError as e:
        if attempt < max_retries - 1:
            delay = min(retry_delay * (2 ** attempt), 30.0)
            jitter = delay * 0.2 * (2 * (time.time() % 1) - 1)
            await asyncio.sleep(max(0.1, delay + jitter))
        else:
            return ServiceResult.fail(f"Failed after {max_retries} retries: {e}")
```

## Instructions

### Step 1: Identify Retry Requirements

**Check if retry is needed:**
- [ ] External service call (API, database, network)
- [ ] Operation may fail transiently (timeouts, rate limits)
- [ ] Operation is idempotent (safe to retry)
- [ ] Failures should not crash the system

**Anti-patterns to avoid:**
- ❌ Retrying non-idempotent operations (creates duplicates)
- ❌ Retrying permanent errors (syntax errors, bad input)
- ❌ No backoff delay (hammers failing service)
- ❌ Infinite retries (never gives up)

### Step 2: Add Configuration

**Add retry settings to config/settings.py:**

```python
@dataclass
class ServiceSettings:
    """Configuration for [service name]."""

    # Existing fields...

    # Retry configuration
    max_retries: int = 3  # Maximum retry attempts
    retry_delay: float = 1.0  # Base delay in seconds

    @classmethod
    def from_env(cls) -> "ServiceSettings":
        return cls(
            # Existing fields...
            max_retries=int(os.getenv("SERVICE_MAX_RETRIES", "3")),
            retry_delay=float(os.getenv("SERVICE_RETRY_DELAY", "1.0")),
        )
```

**Configuration Rules:**
1. Always inject via Settings (never hardcode)
2. Provide environment variable overrides
3. Use sensible defaults (max_retries=3, retry_delay=1.0)
4. Document units (seconds, milliseconds)

### Step 3: Implement Retry Logic

**Use project pattern with exponential backoff + jitter:**

```python
async def _call_with_retry(self, operation_name: str) -> ServiceResult[T]:
    """Call external service with retry logic.

    Args:
        operation_name: Name for logging

    Returns:
        ServiceResult with operation result or error
    """
    last_error: str = ""

    for attempt in range(self.settings.max_retries):
        try:
            # Perform operation
            result = await self._perform_operation()
            return ServiceResult.ok(result)

        except aiohttp.ClientConnectionError as e:
            # Connection errors are retriable
            last_error = f"Connection error: {e}"
            if attempt < self.settings.max_retries - 1:
                delay = self._calculate_backoff_delay(attempt)
                logger.warning(
                    f"{last_error}. Retrying in {delay:.1f}s "
                    f"(attempt {attempt + 1}/{self.settings.max_retries})"
                )
                await asyncio.sleep(delay)

        except TimeoutError as e:
            # Timeouts are retriable
            last_error = f"Request timed out: {e}"
            if attempt < self.settings.max_retries - 1:
                delay = self._calculate_backoff_delay(attempt)
                logger.warning(
                    f"{last_error}. Retrying in {delay:.1f}s "
                    f"(attempt {attempt + 1}/{self.settings.max_retries})"
                )
                await asyncio.sleep(delay)

        except ValueError as e:
            # Validation errors are NOT retriable (permanent)
            logger.error(f"Validation error (non-retriable): {e}")
            return ServiceResult.fail(f"Validation error: {e}")

        except Exception as e:
            # Unexpected errors - fail fast
            logger.error(f"Unexpected error in {operation_name}: {e}")
            return ServiceResult.fail(f"Unexpected error: {e}")

    # All retries exhausted
    return ServiceResult.fail(
        f"Failed after {self.settings.max_retries} retries: {last_error}"
    )
```

**Key Components:**
1. **Retry Loop**: `for attempt in range(max_retries)`
2. **Error Classification**: Retriable vs permanent errors
3. **Backoff Calculation**: Exponential with jitter
4. **Logging**: Warning on retry, error on failure
5. **ServiceResult**: Always return ServiceResult, never raise

### Step 4: Add Backoff Helper

**Helper method for exponential backoff with jitter:**

```python
def _calculate_backoff_delay(self, attempt: int) -> float:
    """Calculate exponential backoff with jitter.

    Args:
        attempt: Current attempt number (0-based)

    Returns:
        Delay in seconds with jitter
    """
    base_delay = self.settings.retry_delay

    # Exponential backoff: base * 2^attempt, capped at 30s
    delay = min(base_delay * (2 ** attempt), 30.0)

    # Add jitter to avoid thundering herd (±20%)
    jitter = delay * 0.2 * (2 * (time.time() % 1) - 1)

    return max(0.1, delay + jitter)
```

**Jitter prevents thundering herd:**
- Multiple clients don't retry at exact same time
- Uses time.time() fractional seconds for randomness
- ±20% variance is industry standard

### Step 5: Classify Errors

**Determine which exceptions are retriable:**

```python
def _is_retriable_error(self, error: Exception, status_code: int | None = None) -> bool:
    """Determine if error is retriable.

    Args:
        error: Exception that occurred
        status_code: HTTP status code if applicable

    Returns:
        True if error should be retried
    """
    # HTTP status codes
    if status_code:
        # 429 Rate Limited - retriable
        if status_code == 429:
            return True
        # 5xx Server errors - retriable
        if 500 <= status_code < 600:
            return True
        # 4xx Client errors (except 429) - NOT retriable
        if 400 <= status_code < 500:
            return False

    # Network/connection errors - retriable
    if isinstance(error, (
        aiohttp.ClientConnectionError,
        aiohttp.ServerDisconnectedError,
        TimeoutError,
    )):
        return True

    # Validation/syntax errors - NOT retriable
    if isinstance(error, (ValueError, TypeError, SyntaxError)):
        return False

    # Default: not retriable (fail fast)
    return False
```

**Error Classification Rules:**
- **Retriable**: Timeouts, rate limits, 5xx errors, network errors
- **Permanent**: Validation errors, 4xx errors (except 429), syntax errors
- **Default**: When uncertain, fail fast (not retriable)

### Step 6: Add Tests

**Test retry behavior:**

```python
async def test_retry_on_transient_error():
    """Test that transient errors trigger retry."""
    service = MyService(settings)

    # Mock to fail twice, then succeed
    with patch.object(service, "_perform_operation") as mock_op:
        mock_op.side_effect = [
            TimeoutError("timeout"),
            TimeoutError("timeout"),
            {"status": "success"}
        ]

        result = await service._call_with_retry("test")

        assert result.is_success
        assert mock_op.call_count == 3  # 2 failures + 1 success

async def test_no_retry_on_permanent_error():
    """Test that permanent errors do not retry."""
    service = MyService(settings)

    with patch.object(service, "_perform_operation") as mock_op:
        mock_op.side_effect = ValueError("bad input")

        result = await service._call_with_retry("test")

        assert result.is_failure
        assert mock_op.call_count == 1  # No retry
```

**Test Coverage:**
- Retry on transient errors
- No retry on permanent errors
- Exponential backoff delays
- Max retries exhausted
- Jitter variance

## Examples

### Example 1: Database Operation Retry

```python
async def create_database(self, database_name: str) -> ServiceResult[str]:
    """Create database with retry on transient errors."""
    last_error: str = ""

    for attempt in range(self.settings.max_retries):
        try:
            # Attempt database creation
            await self.execute_query(
                f"CREATE DATABASE `{database_name}` IF NOT EXISTS",
                database="system",
            )
            return ServiceResult.ok(database_name, was_created=True)

        except Neo4jConnectionError as e:
            # Connection errors are retriable
            last_error = f"Connection error: {e}"
            if attempt < self.settings.max_retries - 1:
                delay = self._calculate_backoff_delay(attempt)
                logger.warning(f"Retrying in {delay:.1f}s (attempt {attempt + 1})")
                await asyncio.sleep(delay)

        except Neo4jPermissionError as e:
            # Permission errors are NOT retriable
            return ServiceResult.fail(
                f"Permission denied: {e}",
                error_type="PermissionError",
                recoverable=False
            )

    return ServiceResult.fail(f"Failed after {self.settings.max_retries} retries: {last_error}")
```

### Example 2: External API Retry

```python
async def _call_embedding_api(self, texts: list[str]) -> ServiceResult[list[list[float]]]:
    """Call embedding API with retry and rate limit handling."""
    last_error: str = ""

    for attempt in range(self.settings.max_retries):
        try:
            session = await self._get_session()
            payload = {"model": self.model, "input": texts}

            async with session.post(self.api_url, json=payload) as response:
                if response.status == 200:
                    data = await response.json()
                    embeddings = [item["embedding"] for item in data["data"]]
                    return ServiceResult.ok(embeddings)

                elif response.status == 429:
                    # Rate limited - retriable
                    last_error = "Rate limited by API"
                    if attempt < self.settings.max_retries - 1:
                        # Use Retry-After header if available
                        retry_after = response.headers.get("Retry-After", "2")
                        delay = min(float(retry_after), 30.0)
                        logger.warning(f"Rate limited. Retrying in {delay}s")
                        await asyncio.sleep(delay)

                elif 500 <= response.status < 600:
                    # Server error - retriable
                    last_error = f"Server error {response.status}"
                    if attempt < self.settings.max_retries - 1:
                        delay = self._calculate_backoff_delay(attempt)
                        await asyncio.sleep(delay)

                else:
                    # Client error - NOT retriable
                    error_text = await response.text()
                    return ServiceResult.fail(f"API error {response.status}: {error_text}")

        except aiohttp.ClientConnectionError as e:
            last_error = f"Connection error: {e}"
            if attempt < self.settings.max_retries - 1:
                delay = self._calculate_backoff_delay(attempt)
                await asyncio.sleep(delay)

    return ServiceResult.fail(f"Failed after {self.settings.max_retries} retries: {last_error}")
```

## Requirements

**Dependencies:**
- `asyncio` - Async/await and sleep
- `time` - Jitter calculation
- `aiohttp` - HTTP client (for network operations)

**Project Patterns:**
- ServiceResult for return values
- Settings injection for configuration
- OTEL logging (not print statements)
- Fail-fast principle

**Configuration:**
- Add retry settings to config/settings.py
- Provide environment variable overrides
- Never hardcode retry parameters

## See Also

- [references/reference.md](./references/reference.md) - Advanced patterns and troubleshooting
- [templates/retry-template.py](./templates/retry-template.py) - Copy-paste template
- [scripts/add_retry_logic.py](./scripts/add_retry_logic.py) - Auto-add retry logic utility
- [scripts/analyze_retryable_operations.py](./scripts/analyze_retryable_operations.py) - Codebase analysis utility
- [scripts/validate_retry_patterns.py](./scripts/validate_retry_patterns.py) - Validation utility
- [ARCHITECTURE.md](/Users/dawiddutoit/projects/play/project-watch-mcp/ARCHITECTURE.md) - ServiceResult pattern
- [src/project_watch_mcp/infrastructure/embeddings/infinity/embedding_service.py](/Users/dawiddutoit/projects/play/project-watch-mcp/src/project_watch_mcp/infrastructure/embeddings/infinity/embedding_service.py) - Reference implementation