---
name: valkey-dev
description: "Use when contributing to the Valkey server source - C internals, event loop, data structures, cluster, replication, persistence, memory, threading, modules, scripting, tests, build. Only what diverges from Redis or is genuinely novel; agent-trained-on-Redis knowledge is assumed. Not for app development (valkey) or ops (valkey-ops)."
version: 3.0.0
argument-hint: "[subsystem or source path]"
---

# Valkey Contributor Reference

This skill does not replace reasoning and exploring the relevant code. It is a tool to find the nuance faster and sharper. Redis-baseline behavior is assumed and not repeated; what you read below is Valkey-specific divergence. All content targets `unstable`.

## Quick start

```sh
make -j$(nproc)                          # build
./runtest --verbose --tags -slow         # core integration tests
./runtest-cluster                        # legacy cluster tests
./runtest-moduleapi                      # module API tests
./runtest-sentinel                       # Sentinel tests
make test-unit                           # C++ gtest unit tests
make SANITIZER=address MALLOC=libc       # ASan
```

## Top-level grep hazards (Redis-trained knowledge that misleads)

- `redisCommand` is gone; the struct is `struct serverCommand`. Command table is `server.commands = hashtableCreate(&commandSetType)`, not a `dict`.
- `zmalloc` / `zfree` are `#define`d to `valkey_malloc` / `valkey_free` (stack traces, debuggers, `nm` show `valkey_*`). A rename PR with both sides is not a symbol change.
- `RedisModule_*` -> `ValkeyModule_*`. Compat shim `src/redismodule.h` is pinned at Redis 7.2.4; every post-7.2.4 API is reachable only through `src/valkeymodule.h`.
- Replication configs renamed: `slaveof`/`slave-priority`/`masteruser`/`masterauth` -> `replicaof`/`replica-priority`/`primaryuser`/`primaryauth`. Grepping the Redis name finds only the alias.
- Hashtable design is **bucket chaining**, not open-addressing, not Robin Hood. 64-byte buckets, 7 entries (12 on 32-bit) + chain pointer.
- Invalidation channel is still named `__redis__:invalidate` (not `__valkey__:*`) - renaming breaks every tracking-enabled client.
- `adjustIOThreadsByEventLoad` does NOT exist. Real call sites are `IOThreadsBeforeSleep` / `IOThreadsAfterSleep` with the Ignition/Cooldown CPU-sample policy.
- `events-per-io-thread` and `io-threads-do-reads` are deprecated (in `deprecated_configs[]`), silently accepted as no-ops. Scaling is CPU-based, not event-count.
- Embed-string budget is **128 bytes** (2 cache lines via `shouldEmbedStringObject`), not the old `OBJ_ENCODING_EMBSTR_SIZE_LIMIT 44`.
- `dict` is now `typedef hashtable dict;` in `src/dict.h` (and `src/dict.c` is gone). Callers still using `dict*`/`dictEntry*`: Sentinel, `cluster_legacy.c`, pub/sub patterns, latency, scripting, functions, blocked clients, `subcommands_ht`. `dictEntry->next` and old chaining fields are gone; casting through them will not compile.

## Critical correctness rules

These fail loudly in review or silently corrupt the wire protocol.

- Run `lookupKey*` and type checks BEFORE `addReplyArrayLen` / `addReplyMapLen` / `addReplyPushLen`. Emitting a header then an error desyncs RESP for every subsequent reply on the connection. `debugServerAssert` enforces this in debug builds. HPERSIST is the canonical example.
- Edits to `src/commands/*.json` require `make commands.def` and committing the regenerated file; CI fails on diff. `lastkey` is an argv position - miscounting breaks cluster routing with CROSSSLOT.
- Module-API struct changes use `VersionN` extension, never in-place mutation: bump `VALKEYMODULE_*_ABI_VERSION`, keep V1, add V2 that embeds V1, version-gate reads with `methods.version >= N`. A mutation silently breaks every compiled out-of-tree module. Enums are append-only (never insert); `_VALKEYMODULE_SUBEVENT_..._NEXT` sentinel is required on server-event subevents.
- `server.current_client` may be NULL during active expire, `delKeysInSlot`, module cron, module timer, AOF load, and cluster topology updates. Propagation / ACL / stats code that reads `current_client` must null-guard or resolve via `server.executing_client`.
- If you add a new allocation owned by a type (defrag, RDB, AOF, module), implement the `dismissObject` / defrag / RDB callbacks - silent leaks in fork children are hard to find later.
- PRs target `unstable`. DCO sign-off (`git commit -s`) is required. `clang-format-18` diff fails CI.

## Iterator-invariant taxonomy (read before matching individual hashtable rules)

Five UAF/corruption windows around the hashtable and its clients. The per-subsystem rules below cover obvious cases; these are the bug classes they miss. Any change widening one of these must close it before the rest of this file is relevant.

1. **Rehash cursor direction.** `rehash_idx` advances 0 upward. `idx < rehash_idx` has migrated to `tables[1]`; `idx >= rehash_idx` still lives in `tables[0]`. "Skip already-rehashed" helpers use `<`, not `<=`, not `>`. `findBucket`, `hashtableReplaceReallocatedEntry`, and safe-iterator init share this pattern. Inverting compiles, passes unmirrored tests, corrupts silently under rehash.
2. **Scan + shrink race.** `hashtableTwoPhasePopDelete` reserves a position during begin; a shrink between begin and finalize invalidates it. Pause auto-shrink across the pair or defer shrink until after finalize.
3. **Safe-iterator lifetime.** Safe iterators pause incremental rehash for their lifetime. `hashtableRelease` may run while an iterator is still registered, so `hashtableCleanupIterator` MUST unregister from `ht->safe_iterators` before the table is freed - otherwise a later release walks dangling `next_safe_iter`. Canonical UAF path for hashtable changes.
4. **stringRef ownership (entries).** `entryUpdateAsStringRef` points at a caller-owned buffer; if the caller frees or reuses it before the entry is freed or re-updated, the hash read path UAFs.
5. **Two-phase insert reservation.** `hashtableInsertAtPosition` commits what `hashtableFindPositionForInsert` reserved; any realloc, rehash, or shrink in between invalidates the reservation. Treat the pair as atomic - no allocations, no scans, no callback re-entry between them.

## Data structures

### Keyspace: kvstore per DB

Each DB carries three kvstores in lockstep - `keys`, `expires`, `keys_with_volatile_items` - so any RDB, replication, or defrag path touching one must plan for the other two. The keyspace is a `kvstore`, not a `dict`: cluster mode allocates 16,384 hashtables per kvstore (one per slot), standalone uses one, routed via `getKVStoreIndexForKey()`. DBs are lazy (`server.db[]` is `serverDb *`, `createDatabaseIfNeeded(id)` on first use); only `db == NULL` is the genuine skip, since `kvstoreCreate()` has baseline cost at zero hashtables.

- `db->keys_with_volatile_items` holds every hash key carrying per-field TTL, maintained by `dbTrackKeyWithVolatileItems()` / `dbUntrackKeyWithVolatileItems()`. Active hash-field expiration must register with the kvstore's incremental-rehashing hook or expiration halts mid-rehash and leaks phantom vset bookkeeping.

### kvstore (`src/kvstore.c`, `src/kvstore.h`)

Valkey-only wrapper presenting an array of `hashtable *`. Consumers: main keyspace, `expires`, `keys_with_volatile_items`, `pubsub_channels`, `pubsubshard_channels`. `num_hashtables_bits` = log2 of array size (4=16 standalone, 14=16384 cluster slots, 16 max). `kvstoreCreate(type, bits, flags)` asserts four wired callbacks: `rehashingStarted`, `rehashingCompleted`, `trackMemUsage`, `getMetadataSize`. Flags: `KVSTORE_ALLOCATE_HASHTABLES_ON_DEMAND` (cluster - most slots empty), `KVSTORE_FREE_EMPTY_HASHTABLES`. A Fenwick tree `hashtable_size_index` tracks cumulative counts when `num_hashtables > 1`, powering `kvstoreFindHashtableIndexByKeyIndex` and `kvstoreGetFairRandomHashtableIndex`. `kvstoreScan` cursor packs `<upper 48: pos>|<lower: hashtable index>`; `onlydidx >= 0` restricts to one index (`-1` scans all). Both `kvstoreIterator` and `kvstoreHashtableIterator` are safe for in-iteration deletion.

- Size hints are per-slot. `kvstoreHashtableExpand(kvs, slot, size)` pre-sizes one hashtable; sizing the top-level kvstore is not meaningful. RDB slot-info AUX sends one hint per `(slot, keys, expires, keys_with_volatile_items)` - all three need pre-sizing on load.
- `kvstoreExpand(size=0)` is a no-op returning `true`; the inner `TryExpand(0)` returns `false`. Do not conflate the two layers.
- Cluster slot migration: `importing` is a hashtable of slot indexes being imported, excluded from Fenwick counts and fair-random selection (`kvstoreGetFairRandomHashtableIndex` may return `KVSTORE_INDEX_NOT_FOUND`). `DBSIZE` differs from `COUNTKEYSINSLOT` during migration by design. Full-scan consumers (RDB, replication) set `HASHTABLE_ITER_INCLUDE_IMPORTING`; client-facing reads (SCAN, KEYS, RANDOMKEY, eviction, expiry) MUST NOT.
- kvstore iteration advances via `next_didx`, not `didx`, when filtering. `kvstoreIteratorNext` must reset the previous hashtable's iterator before re-initializing, gated on `kvs_it->didx != -1 AND kvstoreGetHashtable(kvs, didx) != NULL`; missing reset leaves rehashing paused forever on the abandoned table.
- `bucket_count` accounting includes `rehashing->to` during rehash. `freeHashtableIfNeeded` mid-rehash without this leaves a count stuck on the freed primary.
- `kvstore` must not `#include "server.h"`. One-way layering; `valkey-cli` and `valkey-benchmark` reuse data-structure modules.

### Hashtable (`src/hashtable.c`, `src/hashtable.h`)

Bucket chaining on 64-byte buckets. Each bucket holds 7 entries inline (12 on 32-bit); when full, the 8th slot is a pointer to the next bucket. `h2` holds the high hash bits, SIMD-scanned (SSE/AVX/NEON) to reject misses without touching entry pointers. Consumers: main keyspace (via kvstore), Set, Hash, Sorted Set (paired with skiplist), `server.commands` / `server.orig_commands`. Design: incremental rehashing across two tables, three-phase resize policy (`ALLOW` / `AVOID` / `FORBID`), incremental find (`hashtableIncrementalFindInit/Step/GetResult`), two-phase insert (`hashtableFindPositionForInsert` + `hashtableInsertAtPosition`).

- `resize()` asserts `!hashtableIsRehashing(ht)`. Callers gate on `hashtableIsRehashing` or `hashtableIsRehashingPaused`. Three-gate order: `MAX_FILL_PERCENT_HARD` -> `resize_policy` -> `resizeAllowed` callback (expand only). `AVOID` during fork; `FORBID` in child.
- Empty-bucket skipping lives only in `rehashStepShrink` (not grow). Condition `b->presence == 0 && !chained` (NOT `b == NULL`); cap 10 empty visits per step (covers 70 dict-equivalent slots).
- "Non-empty" means `used[idx] > 0 OR child_buckets[idx] > 0`. Using `used == 0` alone leaks chained-but-empty buckets; `hashtablePop` only compacts chains when `!hashtableIsRehashingPaused`.
- `pause_rehash` is correctness, `pause_auto_shrink` is performance. Bulk-delete loops over hashtable-encoded sets/zsets bracket with `hashtablePauseAutoShrink` / `hashtableResumeAutoShrink`; on resume, skip if the containing key was already deleted (`keyremoved`) - `dbDelete` freed the hashtable and touching it is a UAF. Gate resume on `keyremoved`, not on reordering `dbDelete` vs `notifyKeyspaceEvent`.
- `validateEntry` / `shouldSkip` are pure predicates on the read/sample path (HRANDFIELD, RANDOMKEY, expiry sampling). `true` = skip, `false` = include. They MAY trigger lazy-expire side effects via return value but MUST NOT mutate, propagate, or notify. This is the seam between read sampling and write propagation.
- `HASHTABLE_ITER_SKIP_VALIDATION` opts into iterating invalid/expired entries (defrag, RDB save, some tests). `randomEntry` must use skip-validation or it loops forever on all-invalid.
- When the hashtable stores DB value objects (robj with embedded key via `objectSetKeyAndExpire`), `hashtableType.hashFunction` and `keyCompare` MUST derive the name from `objectGetKey(o)`, never `dictEncObjHash` / `hashtableEncObjKeyCompare` on the raw robj - those hash `objectGetVal(o)`, so writes silently collide. `lookupKeyRead`'s key argument is a key-name robj; unwrap with `objectGetKey(keys[i])` before passing.
- Shrink-in-progress must be abortable: if an insert pushes `used` above current capacity during shrink, abort-shrink and switch to expand, unless a safe iterator pins the table.
- Fair-random sampling picks a fresh cursor per sample; iterating in scan order from one seed biases toward dense runs once the table is sparse (500x slowdown reproduced). `hashtableScan` reads `rehash_idx` as authoritative for "already migrated in source"; skipping those in table 0 prevents double-visit.
- `iter->num_of_active_buckets` is monotonically non-increasing; `num_of_buckets` is pinned for the iterator's lifetime. Mutating either mid-iteration breaks exhaustion logic.
- Hot callbacks (`validateEntry`, `keyCompare`, `hash`) live in the first cache line of `hashtableType`; cold ones after. Do NOT add `__attribute__((hot))` / `always_inline` - overrides PGO. Iterator bucket prefetch (`HASHTABLE_ITER_PREFETCH_VALUES`) must respect `entryPrefetchValue`, skip `OBJ_ENCODING_INT` (not a real pointer) and NULL; pattern: bucket at `i+2`, entries at `i+1`, consume at `i`.

Hashtable grep hazards: `findBucket` is internal (public surface is `hashtable{Find,Insert,Delete,TwoPhase*}`). "Next bucket" is ambiguous: `getChildBucket` = intra-chain; `getNextBucket` / pointer arithmetic = next top-level INDEX. A conflated `bucketNext` either skips chained children or walks off the table. Stored items are **entries**, not "elements". During rehash, `bucket_exp[hashtableIsRehashing(ht) ? 1 : 0]` reads the OLD table size. `hashTypeEntry` is opaque behind `hashTypeEntryGetField` / `GetValue` / `ReplaceValue`.

### Object lifecycle (`src/object.c`, `struct serverObject`)

The robj contract (types, encodings, refcount, `OBJ_SHARED_INTEGERS = 10000`, `tryObjectEncoding`, `dismissObject`) mostly matches Redis. Layout diverges: three bit-flags - `hasexpire`, `hasembkey`, `hasembval` - gate optional fields after the base struct, so one allocation carries expire + key + value. `objectSet*` MAY reallocate. Embed budget: `shouldEmbedStringObject` returns true when total <= **128 bytes** (base + optional expire + optional key SDS + value SDS). `OBJ_ENCODING_EMBSTR_SIZE_LIMIT 44` is gone. `KEY_SIZE_TO_INCLUDE_EXPIRE_THRESHOLD = 128`: keys >= 128 bytes pre-reserve expire space.

- `objectSetKeyAndExpire(o, key, expire)` and `objectSetExpire(o, expire)` MAY reallocate. Always use the returned pointer.
- Never dereference `val_ptr` when `hasembval` may be set; use `objectGetVal(o)`.
- `OBJ_ENCODING_EMBSTR` values are logically immutable. Write paths (APPEND, INCR, SETRANGE, bit ops, module StringDMA) MUST allocate a new unembedded copy. Any helper calling `objectSetVal` on an embstr is a bug class. Use `objectUnembedVal(o)` to convert EMBSTR -> RAW in place.
- Secondary indexes keyed by a DB robj MUST hash via `objectGetKey()`, never `objectGetVal()`.
- Encoding numbers: String `RAW=0`, `INT=1`, `EMBSTR=8`. Hash/Set/ZSet `HASHTABLE=2`, `INTSET=6`, `SKIPLIST=7`, `LISTPACK=11`. List `QUICKLIST=9`, `LISTPACK=11`. Stream `STREAM=10`. Values 3-5 reserved for legacy-RDB compat, not produced at runtime.

### Encoding transitions

Defaults diverge from Redis, transitions are bidirectional. `hash-max-listpack-entries` is **512** in Valkey (Redis 7.x: 128). Others match Redis (`set-max-listpack-entries=128`, `zset-max-listpack-entries=128`, `set-max-intset-entries=512`, listpack-value cap 64 bytes). Full encoding is `hashtable` (not `dict`); Sorted Set keeps a paired skiplist. Valkey adds `zsetConvertToListpackIfNeeded` (`src/t_zset.c`) and `listTypeTryConvertListpack` (`src/t_list.c` - demotes quicklist -> listpack below half the threshold to avoid oscillation). Do NOT hardcode "listpack -> hashtable" as one-way. Defrag callbacks ASSERT `ob->type` / `ob->encoding` then cast `ob->ptr` - wrong-type is a memory-safety issue, not recoverable.

### Skiplist (`src/t_zset.c`)

Max level 32, p=0.25; algorithm standard. Layout: SDS embedded after the level array in `zskiplistNode` (single allocation). The header node reuses slots via unions (`score`/`length`, `backward`/`tail` share storage), so list `length` and `tail` live inside the header node - there is no separate `zskiplist` struct. Level-0 `span` on the header stores max level. Grep hazard: code assuming Redis's separate `length`/`tail` fields reads wrong values.

### vset (`src/vset.c`, `src/vset.h`)

Valkey-only. **Not a user-facing "vector set" command** - internal adaptive container for entries with expiry. Used by hash field expiry (`volatile_fields` on hash objects), `db->keys_with_volatile_items`, and `vsetRemoveExpired(...)` in `src/t_hash.c`. A vset is a tagged pointer with four backing shapes that change under you - any code storing a vset reference across a mutation must re-read via `vsetResolve`.

Low 3 bits encode bucket type: `VSET_BUCKET_NONE` (-1), `VSET_BUCKET_SINGLE` (0x1, raw tagged pointer - entry must be odd-aligned), `VSET_BUCKET_VECTOR` (0x2, sorted SIMD vector), `VSET_BUCKET_HT` (0x4), `VSET_BUCKET_RAX` (0x6). Masks: `VSET_TAG_MASK=0x7`, `VSET_PTR_MASK=~0x7`. Growth path: `NONE -> SINGLE -> VECTOR (<=127, sorted by expiry) -> RAX (time-bucket VECTORs) or HT (clustered expiry)`. RAX key = 8-byte big-endian bucket-end timestamp. Adaptive widths: `VOLATILESET_BUCKET_INTERVAL_MIN=16ms`, `VOLATILESET_BUCKET_INTERVAL_MAX=8192ms`. `vsetInitIterator` / `vsetNext` are NOT safe - no mutations during iteration; use `vsetRemoveExpired(max_count, ctx)` for bulk reclaim. Sort path stores `vsetGetExpiryFunc` in `_Thread_local current_getter_func` (qsort_r not portable). Grep hazard: `pVector` is vset-only (`{len:30, alloc:34, data[]}`, NEON `pvFind`), distinct from the generic `vector` primitive.

### Hash field entry (`src/entry.c`, `src/entry.h`)

Valkey-only runtime representation of one hash field/value pair with optional per-field TTL, used by `t_hash.c` when hash values are in `hashtable` encoding. The stringRef window (taxonomy item 4) lives here. Type 1 has an `SDS_TYPE_5` field with field+value both embedded and NO expiry (SDS_TYPE_5 has no aux bits). Type 2+ uses a larger SDS field whose aux bit encodes expiry presence via `sdsGetAuxBit` / `sdsSetAuxBit`; value is embedded inline or externalized via `entryUpdateAsStringRef`. `EMBED_VALUE_MAX_ALLOC_SIZE = 128` caps inline embedding.

- Entries live inside the hash's `hashtable` as the entry pointer; `t_hash.c` wires them via the hashtable's `entryGetKey` callback returning the field SDS.
- `entryUpdateAsStringRef` borrows a caller-owned buffer - callers MUST keep the buffer alive until the entry is freed or re-updated. Use `entryHasStringRef` on defrag/dismiss paths.
- **Read paths MUST NOT lazily reclaim expired fields.** Field TTL is filtered by `validateEntry` without deleting; active expire cycle owns reclamation, AOF/replication propagation, and keyspace notifications. Calling `dbReclaimExpiredFields` from HRANDFIELD / HGETALL is a bug.
- Access: `entryGetField`, `entryGetValue(*len)`, `entryGetExpiry`, `entryHasExpiry`, `entryHasStringRef`, `entryIsExpired(against commandTimeSnapshot)`, `entryMemUsage`, `entryDefrag(defragfn, sdsdefragfn)`, `entryDismissMemory`.

## Event loop, I/O threads, BIO, prefetch

### `ae` reactor

The `ae.c` API is unchanged from Redis. Valkey adds `custompoll` (`aeCustomPollProc *`, installed via `aeSetCustomPollProc`) which replaces `aeApiPoll` when set, and `poll_mutex` (`pthread_mutex_t`) paired with the `AE_PROTECT_POLL` flag so I/O threads can reach the poll path without corrupting the fired-event buffer. Any new path that can reach `aeApiPoll` from an I/O thread must set `AE_PROTECT_POLL` or install a `custompoll`. Use `getMonotonicUs()` (TSC-backed) for event timing, not `gettimeofday`. `aeMain` starts only after `initServer` / `clusterInit` return.

### Ignition/Cooldown scaling

`server.c:beforeSleep` calls `IOThreadsBeforeSleep(current_time)` (commits queued I/O jobs) and `afterSleep` calls `IOThreadsAfterSleep(numevents)` which runs the Ignition/Cooldown policy in `io_threads.c`. That policy samples main-thread CPU via `RUSAGE_THREAD` and compares against thresholds - NOT event counts. More workers are woken under load; idle extras are parked by locking their per-thread mutex. Main-thread utilisation uses `clock_gettime(CLOCK_THREAD_CPUTIME_ID)`, not wall clock - with I/O threads active, main busy-spins `beforeSleep -> epoll_wait(timeout=0) -> afterSleep` so wall-clock sampling reports ~100% regardless of real load.

- `ProcessingEventsWhileBlocked` is true during RDB load, AOF load, full-sync load, long scripts, long module commands. Active expire / timer work in this state must NOT set `el_iteration_active` or bump `stat_active_time` - the outer caller already counts.
- An inactive I/O thread parked in `pthread_mutex_lock(io_threads_mutex[id])` must NOT advance its active-time counter. Store `prev_work_start_time`, re-sample at cycle top, `atomic_fetch_add_explicit` only the delta between successful work loops.

### Main/IO ownership

Main owns dispatch, keyspace, cluster state, and every `server.*` global. I/O threads do only socket I/O (`read` / `writev` / `aeApiPoll` / TLS `SSL_accept`) and object free. ACL evaluation, command-table lookup, key prefetch, and `processCommand` run on main. The handoff is `io_*_state` on the client spun with `memory_order_acquire`; no mutex, because IO threads are only active under load and spinning wins on latency.

| Field | Owner | Fence |
|-------|-------|-------|
| `c->flags` | main only | never crosses boundary |
| `c->read_flags` | guarded by `io_read_state` | state transitions acquire/release |
| `c->write_flags` | guarded by `io_write_state` | state transitions |
| reply-block payload | main until push, IO during `CLIENT_PENDING_IO`, main after `COMPLETED_IO` reset | `io_*_state` + `memory_order_acquire` spin |
| keyspace, cluster state, `server.*` | main only | never crosses boundary |

- State transitions: main `IDLE -> PENDING_IO`, worker `PENDING_IO -> COMPLETED_IO`, main `COMPLETED_IO -> IDLE`. Between `COMPLETED_IO` and the main-thread reset, neither side may touch the client. `waitForClientIO(c)` is the acquire-spin barrier.
- `c->flags` is NOT shared. Only `read_flags` / `write_flags` (guarded by `io_*_state`) are. New cross-boundary features encode into reply-block headers, never a new `c->flags` bit.
- `io_read_state` / `io_write_state` are `volatile`, NOT `_Atomic` - atomics measurably slow the main hot-path. Correctness relies on IO threads being active only under load and on explicit fences at state transitions. Do not "upgrade" to `_Atomic` without redoing the microbenchmark.
- Any mutation of shared structures read by I/O threads (replacing command-table entries, rewriting shared RESP string objects) must first call `drainIOThreadsQueue()`. `blocked_clients == 0` is NOT sufficient. Module unload must call `drainIOThreadsQueue()` AND `bioDrainWorker(BIO_LAZY_FREE)` before `dlclose`.
- If an IO-thread enqueue can fail (bounded queue full), every field mutated before the enqueue must be fully rolled back - including `block->last_header`, `buf_encoded`, `write_flags`, `io_write_state`. Partial rollback corrupts the next write.
- Dynamic `io-threads` resize must drain all queues and reset `active_io_threads_num = 1` before growing or shrinking. Going from `io-threads=1` to `>1` must still invoke `prefetchCommandsBatchInit` - `initIOThreads` short-circuits at threads==1 and skips prefetch setup.
- `io_threads_mutex[]` is NOT a shared-queue mutex - it's per-thread, taken only to park/unpark workers. Queues between main and IO are SPSC lock-free (`spscEnqueue` / `spscDequeueBatch`).

### Atomics and memory ordering

`_Atomic` is a correctness marker for actually-shared fields with a defined producer/consumer; plain types stay plain. `memory_order_relaxed` is for uniqueness-only counters; synchronization flags need explicit acquire/release.

- Sync flags (`replica_bio_disk_save_state`, `replica_bio_abort_save`, pending-context pointers, TLS-reload pending) need explicit acquire/release, not relaxed.
- On 32-bit targets, plain `long long` / `double` are 4-byte aligned; `_Atomic` bumps to 8. Lock-free fields (e.g. `server.fsynced_reploff_pending`) must be declared `_Atomic(long long)` - parenthesized form - or the compiler may emit non-atomic load/store pairs.
- Config values set once in argument parsing and only read thereafter stay plain `int`.

### Lazyfree and BIO

BIO workers own the memory (and sometimes the connection) for a job's full lifetime; main releasing the pointer is not enough. Every teardown path must drain the relevant BIO queue before freeing surrounding state; producers must hold accounting increment strictly before enqueue.

- Lazyfree accounting: producer `atomic_fetch_add_explicit(&lazyfree_objects, ..., memory_order_relaxed)` before `bioCreateLazyFreeJob`. BIO worker `atomic_fetch_sub_explicit(&lazyfree_objects, ...)` + `atomic_fetch_add_explicit(&lazyfreed_objects, ...)` after the actual free. Flipping the order means the worker decrements before the producer incremented, counter wraps.
- Handing a list/buffer to a BIO async-free job and nulling the main-thread pointer is NOT enough: the BIO thread still holds the pointer. Any main-thread write to the list before the BIO job runs is a UAF / data race.
- Teardown: `bioDrainWorker` BEFORE freeing surrounding state. "Free name -> bioDrainWorker" is a bug.
- TLS material reload runs on a BIO worker (`BIO_TLS_RELOAD`, Valkey-only). Shared state between main and BIO (pending SSL contexts, `tls_reload_pending`) must be `_Atomic` with explicit acquire/release. Before enqueueing, check whether a previous job is still pending; duplicate jobs race.
- BIO workers own the connection for the job lifetime (e.g. dual-channel RDB download). Main must not read/write that connection while the BIO job is in flight. Bulk payload transfer from BIO must use `connRecvTimeout` - main's busy-wait has no independent watchdog.
- Busy-waiting on main for a BIO worker (`bioPendingJobsOfType`) mirrors `waitForClientIO` and is accepted. Raw `bioDrainWorker` from main is hazardous unless the worker has a read timeout.
- BIO queue is `mutexQueue` (mutex + condvar FIFO), NOT lock-free. `mutexQueuePeek` is unsafe under multiple readers.

### Shutdown, teardown, signal handlers

Shutdown is cooperative: IO workers poll a stop-flag (relaxed atomic read inside the loop) and BIO drains via `bioDrainWorker`. `makeThreadKillable()` was removed to fix jemalloc teardown, so `PTHREAD_CANCEL_ASYNCHRONOUS` is never used. `pthread_cancel()` only lands at explicit `pthread_testcancel()` in `IOThreadMain`.

- `pthread_cancel()` does NOT take `io_threads_mutex[id]`. Shutdown must unlock the mutex for `id >= server.active_io_threads_num` before `pthread_cancel + pthread_join`, or it deadlocks.
- Worker loop must contain a cancellation point, or `pthread_join` hangs forever.
- Crash-handler stack traces run in a signal handler - only async-signal-safe work inline. Symbolization forks a child; parent is `waitpid(WNOHANG)` loop with SIGKILL fallback.
- Lua VM is single-threaded. `FUNCTION FLUSH ASYNC` + `FUNCTION LOAD` races require the BIO worker to own its own `lua_State` (for `lua_close`) and main to create a fresh one. Sharing one `lua_State` across teardown and load crashes.

### Batch key prefetching (`src/memory_prefetch.c`)

Interleaves CPU prefetch instructions across multiple keys so one key's memory access overlaps with another's. Two callers run the SAME logic on DIFFERENT code paths - the IO-thread batch (`processClientsCommandsBatch`) and the pipelined single-client path (`prefetchCommandQueueKeys` in `networking.c`); a change in only one is almost always a bug. Prefetch, command lookup, and `processCommand` run on MAIN only - IO threads run read/parse only. Value-prefetch is skipped when copy-avoidance is active (threads >= `min-io-threads-avoid-copy-reply`, default 7). Both callers use `hashtableIncrementalFindStep`; a blocking `hashtableFind` defeats the interleaving. `onMaxBatchSizeChange` reallocates the static `PrefetchCommandsBatch` only when no work is in flight.

### Event-loop / client-state invariants

- All `blockInUse` API entry points (bstate dict, `server.blocked_clients` counters, `server.unblocked_clients`, `inuse_key_to_clients`) run on main only. Background threads post a completion event - never call `unblockClientsInUseOnKey()` directly.
- `BLOCKED_INUSE` clients have their read handler detached from the event loop; EOF is no longer detected there. The crontab path (`clientsCronTcpIsClosing`) probes the fd via a connection-type-specific `is_closing` hook (TCP_INFO-based `getsockopt` on Linux/macOS). Connection types without `is_closing` (unix sockets, RDMA, non-Linux TCP) cannot reap zombies this way.
- Failover's `disconnectOrRedirectAllBlockedClients` must NOT unblock `BLOCKED_INUSE` clients. The `bgIterator` owner is responsible for `unblockClientsInUseOnKey` on failover; `BLOCKED_INUSE` has no timeout and sending an error mid-command is unsafe.
- `c->flag.reprocessing_command` must be set/cleared symmetrically around unblock-driven re-execution. The reprocessing branch in `processCommand` must test this explicit flag, NOT infer from `c->cmd != NULL` - with IO-threaded parsing, `c->cmd` is pre-populated from `c->io_parsed_cmd`.
- `evictClients` in one iteration can re-observe the same client because `freeClient` returning 0 (async close) leaves it with `CLIENT_CLOSE_ASAP`. Eviction accounting must track already-counted clients explicitly.
- Bulk key deletion on main (slot purge on migration failure, `FLUSHSLOT`) must be offloaded to lazyfree; gated on cluster version so both sides agree on the replicated flush primitive.
- Cluster bus must not rely on synchronous socket I/O on main during steady-state failover/migration. Blocking in `receiveSynchronousResponse()` during `REPL_STATE_RECEIVE_PING_REPLY` stalls the event loop up to `repl_syncio_timeout` (default 5s).

## Networking and command dispatch

### I/O-thread offload and the client struct

Byte flow: socket -> `readQueryFromClient` -> `parseInputBuffer` -> `processInputBuffer` -> `processCommand` -> `call` -> `cmd->proc`. Entry points in `src/networking.c` and `src/server.c`. The IO thread and main thread share a `struct client` through a narrow, deliberately un-atomic handshake: `CLIENT_IDLE` -> `CLIENT_PENDING_IO` -> `CLIENT_COMPLETED_IO` -> `CLIENT_IDLE`. Main arms PENDING before handoff, worker transitions PENDING -> COMPLETED, only main resets COMPLETED -> IDLE. While `io_*_state == CLIENT_COMPLETED_IO` neither side touches the client.

- `io_read_state` / `io_write_state` are `volatile`, NOT `_Atomic`. Memory fence sits AFTER state update; reordering earlier lets main observe new state without the accompanying `read_flags`, breaking MPSC signalling.
- `cur_tid` (`uint8_t`) = IO thread currently owning the client. `cmd_queue` / `io_parsed_cmd` (`cmdQueue`) = parsed pipelined commands awaiting dispatch (per-command `read_flags` lives on the queue entry, not on `c`).
- `read_flags` / `write_flags` are the ONLY client fields safe to cross the main/IO boundary. `write_flags` is guarded by `io_write_state` (PENDING = IO owns, otherwise main owns).
- `waitForClientIO(c)` spins on `io_*_state == CLIENT_PENDING_IO` with `memory_order_acquire` until the worker hands back ownership.

### Shared query buffer (`thread_shared_qb`)

`_Thread_local sds` in `src/networking.c`. On short reads `c->querybuf` aliases this buffer until `resetSharedQueryBuf(c)` detaches it. Any code holding `c->querybuf` across `processCommand` or `processEventsWhileBlocked` must detach first, or another client on the same I/O thread will mutate it.

### IO-thread dispatch (`src/io_threads.c`)

`postponeClientRead(c)` returns 1 and queues read for an I/O thread (0 = read inline). `trySendWriteToIOThreads(c)` offloads `writev`, snapshotting `io_last_reply_block` / `io_last_bufpos` to cap what the worker writes. `trySendPollJobToIOThreads()` hands `aeApiPoll` to a worker when Ignition is active. `trySendAcceptToIOThreads(conn)` offloads TLS accept when `CONN_FLAG_ALLOW_ACCEPT_OFFLOAD` is set.

### Command table is `hashtable`, not `dict`

`server.commands` and `server.orig_commands` are `hashtable *` via `hashtableCreate(&commandSetType)`. Command struct is `struct serverCommand`. JSON metadata in `src/commands/`; generator `utils/generate-command-code.py` emits C tables, CI validates no diff. Runtime-renamed commands retain both `fullname` (original) and `current_name` - logs, NOPERM errors, LATENCY output, and internal comparisons all depend on the original. When the command-table hashtable stores `robj *` with embedded key, the `hashFunction` / `keyCompare` callbacks must derive the key via `objectGetKey()`, not `objectGetVal()`.

### `-REDIRECT` during coordinated failover

During `CLUSTER FAILOVER` with `server.failover_state == FAILOVER_IN_PROGRESS`, `processCommand` and the blocked-client path in `src/blocked.c` branch on redirect capability:

- Redirect-capable: reply `-REDIRECT <primary_host>:<primary_port>`. `EXEC` also triggers `discardTransaction`; other in-multi commands get `flagTransaction` + `rejected_calls++`.
- Non-redirect: `blockPostponeClient` so the client resumes naturally once the replica is promoted.

This is the only reply with a `-` prefix baked into the payload rather than via `addReplyError`.

### Command dispatch invariants

Command execution is main-thread-only. IO workers handle read+parse, `writev`, `aeApiPoll`, TLS `SSL_accept`, and object free. The wire stream is length-prefixed: once you emit a header, the client is committed, so any type-check or `lookupKeyRead` / `lookupKeyWrite` must complete BEFORE `addReplyArrayLen` / `addReplyMapLen` / `addReplyPushLen`. Emitting a header then switching to `addReplyError` desyncs.

The write path has a five-step ordering:

1. AOF / replica propagation (as `DEL` if the key was removed).
2. `signalModifiedKey` for WATCH + client-tracking invalidation.
3. `notifyKeyspaceEvent`.
4. `server.dirty++`.
5. Use `shared.czero` / `shared.cone` for integer replies.

- `signalModifiedKey` + `notifyKeyspaceEvent` fire BEFORE any `addReply*`. `addReply*` installs the client on the pending-write queue via `prepareClientToWrite`, so a module blocking on a keyspace notification must enter blocked state before any reply byte is queued. Use `initDeferredReplyBuffer` when reply-first-notify-second is unavoidable.
- `LOOKUP_NOTOUCH` reads `server.current_client->flag.no_touch`, NOT `server.executing_client` - `executing_client` is NULL when `handleClientsBlockedOnKeys` re-executes an unblocked client.
- Command-JSON `WRITE` flag is per-command, not per-invocation: set `WRITE` if ANY optional arg can mutate state. HGETEX is `WRITE` because of EX/PX/EXAT/PXAT/PERSIST.
- Keyspec flags (`RM`, `ACCESS`, `DELETE`, `RW`) are orthogonal to command flags. A key used as a condition to decide write-vs-no-write (SETNX, DELIFEQ) is NOT `ACCESS`; `ACCESS` requires that stored user data is returned, copied, or exposed.
- `server.dirty` delta after `call()` is a reliable read/write classifier. Post-hoc assertion only, not dispatch logic.

### I/O-thread offload invariants

IO threads parse, write, and poll; main dispatches. Only `read_flags` / `write_flags` (guarded by `io_*_state`) and a reserved flag byte in the reply-block header may cross - `c->flags` is NOT thread-safe. Decisions needing both a live config value and per-client state are made on main at `addReply` time and encoded into the reply-block header; deciding later in the IO thread races `CONFIG SET`.

- Parsed-command queue carries per-command state. `read_flags` on the queue entry records `READ_FLAGS_PREFETCHED`, `READ_FLAGS_BAD_ARITY`, parse errors, input byte counts. Main reads from the queue entry; an unchecked bad-arity flag silently executes with undefined keys.
- Key prefetch and command lookup run ONLY on main. IO path: `ioThreadReadQueryFromClient` -> `parseCommand` -> `processMultibulkBuffer`; main picks up in `processIOThreadsReadDone` -> `processPendingCommandAndInputBuffer` -> `processCommandAndResetClient` -> `processInputBuffer`. When main parses more input (e.g. first command was AUTH), re-check `canParseCommand` / `canProcessCommand`.
- COB SDS length is main-read-only. IO threads must not dereference queued reply SDS to measure length - races main's mutation.
- Outbound main-thread clients (replicationAuth, cluster MIGRATE source link) must not block on synchronous `connRead` / `connWrite` while main serves traffic.
- Allocator locality: allocate+free on one thread, pass only sds bytes. Allocating on main, freeing on IO breaks per-thread allocator locality.
- ACL evaluation is main-thread-only. IO thread captures raw identity material (TLS peer cert CN); `clientAcceptHandler` on main does the ACL lookup post-handshake.

### Key prefetching

`prefetchCommandQueueKeys(c)` warms CPU cache via `hashtableIncrementalFindState`. Config: `prefetch-batch-max-size` (default 16, range 0-128; 0 or 1 disables). IO-thread-path equivalent in `src/memory_prefetch.c`.

### Transport layer (`src/connection.c` / `.h`)

Transports are a vtable. Registered types: `CONN_TYPE_SOCKET`, `CONN_TYPE_UNIX`, `CONN_TYPE_TLS`, `CONN_TYPE_RDMA`. Socket and Unix always register; TLS and RDMA are build-gated (`BUILD_TLS`, `BUILD_RDMA`). Instances at `connTypes[CONN_TYPE_MAX]` with cached accessors `connectionTypeTcp()` / `connectionTypeTls()` / `connectionTypeUnix()`. Adding a transport: declare `static ConnectionType CT_Foo = { ... }`, define `struct foo_connection { connection c; /* fields */ }` with `connection` as the FIRST field, implement `read` / `write` / `accept` / `connect` / `addr` / `listen`, guard registration with the compile flag in `connTypeInitialize()`.

- Never dereference type-specific fields outside the `CT_<name>` implementation.
- IO-thread-aware transports implement `postpone_update_state` + `update_state`. IO threads must not touch `ae*` state directly; queue via `postpone_update_state`, apply from main via `update_state`.
- `CONN_FLAG_WRITE_BARRIER` inverts read-before-write for fsync-before-reply in AOF `always` mode. Custom `ae_handler` must honor it.
- TCP_NODELAY + SO_KEEPALIVE go on BOTH accept-side and connect-side sockets that participate in cluster-bus or replication. Applying only on accept leaves outbound sockets Nagled. Valkey does not use Nagle's algorithm anywhere.
- `CLUSTER SLOTS` and `MOVED` / `ASK` return TCP-vs-TLS ports based on the originating client's connection type - these are the ONLY two ports gossiped. RDMA gossips as the TCP port, so `rdma-port must equal tcp-port` in cluster mode.
- Non-obvious vtable slots: `has_pending_data` / `process_pending_data` (TLS buffered reads, RDMA completion queue) are called outside the normal event loop; `get_peer_cert` / `get_peer_user` are TLS-only and back `tls-auth-clients-user`; `connIntegrityChecked` returns 1 for TLS, 0 for plain socket.

### RDMA (`src/rdma.c`)

Linux-only, gated by `USE_RDMA` (`USE_RDMA=1` linked, `=2` loadable module). Completion-channel fd is POLLIN-only - `connRdmaEventHandler` must poll the CQ, dispatch results, re-register RX buffer on full (`connRdmaRegisterRx`), manually invoke write handlers. Global `pending_list` tracks connections with outstanding write handlers.

- Not fork-safe. `connRdmaAllowCommand` returns `C_ERR` when `server.in_fork_child != CHILD_TYPE_NONE`.
- Cannot coexist with MPTCP on the same outbound connection (current policy: assert).
- MPTCP negotiation is asymmetric: `repl-mptcp yes` on replica + `mptcp yes` on primary yields MPTCP; `repl-mptcp` is immutable at runtime.

## Memory: allocation, eviction, defrag, expiry

### Eviction (`src/evict.c`, `src/lrulfu.c`)

Sampling-based approximate LRU/LFU, 8 policies, 16-entry eviction pool, `performEvictions()` before any command that may allocate. Algorithm matches Redis.

- Field reclaim is the expiry cycle (FIELDS job), not eviction. Eviction sampling walks top-level keys only.
- `kvstoreHashtableSampleEntries` tolerates duplicates by design. Random-cursor sampling keeps sparse/high-churn keyspaces off O(N) per eviction; do not add a `seen` set to dedupe.

### Lazy free (`src/lazyfree.c`)

Background deallocation via BIO. `LAZYFREE_THRESHOLD = 64` elements gates the async path; `lazyfreeGetFreeEffort()` estimates per-object cost. The load-bearing divergence from Redis: all five `lazyfree-lazy-*` knobs ship as `yes` in Valkey (`eviction`, `expire`, `server-del`, `user-del`, `user-flush`), `no` in Redis. A lazyfree test that passes on Redis defaults likely never exercises the background path.

### Active defragmentation (`src/defrag.c`, `src/allocator_defrag.c`)

Jemalloc-required (`HAVE_DEFRAG` needs `USE_JEMALLOC` + experimental.utilization). Defrag runs on its own timer event (not inside `serverCron`); duty cycle `D = P * W / (100 - P)`. Stages iterate `db->keys`, `db->expires`, and the Valkey-only `db->keys_with_volatile_items`. `active-defrag-cycle-us` default **500** microseconds; `cycle-min` / `cycle-max` bound CPU percentage. Pauses during active child processes via `hasActiveChildProcess()`. `DEBUG_FORCE_DEFRAG` is a build knob for running without the jemalloc mallctl.

- Every user-data hashtable is in scope. Missing one (per-hash volatile set, `keys_with_volatile_items`, per-slot kvstores) causes permanent fragmentation visible only through `force-defrag`. Adding a new persistent hashtable = add a defrag stage with a `hasActiveChildProcess()` pause gate.
- Type/encoding guards must be `serverAssert`, not silent early-return. Wrong-type object is a memory-safety issue.
- Per-hit time budget, not per-N scans - check after every hit. The old per-512-defrag / per-64-scan asymmetry was a latency-target bug.
- Allocator-slab defrag trigger is dual: utilization below 1.125x global-average OR small-slab rescue (slab < 1/8 full).
- Defrag callbacks cannot bypass the supplied `defragfn`. Calling `activeDefragAlloc` directly strips threading and accounting.
- Defrag the `stringRef` container itself, not only its buffer.

### Expiry (`src/expire.c`, `src/db.c`)

Lazy (`expireIfNeeded` on access) plus active (`activeExpireCycle`). Valkey adds per-field hash TTL, and the governing rule is: **read paths MUST NOT call `dbReclaimExpiredFields`. The active expire cycle owns reclamation, AOF/replication propagation, and keyspace notifications.** HGET, HRANDFIELD, HGETALL, HSCAN, HEXISTS, HLEN, HKEYS, HVALS skip expired fields via `validateEntry` but never propagate HDEL, never fire `hexpired`, and never mutate. If a read handler appears to need cleanup, raise the random-probe cap or add a bounded validated-scan fallback - do not reclaim.

Active expire runs only when `!server.import_mode && iAmPrimary()`. Replicas apply HDEL / `hexpired` strictly from the replication stream; replica-side expiry side effects are a correctness bug. `activeExpireCycle` carries two job types - KEYS and FIELDS - alternating priority each tick. `dbReclaimExpiredFields()` removes expired fields, propagates `HDEL`, fires `hexpired`, and deletes the parent key if the hash becomes empty. `active-expire-effort` (1-10) scales `keys_per_loop`, `ACTIVE_EXPIRE_CYCLE_ACCEPTABLE_STALE` (default 10%), and cycle time budget.

- `dbSetValue(overwrite=1)` on a hash with per-field TTL must call `dbUntrackKeyWithVolatileItems` on the old object before freeing, then `dbTrackKeyWithVolatileItems` on the new. SET / HSET / BITOP-dst / RENAME / RESTORE / MOVE / SORT-STORE / GETSET over an existing hash all go through this seam; missing the untrack leaves the active-expire cycle holding a stale pointer.
- TTL-setting commands rewrite relative to absolute PXAT before propagation. HEXPIRE / HPEXPIRE / HSETEX and SET EX/PX/EXAT rewrite argv to absolute PXAT before AOF/replication so replica and AOF-replay lifetime matches primary regardless of application delay.
- Expired-in-past writes: primary propagates as UNLINK, not the original command. No non-import-mode node ever stores a negative absolute expire; import-mode must clamp to `[0, LLONG_MAX]`. HSET/HINCRBY over an expired-but-unreclaimed field emits HDEL on the stream before the user write. HSETEX KEEPTTL over an implicitly-expired field suppresses the KEEPTTL propagation and emits an explicit HDEL first. HSETEX FXX/FNX/NX/XX rejections do not propagate. HSETEX with zero fields written cannot leave an empty hash key behind. New write-path code surfacing an expiry goes through `deleteExpiredKeyAndPropagate`, not a bare `dbDelete` + notify.
- A single command may emit multiple KSN events (HSET can emit `hset` + `hexpire` + `hexpired` + `del`). `expire` fires at set-time with positive future timeout; `expired` fires when the key is actually removed. EXPIRE with past/negative goes through the expiration path (fires `expired`, increments `expired_keys`), not DEL.
- RDB type byte differs by TTL presence: HSETEX-loaded hashes serialize as `0x16` (`RDB_TYPE_HASH_2`); HPERSIST rewrites to `0x04`. `rdbSaveObject` byte counts must match across these transitions. RDB load of expired fields does not drop silently - `valkey-check-rdb` and RESTORE pass `now=0`. When `rdbLoadObject` does drop already-expired fields, the primary must propagate an explicit HDEL. `RDB_LOAD_ERR_ALL_ITEMS_EXPIRED` is the dedicated marker - do not reuse `empty_keys_skipped`.
- Writable-replica expire tracking leaks on promotion: keys a replica wrote acquire TTL via `replicaKeysWithExpire`; cleanup happens in the active expire cycle after role transition.
- Active expire cycles under `ProcessingEventsWhileBlocked` must not set `el_iteration_active` - the outer iteration already accounts.

## Persistence and replication

Five writer classes for propagation/ACL/stats code: **normal** (`current_client` set, `call()` + `alsoPropagate`); **synthetic** (cron, `current_client == NULL` - `activeExpireCycle`, `delKeysInSlot`, topology updates, module cron, HFE cleanup); **import-mode** (`server.import_mode`, active expire off, stream-driven); **replica-local** (writable-replica direct write, `replicaKeysWithExpire`); **AOF-replay** (fake client, `executing_client` set, `current_client` may be NULL, routed through `mustObeyClient()`).

### RDB (`src/rdb.c`, `src/rdb.h`)

Version-gated: `RDB_VERSION = 80`, 9-byte magic `VALKEY080`, legacy `REDIS0011` still accepted on load. Versions 12-79 are the foreign range (`RDB_FOREIGN_VERSION_MIN = 12`, `MAX = 79`) rejected by default via `rdbIsForeignVersion` - blocks Redis CE 7.4+ RDB unless overridden. Cross-version upgrades step-stone: 9.0 RDB is unreadable by 7.2/8.0 at signature check; 8.1 + `rdb-version-check=relaxed` is the only bridge.

Opcode vs type handling: opcodes 245-255 are read-and-ignore on unknown; type range 1-22 unknown is a hard fail; bytes 23-244 must error distinctly from unknown 245-255. Forward-compatible hints use an opcode; a new object type does not. `RDB_OPCODE_SLOT_INFO` is read-and-ignore - parse via `rdbLoadLen`; the per-slot AUX (`keys`, `expires`, `keys_with_volatile_items`) drives `kvstoreHashtableExpand` for all three.

- Validate signature + version before `emptyDb()` on full sync. `RDB_INCOMPATIBLE` (pre-flush, preserves dataset) must be distinguished from `RDB_FAILED` (post-flush). Callers (`replicaLoadPrimaryRDBFromSocket`, `rdbLoad`, `rdbLoadRio`, `VM_RdbLoad`, `debug.c`) propagate `RDB_INCOMPATIBLE`.
- Length fields from AUX / RESIZEDB / SLOT_INFO must be range-validated before driving allocations. CRC64 covers transport corruption only.
- `RDB_TYPE_HASH_2 = 22` (0x16) encodes a hash with per-field TTL; `HPERSIST` rewrites to `0x04`. New RDB types gate at both ends: `rdbGetObjectType` returns -1 (skip) when target rdbver is too old; unknown type on decode aborts. No silent substitution.
- If all fields of a hash expire during load, skip the key via `RDB_LOAD_ERR_ALL_ITEMS_EXPIRED` and do NOT fire keyspace notifications. RESTORE deliberately keeps expired fields (intentional asymmetry). Dropping expired fields during load propagates HDEL to replicas.
- Aux fields forward-compatible by layout. Both `valkey-ver` and `redis-ver` are written; loaders recognize either. Modules using `auxsave2` must guard with an explicit "will save" check.
- In-progress slot-migration imports must be included in RDB via `HASHTABLE_ITER_INCLUDE_IMPORTING`; key counts combine `kvstoreSize + kvstoreImportingSize`. Client-facing reads still hide importing keys.
- RDB/AOF files open with mode 0666 so `server.umask` controls final mode. Hardcoding 0644 breaks backup/restore under a different user.
- DUMP framing: payload + 1 type byte + 2 RDB-version bytes + 8 CRC64 bytes (11 bytes overhead).
- `rdbSaveObject` returns 0/-1, NOT a byte count. Compression kicks in at 20+ bytes; benchmarks exercising RDB payload must use values above (512 bytes is standard).
- Return codes: `RDB_OK`, `RDB_NOT_EXIST`, `RDB_INCOMPATIBLE`, `RDB_FAILED`.
- Diskless sync terminates the socket stream with a 40-byte random hex EOF marker, verified on receive. `repl-diskless-load=flush-before-load` (8.1) opts into pre-load flush.

### AOF (`src/aof.c`)

AOF load runs through a fake client (`CLIENT_ID_AOF`) exempted from cluster-slot rejection via `mustObeyClient()`. The AOF loader, primary replication link, and import-side slot-migration client all share that exemption. Because the loader uses `executing_client` rather than `current_client`, cross-cutting logic (notifications, stats, ACL, keyspace events) must null-guard `current_client` or resolve via `executing_client`.

- AOF preamble load applies the `rdbSaveInfo`. If rsi is invalid, free `repl_backlog` to avoid an assert during cluster failover.
- AOF preamble is detected by reading exactly 6 bytes matching literal `REDIS` or `VALKEY` followed by 3-digit zero-padded RDB version. Buffer is not NUL-terminated. Absence is not an error.
- New optional args in minor versions rewrite argv before propagation. HSETEX NX/XX (9.1) must strip/normalize so older replicas can replay.
- Fake-client AOF validation uses `commandCheckExistence` + arity check. AOF corruption is data-integrity, not best-effort. PSYNC-from-AOF is unsupported.
- AOF-rewrite/reload tests with hash-field expirations must `DEBUG SET-ACTIVE-EXPIRE 0` before `debug loadaof`.

### Replication (`src/replication.c`)

Source uses `primary` / `replica` as primary names; `master` / `slave` are aliases kept in INFO fields, config aliases, a few error strings. The core write-path contract: every expiry-setting command (SET EX/PX/EXAT, HEXPIRE, HSETEX, HPEXPIRE) rewrites argv to PXAT before AOF/replication so replay lifetime is absolute ms. Ops classify as: **as-is** (normal, one `alsoPropagate` per command); **as-DEL** (net effect is key removal - expired-on-arrival, cleanup during overwrite, HFE reaching zero fields; wire form MUST be DEL); **as-HDEL** (already-expired-but-unreclaimed hash fields: emit HDEL first, then the user write - HINCRBY, HSETEX KEEPTTL, HINCRBYFLOAT; HINCRBYFLOAT always replicates as HSET of the final value plus field TTL); **as-expire-update** (absolute PXAT rewrite); **suppressed** (conditional rejections - HSETEX NX/XX/FNX/FXX that rejects the write streams nothing).

- Active expire is primary-only: `activeExpireCycle` runs only under `!server.import_mode && iAmPrimary()`. Replicas and import-mode nodes receive expirations from the stream only.
- Blocked writers during role change disconnect or redirect synchronously. On primary-to-replica transition (`replicationSetPrimary`), write-producing blocked clients must be flushed before the replication stream reopens.
- Multi-DB sub-streams inject SELECT at boundaries (atomic slot migration substream, synthetic AOF).
- Replica read loop is bounded per event by `repl-max-reads-per-io-event` (hardcoded 25 in 9.0). `shouldRepeatReadFromPrimary` must precede `beforeNextClient` to avoid UAF ordering.
- Writable replicas: locally-written keys can type-collide with inbound writes and stall the stream. TTLs in `replicaKeysWithExpire` are cleaned on role-promotion via the active expire cycle.
- Coordinated FAILOVER: primary treats `PSYNC_FULLRESYNC` identically to `PSYNC_CONTINUE` - both clear failover state across sync strategies including dual-channel. Writes during `FAILOVER_IN_PROGRESS` get `-REDIRECT host:port`.
- `ProcessingEventsWhileBlocked` windows skip main-thread accounting. `server.el_start` is not set in `afterSleep` when the flag is true.
- `propagation-error-behavior` governs replication, not RDB load. Unknown-command handling on a replica defaults to `ignore`.
- CLUSTER REPLICATE classifies as user-initiated flush - bucket via `lazyfree-lazy-user-flush`.
- `pending_repl_data` cleanup freed on bio/lazyfree thread via `freePendingReplDataBufAsync`.

### Dual-channel replication

Grep `dualChannel` in `src/replication.c`. Full resync uses two TCP connections so the replica buffers streaming writes locally instead of the primary buffering per-replica. The BIO thread owns the primary socket for the RDB download window; the main thread takes over for the load. The two must never write concurrently (TLS session state corruption). Main-thread busy-wait on BIO is accepted, bounded by `connRecvTimeout`. Config: `dual-channel-replication-enabled` (default `no`, replica-side); local buffer block size ~`repl-backlog-size / 16`.

Replica RDB-channel state machine (`REPL_DUAL_CHANNEL_*`), seven-step happy path, ordering load-bearing:

1. `SEND_HANDSHAKE`
2. `RECEIVE_AUTH_REPLY`
3. `RECEIVE_REPLCONF_REPLY`
4. `RECEIVE_ENDOFF`
5. `RDB_LOAD`
6. `RDB_LOADED`
7. `dualChannelSyncSuccess` streams `pending_repl_data`, calls `replicationResurrectProvisionalPrimary`, `replicationSteadyStateInit`, `replicationSendAck`.

`dualChannelFullSyncWithPrimary` drives the machine; `PSYNC_FULLRESYNC_DUAL_CHANNEL = 6`. Fallbacks: primary without capa does single-channel; RDB-channel failure falls back to single-channel; partial resync is single-channel only.

- Replica-side `pending_repl_data` is deliberately uncapped - replica buffers main-channel bytes while loading RDB; primary's COB for that replica grows symmetrically. Accounting lives in `server.pending_repl_data.mem/len/peak`, NOT in `mem_total_replication_buffers` - surface via `mem_replica_dual_channel_buffer` INFO field. **`pending_repl_data.len` tracks bytes consumed, not allocated capacity** - reading it as buffer size mis-accounts, and the `.len += X` must be `+= nread` (the bytes actually read into the block), not `+= tail->size` (capacity) or `+= tail->used` (zero at allocation time).
- Back-pressure: if `pending_repl_data` exceeds `client-output-buffer-limit replica` hard limit, the replica stops reading the main channel.
- RDB bytes account via `child_info` pipe, not client write buffer. Dual-channel needs its own `child_info` field.
- Abort cleanup is unconditional. `replicationAbortDualChannelSyncTransfer` must close `repl_transfer_fd` AND unlink `repl_transfer_tmpfile` on every exit path.
- If `replicaof no one` runs mid-load, `protected_rdb_channel` must be released so the shutdown replica-catch-up grace can complete.
- BIO-thread RDB receive preserves the two-loop payload read. Outer loop consumes primary pings (newline bytes) before bulk length is known (refreshes `last_io`); inner loop drains until EOF marker. Collapsing breaks primary timeout handling.
- Dual-channel + atomic slot migration has a known full-sync vs AOF-reload race.

### Fork machinery (`src/bio.c`, fork wrappers)

At most one background child at a time, discriminated by `server.child_type`. Cancellation sends SIGUSR1 after matching child_type, deferring reap to `checkChildrenDone` (`killRDBChild`, `killAppendOnlyChild`, `TerminateModuleForkChild`, `killSlotMigrationChild`). Fork-coalesce on the hot path: full resync, BGSAVE, AOF rewrite, slot export should coalesce with an in-flight save rather than start a new child.

- Copy-on-write accounting is per-child-type. Replication-child byte counts (dual-channel RDB) use `repl_output_bytes` sourced from `rio->processed_bytes`, not `stat()` on a transient file.
- `valkey_fork` wrapper preserves `errno` across fork. Save after `fork()`, restore after post-fork hooks. `restartServer()` must not close arbitrary FDs at shutdown - closing FDs owned by LTTng-UST corrupts tracer state.
- `dismissSds` / `dismissMemory` only help on > page-size allocations. Freeing small strings in a memory-pool bin can dirty the page. `dismissObject` is the entry point post-fork.
- Synthetic writers crossing fork boundaries must honor `iAmPrimary()`.

## High availability (cluster + Sentinel)

Base algorithms (16,384-slot hash cluster, gossip, MOVED/ASK, bus port = port+10000, PFAIL -> FAIL quorum, Raft-like Sentinel election, SDOWN/ODOWN/TILT) match Redis.

### Cluster shape

Cluster mode is fully multi-DB: `getKVStoreIndexForKey()` routes by slot, `selectDb()` works in cluster mode, migrations transfer all DBs. Topology state is split between `clusterState.shards` (`shard_id -> list(clusterNode)`) and `clusterState.slot_migration_jobs` which coexists with legacy `migrating_slots_to` / `importing_slots_from` - queries must consult both.

- `availability-zone` is a free-form SDS propagated via gossip and surfaced in `CLUSTER SHARDS` / `CLUSTER SLOTS`. Node field: `clusterNode.availability_zone`.
- `cluster-config-save-behavior` (default `sync`) controls `nodes.conf` persistence.

### Cluster bus

Wire format is **big-endian** across the board - all numeric fields in `clusterMsg` / `clusterMsgLight` and gossip/ext/aux scalars require htons/ntohs around memcpy. The two message structs have `data` at different byte offsets; receive-side code must pick the right cast before dereferencing. Extension-support tracking lives on `clusterLink`, not `clusterNode` - during handshake the sender-node lookup returns NULL until the node is added.

- Ports gossiped: TCP + TLS only. RDMA requires `rdma-port == port`. MOVED/ASK/CLUSTER SLOTS return ports by the originating client's connection type.
- Cluster bus byte counters are per-category. Any new `CLUSTERMSG_TYPE_*` requires a `clusterBusAddNetworkBytesByType` call at both send and receive.
- Light-header support is per-peer. 8.0 used the bit for LIGHT_PUBLISH; 9.0 reuses for LIGHT_HDR_MODULE. Any new LIGHT_HDR_* bit must be checked for collision against 8.x before shipping.
- Module sender_id is fixed `CLUSTER_NAMELEN = 40` bytes: NOT null-terminated pre-8.1, null-terminated from 8.1+.

### Failover

Two paths. Classic replica-initiated election matches Redis. Coordinated failover (`FAILOVER` command, also Sentinel's coordinated branch) is primary-driven: primary pauses writes, waits for replica catch-up, swaps roles atomically - `sentinelFailoverSendFailover` (coordinated) vs `sentinelFailoverSendReplicaOfNoOne` (classic) in `src/sentinel.c`. All cluster failover paths live in `src/cluster_legacy.c`.

- `failover_auth_time` must reset to 0 on abort/restart of a replica election; stale timer leaves the replica stuck waiting for votes.
- Rank-convergence delay (500 ms + random) exists for offset broadcasting, not FAIL-propagation.
- Primary sends full ~4 KB UPDATE back to the replica on failover-auth denial; otherwise the replica cannot recover.
- `CLUSTER REPLICATE NO ONE` vs `REPLICAOF NO ONE` diverge by design. Cluster variant must `emptyData` and move the node into a fresh shard (still a cluster member); standalone variant keeps data.
- Slot-bitmap comparisons in auth paths are endian-sensitive. Tests for `clusterSendFailoverAuthIfNeeded` must include a non-64-bit-aligned slot boundary (e.g. slot 12287).
- Cluster-aware lazyfree re-inits `slotToKeyInit` after async flush - `slotToKeyFlush` alone is not enough.
- `clusterSetNodeAsPrimary` + `replicationUnsetPrimary` are paired atomically.
- `cluster-manual-failover-timeout`: 1 to INT_MAX; `now + timeout*PAUSE_MULT` overflows at large values. Write-pause capped at `CLUSTER_MF_TIMEOUT = 500 ms` regardless. `server.cluster_mf_timeout` (default 5000 ms) is configurable; Redis hardcodes it.
- Gossip `shard_id` propagation: replica adopts primary's shard_id, not the reverse.

### Slot migration

Two mechanisms coexist. Traditional `CLUSTER SETSLOT IMPORTING/MIGRATING/STABLE` + per-key `MIGRATE` is the Redis-compatible path. Atomic server-driven `CLUSTER MIGRATESLOTS` (new in Valkey, `src/cluster_migrateslots.c`) forks a child to ship the whole slot range as a mini-RDB. `getNodeByQuery()` in `src/cluster.c` is the single source of truth for redirect decisions.

MOVED vs ASK: MOVED = "slot permanently elsewhere, update your map"; ASK = "this one key is migrating, follow once with ASKING". ASK never invalidates the client's slot map.

- `CLUSTER SETSLOT` replicates via the replication stream before executing, to prevent topology loss on primary crash. Version-gated on `replica_version > 0x702ff` in `clusterCommandSetSlot`.
- `delKeysInSlot` during slot migration needs a version flag: DEL for pre-9.0 replicas, FLUSHSLOT for 9.0+.
- Slot-migration connections use `tls-replication` config, not `tls-cluster`. They connect to the peer's data-plane port, not the cluster-bus port.
- Atomic slot migration child type `CHILD_TYPE_SLOT_MIGRATION` is mutually exclusive with RDB `bgsave`. `killSlotMigrationChild` is the teardown.
- Atomic slot migration target persists in-flight import state as an RDB aux/opcode. Module `IMPORT_STARTED` event fires before imported keys load.
- Module-replicated cross-slot commands must abort during atomic slot migration.
- Importing keys excluded from Fenwick counts and fair-random selection. `HASHTABLE_ITER_INCLUDE_IMPORTING` is the opt-in flag.
- Cutover write-loss window is known, not a correctness guarantee. Between "source grants failover" and "source sees target's ownership gossip", target crash leaves the source eventually timeout-unpaused to accept writes the target won't see (logged *"Write loss risk!"*). Extending the pause is almost always wrong - tighten the gossip-learn-loop on the source.
- `canRedirectClient` is the standalone-only check; `clusterRedirectBlockedClientIfNeeded` handles the cluster case (blocked client on a slot that moved mid-wait). Do not call the standalone helper from cluster paths.

### Gossip

PING/PONG is the only carrier. 80% sampling of known nodes per ping with HANDSHAKE-state nodes excluded - strict equality tests are flaky by design. Stale PING/PONG after primary quiesce: cross-check `nodeEpoch(sender_claimed_primary) > sender_claimed_config_epoch` before latching a new role.

- `clusterNodeCleanupFailureReports` runs per PING/PONG with a millisecond-bucketed RAX key. CPU scaled from 100% to 32% at 450 concurrent failovers - do not regress the bucketing.
- Failure reports expire per-timer; once a node enters FAIL state, `clusterNodeAddFailureReport` short-circuits and the count decays rather than growing.

### SCAN cross-node

Cluster SCAN is primary-only. Per-node hash seeds mean results are not comparable across primary/replica of the same shard, and RDB dump-then-load does not preserve SCAN order. `CLUSTERSCAN` carries a seed fingerprint in the cursor; rerouting a cursor mid-iteration without the fingerprint silently skips or double-visits.

### CLUSTER SLOT-STATS

Valkey-only per-slot observability (`src/cluster_slot_stats.c`). Accounting hooks (network in/out, CPU, key count) live there - add one call per metric in new code paths attributing cost to a slot. Gated on `cluster-slot-stats-enabled` (default off) except `KEY_COUNT`.

### Sentinel

Monitoring matches Redis. Activation: `--sentinel` flag, or binary named `valkey-sentinel` (also accepts `redis-sentinel`). Coordinated failover: `SENTINEL FAILOVER <name> COORDINATED` sets `SRI_COORD_FAILOVER` - primary-driven and atomic.

## Client command patterns

Type-handler statement order is load-bearing: type check + `lookupKey`, then state mutation (`dbDelete`, `hashTypeSet`, `listTypePush`), then `signalModifiedKey` + `notifyKeyspaceEvent`, then `addReply*`.

### MULTI / EXEC

Coordinated failover tears down transactions rather than queuing across the primary boundary. During `CLUSTER FAILOVER` with `FAILOVER_IN_PROGRESS`, the `-REDIRECT` branch calls `discardTransaction(c)` on EXEC and `flagTransaction` on non-EXEC. Queued commands do not carry over.

### Blocking operations (`src/blocked.c`)

Redis core unchanged. Valkey-specific is failover interaction and module/standalone/cluster branches.

- `FAILOVER_IN_PROGRESS` routes to `blockPostponeClient` or emits `-REDIRECT` for `CLIENT_CAPA_REDIRECT`. Set `primary_host` BEFORE the redirect reply.
- `deny_blocking` is set inside scripts and transactions; module commands that would block are rejected synchronously. Keyspace-notification callbacks are the only legitimate exception.
- Reset the client timeout on `blockClient` creation. BLPOP, `CLUSTER SETSLOT`, and any deadline-scheduling path must write a fresh timeout.
- `BLOCKED_INUSE` accounting is one-increment / one-decrement in `server.blocked_clients`. Module clients (`c->flag.module`) skip the counter - unconditional decrement on reset underflows.
- Removing the read handler for a block's duration requires a per-connection-type `is_closing` callback - missing implementations on unix-socket, RDMA, and FreeBSD leak connection resources.
- Module blocking callbacks run on main only.

### Pub/Sub

Storage type diverges per channel type (the single biggest grep hazard). Global channel maps are `kvstore` (slot-partitioned in cluster mode), patterns are still `dict`, per-client subscription sets are `hashtable`.

| Structure | Type |
|-----------|------|
| `server.pubsub_channels` | `kvstore` |
| `server.pubsubshard_channels` | `kvstore` |
| `server.pubsub_patterns` | `dict` (still) |
| Per-client `pubsub_channels` / `pubsub_patterns` | `hashtable` |

- Iterate server-wide channel subscriptions through `kvstore`. Reintroducing flat `dict*` iteration breaks cluster slot partitioning.
- `notifyKeyspaceEvent` does NOT propagate across the cluster - calls `pubsubPublishMessage` directly, bypassing `clusterPropagatePublish`. Subscribers see only local-node events.
- `publish(message, channel)` argument order reverses across GLIDE bindings (Python/Node/Java reverse; Go/C#/PHP/Ruby keep `publish(channel, message)`).

### Keyspace notifications (`src/notify.c`)

Governing rule: notification BEFORE reply. `signalModifiedKey` + `notifyKeyspaceEvent` must run before any `addReply*`. `addReply*` installs the client on the pending-write queue via `prepareClientToWrite` - once queued, the reply flushes before a notification-driven block can take effect. If a command must `addReply*` before notifying, route through `initDeferredReplyBuffer` (no-op when no module subscribes).

Events describe the effect on the key, not the command. SREM/ZREM/HDEL that empty a key emit the sub-element-removed event first (container still present), then `del`. A single command may emit multiple events - HSET can emit `hset` + `hexpire` + `hexpired` + `del`; SETEX on strings suppresses `set` when already expired. Consumers must re-fetch on notification; treat KSN as "key was modified somehow".

- `expire` fires at set-time with positive future timeout. `expired` fires when the key is actually removed. EXPIRE with past/negative timeout takes the expiration path.
- `hexpired` is a Valkey addition. Fires on hash per-field TTL expiry via `dbReclaimExpiredFields`.
- `A` flag expands to `g$lshzxetd`. Does NOT include `m` (KEY_MISS) or `n` (NEW). Adding a letter to `A`'s expansion is a behavior change.
- Module notifications always fire. `moduleNotifyKeyspaceEvent` runs regardless of the `notify-keyspace-events` mask - do not gate it on the flag string.
- HSETEX "nothing was set" return-0 conflates no-op / no-notify cases. Write-completion indicator and notification decision stay coupled.

## Scripting and modules

Scripting (EVAL / FUNCTION) is a subsystem *of* the module framework - the built-in Lua engine registers through the module scripting-engine ABI. Files: `src/eval.c`, `src/functions.c`, `src/script.c`, `src/scripting_engine.{c,h}`, `src/module.c`, `src/modules/lua/**`, `src/valkeymodule.h`, `src/redismodule.h`, `deps/lua/**` (read-only).

### Scripting dispatch

EVAL/EVALSHA with LRU-bounded script cache (500 entries), shebang flags (`no-writes`, `allow-oom`, `allow-stale`, `no-cluster`, `allow-cross-slot-keys`), Functions (`FUNCTION LOAD/CALL/DELETE/LIST/STATS/DUMP/RESTORE` + `#!lua name=mylib` + `RDB_OPCODE_FUNCTION2` + effects-based replication) are Redis 7.0+ baseline. Valkey-specific: both paths delegate to the pluggable scripting engine - `src/eval.c` and `src/functions.c` are thin wrappers. "Script" = EVAL, "library" = FUNCTION; they share the engine but never share flag-parsing or dispatch state.

- `curr_run_ctx` is a single global, not a stack. Nested EVAL / FCALL is unsupported.
- EVAL reads flags from the shebang header; FUNCTION LOAD reads them per-function. Do not unify.
- Cluster mode: scripts touching non-local keys abort with `ERR Script attempted to access a non local key in a cluster node script`. The error string `This Redis command is not allowed from script` is asserted by tests/external tooling - do not rebrand.
- Lua count-hook gate: `busy_reply_threshold > 0 && !debug_enabled`. `<= 0` disables the interrupt. `lua-time-limit` is an alias.
- `VALKEYMODULE_ARGV_SCRIPT_MODE` (`S` flag to `VM_Call`) and `VALKEYMODULE_CTX_SCRIPT_EXECUTION` are distinct.
- ACL log labels Lua as `"lua"` and any other engine as `"script"` - engine name is NOT stored (preserves compat for log-scrapers).
- `VALKEY_VERSION_NUM` uses two-digit patch encoding; pre-releases reuse patch 240+ (e.g. `8.0.240 == 0x800f0`).

### Scripting engine ABI

Engines register via `ValkeyModule_RegisterScriptingEngine(ctx, "name", engine_ctx, &methods)` from module `OnLoad`; shebang `#!<engine-name>` selects via case-insensitive lookup in `engineManager` dict. One engine serves two subsystems - `VMSE_EVAL` and `VMSE_FUNCTION` - and `reset_env` takes the subsystem so teardown differs per context.

Current ABI: `VALKEYMODULE_SCRIPTING_ENGINE_ABI_VERSION = 4UL`. Engine-declared version must be `<=` server's or registration fails. V1: `compile_code`/`free_function`/`call_function`/`get_function_memory_overhead`; V2: binary-safe `compile_code`; V3: subsystem param on `reset_env`; V4: `debugger_*` callbacks.

- Scripting-engine callback signature changes are ABI breaks.
- Free-callback signatures must match the registry's exact C signature - `void fn(engineCtx *, compiledFunction *)` for `engineLibraryFree`, not generic `void (*)(void *)`.
- Unregister drains `BIO_LAZY_FREE` before freeing the slot. Engine-name `sds` is freed AFTER `bioDrainWorker` returns.
- `engineSetupModuleCtx(engine, NULL)` runs even on built-in Lua; every callback tolerates a NULL module pointer.
- `VALKEYMODULE_*` is reserved for `valkeymodule.h`. Internal constants in `scripting_engine.c` use `MODULE_*`.

### Module lifecycle

Standard `.so` + `ValkeyModule_OnLoad` + `ValkeyModule_Init` + `ValkeyModule_CreateCommand`. Public header `src/valkeymodule.h`; the compat shim `src/redismodule.h` re-exports `RedisModule_*` via macros and the legacy `RedisModule_OnLoad` entry point still works. The shim is pinned at Redis 7.2.4 - every post-7.2.4 API is reachable only through `valkeymodule.h`.

- Modules using post-7.2.4 APIs must include `valkeymodule.h` directly. Valkey-only APIs are always new functions - never overload existing names.
- Modules using newer constants (e.g. `VALKEYMODULE_CLIENTINFO_FLAG_PRIMARY` added 9.1) must call `ValkeyModule_GetServerVersion` at runtime before relying on them.
- Unload requires cross-thread join before `dlclose`. Spawned threads must `pthread_join` in `OnUnload`. `drainIOThreadsQueue()` + `bioDrainWorker(BIO_LAZY_FREE)` precede the actual unregister.
- Command gating (`NO_MULTI`, slot routing, ACL categories) lives in `src/commands/*.json`, not runtime checks.
- Modules cannot register new connection types; may provide alternative *implementation* of existing ones.
- Loading uses `RTLD_NOW|RTLD_LOCAL|RTLD_DEEPBIND`. `RTLD_DEEPBIND` is incompatible with ASan (gate with `!defined(__SANITIZE_ADDRESS__)`) and is Linux/FreeBSD only.

### Custom data types

`ValkeyModule_CreateDataType(ctx, name, encver, &methods)` takes a 9-char name, `encver ∈ [0, 1023]`. Current `ValkeyModuleTypeMethods` version is **5**; RDB save/load, defrag, digest, mem-usage, free, and aux all wire through it.

- `.version = VALKEYMODULE_TYPE_METHOD_VERSION` at the top of the struct literal is required.
- `.defrag` returns `NULL` when relocation is unsupported - not a no-op, not the same pointer.
- `.aux_save2` / `.aux_save` fire once per RDB, not per key. V2 lets RDB skip the aux marker when the module has no aux data.
- `.mem_usage2` receives `ValkeyModuleKeyOptCtx *` and respects `MEMORY USAGE SAMPLES`.

### Key API, blocking, threading

`ValkeyModule_OpenKey` flags: READ, WRITE, NOTOUCH, NONOTIFY, NOSTATS, NOEXPIRE, NOEFFECTS. Blocking during `CLUSTER FAILOVER IN_PROGRESS` is `blockPostponeClient`-ed. `current_client` vs `executing_client`: `ctx->client` may be a fake / tmp client (module timer, cron, Lua caller, another module), so APIs needing the real originating client read `server.current_client` while APIs needing the call-site client read `server.executing_client`. `mustObeyClient`-style checks target `current_client`.

- Module timers / cron run with `current_client == NULL`. `VM_Call` from those must null-guard.
- Blocking from inside Lua / MULTI / replica-stream is rejected synchronously. `ValkeyModule_BlockClient` returns NULL with `errno` (`EINVAL` / `ENOTSUP`).
- `ValkeyModule_Call` from a worker thread requires `ThreadSafeContextLock` / `Unlock`.
- Module-originated cross-slot replications are a hard error via `VM_Replicate` during atomic slot migration.
- `VALKEYMODULE_OPTIONS_SKIP_COMMAND_VALIDATION` is an optimization flag, NOT a sandbox escape.

### Lua engine

Lua vendored under `deps/lua` (read-only) with the engine at `src/modules/lua/`. VM is single-threaded; every whitelist / sandbox rule lives in the engine, not `deps/lua`. Lua GC must be controlled during execution (`LUA_GCCOLLECT` or `LUA_GCSTOP`) - both address the Lua-GC CVE class. No `debug` surface exposed to user scripts.

- Do not modify `deps/lua`. Makefile edits go in `src/modules/lua/Makefile` and `src/Makefile`.
- `FUNCTION FLUSH ASYNC` + `FUNCTION LOAD` race requires the BIO worker to own its own `lua_State` (for `lua_close`); main creates fresh. The fix shipped as a pair - cherry-picking one half leaves the crash reachable.
- `luaFunction.function_ref` (from `luaL_ref`) is the only portable handle. Never cache raw `lua_State` pointers outside EVAL context.
- `BUILD_LUA=yes` statically links; `=module` produces `libvalkeylua.so`. `-flto` / `-flto=auto` / `-ffat-lto-objects` are stripped from the Lua build.

### Rust SDK

Crate `valkey-module` (crates.io), repo `valkey-io/valkeymodule-rs`. Build as `crate-type = ["cdylib"]`. The declarative `valkey_module!` macro wires `OnLoad` and command registration, `Context` is the API surface, `ValkeyResult` / `ValkeyError` bubble as replies. All C-ABI invariants apply transparently.

## Monitoring

### CommandLog

`SLOWLOG` is a compat shim over the unified `COMMANDLOG` subsystem - three parallel circular buffers: `slow`, `large-request`, `large-reply`. State at `server.commandlog[]` indexed by `COMMANDLOG_TYPE_SLOW`, `COMMANDLOG_TYPE_LARGE_REQUEST`, `COMMANDLOG_TYPE_LARGE_REPLY`. Each type has its own threshold, max-len, ID counter. Old `slowlog-*` config names remain as aliases on the new `commandlog-*` pair.

- Config pairs: `commandlog-execution-slower-than` (alias `slowlog-log-slower-than`), `commandlog-slow-execution-max-len` (alias `slowlog-max-len`); `-1` threshold or `0` max-len disables that log.
- `CMD_SKIP_COMMANDLOG` hides the entire entry (AUTH and similar).
- Per-argument redaction: `redactClientCommandArgument(c, argc)` sets bits in `c->redact_arg_bitmap`. Applied lazily at log time.
- Rewritten argv (SET-with-EX rewritten to SETEX) captured via `c->original_argv` - distinct from redaction.
- Script execution: value fields from executing client, but `peerid` / `cname` come from `scriptGetCaller()`.
- Truncation: `COMMANDLOG_ENTRY_MAX_ARGC = 32`, `COMMANDLOG_ENTRY_MAX_STRING = 128` bytes.
- Cluster aggregation: `COMMANDLOG GET`/`LEN`/`RESET` carry `REQUEST_POLICY:ALL_NODES`; `LEN` also carries `RESPONSE_POLICY:AGG_SUM`. Aggregated IDs not globally unique.

### Client Tracking

Invalidation channel frozen at `__redis__:invalidate` (see `TrackingChannelName`). Spurious invalidations begin when `tracking_total_keys` hits `tracking-table-max-keys`.

## Security

### ACL

Valkey adds database-scoped selectors and key-permission selectors absent from Redis - grep `alldbs` and `%R~` in `src/acl.c`.

- Database selectors: `alldbs`, `resetdbs`, per-ID rules.
- Key selectors: `%R~pattern` read-only, `%W~pattern` write-only (combine `%RW~` or plain `~` for both).
- `tls-auth-clients-user` (`CN` | `URI` | `OFF`): TLS client with matching cert subject is auto-authenticated as the ACL user named by that field.

### TLS

`tls-auto-reload-interval` (seconds, default 0 = off): BIO thread parses certs and constructs a new `SSL_CTX`, main thread atomically swaps the pointer. Change detection combines SHA-256 over cert content with inode/mtime on the key file. Expensive parts of `SSL_accept` offload via `trySendAcceptToIOThreads`.

- INFO cert expiry fields: `tls_server_cert_expire_time`, `tls_client_cert_expire_time`, `tls_ca_cert_expire_time` plus serials and `tls_*_expires_in_seconds` helpers; populated by `tlsUpdateCertInfoFromCtx`.

## Build, test, config

### Code rules

- `clang-format-18` is the exact version (`src/.clang-format`). DCO sign-off on every commit (`git commit -s`).
- `UNUSED(x)` over `(void)x`. `size_t` for sizes even when `uint32_t` fits. `_Static_assert` over runtime asserts for compile-time invariants.
- No doxygen syntax. No `inline` in `.c` when the function is called across TUs.
- `deps/` (jemalloc, libvalkey, lua, hiredis fork) is upstream-first, version-bump here.
- Never stage runtime artifacts: `dump.rdb`, `nodes.conf`, `*.log`, ad-hoc cluster dirs.

### Test rules

Flakiness traces to wall-clock timing, TCP backpressure on deferring clients, or active-expire cycle racing fixture loads. The TCL harness has primitives; prefer them over ad-hoc sleeps.

- `wait_for_condition 1000 50` over hardcoded sleeps. Bare `after N` without a source-rooted constant is rejected.
- Pipelined deferring clients use `CLIENT REPLY OFF` to avoid TCP-backpressure flakes.
- `DEBUG SET-ACTIVE-EXPIRE 0` must precede `debug loadaof` for TTL-bearing values. Tests calling `DEBUG` subcommands must tag `needs:debug` (see `tests/unit/expire.tcl` for examples) - otherwise they fail against servers built without debug commands.
- Cluster tests use hash-tag keys (`{t}key1 {t}key2`); `{} {cluster:skip}` fallback for standalone-only. SCAN is not consistent across RDB dump/load or across primary/replica (independent hash seeds) - must alternate cursors.
- Tags: `slow`, `valgrind:skip`, `tls:skip`, `external:skip`, `needs:debug`, `logreqres:skip`. `--tags network` does NOT include cluster tests; long cluster tests use `tags {"slow valgrind:skip"}`; ASAN+cluster OOMs runners, so heavy suites need `run_solo`.
- Latency-sensitive tests use `CLOCK_THREAD_CPUTIME_ID` / `CLOCK_PROCESS_CPUTIME_ID`, not wall-clock.
- Under ASAN: gate memory-heavy tests with `#ifdef __SANITIZE_ADDRESS__`. `RTLD_DEEPBIND` is ASAN-incompatible.
- Use `assert_morethan`, `assert_encoding`, `assert_equal`, `assert_match` - not raw `assert {...}`.

### Sanitizer builds

Makefile forces the allocator per sanitizer because jemalloc's interposition fights ASAN and UBSAN instrumentation:

| Sanitizer | Makefile forces `MALLOC=` |
|-----------|---------------------------|
| `address` | `libc` |
| `undefined` | `libc` |
| `thread` | (no override) |

- Run `make distclean` between sanitizer modes - objects compiled against jemalloc can't re-link when `MALLOC=libc` kicks in.
- `check_sanitizer_errors` (`tests/support/server.tcl`) runs after every server stop; matches `Sanitizer` OR `runtime error` (GCC UBSAN uses the second form).
- TLS and RDMA compile only when explicitly enabled: `BUILD_TLS=yes`, `BUILD_RDMA=module`.

### Config system

New configs register via `standardConfig` in `src/config.c`. Flags: `IMMUTABLE_CONFIG`, `PROTECTED_CONFIG`, `HIDDEN_CONFIG`, `SENSITIVE_CONFIG`, `DEBUG_CONFIG`, `DENY_LOADING_CONFIG`. `PROTECTED_CONFIG` is further gated on `enable-protected-configs`.

Renamed configs - grepping the Redis name finds only the alias:

| Primary (Valkey) | Legacy alias |
|------------------|--------------|
| `replicaof` | `slaveof` |
| `replica-priority` | `slave-priority` |
| `primaryuser` | `masteruser` |
| `primaryauth` | `masterauth` |

Many other `slave-*` -> `replica-*` renames follow.