Implementing Async Batch Updates with Celery

A copy-paste Celery recipe for idempotent async pharmacy inventory batches: SHA-256 idempotency keys, advisory-locked atomic ledger writes, and DEA/HIPAA-aligned audit logging that survives mid-batch retries.

A Celery worker that crashes halfway through a controlled-substance batch is not just an operational nuisance — it is a recordkeeping hazard. When task_acks_late is enabled, the broker redelivers the unacknowledged message, and a naive task re-applies every quantity delta it already committed, double-counting a Schedule II item against the perpetual record. The same fault appears when an OperationalError triggers an automatic retry partway through a loop. This page solves exactly that problem: how to make Celery batch updates exactly-once at the ledger level so that retries, redelivery, and worker restarts can never duplicate or drop an inventory movement. It is the focused implementation companion to the Async Batch Processing for Inventory Updates subsystem, which in turn sits inside the broader Data Ingestion & Inventory Sync Workflows architecture.

Prerequisites & Environment

This recipe targets Python 3.11+ and assumes you already understand the regulatory floor for controlled-substance recordkeeping: every Schedule II–V quantity adjustment must resolve to a complete perpetual record under 21 CFR § 1304.11, those records must be reproducible for inspection under 21 CFR § 1304.04, and audit controls over who touched the data are mandated by HIPAA 45 CFR § 164.312(b).

Dependency	Version	Role
`celery`	5.3+	Distributed task queue and retry machinery
`redis`	7.x	Broker plus the atomic `SET NX` idempotency claim
`sqlalchemy`	2.0+	Typed engine and transaction scope for ledger writes
`pydantic`	2.x	Strict payload validation at the queue boundary
PostgreSQL	14+	Append-only ledger with session-level advisory locks

You should already have payloads normalized upstream — NDC values reconciled per the NDC-11 vs NDC-10 Parsing Standards rules and schedule mappings resolved through the DEA Schedule II-V Classification Mapping engine — so this task receives clean, canonical records and concerns itself only with safe persistence.

Implementation

The pattern has three guarantees layered together: a deterministic idempotency key derived by SHA-256 over the canonical payload, an atomic Redis claim that lets only the first arrival proceed, and a single advisory-locked database transaction so a partial batch can never leave a torn ledger. Late acknowledgment (task_acks_late=True) plus worker_prefetch_multiplier=1 make redelivery safe rather than dangerous.

python

import hashlib
import json
import logging
import os
from datetime import datetime, timezone
from typing import Any

import redis
from celery import Celery
from pydantic import BaseModel, Field, ValidationError, field_validator
from sqlalchemy import create_engine, text
from sqlalchemy.exc import OperationalError, SQLAlchemyError
from sqlalchemy.orm import Session

# --- Configuration (secrets injected via environment, never hard-coded) ---
BROKER_URL = os.environ["CELERY_BROKER_URL"]
DB_URL = os.environ["PHARMACY_DB_URL"]
IDEMPOTENCY_TTL_SECONDS = 86_400  # outlives the maximum retry window

app = Celery("pharmacy_inventory")
app.conf.update(
    broker_url=BROKER_URL,
    task_serializer="json",
    accept_content=["json"],
    timezone="UTC",
    enable_utc=True,
    task_acks_late=True,          # ack only after commit; redelivery is now safe
    worker_prefetch_multiplier=1, # fair dispatch; no hoarding of controlled batches
    task_default_queue="inventory_sync",
    task_routes={
        "pharmacy.tasks.process_batch_update": {"queue": "controlled_substance_ledger"},
    },
)

engine = create_engine(DB_URL, pool_pre_ping=True, pool_size=10, max_overflow=20)
kv = redis.Redis.from_url(BROKER_URL, decode_responses=True)

# --- Structured audit logger: operator attribution, zero PHI ---
class AuditFormatter(logging.Formatter):
    def format(self, record: logging.LogRecord) -> str:
        return json.dumps({
            "ts": datetime.now(timezone.utc).isoformat(),
            "level": record.levelname,
            "task_id": getattr(record, "task_id", "N/A"),
            "operator_id": getattr(record, "operator_id", "SYSTEM"),
            "event": record.getMessage(),
            "cfr": "21CFR1304.11;45CFR164.312b",
        })

audit = logging.getLogger("dea_inventory_audit")
audit.setLevel(logging.INFO)
_handler = logging.FileHandler("/var/log/pharmacy/dea_batch_audit.log", mode="a")
_handler.setFormatter(AuditFormatter())
audit.addHandler(_handler)


class DrugDelta(BaseModel, frozen=True):
    """Canonical, immutable inventory movement. additionalProperties is rejected."""
    model_config = {"extra": "forbid"}  # blocks stray PHI from entering the broker

    ndc: str = Field(pattern=r"^\d{5}-\d{4}-\d{2}$")
    quantity_delta: int = Field(ge=-10_000, le=10_000)
    operator_id: str = Field(min_length=3, max_length=50)
    facility_uuid: str = Field(
        pattern=r"^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$"
    )
    batch_id: str = Field(min_length=10)
    transaction_type: str = Field(pattern="^(dispense|receive|adjustment|return|waste)$")

    @field_validator("quantity_delta")
    @classmethod
    def reject_zero(cls, v: int) -> int:
        if v == 0:
            raise ValueError("zero-delta movement violates DEA audit requirements")
        return v


def idempotency_key(rec: DrugDelta) -> str:
    """Deterministic SHA-256 over the canonical movement, namespaced by batch_id."""
    canonical = json.dumps(rec.model_dump(), sort_keys=True, separators=(",", ":"))
    digest = hashlib.sha256(canonical.encode("utf-8")).hexdigest()
    return f"idem:{rec.batch_id}:{digest}"


@app.task(
    name="pharmacy.tasks.process_batch_update",
    bind=True,
    autoretry_for=(OperationalError, SQLAlchemyError, ConnectionError),
    retry_backoff=True,
    retry_backoff_max=300,
    retry_jitter=True,
    max_retries=5,
)
def process_batch_update(self, payload_batch: list[dict[str, Any]]) -> dict[str, Any]:
    task_id = self.request.id

    # 1. Validate at the boundary; a malformed batch never reaches the ledger.
    try:
        records = [DrugDelta(**row) for row in payload_batch]
    except ValidationError as exc:
        audit.warning(f"schema rejection for batch {task_id}: {exc.json()}",
                      extra={"task_id": task_id})
        return {"status": "rejected", "task_id": task_id, "reason": "validation"}

    lock_id = int.from_bytes(b"ledger"[:4], "big") & 0x7FFF_FFFF
    committed = 0

    # 2. One advisory-locked transaction: all-or-nothing for the whole batch.
    with Session(engine) as session:
        session.execute(text("SELECT pg_advisory_lock(:lid)"), {"lid": lock_id})
        try:
            for rec in records:
                key = idempotency_key(rec)
                # Atomic claim: only the FIRST arrival proceeds; retries no-op.
                if not kv.set(key, task_id, nx=True, ex=IDEMPOTENCY_TTL_SECONDS):
                    audit.info(f"idempotency short-circuit: {key}",
                               extra={"task_id": task_id, "operator_id": rec.operator_id})
                    continue

                session.execute(
                    text("UPDATE inventory_ledger SET on_hand = on_hand + :d, "
                         "last_updated = :ts WHERE ndc = :ndc"),
                    {"d": rec.quantity_delta, "ts": datetime.now(timezone.utc),
                     "ndc": rec.ndc},
                )
                session.execute(
                    text("INSERT INTO audit_idempotency_registry "
                         "(key, operator_id, facility_uuid, tx_type, processed_at) "
                         "VALUES (:k, :op, :f, :t, :ts)"),
                    {"k": key, "op": rec.operator_id, "f": rec.facility_uuid,
                     "t": rec.transaction_type, "ts": datetime.now(timezone.utc)},
                )
                committed += 1
                audit.info(
                    f"ledger applied ndc={rec.ndc} delta={rec.quantity_delta}",
                    extra={"task_id": task_id, "operator_id": rec.operator_id},
                )
            session.commit()  # broker ack fires only after this returns
        except (OperationalError, SQLAlchemyError):
            session.rollback()
            # Release the Redis claims so the retry can legitimately re-acquire them.
            for rec in records:
                kv.delete(idempotency_key(rec))
            raise  # autoretry_for handles backoff + redelivery
        finally:
            session.execute(text("SELECT pg_advisory_unlock(:lid)"), {"lid": lock_id})

    return {"status": "success", "committed": committed, "task_id": task_id}

The critical detail is the rollback branch: because the database transaction and the Redis claims must agree, a failed commit deletes the keys it created so the redelivered task re-claims them cleanly. Without that compensation a transient DB error would permanently poison the keys and silently drop the deltas — deferral becoming loss.

Verification & Testing

Correctness here means one thing: running the task twice with the same payload must mutate the ledger exactly once. The assertion below simulates a redelivery by invoking the task body a second time and proving the second pass is a no-op.

python

def test_retry_is_exactly_once(seeded_ledger):
    batch = [{
        "ndc": "12345-6789-01", "quantity_delta": -2, "operator_id": "RPH001",
        "facility_uuid": "0f8fad5b-d9cb-469f-a165-70867728950e",
        "batch_id": "EDI852-2026-0628-0001", "transaction_type": "dispense",
    }]
    before = on_hand("12345-6789-01")

    first = process_batch_update.run(batch)     # initial delivery
    second = process_batch_update.run(batch)    # simulated broker redelivery

    assert first["committed"] == 1
    assert second["committed"] == 0             # idempotency short-circuit
    assert on_hand("12345-6789-01") == before - 2   # applied once, not twice
    # exactly one registry row proves no double-count for the DEA audit
    assert registry_count("EDI852-2026-0628-0001") == 1

To validate the compliance side rather than the logic, tail the audit stream. A clean exactly-once run emits one ledger applied line and, on redelivery, one idempotency short-circuit line — never two applies:

json

{"ts":"2026-06-28T14:02:11.004Z","level":"INFO","task_id":"a91f…","operator_id":"RPH001","event":"ledger applied ndc=12345-6789-01 delta=-2","cfr":"21CFR1304.11;45CFR164.312b"}
{"ts":"2026-06-28T14:02:13.881Z","level":"INFO","task_id":"a91f…","operator_id":"RPH001","event":"idempotency short-circuit: idem:EDI852-2026-0628-0001:9c1e…","cfr":"21CFR1304.11;45CFR164.312b"}

Cross-referencing batch_id against audit_idempotency_registry during an inspection then proves, row by row, that no Schedule II movement was counted twice.

Gotchas & Compliance Pitfalls

Idempotency key collisions across batches. The key namespaces the SHA-256 with batch_id deliberately. Two legitimately distinct movements of the same NDC and quantity by the same operator in different batches must produce different keys — drop the batch_id prefix and you will silently suppress a real second dispense. Never hash only the NDC and delta.
Redis TTL shorter than the retry horizon. If IDEMPOTENCY_TTL_SECONDS expires before retry_backoff_max × max_retries elapses, a late retry re-claims an expired key and double-posts. Keep the TTL comfortably longer than the worst-case retry window.
task_acks_early defeats the whole design. Acking before commit means a worker crash loses the batch entirely. The pattern requires task_acks_late=True; pair it with a visibility timeout that exceeds your slowest ledger commit so a slow transaction is never redelivered and re-run concurrently.
PHI leaking into the broker. extra="forbid" on the Pydantic model is the boundary that keeps patient identifiers out of Redis and the result backend, satisfying the transmission-integrity requirement of 45 CFR § 164.312(e)(2)(ii). Loosen it and protected health information can end up serialized in a queue you do not control.
Swallowed validation failures. A rejected batch returns a rejected status rather than raising — route those to the quarantine and replay machinery described in Error Handling & Retry Mechanisms instead of discarding them, or you create an unrecorded gap in the perpetual inventory.

Frequently Asked Questions

Why use a Redis SET NX claim instead of just the database registry row?

The registry row is the durable audit receipt, but the SET NX claim is what makes the in-flight decision atomic across concurrent workers without holding the database lock longer than necessary. The Redis claim fences the work; the registry insert records that it happened. Together they give exactly-once semantics even under simultaneous redelivery.

Does the advisory lock serialize the entire pharmacy throughput?

No. The lock is held only for the duration of a single batch transaction and can be partitioned per NDC range or per facility by deriving lock_id from the batch scope, so independent controlled-substance ledgers commit in parallel while a single ledger’s writes stay strictly ordered.

What happens when retries are exhausted?

After max_retries, Celery raises and the message is routed to the dead-letter queue rather than silently dropped. A compliance reconciliation job inspects those payloads, applies an attributed manual override, and re-injects them under a fresh batch_id so the original idempotency keys are not reused.

How does this stay consistent when a site goes offline?

Offline batches are buffered through the Fallback Routing for Offline Sync layer, which preserves ordering and idempotency keys locally. On reconnection the deferred deltas replay in sequence and deduplicate against the same registry, so no count drifts during the outage.

Up: Async Batch Processing for Inventory Updates — the parent subsystem this recipe implements.
EDI 852 & 846 Parsing Pipelines — upstream source of the wholesale deltas fed into these batches.
Barcode Scan Log Routing Logic — upstream dispensing-scan source and downstream variance consumer.
Error Handling & Retry Mechanisms — the quarantine and replay machinery for rejected batches.

Prerequisites & Environment

Implementation

Verification & Testing

Gotchas & Compliance Pitfalls

Frequently Asked Questions

Related