krishna@
10 min read#backend

NestJS + BullMQ + Redis: the queue is the API

When a webhook fires that triggers six side-effects, the right answer is always the queue. Here is the operational playbook for shipping it without surprises — including the gotchas nobody documents.

share
Dark server racks lit blue

the queue is the API

When a webhook fires that triggers six side-effects — send an email, hit a third-party API, update a denormalized read model, push to a Slack channel, write an analytics event, and reconcile with a payment provider — the right shape is always the same: accept the webhook, push a job, return 200 in 50ms. Everything else is the queue's problem.

Most teams figure this out the third time they have a webhook handler that takes 4 seconds to respond and costs them a webhook redelivery from Stripe. The first two times they build it inline because "it's just six things, how bad can it be."

This is a piece on how to ship it without surprises.

the stack

For a NestJS app on Node 22, the right answer is BullMQ on Redis. Not BullMQ over RabbitMQ, not Inngest, not the Next.js App Router actions — BullMQ. The reasons:

  • It's been in production at scale for years, with predictable failure modes.
  • It runs on Redis, which you already have for caching.
  • The NestJS module (@nestjs/bullmq) is well-typed and minimal.
  • The job lifecycle (active → completed → failed → retried) is observable via Bull Board, which is two lines of setup.
import { Module } from '@nestjs/common';
import { BullModule } from '@nestjs/bullmq';

@Module({
  imports: [
    BullModule.forRoot({
      connection: { url: process.env.REDIS_URL },
    }),
    BullModule.registerQueue({ name: 'reconcile' }),
  ],
})
export class AppModule {}

Two minutes of setup, you have a queue.

the producer side

The webhook handler does almost nothing:

@Post('webhooks/stripe')
async receive(@Body() body: StripeEvent) {
  await this.queue.add('reconcile', body, {
    attempts: 5,
    backoff: { type: 'exponential', delay: 1000 },
    removeOnComplete: 1000,
    removeOnFail: false,
  });
  return { ok: true };
}

Five things to notice:

  1. attempts: 5 — five tries before giving up. The first failure is often transient (network blip, third-party rate limit), and the exponential backoff lets the system recover.
  2. backoff: exponential — wait 1s, then 2s, then 4s, then 8s, then 16s. Total worst-case retry window is 31s. That's enough for most transient errors to resolve.
  3. removeOnComplete: 1000 — keep the last 1000 successful jobs for debugging. Don't keep them all forever; Redis is RAM.
  4. removeOnFail: false — keep all failed jobs forever. You will want to inspect them. This is the line that most teams forget and then ten minutes after a partial outage they have nothing to debug.
  5. No await on the side-effects — the handler returns { ok: true } to Stripe immediately, with the actual work queued. Stripe's webhook contract gives you a few seconds; you don't need them.

the consumer side

@Processor('reconcile')
export class ReconcileProcessor extends WorkerHost {
  constructor(private readonly logger: Logger) { super(); }

  async process(job: Job<StripeEvent>): Promise<void> {
    const { name } = job.data.type ? { name: job.data.type } : { name: 'unknown' };
    this.logger.log({ jobId: job.id, attempt: job.attemptsMade }, `processing ${name}`);

    await this.sendEmail(job.data);
    await this.updateReadModel(job.data);
    await this.notifySlack(job.data);
    await this.recordAnalytics(job.data);
    await this.reconcileWithProvider(job.data);
  }
}

A few things to highlight here:

Log every attempt. job.attemptsMade is gold during incidents — you need to know whether the third retry is the one that finally succeeded.

Order matters within a job. If sendEmail succeeds but reconcileWithProvider fails, the retry will re-send the email. That's usually fine for emails (deduplication on the receiver side) but it might not be for other side-effects. If it isn't, split into separate jobs each with their own retry policy.

Idempotency is your responsibility. BullMQ guarantees at-least-once delivery, not exactly-once. Each side-effect needs an idempotency key — for Stripe, the event ID; for outbound APIs, a hash of the relevant request fields. Without this, retries cause duplicates.

the gotchas nobody tells you

A list of things I've learned the hard way:

Connection pooling. BullMQ holds open Redis connections. If your worker process and your web process share the same Redis URL, you can blow past the connection limit on managed Redis (Upstash, Redis Cloud, etc.). Use separate Redis databases or different connection IDs.

Worker concurrency. new Worker(name, { concurrency: 1 }) is the default. For most webhook reconciliation jobs that's right — you want serialization. But for fan-out work (process this batch of 10,000 records), bump it. The default trips up people whose first job runs at 3 jobs/second when their queue has 10,000 waiting.

Stalled jobs. A job is "stalled" when the worker locks it but doesn't extend the lock in time (typically because the worker process crashed). BullMQ retries stalled jobs automatically, but if your worker is slow rather than crashed, you'll see jobs run twice. Set lockDuration higher than your slowest expected job.

Cron jobs. BullMQ supports recurring jobs (repeatable jobs) but the API has gotchas — adding a duplicate repeatable job creates two parallel schedules, not one. Always removeRepeatableByKey before re-adding.

Bull Board in production. Bull Board's UI is wonderful in dev. In production it needs auth. The default has none. Wire it behind your auth middleware before you forget.

the operational playbook

When something is wrong, here's the order of operations:

  1. Check Bull Board. Is the queue backed up? Are jobs failing? Are workers connected?
  2. Inspect a failed job's stack trace and inputs. removeOnFail: false saved you.
  3. Look at attemptsMade distribution. If most jobs need 3+ attempts, your backoff is wrong or the upstream is having a bad day.
  4. Drain the dead-letter queue manually if needed. The CLI helper for this is bullmq-cli — use --rerun to push failed jobs back into the active queue once you've fixed the root cause.

why the queue is the API

A common temptation when this stuff works is to expose the queue as part of the API — let clients enqueue jobs directly. Resist this. The queue is an internal primitive. The API contract should remain "submit an event, get an immediate response." What happens behind the response is an implementation detail.

When the queue is the API, you can't change the queue without breaking clients. When the queue is hidden, you can swap BullMQ for SQS for Inngest for whatever-comes-next without touching anyone's integration. That's the version of this that survives.

what's next

The pattern in this post — webhook → 200 → queue → side-effects → idempotency-keyed retries — handles most of what a real backend needs. The interesting next step is making it observable. @nestjs/bullmq has hooks; pipe them into OpenTelemetry, and your queue becomes traceable end-to-end with the rest of your request flow. Once that's wired, you can finally answer "why was this Stripe webhook slow last Tuesday."

That's a topic for another post.


by Krishna Adhikari · Mar 30, 2026
share
// related.transmissions

Keep reading.