5 min read
How to Build Long-Running Tasks with BullMQ (Queue + Worker)
BackendNestJSBullMQRedisQueuesSystem DesignReliability
Background
Some tasks should not run inside an HTTP request:
- Email delivery (verification, tracking links)
- Reconciliation / scheduled sweeps (fallback when webhooks are missed)
- Batch processing (imports, exports, backfills)
The common failure mode is predictable: the request path gets slow, timeouts increase, and downstream providers get spammed or rate-limited.
A reliable approach is a Queue + Worker model (BullMQ on top of Redis): enqueue work fast, return the HTTP response, and let background workers process jobs with retries/backoff, concurrency limits, and lifecycle events.
Goals
- Reduce HTTP latency: return quickly while heavy work runs asynchronously.
- Reliability: automatic retries/backoff; job lifecycle decoupled from requests.
- Rate-limit and concurrency control: prevent spam and protect downstream providers (e.g. email API, payment gateway).
- Operational visibility: logs/metrics for job states and failures.
Building blocks (code map)
- Background entrypoint: a NestJS module that imports queue modules (e.g.
EmailQueueModule,PaymentsReconcileQueueModule) - Queue names / job names: central constants so producers/consumers share the same identifiers (e.g.
QueueName.email,QueueName.paymentsReconcile) - Job payload types: explicit interfaces per job so you can evolve payloads safely (e.g.
IVerifyEmailJob,IOrderTrackingLinkJob) - Redis config: a single BullMQ Redis connection shared by producers, workers, and schedulers
Producer / Consumer model
Producer (enqueue job)
- In feature service layers, inject the queue via
@InjectQueue(QueueName.…). - Enqueue with
queue.add(jobName, data, opts). - HTTP flows do not wait for job completion.
Consumer/Worker (process job)
- Each queue has a
@Processor(QueueName.…)that implementsWorkerHost.process(job). - Workers run in the same process as the NestJS API (there is no separate worker service).
If jobs become CPU-heavy or very high volume, running workers in-process can degrade API throughput. Consider splitting workers into a dedicated process/service when you hit that threshold.
Queue: email (async email delivery)
- Registration:
apps/api/src/background/queues/email-queue/email-queue.module.ts- Configures BullMQ streams events (
maxLen: 1000)
- Configures BullMQ streams events (
- Processor:
apps/api/src/background/queues/email-queue/email.processor.tsconcurrency: 1(sequential delivery)limiter: { max: 1, duration: 150 }(rate limit ~ 1 job / 150ms)removeOnComplete: garbage-collect completed jobs byage+countto keep Redis bounded- Dispatch by
job.name:email-verification→EmailQueueService.sendEmailVerification(...)order-tracking-link→EmailQueueService.sendOrderTrackingLink(...)
- Queue events listener:
apps/api/src/background/queues/email-queue/email-queue.events.ts- Logs lifecycle events (
added/waiting/active/completed/failed)
- Logs lifecycle events (
- Business email send:
apps/api/src/background/queues/email-queue/email-queue.service.ts- Calls
MailService(Brevo integration)
- Calls
Queue: payments-reconcile (scheduled reconciliation / long-running batch)
- Registration: a queue module that registers a reconcile queue + processor
- Scheduler (cron):
- On module init, if
PAYMENTS_RECONCILE_ENABLED=true:upsertJobScheduler(..., { pattern: '0 * * * *' }, { name: payments-sweep })
- Goal: a fallback when webhooks are missed (or delayed) by reconciling pending orders.
- On module init, if
- Processor:
concurrency: 1- Batch:
- WINDOW: last 24 hours
- LIMIT: 200 orders / run
- For each order:
PaymentsService.reconcileProviderOrder(orderId)+ logs outcomes
Retry / backoff policy (examples)
- Email jobs are enqueued in:
apps/api/src/api/auth/auth.service.ts(register → enqueueemail-verification)apps/api/src/api/payments/payments.service.ts(payment success → enqueueorder-tracking-link)
- Common pattern:
attempts: 3backoff: { type: 'exponential', delay: 60000 }
Observability
- Worker events (processor-level): log
active/progress/completed/failed/stalled/error - Queue events (queue-level stream): log
added/waiting/active/completed/failed - Overall monitoring stack: Grafana Cloud + Loki/Mimir (see Monitoring Stack section)
Execution flows (sequence diagrams)
A) Sign-up → enqueue email verification
Rendering diagram…
B) Payment confirmed → enqueue order tracking email
Rendering diagram…
C) Hourly payment reconcile sweep (missed webhook fallback)
Rendering diagram…
Scaling & pitfalls (important)
- In-process workers: because processors share the API process, heavy jobs or slow I/O can impact API throughput. If background volume grows, consider moving workers to a separate service/process to scale independently.
- Scheduler & multi-replica: the scheduler runs on module init. With multiple API replicas sharing the same Redis, ensure you don’t accidentally “double schedule”. Using
upsertJobSchedulerwith a stableSCHEDULER_IDreduces risk, but validate in HA setups. - DLQ/alerts: if failures are mostly log-driven, maturity improvements include dashboards/alerts for failed jobs (Bull Board + alert rules on Loki/Mimir/Sentry).
Written by Vũ Thanh Thiên