Hook: ship faster when nothing “important” runs inside an HTTP request
LLM calls time out, tools fail, databases get throttled, and a single user prompt can fan out into dozens of steps. If you try to do all of that inside a request/response cycle, reliability becomes a game of chance.
In this post you’ll learn why “async by default” is the foundation for dependable AI operations, what to watch for when your workflows span LLMs, tools, and data systems, and how Cere Insight uses queues, workers, and orchestration to run AI workloads without blocking APIs.
The problem: AI workloads are long, spiky, and non-deterministic
Traditional web endpoints are designed for short, deterministic work: validate input, hit a database, return a response. AI systems break those assumptions:
- Long-running steps: An agent may need retrieval (RAG), multiple model calls, tool execution, and approvals before it can respond.
- Bursty traffic: A marketing email or in-app launch can multiply load instantly; LLM latency amplifies the impact.
- Partial failure is normal: A tool integration can fail while the rest of the workflow is valid. Retries must be selective and safe.
- Multi-tenant constraints: In a multi-org platform, one noisy organization must not degrade everyone else.
- Asynchrony is user-facing: Users still expect progress, status, and eventual results—especially for analytics queries and support workflows.
The result is an architectural choice: either block HTTP requests and hope everything finishes quickly, or treat AI work as background jobs with clear state, visibility, and recovery. Reliable AI platforms choose the second path.
How Cere Insight approaches it: queues + workers + orchestrated workflows
Cere Insight is built around operationalizing AI across organizations: knowledge base/RAG, workflow orchestration across modules, an AI Builder for multi-agent flows (router + tools + integrations), analytics bots that translate natural language to SQL (executed as async jobs), and embedded support inbox automation—all governed with JWT and RBAC. The connective tissue for reliability is an asynchronous execution model.
1) Separate “API acceptance” from “work execution”
APIs should validate, authorize (JWT/RBAC), persist intent, and enqueue work. This keeps request latency stable and prevents timeouts during model/tool execution. A user action becomes a job with a traceable lifecycle: queued → running → succeeded/failed.
2) Use queues and workers for long-running or failure-prone steps
Cere Insight runs background processing through a queue/worker model (commonly implemented with BullMQ over Redis in NestJS deployments). Workers handle LLM calls, retrieval, tool invocations, and database operations with controlled concurrency. This is particularly important for analytics bots: translating natural language into SQL is interactive, but executing SQL across organizational data sources can be expensive and should be handled as an asynchronous job with status updates.
3) Orchestrate multi-step flows across modules
Queues run tasks; orchestration defines the workflow. Cere Insight’s workflow orchestration coordinates steps that span modules: fetch context from the knowledge base (RAG), run an AI Builder agent flow (router chooses tools), call integrations, store intermediate results, and route outcomes into the embedded support inbox when human review or follow-up is needed.
This split matters because the platform must support different execution patterns: linear chains, conditional branching, parallel tool calls, and “wait for external input” states.
4) Enforce tenant-aware controls
Multi-tenant reliability is a product feature. Jobs carry organization identity, and workers can enforce per-org concurrency and rate limits. Combined with RBAC, this ensures the system only executes what the caller is allowed to run, and it prevents one organization’s heavy analytics runs from starving another organization’s support automation.
5) Make observability a first-class capability
Async systems fail quietly unless you instrument them. Cere Insight emphasizes operational visibility: job state, step-level timing, retries, and failure reasons—so teams can detect saturation, isolate problematic integrations, and improve prompts and tool schemas based on real performance data.
Practical checklist: patterns, pitfalls, and what to monitor
-
Design jobs to be idempotent (or at least retry-safe).
If a worker crashes after calling a tool but before saving results, a retry can duplicate side effects. Use idempotency keys per workflow step (for example: orgId + workflowRunId + stepName), and store “step already completed” markers before triggering irreversible actions.
-
Apply retries with intent: transient vs. permanent failures.
Not all errors deserve the same retry policy. Timeouts, rate limits, and network errors are often retryable with backoff. Prompt validation errors, permission denials (RBAC), or malformed tool inputs are usually permanent until corrected. Classify failures and cap retries to avoid runaway queues.
-
Set concurrency limits where the real bottlenecks are.
Workers should not simply “scale up.” Constrain concurrency per integration, per model provider, and per organization. For analytics bots, you may need separate queues for SQL execution vs. LLM planning so that a surge in dashboard queries doesn’t block support automation or RAG indexing.
-
Persist workflow state and emit progress events users can trust.
Async UX fails when users get silence. Store workflow run state (queued/running/awaiting input/succeeded/failed) and push updates into the UI and support inbox timelines. This is especially important when AI Builder flows branch: users need to know which step is running and what the system is waiting on.
-
Monitor the few metrics that predict incidents.
At minimum: queue depth (and age of oldest job), job success rate, retry counts, worker utilization, step latency percentiles, and per-tenant saturation signals. Add domain metrics: RAG retrieval latency, tool error rate by integration, and analytics job duration by data source. Alert on trends (growing backlog) rather than single failures.
Closing: async is not an optimization; it’s the operating model
AI operations are inherently asynchronous: they involve multiple systems, uncertain latency, and real-world failure modes. Treating AI work as background jobs—executed by workers, coordinated by workflow orchestration, and governed by tenant-aware access controls—turns unpredictability into something you can measure and manage.
This approach is for platform teams building multi-tenant AI products, backend engineers working in NestJS who want predictable reliability, and operators who need observability and guardrails for LLM + tool + data workflows. If your AI features are growing beyond “one prompt, one response,” moving to async by default is the step that keeps everything shippable.
