Why observability isn't optional.
The automations that hurt the most aren't the ones that break loudly — they're the ones that fail silently for weeks. A workflow that stopped processing 5% of invoices, an integration that dropped WhatsApp leads with a weird emoji, a reconciliation silently skipping operations above a certain amount. By the time you find out, it's already a hole in your operation. That's why every flow we ship to production has observability from day one.
Business + technical dashboards
Two views, not one. The technical (executions, latency, error rate, throughput) for your IT team and the business one (invoices processed, leads qualified, tickets resolved vs. SLA) for ops. Both look at the same system but see what matters to them.
Alerts with escalation, not spam
Slack/Teams for warnings, email for recurring errors, phone/SMS for critical incidents with 15-minute escalation if no ack. Categorization by real severity, not arbitrary thresholds. If an alert isn't actionable, it isn't sent.
Smart retries with backoff and dead letter
Transient errors (502, timeout, rate limit) retry with exponential backoff. Permanent errors (validation, auth) go to a human review queue. Nothing is lost, nothing retries forever. Every execution ends in success, reviewable error, or dead letter — never in limbo.
Immutable audit log
Every execution records: what came in, what went out, what decision was made, what LLM said what, what validation passed/failed, who manually approved. GDPR compliance and traceability for financial audits, especially useful for accounting processes.
The day my OCR provider changed an endpoint without warning, the system alerted me at 9:03 with the exact exception and workflow line. We fixed it in 40 minutes. Without observability, we'd have found out at month-end close.
What we don't do
- We don't sell pretty dashboards without alerts. If nobody acts on a number, it's decoration.
- We don't use heavy SaaS APM (Datadog, New Relic) unless you already have them. Light stack by default.
- We don't instrument after the fact if the flow wasn't designed to be observed. In that case, refactor first.