Skip to Content

Database autopsy: A performance post-mortem

Duration: 21:26


PART 1 — Analytical Summary 🚀

Context and why it matters 💼

In “Database autopsy: A performance post‑mortem” (21:26), Victor from Odoo engineering walks new partners through how to diagnose and resolve a “database is unresponsive and everything is slow” incident. Speaking to both Odoo.sh and on‑premise setups, he frames a realistic morning crisis and then methodically shows how to use observability (monitoring) and traceability (logs) to pinpoint bottlenecks. The talk’s promise is practical: by the end, you’ll gain a “superpower” that solves a large share of typical performance issues.

The investigative flow: observe first, then trace 🧠

The workflow begins with observability to answer when the issue happened, before digging into logs to determine what happened. On Odoo.sh, built‑in monitoring and logs (including module/build updates and PostgreSQL slow requests > 30 seconds) are available; on‑premise teams can replicate with Prometheus and Grafana.

Victor demystifies the standard graphs:

  • CPU utilization is cumulative (100% per worker). Colors matter: green is Python, blue is database, red is admin/editor tasks, yellow is scheduled jobs (cron). A healthy baseline is often ~70% Python and ~20–25% SQL. Spikes in blue can indicate database pressure; spikes in yellow can point to problematic scheduled jobs.
  • Memory is per worker (on Odoo.sh, ~2 GB per worker). Short spikes may not be visible due to 5‑second sampling; large PDFs, CAD plans, or big attachments can trigger out‑of‑memory despite “flat” graphs.
  • Request and concurrency graphs reveal whether you’re simply saturated (volume) or blocked by a few slow requests. Roughly, plan for about eight concurrent requests per worker. Short bursts are okay because requests queue; if the queue fills, users will see “bad gateway” errors.
  • Average response time is aggregated; check the last 24 hours for more granular patterns.

Once you can circle the problematic window (e.g., a spike around 16:00), you switch to traceability via logs to classify errors and identify root causes.

Core ideas and innovations ⚙️

Victor categorizes what truly matters in logs:

  • Concurrency and operational errors: serialization failures (repeatable read mode in PostgreSQL). Occasional errors are normal and auto‑retried up to five times. Continuous waves indicate hot‑row contention (e.g., marking Discuss channels read, or heavy website traffic).
  • Deadlocks: Database‑level deadlocks are logged and retried; trickier are Python/SQL mixed deadlocks, often caused by opening a second cursor inside a transaction and then hitting a unique constraint dependency. These aren’t logged; catch them live with tools like PGActivity, watching blocking vs waiting queries.
  • Bad queries: Often just programming mistakes (e.g., false in a list of integers). They become performance problems when they run within a cron that keeps failing and retriggering—creating “zombie” crons that hog resources.

To eliminate the zombie‑cron class of issues, Odoo introduced the commit_progress API. It changes how long jobs are written: process in small batches, commit frequently, report progress, and re‑enqueue immediately. This frees locks early, reduces long transactions, and improves overall throughput. From Odoo 19, crons that fail repeatedly are automatically deactivated and an admin gets a Discuss notification. This stops infinite failure loops while flagging issues for quick remediation.

For memory errors, the symptom is subtle: a brief disconnect/reconnect during an action and then “nothing happened.” Diagnosis requires profiling since logs only note the kill. Victor recommends Memray (by Bloomberg) for deep Python memory attribution, or Odoo’s built‑in memory profiler for a quick first view. Usual culprits include excessive prefetching of HTML/B64 fields, loading huge attachments, or large in‑memory filters.

Finally, Victor shares the promised “superpower”: stack dumps on demand. When a request is long‑running (not yet timed out), send SIGQUIT (kill -3) to the Odoo process. Odoo will dump thread stacks to the logs without aborting the operation. The dump includes the “three magic numbers” per request—query count, total SQL time, and the remaining Python time—instantly revealing whether the culprit is SQL or Python code. For timeouts, Odoo automatically dumps stacks at kill time.

To wrangle large log files (sometimes gigabytes per day), the team uses LNAV. It parses logs into SQLite for ad‑hoc SQL queries, histograms, and correlation with monitoring spikes. One handy technique is computing a rough “load” by multiplying request counts by average duration to find the top offending endpoints. LNAV can also surface source IP frequency, helping detect bots, scanners, or flash‑sale traffic patterns.

Impact and takeaways 💬

This talk translates performance firefighting into a repeatable practice. Teams get a clear protocol: use monitoring to find when, use logs and stack dumps to see what, and then apply targeted fixes. The guidance de‑risks common pain points—request saturation, lock contention, zombie crons, memory spikes—while introducing helpful tools and patterns like commit_progress, PGActivity, Memray, and LNAV. For partners and admins, the practical sizing rules (eight concurrent requests per worker, memory per worker) and color‑coded CPU reading accelerate triage. For developers, the emphasis on batching, frequent commits, and avoiding second cursors inside long transactions improves code that scales under real‑world concurrency.

The net result is a simpler, faster path from “everything is slow” to an actionable root cause—and, just as important, habits and APIs that prevent the next incident.

PART 2 — Viewpoint: Odoo Perspective

Disclaimer: AI-generated creative perspective inspired by Odoo's vision.

Performance isn’t just about speed; it’s about clarity. When partners can see exactly when a slowdown starts, then dump the stack and read the three magic numbers, the path to a fix becomes obvious. That’s the spirit of Odoo: give people integrated tools so they don’t need a separate toolbox for every fire.

I’m particularly happy about commit_progress and the built‑in profiling. Long jobs should be simple to write and kind to the rest of the system. With better batching, automatic retries, and clear notifications, we’re turning “mystery slowdowns” into straightforward engineering. Our community thrives when good patterns are easy—and that’s our job.

PART 3 — Viewpoint: Competitors (SAP / Microsoft / Others)

Disclaimer: AI-generated fictional commentary. Not an official corporate statement.

The guidance is pragmatic and the developer ergonomics are strong. Odoo’s integrated monitoring, log parsing tips, and batching API are well aligned with SMB and mid‑market realities. The ability to quickly partition Python vs SQL time via stack dumps is especially useful for lean teams without full APM stacks.

For larger enterprises, questions remain around standardized, compliant observability across heterogeneous deployments, multi‑region scale, and deep SRE practices (e.g., automated lock diagnostics, fleet‑level Grafana kits, and governance). Still, Odoo’s UX and built‑in patterns lower operational friction—an area where many suites struggle to differentiate.

Disclaimer: This article contains AI-generated summaries and fictionalized commentaries for illustrative purposes. Viewpoints labeled as "Odoo Perspective" or "Competitors" are simulated and do not represent any real statements or positions. All product names and trademarks belong to their respective owners.

Share this post
Archive
Sign in to leave a comment
From SAP to Odoo V18 — A retail transformation at Schellaert