Partial signal-pipeline delay — 23 minutes, no data loss.
Upstream LinkedIn API rate-limit change caused our hiring-signal and exec-move detectors to back off and queue. Briefs continued to generate from cached data; no CRM writes were lost. Full chronology, root cause, and the four action items we owe out of it are below.
Summary.
At 14:08 UTC on April 5, 2026, our LinkedIn collector started getting back HTTP 429s in volumes we hadn't seen before. Two signal types depend on the collector: hiring spikes (we scrape job postings) and exec moves (we watch profile transitions). Both detectors degraded gracefully — workers backed off, queued the affected events, and continued to retry — but the queue grew faster than it drained for about 18 minutes.
By 14:31 UTC, the queue was back below its normal depth (~120 events) and detection latency was back to baseline. The only customer-observable effect was that hiring and exec-move signals firing during the window showed up in customer briefs roughly 11 minutes later than usual, instead of within the normal <3-minute window.
Impact by service.
Of the 6 services on /status, two were degraded; four were unaffected. The web app, brief writer, CRM sync, and auth all continued normally. The signal pipeline as a whole was degraded but never down — only the LinkedIn-dependent detectors backed off.
| Service | Effect | Status |
|---|---|---|
| Web app & dashboard | No effect. All views loaded normally, all briefs visible. | Unaffected |
| API · v1 | P50 response time stayed at 82ms. P99 ticked up briefly (240ms → 410ms) during the queue-drain window, then returned to baseline. | Brief P99 ↑ |
| Signal detection pipeline | 2 of 47 detectors degraded (hiring · exec moves). All other detectors fired normally. Affected detector latency went from <3 min → 11 min average during the window. | Partial |
| Brief generation | No effect. Briefs continued to write from cached signal data. Briefs that would have included the delayed signals re-wrote automatically once the signals landed. | Unaffected |
| CRM sync & webhooks | No effect. All scheduled writes completed. | Unaffected |
| Auth & SSO | No effect. All logins succeeded. | Unaffected |
Timeline.
All times UTC on April 5, 2026 (Saturday). The first detection happened automatically via our oncall paging; a human responded within 4 minutes.
collector_429_rate > 5/min. Oncall (Asif) is paged.Root cause.
LinkedIn shipped a tighter per-IP rate-limit window for unauthenticated profile and company-page requests. The new limit appears to be roughly 40% lower than the previous one (we estimated the prior ceiling at ~250 req/min/IP; the new ceiling looks like ~150 req/min/IP). They did not publish a changelog entry. Other vendors in the dev forum reported the same change within the same hour.
Our collector was tuned to run just under the previous limit. When the new limit landed, our concurrency was high enough to consistently breach it within the first 60 seconds of any minute. The back-off worked correctly — that's why no data was lost — but the queue grew because the collector was burning its budget faster than it could replenish.
Why we noticed quickly
We page on 429 rate, not on queue depth or signal latency. That's deliberate — by the time queue depth or latency move, the user-facing window has already started ticking. 429 rate is the leading indicator. The PD alert at 14:08:22 was 6 seconds after the first 429, well before any customer would have noticed.
Why the impact was contained
Two design decisions that paid off here. First, every signal type is a separate detector with its own worker pool and queue. The LinkedIn-dependent detectors degraded; the other 45 detectors fired normally. Second, brief generation reads from cached signal data with a fall-back to "last refreshed N minutes ago" rather than synchronously waiting for fresh signals. Customers in the affected window saw briefs as normal, with a slightly older "as of" timestamp.
What we did during the incident.
The mitigation was a single config change: reduce concurrency on the LinkedIn collector and increase the back-off floor. No code change, no deploy of binaries — just a config push through our normal rolling-update path.
- Reduced collector concurrency from 12 → 4 workers. Below the new ceiling with a margin for spikes.
- Increased back-off floor from 2s → 30s per worker after any 429. Prevents the workers from thundering back at the API.
- Updated /status within 10 minutes of the first page. Public status, partner Slack channel, and the affected-customer email list all got the same message.
- No code rolled back. The collector code was correct — the API contract changed underneath it. Rolling back would not have helped.
What went well.
- Detection in 6 seconds. Paging on 429 rate (not queue depth or latency) caught this immediately.
- 4-minute time-to-human. Oncall paged at 14:08, ack'd at 14:12. Well inside the 5-minute SLA we hold ourselves to.
- Graceful degradation worked as designed. No data loss, no synchronous waits that would have failed user requests, no cascading failures into other services.
- Status page update inside 10 minutes of the page. Partners knew before they could've noticed on their own.
- Single-config mitigation. No deploy, no rollback dance, no rolling-restart of the worker pool — config push only.
What didn't go well.
- We didn't see the LinkedIn change coming. Other vendors reported it in the dev forum within an hour — we weren't monitoring that forum. We should have been.
- The collector concurrency was tuned manually. A static number ("12 workers") meant that the moment the external ceiling moved, we were over it. A self-tuning system that backed off below the observed 429 rate would not have hit the queue-growth phase at all.
- Our impact estimate to customers was conservative. We told the partner Slack "up to 15-minute delays on hiring and exec signals." Actual peak was 11 minutes, so we overstated. Not the wrong direction to err, but worth noting.
- No automated retroactive notice for affected accounts. Briefs that were generated during the window with stale hiring/exec-move data weren't flagged to the user once the fresh signals landed. We just re-wrote the brief silently. Some users probably saw a brief at 14:20 and then a re-written one at 14:35 with no explanation.
Action items.
Four work items came out of this. Two are done, two are scheduled. We track these to completion publicly — if any of them aren't done by their target date, the next month's /open update will explain why.
| ID | Action | Owner / due | Status |
|---|---|---|---|
| AI-01 | Subscribe to the LinkedIn developer forum RSS and route it into the oncall Slack channel so we see API changes when other vendors do. | Asif Due Apr 8 |
Done |
| AI-02 | Replace static worker concurrency with adaptive PID controller that targets a 429 rate of zero ± epsilon. Auto-backs off when the external ceiling moves, auto-scales up when there's headroom. | Founding eng Due May 17 |
Done |
| AI-03 | Flag "brief updated" in the dashboard when a brief is re-written within 30 minutes of being first viewed by a user, with a one-line explanation of which signal landed late. | Head of data Due Jun 14 |
In flight |
| AI-04 | Write a runbook for "upstream API rate-limit change" — the exact playbook above (find dev forum confirmation, push concurrency config, update status, notify partners). Keep in the oncall doc set. | Asif Due Jun 30 |
Queued |
If you were affected.
For everyone else: no impact. Briefs, scoring, CRM sync, and the web app were all operating normally throughout. The 23-minute window is captured on the 90-day uptime grid at /status as a single degraded cell for the signal detection service.
Future incidents will get the same treatment — written within 7 days of resolution, posted publicly, action items tracked to completion. We don't think anyone enjoys reading post-mortems, but we'd rather publish ours than not.