Home / Status / Incidents / April 5, 2026
SEV-3 · Partial degradation Resolved April 5, 2026 · 14:08–14:31 UTC

Partial signal-pipeline delay — 23 minutes, no data loss.

Upstream LinkedIn API rate-limit change caused our hiring-signal and exec-move detectors to back off and queue. Briefs continued to generate from cached data; no CRM writes were lost. Full chronology, root cause, and the four action items we owe out of it are below.

Severity
SEV-3
Duration
23 min
Affected
2 of 6 services
Data loss
None
Customer reports
0
← All incidents All systems operational · today
01

Summary.

TLDR
LinkedIn shipped a tighter per-IP rate-limit window. Our hiring-signal and exec-move detectors hit the new ceiling, the worker queue grew, and signals firing during the 23-minute window were delayed by an average of 11 minutes. Briefs, scoring, web app, auth, and CRM sync were all unaffected. No data loss. No customer-facing pages of the app were down.

At 14:08 UTC on April 5, 2026, our LinkedIn collector started getting back HTTP 429s in volumes we hadn't seen before. Two signal types depend on the collector: hiring spikes (we scrape job postings) and exec moves (we watch profile transitions). Both detectors degraded gracefully — workers backed off, queued the affected events, and continued to retry — but the queue grew faster than it drained for about 18 minutes.

By 14:31 UTC, the queue was back below its normal depth (~120 events) and detection latency was back to baseline. The only customer-observable effect was that hiring and exec-move signals firing during the window showed up in customer briefs roughly 11 minutes later than usual, instead of within the normal <3-minute window.

02

Impact by service.

Of the 6 services on /status, two were degraded; four were unaffected. The web app, brief writer, CRM sync, and auth all continued normally. The signal pipeline as a whole was degraded but never down — only the LinkedIn-dependent detectors backed off.

Service Effect Status
Web app & dashboard No effect. All views loaded normally, all briefs visible. Unaffected
API · v1 P50 response time stayed at 82ms. P99 ticked up briefly (240ms → 410ms) during the queue-drain window, then returned to baseline. Brief P99 ↑
Signal detection pipeline 2 of 47 detectors degraded (hiring · exec moves). All other detectors fired normally. Affected detector latency went from <3 min → 11 min average during the window. Partial
Brief generation No effect. Briefs continued to write from cached signal data. Briefs that would have included the delayed signals re-wrote automatically once the signals landed. Unaffected
CRM sync & webhooks No effect. All scheduled writes completed. Unaffected
Auth & SSO No effect. All logins succeeded. Unaffected
03

Timeline.

All times UTC on April 5, 2026 (Saturday). The first detection happened automatically via our oncall paging; a human responded within 4 minutes.

14:08:22Sat Apr 5
First HTTP 429 from the LinkedIn collector. PagerDuty alert fires on collector_429_rate > 5/min. Oncall (Asif) is paged.
14:09:14
Workers begin back-off as designed. Queue depth starts climbing from ~120 events to ~340 over the next 6 minutes.
14:12:08
Oncall acks PD page, opens incident channel, starts a Zoom war room. 4 minutes from page → human — within our 5-min SLA.
14:14
Pulled LinkedIn API status page and dev forum. Confirmed the rate-limit change — they shipped it without a changelog entry but other vendors are reporting it in the public dev forum.
14:18
Updated /status to "Partial degradation · signal detection · hiring + exec-moves". Posted to the public status page and the partner Slack channel.
14:22
Pushed a config change reducing collector concurrency from 12 → 4 and increasing the back-off floor from 2s → 30s. Deploy completed at 14:24, queue drain begins.
14:27
Queue depth back below 200 events. 429 rate down to <1/min. Affected detector latency back below 5 min.
14:31:05Sat Apr 5
Resolved. Queue depth back to baseline (~120 events). All detectors firing normally. Status page updated to "All systems operational". Incident channel closed.
04

Root cause.

LinkedIn shipped a tighter per-IP rate-limit window for unauthenticated profile and company-page requests. The new limit appears to be roughly 40% lower than the previous one (we estimated the prior ceiling at ~250 req/min/IP; the new ceiling looks like ~150 req/min/IP). They did not publish a changelog entry. Other vendors in the dev forum reported the same change within the same hour.

Our collector was tuned to run just under the previous limit. When the new limit landed, our concurrency was high enough to consistently breach it within the first 60 seconds of any minute. The back-off worked correctly — that's why no data was lost — but the queue grew because the collector was burning its budget faster than it could replenish.

Why we noticed quickly

We page on 429 rate, not on queue depth or signal latency. That's deliberate — by the time queue depth or latency move, the user-facing window has already started ticking. 429 rate is the leading indicator. The PD alert at 14:08:22 was 6 seconds after the first 429, well before any customer would have noticed.

Why the impact was contained

Two design decisions that paid off here. First, every signal type is a separate detector with its own worker pool and queue. The LinkedIn-dependent detectors degraded; the other 45 detectors fired normally. Second, brief generation reads from cached signal data with a fall-back to "last refreshed N minutes ago" rather than synchronously waiting for fresh signals. Customers in the affected window saw briefs as normal, with a slightly older "as of" timestamp.

05

What we did during the incident.

The mitigation was a single config change: reduce concurrency on the LinkedIn collector and increase the back-off floor. No code change, no deploy of binaries — just a config push through our normal rolling-update path.

  • Reduced collector concurrency from 12 → 4 workers. Below the new ceiling with a margin for spikes.
  • Increased back-off floor from 2s → 30s per worker after any 429. Prevents the workers from thundering back at the API.
  • Updated /status within 10 minutes of the first page. Public status, partner Slack channel, and the affected-customer email list all got the same message.
  • No code rolled back. The collector code was correct — the API contract changed underneath it. Rolling back would not have helped.
06

What went well.

  • Detection in 6 seconds. Paging on 429 rate (not queue depth or latency) caught this immediately.
  • 4-minute time-to-human. Oncall paged at 14:08, ack'd at 14:12. Well inside the 5-minute SLA we hold ourselves to.
  • Graceful degradation worked as designed. No data loss, no synchronous waits that would have failed user requests, no cascading failures into other services.
  • Status page update inside 10 minutes of the page. Partners knew before they could've noticed on their own.
  • Single-config mitigation. No deploy, no rollback dance, no rolling-restart of the worker pool — config push only.
07

What didn't go well.

  • We didn't see the LinkedIn change coming. Other vendors reported it in the dev forum within an hour — we weren't monitoring that forum. We should have been.
  • The collector concurrency was tuned manually. A static number ("12 workers") meant that the moment the external ceiling moved, we were over it. A self-tuning system that backed off below the observed 429 rate would not have hit the queue-growth phase at all.
  • Our impact estimate to customers was conservative. We told the partner Slack "up to 15-minute delays on hiring and exec signals." Actual peak was 11 minutes, so we overstated. Not the wrong direction to err, but worth noting.
  • No automated retroactive notice for affected accounts. Briefs that were generated during the window with stale hiring/exec-move data weren't flagged to the user once the fresh signals landed. We just re-wrote the brief silently. Some users probably saw a brief at 14:20 and then a re-written one at 14:35 with no explanation.
08

Action items.

Four work items came out of this. Two are done, two are scheduled. We track these to completion publicly — if any of them aren't done by their target date, the next month's /open update will explain why.

ID Action Owner / due Status
AI-01 Subscribe to the LinkedIn developer forum RSS and route it into the oncall Slack channel so we see API changes when other vendors do. Asif
Due Apr 8
Done
AI-02 Replace static worker concurrency with adaptive PID controller that targets a 429 rate of zero ± epsilon. Auto-backs off when the external ceiling moves, auto-scales up when there's headroom. Founding eng
Due May 17
Done
AI-03 Flag "brief updated" in the dashboard when a brief is re-written within 30 minutes of being first viewed by a user, with a one-line explanation of which signal landed late. Head of data
Due Jun 14
In flight
AI-04 Write a runbook for "upstream API rate-limit change" — the exact playbook above (find dev forum confirmation, push concurrency config, update status, notify partners). Keep in the oncall doc set. Asif
Due Jun 30
Queued
09

If you were affected.

For the 4 partner workspaces with hiring or exec-move signals in the window
If you opened a brief between 14:08 and 14:31 UTC on April 5 for an account where a hiring spike or exec move was happening, the brief you saw might have been ~11 minutes behind reality. The brief auto-rewrote with the fresh data once it landed; we did not flag the re-write at the time (see action item AI-03 above — we're fixing that). If you sent an outbound based on the older version and it's relevant, email [email protected] and we'll re-pull what the brief would have looked like with the late signals.

For everyone else: no impact. Briefs, scoring, CRM sync, and the web app were all operating normally throughout. The 23-minute window is captured on the 90-day uptime grid at /status as a single degraded cell for the signal detection service.

Future incidents will get the same treatment — written within 7 days of resolution, posted publicly, action items tracked to completion. We don't think anyone enjoys reading post-mortems, but we'd rather publish ours than not.

AM Asif M. · Co-founder + oncall for this incident · Written Apr 7, 2026, revised once. If you want a deeper technical drill on the PID-controller change in AI-02, email [email protected] — we'll share the design doc with anyone evaluating us for an enterprise contract.