Docs Data & sources Custom bots deep-dive
Data & sources · 04 of 04

Custom bots — implementation deep-dive.

The dashboard feature page covers using custom bots. This page covers how they work under the hood — for power users who want to build robust bots that don't break.

Time: 8 min·Updated: 2026-05-25·Audience: Pro+ power users, RevOps building data pipelines·Pair with: Dashboard feature page

TL;DR

Custom bots are headless scrapers + extractors + schedulers, plus a webhook-in fallback. Architecture is queue-based: source fetcher → extraction rule → idempotency check → result router. Idempotency is keyed on (source, content_hash) so re-runs don't dup-flood your signal feed. Debug surface: per-run logs, last-100 matches preview, manual trigger.

01Architecture overview

Each bot is a Lambda-style execution. On schedule trigger:

  1. Fetcher pulls the source (HTTP, RSS, sitemap, careers parser)
  2. Extractor runs your rule against the fetched content
  3. Idempotency check against (bot_id, content_hash) — duplicate matches dropped
  4. Result router writes to signal feed, fires action (alert/auto-brief/CRM push)
  5. Logger persists the run with status, latency, match count

02Source connectors

ConnectorWhat it fetchesNotes
URL fetcherSingle HTTP GET, follows redirects, runs JS via headlessHeadless render adds ~2s latency
RSS readerStandard RSS / Atom, dedup on item GUIDMost reliable connector
Sitemap walkerCrawls a sitemap.xml, detects new entriesUse for site-wide change monitoring
Careers parserGreenhouse / Lever / Ashby / Workday — structured extractionPre-built per-ATS extractors
Webhook inYour system POSTs; bot reactsReal-time, push-based

03Extraction rule types

3 modes, increasing power + cost:

  • Keyword set — match if any phrase in the list appears (case-insensitive). Cheap, fast.
  • Regex — Python regex against the fetched content. Powerful, pattern-precise.
  • Semantic match — pre-defined intents (e.g., "is this a hiring announcement?") classified by Mama's local model. More accurate, slower, Pro+ only.

04Schedule engine

Cron-like syntax, but exposed as friendly options (hourly, daily, weekly). Behind the scenes: each bot has a next-run timestamp; scheduler polls every 60s and triggers due bots.

Webhook-in bots have no schedule — they run on inbound POST.

Per-tier limits on concurrent bots running:

  • Pro: 5 bots, max 1 running at a time per workspace
  • Company: 50 bots, max 5 concurrent

05Idempotency

Same content fetched twice doesn't produce two signals. Idempotency key:

sha256(bot_id + '|' + extracted_content)

If the same key was emitted in the last 90 days, the bot logs the match but doesn't fire actions. Override via "Force fire on every match" toggle (rare, mostly used for testing).

06Debugging

Each bot has a "Debug" tab showing:

  • Last 100 runs with timestamp, latency, status, match count
  • Last 100 matches with extracted content snippet
  • Last error with stack trace if any
  • "Test fetch" button — run the fetcher once without firing actions
  • "Manual trigger" button — run the full bot now

Bots that fail 3 runs in a row auto-pause and a workspace alert fires.

07Common mistakes

Building bots without testing the extraction
The "Test fetch" button shows you exactly what the fetcher sees. Use it before saving. Bots that match 0 things are useless; bots that match everything are noise.
Polling hourly when daily would do
Hourly costs 24× the compute. Most use cases don't need it. Daily is the right default.
Disabling idempotency for "more matches"
You don't get more value — you get duplicate flooding. Idempotency is what keeps the signal feed clean.
Was this page helpful?