Home / Sources

Sources · data provenance Updated May 22, 2026 47 signal types · 23 upstream sources

Every signal Mama fires comes from one of these places.

A signal is only useful if you trust where it came from. Below is the full list — every category of data we use, every upstream source we can name publicly, and a list at the bottom of what we explicitly don't touch. Published for the same reason /open is — so you can audit us against your procurement checklist instead of asking us on a call.

Partner · paid

Contracted data partners. Paid feeds, SLA-backed, contractually named.

Public · scraped

Public web sources. robots.txt-respected, ToS-compliant, refreshed on a schedule.

Customer-provided

Your own data. Read from your CRM, sequencer, or uploaded list. Never shared.

Never used

Explicit exclusions. See section 07 — the things we don't touch and why.

How sources are scored.

TLDR

Every source gets a confidence multiplier. Contracted partner data > public-web data > inference. A signal fired from two independent sources gets a higher confidence score than one source — and we never let a single-source inference push an account above the working-list floor on its own.

A "source" here means an upstream data feed Mama pulls from. A "signal" is what we derive from one or more sources. Most signals fuse 2–4 sources to reduce false positives — the ICP rubric weights the resulting confidence into the score.

The three tiers

Tier 1 · Contracted partner — data we pay for from a vendor with an SLA and a usage contract. Highest confidence weight. Most expensive per record, used sparingly.
Tier 2 · Public web — scraped or fetched from sources that explicitly allow it (robots.txt + Terms-of-Service compliant). Refreshed on a fixed schedule. Most of our signals fuse 2+ Tier-2 sources.
Tier 3 · Customer-provided — data you uploaded or that we read from your CRM. Treated as authoritative for your workspace, never shared with others.

Tiers are not quality rankings — Tier 2 sources are often more current than Tier 1 partner feeds, which is why we mix them. The tier label is about provenance and contractual basis, not signal quality.

Firmographic data · ICP fit.

The "is this account my customer?" data — industry, employee count, revenue band, location, business model. Used for the ICP fit dimension (35% weight) of every account score.

Firmographic 4 sources · 5.2M companies covered

Daily refresh

Source	What we use it for	Tier
Crunchbase	Company-level firmographics: industry, employee count, location, founded date.	Partner
SEC EDGAR	Public-company filings — official revenue, executive list, board composition.	Public
OpenCorporates	Cross-jurisdictional company-registry lookups for non-US accounts.	Public
Your CRM	Account list + custom fields — treated as authoritative for your workspace.	Customer

Funding events.

Closed rounds, M&A, debt raises, IPOs. Highest-weight signal type in the working-list calculation. Refreshed every 2 hours — a Series B announced at 9am will be in your dashboard before lunch.

Funding events 3 sources · ~140 events/week

2-hour refresh

Source	What we use it for	Tier
Crunchbase	Round size, lead investor, follow-on investors, post-money valuation when disclosed.	Partner
Press release feeds	Real-time scraping of major business-wire press releases. De-duped against Crunchbase to avoid double-firing.	Public
SEC Form D filings	For US private placements — sometimes faster than the press release.	Public

Exec moves.

VP+ joins, departures, lateral moves. Hardest source category to get right — LinkedIn changes happen in real time but data quality varies. We fuse 3 sources to confirm before firing the signal.

Exec moves 3 sources · ~80 events/week

6-hour refresh

Source	What we use it for	Tier
LinkedIn (public profiles)	Public profile change detection at VP+ titles. Read-only via official API access. Subject to LinkedIn's rate limits — see the April 5 incident post-mortem for what happens when they change those.	Public
Press releases & PR feeds	"New CRO" announcements via Business Wire, PR Newswire. Used to confirm a LinkedIn change isn't a profile error.	Public
SEC 8-K filings	For public companies — material exec changes are required disclosures.	Public

Hiring spikes.

Job-posting volume changes, role-type changes, geo expansion. The most-noisy signal category — most teams hire all the time. We threshold for "spike" relative to that team's 90-day baseline, not absolute count.

Hiring spikes 2 sources · ~210 events/week

12-hour refresh

Source	What we use it for	Tier
Greenhouse public job boards	Most B2B SaaS companies use Greenhouse — public boards are scrape-friendly via official API.	Public
Lever public job boards	Same pattern for Lever-based ATS deployments. Combined coverage hits ~70% of our target market.	Public

Tech stack changes.

Software added, dropped, swapped on a target's marketing site, app, or job postings. Powers /lookup and the stack-change signal type. We detect ~1,400 tools across CRM, sequencer, MarTech, observability, data warehouse.

Tech stack changes 3 sources · 1,400+ tools covered

24-hour refresh

Source	What we use it for	Tier
HTTP fingerprinting	Headers, cookies, embedded scripts on the target's public sites. Same fingerprint pattern Wappalyzer / BuiltWith use.	Public
JS bundle inspection	Pattern-match against ~1,400 known vendor SDKs in the page's JavaScript bundles.	Public
Job-posting tool mentions	"Experience with Salesforce required" in a job description is a strong stack-confirmation signal. Cross-checked with HTTP detection.	Public

Voice mining.

Podcasts, interviews, conference talks, panel discussions, public blog posts from exec-level speakers at target accounts. Lowest false-positive rate of any signal type — execs only say things publicly that they want quoted.

Voice mining 4 sources · ~3,400 hours/week scanned

Daily refresh

Source	What we use it for	Tier
Podcast RSS feeds	~600 B2B-focused podcast feeds. Audio transcribed via in-house Whisper deployment, then quote-extracted by exec name.	Public
YouTube channel feeds	Conference recordings, panel talks. Auto-captions used where available, fallback to in-house transcription.	Public
Substack & corporate blogs	Text content under exec bylines. Polled hourly; quote extraction same as audio pipeline.	Public
Conference speaker lists	Public speaker-list pages from SaaStr, RevOps Co-op, similar events. Used to seed which execs to listen for in upcoming releases.	Public

What we never use.

Sources we've evaluated and explicitly decided not to touch — either because the legal basis is weak, the data is unreliable, or it would betray the trust that makes the rest of this site credible.

Never used · by principle Eight things you'll never see in a Mama signal.

Personal email scraping. No data from your buyer's personal Gmail / Outlook. Ever. Even when the inbox is "public" via Apollo-style exposure.

Cookie-tracker data. No buying-intent feeds from third-party cookie networks. The third-party-cookie era was a fragile foundation; we're not building on it.

Stolen breach data. No use of leaked credential dumps, even when legally accessible. Lots of vendors do this. We won't.

Phone metadata. No call-pattern data, no voicemail metadata, no SMS scraping. We don't touch the phone layer at all.

Cross-customer data fusion without consent. Your CRM's contact list doesn't feed any other customer's account scoring. Workspace boundaries are real, not marketing copy.

Predictive "intent" without provenance. Some vendors fire "intent signals" with no underlying event. If we can't show you the source link or the public artifact, we don't fire the signal.

Visitor de-anonymization. No tying anonymous web visits to a named person. Aggregate company-level visit signals only, and only when the customer has set them up themselves.

Generative inference passed off as fact. If an LLM wrote a "signal" without a source link, we tag it as inference, not fact. Most teams hate this; we think it's the floor.

If a source you'd expect to see is missing from both lists, email us — we either haven't gotten to it, or we've evaluated and rejected it for a reason we should add here.

Refresh cadence at a glance.

How often each source category gets re-pulled. Cadences are deliberately staggered — funding is fastest because the news window is short; tech stack is slowest because it changes slowly and false positives spike on tighter polling.

Funding events

2hours

Faster than press cycles

Exec moves

6hours

Matches LinkedIn rate-limit budget

Hiring + voice

12hours

Daytime + overnight pulls

Tech stack + firmo

24hours

Overnight, all accounts in your CRM

The account score itself re-runs every 6 hours regardless — even if no new source data lands, the recency decay shifts the score. Customers on Pro can force a re-score outside the cycle via the API.

The three crawler tiers

"Crawlers" gets thrown around loosely. In Mama, there are three distinct tiers, and they do different jobs. Don't conflate them.

Tier 1 — Built-in 24/7 crawlers. Mama's own 10+ crawlers covering the 1,000+ source feeds listed above. Run continuously, hitting every account in every workspace on the cadences in §9. Free on every plan. Out of your control by design — we run them, we tune them, you benefit.

Tier 2 — On-demand crawls. Pro and Company override: re-run our built-in crawlers on one specific account, right now. Returns in ~30 seconds. Useful for consultancies prepping for a client call, teams responding to a same-day event, or anyone who just heard news they want validated against fresh data. Credit-based — 100/mo on Pro, 500/mo on Company. Available via the dashboard and the API.

Tier 3 — Custom bots. The Pro/Company-only tier where you tell Mama what else to watch. Point at any public source we don't cover by default — a competitor's pricing page, a specific subreddit, a niche Substack, a GitHub repo, a public Discord channel, a custom URL with CSS-selector extraction. Custom bots run on the same crawler infrastructure as our built-in ones (same auto-retry, rate-limiting, proxy rotation, failover). Pro = 25 active; Company = 100 active. Hard rules: public sources only (no scraping behind logins), respect robots.txt, auto-pause after 3 consecutive failures. See /changelog for limits + the dashboard for the builder.

Why this matters: built-in covers the obvious sources (funding wires, job boards, G2, Reddit). Custom bots cover the niche-but-high-signal ones — the specific Substack a buyer reads, the Discord channel where deals get sourced. Compounding signal advantage over tools that only resell vendor data.

When sources change.

Sources change. Vendors deprecate APIs. Press wires re-architect. LinkedIn ships a tighter rate limit (see the April 5 post-mortem for what happens when they do). Three commitments about how we handle source-side changes:

Adds get a /changelog entry tagged sources. New source, new tier, what it's used for, what changed.
Removals get a 30-day notice in the customer dashboard before the source stops being used. Plus a changelog entry. Nothing silent.
Material confidence changes get an alert on accounts where that source was a primary signal driver. "We used to rate this Tier 2; now it's Tier 3 because the vendor changed terms" — flagged on every affected brief.

If you ever see a signal where the source link 404s or the cited article has been pulled, that's a bug — tell us and we'll pull the signal from the affected briefs while we fix the source.

Looking for sub-processors instead of source feeds? The vendors we pay (Mercury, Linear, Stripe, AWS, etc.) are listed on /security. Source feeds are different from sub-processors — this page is for the former, that page for the latter.

Sub-processors →

Questions about a specific source before you start?

If a procurement or security person needs more depth than this page offers, email [email protected]. We'll share the per-source legal-basis memo we keep internally, under NDA.

Start free →