Where Mama's data comes from.

Honest provenance. Each data category has its own pipeline: firmographic from 4 providers + our crawl, technographic from JS sniffing + DNS + careers parsing, signals from 8 specialized pipelines, decision-makers from LinkedIn + careers + team pages, voice from public web.

Time: 6 min·Updated: 2026-05-25·Audience: curious users, RevOps evaluating data quality

TL;DR

5 data categories, 5 different pipelines. Firmographic: Crunchbase, PitchBook, ZoomInfo, Apollo, plus our own crawl. Technographic: client-side JS detection + DNS + careers-page parsing. Signals: 8 pipelines (one per signal type). Decision-makers: LinkedIn + team pages + careers. Voice: public web mining. We're transparent about every source — see /sources for the public version.

015-category overview

Category	Primary source(s)	Coverage
Firmographic	Crunchbase + PitchBook + ZoomInfo + Apollo + Mama crawl	~14M companies
Technographic	JS sniffing + DNS/MX + careers-page parser + 3rd-party feeds	~9M companies with stack data
Signals (×8)	8 specialized pipelines (see Signal types overview)	Real-time to daily
Decision-makers	LinkedIn + team pages + careers pages + ZoomInfo	~80M people
Voice	Public web mining (blogs, podcasts, earnings, social)	Top 50K companies + key personas

02Firmographic data pipeline

4 third-party providers cross-referenced against our own crawl. Disagreement is the norm — we apply a confidence-weighted merge with explicit per-field source attribution.

Crunchbase — funding, founding date, headquarters, employee count band
PitchBook — investor data, valuation, late-stage rounds
ZoomInfo — employee count, industry, revenue band
Apollo — contact-heavy; firmographic secondary
Mama crawl — website, blog, press releases, structured data, real-time recency

Each fact in a brief carries a verify pill showing which source it came from and when it was last verified.

03Technographic data pipeline

Four detection layers. We score each detection by source reliability — a JS-sniff confirmation outranks a careers-page mention which outranks a third-party feed.

Layer	Sees	Confidence
JS sniffing	Frontend tools that inject scripts (analytics, chat, A/B testing)	Very high
DNS & MX	Email provider, hosting, CDN, third-party subdomain pointers	High
Careers page parser	Stack mentions in JDs (Snowflake, dbt, etc.)	Medium-high
Third-party feeds	BuiltWith, partner directories, public case studies	Medium

04Signals — 8 specialized pipelines

Each signal type has its own pipeline. The full per-pipeline source mapping:

Funding — Crunchbase + PitchBook + SEC EDGAR + press wires
Hiring — Greenhouse + Lever + Ashby + Workday + LinkedIn job posts
Exec moves — LinkedIn + press releases + Mama crawl on /about pages
Tech changes — JS sniffing + DNS diffs + careers parser deltas
Product launches — Product Hunt + RSS blog feeds + changelog scrapers + press wires
Office moves — Press releases + commercial real estate feeds + Crunchbase
Job changes — LinkedIn (via Chrome extension) + X bios + GitHub orgs + Substack bios
Custom bots — User-defined sources

Each pipeline has its own latency profile — see Refresh cadence.

05Decision-makers pipeline

People data comes from:

LinkedIn — current role, title, recent activity (public posts), connection graph context (via Chrome extension)
Team pages — company-published team and about pages
Careers pages — hiring manager attribution on JDs
ZoomInfo — verified email + phone (where licensed)

Emails are verified via SMTP (no spam attempt — just verify-then-discard handshake) every 7 days. Verified pill on a person card means SMTP-confirmed in last 7d.

06Voice pipeline

Public web mining across 6 source types:

Earnings calls — full transcripts for ~5K public companies, with speaker attribution
Blog posts — 20K+ company blogs via RSS
Conference talks — YouTube descriptions + transcripts (auto-CC) for major industry events
Podcast appearances — top 5K tech/business podcasts, transcripts via Listen Notes API + our own transcription
X (Twitter) — public posts from tracked executives
LinkedIn posts — public posts (no login required)

07Why we publish provenance

Most data tools obscure sources. We publish them. Reasons:

Trust — users should know what they're acting on
Verifiability — every fact in a brief links back to its source where possible
Accountability — when data is wrong, we can trace the failure quickly
Competitive honesty — competitors using the same APIs aren't a moat; our pipeline + signal combinatorics are

08Common mistakes

Assuming all data refreshes at the same cadence

It doesn't — firmographic is weekly, signals are real-time to daily, voice is daily to weekly. See Refresh cadence.

Treating verify pills as suggestions

They're load-bearing. Inferred > Stale > Verified is a real confidence ladder. Use it.

Asking us to source-attribute everything inside the UI

Every fact is source-attributed via the verify pill — hover or click it. We don't show source on every fact by default because it adds noise.

Was this page helpful?

Yes No Report unclear