Where Mama's data comes from.
Honest provenance. Each data category has its own pipeline: firmographic from 4 providers + our crawl, technographic from JS sniffing + DNS + careers parsing, signals from 8 specialized pipelines, decision-makers from LinkedIn + careers + team pages, voice from public web.
TL;DR
5 data categories, 5 different pipelines. Firmographic: Crunchbase, PitchBook, ZoomInfo, Apollo, plus our own crawl. Technographic: client-side JS detection + DNS + careers-page parsing. Signals: 8 pipelines (one per signal type). Decision-makers: LinkedIn + team pages + careers. Voice: public web mining. We're transparent about every source — see /sources for the public version.
015-category overview
| Category | Primary source(s) | Coverage |
|---|---|---|
| Firmographic | Crunchbase + PitchBook + ZoomInfo + Apollo + Mama crawl | ~14M companies |
| Technographic | JS sniffing + DNS/MX + careers-page parser + 3rd-party feeds | ~9M companies with stack data |
| Signals (×8) | 8 specialized pipelines (see Signal types overview) | Real-time to daily |
| Decision-makers | LinkedIn + team pages + careers pages + ZoomInfo | ~80M people |
| Voice | Public web mining (blogs, podcasts, earnings, social) | Top 50K companies + key personas |
02Firmographic data pipeline
4 third-party providers cross-referenced against our own crawl. Disagreement is the norm — we apply a confidence-weighted merge with explicit per-field source attribution.
- Crunchbase — funding, founding date, headquarters, employee count band
- PitchBook — investor data, valuation, late-stage rounds
- ZoomInfo — employee count, industry, revenue band
- Apollo — contact-heavy; firmographic secondary
- Mama crawl — website, blog, press releases, structured data, real-time recency
Each fact in a brief carries a verify pill showing which source it came from and when it was last verified.
03Technographic data pipeline
Four detection layers. We score each detection by source reliability — a JS-sniff confirmation outranks a careers-page mention which outranks a third-party feed.
| Layer | Sees | Confidence |
|---|---|---|
| JS sniffing | Frontend tools that inject scripts (analytics, chat, A/B testing) | Very high |
| DNS & MX | Email provider, hosting, CDN, third-party subdomain pointers | High |
| Careers page parser | Stack mentions in JDs (Snowflake, dbt, etc.) | Medium-high |
| Third-party feeds | BuiltWith, partner directories, public case studies | Medium |
04Signals — 8 specialized pipelines
Each signal type has its own pipeline. The full per-pipeline source mapping:
- Funding — Crunchbase + PitchBook + SEC EDGAR + press wires
- Hiring — Greenhouse + Lever + Ashby + Workday + LinkedIn job posts
- Exec moves — LinkedIn + press releases + Mama crawl on /about pages
- Tech changes — JS sniffing + DNS diffs + careers parser deltas
- Product launches — Product Hunt + RSS blog feeds + changelog scrapers + press wires
- Office moves — Press releases + commercial real estate feeds + Crunchbase
- Job changes — LinkedIn (via Chrome extension) + X bios + GitHub orgs + Substack bios
- Custom bots — User-defined sources
Each pipeline has its own latency profile — see Refresh cadence.
05Decision-makers pipeline
People data comes from:
- LinkedIn — current role, title, recent activity (public posts), connection graph context (via Chrome extension)
- Team pages — company-published team and about pages
- Careers pages — hiring manager attribution on JDs
- ZoomInfo — verified email + phone (where licensed)
Emails are verified via SMTP (no spam attempt — just verify-then-discard handshake) every 7 days. Verified pill on a person card means SMTP-confirmed in last 7d.
06Voice pipeline
Public web mining across 6 source types:
- Earnings calls — full transcripts for ~5K public companies, with speaker attribution
- Blog posts — 20K+ company blogs via RSS
- Conference talks — YouTube descriptions + transcripts (auto-CC) for major industry events
- Podcast appearances — top 5K tech/business podcasts, transcripts via Listen Notes API + our own transcription
- X (Twitter) — public posts from tracked executives
- LinkedIn posts — public posts (no login required)
07Why we publish provenance
Most data tools obscure sources. We publish them. Reasons:
- Trust — users should know what they're acting on
- Verifiability — every fact in a brief links back to its source where possible
- Accountability — when data is wrong, we can trace the failure quickly
- Competitive honesty — competitors using the same APIs aren't a moat; our pipeline + signal combinatorics are