Docs Data & sources Where data comes from
Data & sources · 01 of 04

Where Mama's data comes from.

Honest provenance. Each data category has its own pipeline: firmographic from 4 providers + our crawl, technographic from JS sniffing + DNS + careers parsing, signals from 8 specialized pipelines, decision-makers from LinkedIn + careers + team pages, voice from public web.

Time: 6 min·Updated: 2026-05-25·Audience: curious users, RevOps evaluating data quality

TL;DR

5 data categories, 5 different pipelines. Firmographic: Crunchbase, PitchBook, ZoomInfo, Apollo, plus our own crawl. Technographic: client-side JS detection + DNS + careers-page parsing. Signals: 8 pipelines (one per signal type). Decision-makers: LinkedIn + team pages + careers. Voice: public web mining. We're transparent about every source — see /sources for the public version.

015-category overview

CategoryPrimary source(s)Coverage
FirmographicCrunchbase + PitchBook + ZoomInfo + Apollo + Mama crawl~14M companies
TechnographicJS sniffing + DNS/MX + careers-page parser + 3rd-party feeds~9M companies with stack data
Signals (×8)8 specialized pipelines (see Signal types overview)Real-time to daily
Decision-makersLinkedIn + team pages + careers pages + ZoomInfo~80M people
VoicePublic web mining (blogs, podcasts, earnings, social)Top 50K companies + key personas

02Firmographic data pipeline

4 third-party providers cross-referenced against our own crawl. Disagreement is the norm — we apply a confidence-weighted merge with explicit per-field source attribution.

  • Crunchbase — funding, founding date, headquarters, employee count band
  • PitchBook — investor data, valuation, late-stage rounds
  • ZoomInfo — employee count, industry, revenue band
  • Apollo — contact-heavy; firmographic secondary
  • Mama crawl — website, blog, press releases, structured data, real-time recency

Each fact in a brief carries a verify pill showing which source it came from and when it was last verified.

03Technographic data pipeline

Four detection layers. We score each detection by source reliability — a JS-sniff confirmation outranks a careers-page mention which outranks a third-party feed.

LayerSeesConfidence
JS sniffingFrontend tools that inject scripts (analytics, chat, A/B testing)Very high
DNS & MXEmail provider, hosting, CDN, third-party subdomain pointersHigh
Careers page parserStack mentions in JDs (Snowflake, dbt, etc.)Medium-high
Third-party feedsBuiltWith, partner directories, public case studiesMedium

04Signals — 8 specialized pipelines

Each signal type has its own pipeline. The full per-pipeline source mapping:

  • Funding — Crunchbase + PitchBook + SEC EDGAR + press wires
  • Hiring — Greenhouse + Lever + Ashby + Workday + LinkedIn job posts
  • Exec moves — LinkedIn + press releases + Mama crawl on /about pages
  • Tech changes — JS sniffing + DNS diffs + careers parser deltas
  • Product launches — Product Hunt + RSS blog feeds + changelog scrapers + press wires
  • Office moves — Press releases + commercial real estate feeds + Crunchbase
  • Job changes — LinkedIn (via Chrome extension) + X bios + GitHub orgs + Substack bios
  • Custom bots — User-defined sources

Each pipeline has its own latency profile — see Refresh cadence.

05Decision-makers pipeline

People data comes from:

  • LinkedIn — current role, title, recent activity (public posts), connection graph context (via Chrome extension)
  • Team pages — company-published team and about pages
  • Careers pages — hiring manager attribution on JDs
  • ZoomInfo — verified email + phone (where licensed)

Emails are verified via SMTP (no spam attempt — just verify-then-discard handshake) every 7 days. Verified pill on a person card means SMTP-confirmed in last 7d.

06Voice pipeline

Public web mining across 6 source types:

  • Earnings calls — full transcripts for ~5K public companies, with speaker attribution
  • Blog posts — 20K+ company blogs via RSS
  • Conference talks — YouTube descriptions + transcripts (auto-CC) for major industry events
  • Podcast appearances — top 5K tech/business podcasts, transcripts via Listen Notes API + our own transcription
  • X (Twitter) — public posts from tracked executives
  • LinkedIn posts — public posts (no login required)

07Why we publish provenance

Most data tools obscure sources. We publish them. Reasons:

  • Trust — users should know what they're acting on
  • Verifiability — every fact in a brief links back to its source where possible
  • Accountability — when data is wrong, we can trace the failure quickly
  • Competitive honesty — competitors using the same APIs aren't a moat; our pipeline + signal combinatorics are

08Common mistakes

Assuming all data refreshes at the same cadence
It doesn't — firmographic is weekly, signals are real-time to daily, voice is daily to weekly. See Refresh cadence.
Treating verify pills as suggestions
They're load-bearing. Inferred > Stale > Verified is a real confidence ladder. Use it.
Asking us to source-attribute everything inside the UI
Every fact is source-attributed via the verify pill — hover or click it. We don't show source on every fact by default because it adds noise.
Was this page helpful?