How we build trustworthy financial data
111M+ SEC facts across 19,000+ entities — point-in-time accurate, survivorship-free, and auditable back to its source filing. Here is exactly how each guarantee is built.
Bloomberg, WRDS, and Compustat make claims about point-in-time accuracy and survivorship-bias-free coverage. We document exactly how ours work — from coverage and survivorship to concept standardization, amendment handling, accepted_at semantics, the smart-money dataset, and the validation checks that run on every release. Then we show you how to audit any of it yourself.
Why XBRL is hard
The SEC has required XBRL submissions since 2009. The format is machine-readable, but standardization stops there. Each filer picks their own taxonomy: us-gaap, ifrs-full, or a custom extension. A single concept like “revenue” resolves to a dozen possible XBRL tags depending on the company, the year, and whether ASC 606 had been adopted.
Restatements are not corrections — they are new filings with the same fiscal period but different values. Quarterly cash flow statements report year-to-date totals, not quarter-only figures. Foreign filers use 20-F and 40-F instead of 10-K with subtly different concept names. And the companies that went bankrupt stopped filing — so a naïve dataset quietly forgets they ever existed.
Anyone can parse XBRL. Producing a dataset where SELECT revenue FROM fact WHERE ticker = 'AAPL' returns the same values 30 years apart, and where a backtest sees the world exactly as it looked on its as-of date — that's the work.
Coverage & scope
The dataset spans the full SEC EDGAR universe of XBRL filers from 1993–present: every active company and every company that has since delisted, gone bankrupt, or been acquired. It is organized as two datasets — a 111M+-fact fundamentals core and a 78M+-row smart-money dataset — across 17 Parquet tables.
Filing forms covered
10-KAnnual report
10-QQuarterly report
8-KMaterial events
20-FForeign annual report
40-FCanadian annual report
/AAmendments to any of the above
The fundamentals core is sourced from the SEC's quarterly EDGAR Financial Statements Data Sets plus the per-filing XBRL submissions — the same primary source the Commission publishes. Nothing is scraped from third-party aggregators. See the full dataset page for the pipeline and per-table schema.
Survivorship-free by construction
Survivorship bias is the single most expensive hidden flaw in a backtest. Most datasets only keep the companies that are still trading today — so your strategy is silently tested against a universe that already knows which companies survived. The Enrons, the Lehmans, the RadioShacks vanish, and historical returns inflate by a percentage point or two that evaporates the moment you trade live.
Valuein retains every entity that ever filed XBRL financial statements — delisted, bankrupt, acquired, merged — with its complete filing history through its final SEC filing. Roughly half of the 19,000+ entities in the universe are no longer actively trading. They stay in the dataset; they are simply not present on dates after they stopped filing, which is exactly how they would have appeared in real time.
A few of the failures still in the data
Each with complete financial statements through its final filing — so a strategy that would have bought them is held accountable for what happened next.
Point-in-time and survivorship-free are two of the trust guarantees we make. See the trust & security overview for provenance, zero-retention, and reliability — this page covers the data construction underneath them.
Concept standardization
We map 11,966 raw XBRL tags to 292 canonical concepts. Definitions are versioned in a taxonomy_guide table that ships with every Parquet bucket — so you can audit every transformation we apply, and unmapped tags fall through to a labelled Other rather than being silently dropped.
Worked example: Revenue
| Source XBRL tag | Used by | Note |
|---|---|---|
| us-gaap:Revenues | Apple, Microsoft | Most common |
| us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax | Tesla, Walmart | Post-ASC 606 adoption |
| us-gaap:SalesRevenueNet | Pre-2018 filers | Legacy tag, deprecated |
| us-gaap:RevenueFromContractWithCustomerIncludingAssessedTax | Some retailers | Includes sales tax pass-through |
| msft:Revenues | Microsoft (custom extension) | Custom XBRL extension |
All five resolve to standard_concept = 'TotalRevenue'. Every fact also keeps its raw concept column — the exact XBRL tag the company filed — so you can always trace a standardized value back to source. This 5-row illustration is the level of detail we publish; the canonical name and definition of every concept lives in the data catalog.
Point-in-time, not point-in-hindsight
The most common look-ahead bias in financial data isn't malicious — it's using the wrong date column. Three timestamps live on every fact, and they mean different things.
report_dateWhen the period endede.g. 2024-09-28 (Apple FY2024)
Aligns financials to a fiscal calendar. Never use as a PIT cutoff — companies file weeks or months later.
filing_dateWhen the filing was submittede.g. 2024-11-01
Useful for filing-cadence analysis. Still not PIT-safe — filings can be accepted hours after the date stamp.
accepted_atWhen SEC accepted it (the canonical PIT field)e.g. 2024-11-01T06:01:36Z
The exact moment the data became public. Use this — and only this — for backtests and any look-ahead-free analysis.
Every PIT-safe MCP tool and SDK method accepts an as_of_date parameter. Internally, that filters on accepted_at <= as_of_date — the queryable equivalent of “what did the market know on this date?”
Amendments and restatements
When a company files a 10-K/A, the SEC treats it as a new filing — not an overwrite. Most data vendors collapse the amendment over the original, destroying the historical view. We keep both.
-- Apple FY2018 net income, original filing
ticker fiscal_year standard_concept numeric_value accepted_at
AAPL 2018 NetIncome 59531000000 2018-11-05T18:23:00Z
-- Apple FY2018 net income, after restatement (hypothetical)
ticker fiscal_year standard_concept numeric_value accepted_at
AAPL 2018 NetIncome 59300000000 2019-02-12T14:51:00ZA backtest that ran on 2018-12-01 sees the first row only — the original $59.531B. A current dashboard sees the latest accepted value. Both are correct; both are queryable. The PIT discipline is what guarantees you get the right one.
Provenance: every number has an identity
A figure is only trustworthy if you can prove where it came from. Every fact in the dataset carries a deterministic fact_id — a SHA-256 hash of the entity, the SEC accession, the concept, the period end, and the unit:
# Deterministic in the pipeline, recomputable in the SDK.fact_id = sha256( f"{entity_id}|{accession_id}|{concept}|{period_end}|{unit}").hexdigest() # The SDK exposes the same function — same inputs, same id, anywhere.from valuein_sdk import compute_fact_idcompute_fact_id(entity_id, accession_id, concept, period_end, unit)Because the hash is deterministic, the fact_id is identical in the Parquet files, the Python SDK, and the MCP server — there is no separate provenance database to drift out of sync; the identity travels with the value. Pass any fact_id to verify_fact_lineage and it resolves back to its source XBRL tag, accession ID, and filing URL for one-click verification against EDGAR.
The envelope on every fact
fact_idA deterministic SHA-256 hash of entity_id, accession_id, concept, period_end, and unit. The same fact computes the same id everywhere — Parquet, SDK, and MCP agree byte-for-byte, so a figure can be re-derived and re-located, never mistaken for another.
confidence_scoreHow directly the value came from the filing — a clean mapped tag scores higher than one recovered through a fallback rule. Lets you threshold on certainty rather than trusting every row equally.
reliability_codeA 1–4 grade of the standardization path, from a primary canonical mapping down to a best-effort fallback. A single integer you can filter or weight on.
restatement_countHow many times this fiscal period has been re-filed. A non-zero count is an immediate signal that the as-reported and current values diverge — and that you should pick the right one for your as-of date.
This envelope is the foundation for everything below: the model never has to be trusted with a number, because the number arrives already attributable, gradeable, and verifiable.
Quarterly cash flow derivation
In Q2 and Q3 10-Q filings, US GAAP requires cash flow statements to report year-to-date totals. Computing a clean quarterly time series requires subtracting the prior quarter — every time, for every issuer, for every line item.
Period numeric_value (YTD) derived_quarterly_value
Q1 2024 12.0B 12.0B
Q2 2024 28.0B 16.0B ← 28.0 − 12.0
Q3 2024 45.0B 17.0B ← 45.0 − 28.0
Q4 2024 62.0B 17.0B ← 62.0 − 45.0Both columns ship in every Parquet bucket. Use COALESCE(derived_quarterly_value, numeric_value) when you want a true quarterly time series; use numeric_value when you specifically want the as-reported YTD figure.
Point-in-time index membership
“What was in the S&P 500 on March 1, 2014?” is a survivorship trap in disguise: screen the index by its current members and you have quietly excluded everyone who was dropped. Index membership is therefore tracked the same way facts are — historically, with effective and removal dates.
The index_membership table records membership spells for S&P 500 and Russell 1000 / 2000 / 3000 with an effective_date and a removal_date per spell, using half-open [effective, removal) interval semantics. A company that left and rejoined gets two spells, not a merged one — so a point-in-time universe on any date reconstructs the index exactly as it stood.
Show the point-in-time query ↓Hide the query ↑
SELECT r.symbol, r.nameFROM index_membership imJOIN references r ON r.cik = im.cikWHERE im.index_name = 'SP500' AND '2014-03-01' >= im.effective_date AND ('2014-03-01' < im.removal_date OR im.removal_date IS NULL);There is no is_sp500 flag — a single boolean can only describe one index at one moment, which is precisely the snapshot bias we avoid. Membership is always a JOIN on cik.
The smart-money dataset
The second dataset — 78M+ rows across six tables — standardizes who is buying and who is holding. It is built from the SEC's mandatory ownership disclosures and held to the same point-in-time and survivorship guarantees as the fundamentals.
Every officer, director, and 10%+ owner transaction — buys, sells, option exercises, and proposed sales — standardized one row per transaction with the transaction code, shares, price, and post-transaction holdings.
insider_transactioninsider_filinginsider_party5%+ activist and passive stakes, one row per reporting person, with the percent owned and the full voting / dispositive-power breakdown.
insider_ownershipQuarterly position disclosures for every institutional manager — shares, USD market value, put/call, and voting authority — one row per holding, resolvable to the issuer.
institutional_holdinginstitutional_filingReporting persons are resolved into a deduplicated directory and 13F holdings are linked back to the issuer they describe, so each row carries a soft reference to entity.cik. The references are soft, not hard, foreign keys — a foreign, pre-IPO, or delisted issuer that doesn't resolve is kept rather than dropped, so coverage is never silently lost. Each disclosure is point-in-time via its own accepted_at. Full table-by-table detail is on the smart-money dataset page.
Foreign private issuers
Foreign private issuers don't file 10-Ks. They file 20-F (and Canadian issuers file 40-F), often under IFRS rather than US GAAP, with their own concept names. Those filings flow through the same standardization pipeline: concepts map into the same canonical standard_concept vocabulary, and an is_foreign flag on the entity lets you include or isolate them. The result is that a US filer and a foreign issuer answer the same query the same way.
Accounting-identity validation
XBRL is machine-readable, not self-consistent: a mis-tagged line item or a transposed figure will parse perfectly and still be wrong. Before any Parquet build is published, the pipeline runs a suite of roughly 48 GAAP accounting-identity rules against the standardized facts — the same articulation a financial statement is required to obey — with a tight numerical tolerance (on the order of 0.1%) to absorb legitimate rounding.
Balance-sheet identity
Assets = Liabilities + EquityCurrent-asset rollup
Current assets ≤ total assetsCash-flow articulation
Δ cash ≈ operating + investing + financingIncome-statement rollup
Gross profit = revenue − cost of revenueA statement that fails an identity is recorded in a qa_violation table rather than silently corrected — the discrepancy is visible and traceable, never papered over. Restatements are tracked the same way: a fact_lineage_summary flags a period as materially restated when a re-filed value moves by more than ~0.5% from the original, so a downstream consumer can tell a cosmetic re-tag apart from an economically meaningful correction.
Structural & coverage checks
On top of the accounting identities, the same release gate runs structural checks. Every fact returned by the MCP server includes a _meta.data_quality block listing which of these passed.
Uniqueness & orderingA company cannot report two FY2024 income statements, and quarterly periods must be strictly ordered. Detects dirty XBRL submissions, amendment collisions, and mis-tagged fiscal periods that would corrupt time-series queries.
Copy-paste error detectionAdjacent periods with statistically improbable identical metrics are flagged as likely filing errors before they reach the dataset.
Amendment lineageEvery restated value must trace back to its original via the accession_id chain. Orphan amendments are quarantined.
Coverage regression alarmsConcept coverage is monitored each release; an unexpected drop flags a pipeline regression before export.
Why the model can't hallucinate a number
The defining failure mode of an LLM over financial data is a confidently-stated figure that no filing supports. Valuein eliminates the failure at its root: the model is never the source of a number, and never recomputes one. Figures are born in the data layer, each already carrying its fact_id and source filing, and the agent is instructed to use them exactly.
The MCP server's provenance rules are explicit and binding on every tool response:
- Use the returned value exactly — never round, restate, recompute, or estimate it.
- Never do arithmetic on returned figures. A derived metric (growth, ratio, margin, valuation) must be requested from the tool that returns it pre-computed with its own input provenance.
- Cite the source filing or fact_id for every figure stated; verify_fact_lineage resolves it back to the filing.
- If a figure's availability is not_reported, not_mapped, suppressed, or error, state that it is unavailable — never substitute a value from prior knowledge.
- Distinguish a genuine reported zero from missing data.
The result is that a hallucinated figure is structurally impossible to pass off as data: anything the model states either carries a verifiable fact_id or is explicitly labelled unavailable. There is no third category where a plausible-sounding number can hide.
Deterministic output, not lucky sampling
Two analysts asking the same question should get the same answer. LLMs are stochastic by default, so the Workspace pins the controls that introduce variance:
temperature = 0Every Workspace model call is pinned to temperature 0, removing sampling randomness so the same prompt against the same data reproduces the same response.
pinned model snapshotModels such as gpt-4o are pinned to a dated snapshot rather than a floating alias, so a silent upstream model update can't change yesterday's output.
Crucially, the figures in an answer don't depend on sampling at all — they carry a fact_id and come straight from the data layer. Temperature and the model snapshot only shape the prose around numbers that are already fixed; the numbers themselves are reproducible regardless of which model, or which run, produced the narrative.
Authorization & prompt-injection safeguards
An agentic data service has two attack surfaces a plain API doesn't: untrusted filing text that could carry injected instructions, and a request path that has to authorize before it touches data. Both are handled at the boundary.
Untrusted-text fencing
SEC filing prose is attacker-controllable — a 10-K can contain text engineered to read like an instruction. Before any filing text is handed to a model, it is passed through a wrapUntrusted() fence that marks it as data, not directives, and known injection-style patterns are stripped or neutralized by a set of regex filters. The model reads the filing as evidence, never as a command.
Layered request hardening
Origin & DNS-rebind checks
Cross-origin and rebinding requests are rejected before any handler runs, so the endpoint can only be reached the way it was meant to be.
Bearer token validation
Every call must present a 64-character hex Bearer token, validated against the Cloudflare KV token store and resolved to a plan tier before a single tool executes.
Body-size cap & per-plan rate limiting
Oversized request bodies are refused and call rates are bounded per plan, so neither a runaway agent nor an abusive client can degrade the service.
Per-request server & Zod schemas
Each request gets a fresh, isolated server instance, and every tool argument is parsed through a strict Zod schema — malformed or unexpected input never reaches the data layer.
Tiering is enforced here too: a tool a caller's plan doesn't cover returns a structured featureNotAvailable envelope with an upgrade path — never a partial or silently downgraded result.
Delivery & freshness
Every table is a column-oriented Parquet file with ZSTD compression — built for DuckDB, Polars, and Spark. A manifest.json ships alongside the data with the snapshot date, the last_updated timestamp, and a row count for every table, so any integration can detect fresh data automatically and verify it received the whole dataset.
The fundamentals core refreshes on the SEC's quarterly EDGAR cadence with amendments processed continuously. On the Institutional tier, filings carry an intraday accepted_at — acceptance timestamps at the moment the SEC published, not a date-only floor.
Python SDK
valuein-sdk on PyPI — in-process DuckDB views over the Parquet tables, with point-in-time enforced at query time.
MCP Server
69 typed tools for any MCP-compatible agent (Claude, Cursor, Codex). The same standardized facts, no SQL required.
Bulk Data API
Authenticated HTTPS streaming of the raw Parquet partitions for B2B and partner integrations.
Workspace
The browser research environment — chat, theses, watchlists, alerts, and reports, all reading the same core.
All four read from the same standardized core — and a single Stripe-issued token unlocks every one of them at your tier. There is no per-channel divergence in the numbers, because there is only one set of numbers.
Verify it yourself
Every claim on this page is testable from the sample tier — no token, no signup. Pick any S&P500 ticker and inspect the lineage of any fact via verify_fact_lineage:
curl -X POST https://mcp.valuein.biz/mcp \ -H "Content-Type: application/json" \ -d '{ "jsonrpc": "2.0", "id": 1, "method": "tools/call", "params": { "name": "verify_fact_lineage", "arguments": { "ticker": "AAPL", "concept": "TotalRevenue", "period_end": "2024-12-31" } } }'The response chains the standardized value back to its source XBRL tag, the SEC accession ID, and the filing URL. If we changed it, you can see why.
Methodology you can audit, data you can trust.
Every step above ships with the data. Read the docs, query the sample tier, and compare against the SEC filings yourself.