DuckDB + Parquet recipes for Valuein
Seven query patterns that cover ~80% of what quants and analysts ask the warehouse. Every snippet runs against the real Parquet schema — copy, paste, edit the ticker, run.
Before you start
All snippets assume Parquet files are local — pulled from the Bulk Data API or via the Python SDK. You can also point read_parquet() directly at an HTTPS URL with DuckDB's httpfs extension.
-- One-time setupINSTALL httpfs;LOAD httpfs; -- Read directly from a presigned R2 URL (compute_ready_stream)SELECT COUNT(*)FROM read_parquet('https://r2.example.com/full/fact.parquet?...');Start every cross-company query with `references`
`references` is a pre-joined entity + security table — one row per security. Start here, then JOIN `index_membership` on `cik` for any membership filter.
Why this works ↓Hide details ↑
The `references` table is a flat join of entity + security — one row per security with `cik`, `sector`, `is_active`, and ticker metadata already populated. Joining entity and security yourself works too, but adds two scans and an order-of-magnitude planner cost. For index membership, JOIN with `index_membership` on `cik = cik` (same column name on both sides since migration 0015). The old `is_sp500` flag was dropped 2026-05-02 — it was snapshot-only and single-index, so all membership questions go through `index_membership`.
SELECT r.symbol, r.name, r.sector, r.industryFROM read_parquet('references.parquet') rJOIN read_parquet('index_membership.parquet') im ON im.cik = r.cikWHERE im.index_name = 'SP500' AND im.removal_date IS NULL -- current member AND r.is_active = TRUEORDER BY r.sector, r.symbol;Note · Use this as the spine of any screen, peer search, or universe construction.
Latest filing per company with LATERAL
Use a LATERAL join to correlate one row per company — never pull all filings and dedupe in pandas. `fact` keys on `entity_id` (= references.cik), not ticker.
Why this works ↓Hide details ↑
The most common bug in cross-sectional fundamentals queries: pulling all filings then deduplicating in pandas. DuckDB's LATERAL join lets you correlate one row per company without a self-join. Note: the `fact` table keys on `entity_id` (= references.cik), not on ticker.
WITH r AS ( SELECT r.cik, r.symbol, r.sector FROM read_parquet('references.parquet') r JOIN read_parquet('index_membership.parquet') im ON im.cik = r.cik WHERE im.index_name = 'SP500' AND im.removal_date IS NULL AND r.is_active = TRUE)SELECT r.symbol, r.sector, f.fiscal_year, f.fiscal_period, f.numeric_value AS revenueFROM r,LATERAL ( SELECT fiscal_year, fiscal_period, numeric_value FROM read_parquet('fact.parquet') AS f0 WHERE f0.entity_id = r.cik AND f0.standard_concept = 'TotalRevenue' AND f0.fiscal_period = 'FY' ORDER BY f0.fiscal_year DESC, f0.accepted_at DESC LIMIT 1) fORDER BY revenue DESC NULLS LASTLIMIT 25;Note · LATERAL evaluates the inner query once per outer row but stops at LIMIT 1 — much cheaper than a full self-join.
Pivot multiple concepts in a single fact scan
One MAX(CASE) pivot scans `fact` once and returns revenue, net income, and EPS as columns — the difference between 200ms and 2s across 111M+ rows.
Why this works ↓Hide details ↑
Naively, pulling revenue + net income + EPS means three queries against the fact table. A single MAX(CASE) pivot scans `fact` once and returns all three as columns. For 111M+ rows, that's the difference between 200ms and 2s. The fact table joins on entity_id (= references.cik) — there is no ticker column on fact.
WITH ids AS ( SELECT cik, symbol FROM read_parquet('references.parquet') WHERE symbol IN ('AAPL', 'MSFT', 'GOOGL', 'NVDA'))SELECT ids.symbol, fa.fiscal_year, MAX(CASE WHEN fa.standard_concept = 'TotalRevenue' THEN fa.numeric_value END) AS revenue, MAX(CASE WHEN fa.standard_concept = 'NetIncomeLoss' THEN fa.numeric_value END) AS net_income, MAX(CASE WHEN fa.standard_concept = 'EarningsPerShareDiluted' THEN fa.numeric_value END) AS eps_diluted, MAX(CASE WHEN fa.standard_concept = 'NetCashProvidedByUsedInOperatingActivities' THEN fa.numeric_value END) AS ocfFROM read_parquet('fact.parquet') faJOIN ids ON ids.cik = fa.entity_idWHERE fa.fiscal_period = 'FY' AND fa.fiscal_year >= 2020GROUP BY ids.symbol, fa.fiscal_yearORDER BY ids.symbol, fa.fiscal_year;Note · Add as many concepts as you need — the cost is one extra CASE per concept, not one extra scan.
Use QUALIFY ROW_NUMBER() instead of correlated subqueries
QUALIFY ROW_NUMBER() returns the latest accepted_at per (entity, period) far cleaner than a correlated MAX subquery — the idiomatic way to dedupe original-vs-amended filings.
Why this works ↓Hide details ↑
When you need the latest accepted_at for each (entity, period) — to dedupe original-vs-amended filings — QUALIFY is dramatically cleaner than a correlated MAX subquery. Resolve a ticker to its CIK via references first; fact has no ticker column.
WITH tsla AS ( SELECT cik FROM read_parquet('references.parquet') WHERE symbol = 'TSLA')SELECT fa.entity_id, fa.fiscal_year, fa.fiscal_period, fa.standard_concept, fa.numeric_value, fa.accepted_atFROM read_parquet('fact.parquet') faJOIN tsla ON tsla.cik = fa.entity_idWHERE fa.standard_concept = 'NetIncomeLoss'QUALIFY ROW_NUMBER() OVER ( PARTITION BY fa.entity_id, fa.fiscal_year, fa.fiscal_period, fa.standard_concept ORDER BY fa.accepted_at DESC) = 1ORDER BY fa.fiscal_year DESC, fa.fiscal_period;| Metric | as reported | restated |
|---|---|---|
| Revenue | 10,388 | 10,388 |
| Operating income | 2,114 | 1,902 |
| Net income | 1,640 | 1,431 |
| Diluted EPS | 3.05 | 2.66 |
verify_fact_lineage() → both versions retained · each keyed to its filing
Note · Returns one row per fiscal period — the most recently accepted (i.e. post-amendment) value. Drop the QUALIFY clause to see all amendments. Tip: fact also exposes value_current and value_as_filed if you want both restated and originally-filed values pre-computed.
Quarterly cash flow without YTD bleed
Q2/Q3 10-Q cash flows are year-to-date. COALESCE on `derived_quarterly_value` to get the true quarter-only figure the pipeline pre-computes.
Why this works ↓Hide details ↑
Q2 and Q3 10-Q cash flow figures are year-to-date, not quarter-only. The pipeline computes the incremental quarterly figure as `derived_quarterly_value`. Use COALESCE to prefer the derived value where present.
WITH amzn AS ( SELECT cik FROM read_parquet('references.parquet') WHERE symbol = 'AMZN')SELECT fa.entity_id, fa.fiscal_year, fa.fiscal_period, COALESCE(fa.derived_quarterly_value, fa.numeric_value) AS quarterly_ocfFROM read_parquet('fact.parquet') faJOIN amzn ON amzn.cik = fa.entity_idWHERE fa.standard_concept = 'NetCashProvidedByUsedInOperatingActivities' AND fa.fiscal_period IN ('Q1', 'Q2', 'Q3', 'Q4') AND fa.fiscal_year >= 2022ORDER BY fa.fiscal_year, fa.fiscal_period;Note · For annual numbers (FY rows) `numeric_value` is correct — the COALESCE just falls back to it when no quarterly derivation exists.
Point-in-time screen: what was investable on a date
Filter `index_membership` by effective/removal date AND `fact.accepted_at` so both the company and its data were knowable — survivorship-free and look-ahead-free.
Why this works ↓Hide details ↑
Backtesting a strategy requires the universe that existed on the trade date — not today's universe. Filter `index_membership` by `effective_date` / `removal_date` AND `fact.accepted_at` to ensure both the company AND its data were knowable. index_membership keys on cik (since migration 0015) and uses half-open [) interval semantics — a company removed on date D is NOT a member on date D.
WITH universe_at AS ( SELECT r.cik, r.symbol FROM read_parquet('references.parquet') AS r JOIN read_parquet('index_membership.parquet') AS im ON im.cik = r.cik WHERE im.index_name = 'SP500' AND im.effective_date <= DATE '2018-01-01' AND (im.removal_date IS NULL OR im.removal_date > DATE '2018-01-01')),known_facts AS ( SELECT fa.entity_id, fa.fiscal_year, MAX(CASE WHEN fa.standard_concept = 'TotalRevenue' THEN fa.numeric_value END) AS revenue, MAX(CASE WHEN fa.standard_concept = 'NetIncomeLoss' THEN fa.numeric_value END) AS net_income FROM read_parquet('fact.parquet') fa WHERE fa.accepted_at <= TIMESTAMP '2018-01-01 00:00:00' AND fa.fiscal_period = 'FY' AND fa.fiscal_year = 2017 GROUP BY fa.entity_id, fa.fiscal_year)SELECT u.symbol, k.revenue, k.net_income, k.net_income / NULLIF(k.revenue, 0) AS net_marginFROM universe_at uJOIN known_facts k ON k.entity_id = u.cikORDER BY net_margin DESC NULLS LASTLIMIT 50;- ENRNbankrupt · 2001
- LEHbankrupt · 2008
- WAMUseized · 2008
- BBBYbankrupt · 2023
- SIVBfailed · 2023
- FTXcollapsed · 2022
get_pit_universe(date) → the index as it stood, not as it survived
Note · The universe filter prevents survivorship bias; the accepted_at filter prevents look-ahead bias. Both are required.
Defensive ratio math: ABS and NULLIF every divisor
Wrap every cash-flow denominator in ABS (capex is reported negative) and every divisor in NULLIF (zero-revenue quarters) — make it a reflex.
Why this works ↓Hide details ↑
Capex is reported as negative on cash flow statements; some companies have zero revenue in a given quarter. Wrap every divisor in NULLIF and every cash-flow denominator in ABS.
WITH ids AS ( SELECT cik, symbol FROM read_parquet('references.parquet') WHERE symbol IN ('AAPL', 'MSFT', 'GOOGL', 'AMZN', 'META')),pivoted AS ( SELECT ids.symbol, fa.fiscal_year, MAX(CASE WHEN fa.standard_concept = 'TotalRevenue' THEN fa.numeric_value END) AS revenue, MAX(CASE WHEN fa.standard_concept = 'NetCashProvidedByUsedInOperatingActivities' THEN fa.numeric_value END) AS ocf, MAX(CASE WHEN fa.standard_concept = 'PaymentsToAcquirePropertyPlantAndEquipment' THEN fa.numeric_value END) AS capex FROM read_parquet('fact.parquet') fa JOIN ids ON ids.cik = fa.entity_id WHERE fa.fiscal_period = 'FY' AND fa.fiscal_year = 2024 GROUP BY ids.symbol, fa.fiscal_year)SELECT symbol, fiscal_year, ocf, capex, -- ABS handles companies that report capex as negative; -- NULLIF prevents division-by-zero on revenue gaps. ocf - ABS(capex) AS free_cash_flow, (ocf - ABS(capex)) / NULLIF(revenue, 0) AS fcf_marginFROM pivotedORDER BY fcf_margin DESC NULLS LAST;Note · Make NULLIF on every divisor a habit — DuckDB returns NULL on divide-by-zero, but errors on integer divide-by-zero in some compilation modes.
More patterns?
The Python SDK ships with pre-built query templates for the most common workflows — factor screens, DCF inputs, peer comparables, insider ownership.