MCP Server live — AI agents can now query 105M+ SEC facts. Connect your agent →
ValueinValuein
SQL Cookbook

DuckDB + Parquet recipes for Valuein

Seven query patterns that cover ~80% of what quants and analysts ask the warehouse. Every snippet runs against the real Parquet schema — copy, paste, edit the ticker, run.

Before you start

All snippets assume Parquet files are local — pulled from the Bulk Data API or via the Python SDK. You can also point read_parquet() directly at an HTTPS URL with DuckDB's httpfs extension.

-- One-time setup
INSTALL httpfs;
LOAD httpfs;

-- Read directly from a presigned R2 URL (compute_ready_stream)
SELECT COUNT(*)
FROM read_parquet('https://r2.example.com/full/fact.parquet?...');
01

Start every cross-company query with `references`

The `references` table is a flat join of entity + security — one row per security with `cik`, `sector`, `is_active`, and ticker metadata already populated. Joining entity and security yourself works too, but adds two scans and an order-of-magnitude planner cost. For index membership, JOIN with `index_membership` on `cik = cik` (same column name on both sides since migration 0015). The old `is_sp500` flag was dropped 2026-05-02 — it was snapshot-only and single-index, so all membership questions go through `index_membership`.

SELECT
  r.symbol,
  r.name,
  r.sector,
  r.industry
FROM read_parquet('references.parquet') r
JOIN read_parquet('index_membership.parquet') im
  ON im.cik = r.cik
WHERE im.index_name = 'SP500'
  AND im.removal_date IS NULL  -- current member
  AND r.is_active = TRUE
ORDER BY r.sector, r.symbol;

Note · Use this as the spine of any screen, peer search, or universe construction.

02

Latest filing per company with LATERAL

The most common bug in cross-sectional fundamentals queries: pulling all filings then deduplicating in pandas. DuckDB's LATERAL join lets you correlate one row per company without a self-join. Note: the `fact` table keys on `entity_id` (= references.cik), not on ticker.

WITH r AS (
  SELECT r.cik, r.symbol, r.sector
  FROM read_parquet('references.parquet') r
  JOIN read_parquet('index_membership.parquet') im
    ON im.cik = r.cik
  WHERE im.index_name = 'SP500'
    AND im.removal_date IS NULL
    AND r.is_active = TRUE
)
SELECT
  r.symbol,
  r.sector,
  f.fiscal_year,
  f.fiscal_period,
  f.numeric_value AS revenue
FROM r,
LATERAL (
  SELECT fiscal_year, fiscal_period, numeric_value
  FROM read_parquet('fact.parquet') AS f0
  WHERE f0.entity_id        = r.cik
    AND f0.standard_concept = 'TotalRevenue'
    AND f0.fiscal_period    = 'FY'
  ORDER BY f0.fiscal_year DESC, f0.accepted_at DESC
  LIMIT 1
) f
ORDER BY revenue DESC NULLS LAST
LIMIT 25;

Note · LATERAL evaluates the inner query once per outer row but stops at LIMIT 1 — much cheaper than a full self-join.

03

Pivot multiple concepts in a single fact scan

Naively, pulling revenue + net income + EPS means three queries against the fact table. A single MAX(CASE) pivot scans `fact` once and returns all three as columns. For 105M rows, that's the difference between 200ms and 2s. The fact table joins on entity_id (= references.cik) — there is no ticker column on fact.

WITH ids AS (
  SELECT cik, symbol
  FROM read_parquet('references.parquet')
  WHERE symbol IN ('AAPL', 'MSFT', 'GOOGL', 'NVDA')
)
SELECT
  ids.symbol,
  fa.fiscal_year,
  MAX(CASE WHEN fa.standard_concept = 'TotalRevenue'      THEN fa.numeric_value END) AS revenue,
  MAX(CASE WHEN fa.standard_concept = 'NetIncomeLoss'     THEN fa.numeric_value END) AS net_income,
  MAX(CASE WHEN fa.standard_concept = 'EarningsPerShareDiluted' THEN fa.numeric_value END) AS eps_diluted,
  MAX(CASE WHEN fa.standard_concept = 'NetCashProvidedByUsedInOperatingActivities' THEN fa.numeric_value END) AS ocf
FROM read_parquet('fact.parquet') fa
JOIN ids ON ids.cik = fa.entity_id
WHERE fa.fiscal_period = 'FY'
  AND fa.fiscal_year >= 2020
GROUP BY ids.symbol, fa.fiscal_year
ORDER BY ids.symbol, fa.fiscal_year;

Note · Add as many concepts as you need — the cost is one extra CASE per concept, not one extra scan.

04

Use QUALIFY ROW_NUMBER() instead of correlated subqueries

When you need the latest accepted_at for each (entity, period) — to dedupe original-vs-amended filings — QUALIFY is dramatically cleaner than a correlated MAX subquery. Resolve a ticker to its CIK via references first; fact has no ticker column.

WITH tsla AS (
  SELECT cik FROM read_parquet('references.parquet') WHERE symbol = 'TSLA'
)
SELECT
  fa.entity_id,
  fa.fiscal_year,
  fa.fiscal_period,
  fa.standard_concept,
  fa.numeric_value,
  fa.accepted_at
FROM read_parquet('fact.parquet') fa
JOIN tsla ON tsla.cik = fa.entity_id
WHERE fa.standard_concept = 'NetIncomeLoss'
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY fa.entity_id, fa.fiscal_year, fa.fiscal_period, fa.standard_concept
  ORDER BY fa.accepted_at DESC
) = 1
ORDER BY fa.fiscal_year DESC, fa.fiscal_period;

Note · Returns one row per fiscal period — the most recently accepted (i.e. post-amendment) value. Drop the QUALIFY clause to see all amendments. Tip: fact also exposes value_current and value_as_filed if you want both restated and originally-filed values pre-computed.

05

Quarterly cash flow without YTD bleed

Q2 and Q3 10-Q cash flow figures are year-to-date, not quarter-only. The pipeline computes the incremental quarterly figure as `derived_quarterly_value`. Use COALESCE to prefer the derived value where present.

WITH amzn AS (
  SELECT cik FROM read_parquet('references.parquet') WHERE symbol = 'AMZN'
)
SELECT
  fa.entity_id,
  fa.fiscal_year,
  fa.fiscal_period,
  COALESCE(fa.derived_quarterly_value, fa.numeric_value) AS quarterly_ocf
FROM read_parquet('fact.parquet') fa
JOIN amzn ON amzn.cik = fa.entity_id
WHERE fa.standard_concept = 'NetCashProvidedByUsedInOperatingActivities'
  AND fa.fiscal_period IN ('Q1', 'Q2', 'Q3', 'Q4')
  AND fa.fiscal_year >= 2022
ORDER BY fa.fiscal_year, fa.fiscal_period;

Note · For annual numbers (FY rows) `numeric_value` is correct — the COALESCE just falls back to it when no quarterly derivation exists.

06

Point-in-time screen: what was investable on a date

Backtesting a strategy requires the universe that existed on the trade date — not today's universe. Filter `index_membership` by `effective_date` / `removal_date` AND `fact.accepted_at` to ensure both the company AND its data were knowable. index_membership keys on cik (since migration 0015) and uses half-open [) interval semantics — a company removed on date D is NOT a member on date D.

WITH universe_at AS (
  SELECT r.cik, r.symbol
  FROM read_parquet('references.parquet') AS r
  JOIN read_parquet('index_membership.parquet') AS im
    ON im.cik = r.cik
  WHERE im.index_name = 'SP500'
    AND im.effective_date <= DATE '2018-01-01'
    AND (im.removal_date IS NULL OR im.removal_date > DATE '2018-01-01')
),
known_facts AS (
  SELECT
    fa.entity_id,
    fa.fiscal_year,
    MAX(CASE WHEN fa.standard_concept = 'TotalRevenue'  THEN fa.numeric_value END) AS revenue,
    MAX(CASE WHEN fa.standard_concept = 'NetIncomeLoss' THEN fa.numeric_value END) AS net_income
  FROM read_parquet('fact.parquet') fa
  WHERE fa.accepted_at <= TIMESTAMP '2018-01-01 00:00:00'
    AND fa.fiscal_period = 'FY'
    AND fa.fiscal_year = 2017
  GROUP BY fa.entity_id, fa.fiscal_year
)
SELECT
  u.symbol,
  k.revenue,
  k.net_income,
  k.net_income / NULLIF(k.revenue, 0) AS net_margin
FROM universe_at u
JOIN known_facts k ON k.entity_id = u.cik
ORDER BY net_margin DESC NULLS LAST
LIMIT 50;

Note · The universe filter prevents survivorship bias; the accepted_at filter prevents look-ahead bias. Both are required.

07

Defensive ratio math: ABS and NULLIF every divisor

Capex is reported as negative on cash flow statements; some companies have zero revenue in a given quarter. Wrap every divisor in NULLIF and every cash-flow denominator in ABS.

WITH ids AS (
  SELECT cik, symbol
  FROM read_parquet('references.parquet')
  WHERE symbol IN ('AAPL', 'MSFT', 'GOOGL', 'AMZN', 'META')
),
pivoted AS (
  SELECT
    ids.symbol,
    fa.fiscal_year,
    MAX(CASE WHEN fa.standard_concept = 'TotalRevenue'                                THEN fa.numeric_value END) AS revenue,
    MAX(CASE WHEN fa.standard_concept = 'NetCashProvidedByUsedInOperatingActivities' THEN fa.numeric_value END) AS ocf,
    MAX(CASE WHEN fa.standard_concept = 'PaymentsToAcquirePropertyPlantAndEquipment' THEN fa.numeric_value END) AS capex
  FROM read_parquet('fact.parquet') fa
  JOIN ids ON ids.cik = fa.entity_id
  WHERE fa.fiscal_period = 'FY'
    AND fa.fiscal_year   = 2024
  GROUP BY ids.symbol, fa.fiscal_year
)
SELECT
  symbol,
  fiscal_year,
  ocf,
  capex,
  -- ABS handles companies that report capex as negative;
  -- NULLIF prevents division-by-zero on revenue gaps.
  ocf - ABS(capex) AS free_cash_flow,
  (ocf - ABS(capex)) / NULLIF(revenue, 0) AS fcf_margin
FROM pivoted
ORDER BY fcf_margin DESC NULLS LAST;

Note · Make NULLIF on every divisor a habit — DuckDB returns NULL on divide-by-zero, but errors on integer divide-by-zero in some compilation modes.

More patterns?

The Python SDK ships with pre-built query templates for the most common workflows — factor screens, DCF inputs, peer comparables, insider ownership.