Finance RAG — Ask My 10-Ks

What This Is

Finance RAG is Part 4 of a six-month series building production AI systems in public — month by month, discipline by discipline. Where earlier parts established foundational retrieval patterns, this one confronts the domain where standard RAG fails most visibly: financial documents.

The system queries public company SEC filings — 10-Ks and 10-Qs sourced directly from SEC EDGAR — and returns structured answers grounded entirely in the retrieved text. Every factual claim carries an inline citation: [AAPL, 10-K, 2024-11-01, Page 3]. The model is explicitly forbidden from drawing upon general knowledge; if the filing corpus does not cover the question, it says so rather than speculating.

This is not a demonstration. It ships with JWT authentication, LangChain guardrails, dual observability via Langfuse and LangSmith, RAGAS evaluation gates that block pull requests on regression, and a one-command Docker deployment.

The Problem

Standard RAG systems fail on financial documents in four distinct and well-documented ways:

Ticker symbol loss — vector embeddings reduce short identifiers such as AAPL or NVDA to distributed representations averaged across the entire vocabulary. Cosine similarity then returns semantically adjacent but factually wrong companies. The model retrieves Apple-adjacent content for an NVIDIA query.
Financial term imprecision — specialised language (EBITDA, Tier 1 capital, Basel III, goodwill impairment) loses definitional precision when averaged into dense vectors, causing retrieval to return tangentially related passages rather than the precise clauses an analyst requires.
Hallucinated numbers — given only vague semantic context, large language models confidently fabricate revenue figures, debt ratios, and earnings-per-share values. A single invented number corrupts every downstream decision built upon it.
Unstructured output with no reliability signal — a free-form prose answer provides the caller with no means of assessing trustworthiness. There is no distinction between a claim supported by three corroborating passages and one assembled from a single ambiguous sentence.

Finance RAG addresses all four through hybrid retrieval, multi-stage reranking, mandatory citation enforcement, structured output validated at the API level, and a blended confidence score that ties reliability to retrieval quality rather than model verbosity.

Architecture

The system comprises five modular, independently testable components arranged in a sequential pipeline.

1. Ingestion Pipeline

A three-stage process that transforms raw SEC filings into two complementary searchable indices:

Fetch — downloads 10-K and 10-Q forms from SEC EDGAR for specified tickers and date ranges via the EDGAR full-text search API.
Parse — PyMuPDF handles PDF filings, preserving actual page numbers as they appear in the document. BeautifulSoup processes iXBRL HTML filings, approximating pages every 3,000 words to maintain citation granularity.
Index — constructs a FAISS vector index (text-embedding-3-small, 1,536 dimensions, L2-normalised) and a BM25 sparse index in parallel from the same chunked corpus.

Chunking employs 512-word windows with 64-word overlap. Every chunk carries full provenance metadata: ticker symbol, form type, filing date, source path, and page number. This provenance is what makes citations traceable rather than decorative.

2. Hybrid Retrieval and Reciprocal Rank Fusion

Vector search and BM25 run in parallel for every query, each returning the top-20 candidates independently. Reciprocal Rank Fusion merges the two ranked lists without requiring manual tuning of relative weights:

RRF(doc) = Σ [ 1 / (k + rank_r(doc)) ]   where k = 60

This produces approximately 40 fused candidates. BM25 handles exact ticker symbol and financial terminology matching; vector search handles semantic intent and paraphrase. Neither approach alone is sufficient; together, they are complementary.

3. Cross-Encoder Reranking

The 40 fused candidates are passed to Cohere's cross-encoder reranker, which evaluates the query and each document jointly — rather than in isolation — and selects the top 5 most relevant chunks. This is a meaningfully different operation from embedding similarity: the cross-encoder reads the query alongside each candidate and scores their relationship directly.

Should Cohere credentials be absent, the system falls back gracefully to the fusion top-5 rather than failing. Every external dependency is designed with this degradation path in mind.

4. Structured Generation and Decision Support

GPT-4o is invoked via beta.chat.completions.parse with a Pydantic schema — not a plain chat completion. The model must return a typed StructuredAnswer object; it cannot return free-form text. Every field is validated before the response leaves the generation layer:

answer — prose response with inline [TICKER, FORM, DATE, Page N] citations woven through the text
claims[] — each factual statement as a structured { statement, citation } pair, making individual claims auditable
llm_confidence — the model's own reliability estimate on a 0–1 scale
confidence_reasoning — a brief explanation of the confidence rating, surfacing the model's epistemic state
decision_recommendation — an actionable next step for the analyst, derived from the retrieved evidence
data_sufficiency — SUFFICIENT or INSUFFICIENT: does the indexed corpus actually support the question posed?

Temperature is fixed at 0.0 for deterministic output across repeated queries. External knowledge is explicitly forbidden in the system prompt; when the filing corpus does not cover the question, the model sets data_sufficiency: INSUFFICIENT rather than speculating from its training data.

Blended confidence scoring combines two independent signals:

confidence_score = 0.7 × llm_confidence + 0.3 × mean(cohere_rerank_scores)

LLM self-assessed confidence is unreliable in isolation — the model's certainty correlates poorly with factual accuracy. Weighting it 70/30 against the mean Cohere rerank score ties the reported confidence to retrieval quality. When the reranker is unavailable, the score falls back to llm_confidence alone — degraded, but not broken.

5. Dual Observability

Langfuse traces every stage of the pipeline — retrieval, reranking, and generation — capturing per-component latency, token counts, and cost per request. Traces are emitted asynchronously; missing credentials return None without interrupting the request path.

LangSmith receives RAGAS evaluation results after every CI run, enabling metric trend analysis across pull requests. Cost aggregates and pipeline performance can be monitored in LangSmith dashboards over time.

Both observability layers are additive, not load-bearing. The pipeline functions correctly without them.

Security

JWT Role-Based Access Control

Three roles govern what a caller may do:

| Role | Can query | Can ingest | Ticker scope | |---|---|---|---| | admin | Any ticker | Yes | Unrestricted | | analyst | Assigned tickers only | No | e.g., AAPL, MSFT, NVDA | | viewer | Assigned tickers only | No | e.g., AAPL only |

Tokens expire after 60 minutes. In production, all API keys are injected from AWS Secrets Manager at container start — nothing is stored in source code or committed environment files.

LangChain Guardrails

Every query passes through two filters executed concurrently before reaching GPT-4o:

Scope filter — blocks questions that fall outside the domain of SEC filings, earnings disclosures, risk factors, capital structure, and related financial topics.
Safety filter — blocks toxicity, competitor promotion, and market manipulation language.

Both filters run via gpt-4o-mini in parallel. The combined cost is approximately $0.00008 per query — effectively negligible. A blocked query returns a structured error explaining precisely which constraint was violated and what categories of question the system accepts.

Quality Gates

The project ships with a RAGAS evaluation suite running against a fixed 25-question test set drawn from 1,260 chunks across 12 SEC filings (Apple, Microsoft, Amazon, Nvidia — 10-Ks from 2022 through 2024).

| Metric | What it measures | Threshold | |---|---|---| | Faithfulness | Every claim is supported by the retrieved context | ≥ 0.85 | | Answer Relevancy | The answer addresses the question posed | ≥ 0.80 | | Context Precision | Retrieved chunks are genuinely useful | ≥ 0.78 | | Context Recall | Retrieval captures all context necessary to answer | ≥ 0.75 |

A GitHub Actions workflow runs this evaluation on every pull request targeting main. All four metrics must pass or the PR is blocked — a binary gate, with no exceptions and no manual overrides. Results are uploaded to LangSmith after each run so that metric trends are visible across the project's history.

Example Response

{
  "question": "What are Apple's risk factors around China supply chain?",
  "answer": "Apple's risk factors around its China supply chain include concentration of manufacturing in China [AAPL, 10-K, 2023-11-03, Page 3], political and trade risks including tariffs and geopolitical tensions [AAPL, 10-K, 2024-11-01, Page 2], and single-source supplier dependencies that create significant supply and pricing risks [AAPL, 10-K, 2024-11-01, Page 3].",
  "claims": [
    {
      "statement": "Manufacturing is concentrated in China",
      "citation": "[AAPL, 10-K, 2023-11-03, Page 3]"
    },
    {
      "statement": "Tariffs and geopolitical tensions pose political and trade risks",
      "citation": "[AAPL, 10-K, 2024-11-01, Page 2]"
    },
    {
      "statement": "Single-source supplier dependencies create significant supply and pricing risks",
      "citation": "[AAPL, 10-K, 2024-11-01, Page 3]"
    }
  ],
  "confidence_score": 0.871,
  "confidence_reasoning": "Three corroborating chunks from two filing years directly address supply chain concentration and regulatory risk with verbatim figures.",
  "decision_recommendation": "Monitor AAPL's China exposure language across successive 10-K filings; the 2024 filing shows heightened geopolitical risk language relative to 2023.",
  "data_sufficiency": "SUFFICIENT",
  "chunks_used": 5,
  "latency_ms": 2077.2,
  "cost_usd": 0.01999,
  "input_tokens": 3437,
  "output_tokens": 187
}

Key Technical Decisions

Hybrid retrieval over pure vector search — BM25 rewards exact token matches without the averaging loss inherent to embedding representations. Ticker symbols and precise financial terminology are retrieved verbatim, with no semantic dilution.

RRF fusion over weighted averaging — eliminates the need to manually tune a vector-versus-BM25 weight parameter, which is dataset-specific and drifts as the corpus grows. Items that rank highly in both methods are naturally boosted; no calibration is required.

Page number preservation — PyMuPDF extracts actual page numbers from PDF filings. HTML iXBRL filings approximate pages every 3,000 words. Every citation is traceable to a specific section of a specific filing on a specific date — a requirement for any context where the source must be verifiable, not merely plausible.

Prompt-based citation enforcement — context is presented to the model as numbered, labelled blocks ([Context 1] [AAPL, 10-K, 2023-11-03, Page 14]), making it structurally difficult for the model to answer from general knowledge rather than the provided passages. The system prompt explicitly forbids external knowledge.

Structured output over plain completion — beta.chat.completions.parse with a Pydantic schema enforces the response shape at the API level. The model cannot return a free-form string. Every field — claims[], decision_recommendation, data_sufficiency — is validated before the response is returned. A malformed response raises an exception rather than propagating silently.

Blended confidence over LLM self-report alone — LLM self-assessed confidence is a notoriously unreliable signal. Weighting it 70/30 against the mean Cohere rerank score ties the reported confidence to an independent measure of retrieval quality rather than model verbosity.

Guardrails before the expensive model — scope and safety checks run via gpt-4o-mini concurrently at a fraction of the cost of GPT-4o. Filtering at the boundary rather than post-hoc prevents wasted compute on out-of-scope or harmful queries and keeps the system's usage surface well-defined.

Stack

AI & LLM

LLMGPT-4o (temp 0.0)

GuardrailsGPT-4o-mini via LangChain

Structured outputbeta.chat.completions.parse + Pydantic

EmbeddingsOpenAI text-embedding-3-small

RerankerCohere cross-encoder

EvaluationRAGAS

Search & Retrieval

Vector indexFAISS (CPU)

Keyword indexBM25 (rank-bm25)

FusionReciprocal Rank Fusion

Data Processing

PDF parsingPyMuPDF

HTML parsingBeautifulSoup4

ValidationPydantic v2

Data sourceSEC EDGAR API

Infrastructure & Observability

APIFastAPI + Uvicorn

AuthJWT + AWS Secrets Manager

TracingLangfuse

MetricsLangSmith

CIGitHub Actions

DeploymentDocker + Render