Skip to main content
Back to projects

Finance RAG

Production-grade retrieval-augmented generation over SEC filings. Ask plain-English questions about 10-K and 10-Q disclosures and get answers grounded in cited source documents.

  • ~2.1s median latency
  • ~$0.02 per query
  • ≥0.85 RAGAS faithfulness
  • 12 SEC filings indexed
  • RAG
  • LLM
  • FAISS
  • OpenAI
  • FastAPI
  • Python
  • SEC EDGAR
  • Cohere
  • Langfuse
View on GitHub →

The Problem

Standard RAG systems break on financial documents in three specific ways:

  • Ticker symbol loss — vector embeddings reduce identifiers like AAPL or NVDA to distributed representations, causing semantic search to miss exact ticker queries and return unrelated companies.
  • Financial term imprecision — specialised language (EBITDA, Tier 1 capital, Basel III) loses precision when averaged into dense vectors.
  • Hallucinated numbers — LLMs confidently fabricate revenue figures, debt ratios, and profit margins when given only vague semantic context.

Finance RAG solves all three through hybrid retrieval, multi-stage reranking, and mandatory citation enforcement.

Architecture

The system has five modular components:

1. Ingestion Pipeline

A three-stage process that transforms raw SEC filings into searchable indexes:

  • Fetch — downloads 10-K and 10-Q forms from SEC EDGAR for specified tickers and date ranges
  • Parse — PyMUPDF for PDFs (preserving real page numbers), BeautifulSoup for iXBRL HTML filings (approximating pages every 3,000 words)
  • Index — builds a FAISS vector index (OpenAI text-embedding-3-small, 1536-dim, L2-normalised) and a BM25 sparse index in parallel

Chunking uses 512-word windows with 64-word overlap. Each chunk carries full provenance: ticker, form type, filing date, source path, and page number.

2. Hybrid Retrieval + Fusion

For each query, vector search and BM25 run in parallel — each returning top-20 candidates. Reciprocal Rank Fusion merges the two ranked lists without manual score-weight tuning:

RRF(doc) = Σ [ 1 / (k + rank_r(doc)) ]   where k = 60

This produces ~40 fused candidates. BM25 handles exact ticker and term matching; vector search handles semantic intent.

3. Cross-Encoder Reranking

The 40 fused candidates are passed to Cohere's cross-encoder reranker, which evaluates query + document together and filters to the top 5 most relevant chunks. Falls back to fusion top-5 if Cohere credentials are absent — the system degrades gracefully rather than failing.

4. Citation-Enforced Generation

GPT-4o is prompted with a strict system instruction: every factual claim must include an inline citation in the format [TICKER, FORM_TYPE, FILING_DATE, Page PAGE_NUMBER]. External knowledge is explicitly forbidden. If the retrieved filings don't contain enough information, the model is instructed to say so rather than speculate.

Temperature is fixed at 0.0 for determinism.

5. Observability

Langfuse traces every stage — retrieval, reranking, and generation — capturing per-component latency, token counts, and cost. Missing credentials return None without breaking requests; tracing is additive, not load-bearing.

Quality Gates

The project ships with a RAGAS evaluation suite that runs on a fixed 25-question test set against 1,260 chunks from 12 SEC filings (Apple, Microsoft, Amazon, Nvidia — 10-Ks from 2022–2024):

| Metric | Threshold | |---|---| | Faithfulness | ≥ 0.85 | | Answer Relevancy | ≥ 0.80 | | Context Precision | ≥ 0.78 |

A GitHub Actions workflow runs this evaluation on every pull request targeting main. All three metrics must pass or the PR is blocked — binary gate, no exceptions.

Key Technical Decisions

Hybrid retrieval over pure vector search — BM25 rewards exact token matches without embedding averaging. Ticker symbols and financial terminology are retrieved verbatim.

RRF fusion over weighted averaging — eliminates manual tuning of vector-vs-BM25 weights. Items consistently high across both methods are naturally boosted.

Page number preservation — PyMUPDF extracts actual page numbers from PDFs. HTML iXBRL filings approximate pages every 3,000 words. Every citation is traceable to a specific filing section — critical for regulatory contexts.

Prompt-based citation enforcement — structured numbered context blocks ([Context 1] [AAPL, 10-K, 2023-11-03, Page 14]) make it structurally difficult for the model to answer from general knowledge rather than the provided filings.

Stack

AI & LLM
LLMGPT-4o (temp 0.0)
EmbeddingsOpenAI text-embedding-3-small
RerankerCohere cross-encoder
EvaluationRAGAS
Search & Retrieval
Vector indexFAISS (CPU)
Keyword indexBM25 (rank-bm25)
Data Processing
PDF parsingPyMUPDF
HTML parsingBeautifulSoup4
ValidationPydantic v2
Infrastructure
APIFastAPI + Uvicorn
ObservabilityLangfuse