Quiver Quantitative — OCR & Document Ingestion at Scale

Results

  • Throughput: 4× increase
  • Latency: –30% (median)
  • Accuracy: 95%+ extraction

Stack

Python, GCP Vision/OCR, Tesseract, SQL Database

TL;DR

Re-architected an OCR pipeline for SEC-compliant government trade filings. Achieved higher accuracy and lower latency via preprocessing, layout-aware parsing, and scalable workers on GCP, delivering 4× throughput and –30% median latency.

Context

The previous pipeline struggled with skewed scans, small fonts, and nested tables, causing extraction errors and slow end-to-end times.

Problem

  • OCR confidence varied wildly by document type.
  • Throughput collapsed when large PDFs queued behind small ones (head-of-line blocking).
  • Excessive post-processing for parsing tables.

What I Built

  • Preprocessing: de-skew, denoise, adaptive thresholding; dynamic DPI upscaling for small glyphs.
  • Layout detection: page segmentation to route pages to table vs text parsers.
  • Hybrid OCR: combine Tesseract for general text with a table-aware extractor; fall back for low-confidence spans.
  • Parallelism: split PDFs into pages → distribute across workers; work stealing to avoid long-PDF starvation.
  • Confidence-driven post-processing: only re-run expensive passes when below a defined confidence threshold.
  • Observability: per-page accuracy, latency histograms, tail-latency alarms.

Key Decisions

  • Google Cloud Processing - Faster transactional speeds and higher OCR quality; augmented by Tesseract where beneficial.
  • Schema-first extraction (typed fields, validators) → fewer downstream corrections.

What I’d Do Next

  • Add learned layout models for tables; expand language packs.
  • Active feedback loop: human corrections feed weak-supervision rules.