Quiver Quantitative — OCR & Document Ingestion at Scale

Results

Throughput: 4× increase
Latency: –30% (median)
Accuracy: 95%+ extraction

Stack

Python, GCP Vision/OCR, Tesseract, SQL Database

TL;DR

Re-architected an OCR pipeline for SEC-compliant government trade filings. Achieved higher accuracy and lower latency via preprocessing, layout-aware parsing, and scalable workers on GCP, delivering 4× throughput and –30% median latency.

Context

The previous pipeline struggled with skewed scans, small fonts, and nested tables, causing extraction errors and slow end-to-end times.

Problem

OCR confidence varied wildly by document type.
Throughput collapsed when large PDFs queued behind small ones (head-of-line blocking).
Excessive post-processing for parsing tables.

What I Built

Preprocessing: de-skew, denoise, adaptive thresholding; dynamic DPI upscaling for small glyphs.
Layout detection: page segmentation to route pages to table vs text parsers.
Hybrid OCR: combine Tesseract for general text with a table-aware extractor; fall back for low-confidence spans.
Parallelism: split PDFs into pages → distribute across workers; work stealing to avoid long-PDF starvation.
Confidence-driven post-processing: only re-run expensive passes when below a defined confidence threshold.
Observability: per-page accuracy, latency histograms, tail-latency alarms.

Key Decisions

Google Cloud Processing - Faster transactional speeds and higher OCR quality; augmented by Tesseract where beneficial.
Schema-first extraction (typed fields, validators) → fewer downstream corrections.

What I’d Do Next

Add learned layout models for tables; expand language packs.
Active feedback loop: human corrections feed weak-supervision rules.