Quiver Quantitative — OCR & Document Ingestion at Scale
Results
- Throughput: 4× increase
- Latency: –30% (median)
- Accuracy: 95%+ extraction
Stack
Python, GCP Vision/OCR, Tesseract, SQL Database
TL;DR
Re-architected an OCR pipeline for SEC-compliant government trade filings. Achieved higher accuracy and lower latency via preprocessing, layout-aware parsing, and scalable workers on GCP, delivering 4× throughput and –30% median latency.
Context
The previous pipeline struggled with skewed scans, small fonts, and nested tables, causing extraction errors and slow end-to-end times.
Problem
- OCR confidence varied wildly by document type.
- Throughput collapsed when large PDFs queued behind small ones (head-of-line blocking).
- Excessive post-processing for parsing tables.
What I Built
- Preprocessing: de-skew, denoise, adaptive thresholding; dynamic DPI upscaling for small glyphs.
- Layout detection: page segmentation to route pages to table vs text parsers.
- Hybrid OCR: combine Tesseract for general text with a table-aware extractor; fall back for low-confidence spans.
- Parallelism: split PDFs into pages → distribute across workers; work stealing to avoid long-PDF starvation.
- Confidence-driven post-processing: only re-run expensive passes when below a defined confidence threshold.
- Observability: per-page accuracy, latency histograms, tail-latency alarms.
Key Decisions
- Google Cloud Processing - Faster transactional speeds and higher OCR quality; augmented by Tesseract where beneficial.
- Schema-first extraction (typed fields, validators) → fewer downstream corrections.
What I’d Do Next
- Add learned layout models for tables; expand language packs.
- Active feedback loop: human corrections feed weak-supervision rules.