NFe OCR Pipeline
Multi-engine pipeline for classification, OCR extraction and automated organization of ~500 invoices/month — 98% accuracy.
The Problem
Administrative team spending days every month manually processing ~500 invoice PDFs — mix of scanned and native files, requiring identification of issuer, date and representative to file correctly in a hierarchical per-company structure.
The Solution
Modular pipeline: PyPDF2 for threshold-based classification (native vs scanned), PyMuPDF (5-10x faster than alternatives) as primary extractor, pdfplumber as fallback for complex layouts, EasyOCR (CRAFT + CRNN) for scanned files with no OS dependencies. pandas with CNPJ index for O(1) lookup in 0.15ms. Automatic organization with backup and duplicate handling.
Result
75% reduction in operational time, elimination of human filing errors, scalability with no additional staff cost and 98% extraction accuracy.
Related Projects
RPA Suite Fictor
Suite of 80+ RPA pipelines automating critical logistics, supply chain and sales reports for 5 subsidiaries.
ELT Pipeline AWS — Medallion
Multi-tenant analytical platform on AWS with 4-layer Medallion architecture — 99% cost reduction vs Azure Databricks.