DataFictor Alimentos

NFe OCR Pipeline

Multi-engine pipeline for classification, OCR extraction and automated organization of ~500 invoices/month — 98% accuracy.

The Problem

Administrative team spending days every month manually processing ~500 invoice PDFs — mix of scanned and native files, requiring identification of issuer, date and representative to file correctly in a hierarchical per-company structure.

The Solution

Modular pipeline: PyPDF2 for threshold-based classification (native vs scanned), PyMuPDF (5-10x faster than alternatives) as primary extractor, pdfplumber as fallback for complex layouts, EasyOCR (CRAFT + CRNN) for scanned files with no OS dependencies. pandas with CNPJ index for O(1) lookup in 0.15ms. Automatic organization with backup and duplicate handling.

Result

75% reduction in operational time, elimination of human filing errors, scalability with no additional staff cost and 98% extraction accuracy.

// Related Projects

Related Projects

Fictor Alimentos

Data

RPA Suite Fictor

Suite of 80+ RPA pipelines automating critical logistics, supply chain and sales reports for 5 subsidiaries.

PythonSeleniumBeautifulSoupFastAPI+1

CodeDetails →

DataFeatured

ELT Pipeline AWS — Medallion

Multi-tenant analytical platform on AWS with 4-layer Medallion architecture — 99% cost reduction vs Azure Databricks.

AWSS3AthenaGlue+6

CodeDetails →