Hosted ongabo.esvia theHypermedia Protocol

Some customers want our help with document processing. For example, emails or invoices. We need to extract the information and add structured information.

I’d treat this as an “ingestion → extraction → validation → structuring → publishing” pipeline, with humans-in-the-loop where it matters (accuracy, exceptions, and policy).

1. Define the target “structured record”

Start with a schema per document type (invoice, receipt, purchase order, email request, etc.). Keep it simple and extensible.

Invoice example (core fields)

  • Vendor: name, VAT ID, address

  • Buyer: name, VAT ID

  • Invoice: number, issue date, due date, currency

  • Totals: subtotal, tax breakdown, total

  • Line items: description, qty, unit price, tax rate

  • Payment: IBAN, payment terms

  • Provenance: source file/email id, received timestamp, page count

  • Evidence: “this value came from this snippet / bbox / page”

That last part (evidence) is crucial for trust and audits.

2. Ingest + normalize

Inputs: PDFs, scans, email bodies, email attachments, EDI-ish PDFs, images.

Steps:

  • Collect from sources (email inbox, upload folder, API).

  • Convert to a canonical “document bundle”:

    • text (best-effort)

    • layout (pages, blocks)

    • images (per page)

    • metadata (sender, dates, thread id)

  • De-duplicate (hashing) and classify.

graph TD
    subgraph client["PER-CLIENT (support contract)"]
        files["Client's .eml files\n(with PDF/Word/Excel)"]
        crawler["Folder Crawler Script"]
        llm_process["LLM Processing\n(user's API key)"]
        sql_adapter["SQL Export Adapter"]
        db["Client's Relational Database"]
    end

    subgraph product["PRODUCT (build into Seed)"]
        llm_layer["LLM Integration Layer\n(provider config + prompt gen)"]
        importers["Format Importers\n.eml | .pdf | .docx | .xlsx\n+ provenance annotations"]
        seed_docs["Seed Hypermedia Documents\n- Versioned & linked\n- Metadata fields\n- Block-level traceability\n- Full change history\n- Keyword + semantic search"]
        api["API / CLI / SDK\ndocument get | query | search"]
    end

    files --> crawler
    crawler --> llm_process
    llm_layer -.-> llm_process
    llm_process --> importers
    importers --> seed_docs
    seed_docs --> api
    api --> sql_adapter
    sql_adapter --> db

3. Classify document type + route

Use a lightweight classifier:

  • Heuristics (sender, keywords like “Invoice”, “Factura”, amounts, IBAN)

  • ML/LLM classification as fallback

Route to an extractor specialized for:

  • Invoices

  • Receipts

  • Contracts

  • Emails (requests, approvals, complaints, support)

4) Extract with “hybrid” methods (best results in practice)

Don’t bet everything on one technique.

For digital PDFs (text-based):

  • Parse text + layout (tables, key-value zones)

  • Use deterministic patterns for high-signal fields (VAT/IVA IDs, dates, invoice number formats, IBAN)

For scanned PDFs/images:

  • OCR

  • Then the same as above, but with lower confidence

LLM step (structured):

  • Ask the model to output strict JSON that matches your schema

  • Provide the model with:

    • extracted text

    • layout hints (tables, page headings)

    • instructions like “return null if missing, don’t guess”

  • Have the model also return citations/evidence (snippet + page, or bbox id) for each field.

5) Validate and score confidence

Run validators after extraction:

  • Invoice number present?

  • Totals match: sum(line_items) ≈ subtotal, subtotal + taxes ≈ total

  • Dates are sensible (due date ≥ issue date)

  • VAT/IVA format valid per country

  • IBAN checksum valid

  • Currency matches symbols

Compute an overall confidence score and decide automation level:

  • High confidence → auto-ingest

  • Medium → “review required”

  • Low → “manual entry”

6) Human-in-the-loop review UI (where you win deals)

For medium confidence cases:

  • Show the document side-by-side with extracted fields

  • Highlight evidence snippets

  • One-click fix + “why” (so you can learn)

Every correction becomes training data:

  • vendor-specific templates

  • recurring line-item patterns

  • preferred mappings (e.g., account codes, cost centers)

7) Map to the customer’s systems

Structured output typically needs to flow into:

  • ERP/accounting (NetSuite, SAP, Odoo, QuickBooks, Xero)

  • CRM/ticketing (HubSpot, Zendesk, Jira)

  • Document repository / knowledge base

Use a canonical internal model → export adapters:

  • JSON (API)

  • CSV (legacy)

  • UBL / Factur-X / PEPPOL-like formats if needed

8) Store as “structured + source + provenance”

Keep:

  • Original document (immutable)

  • Extracted structured record (versioned)

  • Evidence map (field → snippet/page/bbox)

  • Processing log (model version, OCR version, rules triggered)

This makes audits, dispute resolution, and debugging straightforward.

9) Practical deployment approach

Phase 1 (2–4 weeks): one document type (invoices) + 2–3 pilot customers

  • Build schema, ingestion, extraction, validation, review UI, export to one target system.

Phase 2: vendor learning + email intake

  • Recognize repeat vendors and handle their quirks

  • Parse “invoice via email” workflows (threading, attachments, approvals)

Phase 3: multi-doc workflows

  • Purchase order ↔ invoice matching

  • Payment status updates

  • Exceptions and disputes

10) What I’d implement first (minimal but powerful)

  • Ingestion from email + uploads

  • Invoice classifier

  • Extraction pipeline:

    • PDF text extraction → LLM JSON

    • OCR fallback

  • Validation engine (math + formats)

  • Review UI with evidence highlights

  • Export to customer’s accounting system

  • Continuous learning from corrections

  • If you tell me:

    1. which formats (Spanish “Factura” PDFs? scans?),

    2. where the structured data needs to end up (ERP/CRM?), and

    3. typical volume (10/day vs 10k/day), I can sketch an architecture and a concrete “first pilot” plan with the exact fields and validators.

How to archive emails

Perfect — I’ll format this as a Seed-ready archive document, following the conventions you described:

  • One Email Thread document

  • One Email Message document per message

  • Attachments linked

  • Structured metadata

  • Full provenance-ready structure

  • Clean block structure for htmlToBlocks() or markdown import

Below is a Seed archive representation in Markdown form, ready for seed document create.

Do you like what you are reading?. Subscribe to receive updates.

Unsubscribe anytime