How to show document metadata on the UI

>Notes>How to show document metadata on the UI

2 March 2026, 21:13

Some customers want our help with document processing. For example, emails or invoices. We need to extract the information and add structured information.

I’d treat this as an “ingestion → extraction → validation → structuring → publishing” pipeline, with humans-in-the-loop where it matters (accuracy, exceptions, and policy).

1. Define the target “structured record”

Start with a schema per document type (invoice, receipt, purchase order, email request, etc.). Keep it simple and extensible.

Invoice example (core fields)

Vendor: name, VAT ID, address
Buyer: name, VAT ID
Invoice: number, issue date, due date, currency
Totals: subtotal, tax breakdown, total
Line items: description, qty, unit price, tax rate
Payment: IBAN, payment terms
Provenance: source file/email id, received timestamp, page count
Evidence: “this value came from this snippet / bbox / page”

That last part (evidence) is crucial for trust and audits.

2. Ingest + normalize

Inputs: PDFs, scans, email bodies, email attachments, EDI-ish PDFs, images.

Steps:

Collect from sources (email inbox, upload folder, API).
Convert to a canonical “document bundle”:
- text (best-effort)
- layout (pages, blocks)
- images (per page)
- metadata (sender, dates, thread id)
De-duplicate (hashing) and classify.

graph TD
    subgraph client["PER-CLIENT (support contract)"]
        files["Client's .eml files\n(with PDF/Word/Excel)"]
        crawler["Folder Crawler Script"]
        llm_process["LLM Processing\n(user's API key)"]
        sql_adapter["SQL Export Adapter"]
        db["Client's Relational Database"]
    end

    subgraph product["PRODUCT (build into Seed)"]
        llm_layer["LLM Integration Layer\n(provider config + prompt gen)"]
        importers["Format Importers\n.eml | .pdf | .docx | .xlsx\n+ provenance annotations"]
        seed_docs["Seed Hypermedia Documents\n- Versioned & linked\n- Metadata fields\n- Block-level traceability\n- Full change history\n- Keyword + semantic search"]
        api["API / CLI / SDK\ndocument get | query | search"]
    end

    files --> crawler
    crawler --> llm_process
    llm_layer -.-> llm_process
    llm_process --> importers
    importers --> seed_docs
    seed_docs --> api
    api --> sql_adapter
    sql_adapter --> db

3. Classify document type + route

Use a lightweight classifier:

Heuristics (sender, keywords like “Invoice”, “Factura”, amounts, IBAN)
ML/LLM classification as fallback

Route to an extractor specialized for:

Invoices
Receipts
Contracts
Emails (requests, approvals, complaints, support)

4) Extract with “hybrid” methods (best results in practice)

Don’t bet everything on one technique.

For digital PDFs (text-based):

Parse text + layout (tables, key-value zones)
Use deterministic patterns for high-signal fields (VAT/IVA IDs, dates, invoice number formats, IBAN)

For scanned PDFs/images:

OCR
Then the same as above, but with lower confidence

LLM step (structured):

Ask the model to output strict JSON that matches your schema
Provide the model with:
- extracted text
- layout hints (tables, page headings)
- instructions like “return null if missing, don’t guess”
Have the model also return citations/evidence (snippet + page, or bbox id) for each field.

5) Validate and score confidence

Run validators after extraction:

Invoice number present?
Totals match: sum(line_items) ≈ subtotal, subtotal + taxes ≈ total
Dates are sensible (due date ≥ issue date)
VAT/IVA format valid per country
IBAN checksum valid
Currency matches symbols

Compute an overall confidence score and decide automation level:

High confidence → auto-ingest
Medium → “review required”
Low → “manual entry”

6) Human-in-the-loop review UI (where you win deals)

For medium confidence cases:

Show the document side-by-side with extracted fields
Highlight evidence snippets
One-click fix + “why” (so you can learn)

Every correction becomes training data:

vendor-specific templates
recurring line-item patterns
preferred mappings (e.g., account codes, cost centers)

7) Map to the customer’s systems

Structured output typically needs to flow into:

ERP/accounting (NetSuite, SAP, Odoo, QuickBooks, Xero)
CRM/ticketing (HubSpot, Zendesk, Jira)
Document repository / knowledge base

Use a canonical internal model → export adapters:

JSON (API)
CSV (legacy)
UBL / Factur-X / PEPPOL-like formats if needed

8) Store as “structured + source + provenance”

Keep:

Original document (immutable)
Extracted structured record (versioned)
Evidence map (field → snippet/page/bbox)
Processing log (model version, OCR version, rules triggered)

This makes audits, dispute resolution, and debugging straightforward.

9) Practical deployment approach

Phase 1 (2–4 weeks): one document type (invoices) + 2–3 pilot customers

Build schema, ingestion, extraction, validation, review UI, export to one target system.

Phase 2: vendor learning + email intake

Recognize repeat vendors and handle their quirks
Parse “invoice via email” workflows (threading, attachments, approvals)

Phase 3: multi-doc workflows

Purchase order ↔ invoice matching
Payment status updates
Exceptions and disputes

10) What I’d implement first (minimal but powerful)

Ingestion from email + uploads
Invoice classifier
Extraction pipeline:
- PDF text extraction → LLM JSON
- OCR fallback
Validation engine (math + formats)
Review UI with evidence highlights
Export to customer’s accounting system
Continuous learning from corrections
If you tell me:
1. which formats (Spanish “Factura” PDFs? scans?),
2. where the structured data needs to end up (ERP/CRM?), and
3. typical volume (10/day vs 10k/day), I can sketch an architecture and a concrete “first pilot” plan with the exact fields and validators.

How to archive emails

Perfect — I’ll format this as a Seed-ready archive document, following the conventions you described:

One Email Thread document
One Email Message document per message
Attachments linked
Structured metadata
Full provenance-ready structure
Clean block structure for htmlToBlocks() or markdown import

Below is a Seed archive representation in Markdown form, ready for seed document create.

Invoice

{ "Vendor": ["name", "VAT ID", "address"], "Buyer": ["name", "VAT ID"], "Invoice": ["number", "issue date", "due date", "currency"] "Totals": ["subtotal", "tax breakdown", "total"] "Line items": ["des

Do you like what you are reading?. Subscribe to receive updates.

Unsubscribe anytime