AI Mill Test Certificate Data Extraction: Methods

Quick Answer

Three practical methods exist for AI MTC data extraction: rule-based template matching (high accuracy, brittle to new layouts), OCR plus post-processing (broad coverage, error-prone on tables), and LLM-based vision extraction (flexible, layout-agnostic, requires confidence scoring and human review for compliance use cases).

A mill test certificate carries the complete material identity of a heat of steel, pipe, or plate: heat number, chemistry, mechanical test results, the standard the material was tested against, and the certifying mill's statement. Getting that data into your ERP or quality system without manual re-entry is the core problem AI MTC extraction addresses.

This guide breaks down the three principal extraction methods, where each performs well, and what a production-grade MTC parser actually requires.

Method 1: Rule-Based Template Matching

Rule-based parsers use predefined coordinate maps or regex patterns tied to specific mill layouts. If you know that Mill X always places the carbon percentage at coordinates (412, 318) on page one, you can extract it deterministically.

When it works well:

Single-supplier relationships with stable document formats
High-volume, identical-format certificate flows
Environments where 100% deterministic extraction is required and layout changes are rare

Limitations:

Each new mill or new template version requires a new rule set
Any layout change breaks extraction silently (no confidence signal)
Maintenance burden scales linearly with supplier count
Completely fails on scanned documents

For organizations receiving MTCs from ten or fewer mills with stable formats, rule-based extraction is a reasonable low-cost choice. For organizations with dozens of suppliers, the maintenance overhead becomes prohibitive.

Method 2: OCR Plus Post-Processing

Traditional OCR converts document images to text, then post-processing scripts apply entity recognition to find field values. This approach is more flexible than rule-based parsing because it handles varying layouts through NLP rather than coordinate lookup.

The pipeline typically looks like:

PDF rendering to image
OCR (Tesseract, AWS Textract, Azure Form Recognizer)
Text normalization
Named entity recognition to identify field labels
Value association logic to link labels to values
Schema mapping

Accuracy characteristics:

Free-text fields (mill name, standard reference): 90–95%
Simple key-value pairs: 88–94%
Chemistry tables: 75–88% (table structure frequently lost by OCR)
Multi-column mechanical property tables: 70–85%

The fundamental weakness is that OCR operates on characters and loses spatial context. A chemistry table with eight elements across a row requires the post-processor to reconstruct column associations from raw text—a fragile operation that degrades significantly with non-standard layouts.

Method 3: LLM-Based Vision Extraction

Large language models with vision capability (vision-language models, or VLMs) process the rendered page as an image or as a hybrid image+text representation. Unlike OCR pipelines, the model understands table structure visually—it sees that a column of numbers falls beneath a "C%" header and infers the relationship without requiring the OCR layer to preserve it.

How extraction works in practice:

The PDF page is rendered to a high-resolution image
The VLM receives the image with a structured prompt specifying the target schema (heat_number, chemical elements, mechanical properties, applicable standard, etc.)
The model returns a JSON object with extracted values and per-field confidence scores
Low-confidence fields are flagged for human review
Confirmed values are written to the database alongside the source document reference

Accuracy characteristics (native PDF):

Chemistry table fields: 93–97%
Mechanical property fields: 94–98%
Heat/lot number: 96–99%
Standard and grade references: 95–98%

Accuracy characteristics (scanned MTC, good quality):

Chemistry table fields: 89–94%
Mechanical property fields: 90–95%

Platforms like TestCert implement this approach with a standards-aware schema, so extracted chemistry values are immediately compared against stored ASTM or EN limits rather than requiring a separate validation step.

Handling the Hard Cases

Multi-heat certificates

Some steel service centers issue a single PDF covering multiple heats. The extractor must segment the document into per-heat sections before applying the extraction schema. This requires an initial segmentation step that identifies heat boundaries—typically based on heat number occurrences or table row separators.

Supplementary test data

MTCs for pressure vessel materials often carry supplementary tests (Charpy impact, PWHT records, corrosion test results) on additional pages. A robust extractor maps these to an extensible supplementary-data schema rather than discarding them.

Multi-language certificates

EN 10204 certificates from European mills often arrive in German, French, or Italian. LLM-based extractors handle these without separate language models—the underlying model understands field semantics across languages—though accuracy on less common languages degrades slightly.

Handwritten annotations

Any handwritten value on a printed MTC (common for inspector stamps or field corrections) should be routed to human review. Current models handle typed and machine-printed text reliably; handwriting is a known degradation point.

What a Production MTC Parser Requires

Beyond raw extraction capability, a production deployment needs:

Confidence scoring per field — not a single document-level score
Rejection routing — documents below a quality threshold held for full manual entry, not partially extracted
Audit trail — who extracted, when, what was flagged, what was corrected
Immutable source document storage — the original PDF retained alongside the structured record
Standards validation integration — extracted values checked against limits at extraction time, not downstream
Webhook or API output — extracted records pushed to ERP/MES without manual export steps

FAQs

Can AI extract data from a scanned MTC that was faxed multiple times?

Quality degrades significantly with each fax generation. A fax-of-a-fax document often falls below the 150 DPI effective resolution threshold where vision models perform reliably. These documents should be flagged automatically and routed to manual entry. Requesting a fresh PDF directly from the mill is always preferable when possible.

How does AI handle certificates with custom or non-standard fields?

LLM-based extractors can surface unrecognized fields as key-value pairs in an "additional data" bucket rather than discarding them. The reviewer can then decide whether to map the value to an existing schema field or record it as supplementary metadata. Rule-based parsers simply discard unrecognized fields.

Does extraction accuracy improve over time?

Yes, if the system is designed for it. Reviewer corrections should be logged and periodically used to fine-tune the extraction model or update confidence thresholds for specific mill formats. Systems that treat every document as a fresh extraction without learning from corrections plateau quickly.

What file formats does AI MTC extraction support beyond PDF?

Native PDFs and rasterized PDF images are the primary formats. Most production pipelines also handle TIFF, JPEG, and PNG for scanned documents. Excel-format MTCs (common from some mills in Asia) require a separate extraction path that reads the spreadsheet structure directly rather than rendering it as an image.

How do I validate that extracted chemistry matches the reported standard?

The extractor should output both the raw extracted value and a pass/fail flag against the applicable standard. This requires a stored, versioned standards database (ASTM, EN, API, ASME limits per grade) integrated with the extraction pipeline. If the extractor only outputs raw values, validation is a separate manual step—negating much of the automation benefit.

Ready to automate your certificate workflow?

Try TestCert free

AI Mill Test Certificate Data Extraction: Methods and Trade-offs