Skip to main content
SERVICE

Structured Datasets from Unstructured Sources

The vast majority of economically valuable data comes in unstructured formats — handwritten documents, PDF filings, free-text job postings, clinical records. We deploy large language models, NLP pipelines, and agentic AI workflows to extract structured, validated data from these sources at scale.

The Challenge

Vast Archives, No Structure

Governments, banks, and research institutions hold decades of records in formats no machine can read — handwritten ledgers, scanned PDFs, free-text filings. The information exists but cannot be analysed.

AI Hype, Validation Gap

Off-the-shelf LLMs produce output that looks plausible but requires rigorous validation against ground truth before any analytical use. Most AI vendors ship outputs without quality guarantees.

Bespoke Requirements, Generic Tools

Each institution's data has unique structures, quality issues, and domain-specific vocabulary that generic data extraction tools cannot handle. Economic data demands economic understanding.

Our Approach

01

Data Assessment

We evaluate your unstructured data sources and define the target structured output — schema, coverage, and quality standards.

02

Pipeline Design

We design the AI extraction pipeline: model selection, prompt engineering, validation strategy, and quality gates.

03

Build & Validate

We run the pipeline at scale, validating outputs against ground truth and iterating until quality thresholds are met.

04

Delivery & Documentation

Clean, documented datasets delivered with full methodology notes and reproducibility guarantees.

Capabilities

Document Digitisation

AI-powered extraction from archival, handwritten, and scanned documents — from loan records to administrative filings. We achieve production-grade accuracy through multi-agent validation.

Structured datasets, Extraction pipelines, Quality reports

Text Classification

High-accuracy classification of large-scale text data using fine-tuned language models, validated against 30,000+ human labels. We built the system that classifies remote work in 250M job postings with 99% accuracy.

Classification models, Labelled datasets, Accuracy benchmarks

Agentic AI Pipelines

Multi-agent AI workflows where specialised models collaborate to build, validate, and quality-check datasets. Our Machinery of Progress dataset was built entirely through agentic AI collaboration.

Multi-agent pipelines, Automated QA, Documentation

Custom Dataset Construction

End-to-end design and delivery of bespoke structured datasets from unconventional sources. We work with your data, your domain, and your analytical needs.

Bespoke datasets, Schema design, Methodology papers
DATA HIGHLIGHT

Digitising America's Lending History

We deployed LLM tools to digitise 40 million archival loan documents, transforming handwritten records into the most granular dataset of US firm-lender relationships available to researchers and policymakers. The dataset covers 1.8 million firms and 179 bank failures from 1990 to 2023.

40M+
1.8M
33yr
GET IN TOUCH

Have unstructured data that needs structure?

Describe your data challenge. We will assess feasibility and outline an extraction approach.

Typical response time: 24 hours