Structured Datasets from Unstructured Sources
The vast majority of economically valuable data comes in unstructured formats — handwritten documents, PDF filings, free-text job postings, clinical records. We deploy large language models, NLP pipelines, and agentic AI workflows to extract structured, validated data from these sources at scale.
The Challenge
Vast Archives, No Structure
Governments, banks, and research institutions hold decades of records in formats no machine can read — handwritten ledgers, scanned PDFs, free-text filings. The information exists but cannot be analysed.
AI Hype, Validation Gap
Off-the-shelf LLMs produce output that looks plausible but requires rigorous validation against ground truth before any analytical use. Most AI vendors ship outputs without quality guarantees.
Bespoke Requirements, Generic Tools
Each institution's data has unique structures, quality issues, and domain-specific vocabulary that generic data extraction tools cannot handle. Economic data demands economic understanding.
Our Approach
Data Assessment
We evaluate your unstructured data sources and define the target structured output — schema, coverage, and quality standards.
Pipeline Design
We design the AI extraction pipeline: model selection, prompt engineering, validation strategy, and quality gates.
Build & Validate
We run the pipeline at scale, validating outputs against ground truth and iterating until quality thresholds are met.
Delivery & Documentation
Clean, documented datasets delivered with full methodology notes and reproducibility guarantees.
Data Assessment
We evaluate your unstructured data sources and define the target structured output — schema, coverage, and quality standards.
Pipeline Design
We design the AI extraction pipeline: model selection, prompt engineering, validation strategy, and quality gates.
Build & Validate
We run the pipeline at scale, validating outputs against ground truth and iterating until quality thresholds are met.
Delivery & Documentation
Clean, documented datasets delivered with full methodology notes and reproducibility guarantees.
Capabilities
Document Digitisation
AI-powered extraction from archival, handwritten, and scanned documents — from loan records to administrative filings. We achieve production-grade accuracy through multi-agent validation.
Text Classification
High-accuracy classification of large-scale text data using fine-tuned language models, validated against 30,000+ human labels. We built the system that classifies remote work in 250M job postings with 99% accuracy.
Agentic AI Pipelines
Multi-agent AI workflows where specialised models collaborate to build, validate, and quality-check datasets. Our Machinery of Progress dataset was built entirely through agentic AI collaboration.
Custom Dataset Construction
End-to-end design and delivery of bespoke structured datasets from unconventional sources. We work with your data, your domain, and your analytical needs.
Digitising America's Lending History
We deployed LLM tools to digitise 40 million archival loan documents, transforming handwritten records into the most granular dataset of US firm-lender relationships available to researchers and policymakers. The dataset covers 1.8 million firms and 179 bank failures from 1990 to 2023.
Related Data Products
View All DataWork from Home Map
The definitive picture of remote work, built from 250M+ job postings across five countries.
AIPNET
A generative AI map of global production, revealing input-output connections across 5,000+ products.
US Firm-Lender Credit Map
The hidden history of American credit, reconstructed from 40M+ archival loan documents.
Have unstructured data that needs structure?
Describe your data challenge. We will assess feasibility and outline an extraction approach.
Typical response time: 24 hours