benchmarks

Compare document extraction providers.

Representative document benchmark across 139 PDFs and 9,762 pages. Default rankings require successful extraction and source bboxes.

extract latency
1.6s
#1

p50 wall-clock

extract coverage
100%
#2

139/139 documents in selected filter

speed vs consensus

Upper-left is better: faster latency with higher consensus against the majority-provider reference.

document type
← fasterslower →
extractqualified provider
provider table

Exact measurements for the selected documents.

providermodesuccessconsensus F1p50 latencypages/secbbox
extracthosted139/13997.6%1.6s12.6100%
AWS TextractLAYOUT138/13980.8%23.6s1.297.1%
LlamaParsepremium139/13993.6%24.4s1.696.4%
Azure DIprebuilt-layout139/13999.4%29.6s1.2100%
Reductostandard parse139/13999%36.6s0.8100%
documents in this view
Executive memo
3p
tinycoretinyborn-digitalsynthetic
Cover letter
1p
tinytinysyntheticforms
Invoice (1pg)
1p
tinytinysyntheticforms
Thermal receipt
1p
tinyrasterizedtinyscannedsynthetic
Business card
1p
tinyrasterizedtinyscannedsynthetic
USPS shipping label
1p
tinyrasterizedtinysyntheticforms
Meeting notes scan
2p
tinyrasterizedtinyscannedsynthetic
Event flyer
1p
tinytinysyntheticmagazines
Attention Is All You Need
15p
academiccoretablestwo-columnpublic
Curiosity-driven exploration
12p
academiccoretwo-columnpublic
Segment Anything
30p
academicpublictwo-column
COVID-19 paper (CORD-19 / arxiv)
12p
academicpublictables
Oncology / drug-discovery preprint
26p
academicpublictables
Long-form CS preprint (LoRA / arxiv)
26p
academicpubliclong-form
CLIP preprint (CS, two-column)
48p
academicpublictwo-column
math.AG preprint
14p
academicpublicmath
Physics preprint
24p
academicpublictwo-column
96-page financial PDF
96p
financialcorelargetables
Mixed compliance packet (mid)
123p
financialcorelargetables
Pfizer 10-K
121p
financiallargetablespublic
HPE 10-K
176p
financiallargetablespublic
Apple FY2025 10-K
95p
financiallargetablespublic
Tesla Q1 10-Q
62p
financialtablespublic
Walmart Q3 10-Q
62p
financialtablespublic
Berkshire 2024 annual letter
15p
financialpublic
Berkshire 1995 annual letter
22p
financialrasterizedpublicscanned
Microsoft FY25 proxy statement
199p
financiallargetablespublic
Hoffman 2004 LinkedIn series-B deck
15p
financialpublicpresentations
NIST SP 800-53 r5
492p
compliancecorelargepublic
CISA Binding Operational Directive
30p
compliancepublic
NIST Cybersecurity Framework v2.0
32p
compliancepublic
HHS HIPAA Privacy Rule
25p
compliancepublic
NIST SP 800-66r2 HIPAA Security Rule (PCI-DSS substitute)
122p
compliancelargepublic
FedRAMP control matrix
200p
compliancelargepublic
FDA 510(k) decision letter
30p
compliancepublic
NIST SP 800-171 r3
120p
compliancelargepublic
SEC enforcement order
50p
compliancepublic
CMMC 2.0 framework
80p
compliancepublic
CISA Telework Reference Architecture (CIS-Controls equivalent, public)
70p
compliancepublic
SaaS Master Service Agreement
40p
legalformslong-formsynthetic
Mutual NDA
5p
legalsynthetic
Executive employment agreement
25p
legalsyntheticlong-form
Synthetic ToS
25p
legalsyntheticlong-form
USPTO battery patent
60p
legalpubliclong-form
Federal court motion
30p
legalpublic
Class-action complaint
50p
legalpubliclong-form
SCOTUS slip opinion
114p
legalpubliclong-form
Congressional bill
973p
legalpubliclong-form
Sample warranty deed
8p
legalformssynthetic
Clinical trial protocol
89p
healthcarelong-form
FDA drug label
7p
healthcarepublictables
FDA warning letter
16p
healthcarepublic
CMS-1500 medical claim form
4p
healthcarepublicforms
Sanitized EHR export
20p
healthcaresynthetictables
ClinicalTrials.gov protocol
90p
healthcarepubliclong-form
PMC OA clinical trial report
2p
healthcarepublictables
CDC MMWR public-health bulletin
5p
healthcarepublictables
CMS Medicare Provider Manual chapter
323p
healthcarepubliclong-form
Clinical lab result
4p
healthcaresynthetictables
UnitedHealthcare NJ policy
72p
insurancelong-form
Auto insurance policy
40p
insurancesyntheticformslong-form
Homeowners declaration page
6p
insurancesyntheticforms
Life insurance contract
50p
insurancesyntheticlong-form
Auto claim form (filled)
4p
insurancesyntheticforms
Dental EOB
2p
insurancesynthetictables
Pharmacy EOB
2p
insurancesynthetictables
Long-term disability claim packet
30p
insurancesyntheticforms
Medicare benefits explanation
4p
insurancesynthetictables
IRS 1040 instructions scan
123p
governmentcorerasterizedscannedlargetables
IRS 1040 1990 form
64p
governmentrasterizedscannedlarge
IRS W-2
11p
governmentpublicforms
USCIS I-9
4p
governmentpublicforms
IRS 1099-MISC
6p
governmentpublicforms
IRS Schedule C
2p
governmentpublicforms
FBI FOIA response packet
60p
governmentrasterizedpublicscannedlong-form
USCIS I-130 petition
12p
governmentpublicforms
USCIS Request for Evidence
10p
governmentsyntheticforms
Federal court docket sheet
8p
governmentpublictables
Okanogan County invoice listing
984p
governmentpublictableslarge
RFC 9110 HTTP Semantics
194p
technicalcorelargepublic
RFC 8446 (TLS 1.3)
160p
technicallargepublic
RFC 7540 (HTTP/2)
96p
technicallargepublic
RFC 5415 (CAPWAP)
155p
technicallargepublic
ARM v8 architecture reference excerpt
120p
technicallargepublic
USB Type-C spec excerpt
50p
technicalpublic
RFC 9000 (QUIC transport protocol)
151p
technicallargepublic
W3C HTML5.2 spec (PDF mirror)
250p
technicallargepublic
NVMe public spec excerpt
458p
technicallargepublic
Eaton UPS manual
74p
manualslong-form
US Army FM 3-22.9 small arms
200p
manualslargepublic
OSHA safety bulletin
35p
manualspublic
DOE Energy Star HVAC compliance guide
80p
manualspubliclong-form
US Army weapons-handling field manual
60p
manualspublic
USDA Engineering Field Handbook chapter
80p
manualspubliclong-form
OpenWrt router configuration documentation
70p
manualspubliclong-form
Lecture slide handout
38p
presentationscoreslidestables
Scanned lecture handout
38p
presentationscorerasterizedslidesscanned
KubeCon keynote slides
60p
presentationspublicslides
Board meeting deck
25p
presentationssyntheticslides
Sales enablement deck
30p
presentationssyntheticslides
Corporate training deck
50p
presentationssyntheticslideslong-form
Research conference poster
3p
presentationspublicslidesimage-heavy
Public conference keynote
25p
presentationspublicslides
Image-heavy PDF
77p
image-heavycoreimage-heavylarge
Marketing brochure
12p
image-heavysyntheticimage-heavymagazines
Real-estate brochure (mockup)
20p
image-heavysyntheticimage-heavymagazines
Smithsonian Open Access catalog
80p
image-heavypublicimage-heavymagazines
NPS Yellowstone brochure
100p
image-heavypublicimage-heavymagazines
Library of Congress photography collection
30p
image-heavypublicimage-heavy
USGS scientific publication
54p
image-heavypublicimage-heavytables
Chinese national standard GB 18030
30p
multilingualmultilingualpublic
Japanese 国税庁 tax form
10p
multilingualmultilingualpublicforms
German employment contract
40p
multilingualmultilingualsynthetic
French medical journal article
16p
multilingualmultilingualpublic
IRS Form 1040(SP) Spanish
162p
multilingualmultilingualpubliclong-form
Arabic news magazine layout (RTL)
12p
multilingualmultilingualsyntheticmagazines
Russian academic paper
20p
multilingualmultilingualpublic
Korean financial filing
50p
multilingualmultilingualpublic
Italian government policy doc
30p
multilingualmultilingualpublic
Portuguese financial filing
50p
multilingualmultilingualpublic
Unicode edge-case PDF
20p
regressioncoreunicodepathological
Synthetic regression case
10p
regressioncoresyntheticpathological
Mixed-rotation text
8p
regressionsyntheticpathological
Malformed xref
6p
regressionsyntheticpathological
Tiny-font footnote stress
12p
regressionsyntheticpathological
Vertical-text CJK
6p
regressionsyntheticpathologicalmultilingual
Multi-column rotated
8p
regressionsyntheticpathological
Low-contrast text
8p
regressionsyntheticpathological
DocLayNet financial (s1)
50p
public-benchmarkrasterizedpublicimage-heavyscanned
DocLayNet financial (s2)
50p
public-benchmarkrasterizedpublicimage-heavyscanned
DocLayNet financial (s3)
50p
public-benchmarkrasterizedpublicimage-heavyscanned
DocLayNet government (s1)
50p
public-benchmarkrasterizedpublicimage-heavyscanned
DocLayNet government (s2)
50p
public-benchmarkrasterizedpublicimage-heavyscanned
DocLayNet scientific (s1)
50p
public-benchmarkrasterizedpublicimage-heavyscanned
DocLayNet scientific (s2)
50p
public-benchmarkrasterizedpublicimage-heavyscanned
DocLayNet patents
50p
public-benchmarkrasterizedpublicimage-heavyscanned
DocLayNet laws (s1)
50p
public-benchmarkrasterizedpublicimage-heavyscanned
DocLayNet laws (s2)
50p
public-benchmarkrasterizedpublicimage-heavyscanned
DocLayNet manuals
50p
public-benchmarkrasterizedpublicimage-heavyscanned

Rasterized docs ship as page-image PDFs (no text layer). They test the OCR / layout path, not text-layer extraction — treated as a sibling-class to the scanned-doc set, not equivalent to born-digital business PDFs.

methodology
  • 139 documents across representative classes.
  • Default rankings require at least 85% document success and non-zero source bbox coverage.
  • The corpus spans tiny born-digital, academic, financial filings, compliance / regulatory, legal, healthcare, insurance, government forms, technical specs, manuals, slides, image-heavy / magazines, multilingual, regression cases, and DocLayNet rasterized layouts.
  • DocLayNet docs are stitched from rasterized page-images (no text layer) — they test the OCR / layout path, not text-layer extraction. The page tags them is_rasterized.
  • Consensus F1 is computed against tokens emitted by a majority of successful providers for each document.