By Rick Hightower | January 9, 2025
Amazon Textract: A Developer’s Guide to Document Intelligence
Ever stared at a scanned document knowing all the data you need is right there—but completely trapped? Traditional OCR reads words. Textract understands meaning. Here’s how to liberate your documents.
mindmap
root((Amazon Textract))
Core Features
Forms Detection
Table Recognition
Layout Analysis
Natural Language Queries
Document Types
Invoices
Receipts
IDs & Licenses
Contracts
Medical Forms
Processing Modes
Synchronous
Asynchronous
Batch Processing
Integration
boto3 SDK
TRP Library
S3 Integration
SNS Notifications
The Document Intelligence Revolution
You know that sinking feeling. Staring at a scanned invoice, knowing every piece of data you need is technically there—but it might as well be carved in stone. Traditional OCR might read the words, but you’re left playing detective. Which value belongs to which label? Where’s the table structure? How do these pieces connect?
That’s the prison Amazon Textract breaks you out of.
This isn’t just about reading text. It’s about understanding documents the way humans do—recognizing forms, parsing tables, answering questions. Let’s dive into what makes Textract different and how you can use its power.
What Makes Textract Different? Structure Over Strings
Traditional OCR hands you a flat text file. Textract hands you intelligence.
Beyond Character Recognition
Textract understands documents in layers:
- Forms: Detects key-value pairs like “Invoice Number: 12345”—no more regex nightmares
- Tables: Recognizes rows, columns, headers, and summary rows with proper relationships
- Layout: Identifies paragraphs, lists, titles, and sections maintaining document flow
- Custom Queries: Ask natural language questions like “What is the policy number?” and get direct answers
It also handles both printed and handwritten text for real-world documents.
graph LR
A[Scanned Document] --> B{Textract Processing}
B --> C[Text Extraction]
B --> D[Form Detection]
B --> E[Table Recognition]
B --> F[Layout Analysis]
C --> G[Structured Output]
D --> G
E --> G
F --> G
G --> H[Actionable Data]
style A fill:#ffe0b2
style H fill:#a5d6a7
style B fill:#bbdefb
Getting Started: Your First Document Analysis
Let’s extract structured data from an invoice with Python and boto3
:
import boto3
# Create Textract client
textract = boto3.client('textract', region_name='us-east-1')
# Analyze document with forms and tables
response = textract.analyze_document(
Document={
'S3Object': {
'Bucket': 'your-document-bucket',
'Name': 'invoices/invoice-2025-001.pdf'
}
},
FeatureTypes=['FORMS', 'TABLES']
)
# Process the results
for block in response['Blocks']:
if block['BlockType'] == 'KEY_VALUE_SET':
# Found a form field
print(f"Form field detected: {block.get('Text', '')}")
elif block['BlockType'] == 'TABLE':
# Found a table
print("Table detected with relationships")
Understanding the Response Structure
Textract returns “blocks” that represent document elements:
- PAGE: The document page
- LINE: A line of text
- WORD: Individual words
- KEY_VALUE_SET: Form fields (key-value pairs)
- TABLE: Table structure
- CELL: Individual table cells
Each block includes metadata like confidence scores, bounding box coordinates, and relationships to other blocks.
Making Life Easier: The TRP Library
Raw Textract JSON is powerful but complex. Enter the amazon-textract-response-parser
(TRP) library—your complexity tamer:
from trp import Document
# Parse Textract response into friendly objects
doc = Document(response)
# Extract form fields elegantly
for page in doc.pages:
print(f"\n--- Page {page.page_num} ---")
# Process form fields
for field in page.form.fields:
key = field.key.text if field.key else "Unknown"
value = field.value.text if field.value else "Not found"
confidence = field.confidence
print(f"{key}: {value} (Confidence: {confidence:.2%})")
# Process tables
for table in page.tables:
print(f"\nTable with {table.row_count} rows, {table.column_count} columns:")
for row in table.rows:
row_data = [cell.text for cell in row.cells]
print(" | ".join(row_data))
TRP transforms the block graph into intuitive Python objects—no more manual relationship parsing.
Advanced Magic: Natural Language Queries
Need specific information regardless of document format? Use Textract’s query feature:
# Ask questions about your document
response = textract.analyze_document(
Document={
'S3Object': {
'Bucket': 'your-document-bucket',
'Name': 'insurance/policy-doc.pdf'
}
},
FeatureTypes=['QUERIES'],
QueriesConfig={
'Queries': [
{'Text': 'What is the policy number?'},
{'Text': 'What is the coverage amount?'},
{'Text': 'When does the policy expire?'}
]
}
)
# Extract answers
for block in response['Blocks']:
if block['BlockType'] == 'QUERY':
question = block['Query']['Text']
elif block['BlockType'] == 'QUERY_RESULT':
answer = block.get('Text', 'Not found')
confidence = block['Confidence']
print(f"Q: {question}")
print(f"A: {answer} (Confidence: {confidence:.2%})\n")
Textract finds and extracts answers even when document formats vary widely. This works well for processing diverse document types.
Specialized APIs for Common Use Cases
Amazon provides purpose-built APIs for common scenarios:
AnalyzeExpense: Receipt and Invoice Processing
# Extract expense data from receipts
response = textract.analyze_expense(
Document={
'S3Object': {
'Bucket': 'your-document-bucket',
'Name': 'receipts/restaurant-receipt.jpg'
}
}
)
# Process expense summary
for doc in response['ExpenseDocuments']:
print("=== Expense Summary ===")
for field in doc['SummaryFields']:
field_type = field['Type']['Text']
field_value = field.get('ValueDetection', {}).get('Text', 'N/A')
print(f"{field_type}: {field_value}")
# Process line items
print("\n=== Line Items ===")
for item in doc.get('LineItemGroups', []):
for line_item in item['LineItems']:
for field in line_item['LineItemExpenseFields']:
print(f" {field['Type']['Text']}: {field.get('ValueDetection', {}).get('Text', '')}")
AnalyzeID: Identity Document Processing
# Extract data from driver's licenses or passports
response = textract.analyze_id(
DocumentPages=[{
'S3Object': {
'Bucket': 'your-document-bucket',
'Name': 'ids/drivers-license.jpg'
}
}]
)
# Process ID fields
for doc in response['IdentityDocuments']:
for field in doc['IdentityDocumentFields']:
field_type = field['Type']['Text']
field_value = field.get('ValueDetection', {}).get('Text', 'N/A')
print(f"{field_type}: {field_value}")
Synchronous vs. Asynchronous: Choosing Your Processing Mode
flowchart TD
A[Document Processing Need] --> B{Document Size/Volume?}
B -->|Small/Single Page| C[Synchronous Processing]
B -->|Large/Multi-Page| D[Asynchronous Processing]
C --> E[analyze_document]
C --> F[Immediate Response]
C --> G[Use Cases: Mobile apps, Real-time validation]
D --> H[start_document_analysis]
D --> I[Job ID Returned]
D --> J[Poll or SNS Notification]
D --> K[Use Cases: Batch processing, Large PDFs]
style C fill:#c8e6c9
style D fill:#bbdefb
Synchronous Processing (Real-Time)
Perfect for small documents and immediate feedback:
# Direct analysis - results returned immediately
response = textract.analyze_document(
Document={'S3Object': {'Bucket': 'bucket', 'Name': 'scan.jpg'}},
FeatureTypes=['FORMS', 'TABLES']
)
Asynchronous Processing (Batch Jobs)
Use this for multi-page PDFs and large-scale operations:
# Start async job
job_response = textract.start_document_analysis(
DocumentLocation={
'S3Object': {'Bucket': 'bucket', 'Name': 'large-contract.pdf'}
},
FeatureTypes=['FORMS', 'TABLES', 'QUERIES'],
QueriesConfig={
'Queries': [{'Text': 'What is the contract value?'}]
},
NotificationChannel={
'SNSTopicArn': 'arn:aws:sns:region:account:textract-notifications',
'RoleArn': 'arn:aws:iam::account:role/TextractSNSRole'
}
)
job_id = job_response['JobId']
# Option 1: Poll for results
import time
while True:
result = textract.get_document_analysis(JobId=job_id)
status = result['JobStatus']
if status == 'SUCCEEDED':
# Process results
break
elif status == 'FAILED':
print(f"Job failed: {result.get('StatusMessage', 'Unknown error')}")
break
time.sleep(5)
# Option 2: Use SNS notification (recommended for production)
# Configure Lambda to process SNS notifications
Pro Tips for Maximum Accuracy
Document Preparation
- Resolution: Aim for 300 DPI scans minimum
- Alignment: Keep documents straight (skew affects accuracy)
- Lighting: Use even lighting for photos
- Contrast: High contrast between text and background
Implementation Best Practices
# Implement confidence thresholds
MIN_CONFIDENCE = 80.0
def process_with_confidence(block):
confidence = block.get('Confidence', 0)
if confidence < MIN_CONFIDENCE:
# Flag for human review
return {
'text': block.get('Text', ''),
'confidence': confidence,
'needs_review': True
}
return {
'text': block.get('Text', ''),
'confidence': confidence,
'needs_review': False
}
# Handle multi-page results
def get_all_pages(job_id):
pages = []
next_token = None
while True:
if next_token:
response = textract.get_document_analysis(
JobId=job_id,
NextToken=next_token
)
else:
response = textract.get_document_analysis(JobId=job_id)
pages.extend(response['Blocks'])
next_token = response.get('NextToken')
if not next_token:
break
return pages
Real-World Implementation Pattern
Here’s a production-ready document processing pipeline:
import json
from typing import Dict, List, Optional
class DocumentProcessor:
def __init__(self, textract_client, s3_client):
self.textract = textract_client
self.s3 = s3_client
def process_document(
self,
bucket: str,
key: str,
features: List[str] = ['FORMS', 'TABLES'],
queries: Optional[List[str]] = None
) -> Dict:
"""Process a document with comprehensive error handling"""
try:
# Start analysis
job_config = {
'DocumentLocation': {
'S3Object': {'Bucket': bucket, 'Name': key}
},
'FeatureTypes': features
}
if queries:
job_config['QueriesConfig'] = {
'Queries': [{'Text': q} for q in queries]
}
# Check document size for sync/async decision
obj_meta = self.s3.head_object(Bucket=bucket, Key=key)
file_size_mb = obj_meta['ContentLength'] / (1024 * 1024)
if file_size_mb < 5: # Small file - use synchronous
response = self.textract.analyze_document(
Document={'S3Object': {'Bucket': bucket, 'Name': key}},
**job_config
)
return self._parse_response(response)
else: # Large file - use asynchronous
job = self.textract.start_document_analysis(**job_config)
return self._wait_for_job(job['JobId'])
except Exception as e:
return {
'success': False,
'error': str(e),
'document': f"s3://{bucket}/{key}"
}
def _parse_response(self, response: Dict) -> Dict:
"""Parse Textract response into structured data"""
doc = Document(response)
result = {
'success': True,
'pages': [],
'forms': {},
'tables': [],
'queries': {}
}
for page in doc.pages:
# Extract forms
for field in page.form.fields:
if field.key and field.value:
result['forms'][field.key.text] = {
'value': field.value.text,
'confidence': field.confidence
}
# Extract tables
for table in page.tables:
table_data = []
for row in table.rows:
table_data.append([cell.text for cell in row.cells])
result['tables'].append(table_data)
return result
The Bottom Line: Why Textract Transforms Workflows
Textract isn’t just OCR with better marketing. It’s document intelligence that understands structure, context, and meaning. It supports forms, tables, layout analysis, and natural language queries. This transforms manual document processing into automated workflows.
From financial documents to medical forms, from contracts to receipts—Textract provides the API to liberate trapped data and make it actionable.
Ready to go deeper? Next, we’ll integrate Textract with Amazon Comprehend for sentiment analysis and entity extraction, building a complete intelligent document processing pipeline on AWS.
Your documents are talking. Are you ready to listen?
Helpful Resources
- Amazon Textract Documentation
- TRP Library on GitHub
- Textract Pricing Calculator
- AWS SDK for Python (boto3)
About the Author
Rick Hightower is a software architect and technical writer with deep experience in cloud computing and AI technologies. As a certified AWS Solutions Architect, he helps developers implement intelligent document processing solutions with AWS services.
With over two decades of hands-on experience building enterprise-scale applications, Rick combines technical knowledge with clear, practical writing to make complex topics accessible to developers at all skill levels.
TweetApache Spark Training
Kafka Tutorial
Akka Consulting
Cassandra Training
AWS Cassandra Database Support
Kafka Support Pricing
Cassandra Database Support Pricing
Non-stop Cassandra
Watchdog
Advantages of using Cloudurable™
Cassandra Consulting
Cloudurable™| Guide to AWS Cassandra Deploy
Cloudurable™| AWS Cassandra Guidelines and Notes
Free guide to deploying Cassandra on AWS
Kafka Training
Kafka Consulting
DynamoDB Training
DynamoDB Consulting
Kinesis Training
Kinesis Consulting
Kafka Tutorial PDF
Kubernetes Security Training
Redis Consulting
Redis Training
ElasticSearch / ELK Consulting
ElasticSearch Training
InfluxDB/TICK Training TICK Consulting