May 13, 2025
Unlock the hidden potential of your documents! Dive into our latest guide on Amazon Textract and discover how to transform unstructured data into actionable insights. From invoices to contracts, learn the secrets of document intelligence that could revolutionize your workflow. Don’t let your data stay trapped—read on to unleash its power!
Amazon Textract converts documents into structured data by detecting forms, tables, and layouts while enabling natural language queries. It includes expense and ID analysis APIs and handles both real-time and batch processing.
This guide covers setup and advanced features. It provides examples for both beginners and experienced users looking to enhance their document processing capabilities.
Document Intelligence with Amazon Textract: A Developer’s Guide
You know that feeling, right? When you’re staring at a scanned document and all the useful stuff is there, but somehow it’s just trapped. Traditional OCR might read the words, but you’re left trying to figure out how it all fits together. Where’s the structure? Where are the tables? Which value belongs to which label?
That’s the puzzle Amazon Textract is built to solve.
In this hands-on tutorial, we’ll dive deep into what makes Textract different from plain OCR, explore how to work with its structured output, and walk through Python examples using the boto3
SDK and the amazon-textract-response-parser
library (TRP). Whether you’re dealing with invoices, forms, or contracts, this guide will show you how to turn documents into structured, actionable data.
What Makes Textract Different?
Traditional OCR gives you flat text. Textract gives you structure.
That means it understands not just the characters on a page, but also what they mean in context:
- Forms: Detects key-value pairs like “Invoice Number: 12345”
- Tables: Recognizes rows and columns, including headers and summary rows
- Layout: Identifies paragraphs, lists, titles, and sections
- Custom Queries: Lets you ask natural-language questions like “What is the policy number?” and get a direct answer
Textract also supports both printed and handwritten text, making it suitable for real-world, messy documents.
Getting Started: AnalyzeDocument with Forms and Tables
Here’s how to extract structured data from an invoice using Python and boto3
:
import boto3
textract = boto3.client('textract')
response = textract.analyze_document(
Document={
'S3Object': {
'Bucket': 'your-bucket-name',
'Name': 'invoice.pdf'
}
},
FeatureTypes=['FORMS', 'TABLES']
)
for block in response['Blocks']:
print(block['BlockType'], block.get('Text', ''))
This returns a list of “blocks.” Each represents a word, line, key-value pair, table cell, etc. They include metadata like page number, position, and relationships to other blocks.
Making It Easier: Using the TRP Library
Textract’s raw JSON is powerful but complex. That’s where the amazon-textract-response-parser
library (TRP) comes in.
from trp import Document
doc = Document(response)
for page in doc.pages:
for field in page.form.fields:
print(f"{field.key.text}: {field.value.text}")
for table in page.tables:
for row in table.rows:
print(" | ".join(cell.text for cell in row.cells))
TRP handles the relationships for you, turning block graphs into familiar Python objects.
Advanced Feature: Custom Queries
Want to ask specific questions about a document, regardless of layout?
response = textract.analyze_document(
Document={
'S3Object': {
'Bucket': 'your-bucket-name',
'Name': 'policy.pdf'
}
},
FeatureTypes=['QUERIES'],
QueriesConfig={
'Queries': [{'Text': 'What is the policy number?'}]
}
)
for block in response['Blocks']:
if block['BlockType'] == 'QUERY_RESULT':
print(f"Answer: {block.get('Text', '')}")
Textract will find and extract the answer, even if the format varies across documents.
Real-World APIs: AnalyzeExpense and AnalyzeID
Amazon offers specialized APIs for common use cases:
analyze_expense
– Extracts totals, vendor names, tax, etc. from receipts and invoicesanalyze_id
– Parses driver’s licenses and passports
Example:
response = textract.analyze_expense(
Document={
'S3Object': {
'Bucket': 'your-bucket-name',
'Name': 'receipt.pdf'
}
}
)
for doc in response['ExpenseDocuments']:
for field in doc['SummaryFields']:
print(f"{field['Type']['Text']}: {field.get('ValueDetection', {}).get('Text', '')}")
Synchronous vs. Asynchronous: Which One Should You Use?
- Synchronous: Great for small, real-time tasks (e.g., mobile scans)
- Asynchronous: Ideal for multi-page PDFs, batch jobs, or large files
Asynchronous usage:
job_id = textract.start_document_analysis(
DocumentLocation={
'S3Object': {'Bucket': 'your-bucket', 'Name': 'long-file.pdf'}
},
FeatureTypes=['FORMS', 'TABLES']
)['JobId']
# Polling loop (in production, use SNS)
while True:
result = textract.get_document_analysis(JobId=job_id)
if result['JobStatus'] == 'SUCCEEDED':
break
Tips for Better Results
- Aim for 300 DPI scans
- Keep images aligned and well-lit
- Clean up noise or background clutter before submission
- Use Custom Queries for inconsistent formats
- Implement confidence thresholds and human-in-the-loop validation for critical data
Wrap-Up: Why Textract Matters
Textract is more than OCR. It’s document intelligence. With support for forms, tables, layout, and targeted queries, it empowers you to automate what used to require manual review.
From financial documents to medical forms, Textract is your API for transforming unstructured files into structured, searchable data.
Ready to go further? In upcoming posts, we’ll integrate Textract with Amazon Comprehend for sentiment and entity extraction, and build a full document processing pipeline on AWS.
Stay tuned.
Helpful Links:
About the Author
Rick Hightower is a seasoned software developer and technical writer with extensive experience in cloud computing and AI technologies. As a certified AWS Solutions Architect, he specializes in helping developers implement intelligent document processing solutions using AWS services.
With over a decade of hands-on experience building enterprise-scale applications, Rick combines deep technical knowledge with clear, practical writing. He makes complex topics accessible to developers of all skill levels.
Follow Rick’s technical insights and tutorials on cloud computing, machine learning, and software development best practices through his regular contributions to this blog.
TweetApache Spark Training
Kafka Tutorial
Akka Consulting
Cassandra Training
AWS Cassandra Database Support
Kafka Support Pricing
Cassandra Database Support Pricing
Non-stop Cassandra
Watchdog
Advantages of using Cloudurable™
Cassandra Consulting
Cloudurable™| Guide to AWS Cassandra Deploy
Cloudurable™| AWS Cassandra Guidelines and Notes
Free guide to deploying Cassandra on AWS
Kafka Training
Kafka Consulting
DynamoDB Training
DynamoDB Consulting
Kinesis Training
Kinesis Consulting
Kafka Tutorial PDF
Kubernetes Security Training
Redis Consulting
Redis Training
ElasticSearch / ELK Consulting
ElasticSearch Training
InfluxDB/TICK Training TICK Consulting