May 12, 2025
Tired of drowning in a sea of paperwork? Discover how to transform that mountain of PDFs into actionable insights with AWS’s intelligent document workflows! Say goodbye to chaos and hello to efficiency—your digital assistant awaits!
Learn to build an intelligent document workflow using AWS Textract and Amazon Comprehend to automate document processing, extract text, analyze content, and gain insights from unstructured data, transforming chaos into structured information.
Let’s explore how to use Textract’s FeatureTypes parameter to extract form data more precisely. The FORMS
feature detects key-value pairs, while TABLES
finds structured tabular data. And explore Comprehend’s ability to analyze sentiment and emotional tone in documents.
By combining Textract and Comprehend, you unlock powerful capabilities like extracting customer feedback from scanned forms and automatically determining if it’s positive or negative.
Building Your First Intelligent Document Workflow with AWS Textract and Comprehend
That mountain of PDFs isn’t going to process itself.
The Paper Problem
Picture this: your desk is disappearing under a mountain of papers. Invoices are stacking up, contracts are everywhere, and maybe there are some medical forms sprinkled in for good measure. It’s chaos. The information you need is buried somewhere in there, but finding it feels like an archaeological expedition.
Sound familiar?
This paper problem isn’t just annoying—it’s a massive bottleneck for businesses everywhere. Think about the hours wasted digging through documents and manually typing data into systems. It’s not just slow; it’s begging for mistakes. A single typo in an insurance ID? Claim denied. Overlooking an allergy note in a medical record? That could be serious.
The real issue isn’t just finding information reactively—it’s about understanding it proactively. Instead of searching for one policy detail, what if your system could automatically flag potential compliance issues?
That tedious, repetitive document work is screaming for a smarter approach.
Enter Document Intelligence
This is where document intelligence enters the picture. Think of it as a tireless digital assistant that doesn’t just read documents but actually understands what’s inside and organizes it for you.
At its core, document intelligence transforms unstructured chaos (your paper mountain) into structured, searchable, analyzable data. It’s like having a super analyst who can read thousands of documents simultaneously, never gets tired, and maintains consistent accuracy.
AWS offers a powerful toolkit for building document intelligence workflows, with two key services at the center:
- AWS Textract: This service extracts text, forms, tables, and document structure from images and PDFs. It’s your document interpreter.
- Amazon Comprehend: This service takes the text Textract pulls out and makes sense of it. It identifies people, organizations, dates, and other entities.
When these services work together, magic happens. Textract handles the extraction foundation, while Comprehend builds on it by adding semantic understanding. This means adding the meaning behind the text.
In this article, I’ll show you how to build your first intelligent document workflow using these AWS APIs. Don’t worry if you’re not an AWS expert. Curiosity is the main prerequisite.
Setting Up Your AWS Environment
Before diving into code, let’s set up our AWS environment. Think of this as preparing your workshop. You need the right tools and safety equipment before building anything.
AWS Account and Permissions
First, you’ll need an AWS account. If you don’t have one, sign up at aws.amazon.com. The AWS Free Tier covers most small-scale experiments with Textract and Comprehend, making it perfect for learning.
For document processing, you need permissions for three core services:
- Amazon Textract
- Amazon Comprehend
- Amazon S3 (for document storage)
Security best practice: Create a dedicated IAM role or user for document processing. Scope permissions to only the resources and actions needed. Never use your root account for day-to-day work.
Here’s a minimal IAM policy example. Update with your details:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"textract:AnalyzeDocument",
"textract:StartDocumentAnalysis",
"textract:GetDocumentAnalysis",
"textract:DetectDocumentText"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"comprehend:DetectEntities",
"comprehend:DetectKeyPhrases",
"comprehend:DetectPiiEntities"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-bucket-name/*",
"arn:aws:s3:::your-bucket-name"
]
}
]
}
Installing the AWS Tools
Next, you’ll need the AWS CLI and boto3 (AWS SDK for Python). Here’s how to install them:
For AWS CLI:
# macOS/Linux
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
# Or via Homebrew on macOS
brew install awscli
# Windows (using Chocolatey)
choco install awscli
For boto3:
pip install boto3
Configure your CLI with credentials:
aws configure
# Then enter your Access Key ID, Secret Access Key, default region, and output format
Test your setup by listing S3 buckets:
aws s3 ls
If that works, you’re ready to start building!
Your First Document Extraction with Textract
Let’s kick things off by extracting text from a document. First, upload a sample document to S3:
aws s3 cp sample-document.pdf s3://your-bucket-name/
Now let’s write some Python code to extract text from this document:
import boto3
# Create a Textract client
textract = boto3.client('textract')
# Call Textract to detect text in your S3 document
response = textract.detect_document_text(
Document={
'S3Object': {
'Bucket': 'your-bucket-name',
'Name': 'sample-document.pdf'
}
}
)
# Extract text lines from the response
lines = [block['Text'] for block in response['Blocks']
if block['BlockType'] == 'LINE']
# Print the extracted text
print('\n'.join(lines))
This code uses Textract’s detect_document_text
API to find all text in your document. The response contains a list of “Blocks”. These represent different elements like pages, lines, words, and tables.
For most basic tasks, you’ll want the ‘LINE’ blocks. These contain complete lines of text. Each block includes the detected text, a confidence score (how sure Textract is), and geometry data (where on the page the text appears).
For richer documents with tables or forms, you’ll want to use the analyze_document
API with specific feature types. We’ll explore this later.
Understanding the Document with Comprehend
Extracting text is just the beginning. Next, let’s analyze the content using Amazon Comprehend to identify entities and key phrases. Entities include people, organizations, dates, and more.
import boto3
# Create a Comprehend client
comprehend = boto3.client('comprehend')
# Join the extracted lines into a single text block
text = '\n'.join(lines)
# Detect entities (people, organizations, locations, etc.)
entities = comprehend.detect_entities(
Text=text,
LanguageCode='en' # Use 'es', 'fr', etc. for other languages
)
# Detect key phrases (important terms/concepts)
key_phrases = comprehend.detect_key_phrases(
Text=text,
LanguageCode='en'
)
# Print entities with their types and confidence scores
for entity in entities['Entities']:
print(f"Entity: {entity['Text']} "
f"(Type: {entity['Type']}, "
f"Score: {entity['Score']:.2f})")
# Print key phrases
for phrase in key_phrases['KeyPhrases']:
print(f"Key Phrase: {phrase['Text']} "
f"(Score: {phrase['Score']:.2f})")
Here, we’re passing the extracted text to Comprehend’s entity and key phrase detection APIs. Comprehend returns structured information about what it found, including:
-Entities: Real-world items like people, organizations, locations, dates, or monetary amounts, each with a type and confidence score. -Key Phrases: Important terms or expressions that summarize the main ideas in your document.
High confidence scores (close to 1.0) mean Comprehend is very certain about its results. You might choose to only use entities with scores above 0.90 or flag lower-scoring items for human review.
The Complete Workflow: From PDF to Insights
Now, let’s put everything together into a complete, end-to-end pipeline. This script handles larger documents using Textract’s asynchronous API (recommended for PDFs and multi-page documents) and processes text in chunks to respect Comprehend’s API limits:
import boto3
import json
import sys
import time
import logging
# --- Configuration ---
BUCKET = 'your-bucket-name'
DOCUMENT = 'sample-document.pdf'
OUTPUT_FILE = 'document_analysis.json'
LANGUAGE_CODE = 'en' # Set to 'es', 'fr', etc. as needed
SAVE_TO_S3 = False # Set to True to upload output to S3
OUTPUT_S3_BUCKET = 'your-output-bucket'
OUTPUT_S3_KEY = 'results/document_analysis.json'
# --- Logging Setup ---
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s %(levelname)s %(message)s'
)
# --- Initialize AWS Clients ---
textract = boto3.client('textract')
comprehend = boto3.client('comprehend')
s3 = boto3.client('s3')
def start_textract_job(bucket, document):
"""Start an asynchronous Textract text detection job"""
response = textract.start_document_text_detection(
Document={'S3Object': {'Bucket': bucket, 'Name': document}}
)
return response['JobId']
def wait_for_textract_job(job_id, poll_interval=5, timeout=600):
"""Wait for a Textract job to complete"""
elapsed = 0
while elapsed < timeout:
status = textract.get_document_text_detection(JobId=job_id)
job_status = status['JobStatus']
if job_status in ['SUCCEEDED', 'FAILED', 'PARTIAL_SUCCESS']:
return status
time.sleep(poll_interval)
elapsed += poll_interval
raise TimeoutError(
f"Textract job {job_id} did not complete within {timeout} seconds."
)
def extract_text_from_textract(status_response):
"""Extract all text lines from a Textract job response, handling pagination"""
lines = []
next_token = status_response.get('NextToken', None)
response = status_response
# Process all pages of results
while True:
for block in response['Blocks']:
if block['BlockType'] == 'LINE':
lines.append(block['Text'])
next_token = response.get('NextToken', None)
if not next_token:
break
response = textract.get_document_text_detection(
JobId=status_response['JobId'],
NextToken=next_token
)
return '\n'.join(lines)
def chunk_text(text, max_bytes=5000):
"""Split text into chunks that respect Comprehend's API limits"""
import re
sentences = re.split(r'(?<=[.!?]) +', text)
chunks = []
current = ''
for sentence in sentences:
if len((current + ' ' + sentence).encode('utf-8')) > max_bytes:
if current:
chunks.append(current)
current = sentence
else:
# Sentence itself exceeds max_bytes, split mid-sentence
for i in range(0, len(sentence), max_bytes):
chunks.append(sentence[i:i+max_bytes])
current = ''
else:
current = current + ' ' + sentence if current else sentence
if current:
chunks.append(current)
return chunks
def analyze_with_comprehend(text, language_code):
"""Analyze text with Comprehend, handling chunking for API limits"""
entities = []
key_phrases = []
# Process each chunk of text
for chunk in chunk_text(text):
if chunk.strip():
ent_resp = comprehend.detect_entities(
Text=chunk,
LanguageCode=language_code
)
kp_resp = comprehend.detect_key_phrases(
Text=chunk,
LanguageCode=language_code
)
entities.extend(ent_resp['Entities'])
key_phrases.extend(kp_resp['KeyPhrases'])
return entities, key_phrases
def save_output(output, filename, to_s3=False, s3_bucket=None, s3_key=None):
"""Save results as JSON locally and optionally to S3"""
with open(filename, 'w', encoding='utf-8') as f:
json.dump(output, f, indent=2, ensure_ascii=False)
logging.info(f"Results saved locally to {filename}")
if to_s3 and s3_bucket and s3_key:
s3.upload_file(filename, s3_bucket, s3_key)
logging.info(f"Results uploaded to s3://{s3_bucket}/{s3_key}")
if __name__ == '__main__':
try:
# Step 1: Start Textract async job
logging.info(f"Starting Textract async job for s3://{BUCKET}/{DOCUMENT} ...")
job_id = start_textract_job(BUCKET, DOCUMENT)
# Step 2: Wait for completion
logging.info(f"Waiting for Textract job {job_id} to complete ...")
status_response = wait_for_textract_job(job_id)
if status_response['JobStatus'] != 'SUCCEEDED':
raise RuntimeError(
f"Textract job failed with status: {status_response['JobStatus']}"
)
# Step 3: Extract text
text = extract_text_from_textract(status_response)
logging.info(f"Extracted {len(text.splitlines())} lines of text.")
# Step 4: Analyze with Comprehend (chunked)
logging.info(
"Analyzing text with Comprehend (entities/key phrases, chunked for API limits) ..."
)
entities, key_phrases = analyze_with_comprehend(text, LANGUAGE_CODE)
logging.info(f"Found {len(entities)} entities and {len(key_phrases)} key phrases.")
# Step 5: Save output as JSON (locally or to S3)
output = {
'ExtractedText': text,
'Entities': entities,
'KeyPhrases': key_phrases
}
save_output(
output, OUTPUT_FILE,
to_s3=SAVE_TO_S3,
s3_bucket=OUTPUT_S3_BUCKET,
s3_key=OUTPUT_S3_KEY
)
except Exception as e:
logging.error(f"Error during document processing: {e}")
sys.exit(1)
This script follows a clear extract-analyze-output pattern:
1.Start a Textract Job: Begin asynchronous text extraction from a document in S3. 2.Wait for Completion: Poll until the job finishes, with timeout handling. 3.Extract Text: Collect all text lines from the results, handling pagination for large documents. 4.Analyze Content: Feed the text to Comprehend in manageable chunks, gathering entities and key phrases. 5.Save Results: Output everything as structured JSON, locally or to S3.
The asynchronous approach ensures our workflow can handle documents of any size, while chunking helps us work within Comprehend’s API limits. Error handling and logging make the script robust and production-ready.
Understanding Your Results: The JSON Output
The output JSON from our workflow looks something like this:
{
"ExtractedText": "...full document text...",
"Entities": [
{ "Text": "John Doe", "Type": "PERSON", "Score": 0.99 },
{ "Text": "Acme Corp", "Type": "ORGANIZATION", "Score": 0.97 },
{ "Text": "January 15, 2024", "Type": "DATE", "Score": 0.98 }
],
"KeyPhrases": [
{ "Text": "Invoice Number", "Score": 0.98 },
{ "Text": "Due Date", "Score": 0.95 }
]
}
This structured format makes it easy to integrate with databases, business intelligence tools, or automated workflows. You could, for example:
- Import entities into a CRM system
- Track invoice dates and amounts in a financial dashboard
- Trigger alerts based on specific entities or key phrases
- Feed the structured data into a machine learning model for further analysis
Common Pitfalls and How to Avoid Them
Even the best-designed workflows can hit snags. Here are some common issues and how to address them:
1. Permission DeniedSymptom: Errors like AccessDeniedException
or UnauthorizedOperation
.Fix: Check your IAM policies. Ensure you have permissions for Textract (including async APIs), Comprehend, and the appropriate S3 buckets.
2. Poor Extraction QualitySymptom: Textract returns incomplete or garbled text.Fix: Use high-resolution, clear, and unprotected PDFs. Avoid blurry scans. Digital PDFs work best. For tables/forms, use the analyze_document
API with appropriate feature types.
3. API Limits and TimeoutsSymptom: Errors about file size, page count, or request limits.Fix: Use Textract’s asynchronous APIs for PDFs and multi-page documents. Split text into chunks for Comprehend (as we did in our example). For high-volume processing, consider batch APIs or parallel processing.
4. Language MismatchSymptom: Comprehend finds few or no entities for non-English text.Fix: Set the LanguageCode
parameter correctly (e.g., ’es’ for Spanish). Make sure your text matches the language code.
Beyond the Basics: Where to Go Next
You’ve just built a solid foundation for document intelligence. Think of it as having learned to crawl—now it’s time to walk and eventually run. Here are some advanced directions to explore:
Extracting Forms and Key-Value Pairs
For documents with forms (like invoices or applications), you can extract specific fields:
# Extract key-value pairs from a form
response = textract.analyze_document(
Document={'S3Object': {'Bucket': 'your-bucket-name',
'Name': 'form-example.pdf'}},
FeatureTypes=['FORMS']
)
# This will identify fields like "Invoice Number: 12345" as key-value pairs
Processing Tables
Tables are common in financial statements, reports, and many other documents:
# Extract tables from a document
response = textract.analyze_document(
Document={'S3Object': {'Bucket': 'your-bucket-name',
'Name': 'table-example.pdf'}},
FeatureTypes=['TABLES']
)
# This will give you structured data for each table in the document
Custom Queries for Targeted Extraction
When information isn’t in a fixed spot, you can use custom queries:
# Ask specific questions about a document
response = textract.analyze_document(
Document={'S3Object': {'Bucket': 'your-bucket-name',
'Name': 'contract.pdf'}},
FeatureTypes=['QUERIES'],
QueriesConfig={
'Queries': [
{'Text': 'What is the contract end date?'},
{'Text': 'Who are the parties in this agreement?'}
]
}
)
# This will attempt to answer your specific questions about the document
Scaling with Serverless Architectures
As your document processing needs grow, consider serverless architectures using AWS Lambda and Step Functions. These can automatically scale to handle thousands of documents without maintaining any servers.
Conclusion: From Paper Mountain to Digital Insights
We’ve come full circle. Remember that mountain of paper we started with? You now have the tools to scale it efficiently, extracting value instead of drowning in manual processing.
The workflow you’ve built—extract, analyze, structure, and act—is the foundation for document intelligence across industries:
-Banking: Process loan applications faster and more accurately -Insurance: Streamline claims handling and policy management -Healthcare: Digitize and understand patient records -Legal: Extract key information from contracts and case files -Any industry with document-heavy processes: Reduce manual effort and increase accuracy
Each document presents unique challenges, but the pattern remains consistent. As you experiment with different document types and more complex extraction needs, your understanding will deepen, and your solutions will become more sophisticated.
The most exciting part? Document intelligence isn’t just about technology—it’s about freeing humans from mundane data entry to focus on higher-value work that requires judgment, creativity, and personal interaction.
So the next time you face a mountain of documents, you’ll know exactly how to turn that paper chaos into digital intelligence.
If you liked this article, check out this chapter in this book.
About the Author
Rick Hightower is a seasoned software engineer and technical author with extensive experience in cloud computing, AI/ML, and enterprise software development. As an expert in AWS services and modern application architectures, Rick regularly shares insights through articles and technical publications.
With a focus on practical implementation and real-world solutions, Rick has helped numerous organizations modernize their document processing workflows and leverage cloud-native technologies. His writing style combines technical depth with accessible explanations, making complex topics approachable for developers at all levels.
When not writing about technology, Rick can be found experimenting with new tools and frameworks, contributing to open-source projects, and mentoring fellow developers in cloud and AI technologies.
Connect with Rick on LinkedIn. https://www.linkedin.com/in/rickhigh/
Tweet
Apache Spark Training
Kafka Tutorial
Akka Consulting
Cassandra Training
AWS Cassandra Database Support
Kafka Support Pricing
Cassandra Database Support Pricing
Non-stop Cassandra
Watchdog
Advantages of using Cloudurable™
Cassandra Consulting
Cloudurable™| Guide to AWS Cassandra Deploy
Cloudurable™| AWS Cassandra Guidelines and Notes
Free guide to deploying Cassandra on AWS
Kafka Training
Kafka Consulting
DynamoDB Training
DynamoDB Consulting
Kinesis Training
Kinesis Consulting
Kafka Tutorial PDF
Kubernetes Security Training
Redis Consulting
Redis Training
ElasticSearch / ELK Consulting
ElasticSearch Training
InfluxDB/TICK Training TICK Consulting