May 13, 2025

Unlock the hidden potential of your documents! Dive into our latest guide on Amazon Textract and discover how to transform unstructured data into actionable insights. From invoices to contracts, learn the secrets of document intelligence that could revolutionize your workflow. Don’t let your data stay trapped—read on to unleash its power!

Amazon Textract converts documents into structured data by detecting forms, tables, and layouts while enabling natural language queries. It includes expense and ID analysis APIs and handles both real-time and batch processing.

This guide covers setup and advanced features. It provides examples for both beginners and experienced users looking to enhance their document processing capabilities.

Document Intelligence with Amazon Textract: A Developer’s Guide

You know that feeling, right? When you’re staring at a scanned document and all the useful stuff is there, but somehow it’s just trapped. Traditional OCR might read the words, but you’re left trying to figure out how it all fits together. Where’s the structure? Where are the tables? Which value belongs to which label?

That’s the puzzle Amazon Textract is built to solve.

In this hands-on tutorial, we’ll dive deep into what makes Textract different from plain OCR, explore how to work with its structured output, and walk through Python examples using the boto3 SDK and the amazon-textract-response-parser library (TRP). Whether you’re dealing with invoices, forms, or contracts, this guide will show you how to turn documents into structured, actionable data.

What Makes Textract Different?

Traditional OCR gives you flat text. Textract gives you structure.

That means it understands not just the characters on a page, but also what they mean in context:

Forms: Detects key-value pairs like “Invoice Number: 12345”
Tables: Recognizes rows and columns, including headers and summary rows
Layout: Identifies paragraphs, lists, titles, and sections
Custom Queries: Lets you ask natural-language questions like “What is the policy number?” and get a direct answer

Textract also supports both printed and handwritten text, making it suitable for real-world, messy documents.

Getting Started: AnalyzeDocument with Forms and Tables

Here’s how to extract structured data from an invoice using Python and boto3:

import boto3

textract = boto3.client('textract')

response = textract.analyze_document(
    Document={
        'S3Object': {
            'Bucket': 'your-bucket-name',
            'Name': 'invoice.pdf'
        }
    },
    FeatureTypes=['FORMS', 'TABLES']
)

for block in response['Blocks']:
    print(block['BlockType'], block.get('Text', ''))

This returns a list of “blocks.” Each represents a word, line, key-value pair, table cell, etc. They include metadata like page number, position, and relationships to other blocks.

Making It Easier: Using the TRP Library

Textract’s raw JSON is powerful but complex. That’s where the amazon-textract-response-parser library (TRP) comes in.

from trp import Document

doc = Document(response)

for page in doc.pages:
    for field in page.form.fields:
        print(f"{field.key.text}: {field.value.text}")

    for table in page.tables:
        for row in table.rows:
            print(" | ".join(cell.text for cell in row.cells))

TRP handles the relationships for you, turning block graphs into familiar Python objects.

Advanced Feature: Custom Queries

Want to ask specific questions about a document, regardless of layout?

response = textract.analyze_document(
    Document={
        'S3Object': {
            'Bucket': 'your-bucket-name',
            'Name': 'policy.pdf'
        }
    },
    FeatureTypes=['QUERIES'],
    QueriesConfig={
        'Queries': [{'Text': 'What is the policy number?'}]
    }
)

for block in response['Blocks']:
    if block['BlockType'] == 'QUERY_RESULT':
        print(f"Answer: {block.get('Text', '')}")

Textract will find and extract the answer, even if the format varies across documents.

Real-World APIs: AnalyzeExpense and AnalyzeID

Amazon offers specialized APIs for common use cases:

analyze_expense – Extracts totals, vendor names, tax, etc. from receipts and invoices
analyze_id – Parses driver’s licenses and passports

Example:

response = textract.analyze_expense(
    Document={
        'S3Object': {
            'Bucket': 'your-bucket-name',
            'Name': 'receipt.pdf'
        }
    }
)

for doc in response['ExpenseDocuments']:
    for field in doc['SummaryFields']:
        print(f"{field['Type']['Text']}: {field.get('ValueDetection', {}).get('Text', '')}")

Synchronous vs. Asynchronous: Which One Should You Use?

Synchronous: Great for small, real-time tasks (e.g., mobile scans)
Asynchronous: Ideal for multi-page PDFs, batch jobs, or large files

Asynchronous usage:

job_id = textract.start_document_analysis(
    DocumentLocation={
        'S3Object': {'Bucket': 'your-bucket', 'Name': 'long-file.pdf'}
    },
    FeatureTypes=['FORMS', 'TABLES']
)['JobId']


# Polling loop (in production, use SNS)
while True:
    result = textract.get_document_analysis(JobId=job_id)
    if result['JobStatus'] == 'SUCCEEDED':
        break

Tips for Better Results

Aim for 300 DPI scans
Keep images aligned and well-lit
Clean up noise or background clutter before submission
Use Custom Queries for inconsistent formats
Implement confidence thresholds and human-in-the-loop validation for critical data

Wrap-Up: Why Textract Matters

Textract is more than OCR. It’s document intelligence. With support for forms, tables, layout, and targeted queries, it empowers you to automate what used to require manual review.

From financial documents to medical forms, Textract is your API for transforming unstructured files into structured, searchable data.

Ready to go further? In upcoming posts, we’ll integrate Textract with Amazon Comprehend for sentiment and entity extraction, and build a full document processing pipeline on AWS.

Stay tuned.

Helpful Links:

About the Author

Rick Hightower is a seasoned software developer and technical writer with extensive experience in cloud computing and AI technologies. As a certified AWS Solutions Architect, he specializes in helping developers implement intelligent document processing solutions using AWS services.

With over a decade of hands-on experience building enterprise-scale applications, Rick combines deep technical knowledge with clear, practical writing. He makes complex topics accessible to developers of all skill levels.

Follow Rick’s technical insights and tutorials on cloud computing, machine learning, and software development best practices through his regular contributions to this blog.

comments powered by Disqus

Apache Spark Training
Kafka Tutorial
Akka Consulting
Cassandra Training
AWS Cassandra Database Support
Kafka Support Pricing
Cassandra Database Support Pricing
Non-stop Cassandra
Watchdog
Advantages of using Cloudurable™
Cassandra Consulting
Cloudurable™| Guide to AWS Cassandra Deploy
Cloudurable™| AWS Cassandra Guidelines and Notes
Free guide to deploying Cassandra on AWS
Kafka Training
Kafka Consulting
DynamoDB Training
DynamoDB Consulting
Kinesis Training
Kinesis Consulting
Kafka Tutorial PDF
Kubernetes Security Training
Redis Consulting
Redis Training
ElasticSearch / ELK Consulting
ElasticSearch Training
InfluxDB/TICK Training TICK Consulting

Document Intelligence with Amazon Textract: A Developer’s Guide

What Makes Textract Different?

Getting Started: AnalyzeDocument with Forms and Tables

Making It Easier: Using the TRP Library

Advanced Feature: Custom Queries

Real-World APIs: AnalyzeExpense and AnalyzeID

Synchronous vs. Asynchronous: Which One Should You Use?

Tips for Better Results

Wrap-Up: Why Textract Matters

About the Author

Search

Share

Follow

Categories

Tags

Amazon Textract A Developer's Guide

Document Intelligence with Amazon Textract: A Developer’s Guide

What Makes Textract Different?

Getting Started: AnalyzeDocument with Forms and Tables

Making It Easier: Using the TRP Library

Advanced Feature: Custom Queries

Real-World APIs: AnalyzeExpense and AnalyzeID

Synchronous vs. Asynchronous: Which One Should You Use?

Tips for Better Results

Wrap-Up: Why Textract Matters

About the Author

Search

Share

Follow

Categories

Tags