May 13, 2025
Making Sense of Textract Output: A Developer’s Fast Track with the TRP Library
You know that feeling when you open a scanned document, and it’s like all the valuable information is just sitting there—but shattered across the page in a hundred disjointed fragments? Sure, traditional OCR gets you the text. But it doesn’t give you the map. It doesn’t tell you how the pieces fit together: what’s a table, what’s a form, what field goes with what value.
That’s exactly where Amazon Textract shines—and even more so with its companion library: the Amazon Textract Response Parser (TRP).
In this article, we’ll take a developer-focused, two-part look at how to:
- Transform raw Textract responses into clean, accessible Python objects using TRP
- Tap into Textract’s document intelligence to extract structure, meaning, and context
Let’s get into it.
Part 1: Wrestling Textract Output into Shape with TRP
The Problem
Textract returns a huge JSON file full of nested “blocks” representing words, lines, key-value pairs, and more—each connected by relationships. Navigating this by hand is like document archaeology.
Enter the amazon-textract-response-parser
library (TRP). It organizes the chaos into Python objects you can reason about.
Installing TRP
pip install amazon-textract-response-parser
Step 1: Deserialize Textract Output
Textract’s response JSON can be loaded into a structured Python object using TRP2
:
from trp.trp2 import TDocumentSchema
# Assume you have the raw Textract JSON loaded into textract_response
document = TDocumentSchema().load(textract_response)
This converts everything into typed Python classes. You can also go back to JSON:
tdict = TDocumentSchema().dump(document)
Step 2: Work with the Document Object Model
Once structured, you can interact with the document using the classic TRP interface:
from trp import Document
doc = Document(TDocumentSchema().dump(document))
# Print form fields and tables
for page in doc.pages:
for field in page.form.fields:
print(f"{field.key.text}: {field.value.text}")
for table in page.tables:
for row in table.rows:
print(" | ".join(cell.text for cell in row.cells))
TRP Pipelines: Smarter Post-Processing
TRP also comes with helpful “pipeline components” to enhance the parsed output.
1. Order Blocks by Geometry
Textract doesn’t return text in reading order. This fixes that:
from trp.t_pipeline import order_blocks_by_geo
ordered_doc = order_blocks_by_geo(document)
2. Add Page Orientation
Adds a custom field to each page estimating its rotation:
from trp.t_pipeline import add_page_orientation
doc_with_orientation = add_page_orientation(document)
3. Merge or Link Multi-Page Tables
For big tables that span pages:
from trp.t_pipeline import merge_tables, link_tables
doc_merged = merge_tables(document)
doc_linked = link_tables(document)
- Merged tables are easier to work with
- Linked tables preserve geometry and page info
4. Add OCR Confidence Scores to Form Fields
from trp.t_pipeline import add_ocr_confidence
doc_confidence = add_ocr_confidence(document)
This lets you flag low-confidence fields for human review.
TRP Object Access Patterns
After enhancement, the Document
object gives you clean access to:
# First page lines
for line in doc.pages[0].lines:
print(line.text, line.confidence)
# Table access
for table in doc.pages[0].tables:
for row in table.rows:
print(" | ".join(cell.text for cell in row.cells))
# Search form fields
fields = doc.pages[0].form.searchFieldsByKey("invoice")
for field in fields:
print(field.key.text, field.value.text)
Command-Line Support
You can also use the amazon-textract-pipeline
CLI to test pipeline features:
aws textract analyze-document ... \
| amazon-textract-pipeline --add-page-orientation \
| jq .
Great for testing without writing a full script.
Part 2: What Makes Textract Special?
Traditional OCR outputs unstructured text. Textract understands layout, structure, and semantics.
Core Features
- Key-Value Pairs: Recognizes labeled fields (e.g. “Invoice #: 2024001”)
- Tables: Preserves row/column structure
- Layout: Identifies paragraphs, headers, lists
- Custom Queries: Ask questions like “What’s the PO number?”
Advanced Features
Layout Extraction— Returns paragraph, title, list, and header blocks
Custom Queries— Use natural language to target answers
response = textract.analyze_document(
Document={'S3Object': {'Bucket': 'bucket', 'Name': 'file.pdf'}},
FeatureTypes=['QUERIES'],
QueriesConfig={'Queries': [{'Text': 'What is the invoice total?'}]}
)
Specialized APIs
analyze_expense
: Pretrained for invoices and receipts. Returns totals, vendor info, and line items in semantic structure.analyze_id
: Tailored for identity documents like passports and driver’s licenses; returns normalized fields such as name, date of birth, and ID number.analyze_lending
: Designed for mortgage and loan documents. Extracts structured lending-specific information (available in select AWS regions).
These APIs improve accuracy, simplify parsing, and reduce downstream logic.
Asynchronous Processing for Scale
Use start_document_analysis
for large or multi-page documents:
response = textract.start_document_analysis(
DocumentLocation={'S3Object': {'Bucket': 'your-bucket', 'Name': 'big-file.pdf'}},
FeatureTypes=['FORMS', 'TABLES']
)
job_id = response['JobId']
To poll for results:
import time
while True:
result = textract.get_document_analysis(JobId=job_id)
if result['JobStatus'] in ['SUCCEEDED', 'FAILED']:
break
time.sleep(5)
Production Tip: Use Amazon SNS notifications to avoid polling. Configure an SNS topic to receive job completion messages.
Error Handling Examples
Wrap API calls in try/except blocks and log the error:
import boto3
import logging
textract = boto3.client('textract')
try:
response = textract.detect_document_text(
Document={'S3Object': {'Bucket': 'bucket', 'Name': 'file.jpg'}}
)
except textract.exceptions.InvalidS3ObjectException as e:
logging.error("Invalid S3 object: %s", e)
except Exception as e:
logging.error("Unexpected error: %s", e)
Document Preparation Tips
To improve accuracy, follow these guidelines:
- Resolution: Use scans of at least 300 DPI
- Alignment: Avoid skewed or rotated images
- Clarity: Remove smudges, stains, or noise
- Contrast: Use high contrast between text and background
- Cropping: Remove irrelevant margins or borders
- Rotation: Use tools like Pillow to auto-rotate based on EXIF
Example with Pillow:
from PIL import Image, ImageOps
img = Image.open('scan.jpg')
img = ImageOps.exif_transpose(img) # Auto-orient
img = ImageOps.grayscale(img) # Convert to grayscale
img.save('processed.jpg')
Pagination Handling for Large Responses
Asynchronous API responses can span multiple pages. Use pagination to get all results:
pages = []
next_token = None
while True:
args = {'JobId': job_id}
if next_token:
args['NextToken'] = next_token
result = textract.get_document_analysis(**args)
pages.append(result)
next_token = result.get('NextToken')
if not next_token:
break
Each response page includes a subset of Blocks
. Combine them to get the full document analysis.
Final Thoughts
Amazon Textract is a shift from reading text to understanding documents. Combined with the TRP library, it gives you:
- A consistent interface to extract meaning
- Structured access to key fields and layout
- Scalable automation for forms, tables, and reports
This is more than OCR—it’s the beginning of true document intelligence.
So next time you’re handed a pile of PDFs, ask yourself: What if your software could actually understand them?
About the Author
Rick Hightower is a seasoned software developer and technical writer with extensive experience in cloud computing and document processing technologies. With a background in enterprise software development and AWS services, Rick specializes in creating developer-focused content that bridges the gap between complex technical concepts and practical implementation.
Currently working as a solutions architect, Rick is passionate about helping developers use cloud-native tools effectively. When not writing about technology, he can be found experimenting with new programming languages and contributing to open-source projects.
TweetApache Spark Training
Kafka Tutorial
Akka Consulting
Cassandra Training
AWS Cassandra Database Support
Kafka Support Pricing
Cassandra Database Support Pricing
Non-stop Cassandra
Watchdog
Advantages of using Cloudurable™
Cassandra Consulting
Cloudurable™| Guide to AWS Cassandra Deploy
Cloudurable™| AWS Cassandra Guidelines and Notes
Free guide to deploying Cassandra on AWS
Kafka Training
Kafka Consulting
DynamoDB Training
DynamoDB Consulting
Kinesis Training
Kinesis Consulting
Kafka Tutorial PDF
Kubernetes Security Training
Redis Consulting
Redis Training
ElasticSearch / ELK Consulting
ElasticSearch Training
InfluxDB/TICK Training TICK Consulting