Making Sense of Textract Output A Developer's Fast

May 13, 2025

                                                                           

Making Sense of Textract Output: A Developer’s Fast Track with the TRP Library

You know that feeling when you open a scanned document, and it’s like all the valuable information is just sitting there—but shattered across the page in a hundred disjointed fragments? Sure, traditional OCR gets you the text. But it doesn’t give you the map. It doesn’t tell you how the pieces fit together: what’s a table, what’s a form, what field goes with what value.

That’s exactly where Amazon Textract shines—and even more so with its companion library: the Amazon Textract Response Parser (TRP).

In this article, we’ll take a developer-focused, two-part look at how to:

  1. Transform raw Textract responses into clean, accessible Python objects using TRP
  2. Tap into Textract’s document intelligence to extract structure, meaning, and context

Let’s get into it.


Part 1: Wrestling Textract Output into Shape with TRP

The Problem

Textract returns a huge JSON file full of nested “blocks” representing words, lines, key-value pairs, and more—each connected by relationships. Navigating this by hand is like document archaeology.

Enter the amazon-textract-response-parser library (TRP). It organizes the chaos into Python objects you can reason about.

Installing TRP

pip install amazon-textract-response-parser

Step 1: Deserialize Textract Output

Textract’s response JSON can be loaded into a structured Python object using TRP2:

from trp.trp2 import TDocumentSchema


# Assume you have the raw Textract JSON loaded into textract_response
document = TDocumentSchema().load(textract_response)

This converts everything into typed Python classes. You can also go back to JSON:

tdict = TDocumentSchema().dump(document)

Step 2: Work with the Document Object Model

Once structured, you can interact with the document using the classic TRP interface:

from trp import Document

doc = Document(TDocumentSchema().dump(document))


# Print form fields and tables
for page in doc.pages:
    for field in page.form.fields:
        print(f"{field.key.text}: {field.value.text}")

    for table in page.tables:
        for row in table.rows:
            print(" | ".join(cell.text for cell in row.cells))

TRP Pipelines: Smarter Post-Processing

TRP also comes with helpful “pipeline components” to enhance the parsed output.

1. Order Blocks by Geometry

Textract doesn’t return text in reading order. This fixes that:

from trp.t_pipeline import order_blocks_by_geo

ordered_doc = order_blocks_by_geo(document)

2. Add Page Orientation

Adds a custom field to each page estimating its rotation:

from trp.t_pipeline import add_page_orientation

doc_with_orientation = add_page_orientation(document)

For big tables that span pages:

from trp.t_pipeline import merge_tables, link_tables

doc_merged = merge_tables(document)
doc_linked = link_tables(document)
  • Merged tables are easier to work with
  • Linked tables preserve geometry and page info

4. Add OCR Confidence Scores to Form Fields

from trp.t_pipeline import add_ocr_confidence

doc_confidence = add_ocr_confidence(document)

This lets you flag low-confidence fields for human review.


TRP Object Access Patterns

After enhancement, the Document object gives you clean access to:


# First page lines
for line in doc.pages[0].lines:
    print(line.text, line.confidence)


# Table access
for table in doc.pages[0].tables:
    for row in table.rows:
        print(" | ".join(cell.text for cell in row.cells))


# Search form fields
fields = doc.pages[0].form.searchFieldsByKey("invoice")
for field in fields:
    print(field.key.text, field.value.text)

Command-Line Support

You can also use the amazon-textract-pipeline CLI to test pipeline features:

aws textract analyze-document ... \
  | amazon-textract-pipeline --add-page-orientation \
  | jq .

Great for testing without writing a full script.


Part 2: What Makes Textract Special?

Traditional OCR outputs unstructured text. Textract understands layout, structure, and semantics.

Core Features

  • Key-Value Pairs: Recognizes labeled fields (e.g. “Invoice #: 2024001”)
  • Tables: Preserves row/column structure
  • Layout: Identifies paragraphs, headers, lists
  • Custom Queries: Ask questions like “What’s the PO number?”

Advanced Features

Layout Extraction— Returns paragraph, title, list, and header blocks

Custom Queries— Use natural language to target answers

response = textract.analyze_document(
    Document={'S3Object': {'Bucket': 'bucket', 'Name': 'file.pdf'}},
    FeatureTypes=['QUERIES'],
    QueriesConfig={'Queries': [{'Text': 'What is the invoice total?'}]}
)

Specialized APIs

  • analyze_expense: Pretrained for invoices and receipts. Returns totals, vendor info, and line items in semantic structure.
  • analyze_id: Tailored for identity documents like passports and driver’s licenses; returns normalized fields such as name, date of birth, and ID number.
  • analyze_lending: Designed for mortgage and loan documents. Extracts structured lending-specific information (available in select AWS regions).

These APIs improve accuracy, simplify parsing, and reduce downstream logic.


Asynchronous Processing for Scale

Use start_document_analysis for large or multi-page documents:

response = textract.start_document_analysis(
    DocumentLocation={'S3Object': {'Bucket': 'your-bucket', 'Name': 'big-file.pdf'}},
    FeatureTypes=['FORMS', 'TABLES']
)
job_id = response['JobId']

To poll for results:

import time
while True:
    result = textract.get_document_analysis(JobId=job_id)
    if result['JobStatus'] in ['SUCCEEDED', 'FAILED']:
        break
    time.sleep(5)

Production Tip: Use Amazon SNS notifications to avoid polling. Configure an SNS topic to receive job completion messages.


Error Handling Examples

Wrap API calls in try/except blocks and log the error:

import boto3
import logging

textract = boto3.client('textract')

try:
    response = textract.detect_document_text(
        Document={'S3Object': {'Bucket': 'bucket', 'Name': 'file.jpg'}}
    )
except textract.exceptions.InvalidS3ObjectException as e:
    logging.error("Invalid S3 object: %s", e)
except Exception as e:
    logging.error("Unexpected error: %s", e)

Document Preparation Tips

To improve accuracy, follow these guidelines:

  • Resolution: Use scans of at least 300 DPI
  • Alignment: Avoid skewed or rotated images
  • Clarity: Remove smudges, stains, or noise
  • Contrast: Use high contrast between text and background
  • Cropping: Remove irrelevant margins or borders
  • Rotation: Use tools like Pillow to auto-rotate based on EXIF

Example with Pillow:

from PIL import Image, ImageOps

img = Image.open('scan.jpg')
img = ImageOps.exif_transpose(img)  # Auto-orient
img = ImageOps.grayscale(img)       # Convert to grayscale
img.save('processed.jpg')

Pagination Handling for Large Responses

Asynchronous API responses can span multiple pages. Use pagination to get all results:

pages = []
next_token = None

while True:
    args = {'JobId': job_id}
    if next_token:
        args['NextToken'] = next_token

    result = textract.get_document_analysis(**args)
    pages.append(result)

    next_token = result.get('NextToken')
    if not next_token:
        break

Each response page includes a subset of Blocks. Combine them to get the full document analysis.


Final Thoughts

Amazon Textract is a shift from reading text to understanding documents. Combined with the TRP library, it gives you:

  • A consistent interface to extract meaning
  • Structured access to key fields and layout
  • Scalable automation for forms, tables, and reports

This is more than OCR—it’s the beginning of true document intelligence.

So next time you’re handed a pile of PDFs, ask yourself: What if your software could actually understand them?


About the Author

Rick Hightower is a seasoned software developer and technical writer with extensive experience in cloud computing and document processing technologies. With a background in enterprise software development and AWS services, Rick specializes in creating developer-focused content that bridges the gap between complex technical concepts and practical implementation.

Currently working as a solutions architect, Rick is passionate about helping developers use cloud-native tools effectively. When not writing about technology, he can be found experimenting with new programming languages and contributing to open-source projects.

                                                                           
comments powered by Disqus

Apache Spark Training
Kafka Tutorial
Akka Consulting
Cassandra Training
AWS Cassandra Database Support
Kafka Support Pricing
Cassandra Database Support Pricing
Non-stop Cassandra
Watchdog
Advantages of using Cloudurable™
Cassandra Consulting
Cloudurable™| Guide to AWS Cassandra Deploy
Cloudurable™| AWS Cassandra Guidelines and Notes
Free guide to deploying Cassandra on AWS
Kafka Training
Kafka Consulting
DynamoDB Training
DynamoDB Consulting
Kinesis Training
Kinesis Consulting
Kafka Tutorial PDF
Kubernetes Security Training
Redis Consulting
Redis Training
ElasticSearch / ELK Consulting
ElasticSearch Training
InfluxDB/TICK Training TICK Consulting