Document Intelligence with Amazon Textract From OC

April 28, 2025

Turning Paper into Power—The Promise of Document Intelligence

Every business drowns in paperwork. This includes banks, hospitals, and law firms. Invoices, contracts, tax forms, and receipts are full of critical information. Finding what you need can feel like searching for a needle in a haystack. Manual data entry is tedious and expensive. It often leads to errors.

Even traditional scanning and basic OCR (Optical Character Recognition) usually produce a wall of unstructured text, missing the relationships and meaning hidden in the document’s layout.

ChatGPT Image Apr 29, 2025, 12_10_47 AM.png

Enterdocument intelligence. This is more than just reading text. It’s about understanding documents as a human would. Imagine instantly turning a stack of scanned contracts into a searchable database. It would contain client names, dates, and key clauses. Or extracting line items from thousands of invoices for real-time financial analysis. This is the promise of Amazon Textract. It’s a fully managed AWS service that goes beyond OCR to deliver structured, actionable data from your documents.

Amazon Textract can extract not just plain text. It also extracts complex elements like tables, forms, key-value pairs, and specialized data from receipts, IDs, and lending documents. Think of it as a digital assistant that reads documents at scale. It never tires and never makes a typo. Automating document processing frees up your team for higher-value work. It also opens the door to advanced AI applications.

The impact of document intelligence spans industries:

Healthcare: Pull patient info and diagnoses from medical forms for faster billing and care coordination.
Finance: Analyze receipts and contracts at scale for compliance and fraud detection.
Legal: Index and search legal documents by clause, party, or date—making due diligence much faster.
Retail: Automate expense and inventory tracking using scanned receipts and shipping manifests.

Quick Start Example: Detecting Document Text with Textract

Here’s a simple example: extracting every line of text from a PDF stored in Amazon S3 using Textract’s**DetectDocumentText**API.

Extracting Text Lines from a PDF with Textract

import boto3


# Always specify the latest supported region
textract = boto3.client("textract",
                        region_name="us-east-1")

bucket = "my-bucket"
document = "sample.pdf"


# Call Textract's DetectDocumentText API
response = textract.detect_document_text(
    Document={"S3Object": {
        "Bucket": bucket,
        "Name": document}
    }
)


# Print each detected line of text
for block in response["Blocks"]:
    if block["BlockType"] == "LINE":
        print(block["Text"])


# For large files, use StartDocumentTextDetection

What’s Happening

You specify your PDF’s location in S3.
Textract reads and analyzes the file, returning a structured response.
Each "LINE" block represents a detected line of text, which you can print, store, or process further.

This simple workflow is the foundation for more advanced pipelines—like extracting tables, forms, or specialized fields from complex documents.

Best Practices (2025)

Check for API deprecations: Monitor AWS release notes and enable deprecation warnings in your CI/CD pipelines.
Use asynchronous APIs for large files: For documents exceeding 5 MB, prefer StartDocumentTextDetection or StartDocumentAnalysis.
Secure your workflows: Always use IAM least-privilege roles, encrypt sensitive data, and consider VPC endpoints for private connectivity.
Automate compliance monitoring: Integrate with AWS Audit Manager or similar tools to provide ongoing compliance for sensitive workloads.

Textract Service Overview and Core Workflows

Amazon Textract transforms unstructured documents into structured, actionable data. Think of Textract as a digital mailroom with two service lanes: one for quick, interactive jobs and another for high-volume, batch processing.

Synchronous or Asynchronous?

Textract offers two main ways to process documents:

Synchronous: For small, fast jobs (like a single receipt or ID card). You send the document and get results in seconds. Limited to single-page PDFs or images and a 5MB file size cap.
Asynchronous: For large or multi-page documents (such as a 200-page contract). You submit the job, Textract works in the background, and you get notified when it’s done. Supports up to 1,000 pages per document.Key differences:
Synchronous jobs: single-page limit, strict 5MB size cap, immediate response.
Asynchronous jobs: multi-page support, larger file support, background processing, error notifications, and scalable parallelism.

Textract’s Core APIs

DetectDocumentText - Fast, synchronous OCR for plain text extraction
- Use when you only need raw text—no structured data required

Basic Synchronous OCR with DetectDocumentText

import boto3


# Create a Textract client
textract = boto3.client('textract')

try:
    # Extract text from a document in S3
    response = textract.detect_document_text(
        Document={'S3Object': {
            'Bucket': 'my-bucket',
            'Name': 'sample.pdf'
        }}
    )

    # Print each detected line of text
    for block in response['Blocks']:
        if block['BlockType'] == 'LINE':
            print(block['Text'])
except Exception as e:
    print(f"Textract error: {e}")

AnalyzeDocument - Extracts structured data: key-value pairs, tables, layout elements and custom queries
- Supports multiple FeatureTypes: FORMS, TABLES, LAYOUT, and QUERIES
- Returns a relationship graph linking detected elements

Advanced Extraction with AnalyzeDocument

import boto3

textract = boto3.client("textract")

response = textract.analyze_document(
    Document={"S3Object": {
        "Bucket": "my-bucket",
        "Name": "sample.pdf"
    }},
    FeatureTypes=["LAYOUT", "QUERIES"],
    QueriesConfig={
        "Queries": [
            {
                "Text": "What is the invoice number?",
                "Alias": "InvoiceNumber"
            }
        ]
    }
)

for block in response["Blocks"]:
    if block["BlockType"] == "QUERY_RESULT":
        print(f"{block['Query']['Alias']}: "
              f"{block['Text']}")

StartDocumentAnalysis - Asynchronous version of AnalyzeDocument for large or complex documents
- Supports all advanced features but processes in the background
- You submit a job and get notified via SNS when complete

Specialized APIs:

AnalyzeExpense (for invoices/receipts)
AnalyzeID (for IDs)
StartLendingAnalysis (for lending/financial documents)

Advanced Extraction Capabilities

Most business documents—contracts, invoices, onboarding forms—are more than just text. Their real value lies in structure: fields, tables, semantic sections, and the relationships between them.

Forms and Key-Value Pair Extraction

Forms are everywhere in business: HR onboarding, insurance claims, loan applications, and more. Each form field—like ‘Name’ or ‘Account Number’—acts as a key, paired with a value filled in by a person.

When Textract analyzes a form, it breaks the content into ‘Blocks.’ Each Block represents a detected element—such as a word, a key, a value, or their relationships. For forms, Textract uses KEY_VALUE_SET blocks to represent both keys and values, It also links them together.

Extracting Key-Value Pairs from a Form


# Analyze a form for key-value pairs
import boto3

textract = boto3.client('textract')

response = textract.analyze_document(
    Document={'S3Object': {
        'Bucket': 'my-bucket',
        'Name': 'form.pdf'
    }},
    FeatureTypes=['FORMS']
)


# Build a map of block IDs to blocks
block_map = {block['Id']: block
             for block in response['Blocks']}


# Traverse blocks to find key-value pairs
for block in response['Blocks']:
    if (block['BlockType'] == 'KEY_VALUE_SET' and
        'KEY' in block.get('EntityTypes', [])):

        key_text = ''
        value_text = ''

        # Extract key text
        for rel in block.get('Relationships', []):
            if rel['Type'] == 'CHILD':
                for cid in rel['Ids']:
                    child = block_map[cid]
                    if (child['BlockType'] == 'WORD' or
                        child['BlockType'] == 'SELECTION_ELEMENT'):
                        key_text += child.get('Text', '') + ' '

        # Find the value block
        for rel in block.get('Relationships', []):
            if rel['Type'] == 'VALUE':
                for vid in rel['Ids']:
                    value_block = block_map[vid]
                    # Extract value text
                    for vrel in value_block.get('Relationships', []):
                        if vrel['Type'] == 'CHILD':
                            for vcid in vrel['Ids']:
                                vchild = block_map[vcid]
                                if (vchild['BlockType'] == 'WORD' or
                                    vchild['BlockType'] == 'SELECTION_ELEMENT'):
                                    value_text += vchild.get('Text', '') + ' '

        print(f"Key: {key_text.strip()} -> "
              f"Value: {value_text.strip()} "
              f"(Confidence: {block['Confidence']:.1f}%)")

This code demonstrates how Textract processes form documents and extracts key-value pairs. Let’s break down what’s happening:

Block Structure: Textract represents document elements as “Blocks”, where each block can be text, a key-value pair, a table cell, etc.
KEY_VALUE_SET Blocks: These special blocks represent form fields. A single key-value pair consists of:
- A key block (like “Name:” or “Date:”)
- A value block (the actual information filled in)
- Relationship links connecting them

The code works in three main steps:

First, it creates a map of all blocks by their IDs for easy lookup
Then it finds blocks marked as KEY_VALUE_SET that are specifically keys (using EntityTypes)
For each key found, it:
- Extracts the key text by following CHILD relationships
- Finds the corresponding value using VALUE relationships
- Extracts the value text from its children
- Prints the complete key-value pair with confidence score

For example, if processing a form with “Name: John Doe”, the code would identify “Name” as the key and “John Doe” as the value, then extract and pair them together.

Flexible Field Extraction with Queries

The Queries feature allows you to extract specific information using natural-language questions, removing the need for brittle templates or custom parsing logic.

Extracting Fields Using Queries


# Using Textract Queries to extract specific fields
import boto3

textract = boto3.client('textract')

response = textract.analyze_document(
    Document={'S3Object': {
        'Bucket': 'my-bucket',
        'Name': 'invoice.pdf'
    }},
    FeatureTypes=['QUERIES'],
    QueriesConfig={
        "Queries": [
            {"Text": "What is the invoice date?",
             "Alias": "InvoiceDate"},
            {"Text": "What is the total amount?",
             "Alias": "TotalAmount"}
        ]
    }
)

for block in response['Blocks']:
    if block['BlockType'] == 'QUERY_RESULT':
        alias = block['Query']['Alias']
        answer = block.get('Text', '')
        confidence = block.get('Confidence', 0)
        print(f"{alias}: {answer} "
              f"(Confidence: {confidence:.1f}%)")

This code demonstrates using Textract’s Queries feature to extract specific information from documents. The key aspects are:

It uses the analyze_document API with the QUERIES feature type
Natural language queries are defined with aliases for easy reference (e.g., “What is the invoice date?” with alias “InvoiceDate”)
The code processes the response by looking for QUERY_RESULT blocks
For each result, it extracts:
- The query alias (to identify which question was answered)
- The extracted answer text
- The confidence score for the extraction

This approach is particularly useful when you need to extract specific fields without relying on fixed templates or exact field positions in the document.

Textract’s analysis for fields like invoice numbers uses several sophisticated techniques working together:

Pattern Recognition: The system identifies common invoice number formats and locations (usually near the top of documents or in specific header sections)
Contextual Understanding: It looks for surrounding text like “Invoice #”, “Invoice Number”, or similar labels that typically indicate where invoice numbers are located
Natural Language Processing: When using queries like “What is the invoice number?”, Textract analyzes the semantic meaning of document text to find the most relevant answer
Confidence Scoring: Each potential match is given a confidence score based on:
- How well it matches known invoice number patterns
- The presence of supporting contextual clues
- The clarity and quality of the text in that region

Complex Tables and Enhanced Layout Analysis

Textract’s TABLES feature now detects not only tables, cells, rows, and columns—even in non-standard layouts—but also table titles, section headers, footers, summary cells, and table type classification.

Extracting Enhanced Table Structure


# Extract tables and enhanced metadata
import boto3

textract = boto3.client('textract')

response = textract.analyze_document(
    Document={'S3Object': {
        'Bucket': 'my-bucket',
        'Name': 'statement.pdf'
    }},
    FeatureTypes=['TABLES']
)

block_map = {block['Id']: block
             for block in response['Blocks']}

table_blocks = [b for b in response['Blocks']
                if b['BlockType'] == 'TABLE']

for table in table_blocks:
    table_type = table.get('TableType', 'STANDARD')
    title = table.get('Title', {}).get('Text', '')
    footer = table.get('Footer', {}).get('Text', '')

    print(f"Table Type: {table_type}, "
          f"Title: {title}, "
          f"Footer: {footer}")
    # Further process rows, cells, summary cells

Let’s break down this code that extracts table information:

Block Map Creation: First, it creates a lookup dictionary (block_map) of all blocks by their IDs for efficient retrieval
Table Block Filtering: The code identifies all table blocks in the document using list comprehension
Table Information Extraction: For each table found, it extracts:
- The table type (standard or other specialized types)
- The table’s title (if present)
- The table’s footer text (if present)
Output Formatting: Finally, it prints the extracted information in a formatted string showing the table type, title, and footer

The code demonstrates Textract’s enhanced table detection capabilities, which go beyond basic cell detection to understand the semantic structure of tables in documents.

Semantic Layout Extraction

Business documents often contain rich semantic structure: headers, footers, section titles, numbered lists, and figures. The LAYOUT feature extracts these elements, preserving the true structure of your documents.

Extracting Semantic Layout Elements


# Extract semantic layout elements
import boto3

textract = boto3.client('textract')

response = textract.analyze_document(
    Document={'S3Object': {
        'Bucket': 'my-bucket',
        'Name': 'policy.pdf'
    }},
    FeatureTypes=['LAYOUT']
)

for block in response['Blocks']:
    if block['BlockType'].startswith('LAYOUT_'):
        print(f"{block['BlockType']}: "
              f"{block.get('Text', '')}")

This code demonstrates how to extract semantic layout elements from documents using Amazon Textract’s LAYOUT feature. Here’s how it works:

First, it initializes a Textract client using boto3
It calls the analyze_document API with the LAYOUT feature type enabled
The document is specified using an S3 bucket and object key
The code then iterates through all blocks in the response
It filters for blocks that have a BlockType starting with ‘LAYOUT_’
For each matching block, it prints:
- The block type (e.g., LAYOUT_HEADER, LAYOUT_FOOTER, etc.)
- The text content of that layout element

This helps preserve the semantic structure of documents by identifying different layout components like headers, footers, titles, and sections in their proper context.

Specialized APIs: Expense, Lending, and ID Docs

Some documents require extra intelligence. Textract provides specialized APIs for these cases:

AnalyzeExpense: Extracts structured fields, line items, currencies and taxes from invoices and receipts
StartLendingAnalysis: Pulls out borrower information, loan terms, and interest rates from loan documents
AnalyzeID: Focuses on government-issued IDs, extracting name, date of birth, document number, etc.

Processing an Invoice with AnalyzeExpense


# Call AnalyzeExpense for an invoice
import boto3

textract = boto3.client('textract')

response = textract.analyze_expense(
    Document={'S3Object': {
        'Bucket': 'my-bucket',
        'Name': 'invoice.pdf'
    }}
)


# Print summary fields
for expense_doc in response.get('ExpenseDocuments', []):
    for field in expense_doc.get('SummaryFields', []):
        print(f"{field['Type']['Text']}: "
              f"{field.get('ValueDetection', {}).get('Text', '')}")


# To extract line items
for expense_doc in response.get('ExpenseDocuments', []):
    for group in expense_doc.get('LineItemGroups', []):
        for item in group.get('LineItems', []):
            item_data = {
                field['Type']['Text']:
                field.get('ValueDetection', {}).get('Text', '')
                for field in item.get('LineItemExpenseFields', [])
            }
            print("Line Item:", item_data)

Let’s break down this code that processes line items from invoices:

Initialize Textract Client: The code starts by creating a Textract client using boto3
Call AnalyzeExpense API: It calls the specialized AnalyzeExpense API, pointing to a document stored in S3
Process Summary Fields: The first loop extracts summary-level information from the invoice (like totals, dates, vendor info)
Extract Line Items: The nested loops then process the detailed line items:
- It iterates through each expense document
- Then through each line item group
- Finally through individual line items
Create Item Dictionary: For each line item, it creates a dictionary mapping field types to their detected values
Print Results: Finally, it outputs the structured line item data for further processing

This structure allows for detailed extraction of both summary-level invoice data and individual line items, making it ideal for expense processing and accounting automation.

Performance, Security, and Compliance Best Practices

Running Amazon Textract at scale needs to balance speed, security, and compliance. This section covers efficient processing, data security, and compliance best practices.

Batching, Parallelism, and Modern Extraction Techniques

Break large jobs into small chunks—such as individual pages or logical sections—and process them in parallel. Use AWS Step Functions, Lambda, and EventBridge to orchestrate large-scale, resilient pipelines.

Parallel Textract Job Submission with Step Functions

import boto3
import time
from botocore.exceptions import ClientError

textract = boto3.client('textract')
document_keys = ['docs/page1.pdf',
                 'docs/page2.pdf',
                 'docs/page3.pdf']


# Example custom queries for targeted extraction
queries = [
    {"Text": "What is the invoice number?",
     "Alias": "InvoiceNumber"},
    {"Text": "What is the total amount?",
     "Alias": "TotalAmount"}
]

job_ids = []
for key in document_keys:
    try:
        response = textract.start_document_analysis(
            DocumentLocation={'S3Object': {
                'Bucket': 'my-bucket',
                'Name': key
            }},
            FeatureTypes=['TABLES', 'FORMS', 'QUERIES'],
            QueriesConfig={"Queries": queries}
        )
        job_ids.append(response['JobId'])
        print(f"Started Textract job for {key}: "
              f"{response['JobId']}")
        time.sleep(0.5)  # Avoid API throttling
    except ClientError as e:
        print(f"Error starting Textract job: {e}")

# In production, use Step Functions Map states

Security: Encryption, VPC Endpoints, and KMS Best Practices

When processing sensitive documents, security is non-negotiable. Encrypt all documents at rest with customer-managed KMS keys and use VPC interface endpoints to keep traffic private.

Uploading Documents with KMS Encryption


# Upload with customer-managed KMS key
import boto3
from botocore.exceptions import ClientError

def upload_document_with_kms(s3_client, file_path,
                            bucket, key, kms_key_id):
    try:
        with open(file_path, 'rb') as f:
            s3_client.put_object(
                Bucket=bucket,
                Key=key,
                Body=f,
                ServerSideEncryption='aws:kms',
                SSEKMSKeyId=kms_key_id
            )
        print(f"Uploaded {key} with KMS encryption.")
    except ClientError as e:
        print(f"Error uploading {key}: {e}")

# Usage:

# upload_document_with_kms(boto3.client('s3'),

# 'form.pdf',

# 'my-bucket',

# 'secure/form.pdf',

# 'arn:aws:kms:...:key/1234abcd')

IAM Roles, Auditing, and PII Handling

Think of AWS IAM as the badge system for your digital mailroom. Apply the principle of least privilege and create dedicated IAM roles for Lambda, Step Functions, or containers.

Modern IAM Policy for Textract Processing Role

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "textract:StartDocumentAnalysis",
        "textract:GetDocumentAnalysis"
      ],
      "Resource": "arn:aws:textract:us-east-1:123456789012:document/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/secure/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "kms:Decrypt"
      ],
      "Resource": "arn:aws:kms:us-east-1:123456789012:key/1234abcd"
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:us-east-1:123456789012:log-group:/aws/lambda/*"
    }
  ]
}

For personally identifiable information (PII), use Amazon Comprehend for detection and build workflows to redact or mask sensitive fields. Enable AWS CloudTrail logging for a full audit trail.

Post-Processing, Data Modeling, and Integration Patterns

Extracting data with Amazon Textract is just the foundation. To make this information usable for analytics, AI, or search, you must parse, clean, and model it for downstream systems.

Parsing and Cleaning Textract Output

The open-source Textractor library helps convert Textract JSON into familiar Python data structures such as Pandas DataFrames, making it easier to analyze tabular data and extract fields.

Parsing Textract Output with Textractor

from textractor import Textractor


# Initialize Textractor
extractor = Textractor()


# Analyze a document in S3
result = extractor.analyze_document(
    file_source='s3://my-bucket/form.pdf',
    features=['FORMS', 'TABLES', 'LAYOUT']
)


# Convert the first table to a Pandas DataFrame
table_df = result.tables[0].to_pandas()
print(table_df)


# Extract key-value pairs as a dictionary
fields = result.form.fields_as_dict()
print(fields)

Once your data is structured, use AWS Glue Crawlers to automatically scan your S3 buckets, infer schema, and create tables in the AWS Glue Data Catalog. For additional cleaning, AWS Glue DataBrew offers a visual interface to standardize dates, currencies, and text fields.

Converting Outputs for Analytics Platforms

After parsing and cleaning, export your structured data into analytics-friendly formats. While CSV is universally supported, Parquet is preferred for large-scale analytics due to its columnar storage and compression.

Exporting Table Data to Parquet and CSV

import pandas as pd
import boto3


# Assume 'table_df' is from Textractor
csv_path = '/tmp/extracted_table.csv'
parquet_path = '/tmp/extracted_table.parquet'

table_df.to_csv(csv_path, index=False)
table_df.to_parquet(parquet_path,
                   index=False,
                   engine='pyarrow')


# Upload files to S3
s3 = boto3.client('s3')
s3.upload_file(csv_path,
              'my-analytics-bucket',
              'tables/extracted_table.csv')
s3.upload_file(parquet_path,
              'my-analytics-bucket',
              'tables/extracted_table.parquet')

Integration Patterns

With clean, structured data, you can unlock advanced integration patterns:

Retrieval-Augmented Generation (RAG): Feed extracted text and metadata into a vector store like Amazon OpenSearch for semantic search and generative AI applications.
Event-Driven, Serverless Automation: Automate document workflows using AWS EventBridge and Lambda, triggering processes when new files land in S3.
Human-in-the-Loop Review (Amazon A2I): Route low-confidence results to Amazon Augmented AI for human review, providing data quality while keeping most processing automated.

Routing Low-Confidence Results to A2I

import boto3
import json

a2i_client = boto3.client('sagemaker-a2i-runtime')


# Flag low-confidence fields for review
for field in extracted_fields:
    if field['confidence'] < 0.85:
        response = a2i_client.start_human_loop(
            HumanLoopName='doc-review-loop',
            FlowDefinitionArn='arn:aws:sagemaker:us-east-1:123456789012:flow-definition/doc-review',
            HumanLoopInput={
                'InputContent': json.dumps(field)
            }
        )

Operational Metrics, Monitoring, and Reference Architectures

Every document pipeline needs robust monitoring and safeguards. Focus on three essential metrics:

Throughput (Pages per Minute): How fast you process documents.
Error Rate: Percentage of failed extractions.
Cost per Page: What you pay to process each page.

Use Amazon CloudWatch for dashboards, alarms, and anomaly detection to monitor these metrics in real time.

Creating a CloudWatch Alarm for Textract Error Rate

aws cloudwatch put-anomaly-detector \
  --metric-name TextractDocumentPagesFailed \
  --namespace AWS/Textract \
  --statistic Sum \
  --period 300

aws cloudwatch put-metric-alarm \
  --alarm-name TextractErrorRateAnomalyAlarm \
  --metric-name TextractDocumentPagesFailed \
  --namespace AWS/Textract \
  --statistic Sum \
  --period 300 \
  --evaluation-periods 1 \
  --comparison-operator GreaterThanUpperThreshold \
  --threshold-metric-id anomalyDetector \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:MyTopic

CDK v2 Stack for Textract Processing

from aws_cdk import App, Stack, CfnOutput
from aws_cdk import aws_lambda as _lambda
from aws_cdk import aws_s3 as s3

class TextractStack(Stack):
    def __init__(self, scope: App, construct_id: str,**kwargs) -> None:
        super().__init__(scope, construct_id,**kwargs)

        # Define S3 bucket for storing documents
        bucket = s3.Bucket(self, "TextractBucket")

        # Define Lambda function (Python 3.11)
        lambda_function = _lambda.Function(
            self, "TextractFunction",
            runtime=_lambda.Runtime.PYTHON_3_11,
            handler="handler.lambda_handler",
            code=_lambda.Code.from_asset("lambda")
        )

        # Grant Lambda read access to the bucket
        bucket.grant_read(lambda_function)

        # Output the bucket name
        CfnOutput(self, "BucketName",
                 value=bucket.bucket_name)

app = App()
TextractStack(app, "TextractStack")
app.synth()

Sample Lambda Handler for Textract with Powertools

import boto3
from aws_lambda_powertools import Logger, Tracer, Metrics

logger = Logger()
tracer = Tracer()
metrics = Metrics(namespace="TextractPipeline")

textract = boto3.client('textract')

@tracer.capture_lambda_handler
@metrics.log_metrics
def lambda_handler(event, context):
    s3_info = event['Records'][0]['s3']
    bucket = s3_info['bucket']['name']
    key = s3_info['object']['key']

    logger.info(f"Received S3 event for {bucket}/{key}")

    try:
        response = textract.start_document_analysis(
            DocumentLocation={'S3Object': {
                'Bucket': bucket,
                'Name': key
            }},
            FeatureTypes=['TABLES', 'FORMS']
        )
        job_id = response['JobId']

        logger.info(f"Started analysis for {key} "
                   f"with JobId: {job_id}")

        metrics.add_metric(name="DocumentsProcessed",
                          unit="Count",
                          value=1)

        return {
            'statusCode': 200,
            'body': f'Started analysis for {key} '
                    f'with JobId: {job_id}'
        }
    except Exception as e:
        logger.error(f"Error processing {key}: {e}")
        metrics.add_metric(name="DocumentsFailed",
                          unit="Count",
                          value=1)
        raise

Summary and Key Ideas

Amazon Textract bridges the gap between paper and digital workflows, enabling automation at scale with high accuracy and robust compliance. Here are the key takeaways:

Textract converts unstructured documents into actionable data using state-of-the-art AWS APIs.
Advanced extraction features enable automation of complex, real-world workflows.
Security, performance, and integration best practices are essential for production.
Deep AWS integration unlocks analytics, Retrieval-Augmented Generation (RAG), and agentic automation.

Author Bio

Rick Hightower is a technology leader and software architect with deep expertise in cloud computing, artificial intelligence, and enterprise software development. His focus areas include AWS services, document intelligence systems, and machine learning integration patterns.

With years of hands-on experience, Rick specializes in developing scalable solutions and sharing knowledge through technical writing and education. His articles cover a wide range of topics from cloud architecture to AI implementation, helping other developers navigate complex technological landscapes.

Rick regularly writes about emerging technologies, best practices, and practical implementations of cloud-native solutions, with a particular emphasis on document processing, machine learning, and enterprise architecture.

comments powered by Disqus

Apache Spark Training
Kafka Tutorial
Akka Consulting
Cassandra Training
AWS Cassandra Database Support
Kafka Support Pricing
Cassandra Database Support Pricing
Non-stop Cassandra
Watchdog
Advantages of using Cloudurable™
Cassandra Consulting
Cloudurable™| Guide to AWS Cassandra Deploy
Cloudurable™| AWS Cassandra Guidelines and Notes
Free guide to deploying Cassandra on AWS
Kafka Training
Kafka Consulting
DynamoDB Training
DynamoDB Consulting
Kinesis Training
Kinesis Consulting
Kafka Tutorial PDF
Kubernetes Security Training
Redis Consulting
Redis Training
ElasticSearch / ELK Consulting
ElasticSearch Training
InfluxDB/TICK Training TICK Consulting