When your doc repository incorporates tons of of hundreds of thousands of recordsdata collected over almost a decade, how do you systematically discover and redact delicate buyer information with out taking years to finish? This was the problem dealing with The Huntington Nationwide Financial institution (Huntington), a high 10 financial institution in america.
Redacting delicate info at scale
Since 2015, Huntington’s doc administration system has securely saved tons of of hundreds of thousands of paperwork on-premises. In 2025, as a part of a proactive compliance initiative, Huntington got down to course of the paperwork on this system and redact delicate information. These paperwork come in several codecs, so the answer wanted flexibility to deal with different file sorts whereas delivering the throughput required to course of hundreds of thousands of paperwork rapidly.
Unique estimates indicated this effort would take years. Nonetheless, by designing a scalable redaction workflow utilizing Amazon Textract, Amazon SageMaker, AWS Step Capabilities, and AWS Lambda, Huntington lowered this timeline to months.
Answer overview
Earlier than analyzing the technical implementation, let’s have a look at the core necessities Huntington established for this venture. If you happen to’re dealing with the same large-scale doc processing problem, these necessities can function a place to begin in your personal answer design:
- Knowledge should be encrypted at relaxation and in transit.
- Places the place information is saved or accessed should meet strict entry necessities.
- Providers used should be in-scope for PCI DSS compliance.
- Outputs should be replicated again to on-premises information shops.
- Redaction accuracy should meet or exceed 95% to fulfill compliance necessities.
The next diagram illustrates the high-level answer structure.
Shifting information securely, with confidence
Huntington’s first goal was to maneuver paperwork from an on-premises file share to an Amazon Easy Storage Service (Amazon S3) bucket. Shifting paperwork is easy, however this effort required transferring over 400 million paperwork, encrypted in transit and at relaxation. To perform this, Huntington used AWS DataSync, AWS Direct Join, Amazon S3, and AWS Key Administration Service (AWS KMS).
AWS DataSync could be deployed as an agent in your on-premises information heart to watch a configured supply, reminiscent of an SMB file share. Whereas getting paperwork to AWS was crucial for processing, AWS DataSync additionally helps syncing information again to on-premises, which was one other key requirement for this venture.

Amazon Textract is an AWS machine studying service that extracts textual content, tables, and varieties from scanned paperwork. Monetary establishments use it to robotically course of paperwork like account statements or mortgage functions, then determine delicate information reminiscent of Social Safety numbers, account numbers, and private addresses. The next pattern bill demonstrates this functionality.


Amazon Textract detects numerous fields from a doc and gives coordinates of detected fields and different metadata inside a JSON output.
Huntington used Amazon Textract in an orchestrated course of with AWS Step Capabilities. This method lowered handbook assessment time whereas enhancing accuracy in detecting delicate info throughout massive doc volumes.
Scaling detection throughput
Automated pipelines for doc processing are helpful, however processing paperwork sequentially would have prolonged the venture timeline to years. To fulfill their aim, Huntington wanted to course of hundreds of thousands of paperwork every day.
Scaling to this degree required addressing two predominant concerns: maximizing concurrent Amazon Textract jobs inside service quotas, and controlling request charges to keep away from throttling.
AWS companies have quotas that may be adjusted by comfortable and laborious limits. The Amazon Textract jobs-per-second quota could be elevated by submitting a request by the AWS Service Quotas console.
To maximise throughput in opposition to the service quota, Huntington used the AWS Step Capabilities built-in map state, which processes collections of inputs in JSON, CSV, or different codecs. The group organized paperwork in Amazon S3 right into a JSON assortment and ran the map state in distributed mode for greater concurrency. To trace pipeline progress, they used AWS Step Capabilities map run execution summaries alongside Amazon CloudWatch dashboards to watch response instances, throttle counts, successes, and error charges.
To handle potential throttling, Huntington monitored their CloudWatch dashboard to confirm Amazon Textract profitable request counts and throttled counts. As wanted, they adjusted concurrency limits for baby workflow executions to substantiate they remained underneath the Amazon Textract service quota whereas sustaining excessive throughput. When jobs accomplished efficiently, detected fields and metadata had been written to a bucket for later assessment. The next diagram depicts this method:

The wait block throughout the step operate verified the method was able to proceed with writing job metadata and persevering with with the following Amazon Textract invocation. When there are not any failures, the state machine finishes with a cross state. When failures happen, AWS Step Capabilities writes to a log for human assessment and reprocessing.
Redacting detected delicate info
Up so far, the method centered on detecting delicate information and cataloging it inside metadata recordsdata written to Amazon S3. The ultimate steps are to redact the paperwork and transmit them again to on-premises storage.
Picture and PDF redaction is supported by a number of open-source and proprietary instruments. Widespread open-source Python libraries embody PyMuPDF or picture drawing libraries like PIL. The next determine reveals a pattern redaction of the bill proven earlier. Amazon Textract helps detection of assorted fields, and you can even create customized classifications utilizing regex patterns. Mixed with redaction software program, you’ll be able to confidently redact detected fields. If you wish to create a threshold for human intervention, Amazon Textract gives confidence scores that may set off validation workflows.

As soon as once more, Huntington confronted the identical architectural problem: how would this scale? AWS Step Capabilities supplied the answer for processing hundreds of thousands of paperwork whereas providing hooks for error dealing with and retry logic. Because the doc processing pipeline cataloged objects requiring redaction, Huntington ran a easy circulate in opposition to them:

To confirm accuracy and thoroughness, Huntington double-checked that detected fields matched anticipated patterns previous to redaction, adopted by a metadata replace for every file. Redacted recordsdata had been positioned in an Amazon S3 location monitored by AWS DataSync for transmission again to on-premises file storage.
Conclusion
Utilizing AWS, Huntington processed paperwork at a fee of roughly 10 million per day, lowering estimated processing time from years to only a few months. The price of processing the whole doc repository was roughly 5% of the unique estimate. Redaction accuracy exceeded 95%, assembly compliance necessities and supporting information safety aims.
This venture demonstrates how AWS companies can assist large-scale information processing and compliance initiatives. Huntington plans to proceed utilizing this framework for high-volume redaction wants reminiscent of mergers and acquisitions.
To be taught extra concerning the companies used on this answer, go to the Amazon Textract element web page or discover the AWS Step Capabilities documentation.
Acknowledgements
Particular due to the next people and groups for his or her contributions: Xuelei Yuan, Robert Carnell, Jeanne Keith, Debbie Montgomery, Invoice Gross, Jodi Pettiford, Jon Glazer, Marshall Doss, Bob Wojasinski, Tami Wolf, Marijane Eldridge, Pradeep Kumar Tata, Michael Burkhardt, Nirmal Antony, Trevor Pease, Bryan Griffith, Angus Ferguson (AWS) Randy Patrick (AWS), Stephanie Brenneman (AWS), Artwork Steele, Kevin Owen.
In regards to the authors
