# Introduction
PDF information are extensively utilized in many workflows. You may have to merge studies, break up giant information, extract textual content or tables, add watermarks, or redact delicate content material. These are all routine duties, however dealing with them manually for a number of information may be sluggish and error-prone. These 5 Python scripts automate the method. They run from the command line, assist batch processing, and are simple to configure.
You could find all of the scripts on GitHub.
# 1. Merging and Splitting PDF Information
// The Ache Level
Combining a number of PDF information into one, or splitting a big PDF into separate information by web page vary, are among the many commonest PDF duties. Each are tedious to do manually, notably when coping with many information or giant web page counts.
// What the Script Does
Merges a folder of PDF information right into a single output file in a configurable order, or splits a single PDF into separate information by fastened web page ranges, each N pages, or by an inventory of particular web page numbers. Each operations are dealt with by the identical script by way of a mode flag.
// How It Works
The script makes use of pypdf for all page-level operations. In merge mode, it reads all PDFs from an enter folder, types them by filename (or a customized order outlined in a textual content file), and writes them sequentially right into a single output PDF. In break up mode, it accepts both a web page vary listing, a set chunk measurement, or an inventory of web page numbers to separate on. Every break up section is written to a numbered output file. Metadata from the primary enter file is preserved in merge mode.
⏩ Get the PDF merge & break up script
# 2. Extracting Textual content and Tables from PDFs
// The Ache Level
Getting usable knowledge out of a PDF — whether or not it is textual content from a report or tabular knowledge from an announcement — is one thing that should occur earlier than any additional processing can happen. Copy-pasting from a PDF viewer is impractical for something past just a few pages, and the output is never clear.
// What the Script Does
Extracts textual content and tables from a number of PDF information and writes the outcomes to structured output information. Textual content is written to plain textual content or markdown information. Tables are written to CSV or Excel, with one sheet per desk discovered. Helps each text-based PDFs and primary layout-preserving extraction.
// How It Works
The script makes use of pypdf for primary textual content extraction and pdfplumber for layout-aware extraction and desk detection. For every enter file, it runs web page by web page, extracting textual content blocks and detecting desk areas utilizing pdfplumber’s desk finder. Extracted tables are normalized — empty rows eliminated, headers detected — and written to separate output information. A abstract report lists what number of pages and tables have been present in every file, and flags any pages the place extraction produced no output.
⏩ Get the PDF textual content & desk extractor script
# 3. Stamping, Watermarking, and Including Web page Numbers
// The Ache Level
Including a watermark, a stamp, or web page numbers to a batch of PDFs earlier than distributing them is easy in idea however sluggish to do one file at a time by way of a graphical consumer interface (GUI). When the batch is giant or the requirement is recurring, it wants automating.
// What the Script Does
Applies a textual content or picture stamp to each web page of a number of PDF information. Helps diagonal watermarks, header/footer textual content, web page numbers, and picture overlays. Place, font measurement, opacity, and coloration are all configurable. Processes total folders in batch.
// How It Works
The script makes use of pypdf for web page manipulation and reportlab to generate the stamp layer. For every enter PDF, it creates a single-page stamp PDF in reminiscence utilizing reportlab. It renders textual content on the configured place, angle, font, and opacity, or locations a picture at specified coordinates. This stamp web page is then merged onto each web page of the supply PDF utilizing pypdf’s web page merging. The result’s written to a brand new output file, leaving the unique unchanged. Web page numbers are dealt with as a particular case, producing a singular stamp per web page.
# 4. Redacting Delicate Content material
// The Ache Level
Earlier than sharing a PDF externally, delicate content material — like names, reference numbers, monetary figures, and addresses — typically wants eradicating. Manually drawing black bins over textual content in a PDF editor works, however doesn’t truly take away the underlying textual content in all instruments, and is impractical for greater than a handful of pages.
// What the Script Does
Scans PDF pages for textual content matching patterns you outline — regex patterns, actual strings, or predefined classes like e-mail addresses and cellphone numbers — and completely redacts matching content material by changing it with black rectangles. Outputs a brand new PDF with the underlying textual content eliminated, not simply visually obscured.
// How It Works
The script makes use of pymupdf, which supplies each textual content search with bounding field coordinates and the flexibility to attract redaction annotations that completely take away the underlying content material when utilized. For every web page, the script searches for all matches of every configured sample, marks the bounding rectangles as redaction annotations, then applies them — which removes the textual content from the web page content material stream. A report is written itemizing each redaction made, together with web page quantity, matched textual content (earlier than redaction), and the sample that triggered it.
⏩ Get the PDF redaction script
# 5. Extracting Metadata and Producing a PDF Stock
// The Ache Level
When working with a big assortment of PDF information, it’s typically helpful to know primary info about each — web page depend, file measurement, creation date, creator, whether or not it’s encrypted, whether or not it comprises textual content or is a scanned picture. Checking every file individually by way of a viewer isn’t sensible at scale.
// What the Script Does
Scans a folder of PDF information and extracts metadata from each, together with web page depend, file measurement, creation and modification dates, creator, producer, encryption standing, and whether or not the doc seems to comprise searchable textual content or scanned photos. Writes all the pieces to a single CSV or Excel stock file.
// How It Works
The script makes use of pypdf to learn doc metadata from the PDF information dictionary and pdfplumber to pattern pages for textual content content material. For every file, it makes an attempt to open the PDF and browse commonplace metadata fields. It samples the primary few pages to find out whether or not the file comprises extractable textual content versus scanned picture pages. Encrypted information that can not be opened are flagged fairly than skipped silently. The output stock consists of one row per file with all extracted fields, and a abstract row on the backside with totals and averages.
# Wrapping Up
These 5 Python scripts deal with the PDF duties that often flip into repetitive handbook work: splitting information, extracting content material, processing batches, and cleansing up doc workflows. Every script is designed to work safely on single information or total folders whereas producing new outputs as an alternative of modifying the originals.
Begin with a small batch, confirm the output, then scale to bigger folders as soon as all the pieces appears proper. Many of the setup solely includes putting in the listed dependencies and adjusting the config part in your file paths and settings.
Bala Priya C is a developer and technical author from India. She likes working on the intersection of math, programming, knowledge science, and content material creation. Her areas of curiosity and experience embrace DevOps, knowledge science, and pure language processing. She enjoys studying, writing, coding, and low! Presently, she’s engaged on studying and sharing her information with the developer group by authoring tutorials, how-to guides, opinion items, and extra. Bala additionally creates partaking useful resource overviews and coding tutorials.
