redact-pdf

A CLI tool that actually redacts PDFs. Not "draw a black box and hope for the best" redaction — real redaction that rips the text out of the file's content streams, scrubs metadata, and verifies the job is done.

Built because every year I'd prepare tax documents and think "surely there's a simple command-line tool for this." There wasn't.

How it works

Two phases. You scan first, review what will be removed, then apply.

# Step 1: Tell it what to find
redact scan taxes.pdf --terms terms.txt

# Step 2: Open the preview PDF, make sure it looks right
# Step 3: Pull the trigger
redact apply manifest.json --output taxes_redacted.pdf

That's it. Your SSNs, addresses, and account numbers are gone — not hidden under a rectangle, but deleted from the underlying PDF data.

The scan phase

$ redact scan taxes.pdf --terms terms.txt

Scanning taxes.pdf for 3 term(s)...

         Scan Results — taxes.pdf
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┓
┃ Term            ┃ Matches ┃ Pages ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━┩
│ Jane Doe        │       1 │ 1     │
│ 555-12-9876     │       1 │ 1     │
│ 1234 Oak Street │       1 │ 1     │
└─────────────────┴─────────┴───────┘

Total: 3 matches across 1 pages (0 pages unaffected)

Preview: taxes_preview.pdf
Manifest: manifest.json

It produces two files:

Preview PDF — your original with yellow highlights over every match. Open it. Check that the right things are highlighted and nothing is missed.
Manifest — a JSON file that records where the redactions go. No sensitive content is stored in it.

The apply phase

$ redact apply manifest.json --output taxes_redacted.pdf

Source: taxes.pdf
Source integrity verified (SHA-256 match).
Applying 3 redactions...
Running verification...
Verification passed: 0 terms found in redacted output.
  Text extraction: clean
  Stream inspection: clean
  Byte scan: clean

Saved to: taxes_redacted.pdf

Three-level verification runs automatically:

Text extraction — can the text be read from the page?
Stream inspection — is the text hiding in the raw PDF content streams?
Byte scan — does the text appear anywhere in the file, in any encoding?

If any check fails, you'll know.

Bulk mode — process a whole folder

Got a stack of documents? Drop them in a folder and redact them all at once.

# Step 1: Scan everything
redact bulk scan ./input --terms terms.txt --drafts ./drafts

# Step 2: Browse the drafts/ folder, check the highlighted previews
# Step 3: Apply everything
redact bulk apply bulk_manifest.json --output ./output

$ redact bulk scan ./input --terms terms.txt

Scanning 3 PDF(s) in ./input for 3 term(s)...

            Bulk Scan Results
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ File              ┃ Matches ┃ Pages affected ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ tax_return.pdf    │       3 │              1 │
│ w2_form.pdf       │       1 │              1 │
└───────────────────┴─────────┴────────────────┘

2 file(s) with matches, 1 clean, 4 total redactions

Drafts:   ./drafts
Manifest: bulk_manifest.json

The folder layout:

input/              ← your original PDFs (never modified)
drafts/             ← highlighted previews for review
output/             ← final redacted PDFs
bulk_manifest.json  ← tracks everything between phases

Files with no matches are skipped automatically. Corrupt or unreadable PDFs are reported and skipped without stopping the batch. Each file gets its own SHA-256 integrity check — if a source file changes between scan and apply, that file is skipped and the rest continue.

The terms file

One term per line. Comments and blank lines are fine.

# terms.txt
# SSNs
123-45-6789
987-65-4321

# Names and addresses
Jane Doe
1234 Oak Street, Anytown

Install

Requires Python 3.10+.

# With uv (recommended)
uv tool install redact-pdf

# With pip
pip install redact-pdf

# From source
git clone https://github.com/gaborcsapo/redact-pdf.git
cd redact-pdf
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"

What makes this different from drawing black boxes

Most "redaction" tools — including some commercial ones — just paint over the text. The data is still in the file. Select-all, copy, paste into a text editor, and there's your SSN. This has caused real problems for people who should have known better.

redact-pdf takes a different approach:

Step	What happens
Text removal	PyMuPDF's `apply_redactions()` rewrites the content stream, destroying the text operators for redacted characters
Image removal	Images overlapping redaction areas are removed
Metadata scrub	pikepdf strips 15+ metadata locations: XMP, docinfo, structure tree, thumbnails, embedded files, JavaScript, form fields, annotations, and more
Clean rewrite	The entire PDF is rewritten from scratch via QPDF linearization — no incremental save artifacts, no orphaned objects, no recoverable history
Verification	The output is scanned at three levels to confirm zero traces remain

A black REDACTED label is placed where the text used to be, so readers know something was there.

Security model

Everything is local. No network calls, no telemetry, no cloud anything. The libraries used (PyMuPDF, pikepdf, Typer, Rich) have been audited — none of them make network connections or send data anywhere.

Additional safeguards:

Core dumps disabled at startup (a crash won't dump your data to disk)
Spotlight indexing blocked via .metadata_never_index in the working directory
Extended attributes stripped from output files on macOS
Sensitive text never printed to the terminal — only match counts and page numbers
Terms read from a file, not CLI arguments (so they don't end up in shell history)
Source integrity check — the apply phase verifies the PDF hasn't changed since scanning

Known limitations

Be aware of these edge cases:

Scanned/image-only PDFs: If the PDF is a scan with no text layer, there's nothing to search. The tool warns you when it detects this.
Text rendered as vector paths: Some PDFs convert text to outlines (curves). This text is invisible to any text-based search. The tool cannot find it.
Type 3 fonts / exotic encodings: Some fonts don't map to Unicode properly. The tool warns when it detects these, but matches may be missed.
Font subsetting: After redaction, the embedded font subset may still contain glyphs for redacted characters. For high-security use cases, consider re-processing with full font replacement.

For maximum security on critical documents, consider also rasterizing the output (print to PDF at 300 DPI).

Development

git clone https://github.com/gaborcsapo/redact-pdf.git
cd redact-pdf
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"
pytest

Tests generate PDFs on the fly with ReportLab — no binary test fixtures in the repo. Every redaction test verifies text removal at the byte level, not just visually.

License

MIT. Do whatever you want with it.

Note: PyMuPDF (a dependency) is AGPL-3.0 licensed. If you're using this tool for personal/internal use, that's fine — AGPL only kicks in if you distribute modified software or serve it over a network. See Artifex licensing for details.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src/redact		src/redact
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

redact-pdf

How it works

The scan phase

The apply phase

Bulk mode — process a whole folder

The terms file

Install

What makes this different from drawing black boxes

Security model

Known limitations

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

redact-pdf

How it works

The scan phase

The apply phase

Bulk mode — process a whole folder

The terms file

Install

What makes this different from drawing black boxes

Security model

Known limitations

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages