X-Ray: a Python Library for Finding Bad Redactions in PDF Documents
Posted11 days agoActive6d ago
github.comstoryHigh profile
informativeneutral
PDF AnalysisRedactionClient-Side EncryptionInformation Extraction
Key topics
PDF Analysis
Redaction
Client-Side Encryption
Information Extraction
X-ray: A Python library for finding bad redactions in PDF documents
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
19m
Peak period
63
0-6h
Avg / period
12.9
Comment distribution129 data points
Loading chart...
Based on 129 loaded comments
Key moments
- 01Story posted
Dec 23, 2025 at 4:54 PM EST
11 days ago
Step 01 - 02First comment
Dec 23, 2025 at 5:13 PM EST
19m after posting
Step 02 - 03Peak activity
63 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 27, 2025 at 8:45 AM EST
6d ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 46369923Type: storyLast synced: 12/26/2025, 9:45:55 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
why don't you come up with one of those instead of just crying about it? lmao.
hopefully this is straw that breaks the camel's back
That should in theory prevent overly redacted documents for political purposes.
An approach that could be rolled out today would be redacting with human review, but showing what % of redactions the AI would have done, and also showing the prompt given to the AI to perform redactions.
Whoever did these "bad" redactions doesn't even know how to use a PDF Editor.
We have paralegals and lawyers "mark for redaction", then review the documents, then "apply redactions". It's literally be done by thousands of lawyers/paralegals for decades. This is just someone not following the process and procedure, and making mistakes. It's actually quite amateurish. You should never, ever screw up redactions if you follow the proper process. Good on the X-ray project on trying to find errors.
I just want to add, applying black highlights on top of text is in fact, the "old" way of redaction, as it was common to do this, and then simply print the paper with the black bars, and send the paper as the final product.
Whoever did it is probably old, and may have done it thinking they were going to print it on paper afterwards!! Just guessing as to why someone would do this.
Especially with the "draw a black box over it" method, the text also stops being trivially mouse-selectable (even if CTRL+A might still work).
Another possibility is, of course, that whoever was responsible for this knew exactly what they were doing, but this way they can claim a honest mistake rather than intentionally leaking the data.
Yes; that's presumably included in being "amateurish" and "not following proper process".
Anyway, I made X-ray to analyze the millions of documents we have in CourtListener so that we can try to educate people about the issue.
The analysis was fun. We used S3 batch jobs, but we haven’t done the hard part of looking at the results and reporting them out. One day.
> Information Leaking from Redaction Marks: Even when content is properly removed, the redaction marks themselves can leak some information if not done carefully. For example, if you have a black box exactly covering a word, the length of that black box gives a clue to the word’s length (and potentially its identity).
Does X-ray employ glyph spacing attacks and try to exploit font metric leaks?
I think the combination of AI and font-metrics is going to be wild though. You ought to be able to make a system that can figure out likely words based on the unredacted ones and the redaction's size. I haven't seen any redaction system yet that protects against this.
The linked article suggests widening redacted areas more than needed with some randomization applied to the width. Strikes me that that wouldn't do much except add a few more possible solutions.
If the redaction is a person's name, and there's nothing else to give the person's identity away, single word redaction probably works reasonably well, AI or no AI.
Of course, you can also take this further. Even if you can't recover names you can get meta information about how many parties are involved by recognizing different length redactions correspond to different entities. While same length redaction doesn't guarantee same entity it is a hint.
If you have a number such locations with alternatives then you can make a number of identifiable versions by combining alternates.
If there’s a way to undo huge amounts of redactions, that’d certainly be a net negative. Sort of like if encryption were suddenly broken, you wouldn’t publish a paper saying so.
Our goal has always been to educate about the problem so that it can be addressed. We didn’t have resources to push on the font metrics approach, so we stayed mostly quiet about it.
I can't state emphatically enough how this is not the right mental playbook.
If you have found a vulnerability, it's likely someone else has too. By sitting on it, you only create more future victims.
Disclosure will lead to fixing this issue, mitigating it's precense, or switching tools/workflows, possibly a combination of. Sitting on it only ensures that folks who think they are protected, actually aren't.
It’s tricky stuff and we have limited resources, unfortunately.
What if you are not the only folks who have found and exploited this vulnerability?
You can play the "what if" game to justify not doing the right thing all day long, when really it should be one "if" that guide you. What if someone else found this?
While protecting victims is noble, something like this really needs the light of day and a truth and reconciliation commission so that everyone associated with the crime ring is punished and accounted for.
And no, if you do find somehow all encryption is mathematically broken, it’s your duty to publicize it even if existing secrets are jeopardized (you mitigate as best you can obviously in the short term) because it’s likely people more powerful than you might have that knowledge anyway and are engaged in asymmetric warfare.
The strings oioioi and oooiii will have different widths in some fonts because character organisation matters a lot.
I think the conclusion is honestly that PDF is an outdated format for keeping records that might have to be redacted in the future, like court documents. Something reflowable like epub could have the text replaced with constant-space black squares instead no hints leaked as someone mentioned in a parallel comment.
It's not hard to imagine some powerful LLMs being able to undo some light redactions that are deducible based on context
One fun thing I encountered from local government is releasing files with potato quality resolution and not considering the page size.
I had a FOI request that returned mainly Arch D sized drawings but they were in a 94 DPI PDF rendered as letter sized. It was a fun conversation trying to explain to an annoyed city employee that putting those large drawings in a 94 DPI letter size page effectively made it 30-ish DPI.
PDF redaction fails are everywhere and it's usually because people don't understand that covering text with a black box doesn't actually remove the underlying data.
I see this constantly in compliance. People think they're protecting sensitive info but the original text is still there in the PDF structure.
context: https://www.resetera.com/threads/sega-sammy-fiscal-report-mi...
the pdf: https://www.segasammy.co.jp/cms/wp-content/uploads/pdf/en/ir...
I'm almost fully convinced that someone did this bad intentionally, together with the bad redactions, as surely people tasked with redacting a bunch of files receive some instructions on what to do/not to do?
JaneDoe2 is redacted 150 times
for example
Shouldn't you be dilating?
If I were to hazard a guess, pure speculation, I would say the unretrievable parts were court / previously redacted and the retrievable parts are the latest round of panicked rushed redactions.
You've phrased this as a question; I gather that you know better than to assume a modicum of competence from these people.
Now I want a font that randomly adjusts the kerning automagically to be used by people in standard word processors not some graphics app. In this way, every time the same word appears in the document, the kerning is different between each one.
most people cannot detect differences in kerning, and must be extreme adjustments to get people to notice. even then, the words would need to be aligned above/below each other for people to see the differences. however, a computer program analyzing the size of a bounding box would notice single pixel differences. so randomly adjusting the kearning per word by pixels between each letter would go unnoticed by the vast majority of readers, but could play absolute havoc with algos trying to decipher possible word combos based on bounding box size.
https://www.justice.gov/multimedia/Court%20Records/Matter%20...
As a test, select with your mouse the entire first line of paragraph number 90, and then paste it into a text editor or a shell. The unredacted text appears!
It´s clear that the DOJ was paying overtime, based on the number of redactions, so the agents and lawyers just roamed free...
Why should anyone involved retain any anonymity?
I’m asking in good faith because naively it seems like this should not even exist. All of it should be exposed.
What would be the fallout? The Democrats are complicit, the regime all but controls the judiciary (at least the Supreme Court.) And a lot of these guys are billionaires and untouchable anyway unless someone does a Luigi on them. They have the ability to just brute force past the controversy and yet they've chosen to attempt the most ridiculously inept coverup possible.
On the one the sheer stupidity of this administration and its incompetence at implementing fascism means that as bad as things are they could be much worse. On the other hand I fear that once JD Vance or someone just as evil but without Trump's instability takes power we're going to wish we'd done something more when we had the chance.
David Leigh then published the decryption key in his 2011 book about Wikileaks (for some reason) and the info became publicly available. Everyone pinned the blame on Assange.
Moral of the story: journalists can and will disclose ridiculously sensitive info you give them for a bit of fame and you should be extremely careful about covering your tracks.
This case is so important and such a clusterfuck that the files need to be opened anyway.
So yes this is best explanation. Revealing everything might bring great harm to innocent people just because they were somehow mentioned in the documents.
Just add all the experience we already have with “internet investigators” that ruin people lives for petty reasons.
Specifically, a number of Epstein victims have complained that the release was unacceptable because it was incomplete, illegally redacted material other than victim names which was not excepted from release under the law mandating release, and because it failed to redact victim identities required to be protected under the law mandating release.
It generally prohibits other redactions, and expressly prohibits redactions for embarassment, reputational harm, or political sensitivity.
Of course, there is considerable concern that the actual reactions do not appear to comply with the legal requirements.
But since we’re talking about accuracy: I don’t agree on redactions being binary. You can redact with a pen that under certain lighting still reveals the text; you can redact parts that are easy to reconstruct when you have additional information; you can redact with a pen color that over time loses its function; etc. The “perfect” redaction would perhaps leave no clues as to even how much text was redacted? It seems to depend on the goal and context of the redaction, whether it achieves its purpose or not.
I still think that the word redacted is meant to destroy the original text, it might not remove the metadata (e.g. length).
Redaction is done mostly in ways with a possibility to reveal the underlying text, but all this is not redacted in my understanding of the word. I always liked the english word for this – the german word "schwärzen" just means to "blacken" the text and this was never the same for me.
But after further research I must agree with you, it just means to obscure or remove, but not clearly just remove. I have been using it for years in a stronger meaning that it's really meant.
One more but: we hopefully can all agree that putting a black bar over some text which still is just copy/pasteable is not even obscuring.
text=about them to damage their credibility when they tried to go public with their stories of being text=Epstein also threatened harm to victims and helped release damaging stories =attorneys' fees and case costs in litigation related to this conduct. =Defendants also attempted to conceal their criminal sex trafficking and abuse
text=$327,497.48 and $6,487.04 in New York City text=trafficking and abuse conduct. text=destroy evidence relevant to ongoing court proceedings involving Defendants' criminal sex text=Epstein also instructed one or more Epstein Enterprise participant-witnesses to text=trafficked and sexually abused. text=conduct by paying large sums of money to participant-witnesses, including by paying for their
text=about them to damage their credibility when they tried to go public with their stories of being text=Epstein also threatened harm to victims and helped release damaging stories =attorneys' fees and case costs in litigation related to this conduct.
=Defendants also attempted to conceal their criminal sex trafficking and abuse
text=$327,497.48 and $6,487.04 in New York City text=trafficking and abuse conduct. text=destroy evidence relevant to ongoing court proceedings involving Defendants' criminal sex text=Epstein also instructed one or more Epstein Enterprise participant-witnesses to text=trafficked and sexually abused. text=conduct by paying large sums of money to participant-witnesses, including by paying for their
The releases haven't yielded anything so far. For all we know, Epstein used other methods of communications for the really sensitive stuff. This would not be a surprise, since the whole Maxwell family was deep into tech (Magellan, Chiliad) and Ehud Barak was the head of Israeli military intelligence in the 1980s.
The story is going to be closed in a bipartisan manner except that it might be used to remove some unwanted politicians. The New York Times has already released an article that "explains" Epstein's wealth which names all figures that appear in "conspiracy theories" in an innocent way. Basically, they claim that Epstein could just steal from billionaires like Wexner and the billionaires would roll over and do nothing.
That is the official confirmation that all intelligence angles will be squashed in a bipartisan manner. For all we know, the "incompetence" in the redactions may be a way of saying: "See, we have nothing to hide."
I think governments should not be able to hide information from citizens in general. I don't trust those who hide stuff while being fed money from the taxpayers - that is a modern form of slavery.