A File Format Uncracked for 20 Years
Postedabout 2 months agoActiveabout 2 months ago
landaire.netTechstoryHigh profile
calmpositive
Debate
20/100
Reverse EngineeringGame DevelopmentFile Formats
Key topics
Reverse Engineering
Game Development
File Formats
The article discusses the reverse engineering of a 20-year-old game file format, sparking discussion on the techniques and challenges involved in understanding legacy game data.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
10d
Peak period
49
Day 11
Avg / period
28.5
Comment distribution57 data points
Loading chart...
Based on 57 loaded comments
Key moments
- 01Story posted
Nov 6, 2025 at 11:52 AM EST
about 2 months ago
Step 01 - 02First comment
Nov 16, 2025 at 10:48 PM EST
10d after posting
Step 02 - 03Peak activity
49 comments in Day 11
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 17, 2025 at 2:51 PM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45837259Type: storyLast synced: 11/20/2025, 4:53:34 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
So in this file seek doesn’t do anything because seek kills the requirement of 45 sec per loading screen.
Instead the logic is as follows: check if a .lin file exists. Yes: open a handle to it and only read from it with fread, what ever currently is at the current file position. No: while reading any file write the read bytes to a .lin file in the order they are read.
This gives a highlyy optimized .lin file which can be read from the disk into memory, without creating a better dedicated loading mechanism.
So if your really would like to unpack this. The first file being read is most likely the key, as it dictates what comes next. If it is a level model, then the position of the player in it might affect which other files to load etc.
In short it’s not a file format in the classical sense, it’s a linear stream of game data.
Yes, it's buried deep in the details but it's basically just bytes in being written 1:1 to an output file.
I don't know which stage of grief this is, but since I wrote this blog post I've now ported my IDA debugger scripts to a dedicated QEMU plugin which logs all I/O operations and some other metadata. I tried using this technique to statically rewrite files by basically following DataLoad (with unique identifier) -> Seek -> Read patterns.
There's some annoying nuance to deal with (like seeking backwards implying that data was read, tested, then discarded) but I got this working. Unfortunately some data types encode absolute offsets in them that need to be touched up.
Now I'm just using this data to completely reimplement the game engine's loading logic from scratch using a custom IO stream which checks the incoming IO operation (seek/read) against what I logged from the game engine to ensure a 1:1 match with how the game loads data.
You stated in the blog post, that your goal is to try and find unused content, however if as described, the file is just a record of how the game loads the data, then it won't contain any hidden unused assets, since unused assets would never have been read from the original unoptimised file, and thus never written to this optimized file.
The goal post has shifted so far beyond my original intentional at this point. The devs working on the EnhancedSC mod have a strong desire to port some Xbox assets/maps to PC, so I'm mostly doing it at this point as an attempt to help them out.
This is an interesting thing I've noticed about game dev, it seems to sometimes live in a weird space of optimisation requirements vs hackiness. Where you'll have stuff like using instruction data as audio to save space, but then forget to compile in release mode or something. Really odd juxtaposition of near-genius-level optimisation with naive inefficiency. I'm assuming it's because, while there may be strict performance requirements, the devs are under the pump and there's so much going on that silly stuff ends up happening?
Also happened in GTA5 there was a ridiculous loading glitch that was quite well documented on here a while ago. Also a monetisation bolt on.
So you have competing departments one of whom must justify itself by producing a heavily after my system. And another one which is licensed to generate revenue at any cost……
Loading happen once per session and is less painful than frame stuttering all game, for example, so given a tight deadline one would get prioritized over the other
https://www.reddit.com/r/Doom/comments/bnsy4o/psa_deactivate...
https://old.reddit.com/r/Doom/comments/bnsy4o/psa_deactivate...
common.lin is a separate file which I believe is supposed to contain data common to all levels _before_ the level is loaded.
There's a single exported object that all levels of the game have called `MyLevel`. The game attempts to load this and it triggers a load of the level data and all its unique dependencies. The common.lin file is a snapshot of everything read before this export. AFAIK this is deterministic so it should be the exact same across all maps but I've not tested all levels.
When loading a level, the training level for instance contains two distinct parts. Part 1 of the map loads 0_0_2_Training.lin, and the second part loads 0_0_3_Training.lin. These parts are completely independent -- loading the second part does not require loading the first. It does a complete re-launch of the game using the Xbox's XLaunchNewImage API, so all prior memory I think should be evicted but maybe there's some flag I'm unaware of.
So basically the game launches, looks In the "Training" map folder for common.lin, opens a HANDLE, then looks for whichever section it's loading, grabs a HANDLE, then starts reading common.lin and <map_part>.lin
There's multiple parts, but only one common.lin in each map folder. So no matter what it's not going to be laid out in a contiguous disc region for common.lin leading into <map_part>.lin.
I don't know enough about optical media seek times to say if semi-near locality is noticeably better for the worst case than the files being on complete opposite sector ranges of the disc.
By the way, could the nonsensical offsets be checksums instead?
Nice reverse engineering work and analysis there!
I remember one game I worked on, I spent months optimising loading, especially boot flow, to ensure that every file the game was going to load was the very next file on the disk, or else the next file was an optionally loaded file that could be skipped (as reading and ignoring was quicker than seeking). For the few non-deterministic cases where order couldn't be predicted (e.g. music loaded from a different thread), I preloaded a bunch of assets up front so that the rest of the assets were deterministic.
One fun thing we often did around this era is eschew filenames and instead hash the name. If we were loading a file directly from C code, we'd use the preprocessor the hash the code via some complicated macros, so the final call would be compiled like LoadAsset(0x184e49da) but still retain a run-time hasher for cases where the filename was generated dynamically. This seems like a weird optimisation, but actually avoiding the directory scan and filename comparisons can save a lot of unnecessary seeking / CPU operations, especially for multi-level directories. The "file table" then just became a list of disk offset and lengths, with a few gaps because the hash table size was a little bigger than the number of files to avoid hash conflicts. Ironically, on one title I worked on we had the same modulo for about 2 years in development, and just before launch we needed to change it twice in a week due to conflicts!
"Mel's job was to re-write the blackjack program for the RPC-4000. (Port? What does that mean?) The new computer had a one-plus-one addressing scheme, in which each machine instruction, in addition to the operation code and the address of the needed operand, had a second address that indicated where, on the revolving drum, the next instruction was located.
https://users.cs.utah.edu/~elb/folklore/mel.html
>By the way, could the nonsensical offsets be checksums instead?
If you're referring to those weird "addresses" that quickly became irrelevant, there's a CRC32 somewhere in the header immediately after them. The address value is the same across files with different contents too.
I was talking to a friend of mine about it and he suggested that maybe whatever process generated the files included the file's load address in case it could be mapped to the same address for some other optimization?
I think XISO is derived from ISO9660, so may have the same properties?
The longer it’s not released for sale, the more debt you’re incurring paying the staff.
I’ve worked with a few ex-game devs and they’re always great devs, specifically at optimising. They’re not great at the “forward maintainability” aspect though because they’ve largely never had experience having to do it.
Then game dev was always full of fresh junior devs with tons of energy, ideas and dreams, but who are coming from home brew where things like reliable, beautiful, readable code are unnecessary.
And tons of things get missed. I keep hoping that the one published game I have was accidentally built with debug symbols in it so it can be easily traced. Two of us on the project were heavily into performance optimization, and I absolutely remember us going through compiler options, but things were crazy. I remember one major milestone build for Eidos I was hallucinating badly when I compiled and burned the CD because I'd been working for three days straight with no sleep.
Country ?! What's the meaning
I'd recommend taking a well documented binary file format (Doom WAD file?), go over the documentation, and see that you manage to see the individual values in the hex editor.
Now, after you have a feel for how things might look in hex, look at your own file. Start by saving an empty project from your program and identifying the header, maybe it's compressed?
If it's not, change a tiny thing in the program and save again, compare the files to see what changed. Or alternatively change the file a tiny bit and load it.
Write a parser and add things as you learn more. If the file isn't intentionally obfuscated, it should probably be just a matter of persevering until you can parse the entire file.
You can basically divide the world into read/write/write-only formats and read-only formats.
For read/write/write-only formats, usually the in-memory data structures were written first, and then the serialization/deserialization code. So it almost always more useful to see how the code works, than try to just figure out what random bytes in the file mean. A not insignificant percent of the time, the serialization/deserialization code is fairly straightforward - read some bytes, maybe decompress them, maybe checksum them and compare to a checksum field, shove in right place in memory structure/create a class using it, move on.
Different parts of the program may read different parts of a file, but again, usually a given part of the deserialization/serialization code is fairly understandable.
Read-only formats are scattershot. Lots of reasons. I'll just cover a few. First, because the code doesn't usually have both the writing and reading, you have less of a point of reference for what reading code is doing. Second, they are not uncommonly memory mapped serializations of in-memory structures. But not necessarily even for the current platform. So it may even make perfect sense and be easy to undersatnd on some platform, but on your platform, the code is doing weird conversions and such. This is essentially a variant of "the format was designed before the code". Lots and lots more issues.
I still would start by trying to understand deserialization code rather the format directly in this case, but it often is significantly harder to get a handle on.
There are commonalities in some types of programs (IE you will find commonalities between games using the same engine, etc), but if you are talking "in general", the above is the best "in general" i can give you.
You can do something in between reverse-engineering the code and reverse-engineering the format if you can instrument the reader: attach breakpoints on every basic block in the reader, load a file, take a baseline trace of what gets hit, then vary bytes in the file and diff the new trace against the baseline.
It's a pretty fun tool to write, too.
If it's a really simple format, since you appear to have the ability to generate arbitrary file contents using the program, you can get some mileage by generating a suite of small contents with few changes between them. I reverse engineered the DSP sphere blueprint format by generating a blueprint with one node, then the same node located elsewhere, then two nodes, then two nodes and one frame between them, etc. But this process is really only possible for the simplest formats; I'd gander that most reverse-engineered file formats are heavily based on decompilation of the deserialization code.
A lot of binary file formats end up being some form of "container" format--essentially, a file contains some form of directory mapping an item ID to a location in the file, and the contents of that is in some other binary format. It's worth first checking if this is the case, and matching against known formats like ZIP or HDF5.
I'd suggest looking at a format like msgpack to see what a binary data format could look like: https://msgpack.org/
Then be aware that proprietary formats are going to be a lot more complicated. Or maybe it's just zipped up json data, only way to tell is to start poking around at it.
(a) you reverse engineer the application writing or reading the file. Even without fully understanding the application it can give you valuable information about the format (e.g. "The application calls fwrite in a for loop ten times, maybe those are related to the ten elements that I see on the screen").
(b) you reverse engineer only the file. For example, you change one value in the application and compare the resulting output file. Or the opposite way: you change one value in the file and see what happens in the application when you load it.
The next part is always looking into the values of 32 bit or 64 bit integers, if their value is higher than 0 but less then the files size they often are offsets into the file, meaning they address specific parts.
Another recommendation is to understand what you are looking for. For games, you are most likely looking for meshes and textures. For meshes in 3D every vertex of a mesh is most likely represented by 3 floats / doubles. If you see clusters of 3 floats with sensical values (e.g. without an +/-E component) its likely that your looking at real floats.
When looking for textures it can help to adjust the view on the data to the same resolution of the data your looking for. For example, if you are looking for a 8bit alpha map with a resolution of 64 x 64 then try to get 64 bytes in a row in your hex editor, you might be lucky to see the pattern show up.
For save games I can only reiterated what has been mentioned before. Look for unique specific values in the file as integers. For example how much gold you have.
I used these technics to reverse engineer: * Diablo 2 save games * World of Worcraft adt chunks * .NET Assembly files (I would recommend reading the ECMA specification though) * jade format of Beyond good and evil
* common.lin contains filenames, so that filename-expansion code in the game can work. But the offsets and sizes associated with the files are garbage
* <filename>.lin contains a stream of every byte read from every file, while loading the level <filename>. The stream is then compressed in 16k chunks by zlib.
* There is no indication in that stream of which real file was being read, nor the length of each read, nor what seeking was done (if any). All that metadata is gone.
* The only way to recover this metadata is to run the game code and log the exact sequence of file opens, seeks, reads.
* Alternatively, extract all that Unreal object loader code from the game and reimplement it yourself, so that you can let the contents of the stream drive the correct reading of the stream. The code should be deterministic.
This sounds pretty hellish for the game developers, and I bet the debug versions of their game _ignored_ <filename>.lin and used the real source files, but _wrote_ <filename>.lin immediately after every load... any change to the Unreal objects could alter how they were read, and if the data streamed didn't perfectly match up with what was in the real files, you'd be toast.
It reminds me of the extreme optimisation that Farbrausch did for .kkrieger -- they built a single binary, then ran and played it under instrumentation, and _any_ code path that wasn't taken was deleted from the binary to make it smaller. They forgot to take any damage in that playthrough, so all the code that applies damage to the player was deleted. Oops!
https://fgiesen.wordpress.com/2012/04/08/metaprogramming-for...
Now it makes sense.
> retn 4
> That's it.
I've seen this before; it's a random() function.
I wonder if any of the original devs will stumble upon the author's article and then remember why they did those weird file offsets.
There was a difference in the PC and Xbox versions, so it will be interesting to find out if the author sees any snippets or missing game assets in the Xbox version.
[1] Starbounded was supposed to become an editor: https://github.com/blixt/starbounded
[2] https://github.com/blixt/starbound-sha256
[3] https://github.com/blixt/py-starbound
It's an interesting rabbit hole to go down into and this post makes me appreciate the way in which this kind of forensic analysis is done.
1: https://eye-of-the-gopher.github.io/
2 more comments available on Hacker News