A File Format Uncracked for 20 Years

Postedabout 2 months agoActiveabout 2 months ago

todsacerdoti

298 points

59 comments

landaire.netTechstoryHigh profile

calmpositive

Debate

20/100

Reverse EngineeringGame DevelopmentFile Formats

Key topics

Reverse Engineering

Game Development

File Formats

The article discusses the reverse engineering of a 20-year-old game file format, sparking discussion on the techniques and challenges involved in understanding legacy game data.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

10d

Peak period

Day 11

Avg / period

28.5

Comment distribution57 data points

Loading chart...

Based on 57 loaded comments

Key moments

01Story posted
Nov 6, 2025 at 11:52 AM EST
about 2 months ago
Step 01
02First comment
Nov 16, 2025 at 10:48 PM EST
10d after posting
Step 02
03Peak activity
49 comments in Day 11
Hottest window of the conversation
Step 03
04Latest activity
Nov 17, 2025 at 2:51 PM EST
about 2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (59 comments)

Showing 57 comments of 59

LunicLynx

about 2 months ago

1 reply

The Xbox had strong requirements for loading times. This is probably a linear (lin) record of how the data was loaded unoptimized from the disk. And just written to a file.

So in this file seek doesn’t do anything because seek kills the requirement of 45 sec per loading screen.

Instead the logic is as follows: check if a .lin file exists. Yes: open a handle to it and only read from it with fread, what ever currently is at the current file position. No: while reading any file write the read bytes to a .lin file in the order they are read.

This gives a highlyy optimized .lin file which can be read from the disk into memory, without creating a better dedicated loading mechanism.

So if your really would like to unpack this. The first file being read is most likely the key, as it dictates what comes next. If it is a level model, then the position of the player in it might affect which other files to load etc.

In short it’s not a file format in the classical sense, it’s a linear stream of game data.

landr0id

about 2 months ago

1 reply

>This is probably a linear (lin) record of how the data was loaded unoptimized from the disk.

Yes, it's buried deep in the details but it's basically just bytes in being written 1:1 to an output file.

I don't know which stage of grief this is, but since I wrote this blog post I've now ported my IDA debugger scripts to a dedicated QEMU plugin which logs all I/O operations and some other metadata. I tried using this technique to statically rewrite files by basically following DataLoad (with unique identifier) -> Seek -> Read patterns.

There's some annoying nuance to deal with (like seeking backwards implying that data was read, tested, then discarded) but I got this working. Unfortunately some data types encode absolute offsets in them that need to be touched up.

Now I'm just using this data to completely reimplement the game engine's loading logic from scratch using a custom IO stream which checks the incoming IO operation (seek/read) against what I logged from the game engine to ensure a 1:1 match with how the game loads data.

WatchDog

about 2 months ago

1 reply

Have you done any analysis of what proportion of the lin file is being read in total?

You stated in the blog post, that your goal is to try and find unused content, however if as described, the file is just a record of how the game loads the data, then it won't contain any hidden unused assets, since unused assets would never have been read from the original unoptimised file, and thus never written to this optimized file.

landr0id

about 2 months ago

I agree and don't think there's any unused data. For `common.lin` for instance my parser reads it basically to the end and there's some small amount of data that's unused. I never actually quantified the amount but I'm fairly certain it's <100 bytes.

The goal post has shifted so far beyond my original intentional at this point. The devs working on the EnhancedSC mod have a strong desire to port some Xbox assets/maps to PC, so I'm mostly doing it at this point as an attempt to help them out.

mcdeltat

about 2 months ago

5 replies

> Compressing data means you save space on the disc... If you conveniently ignore the fact that common.lin is duplicated in each map's directory and is the same for every map I tested, which kinda negates part of this.

This is an interesting thing I've noticed about game dev, it seems to sometimes live in a weird space of optimisation requirements vs hackiness. Where you'll have stuff like using instruction data as audio to save space, but then forget to compile in release mode or something. Really odd juxtaposition of near-genius-level optimisation with naive inefficiency. I'm assuming it's because, while there may be strict performance requirements, the devs are under the pump and there's so much going on that silly stuff ends up happening?

monero-xmr

about 2 months ago

1 reply

And passion to deliver. Engineers will kill themselves for a game release for no extra money and far less salary than their abilities would demand at a bigcorp. But they love it so they do it, and hack as best they can, to get their art into the world.

qingcharles

about 2 months ago

I bought into that scam. My dream job was video game dev. I spent my whole childhood writing video games. It was how I ended up doing it professionally for near zero money and 80 hour weeks and burning out completely in two years.

rusk

about 2 months ago

2 replies

There was a running theme in mythic quest about the engineers sweating over the system while monetisation just went bolted on a casino.

Also happened in GTA5 there was a ridiculous loading glitch that was quite well documented on here a while ago. Also a monetisation bolt on.

So you have competing departments one of whom must justify itself by producing a heavily after my system. And another one which is licensed to generate revenue at any cost……

avereveard

about 2 months ago

2 replies

There's also relative pain scales

Loading happen once per session and is less painful than frame stuttering all game, for example, so given a tight deadline one would get prioritized over the other

Orygin

about 2 months ago

1 reply

I tried playing GTAO when it was free, and oh boy. Loading for 10 minutes, arrive into the game and see you're not with your friends. So 10 more minutes to load into their server. Then you start a mission and 10 more minutes of loading. The server disconnected? 10 minutes load to go back without your friend. Join your friend? guessed it: 10 more minutes of loading. For a billion dollar game, it's insane I spent more time loading than playing the game. Imagine how many more $$ they could have gotten if players could double their play time.

rusk

about 2 months ago

Put me right off the game.

GranPC

about 2 months ago

Loading in GTA Online absolutely does not happen once per session. It happens before and after every mission and activity. I am not sure whether it's a full load/was also affected by that bug, but I can certainly tell you that around 20% of my GTAO "playtime" consisted of staring at a load screen.

ramses0

about 2 months ago

1 reply

There was one similar issue with DOOM framerate, I'm assuming an intern got tasked with adding the "blink the LED on the fancy mouse" code (due to a marketing partnership) and it absolutely _trashes_ the framerate!

https://www.reddit.com/r/Doom/comments/bnsy4o/psa_deactivate...

RandomBacon

about 2 months ago

Please use old.* when posting reddit links:

https://old.reddit.com/r/Doom/comments/bnsy4o/psa_deactivate...

richardfey

about 2 months ago

2 replies

This might be an optimisation to avoid disc seeks on wildly far apart distances, which would introduce more latency.

landr0id

about 2 months ago

2 replies

For this file in particular I'm unsure.

common.lin is a separate file which I believe is supposed to contain data common to all levels _before_ the level is loaded.

There's a single exported object that all levels of the game have called `MyLevel`. The game attempts to load this and it triggers a load of the level data and all its unique dependencies. The common.lin file is a snapshot of everything read before this export. AFAIK this is deterministic so it should be the exact same across all maps but I've not tested all levels.

When loading a level, the training level for instance contains two distinct parts. Part 1 of the map loads 0_0_2_Training.lin, and the second part loads 0_0_3_Training.lin. These parts are completely independent -- loading the second part does not require loading the first. It does a complete re-launch of the game using the Xbox's XLaunchNewImage API, so all prior memory I think should be evicted but maybe there's some flag I'm unaware of.

So basically the game launches, looks In the "Training" map folder for common.lin, opens a HANDLE, then looks for whichever section it's loading, grabs a HANDLE, then starts reading common.lin and <map_part>.lin

There's multiple parts, but only one common.lin in each map folder. So no matter what it's not going to be laid out in a contiguous disc region for common.lin leading into <map_part>.lin.

I don't know enough about optical media seek times to say if semi-near locality is noticeably better for the worst case than the files being on complete opposite sector ranges of the disc.

richardfey

about 2 months ago

2 replies

They were doing this kind of optical media seek times tests/optimisations for PS1 games, like Crash Bandicoot. You certainly have more and better context than me on this console/game, I just mentioned it in case it wasn't considered.

By the way, could the nonsensical offsets be checksums instead?

Nice reverse engineering work and analysis there!

ralferoo

about 2 months ago

1 reply

IIRC the average seek time across optical media is around 120ms, so ideally you want all reads to be linear.

I remember one game I worked on, I spent months optimising loading, especially boot flow, to ensure that every file the game was going to load was the very next file on the disk, or else the next file was an optionally loaded file that could be skipped (as reading and ignoring was quicker than seeking). For the few non-deterministic cases where order couldn't be predicted (e.g. music loaded from a different thread), I preloaded a bunch of assets up front so that the rest of the assets were deterministic.

One fun thing we often did around this era is eschew filenames and instead hash the name. If we were loading a file directly from C code, we'd use the preprocessor the hash the code via some complicated macros, so the final call would be compiled like LoadAsset(0x184e49da) but still retain a run-time hasher for cases where the filename was generated dynamically. This seems like a weird optimisation, but actually avoiding the directory scan and filename comparisons can save a lot of unnecessary seeking / CPU operations, especially for multi-level directories. The "file table" then just became a list of disk offset and lengths, with a few gaps because the hash table size was a little bigger than the number of files to avoid hash conflicts. Ironically, on one title I worked on we had the same modulo for about 2 years in development, and just before launch we needed to change it twice in a week due to conflicts!

rswail

about 2 months ago

This reminds me of Mel:

"Mel's job was to re-write the blackjack program for the RPC-4000. (Port? What does that mean?) The new computer had a one-plus-one addressing scheme, in which each machine instruction, in addition to the operation code and the address of the needed operand, had a second address that indicated where, on the revolving drum, the next instruction was located.

https://users.cs.utah.edu/~elb/folklore/mel.html

landr0id

about 2 months ago

Thank you.

>By the way, could the nonsensical offsets be checksums instead?

If you're referring to those weird "addresses" that quickly became irrelevant, there's a CRC32 somewhere in the header immediately after them. The address value is the same across files with different contents too.

I was talking to a friend of mine about it and he suggested that maybe whatever process generated the files included the file's load address in case it could be mapped to the same address for some other optimization?

oarsinsync

about 2 months ago

ISO9660 has support for something that resembles hard links - IE, a file can exist in multiple places in the directory structure, but always point to the same underlying data blocks on disc.

I think XISO is derived from ISO9660, so may have the same properties?

Cthulhu_

about 2 months ago

[delayed]

bargainbin

about 2 months ago

Exactly that - once it’s shipped it’s shipped. Doesn’t matter if the code is “clean” or “maintainable” or whatever.

The longer it’s not released for sale, the more debt you’re incurring paying the staff.

I’ve worked with a few ex-game devs and they’re always great devs, specifically at optimising. They’re not great at the “forward maintainability” aspect though because they’ve largely never had experience having to do it.

qingcharles

about 2 months ago

As a previous game dev. It was a combination of that. Also, you're often starting a project on tooling that is bleeding edge and that you have no experience with and isn't properly tested or documented.

Then game dev was always full of fresh junior devs with tons of energy, ideas and dreams, but who are coming from home brew where things like reliable, beautiful, readable code are unnecessary.

And tons of things get missed. I keep hoping that the one published game I have was accidentally built with debug symbols in it so it can be easily traced. Two of us on the project were heavily into performance optimization, and I absolutely remember us going through compiler options, but things were crazy. I remember one major milestone build for Eidos I was hallucinating badly when I compiled and burned the CD because I'd been working for three days straight with no sleep.

tapia

about 2 months ago

7 replies

I'm always amazed by people doing reverse engineering of some country formats. There's a binary format that I've been wanting to reverse engineer, but I don't know exactly how to start. It's for the result file of a proprietary finite element program. Could anyone point me to some resources and also what are the basics that I need to learn to achieve this?

iberator

about 2 months ago

2 replies

> country format

Country ?! What's the meaning

fainpul

about 2 months ago

1 reply

[delayed]

antonvs

about 2 months ago

That's really penguin

tapia

about 2 months ago

Sorry, that was a typo (autocorrected). I meant binary format.

ashdnazg

about 2 months ago

1 reply

The bare basics are working with a hex editor and understanding data types - ints, floats, null-terminated strings, length-prefixed strings etc.

I'd recommend taking a well documented binary file format (Doom WAD file?), go over the documentation, and see that you manage to see the individual values in the hex editor.

Now, after you have a feel for how things might look in hex, look at your own file. Start by saving an empty project from your program and identifying the header, maybe it's compressed?

If it's not, change a tiny thing in the program and save again, compare the files to see what changed. Or alternatively change the file a tiny bit and load it.

Write a parser and add things as you learn more. If the file isn't intentionally obfuscated, it should probably be just a matter of persevering until you can parse the entire file.

tapia

about 2 months ago

Thanks. That is kind of what I imagined. But I am not good at understanding the information from the hex editor. Reading the article I was a bit lost with the terms like little-endian and thought that that might be someone important concept to know for the task . I guess that that is what I should learn first.

mrgaro

about 2 months ago

1 reply

It helps tremendously if you have a programming background as usually the developers behind the original format didn't have any need to make things harder than they need to be. Because of this, you can often guess how the format works, aka. "If I was the original developer, how would I do this?"

tapia

about 2 months ago

I am an engineer, but not a computer scientist or developer. I have been using Linux for 20 years and program a lot at work and at home. I think it should be possible, but find it difficult to interpret the hex code. I do have a general idea of how the format should be organized, as it contains mostly geometric data and associated results.

DannyBee

about 2 months ago

1 reply

As someone who has reverse engineers hundreds of random file formats of all kinds over the years, the comment that suggests understanding the code is generally spot on.

You can basically divide the world into read/write/write-only formats and read-only formats.

For read/write/write-only formats, usually the in-memory data structures were written first, and then the serialization/deserialization code. So it almost always more useful to see how the code works, than try to just figure out what random bytes in the file mean. A not insignificant percent of the time, the serialization/deserialization code is fairly straightforward - read some bytes, maybe decompress them, maybe checksum them and compare to a checksum field, shove in right place in memory structure/create a class using it, move on.

Different parts of the program may read different parts of a file, but again, usually a given part of the deserialization/serialization code is fairly understandable.

Read-only formats are scattershot. Lots of reasons. I'll just cover a few. First, because the code doesn't usually have both the writing and reading, you have less of a point of reference for what reading code is doing. Second, they are not uncommonly memory mapped serializations of in-memory structures. But not necessarily even for the current platform. So it may even make perfect sense and be easy to undersatnd on some platform, but on your platform, the code is doing weird conversions and such. This is essentially a variant of "the format was designed before the code". Lots and lots more issues.

I still would start by trying to understand deserialization code rather the format directly in this case, but it often is significantly harder to get a handle on.

There are commonalities in some types of programs (IE you will find commonalities between games using the same engine, etc), but if you are talking "in general", the above is the best "in general" i can give you.

tptacek

about 2 months ago

It has been a minute since I routinely did this kind of work, but I have to mention this because it's fun:

You can do something in between reverse-engineering the code and reverse-engineering the format if you can instrument the reader: attach breakpoints on every basic block in the reader, load a file, take a baseline trace of what gets hit, then vary bytes in the file and diff the new trace against the baseline.

It's a pretty fun tool to write, too.

jcranmer

about 2 months ago

1 reply

The most important resource you'll need is a hex editor that can let you drop at a cursor and see what the value is at the cursor for all the basic datatypes (u8/u16/u32/u64, float, double, at minimum). Something like 010 Editor or ImHex.

If it's a really simple format, since you appear to have the ability to generate arbitrary file contents using the program, you can get some mileage by generating a suite of small contents with few changes between them. I reverse engineered the DSP sphere blueprint format by generating a blueprint with one node, then the same node located elsewhere, then two nodes, then two nodes and one frame between them, etc. But this process is really only possible for the simplest formats; I'd gander that most reverse-engineered file formats are heavily based on decompilation of the deserialization code.

A lot of binary file formats end up being some form of "container" format--essentially, a file contains some form of directory mapping an item ID to a location in the file, and the contents of that is in some other binary format. It's worth first checking if this is the case, and matching against known formats like ZIP or HDF5.

tapia

about 2 months ago

1 reply

That sounds interesting. But how can you test these internal binary formats? Do I need to extract that somehow?

traverseda

about 2 months ago

ImHex will tell you if it's compressed. Do you understand data structures? Floats, all those data types?

I'd suggest looking at a format like msgpack to see what a binary data format could look like: https://msgpack.org/

Then be aware that proprietary formats are going to be a lot more complicated. Or maybe it's just zipped up json data, only way to tell is to start poking around at it.

tralarpa

about 2 months ago

There are two approaches (sometimes mixed):

(a) you reverse engineer the application writing or reading the file. Even without fully understanding the application it can give you valuable information about the format (e.g. "The application calls fwrite in a for loop ten times, maybe those are related to the ten elements that I see on the screen").

(b) you reverse engineer only the file. For example, you change one value in the application and compare the resulting output file. Or the opposite way: you change one value in the file and see what happens in the application when you load it.

LunicLynx

about 2 months ago

The way I do it is looking for markers. Most files have some kind of magic number in the beginning. So these can valuable to recognize.

The next part is always looking into the values of 32 bit or 64 bit integers, if their value is higher than 0 but less then the files size they often are offsets into the file, meaning they address specific parts.

Another recommendation is to understand what you are looking for. For games, you are most likely looking for meshes and textures. For meshes in 3D every vertex of a mesh is most likely represented by 3 floats / doubles. If you see clusters of 3 floats with sensical values (e.g. without an +/-E component) its likely that your looking at real floats.

When looking for textures it can help to adjust the view on the data to the same resolution of the data your looking for. For example, if you are looking for a 8bit alpha map with a resolution of 64 x 64 then try to get 64 bytes in a row in your hex editor, you might be lucky to see the pattern show up.

For save games I can only reiterated what has been mentioned before. Look for unique specific values in the file as integers. For example how much gold you have.

I used these technics to reverse engineer: * Diablo 2 save games * World of Worcraft adt chunks * .NET Assembly files (I would recommend reading the ECMA specification though) * jade format of Beyond good and evil

amiga386

about 2 months ago

3 replies

So if I understand this right:

* common.lin contains filenames, so that filename-expansion code in the game can work. But the offsets and sizes associated with the files are garbage

* <filename>.lin contains a stream of every byte read from every file, while loading the level <filename>. The stream is then compressed in 16k chunks by zlib.

* There is no indication in that stream of which real file was being read, nor the length of each read, nor what seeking was done (if any). All that metadata is gone.

* The only way to recover this metadata is to run the game code and log the exact sequence of file opens, seeks, reads.

* Alternatively, extract all that Unreal object loader code from the game and reimplement it yourself, so that you can let the contents of the stream drive the correct reading of the stream. The code should be deterministic.

This sounds pretty hellish for the game developers, and I bet the debug versions of their game _ignored_ <filename>.lin and used the real source files, but _wrote_ <filename>.lin immediately after every load... any change to the Unreal objects could alter how they were read, and if the data streamed didn't perfectly match up with what was in the real files, you'd be toast.

It reminds me of the extreme optimisation that Farbrausch did for .kkrieger -- they built a single binary, then ran and played it under instrumentation, and _any_ code path that wasn't taken was deleted from the binary to make it smaller. They forgot to take any damage in that playthrough, so all the code that applies damage to the player was deleted. Oops!

cedws

about 2 months ago

1 reply

Could you explain a bit more about that code path optimisation? Why wouldn’t the compiler eliminate dead code? It seems like a very haphazard blunt force optimisation method.

anamexis

about 2 months ago

1 reply

The compiler can’t determine which code paths are never used in practice at runtime.

cedws

about 2 months ago

2 replies

I see. It sounds like it would be a source of countless headaches, and I don’t think I’d ever want to do something that risks breaking the program like that, but I guess that’s why I’m not a game programmer.

amiga386

about 2 months ago

1 reply

Picture the scene: you have a 102KiB executable game that you want to enter into the 96KiB-or-less game competition at a demoparty in 2 weeks time. You and your friends have been working on this for months. This amazing thing you have is currently 6KiB too large to qualify for the competition at all. What do you do?

https://fgiesen.wordpress.com/2012/04/08/metaprogramming-for...

Now it makes sense.

cedws

about 2 months ago

Thanks, I wasn’t familiar with the background story.

dagmx

about 2 months ago

Kkrieger specifically is a demo scene app with the goal of being as small as possible. It’s not indicative of overall game development practices as a whole.

maccard

about 2 months ago

Unreal has gotten better since then, but you still need the actual game code to load the asset correctly. It’s a major pain in the ecosystem.

debugnik

about 2 months ago

[delayed]

hiimkeks

about 2 months ago

1 reply

> The entire content of the function is:

> retn 4

> That's it.

I've seen this before; it's a random() function.

amiga386

about 2 months ago

Very good, a reference to https://youtu.be/DUGGJpn2_zY?t=2324

vivzkestrel

about 2 months ago

what about splinter cell conviction, 15 yrs and nobody has figured out its map file format .unr that uses custom unreal engine 2.x. It even has a tool that lets you unpack its UMD files https://github.com/wcolding/UMDModTemplate The library on github requires this tool unumd https://www.gildor.org/smf/index.php/topic,458.msg15196.html... The same tool also works for blacklist. I would like to change the type of enemy spawned in the map but I cannot find any assistance on it. UEExplorer doesnt work because it is some kinda custom map file

suckow

about 2 months ago

Pretty fascinating. Unfortunately, you will never see something like this in 2025 because DEI has effectively killed innovative programming. Sorry not sorry.

tomaytotomato

about 2 months ago

Loved Splinter Cell

I wonder if any of the original devs will stumble upon the author's article and then remember why they did those weird file offsets.

There was a difference in the PC and Xbox versions, so it will be interesting to find out if the author sees any snippets or missing game assets in the Xbox version.

blixt

about 2 months ago

The quirks of field values not matching expectations reminds me of a rabbit hole when I was reverse engineering the Starbound engine[1] and eventually figured out the game was using a flawed implementation of SHA-256 hashing and had to create a replica of it [2]. Originally I used Python [3] which is a really nice language for reverse engineering data formats thanks to its flexibility.

[1] Starbounded was supposed to become an editor: https://github.com/blixt/starbounded

[2] https://github.com/blixt/starbound-sha256

[3] https://github.com/blixt/py-starbound

noufalibrahim

about 2 months ago

This is an interesting post. I've been spending time on a hobby project[1] that requires reading some old archives and game asset files. I didn't have to do any reverse engineering since it's already done by others and documented on on the moddingwiki. However, I did implement the algorithms myself to work with the assets.

It's an interesting rabbit hole to go down into and this post makes me appreciate the way in which this kind of forensic analysis is done.

1: https://eye-of-the-gopher.github.io/

harrylepotter

about 2 months ago

ironically this was the game that enabled the savegame exploit with the bert and ernie fonts if i recall correctly

2 more comments available on Hacker News

View full discussion on Hacker News

ID: 45837259Type: storyLast synced: 11/20/2025, 4:53:34 PM

Want the full context?