Generally good points. Unfortunately existing file formats are rarely following these rules. In fact these rules should form naturally when you are dealing with many different file formats anyway. Specific points follow:
- Agreed that human-readable formats have to be dead simple, otherwise binary formats should be used. Note that textual numbers are surprisingly complex to handle, so any formats with significant number uses should just use binary.
- Chunking is generally good for structuring and incremental parsing, but do not expect it to provide reorderability or back/forward compatibility somehow. Unless explicitly designed, they do not exist. Consider PNG for example; PNG chunks were designed to be quite robust, but nowadays some exceptions [1] do exist. Versioning is much more crucial for that.
- Making a new file format from scratch is always difficult. Already mentioned, but you should really consider using existing file formats as a container first. Some formats are even explicitly designed for this purpose, like sBOX [2] or RFC 9277 CBOR-labeled data tags [3].
> Note that textual numbers are surprisingly complex to handle, so any formats with significant number uses should just use binary.
Especially true of floats!
With binary formats, it's usually enough to only support machines whose floating point representation conforms to IEEE 754, which means you can just memcpy a float variable to or from the file (maybe with some endianness conversion). But writing a floating point parser and serializer which correctly round-trips all floats and where the parser guarantees that it parses to the nearest possible float... That's incredibly tricky.
What I've sometimes done when I'm writing a parser for textual floats is, I parse the input into separate parts (so the integer part, the floating point part, the exponent part), then serialize those parts into some other format which I already have a parser for. So I may serialize them into a JSON-style number and use a JSON library to parse it if I have that handy, or if I don't, I serialize it into a format that's guaranteed to work with strtod regardless of locale. (The C standard does, surprisingly, quite significantly constrain how locales can affect strtod's number parsing.)
Here's a weird idea that has occurred me from time to time. What if your editor could recognize a binary float, display it in a readable format, and allow you to edit it, but leave it as binary when the file is saved.
Maybe it's discipline-specific, but with the reasonable care in handling floats that most people are taught, I've never had a consequential mishap.
I don't know how you would do that in practice, since every valid sequence of 4 or 8 bytes is a valid float. Maybe you could exclude some of the more unusual NaN representations but it still leaves you with most byte sequences being floats.
For example, the ASCII string "Morn", stored as the bytes '0b01001101 0b01101111 0b01110010 0b01101110', could be interpreted as the 32-bit float 0b01101110011100100110111101001101, representing the number 1.8757481691240478e+28.
So you couldn't really just have smart "float recognition" built in to an editor as a general feature, you would need some special format which the editor understands which communicates "the following 4 bytes is a single-precision float" or "the following 8-byte is a double-precision float".
You do have to be aware of endianness, although it's really not hard to handle it: pick one. Your code always knows what endian the file format is, so it always knows how to read it.
An amusing anecdote is that Alan Turing taught himself to read numbers in the internal representation used by the computer, rather than allowing the waste of equipment, time, or labor to translate them.
Spent the weekend with an untagged chunked format, and... I rather hate it.
A friend wanted a newer save viewer/editor for Dragonball Xenoverse 2, because there's about a total of two, and they're slow to update.
I thought it'd be fairly easy to spin up something to read it, because I've spun up a bunch of save editors before, and they're usually trivial.
XV2 save files change over versions. They're also just arrays of structs [0], that don't properly identify themselves, so some parts of them you're just guessing. Each chunk can also contain chunks - some of which are actually a network request to get more chunks from elsewhere in the codebase!
[0] Also encrypted before dumping to disk, but the keys have been known since about the second release, and they've never switched them.
>Most extensions have three characters, which means the search space is pretty crowded. You may want to consider using four letters.
Is there a reason not to use a lot more characters? If your application's name is MustacheMingle, call the file foo.mustachemingle instead of foo.mumi?
This will decrease the probability of collision to almost zero. I am unaware of any operating systems that don't allow it, and it will be 100% clear to the user which application the file belongs to.
It will be less aesthetically pleasing than a shorter extension, but that's probably mainly a matter of habit. We're just not used to longer file name extensions.
A 14-character extension might cause UX issues in desktop environments and file managers, where screen real estate per directory entry is usually very limited.
When under pixel pressure, a graphical file manager might choose to prioritize displaying the file extension and truncate only the base filename. This would help the user identify file formats. However, the longer the extension, the less space remains for the base name. So a low-entropy file extension with too many characters can contribute to poor UX.
> it will be 100% clear to the user which application the file belongs to.
The most popular operating system hides it from the user, so clarity would not improve in that case. At leat one other (Linux) doesn't really use "extensions" and instead relies on magic headers inside the files to determine the format.
Otherwise I think the decision is largely aestethic. If you value absolute clarity, then I don't see any reason it won't work, it'll just be a little "ugly"
I don't even think it's ugly. I'm incredibly thankful every time I see someone make e.g. `db.sqlite`, it immediately sets me at ease to know I'm not accidentally dealing with a DuckDB file or something.
>The most popular operating system hides it from the user, so clarity would not improve in that case.
If you mean Windows, that's not entirely correct. It defaults to hiding only "known" file extensions, like txt, jpg and such. (Which IMO is even worse than hiding all of them; that would at least be consistent.)
EDIT: Actually, I just checked and apparently an extension, even an exotic one, becomes "known" when it's associated with a program, so your point still stands.
Your doubts are incorrect. There's a fairly standard way of extracting the file type out of files on linux, which relies on a mix of extensions and magic bytes. Here's where you can start to read about this:
I'm a little surprised that that link doesn't go to libmagic[1]. No doubt XDG_MIME is an important spec for desktop file detection, but I think libmagic and the magic database that underpins it are more fundamental to filetype detection in general.
It's also one of my favorite oddities on Linux. If you're a Windows user the idea of a database of signatures for filetypes that exists outside the application that "owns" a file type is novel and weird.
libmagic maintains its own separate database from xdg. XDG db is meant from the ground up to have other apps add to it etc… so that one is the one apps use as a library usually, if they want to integrate nicely and correctly with other installed apps. libmagic is the hacking of the two :)
I think the Mac got this right (before Mac OS X) and has since screwed it up. Every file had both a creator code and a type code. So, for every file, you would know which application created it and also which format it was.
So, double-clicking the file opened it in the application it was made in, but the Mac would also know which other applications could open that file.
For archive formats, or anything that has a table of contents or an index, consider putting the index at the end of the file so that you can append to it without moving a lot of data around. This also allows for easy concatenation.
What probably allows for even more easier concatenation would be to store the header of each file immediately preceding the data of that file. You can make a index in memory when reading the file if that is helpful for your use.
This would require a separate seek and read operation per archive member, each yielding only one directory entry, rather than very few read operation to load the whole directory at once.
Why not put it at the beginning so that it is available at the start of the filestream that way it is easier to get first so you know what other ranges of the file you may need?
>This also allows for easy concatenation.
How would it be easier than putting it at the front?
So if you rewrite an index at the head of the file, you may end up having to rewrite everything that comes afterwards, to push it further down in the file, if it overflows any padding offset. Which makes appending an extremely slow operation.
Whereas seeking to end, and then rewinding, is not nearly as costly.
You can do it via fallocate(2) FALLOC_FL_INSERT_RANGE and FALLOC_FL_COLLAPSE_RANGE but sadly these still have a lot of limitations and are not portable. Based on discussions I've read, it seems there is no real motivation for implementing support for it, since anyone who cares about the performance of doing this will use some DB format anyway.
In theory, files should be just unrolled linked lists (or trees) of bytes, but I guess a lot of internal code still assumes full, aligned blocks.
Files are often authored once and read / used many times. When authoring a file performance is less important and there is plenty of file space available. Indices are for the performance for using the file which is more important than the performance for authoring it.
If storage and concern aren't a concern when writing, then you probably shouldn't be doing workarounds to include the index in the file itself. Follow the dbm approach and separate both into two different files.
Which is what dbm, bdb, Windows search indexes, IBM datasets, and so many, many other standards will do.
Separate files isn't always the answer. It can be more awkward to need to download both and always keep them together compared to when it's a single file.
If the archive is being updated in place, turning ABC# into ABCD#' (where # and #' are indices) is easier than turning #ABC into #'ABCD. The actual position of indices doesn't matter much if the stream is seekable. I don't think the concatenation is a good argument though.
Imagine you have a 12Gb zip file, and you want to add one more file to it. Very easy and quick if the index is at the end, very slow if it's at the start (assuming your index now needs more space than is available currently).
Reading the index from the end of the file is also quick; where you read next depends on what you are trying to find in it, which may not be the start.
And most of them aren't. And even those that are - it's much easier to implement the ability to retrieve the last chunk of file than to deal with significant performance degradation of forced file rewrites.
Think about a format that has all those properties and you've used - PDF. PDFs the size of several 100s of MB aren't rare. Now imagine how it works in your world:
* Add a note? Wait for the file to be completely rewritten and burn 100s of MB of your data to sync to iCloud/Drive.
* Fill a form? Same.
* Add an annotation with your Apple Pencil? Yup, same.
Now look at how it works right now:
- Add a text? Fill a form? Add a drawing? A few KB of data is appended and uploaded.
* Sign the document to confirm authenticy? You got it, a few KB of data at the end.
* Determine which data was added after the document was signed and sign it with another cert? A few bytes.
Do you need to stream the PDF? Load the last chunk to detect the dictionary. If you don't want to do that, configure PDF writer to output the dictionary at the start and you still end up with a better solution.
Different trade-offs is why it might make sense to embrace the Unix way for file formats: do one thing well, and document it so that others can do a different thing well with the same data and no loss.
For example, if it is an archival/recording oriented use case, then you make it cheap/easy to add data and possibly add some resiliency for when recording process crashes. If you want efficient random access, streaming, storage efficiency, the same dataset can be stored in a different layout without loss of quality—and conversion between them doesn’t have to be extremely optimal, it just should be possible to implement from spec.
Like, say, you record raw video. You want “all of the quality” and you know all in all it’s going to take terabytes, so bringing excess capacity is basically a given when shooting. Therefore, if some camera maker, in its infinite wisdom, creates a proprietary undocumented format to sliiightly improve on file size but “accidentally” makes it unusable in most software without first converting it using their own proprietary tool, you may justifiedly not appreciate it. (Canon Cinema Raw Light HQ—I kid you not, that’s what it’s called—I’m looking at you.)
On this note, what are the best/accepted approaches out there when it comes to documenting/speccing out file formats? Ideally something generalized enough that it can also handle cases where the “file” is in fact a particularly structured directory (a la macOS app bundle).
Adding to the recording _raw_ video point, for such purposes, try to design the format so that losing a portion of the file doesn't render it entirely unusable. Kinda like how you can recover DV video from spliced tapes because the data for the current frame (+/- the bordering frame) is enough to start a valid new file stream.
That’s true, but streamable formats often don’t need an index.
A team member just created a new tool that uses the tar format (streamable), but then puts the index as the penultimate entry, with the last entry just being a fixed size entry with the offset of the beginning of the index.
In this way normal tar tools just work but it’s possible to retrieve a listing and access a file randomly. It’s also still possible to append to it in the future, modulo futzing with the index a bit.
(The intended purpose is archiving files that were stored as S3 objects back into S3.?
> How would it be easier than putting it at the front?
Have you ever wondered why `tar` is the Tape Archive? Tape. Magnetic recording tape. You stream data to it, and rewinding is Hard, so you put the list of files you just dealt with at the very end. This now-obsolete hardware expectation touches us decades later.
tar streams don't have an index at all, actually, they're just a series of header blocks and data blocks. Some backup software built on top may include a catalog of some kind inside the tar stream itself, of course, and may choose to do so as the last entry.
But new file formats being developed are most likely not going to be designed to be used with tapes. If you want to avoid rewinds you can write a new concatenated version of the files. This also allows you to keep the original in case you need it.
Consider DER format. Partial parsing is possible; you can easily ignore any part of the file that you do not care about, since the framing is consistent. Additionally, it works like the "chunked" formats mentioned in the article, and one of the bits of the header indicates whether it includes other chunks or includes data. (Furthermore, I made up a text-based format called TER which is intended to be converted to DER. TER is not intended to be used directly; it is only intended to be converted to DER for then use in other programs. I had also made up some additional data types, and one of these (called ASN1_IDENTIFIED_DATA) can be used for identifying the format of a file (which might conform to multiple formats, and it allows this too).)
I dislike JSON and some other modern formats (even binary formats); they often are just not as good in my opinion. One problem is they tend to insist on using Unicode, and/or on other things (e.g. 32-bit integers where you might need 64-bits). When using a text-based format where binary would do better, it can also be inefficient especially if binary data is included within the text as well, especially if the format does not indicate that it is meant to represent binary data.
However, even if you use an existing format, you should avoid using the existing format badly; using existing formats badly seems to be common. There is also the issue of if the existing format is actually good or not; many formats are not good, for various reasons (some of which I mentioned above, but there are others, depending on the application).
About target hardware, not all software is intended for a specific target hardware, although some is.
For compression, another consideration is: there are general compression schemes as well as being able to make up a compression scheme that is specific for the kind of data that is being compressed.
They also mention file names. However, this can also depend on the target system; e.g. for DOS files you will need to be limited to three characters after the dot. Also, some programs would not need to care about file names in some or all cases (many programs I write don't care about file names).
The ASN.1 format itself is pretty well-suited for generic file types. Unfortunately, there are very few good, open source/free ASN.1 (de)serializers out there.
In theory you could use ASN.1 DER files the same way you would JSON for human-readable formats. In practice, you're better off picking a different format.
Modern evolutions of ASN.1 like ProtoBuf or Cap'n Proto designed for transmitting data across the network might fit this purpose pretty well, too.
On the other hand, using ASN.1 may be a good way to make people trying to reverse engineer your format give up in despair, especially if you start using the quirks ASN.1 DER comes with and change the identifiers.
> Unfortunately, there are very few good, open source/free ASN.1 (de)serializers out there.
I wrote a library to read/write DER, which I have found suitable for my uses. (Although, I might change or add some things later, and possibly also some things might be removed too if I think they are unnecessary or cause problems.)
(You can complain about it if there is something that you don't like.)
> In theory you could use ASN.1 DER files the same way you would JSON for human-readable formats. In practice, you're better off picking a different format.
I do use ASN.1 DER for some things, because, in my opinion it is (generally) better than JSON, XML, etc.
> Modern evolutions of ASN.1 like ProtoBuf or Cap'n Proto designed for transmitting data across the network might fit this purpose pretty well, too.
I have found them to be unsuitable, with many problems, and that ASN.1 does them better in my experience.
> On the other hand, using ASN.1 may be a good way to make people trying to reverse engineer your format give up in despair, especially if you start using the quirks ASN.1 DER comes with and change the identifiers.
For me too, although you only need to use (and implement) the parts which are relevant for your application and not all of them, so it is not really the problem. (I also never needed to write ASN.1 schemas, and a full implementation of ASN.1 is not necessary for my purpose.) (This is also a reason I use DER instead of BER, even if canonical form is not required; DER is simpler to handle than all of the possibilities of BER.)
On the contrary, loading everything from a database is the limit case of "partial parsing" with queries that read only a few pages of a few tables and indices.
From the point of view of the article, a SQLite file is similar to a chunked file format: the compact directory of what tables etc. it contains is more heavyweight than listing chunk names and lengths/offsets, but at least as fast, and loading only needed portions of the file is automatically managed.
Using SQLite as a container format is only beneficial when the file format itself is a composite, like word processor files which will include both the textual data and any attachments. SQLite is just a hinderance otherwise, like image file formats or archival/compressed file formats [1].
[1] SQLite's own sqlar format is a bad idea for this reason.
From my own experience SQLite works just fine as the container for an archive format.
It ends up having some overhead compared to established ones,
but the ability to query over the attributes of 10000s of files is pretty nice, and definitely faster than the worst case of tar.
My archiver could even keep up with 7z in some cases (for size and access speed).
Implementing it is also not particularly tricky, and SQLite even allows streaming the blobs.
Making readers for such a format seems more accessible to me.
SQLite format itself is not very simple, because it is a database file format in its heart. By using SQLite you are unknowingly constraining your use case; for example you can indeed stream BLOBs, but you can't randomly access BLOBs because the SQLite format puts a large BLOB into pages in a linked list, at least when I checked last. And BLOBs are limited in size anyway (4GB AFAIK) so streaming itself might not be that useful. The use of SQLite also means that you have to bring SQLite into your code base, and SQLite is not very small if you are just using it as a container.
> My archiver could even keep up with 7z in some cases (for size and access speed).
7z might feel slow because it enables solid compression by default, which trades decompression speed with compression ratio. I can't imagine 7z having a similar compression ratio with correct options though, was your input incompressible?
Yes, the limits are important to keep in mind, I should have contextualized that before.
For my case it happened to work out because it was a CDC based deduplicating format that compressed batches of chunks. Lots of flexibility with working within the limits given that.
The primary goal here was also making the reader as simple as possible whilst still having decent performance.
I think my workload is very unfair towards (typical) compressing archivers: small incremental additions, needs random access, indeed frequent incompressible files, at least if seen in isolation.
I've really brought up 7z because it is good at what it does, it is just (ironically) too flexible for what was needed. There probably some way of getting it to perform way better here.
zpack is probably a better comparison in terms of functionality, but I didn't want to assume familiarity with that one. (Also I can't really keep up with it, my solution is not tweaked to that level, even ignoring the SQLite overhead)
My statement wasn't precise enough, you are correct that random access API is provided. But it is ultimately connected to the `accessPayload` function in btree.c which comment mentions that:
** The content being read or written might appear on the main page
** or be scattered out on multiple overflow pages.
In the other words, the API can read from multiple scattered pages unknowingly to the caller. That said I see this can be considered enough for being random accessible, as the underlying file system would use similarly structured indices behind the scene anyway... (But modern file systems do have consecutively allocated pages for performance.)
> Acorn’s native file format is used to losslessly store layer data, editable text, layer filters, an optional composite of the image, and various metadata. Its advantage over other common formats such as PNG or JPEG is that it preserves all this native information without flattening the layer data or vector graphics.
As I've mentioned, this is a good use case for SQLite as a container. But ZIP would work equally well here.
I think it's fine as an image format. I've used the mbtiles format which is basically just a table filled with map tiles. Sqlite makes it super easy to deal with it, e.g. to dump individual blobs and save them as image files.
It just may not always be the most performant option. For example, for map tiles there is alternatively the pmtiles binary format which is optimized for http range requests.
Except image formats and archival formats are composites (data+metadata). We have Exif for images, and you might be surprised by how much metadata the USTar format has.
With that reasoning almost every format is a composite, which doesn't sound like a useful distinction. Such metadata should be fine as long as the metadata itself is isolated and can be updated without the parent format.
My reasoning for Exif was that it is not only auxiliary but also post-hoc. Exif was defined independently from image formats and only got adopted later because those formats provided extension points (JPEG APP# markers, PNG chunks).
You've got a good point that there are multiple types of metadata and some metadata might be crucial for interpreting data. I would say such "structural" metadata should be considered as a part of data. I'm not saying it is not a metadata; it is a metadata inside some data, so doesn't count for our purpose of defining a composite.
I also don't think tar hardlinks are metadata for our purpose, because it technically consists of the linked path instead of the file contents and the information that the file is a hardlink, where the former is clearly a data and the latter is a metadata used to reconstruct the original file system so should be considered as a part of larger data (in this case, a logical notion of "file").
I believe these examples should be enough to derive my own definition of "composite". Please let me know otherwise.
Unless you are using the container file as a database too, sqlar is strictly inferior to ZIP in terms of pretty much everything [1]. I'm actually more interested in the context sqlar did prove useful for you.
I remember seeing the comment you linked few years back, and back then comments were already locked so I couldn't reply, and this time I sadly don't have the time to get deeper into this, however - I recommend you to research more about sqlar/using sqlite db as _file format_ in general, or at minimum looking at SQLite Encryption Extension (SEE) (https://www.sqlite.org/see/doc/trunk/www/readme.wiki). You can get a lot out of the box with very little investment. IMHO sqlar is not competing with ZIP (can zip do metadata and transactions?)
SEE is a proprietary extension, however generous its license is. So it is not very meaningful when sqlar is compared against ZIP. Not to say that I necessarily see encryption as a fundamental feature for compressed archive formats though---I'm advocating for age [1] integration instead.
> IMHO sqlar is not competing with ZIP (can zip do metadata and transactions?)
In my understanding SQLite's support for sqlar and ZIP occurred at the same time, so I believe that sqlar was created to demonstrate an alternative to ZIP (and that the demonstration wasn't good enough). I'm aware that this is just a circumstantial evidence, so let me know if you have some concrete one.
ZIP can of course do metadata in the form of per-file and archive comments. For more structured metadata, you can make use of extra fields if you really, really want, but at that point SQLite would indeed be a better choice. I however doubt it's a typical use case.
ZIP can be partially updated in place but can't do any transaction. But it should be noted that SQLite handles transaction by additional files (`-journal` or `-wal` files). So both sqlar and ZIP would write to an additional file during the update process, though SQLite would write much less data compared to ZIP. Any remaining differences are invisible to end users, unless the in-place update is common enough in which case the use of SQLite is justified.
Point is that SEE exists, and so do free alternatives.
> In my understanding SQLite's support for sqlar and ZIP occurred at the same time
I believe so too.
I agree with you on SQLAR being poor general-purpose archive or compression format compared to ZIP; what I'm arguing is that its very good file format for certain applications, offering structured, modifiable, and searchable file storage. We had great success using it as db/file format for PLM solution packed both as desktop and web app. Same database can then be used to power the web ui (single tenant SaaS deployments), and for desktop app (web export is simply a working file for desktop app). This file being just a simple sqlite db lets users play with data, do their own imports, migrations etc., while having all files & docs in one place.
Compression: For anything that ends up large it's probably desired. Though consider both algorithm and 'strength' based on the use case carefully. Even a simple algorithm might make things faster when it comes time to transfer or write to permanent storage. A high cost search to squeeze out yet more redundancy is probably worth it if something will be copied and/or decompressed many times, but might not be worth it for that locally compiled kernel you'll boot at most 10 times before replacing it with another.
If your data format contains multiple streams inside, consider ZIP for the container. Enables standard tools, and libraries available in all languages. The compression support is built-in but optional, can be enabled selectively for different entries.
The approach is widely used in practice. MS office files, Java binaries, iOS app store binaries, Android binaries, epub books, chm documentation are all using ZIP container format.
Designing your file (and data) formats well is important.
“Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.”
Agreed on that one. With a nice file format, streamable is hopefully just a matter of ordering things appropriately once you know the sizes of the individual chunks. You want to write the index last, but you want to read it first. Perhaps you want the most influential values first if you're building something progressive (level-of-detail split.)
Similar is the discussion of delimited fields vs. length prefix. Delimited fields are nicer to write, but length prefixed fields are nicer to read. I think most new formats use length prefixes, so I'd start there. I wrote a blog post about combining the value and length into a VLI that also handles floating point and bit/byte strings: https://tommie.github.io/a/2024/06/small-encoding
I don't think a single encoding is generally useful. A good encoding for given application would depend on the value distribution and neighboring data. For example any variable-length scalar encoding would make vectorization much harder.
Depends if you're optimizing for storage size or code size, and in-memory vs transfer. This encoding was meant to optimize transfer (and perhaps storage.)
If you find yourself building a file format - you should read this page carefully and make sure that you have very good arguments which support why it does not apply to you.
> However, it's cleaner to have a field in your header that states where the first sub-chunk starts; that way you can expand your header as much as you like in future versions, with old code being able to ignore those fields and jump to the good stuff.
That’s assuming that parsers will honor this, and not just use the fixed offset that worked for the past ten hears. This has happened often enough in the past.
I've had to do just that to retrofit features I wasn't allowed to think about up front (we must get the product out the door.... we'll cross that bridge when we get to it)
iNES file format is guilty of badly designed bit packing. Four flags were packed into the lower 4 bits, then Mapper Number was assigned to the high 4 bits. But then they needed more than 16 mappers. They used 4 high bits of the next byte to store the remaining 4 bits, and that was enough... until they needed over 256 mappers.
The "Chunk your binaries" point is spot on. Creating a huge binary blob that contains everything makes it hard to work with in constrained environments.
Also, +1 for "Document your format". More like "Document everything". Future you will thank you for it for sure.
Also you should consider the context in which you are developing. Often there are "standard" tools and methods to deal with the kind of data you want to store.
E.g. if you are interested in storing significant amounts of structured floating point data, choosing something like HDF5 will not only make your life easier it will also make it easy to communicate what you have done to others.
Thinking about a file format is a good way to clarify your vision. Even if you don’t want to facilitate interop, you’d get some benefits for free—if you can encapsulate the state of a particular thing that the user is working on, you could, for example, easily restore their work when they return, etc.
Some cop-out (not necessarily in a bad way) file formats:
1. Don’t have a file format, just specify a directory layout instead. Example: CinemaDNG. Throw a bunch of particularly named DNGs (a file for each frame of the footage) in a directory, maybe add some metadata file or a marker, and you’re good. Compared to the likes of CRAW or BRAW, you lose in compression, but gain in interop.
2. Just dump runtime data. Example: Mnemosyne’s old format. Do you use Python? Just dump your state as a Python pickle. (Con: dependency on a particular runtime, good luck rewriting it in Rust.)
3. Almost dump runtime data. Example: Anki, newer Mnemosyne with their SQLite dumps. (Something suggests to me that they might be using SQLite at runtime.) A step up from a pickle in terms of interop, somewhat opens yourself (but also others) to alternative implementations, at least in any runtime that has the means to read SQLite. I hope if you use this you don’t think that the presence of SQL schema makes the format self-documenting.
4. One or more of the above, except also zip or tar it up. Example: VCV, Anki.
About 1, directory of files, many formats these days are just a bunch of files in a ZIP. One thing most applications lack unfortunately is a way to instead just read and write the part files from/to a directory. For one thing it makes it much better for version control, but also just easier to access in general when experimenting. I don't understand why this is not more common, since as a developer it is much more fun to debug things when each thing is its own file rather than an entry in an archive. Most times it is also trivial to support both, since any API for accessing directory entries will be close to 1:1 to an API for accessing ZIP entries anyway.
When editing a file locally I would prefer to just have it split up in a directory 99% of the time, only exporting to a ZIP to publish it.
Of course it is trivial to write wrapper scripts to keep zipping and unzipping files, and I have done that, but it does feel a bit hacky and should be an unnecessary extra step.
Yes, the zipped version is number four. It’s not great for the reason you noted. Some people come up with smudge/clean filters that handle the (de)compression, letting Git store the more structured version of the data even though your working directory contains the compressed files your software can read and write—but I don’t know how portable these things are. I agree with you in general, and it is also why my number one example is that you might not need a single-file format at all. macOS app bundles is a great example of this approach in the wild.
One question I was hoping to ask anyone who thought about these matters: what accepted approaches do exist out there when it comes to documenting/speccing out file formats? Ideally, including the cases where the “file” is in fact a directory with a specific layout.
> 2. Just dump runtime data. Example: Mnemosyne’s old format. Do you use Python? Just dump your state as a Python pickle. (Con: dependency on a particular runtime, good luck rewriting it in Rust.)
Be particularly careful with this one as it can potentially vastly expand the attack surface of your program. Not that you shouldn't ever do it, just make sure the deserializer doesn't accept objects/values outside of your spec.
I certainly hope no one takes my list as an endorsement… It’s just some formats seen in the wild.
It should be noted (the article does not) that parsing and deserialisation is generally a known weak area and a common source of CVEs, even when pickling is not used. Being more disciplined about it helps, of course.
For Open-Source projects, human readable file formats are actively harmful.
This mostly is motivated by my experience with KiCad. Principally, there are multiple things that the UI does not expose at all (slots in PCB footprint files) where the only way to add them is to manually edit the footprint file in a text editor.
There are some other similar annoyances in the same vein.
Basically, human readable (and therefore editable) file formats wind up being a way for some things to never be exposed thru the UI. This actively leads to the software being less capable.
Not exposing things in the UI is not necessarily a problem (it depends on the program and on other stuff), although it can be (especially if it is not documented). I had not used the program you mention, although it does seem a problem in the way you mention, although someone who wants to add it into the UI could hopefully do so if it is FOSS. However, one potential problem is sometimes if it is a text-based format, writing such a format (in a way which still remains readable rather than messy) can sometimes be more complicated than reading it.
(The TEMPLATE.DER lump (which is a binary file format and not plain text) in Super ZZ Zero is not exposed anywhere in the UI; you must use an external program to create this lump if you want it. Fortunately that lump is not actually mandatory, and only affects the automatic initial modifications of a new world file based on an existing template.)
However, I think that human readable file formats are harmful for other reasons.
Generally good points. Unfortunately existing file formats are rarely following these rules. In fact these rules should form naturally when you are dealing with many different file formats anyway. Specific points follow:
- Agreed that human-readable formats have to be dead simple, otherwise binary formats should be used. Note that textual numbers are surprisingly complex to handle, so any formats with significant number uses should just use binary.
- Chunking is generally good for structuring and incremental parsing, but do not expect it to provide reorderability or back/forward compatibility somehow. Unless explicitly designed, they do not exist. Consider PNG for example; PNG chunks were designed to be quite robust, but nowadays some exceptions [1] do exist. Versioning is much more crucial for that.
[1] https://www.w3.org/TR/png/#animation-information
- Making a new file format from scratch is always difficult. Already mentioned, but you should really consider using existing file formats as a container first. Some formats are even explicitly designed for this purpose, like sBOX [2] or RFC 9277 CBOR-labeled data tags [3].
[2] https://nothings.org/computer/sbox/sbox.html
[3] https://www.rfc-editor.org/rfc/rfc9277.html
> Note that textual numbers are surprisingly complex to handle, so any formats with significant number uses should just use binary.
Especially true of floats!
With binary formats, it's usually enough to only support machines whose floating point representation conforms to IEEE 754, which means you can just memcpy a float variable to or from the file (maybe with some endianness conversion). But writing a floating point parser and serializer which correctly round-trips all floats and where the parser guarantees that it parses to the nearest possible float... That's incredibly tricky.
What I've sometimes done when I'm writing a parser for textual floats is, I parse the input into separate parts (so the integer part, the floating point part, the exponent part), then serialize those parts into some other format which I already have a parser for. So I may serialize them into a JSON-style number and use a JSON library to parse it if I have that handy, or if I don't, I serialize it into a format that's guaranteed to work with strtod regardless of locale. (The C standard does, surprisingly, quite significantly constrain how locales can affect strtod's number parsing.)
Here's a weird idea that has occurred me from time to time. What if your editor could recognize a binary float, display it in a readable format, and allow you to edit it, but leave it as binary when the file is saved.
Maybe it's discipline-specific, but with the reasonable care in handling floats that most people are taught, I've never had a consequential mishap.
I don't know how you would do that in practice, since every valid sequence of 4 or 8 bytes is a valid float. Maybe you could exclude some of the more unusual NaN representations but it still leaves you with most byte sequences being floats.
For example, the ASCII string "Morn", stored as the bytes '0b01001101 0b01101111 0b01110010 0b01101110', could be interpreted as the 32-bit float 0b01101110011100100110111101001101, representing the number 1.8757481691240478e+28.
So you couldn't really just have smart "float recognition" built in to an editor as a general feature, you would need some special format which the editor understands which communicates "the following 4 bytes is a single-precision float" or "the following 8-byte is a double-precision float".
Good point. It strikes me that Unicode is already analogous to recognition of multi byte characters. Maybe something along similar lines would work.
You can use hexadecimal floating-point literal format that introduced in the C language since C99[^1].
[^1]: https://cppreference.com/w/c/language/floating_constant.html
For example, `0x1.2p3` represents `9.0`.
To me this just sounds like an endianness nightmare waiting to happen.
Just add this to your serialiser and deserialiser:
It's really, really not hard in comparison to parsing and serialising textual floats.You do have to be aware of endianness, although it's really not hard to handle it: pick one. Your code always knows what endian the file format is, so it always knows how to read it.
Couldn't you just write the hex bytes? That would be unambiguous, and it wouldn't lose precision.
That's still a binary format in my eyes, it's not human readable or writable. You've just written the binary format as hex.
At that point human-readability would suffer.
reading hex floats isn't the most intuitive, but isn't that bad.
An amusing anecdote is that Alan Turing taught himself to read numbers in the internal representation used by the computer, rather than allowing the waste of equipment, time, or labor to translate them.
Spent the weekend with an untagged chunked format, and... I rather hate it.
A friend wanted a newer save viewer/editor for Dragonball Xenoverse 2, because there's about a total of two, and they're slow to update.
I thought it'd be fairly easy to spin up something to read it, because I've spun up a bunch of save editors before, and they're usually trivial.
XV2 save files change over versions. They're also just arrays of structs [0], that don't properly identify themselves, so some parts of them you're just guessing. Each chunk can also contain chunks - some of which are actually a network request to get more chunks from elsewhere in the codebase!
[0] Also encrypted before dumping to disk, but the keys have been known since about the second release, and they've never switched them.
>Most extensions have three characters, which means the search space is pretty crowded. You may want to consider using four letters.
Is there a reason not to use a lot more characters? If your application's name is MustacheMingle, call the file foo.mustachemingle instead of foo.mumi?
This will decrease the probability of collision to almost zero. I am unaware of any operating systems that don't allow it, and it will be 100% clear to the user which application the file belongs to.
It will be less aesthetically pleasing than a shorter extension, but that's probably mainly a matter of habit. We're just not used to longer file name extensions.
Any reason why this is a bad idea?
A 14-character extension might cause UX issues in desktop environments and file managers, where screen real estate per directory entry is usually very limited.
When under pixel pressure, a graphical file manager might choose to prioritize displaying the file extension and truncate only the base filename. This would help the user identify file formats. However, the longer the extension, the less space remains for the base name. So a low-entropy file extension with too many characters can contribute to poor UX.
> it will be 100% clear to the user which application the file belongs to.
The most popular operating system hides it from the user, so clarity would not improve in that case. At leat one other (Linux) doesn't really use "extensions" and instead relies on magic headers inside the files to determine the format.
Otherwise I think the decision is largely aestethic. If you value absolute clarity, then I don't see any reason it won't work, it'll just be a little "ugly"
I don't even think it's ugly. I'm incredibly thankful every time I see someone make e.g. `db.sqlite`, it immediately sets me at ease to know I'm not accidentally dealing with a DuckDB file or something.
Yes, oh my god. Stop using .db for Sqlite files!!! It’s too generic and it’s already used by Windows for those thumbnail system files.
>The most popular operating system hides it from the user, so clarity would not improve in that case.
If you mean Windows, that's not entirely correct. It defaults to hiding only "known" file extensions, like txt, jpg and such. (Which IMO is even worse than hiding all of them; that would at least be consistent.)
EDIT: Actually, I just checked and apparently an extension, even an exotic one, becomes "known" when it's associated with a program, so your point still stands.
> At leat one other (Linux) doesn't really use "extensions" and instead relies on magic headers inside the files to determine the format.
mostly for executable files.
I doubt many Linux apps look inside a .py file to see if it's actually a JPEG they should build a thumbnail for.
Your doubts are incorrect. There's a fairly standard way of extracting the file type out of files on linux, which relies on a mix of extensions and magic bytes. Here's where you can start to read about this:
https://wiki.archlinux.org/title/XDG_MIME_Applications
A lot of apps implement this (including most file managers)
I'm a little surprised that that link doesn't go to libmagic[1]. No doubt XDG_MIME is an important spec for desktop file detection, but I think libmagic and the magic database that underpins it are more fundamental to filetype detection in general.
It's also one of my favorite oddities on Linux. If you're a Windows user the idea of a database of signatures for filetypes that exists outside the application that "owns" a file type is novel and weird.
[1]: https://man7.org/linux/man-pages/man3/libmagic.3.html
libmagic maintains its own separate database from xdg. XDG db is meant from the ground up to have other apps add to it etc… so that one is the one apps use as a library usually, if they want to integrate nicely and correctly with other installed apps. libmagic is the hacking of the two :)
It’s tedious to type when you want to do `ls *.mustachemingle` or similar.
It’s prone to get cut off in UIs with dedicated columns for file extensions.
As you say, it’s unconventional and therefore risks not being immediately recognized as a file extension.
On the other hand, Java uses .properties as a file extension, so there is some precedent.
> call the file foo.mustachemingle
You could go the whole java way then foo.com.apache.mustachemingle
> Any reason why this is a bad idea
the focus should be on the name, not on the extension.
Why should a file format be locked down to one specific application?
Both species are needed:
Generic, standardized formats like "jpg" and "pdf", and
Application-specific formats like extension files or state files for your program, that you do not wish to share with competitors.
I think the Mac got this right (before Mac OS X) and has since screwed it up. Every file had both a creator code and a type code. So, for every file, you would know which application created it and also which format it was.
So, double-clicking the file opened it in the application it was made in, but the Mac would also know which other applications could open that file.
For archive formats, or anything that has a table of contents or an index, consider putting the index at the end of the file so that you can append to it without moving a lot of data around. This also allows for easy concatenation.
What probably allows for even more easier concatenation would be to store the header of each file immediately preceding the data of that file. You can make a index in memory when reading the file if that is helpful for your use.
This would require a separate seek and read operation per archive member, each yielding only one directory entry, rather than very few read operation to load the whole directory at once.
Why not put it at the beginning so that it is available at the start of the filestream that way it is easier to get first so you know what other ranges of the file you may need?
>This also allows for easy concatenation.
How would it be easier than putting it at the front?
Files are... Flat streams. Sort of.
So if you rewrite an index at the head of the file, you may end up having to rewrite everything that comes afterwards, to push it further down in the file, if it overflows any padding offset. Which makes appending an extremely slow operation.
Whereas seeking to end, and then rewinding, is not nearly as costly.
You can do it via fallocate(2) FALLOC_FL_INSERT_RANGE and FALLOC_FL_COLLAPSE_RANGE but sadly these still have a lot of limitations and are not portable. Based on discussions I've read, it seems there is no real motivation for implementing support for it, since anyone who cares about the performance of doing this will use some DB format anyway.
In theory, files should be just unrolled linked lists (or trees) of bytes, but I guess a lot of internal code still assumes full, aligned blocks.
Most workflows do not modify files in place but rather create new files as its safer and allows you to go back to the original if you made a mistake.
If you're writing twice, you don't care about the performance to begin with. Or the size of the files being produced.
But if you're writing indices, there's a good chance that you do care about performance.
Files are often authored once and read / used many times. When authoring a file performance is less important and there is plenty of file space available. Indices are for the performance for using the file which is more important than the performance for authoring it.
If storage and concern aren't a concern when writing, then you probably shouldn't be doing workarounds to include the index in the file itself. Follow the dbm approach and separate both into two different files.
Which is what dbm, bdb, Windows search indexes, IBM datasets, and so many, many other standards will do.
Separate files isn't always the answer. It can be more awkward to need to download both and always keep them together compared to when it's a single file.
If the archive is being updated in place, turning ABC# into ABCD#' (where # and #' are indices) is easier than turning #ABC into #'ABCD. The actual position of indices doesn't matter much if the stream is seekable. I don't think the concatenation is a good argument though.
Imagine you have a 12Gb zip file, and you want to add one more file to it. Very easy and quick if the index is at the end, very slow if it's at the start (assuming your index now needs more space than is available currently).
Reading the index from the end of the file is also quick; where you read next depends on what you are trying to find in it, which may not be the start.
Some formats are meant to be streamable. And if the stream is not seekable, then you have to read all 12 Gb before you get to the index.
The point is, not all is black and white. Where to put the index is just another trade off.
And most of them aren't. And even those that are - it's much easier to implement the ability to retrieve the last chunk of file than to deal with significant performance degradation of forced file rewrites.
Think about a format that has all those properties and you've used - PDF. PDFs the size of several 100s of MB aren't rare. Now imagine how it works in your world:
* Add a note? Wait for the file to be completely rewritten and burn 100s of MB of your data to sync to iCloud/Drive.
* Fill a form? Same.
* Add an annotation with your Apple Pencil? Yup, same.
Now look at how it works right now:
- Add a text? Fill a form? Add a drawing? A few KB of data is appended and uploaded.
* Sign the document to confirm authenticy? You got it, a few KB of data at the end.
* Determine which data was added after the document was signed and sign it with another cert? A few bytes.
Do you need to stream the PDF? Load the last chunk to detect the dictionary. If you don't want to do that, configure PDF writer to output the dictionary at the start and you still end up with a better solution.
Different trade-offs is why it might make sense to embrace the Unix way for file formats: do one thing well, and document it so that others can do a different thing well with the same data and no loss.
For example, if it is an archival/recording oriented use case, then you make it cheap/easy to add data and possibly add some resiliency for when recording process crashes. If you want efficient random access, streaming, storage efficiency, the same dataset can be stored in a different layout without loss of quality—and conversion between them doesn’t have to be extremely optimal, it just should be possible to implement from spec.
Like, say, you record raw video. You want “all of the quality” and you know all in all it’s going to take terabytes, so bringing excess capacity is basically a given when shooting. Therefore, if some camera maker, in its infinite wisdom, creates a proprietary undocumented format to sliiightly improve on file size but “accidentally” makes it unusable in most software without first converting it using their own proprietary tool, you may justifiedly not appreciate it. (Canon Cinema Raw Light HQ—I kid you not, that’s what it’s called—I’m looking at you.)
On this note, what are the best/accepted approaches out there when it comes to documenting/speccing out file formats? Ideally something generalized enough that it can also handle cases where the “file” is in fact a particularly structured directory (a la macOS app bundle).
Adding to the recording _raw_ video point, for such purposes, try to design the format so that losing a portion of the file doesn't render it entirely unusable. Kinda like how you can recover DV video from spliced tapes because the data for the current frame (+/- the bordering frame) is enough to start a valid new file stream.
That’s true, but streamable formats often don’t need an index.
A team member just created a new tool that uses the tar format (streamable), but then puts the index as the penultimate entry, with the last entry just being a fixed size entry with the offset of the beginning of the index.
In this way normal tar tools just work but it’s possible to retrieve a listing and access a file randomly. It’s also still possible to append to it in the future, modulo futzing with the index a bit.
(The intended purpose is archiving files that were stored as S3 objects back into S3.?
Yes, a good point. Each file format must try to optimise for the use cases it supports of course.
make the index a linked data structure. You can then extend it whenever, wherever
> How would it be easier than putting it at the front?
Have you ever wondered why `tar` is the Tape Archive? Tape. Magnetic recording tape. You stream data to it, and rewinding is Hard, so you put the list of files you just dealt with at the very end. This now-obsolete hardware expectation touches us decades later.
tar streams don't have an index at all, actually, they're just a series of header blocks and data blocks. Some backup software built on top may include a catalog of some kind inside the tar stream itself, of course, and may choose to do so as the last entry.
IIRC, the original TAR format was just writing the 'struct stat' from sys/stat.h, followed by the file contents for each file.
But new file formats being developed are most likely not going to be designed to be used with tapes. If you want to avoid rewinds you can write a new concatenated version of the files. This also allows you to keep the original in case you need it.
Sometimes, you'll need to pack multiple files inside of a single file. Those files will need to grow, and be able to be deleted.
At that point, you're asking for a filesystem inside of a file. And you can literally do exactly that with a filesystem library (FAT32, etc).
Consider DER format. Partial parsing is possible; you can easily ignore any part of the file that you do not care about, since the framing is consistent. Additionally, it works like the "chunked" formats mentioned in the article, and one of the bits of the header indicates whether it includes other chunks or includes data. (Furthermore, I made up a text-based format called TER which is intended to be converted to DER. TER is not intended to be used directly; it is only intended to be converted to DER for then use in other programs. I had also made up some additional data types, and one of these (called ASN1_IDENTIFIED_DATA) can be used for identifying the format of a file (which might conform to multiple formats, and it allows this too).)
I dislike JSON and some other modern formats (even binary formats); they often are just not as good in my opinion. One problem is they tend to insist on using Unicode, and/or on other things (e.g. 32-bit integers where you might need 64-bits). When using a text-based format where binary would do better, it can also be inefficient especially if binary data is included within the text as well, especially if the format does not indicate that it is meant to represent binary data.
However, even if you use an existing format, you should avoid using the existing format badly; using existing formats badly seems to be common. There is also the issue of if the existing format is actually good or not; many formats are not good, for various reasons (some of which I mentioned above, but there are others, depending on the application).
About target hardware, not all software is intended for a specific target hardware, although some is.
For compression, another consideration is: there are general compression schemes as well as being able to make up a compression scheme that is specific for the kind of data that is being compressed.
They also mention file names. However, this can also depend on the target system; e.g. for DOS files you will need to be limited to three characters after the dot. Also, some programs would not need to care about file names in some or all cases (many programs I write don't care about file names).
Maybe it's just because I've never needed the complexity, but ASN.1 seems a bit much for any of the formats I've created.
The ASN.1 format itself is pretty well-suited for generic file types. Unfortunately, there are very few good, open source/free ASN.1 (de)serializers out there.
In theory you could use ASN.1 DER files the same way you would JSON for human-readable formats. In practice, you're better off picking a different format.
Modern evolutions of ASN.1 like ProtoBuf or Cap'n Proto designed for transmitting data across the network might fit this purpose pretty well, too.
On the other hand, using ASN.1 may be a good way to make people trying to reverse engineer your format give up in despair, especially if you start using the quirks ASN.1 DER comes with and change the identifiers.
> Unfortunately, there are very few good, open source/free ASN.1 (de)serializers out there.
I wrote a library to read/write DER, which I have found suitable for my uses. (Although, I might change or add some things later, and possibly also some things might be removed too if I think they are unnecessary or cause problems.)
https://github.com/zzo38/scorpion/blob/trunk/asn1/asn1.c https://github.com/zzo38/scorpion/blob/trunk/asn1/asn1.h
(You can complain about it if there is something that you don't like.)
> In theory you could use ASN.1 DER files the same way you would JSON for human-readable formats. In practice, you're better off picking a different format.
I do use ASN.1 DER for some things, because, in my opinion it is (generally) better than JSON, XML, etc.
> Modern evolutions of ASN.1 like ProtoBuf or Cap'n Proto designed for transmitting data across the network might fit this purpose pretty well, too.
I have found them to be unsuitable, with many problems, and that ASN.1 does them better in my experience.
> On the other hand, using ASN.1 may be a good way to make people trying to reverse engineer your format give up in despair, especially if you start using the quirks ASN.1 DER comes with and change the identifiers.
I am not so sure of this.
For me too, although you only need to use (and implement) the parts which are relevant for your application and not all of them, so it is not really the problem. (I also never needed to write ASN.1 schemas, and a full implementation of ASN.1 is not necessary for my purpose.) (This is also a reason I use DER instead of BER, even if canonical form is not required; DER is simpler to handle than all of the possibilities of BER.)
If binary, consider just using SQLite.
Did you read the article?
That wouldn’t support partial parsing.
On the contrary, loading everything from a database is the limit case of "partial parsing" with queries that read only a few pages of a few tables and indices.
From the point of view of the article, a SQLite file is similar to a chunked file format: the compact directory of what tables etc. it contains is more heavyweight than listing chunk names and lengths/offsets, but at least as fast, and loading only needed portions of the file is automatically managed.
Using SQLite as a container format is only beneficial when the file format itself is a composite, like word processor files which will include both the textual data and any attachments. SQLite is just a hinderance otherwise, like image file formats or archival/compressed file formats [1].
[1] SQLite's own sqlar format is a bad idea for this reason.
From my own experience SQLite works just fine as the container for an archive format.
It ends up having some overhead compared to established ones, but the ability to query over the attributes of 10000s of files is pretty nice, and definitely faster than the worst case of tar.
My archiver could even keep up with 7z in some cases (for size and access speed).
Implementing it is also not particularly tricky, and SQLite even allows streaming the blobs.
Making readers for such a format seems more accessible to me.
SQLite format itself is not very simple, because it is a database file format in its heart. By using SQLite you are unknowingly constraining your use case; for example you can indeed stream BLOBs, but you can't randomly access BLOBs because the SQLite format puts a large BLOB into pages in a linked list, at least when I checked last. And BLOBs are limited in size anyway (4GB AFAIK) so streaming itself might not be that useful. The use of SQLite also means that you have to bring SQLite into your code base, and SQLite is not very small if you are just using it as a container.
> My archiver could even keep up with 7z in some cases (for size and access speed).
7z might feel slow because it enables solid compression by default, which trades decompression speed with compression ratio. I can't imagine 7z having a similar compression ratio with correct options though, was your input incompressible?
Yes, the limits are important to keep in mind, I should have contextualized that before.
For my case it happened to work out because it was a CDC based deduplicating format that compressed batches of chunks. Lots of flexibility with working within the limits given that.
The primary goal here was also making the reader as simple as possible whilst still having decent performance.
I think my workload is very unfair towards (typical) compressing archivers: small incremental additions, needs random access, indeed frequent incompressible files, at least if seen in isolation.
I've really brought up 7z because it is good at what it does, it is just (ironically) too flexible for what was needed. There probably some way of getting it to perform way better here.
zpack is probably a better comparison in terms of functionality, but I didn't want to assume familiarity with that one. (Also I can't really keep up with it, my solution is not tweaked to that level, even ignoring the SQLite overhead)
BLOBs support random access - the handles aren't stateful. https://www.sqlite.org/c3ref/blob_read.html
You're right that their size is limited, though, and it's actually worse than you even thought (1 GB).
My statement wasn't precise enough, you are correct that random access API is provided. But it is ultimately connected to the `accessPayload` function in btree.c which comment mentions that:
In the other words, the API can read from multiple scattered pages unknowingly to the caller. That said I see this can be considered enough for being random accessible, as the underlying file system would use similarly structured indices behind the scene anyway... (But modern file systems do have consecutively allocated pages for performance.)One gotcha to be aware of is that SQLite blobs can't exceed 1* GB. Don't use SQLite archives for large monolithic data.
*: A few bytes less, actually; the 1 GB limit is on the total size of a row, including its ID and any other columns you've included.
The Mac image editor Acorn uses SQLite as its file format. It's described here:
https://shapeof.com/archives/2025/4/acorn_file_format.html
The author notes that an advantage is that other programs can easily read the file format and extract information from it.
It is clearly a composite file format [1]:
> Acorn’s native file format is used to losslessly store layer data, editable text, layer filters, an optional composite of the image, and various metadata. Its advantage over other common formats such as PNG or JPEG is that it preserves all this native information without flattening the layer data or vector graphics.
As I've mentioned, this is a good use case for SQLite as a container. But ZIP would work equally well here.
[1] https://flyingmeat.com/acorn/docs/technotes/ACTN002.html
I think it's fine as an image format. I've used the mbtiles format which is basically just a table filled with map tiles. Sqlite makes it super easy to deal with it, e.g. to dump individual blobs and save them as image files.
It just may not always be the most performant option. For example, for map tiles there is alternatively the pmtiles binary format which is optimized for http range requests.
Except image formats and archival formats are composites (data+metadata). We have Exif for images, and you might be surprised by how much metadata the USTar format has.
With that reasoning almost every format is a composite, which doesn't sound like a useful distinction. Such metadata should be fine as long as the metadata itself is isolated and can be updated without the parent format.
I agree that almost every format is a composite; you seem to not, which makes me think you mean something different than I by "composite."
Your reply suggests that, if all the metadata is auxiliary it can be segregated from the data and doesn't count as a composite.
However, that doesn't exclude archives (in many use-cases the file metadata is as important as the data itself; consider e.g. hardlinks in TAR files)
Nor does it exclude certain vital metadata for images: resolution, color-space, and bit-depth come to mind.
My reasoning for Exif was that it is not only auxiliary but also post-hoc. Exif was defined independently from image formats and only got adopted later because those formats provided extension points (JPEG APP# markers, PNG chunks).
You've got a good point that there are multiple types of metadata and some metadata might be crucial for interpreting data. I would say such "structural" metadata should be considered as a part of data. I'm not saying it is not a metadata; it is a metadata inside some data, so doesn't count for our purpose of defining a composite.
I also don't think tar hardlinks are metadata for our purpose, because it technically consists of the linked path instead of the file contents and the information that the file is a hardlink, where the former is clearly a data and the latter is a metadata used to reconstruct the original file system so should be considered as a part of larger data (in this case, a logical notion of "file").
I believe these examples should be enough to derive my own definition of "composite". Please let me know otherwise.
I think I understand now. Thanks for the clarification.
sqlar proved a great solution in the past for me. Where does it fall short in your experience?
Unless you are using the container file as a database too, sqlar is strictly inferior to ZIP in terms of pretty much everything [1]. I'm actually more interested in the context sqlar did prove useful for you.
[1] https://news.ycombinator.com/item?id=28670418
I remember seeing the comment you linked few years back, and back then comments were already locked so I couldn't reply, and this time I sadly don't have the time to get deeper into this, however - I recommend you to research more about sqlar/using sqlite db as _file format_ in general, or at minimum looking at SQLite Encryption Extension (SEE) (https://www.sqlite.org/see/doc/trunk/www/readme.wiki). You can get a lot out of the box with very little investment. IMHO sqlar is not competing with ZIP (can zip do metadata and transactions?)
> [...] at minimum looking at SQLite Encryption Extension (SEE) (https://www.sqlite.org/see/doc/trunk/www/readme.wiki).
SEE is a proprietary extension, however generous its license is. So it is not very meaningful when sqlar is compared against ZIP. Not to say that I necessarily see encryption as a fundamental feature for compressed archive formats though---I'm advocating for age [1] integration instead.
[1] https://github.com/FiloSottile/age
> IMHO sqlar is not competing with ZIP (can zip do metadata and transactions?)
In my understanding SQLite's support for sqlar and ZIP occurred at the same time, so I believe that sqlar was created to demonstrate an alternative to ZIP (and that the demonstration wasn't good enough). I'm aware that this is just a circumstantial evidence, so let me know if you have some concrete one.
ZIP can of course do metadata in the form of per-file and archive comments. For more structured metadata, you can make use of extra fields if you really, really want, but at that point SQLite would indeed be a better choice. I however doubt it's a typical use case.
ZIP can be partially updated in place but can't do any transaction. But it should be noted that SQLite handles transaction by additional files (`-journal` or `-wal` files). So both sqlar and ZIP would write to an additional file during the update process, though SQLite would write much less data compared to ZIP. Any remaining differences are invisible to end users, unless the in-place update is common enough in which case the use of SQLite is justified.
Point is that SEE exists, and so do free alternatives.
> In my understanding SQLite's support for sqlar and ZIP occurred at the same time
I believe so too.
I agree with you on SQLAR being poor general-purpose archive or compression format compared to ZIP; what I'm arguing is that its very good file format for certain applications, offering structured, modifiable, and searchable file storage. We had great success using it as db/file format for PLM solution packed both as desktop and web app. Same database can then be used to power the web ui (single tenant SaaS deployments), and for desktop app (web export is simply a working file for desktop app). This file being just a simple sqlite db lets users play with data, do their own imports, migrations etc., while having all files & docs in one place.
Most of that's pretty good.
Compression: For anything that ends up large it's probably desired. Though consider both algorithm and 'strength' based on the use case carefully. Even a simple algorithm might make things faster when it comes time to transfer or write to permanent storage. A high cost search to squeeze out yet more redundancy is probably worth it if something will be copied and/or decompressed many times, but might not be worth it for that locally compiled kernel you'll boot at most 10 times before replacing it with another.
Good article, but I’d add one more paragraph.
If your data format contains multiple streams inside, consider ZIP for the container. Enables standard tools, and libraries available in all languages. The compression support is built-in but optional, can be enabled selectively for different entries.
The approach is widely used in practice. MS office files, Java binaries, iOS app store binaries, Android binaries, epub books, chm documentation are all using ZIP container format.
Designing your file (and data) formats well is important.
“Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.”
— Fred Brooks
I would add to make it streamable or at least allow to be read remotely efficiently.
Agreed on that one. With a nice file format, streamable is hopefully just a matter of ordering things appropriately once you know the sizes of the individual chunks. You want to write the index last, but you want to read it first. Perhaps you want the most influential values first if you're building something progressive (level-of-detail split.)
Similar is the discussion of delimited fields vs. length prefix. Delimited fields are nicer to write, but length prefixed fields are nicer to read. I think most new formats use length prefixes, so I'd start there. I wrote a blog post about combining the value and length into a VLI that also handles floating point and bit/byte strings: https://tommie.github.io/a/2024/06/small-encoding
I don't think a single encoding is generally useful. A good encoding for given application would depend on the value distribution and neighboring data. For example any variable-length scalar encoding would make vectorization much harder.
Depends if you're optimizing for storage size or code size, and in-memory vs transfer. This encoding was meant to optimize transfer (and perhaps storage.)
Depending on the use-case, it might be good to have a file format "diffable", especially when it's going to be checked in in some version control.
If you find yourself building a file format - you should read this page carefully and make sure that you have very good arguments which support why it does not apply to you.
https://www.sqlite.org/appfileformat.html
> However, it's cleaner to have a field in your header that states where the first sub-chunk starts; that way you can expand your header as much as you like in future versions, with old code being able to ignore those fields and jump to the good stuff.
That’s assuming that parsers will honor this, and not just use the fixed offset that worked for the past ten hears. This has happened often enough in the past.
Also, "Don't try to be clever to save a few bits.", like using the lower and upper 4 bits of a byte for different things (I'm looking at you, ONFI).
I've had to do just that to retrofit features I wasn't allowed to think about up front (we must get the product out the door.... we'll cross that bridge when we get to it)
iNES file format is guilty of badly designed bit packing. Four flags were packed into the lower 4 bits, then Mapper Number was assigned to the high 4 bits. But then they needed more than 16 mappers. They used 4 high bits of the next byte to store the remaining 4 bits, and that was enough... until they needed over 256 mappers.
Good overview.
The "Chunk your binaries" point is spot on. Creating a huge binary blob that contains everything makes it hard to work with in constrained environments.
Also, +1 for "Document your format". More like "Document everything". Future you will thank you for it for sure.
Also you should consider the context in which you are developing. Often there are "standard" tools and methods to deal with the kind of data you want to store.
E.g. if you are interested in storing significant amounts of structured floating point data, choosing something like HDF5 will not only make your life easier it will also make it easy to communicate what you have done to others.
Thinking about a file format is a good way to clarify your vision. Even if you don’t want to facilitate interop, you’d get some benefits for free—if you can encapsulate the state of a particular thing that the user is working on, you could, for example, easily restore their work when they return, etc.
Some cop-out (not necessarily in a bad way) file formats:
1. Don’t have a file format, just specify a directory layout instead. Example: CinemaDNG. Throw a bunch of particularly named DNGs (a file for each frame of the footage) in a directory, maybe add some metadata file or a marker, and you’re good. Compared to the likes of CRAW or BRAW, you lose in compression, but gain in interop.
2. Just dump runtime data. Example: Mnemosyne’s old format. Do you use Python? Just dump your state as a Python pickle. (Con: dependency on a particular runtime, good luck rewriting it in Rust.)
3. Almost dump runtime data. Example: Anki, newer Mnemosyne with their SQLite dumps. (Something suggests to me that they might be using SQLite at runtime.) A step up from a pickle in terms of interop, somewhat opens yourself (but also others) to alternative implementations, at least in any runtime that has the means to read SQLite. I hope if you use this you don’t think that the presence of SQL schema makes the format self-documenting.
4. One or more of the above, except also zip or tar it up. Example: VCV, Anki.
About 1, directory of files, many formats these days are just a bunch of files in a ZIP. One thing most applications lack unfortunately is a way to instead just read and write the part files from/to a directory. For one thing it makes it much better for version control, but also just easier to access in general when experimenting. I don't understand why this is not more common, since as a developer it is much more fun to debug things when each thing is its own file rather than an entry in an archive. Most times it is also trivial to support both, since any API for accessing directory entries will be close to 1:1 to an API for accessing ZIP entries anyway.
When editing a file locally I would prefer to just have it split up in a directory 99% of the time, only exporting to a ZIP to publish it.
Of course it is trivial to write wrapper scripts to keep zipping and unzipping files, and I have done that, but it does feel a bit hacky and should be an unnecessary extra step.
Yes, the zipped version is number four. It’s not great for the reason you noted. Some people come up with smudge/clean filters that handle the (de)compression, letting Git store the more structured version of the data even though your working directory contains the compressed files your software can read and write—but I don’t know how portable these things are. I agree with you in general, and it is also why my number one example is that you might not need a single-file format at all. macOS app bundles is a great example of this approach in the wild.
One question I was hoping to ask anyone who thought about these matters: what accepted approaches do exist out there when it comes to documenting/speccing out file formats? Ideally, including the cases where the “file” is in fact a directory with a specific layout.
(Correction: instead of “CRAW”, I should have written “Canon Cinema Raw Light”. Apparently, those are different things.)
> 2. Just dump runtime data. Example: Mnemosyne’s old format. Do you use Python? Just dump your state as a Python pickle. (Con: dependency on a particular runtime, good luck rewriting it in Rust.)
Be particularly careful with this one as it can potentially vastly expand the attack surface of your program. Not that you shouldn't ever do it, just make sure the deserializer doesn't accept objects/values outside of your spec.
I certainly hope no one takes my list as an endorsement… It’s just some formats seen in the wild.
It should be noted (the article does not) that parsing and deserialisation is generally a known weak area and a common source of CVEs, even when pickling is not used. Being more disciplined about it helps, of course.
[dead]
I have a rather ideosyncratic opinion here:
For Open-Source projects, human readable file formats are actively harmful.
This mostly is motivated by my experience with KiCad. Principally, there are multiple things that the UI does not expose at all (slots in PCB footprint files) where the only way to add them is to manually edit the footprint file in a text editor.
There are some other similar annoyances in the same vein.
Basically, human readable (and therefore editable) file formats wind up being a way for some things to never be exposed thru the UI. This actively leads to the software being less capable.
Not exposing things in the UI is not necessarily a problem (it depends on the program and on other stuff), although it can be (especially if it is not documented). I had not used the program you mention, although it does seem a problem in the way you mention, although someone who wants to add it into the UI could hopefully do so if it is FOSS. However, one potential problem is sometimes if it is a text-based format, writing such a format (in a way which still remains readable rather than messy) can sometimes be more complicated than reading it.
(The TEMPLATE.DER lump (which is a binary file format and not plain text) in Super ZZ Zero is not exposed anywhere in the UI; you must use an external program to create this lump if you want it. Fortunately that lump is not actually mandatory, and only affects the automatic initial modifications of a new world file based on an existing template.)
However, I think that human readable file formats are harmful for other reasons.