Ask HN: Why Databases Instead of Filesystem?

uticus

2 days ago

12points

A database could be described crudely as a program for interfacing with data on disk. Most software is built to interface with databases although the software also has (often very capable) ways of interfacing with data on disk via filesystem interaction.

My question is, what historically has driven the industry to focus on database-specific solutions, rather than on filesystem-specific solutions?

This is not a rant against databases, but I do wonder why many major programming languages and frameworks (RoR Active Directory, C# EF, etc) have put much effort into making database interaction smooth from the perspective of the programming language, instead of putting effort into interacting with data on disk. Kind of an alternate reality sort of musing.

20 comments

Comments

Someone1234•2 days ago

Because you'll slowly start building the individual pieces of the database over the file system, until you've just recreated a database. Database didn't spawn out of nothing, people were writing to raw files on disk, and kept on solving the same issues over and over (data definitions, indexes, relations, cache/memory management, locks, et al).

So your question is: Why does the industry focus on reusable solutions to hard problems, over piece-meal recreating it every project? And when phased in that way, the answer is self-evident. Productivity/cost/ease.

ksherlock•2 days ago

Historically, Unix (and many other operating systems) stored file names as an unsorted list so using the FS as KV store had O(N) lookup times whereas a single-file hashed database like dbm, ndbm, gdbm, bdb, etc gave you O(1) access.

If you're using a relational DB, like SQL, as a relational database, then it gives you a lot the FS doesn't give you. If you're using a relational database as a key-value store, SQLite is 35% than the filesystem [1]

Perhaps one of the biggest users of the filesystem as a KV store is git -- (not an llm, I just wanted to use --) .git/objects/xx/xxxxx maps the sha1 file hash to the compressed data, splayed by the first 2 bytes. However git also uses a database of sorts (.git/objects/pack/....). To sum up the git pack-objects man page, it's more efficient.

1. https://www.sqlite.org/fasterthanfs.html

myniqx•9 hours ago

databases solves coordination problems that filesystems couldnt. concurrent writes, atomic transactions, and structured querying across millions of records.
But, filesystem-databases idea is still exists. SQLite is essentially this. and I wrote an adapter for this. from experience, It look like capable of doing everything that other adapters do. but really slowly :D

For example; datrix.update('comment',/where/{ author(another file): { group(another file): { tag: 'abc'} } },/data*/ { name: 'new name', anotherRelation(another file): { connect:[123] } })

to find correct entry, this query has to check 3 file. that It replace all the results name with 'new name' and also connects another file's entry for them. what happens if another user wants to write authors same time. or names updated but while connecting another relation exception happened. I have to get back all updates. how I solved this. I lock filesystem while this query (all other requests blocked) I updated all data in memory and if anything happens on the way I revert. this is how slow this solution. what should I do to fix this. I must create another application. It has to manage all request. without blocking. and this is also we called database engine.

lproven•2 days ago

You might enjoy the FOSDEM talk I did about half a dozen years ago:

https://archive.fosdem.org/2021/schedule/event/new_type_of_c...

I adapted most of it into an article for The Register:

https://www.theregister.com/2024/02/26/starting_over_rebooti...

lgeorget•2 days ago

More of a sidenote than an answer but a database system can be faster than using the disk directly: https://sqlite.org/fasterthanfs.html#approx.

It turns out having a defined abstraction like a database makes things faster than having to rely on a rawer interface like filesystems because you can then reduce the number of system calls and context switches necessary. If you wanted to optimize that in your own code rather than relying on a database, you'd end up reinventing a database system of sorts, when (probably) better solutions already exist.

AyanamiKaine•2 days ago

But this would only relate to local databases wouldn't it? Having to connect to a postgres server or something similar. The latency for queries would be far higher than using the file system.

lgeorget•2 days ago

That's orthogonal to the discussion, I think, you can also mount and access remote filesystems from any program as if it was a local filesystem.

pushfoo•2 days ago

Consider: a cheap VPS can have disk latency above 10ms while the in-datacenter LAN can be ~1ms.

didgetmaster•2 days ago

There is no reason why a single data management system cannot be built that can do everything that either a file system or a database can do (I have been building one).

It is an object store called Didgets (i.e. Data Widgets). Each Didget has a specific type. One type is used to hold unstructured data like a file does. These Didgets are unsurprisingly called File Didgets. Other types of Didgets can hold lists (used to create hierarchical folders, music play lists, photo albums, etc.).

Others hold sets of Key/Value pairs which are used to create a tagging system for other Didgets or columns in a relational table.

Using a variety of Didgets, I have been able to create hierarchical file systems where a simple query can find one or thousands of files instantly out of 200 million+ based on the values of any tags attached.

In the same container (called a pod), it can store tens of thousands of relational tables; each one capable of having 100,000+ columns and billions of rows.

The system is 'multi-model' so it could manage hierarchical data, relational data, graph data, or anything managed by a NoSQL system.

It is not only versatile, but is incredibly fast.

codingdave•2 days ago

A database is a file system when you get down to it. The reason people use them is to abstract up a layer so you can query the data and get the results you want instead of having to iterate through direct reads of a disk, then having to read, parse, and filter what you want from those reads. You could always write code to help do those things direct from disk, but you know what you have just written if you do so? A database!

PaulHoule•2 days ago

I'd say the filesystem is a database.

It would be straightforward, for instance, to implement a lot of the functionality of a filesystem in a database with BLOBs. Random access might be a hassle, but people are getting used to "filesystem-like" systems which are bad at random access like S3.

uticus•2 days ago

> You could always write code to help doing those things direct from disk, but you know what you have just written if you do so? A database!

Yes, but that's my point. Why is this not a part of the standard library / typical package with very little friction with the rest of the code, instead of a separate program that the standard library / typical packages provide in an attempt to reduce the friction?

Or are you making the general point that databases already existed prior to the standard libraries etc, and this is just a case of interfacing with an existing technology instead of rebuilding from scratch?

apothegm•2 days ago

Because a reasonably well optimized database with support for indexes, data integrity enforcement, transactions, and all the other important things we expect from a good (relational) database is complex enough that it takes a rather large codebase to do it reasonably well. It’s not something you slap together out of a handful of function calls.

ETA: look at SQLite for an example — it’s a relatively recent and simple entrant in the field and the closest you’ll find in the mainstream to a purely filesystem based RDBMS. How would you provide a stdlib that would let you implement something like that reasonably simply? What would be the use case for it?

stanfordkid•2 days ago

I think there are a lot of good answers here, but it really comes down to the type of content being stored and access patterns.

A database is a data structure with (generally) many small items that need to be precisely updated, read and manipulated.

A lot of files don't necessarily have this access pattern (for instance rendering a large video file) ... a filesystem has a generic access pattern and is a lower level primitive than a database.

For this same reason you even have different kinds of database for different types of access patterns and data types (e.g Elasticsearch for full text search, MongoDB for JSON, Postgres for SQL)

Filesystem is generic and low-level, database is a higher order abstraction.

muzani•yesterday

A lot of early games were more like file systems and they worked. I'm surprised nobody had mentioned ACID yet.

The API revolution might be another thing - you were able to swap out a database with any other. Risky decisions are fine when they're reversible. Databases were a more reversible way to deal with scaling and architecture.

nivertech•2 days ago

BTW: some early filesystems were more database-like:

https://en.wikipedia.org/wiki/ISAM

https://en.wikipedia.org/wiki/Record_Management_Services

They were more like BerkeleyDB and lacked Query Planner.

I think Oracle internally using something similar, i.e. a native filesystem optimized for an RDBMS.