My question is, what historically has driven the industry to focus on database-specific solutions, rather than on filesystem-specific solutions?
This is not a rant against databases, but I do wonder why many major programming languages and frameworks (RoR Active Directory, C# EF, etc) have put much effort into making database interaction smooth from the perspective of the programming language, instead of putting effort into interacting with data on disk. Kind of an alternate reality sort of musing.
So your question is: Why does the industry focus on reusable solutions to hard problems, over piece-meal recreating it every project? And when phased in that way, the answer is self-evident. Productivity/cost/ease.
If you're using a relational DB, like SQL, as a relational database, then it gives you a lot the FS doesn't give you. If you're using a relational database as a key-value store, SQLite is 35% than the filesystem [1]
Perhaps one of the biggest users of the filesystem as a KV store is git -- (not an llm, I just wanted to use --) .git/objects/xx/xxxxx maps the sha1 file hash to the compressed data, splayed by the first 2 bytes. However git also uses a database of sorts (.git/objects/pack/....). To sum up the git pack-objects man page, it's more efficient.
But, filesystem-databases idea is still exists. SQLite is essentially this. and I wrote an adapter for this. from experience, It look like capable of doing everything that other adapters do. but really slowly :D
For example; datrix.update('comment',/where/{ author(another file): { group(another file): { tag: 'abc'} } },/data*/ { name: 'new name', anotherRelation(another file): { connect:[123] } })
to find correct entry, this query has to check 3 file. that It replace all the results name with 'new name' and also connects another file's entry for them. what happens if another user wants to write authors same time. or names updated but while connecting another relation exception happened. I have to get back all updates. how I solved this. I lock filesystem while this query (all other requests blocked) I updated all data in memory and if anything happens on the way I revert. this is how slow this solution. what should I do to fix this. I must create another application. It has to manage all request. without blocking. and this is also we called database engine.
https://archive.fosdem.org/2021/schedule/event/new_type_of_c...
I adapted most of it into an article for The Register:
https://www.theregister.com/2024/02/26/starting_over_rebooti...
It turns out having a defined abstraction like a database makes things faster than having to rely on a rawer interface like filesystems because you can then reduce the number of system calls and context switches necessary. If you wanted to optimize that in your own code rather than relying on a database, you'd end up reinventing a database system of sorts, when (probably) better solutions already exist.
It is an object store called Didgets (i.e. Data Widgets). Each Didget has a specific type. One type is used to hold unstructured data like a file does. These Didgets are unsurprisingly called File Didgets. Other types of Didgets can hold lists (used to create hierarchical folders, music play lists, photo albums, etc.).
Others hold sets of Key/Value pairs which are used to create a tagging system for other Didgets or columns in a relational table.
Using a variety of Didgets, I have been able to create hierarchical file systems where a simple query can find one or thousands of files instantly out of 200 million+ based on the values of any tags attached.
In the same container (called a pod), it can store tens of thousands of relational tables; each one capable of having 100,000+ columns and billions of rows.
The system is 'multi-model' so it could manage hierarchical data, relational data, graph data, or anything managed by a NoSQL system.
It is not only versatile, but is incredibly fast.
It would be straightforward, for instance, to implement a lot of the functionality of a filesystem in a database with BLOBs. Random access might be a hassle, but people are getting used to "filesystem-like" systems which are bad at random access like S3.
Yes, but that's my point. Why is this not a part of the standard library / typical package with very little friction with the rest of the code, instead of a separate program that the standard library / typical packages provide in an attempt to reduce the friction?
Or are you making the general point that databases already existed prior to the standard libraries etc, and this is just a case of interfacing with an existing technology instead of rebuilding from scratch?
ETA: look at SQLite for an example — it’s a relatively recent and simple entrant in the field and the closest you’ll find in the mainstream to a purely filesystem based RDBMS. How would you provide a stdlib that would let you implement something like that reasonably simply? What would be the use case for it?
A database is a data structure with (generally) many small items that need to be precisely updated, read and manipulated.
A lot of files don't necessarily have this access pattern (for instance rendering a large video file) ... a filesystem has a generic access pattern and is a lower level primitive than a database.
For this same reason you even have different kinds of database for different types of access patterns and data types (e.g Elasticsearch for full text search, MongoDB for JSON, Postgres for SQL)
Filesystem is generic and low-level, database is a higher order abstraction.
The API revolution might be another thing - you were able to swap out a database with any other. Risky decisions are fine when they're reversible. Databases were a more reversible way to deal with scaling and architecture.
https://en.wikipedia.org/wiki/ISAM
https://en.wikipedia.org/wiki/Record_Management_Services
They were more like BerkeleyDB and lacked Query Planner.
I think Oracle internally using something similar, i.e. a native filesystem optimized for an RDBMS.