You have some data to store. You want to be able to retrieve parts of it later on. Should you store it in a file or a database?

Quick Answer

If you have very little data, store it in a file. If you have a lot of data, store it in a database.

Explanation

Let’s say you are building a social media website (facebook-like). You need to store information about people who create profiles on your site. You are expecting tons and tons of people to sign up, thus you will be storing a lot of data.

Somewhere down the line, you want to retrieve the information for a specific person, named “abdullah”. If you had stored all your data on files, how will you find the information just relavant to “abdullah”? Have you organized your your data by creating one file for each profile? What if you need to find “the school names for people who are between 18-22”? Will the way you “organized” your files still allow you to quickly find this information? Disk reads are expensive, so you want to minimize them. You want to avoid having to read lots of files or large parts of files.

This is what a database does for you. When you store information in a database, and then create proper indices, it allows you to quickly (with minimal disk reading) find just the information you are looking for.

The key here is to minimize disk reads!

How Does a Database Minmize Disk Reads - Indices

If you want to store information in a database, you have to present it in “structured form”. In other words, you gotta think of your data as “records” (“rows”). A database stores records linearly in its file.

Let’s say your data was about people’s profiles, their name, age, school, and hobby. What if you want to find all records that attended certain schools? Well, then you should have built an “index” on the “school” field. If you had done that, there would be a look-up table, that maps schools (keys) to values (“pointer” to records that attend that school). By “pointer” here I just mean the offset from the database file where the info of that record starts.

So now, you have quickly obtained the locations within your database file where records that attended said school exist. Just read these areas!

If you need quick search on another field, add an index for that field. Indices are also stored on disk. The more indices you add, the larger your database gets, but the faster your queries get (queries on indexed fields that is).

Adding an Index Speeds Up Search but Slows Updates

If you update a record (more specifically, if you update a field of the record that was indexed), you also need to update the index of the field! Thus adding an index on a field:

  1. makes searching on that field faster
  2. makes updating that field slower

B Trees

What if you index so much that your indices get huge. So every time you want to search, you have to read these huge indices to find out where relavant info is (remember indices are also stored on disk). Databases have a solution for this too. Indices are split into chunks, and the chunks are arranged into a B Tree structure. Each node of the BTree is the size of a disk block (usually around 4kb). Each node of the B Tree references its children via an “offset” from the database file. The leafs of the B Trees actually tell you where your relavant info is. Since B Trees are a lot fatter than Binary Search Trees, you are minimizing the height of the tree, thus the number of disk reads!

Summary

  • a database makes finding the location of certain pieces of data do the minimum number of disk reads (thus “faster”)
    • you can then read this data
    • or update its value
  • a database chunks up its indices and arranges them in a B Tree
    • each node of the B Tree is the size of a disk block (~4kb)
    • B Tree nodes “reference” children via “offsets” from database file start