homer reading

Let’s say you have some data that can be arrange in a table of rows/columns. For example, a table of students with columns like name, age, gpa, etc.

There are two ways to store this data. You can either store each row contiguously or each column contiguously.

Row Storage

For example, if you have 3 rows and 3 columns in your table, the data will be stored like this:

name1 age1 gpa1, name2 age2 gpa2, name3 age3 gpa3

Columnar Storage

The same data can be stored like this:

name1 name2 name3, age1 age2 age3, gpa1 gpa2 gpa3

In this case, you are storing a certain column contiguously.

This is kinda similar to AOS (Array of Structures) vs SOA (Structure of Arrays) in programming. AOS is like row storage and SOA is like columnar storage.

Advantages of Columnar Storage

If most of your queries involve reading and in particular, reading only a few columns, then columnar storage is better because you end up doing less reads/seeks. Most of the time the storage is in disk, not memory, and as we know reading/seeking from disk is a lot slower, so you definitely want to minimize it.

Concretely, let’s say your query was:

SELECT age FROM students

If you are using columnar storage, the disk seeks to age section of the students table, and reads all the age values.

If you are using row storage, the disk seeks to the first row, reads the age value, then seeks to the second row, reads the age value, and so on. Alternatively, you could read the entire row, but then you are reading more data than you need. Either way, you are either reading more or seeking more, both of which is slow. Even if you are using an SSD, which has no seeking, it’s still faster to read contiguously on SSDs.

It’s faster to read contiguously, even on SSDs, which don’t have any seeking. This is just how these chips are designed.

A bonus advantage of columnar storage is that you can have better compression, since there is more likely to be repeated values in a column.

Disadvantages of Columnar Storage

If your queries involve more writing, then columnar storage is worse because your writes are “dispersed” across the disk, as you’ll see shortly.

Let’s imagine you want to add a new student to the table. You:

  • Seek to the end of the name column and write the new name.
  • Seek to the end of the age column and write the new age.
  • Seek to the end of the gpa column and write the new gpa.

That’s a lotta seeking! If you were using row storage, you would just seek to the end of the table and write the new row.

So this is a classic read-write tradeoff. If you want to optimize for reads, use columnar storage.

Summary

The main thing to get out of this is that it is faster to read contiguously. This is true of memory, but especially true of disks. For classical disks, this makes sense, because seeking is slow. But this is even true for SSDs, which don’t have any seeking.

So if it is faster to read contiguously, then store your data in such a way that you read it contiguously, based on your access patter, i.e. how you access the data. So if you tend to access only a few columns, then columnar storage makes sense. But also, keep in mind writes!

Also, in my example I used an sql query, but this applies to any kind of data storage, not just databases. This is much more general than that. I just used SQL as an example. Just think about reading, and reading contiguously.

That’s all folks! Hope you have an awesome day!