When multiple cores operate on data that is too close together in memory (in the same 64 byte chunk), multi-threaded performance degrades. This is because each core has its own cache, and if one core wants to write to its cache (instead of main memory), it has to update the other cores’ caches as well with the new value (because more than likely, the other cores have the same data cached as well)!

 ┌─────────────────────────────────────────────────────────────┐
 │                                                             │
 │                                                             │
 │                         main memory                         │
 │                                                             │
 │                                                             │
 │                                                             │
 └──────────────────────────────┬──────────────────────────────┘
                                │
                   ┌────────────┴─────────────┐
                   │        L3 cache          │
                   └───┬─────────┬─────────┬──┘
                       │         │         │
       ┼           ┌───┴───┐ ┌───┴───┐ ┌───┴───┐
                   │ L1/L2 │ │ L1/L2 │ │ L1/L2 │
                   └───┬───┘ └───┬───┘ └───┬───┘
                       │         │         │
                   ┌───┴───┐ ┌───┴───┐ ┌───┴───┐
                   │       │ │       │ │       │
                   │core1  │ │core2  │ │core3  │
                   │       │ │       │ │       │
                   └───────┘ └───────┘ └───────┘

Let’s talk a little about CPU caches.

Modern CPUs are multi-core, and each core has several layers of caches (L1, L2). All cores generally share L3 cache.

When talking about caches, we usually talk about a 64 byte chunk, which is called a “cache line”.

When a core needs to access some area of memory, it first checks if the memory is in its L1 cache, if not, it will check its L2 cache, then the shared L3 cache. If it isn’t in any of the caches, it will fetch it from main memory, and then place it in L1. If there isn’t room in L1, it will move older stuff from L1 to L2. If there isn’t enough room in L2, it will move even older stuff from L2 to L3. So in other words, the caches are MRU (most recently used).

But multi-coreness makes this a little more complex. Remember, each core has its own L1/L2 cache. When one core makes a change to some data, any subsequent cores, when they read that data, they need to be able to see the updated version as well. This is called “cache coherence”. In other words, all cores need a coherent, consistent view of memory.

CPUs handle this, but for a price. For example, when core1 writes to some memory that is in its cache, it will check if other cores have that same memory in their cache as well. If any other cores have it, those cores’ caches have to be updated as well! This happens at a “cache line” (64 byte) granularity. So to be clearer, when core1 writes to any region of a certain 64 byte chunk of memory (a cache line) that is in its cache, if any other cores have that same 64 byte chunk in their cache, their cache has to be updated too.

This leads to multi-threaded performance degradation if multiple cores are operating in the same 64 byte chunk of memory (operating in the same cache line). Because again, imagine core1 has a certain 64 byte chunk of memory in its cache (let’s call this chunk1). Let’s say core1 wants to write to chunk1. It wants to be fast, so instead of writing to main memory, it writes to its cached version of chunk1. However, core1 notices that other cores also have chunk1 cached, thus it will have to update their caches as well! This process is expensive! One core ends up having to update the cache of several cores!

By the way, this phenomena, when multiple threads are operating too close in memory (on the same 64 byte chunks), is called false sharing. False sharing degrades multi-threaded performance, and now you know why!