SIMD within a register: How I doubled hash table lookup performance

190 points by axeluser 4 days ago

Some questions/remarks:

- in the initial `Contains` code snippet (and all early-break variants), most performance is probably lost by branch misses on which element returns true. Probably much better is something like `table[0] == fp | table[1] == fp | table[2] == fp | table[3] == fp` (which might actually get optimized to int/SIMD anyway; who knows).

- What is shown in the table with benchmarks? I assume 'operations' is the number of lookups, but how large is the array? Memory latency is probably one of the big effects on overall runtime? - It's probably more useful to show time per lookup than mean 'total time'.

- Also, how do you run the benchmarks? If you make a fixed array of queries and query those many times, the same cachelines will be hit and you won't measure memory latency. Also, in this case the branchpredictor might learn all branches, so best is to only do each query once in your benchmark.

- You don't really comment on the differences between benches with different number of 'operations'. Are there any takeaways from this? (otherwise just don't show them?)

- Maybe you could try with some `u8x8` SIMD instructions as well? Although the bitmasking is also cute as-is :)

axeluser 3 days ago

Hi, thank you for your questions: 1. The "one-line" boolean condition still has jumps in the produced JIT Asm, at least to make a short circuit to exit as soon as possible. I guess the C# compiler is far too conservative for such optimisations.
2 and 3. The "operations" are both the array's size (actually the nearest power of 2) and lookup operations. I first thought that latency should be equal for both cases, because my benchmarks may be too naive - they test lookup by sequential memory access, i.e., iterating buckets from 0 to n, and iterating from first to last element in each bucket. I'll try to experiment with a more shuffled workload in my free time. Thank you for the question. I've not thought about it deeply. Also, counting mispredictions is a good idea; I'll add this metric to my benchmarks to get a complete picture.
4. I didn't pay much attention to that because the trend was still the same, and even the ratio was similar for different counts of lookups per underlying memory layout. There was just some minor deviation, but I ignored it. Maybe I'll run more granular tests with hardware counters to catch some insights.
5. SIMD, in my opinion, would be an overhead there because for a single lookup in my configuration, I'm using a single 32-bit value, and Cuckoo Filter will touch at max 64 bits (2 lookups). However, I was planning to experiment with batching of lookups; I just haven't reached it yet.
- curiouscoding 3 days ago
  
  > they test lookup by sequential memory access, i.e., iterating buckets from 0 to n, and iterating from first to last element in each bucket
  Yeah; that completely eliminates the cache misses / memory latency you'd have in practice. Of course eliminating that bottleneck is useful if you want to purely optimize CPU speed, but probably not quite as representative of real workloads. Also makes sense then that different array sizes give similar results: streaming tends to be fast anyway, regardless of the array size.
- maxlapdev 2 days ago
  
  The one liner shouldn't have shortcircuiting since it uses the binary or |, not the boolean/logical or||.
  - axeluser 2 days ago
    
    Aha, you're right, didn't look carefully
curiouscoding 3 days ago

Another remark:
If you benchmark it as something like `for q in queries { contains(q); }`, especially the branchless variants are probably executed in parallel by the CPU, and you are measuring throughput instead of latency. That may or may not be relevant depending on the application.
- axeluser 3 days ago
  
  Hm, that's actually may be true, thank you for the heads-up

mizmar 3 days ago

Something similar is used in swiss tables - metadata bucket entries are 1 bit occupancy marker and 7 MSB bits of hash (don't remember how tombstones are represented). Metadata table is scanned first, the upper part of hash should discard filter out most colliding entries and the "false positives" lead to probe in entry table and key comparison (possibly optimized with full-hash comparison).

Buckets use 16 bytes because sse2 and arm neon SIMD are basically guaranteed.

I was shocked to read on how swiss tables - proclaiming to be the fastest hash tables - work. It's just open-addressing linear probe with no technique to deal with collisions and clustering. Plus the initial version rounded hashes%capacity to bucket size, thus using 4 less bits of hash, leading to even more collisions and clustering. Yet the super fast probe with apparently made it not an issue? Mind-boggling. (Later version allowed to scan from arbitrary position by mirroring first bucket as last.)

jandrewrogers 3 days ago

A simple linear probe is very efficient if the hash quality is high enough. More complex hash tables have the useful property that their performance is less sensitive to lower hash quality.
Historically, the computational cost of excellent hash quality was too high, so the overall cost of a hash table was lower using a worse hash and more complex table design. Today, it is possible to generate high quality hashes with minimal computational cost, so complex hash table designs are less useful. Modern hash table performance can largely be reduced to the average number of cache lines (both data and instruction) touched by the operations.
Sesse__ 3 days ago

The absl tables (the canonical Swisstables) are good, but not the fastest. Of course, everything will depend on your hardware, compiler and usage patterns, but if you look at an all-round benchmark like https://martin.ankerl.com/2022/08/27/hashmap-bench-01/, they're about in the middle of the pack even when picking the best-working hash functions (and really bad if you happen to have a bad one). This mirrors my own experience.
Of course, absl tables will have been improved since 2022, but there are many competitors that have improved as well. And you could argue “well, who ever copies or iterates over a map”, but even if you take away those categories, they're not a clear leader really anywhere else either.
- vlovich123 3 days ago
  
  I wouldn’t say it’s the middle of the pack. They’re very clearly in the front of the pack for most operations but the summarization of the author penalizes it’s weaknesses more than of others because they have a horse in the race.
- spacechild1 3 days ago
  
  It's worth mentioning that boost.unordered now also has a state-of-the-art open adressing hashtable (boost::unordered_flat_map) that beats abseil in some benchmarks, particularly for unsuccessful look-up: https://bannalia.blogspot.com/2022/11/inside-boostunorderedf...
- camel-cdr 3 days ago
  
  absl::flat_hash_map performs quite well on the slight more "realistic" benchmarks that were added to ankerls map_benchmark after the article came out (GameOfLife and knucleotide). See this slide: https://www.youtube.com/watch?v=Rg8MZ5pJIJA&t=1996s (the talks is also great btw)
tialaramex 3 days ago

The Swiss Table is not presently the fastest known design for many purposes, but other open addressed schemes take that honour today and yes, nobody who wants a fast modern hash table tries to mitigate collision - just use a decent hash and the collisions are rare enough to have negligible effect so long as your scheme degrades rather than failing.
The C++ approach is focused around mitigating the price of collision. Accept you lost, and now try to make that not sting too bad. This made some sense because their provided hash for the integers is typically literally the identity function, collision is thus very frequent and predictable. But "Probably we lost, lets mitigate that" isn't a route to success and it only made sense when memory fetches were cheap and ALU operations were expensive. Forty years ago that was a sane choice, thirty years ago when C++ standardization was in progress it was already looking dodgy, fifteen years ago when they actually standardized this feature it was already very stupid.
Open addressed strategies begin by assuming you probably won. Your first main memory fetch is probably enough to know if this key is present, and if it is the next fetch probably gets the key and value.
The std::unordered_list strategy will only tell you if the key was not present on its first fetch some small fraction of the time, and on every other occasion you must do the second fetch to even discover whether the key might be present, several more fetches are needed on average to get a final key + value or certainty that it's not found.
dietr1ch 3 days ago

I think it just abuses on how reading a few contiguous cache lines at random isn't a whole lot more expensive than reading a single random byte.
I found myself never wanting to use binary heaps, and instead use a wider one with an arity that can use at least a whole cache line, if not multiple ones after I started experimenting with my priority queues. Wider nodes meant fewer jumps, and jumping between nodes was more expensive (time, cache misses) than the work done at each node.
adgjlsfhk1 3 days ago

tombstones use the same bit as empty (empty and deleted don't have hash codes, so there's room).
the key realization behind swissdict is that all the complications people add are only necessary if you have a bad hash function, so you can just use a good one and be happier.
linear probing also has the significant advantage that it allows removing tombstones when whenever the next entry is empty
moonchild 2 days ago

> Later version allowed to scan from arbitrary position by mirroring first bucket as last
you may find my improved table design of interest, which avoids the need for mirroring: https://outerproduct.net/trivial/2022-10-06_hash.html
teo_zero 3 days ago

> Later version allowed to scan from arbitrary position by mirroring first bucket as last.
I don't think this would help. The real issue with arbitrary position is that you can't load 16 bye to a 128-bit SIMD register if the memory is not aligned. The solution I found is to unroll the first iteration and mask out the results found before the initial offset.
- mizmar 3 days ago
  
  It's one of the improvements they claimed in the 2019 presentation. https://youtu.be/JZE3_0qvrMg?feature=shared&t=1054 Reporting 10% speedup on find, but 15% slowdown on insert. The speedup probably comes from using 4 more bits of hash, which leads to fewer collisions. And slowdown from more complicated code for the mirroring.
  I'm still confused on the SIMD alignment. There are load instructs with alignment requirements (_mm_load_si128) and without (_mm_loadu_si128). Both claim the same latency and throughput. Somewhere I've heard the slowdown of unaligned access comes from using more bandwidth to load two aligned 128-bit lines to compose the unaligned value. But no idea if this affects multiple loads of continuous memory.
Rietty 3 days ago

> Yet the super fast probe with apparently made it not an issue?
Could you explain how that makes it a non-issue? It seems counter-intuitive to me that it solves the problem by just probing faster?

Const-me 3 days ago

Great idea.

BTW, when I implementing similar logic in C#, I place expressions like ((xored - 0x01010101U) & ~xored & 0x80808080U) inside unchecked blocks. Because when I compile my C#, I set <CheckForOverflowUnderflow>true</CheckForOverflowUnderflow> compiler option.

On modern computers the performance impact of that option is negligible. I’ve encountered several cases where OverflowException produced by integer arithmetic exposed bugs which were trivial to fix, but would have been insanely hard to detect and debug without that compiler option.

qcnguy 3 days ago

That's cool. What sort of bugs are being found with the overflow checks that wouldn't also be caught by array range checks?

warangal 3 days ago

Nice write up! It is about speeding up underlying hash-table for a `cuckoo filter` , as in false-positives are allowed, unlike data-structure like dictionary, which makes sure that false-positive rate is 0.

Whenever i need to use dictionary/hash-table in a hot loop where maximum possible iterations are known, i use a custom hash-table initialized with that value, and in some cases, standard library implementation may store `hash` along with `key` and `value` in case it needs to `resize`, in my case i don't store the `hash` which leads to speed up about 15-20% due to less cache misses, as there would be no resize possible!

Also `cuckoo filters` seems to use `robinhood hashing` like strategy, using in a hot-loop it would lead to a higher insert cost ?

axeluser 3 days ago

Thank you! You are right. Cuckoo Filters use the open-addressing technique to deal with collisions, which may result in a long eviction chain on inserts or even fail if the load factor is high or just bad luck. This is one major drawback you should consider before using it. https://maltsev.space/blog/010-cuckoo-filters#fingerprints-a...

jadenPete 3 days ago

This is an off-topic and low-quality comment, but the animated dots on the background of this website made me think I had dust on my device for a second.

axeluser 2 days ago

Yeah, there's such an issue. I'll fix that later.

alwahi 3 days ago

okay wait, i thought hash table lookup was already O(1)...

drysine 3 days ago

O(1) only means that the lookup time doesn't depend on the number of elements in the table. It can be 10 nanoseconds or 10 days.
- Retr0id 3 days ago
  
  And in practice it's secretly O(logn) or thereabouts if you microbenchmark it, due to caching (if your whole hash table fits in L1, it will be faster than a hash table that doesn't)
  - gpm 3 days ago
    
    Worse, it's O(sqrt(n)) because we lay out data on roughly a plane and time is bounded below by how far away the data is.
    O(1) is in the mythical (easy to analyze, good approximation due to small constants on the sqrt(n) and log(n) terms) data model where random memory access is O(1), but it's not physically possible.
    
    Dylan16807 3 days ago
    
    While you're factoring in real world constraints, you've gone back to theory rather than practice. In practice the RAM is not far enough away for the travel time to get past a rounding error. Also very high RAM installs use all three dimensions to pack more.
    
    gpm 3 days ago
    
    Yeah the justification I gave is pure theory until you're talking large distributed systems... and the theory amuses me here so I'm going to keep talking about it first.
    But it's actually a place where practice happens, by coincidence, to match theory pretty well: https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html
    
    Dylan16807 3 days ago
    
    That's an interesting correlation between the sizes and latencies for cache levels. Thanks for linking it.
    Though once you're at the "distributed" tier, I wouldn't expect further increases to be sqrt. Distributed within a datacenter is probably going to have a log(n) topology of links, and distributed across datacenters is a fixed number of milliseconds based on the area you're serving with no real connection to amount of RAM.
    
    SkiFire13 3 days ago
    
    This is a bit misguiding however, since the `n` here represents the total amount of RAM on the system, not the one used by the algorithm.
    
    gpm 3 days ago
    
    Eh... if the algorithm is dominating your apps runtime the data it loads is probably in the fastest storage device you have that fits. Whether that's L1 cache or a cluster of a dozen computers each with a TB of ram. If not, you can probably optimize by making that the case. At which point `n` is actually the size of the data for the slow algorithm.
    I agree you don't often care about this scaling factor though, the O(1) random access ram model gets you pretty far and after that you're probably interested in optimizing for some specific data size anyways.
    
    hn_go_brrrrr 3 days ago
    
    I would love to read more about this. Should I be looking at RAM row/column addressing costs?
    
    gpm 3 days ago
    
    The link __s posted is really more the level where you need to be aware of the sqrt(n) cost, as you scale from one device to another (L1 cache -> L2 cache -> L3 cache -> ram -> SSD -> SSD in another computer in the same datacenter -> SSD in another computer in the same continent...)
    As far as I know (and I may just be ignorant), ignoring the very important part of the equation that are caches, it doesn't really matter what row/column you address in RAM, at that level things are dictated by clock speeds.
    Caches are obviously very important though, and beyond optimizing the probability of cache hits, on modern CPUs some cores are "closer" than others and thus cache-interactions between them are faster: https://github.com/nviennot/core-to-core-latency And if you optimize based on this you can get speedups, both at the macro level by doing things like scheduling things that talk on "close" cores, and at the micro level by doing things like implementing NUMA aware locking primitives: https://dl.acm.org/doi/10.1145/3477132.3483557
    There's also definitely been CPUs (not sure if this is still a thing) where some cores share memory channels and some cores don't so you can access RAM faster (higher bandwidth) if you spread your access between the two sets of cores instead of staying within one.
    
    __s 3 days ago
    
    https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html
- johnisgood 3 days ago
  
  Exactly. So many people get this wrong, it should be a question in interviews.
ot 3 days ago

Now it's O(0.5)
TacticalCoder 3 days ago

[dead]