Building a compile-time SIMD optimized smoothing filter

88 points by PaulHoule 9 months ago

keldaris 9 months ago

This looks like a nice case study for when you're already using Rust for other reasons and just want to make a bit of numerical code go fast. However, as someone mostly writing C++ and Julia, this does not look promising at all - it's clear that the Julia implementation is both more elegant and faster, and it seems much easier to reproduce that result in C++ (which has no issues with compile time float constants, SIMD, GPU support, etc.) than Rust.

I've written very little Rust myself, but when I've tried, I've always come away with a similar impression that it's just not a good fit for performant numerical computing, with seemingly basic things (like proper SIMD support, const generics without weird restrictions, etc.) considered afterthoughts. For those more up to speed on Rust development, is this impression accurate, or have I missed something and should reconsider my view?

adgjlsfhk1 9 months ago

This sort of thing is where Julia really shines. https://github.com/miguelraz/StagedFilters.jl/blob/master/sr... is the julia code and it's only 65 lines and uses some fairly clean generated code to get optimal performance for all floating point types.
gauge_field 9 months ago

In terms of speed, Rust is up there with C/C++. See e.g. https://benchmarksgame-team.pages.debian.net/benchmarksgame/... I also ported several algo from C and it was matching the performance.
Regarding SIMD support, the only thing that is missing, is stable support for avx512, and some more exotic feature extensions for deep learning e.g. avx_vnni. Those are implemented and waiting to be included in the next stable versions.
Gpu support: this is still an issue b/c of not enough people working on it, but there projects trying to improve this: see https://github.com/tracel-ai/cubecl .
Const generics: Yeah, there are a few annoying issues: it is limited to small set of types. For instance, you cant use const enum as a generic. Also, you cant use generic parameters in const operations on stable rust: see unstable feature generic_const_exprs.
My main reason for using rust in numerical computing:
- type system. Some find it weird. I find it explicit and easier to understand.
- cargo (nicer cross platform defaults, since I tend to develop both from windows and linux)
- unconditional code generation, with [target_feature(enable = "feature_list")]. This makes it so that I dont have to set different set of flags for each compilation unit when building. It is enough to put that on top of function making use of SIMD.
I agree that if you want to be fast/exploratory in developing algo and you can sacrifice a little bit of performance, Julia is a better choice.
- bobmcnamara 9 months ago
  
  TBH, Intel ISAs have never been very stable, the mixed AVX flavors are just the latest examples.
  - gauge_field 9 months ago
    
    Yeah, it so fast that AMD is not even able to catchup and that many of the extensions is available only on Intel CPUs.
    As far as I could tell, it is only unstable in the sense of being fast and having many features. I dont see any breaking for my code using cpuid to detect avx512 features.
    
    bobmcnamara 9 months ago
    
    No, I mean things like when they removed BCD support, or when they removed simultaneous 16-bit ISA when they added 64-bit. Or the bit string instructions. Or moving the cmpxchg encoding across chip revisions.

jonathrg 9 months ago

Tight loops of SIMD operations seems like something that might be more convenient to just implement directly in assembly? So you don't need to babysit the compiler like this.

adgjlsfhk1 9 months ago

Counterpoint: that makes you rewrite it for every architecture and every datatype. A high level language makes it a lot easier to get something more readable, runable everywhere, and datatype generic.
- brrrrrm 9 months ago
  
  You almost always have to for good perf on non-trivial operations
- jonathrg 9 months ago
  
  For libraries, yes that is a concern. In practice you're often dealing with one or a handful of instances of the problem at hand.
- menaerus 9 months ago
  
  If you're ready to settle on much lower IPC then yes.
  - adgjlsfhk1 9 months ago
    
    The Julia version of this (https://github.com/miguelraz/StagedFilters.jl) is very clearly generating pretty great code. For `smooth!(SavitzkyGolayFilter{4,3}, rand(Float64, 1000), rand(Float64, 1000))`, 36 out of 74 of the inner loop instructions are vectorized fmadds (of one kind or another). There are a few register spills that seem plausibly unnecessary, and some dependency chains that I think are an inherent part of the algorithm, but I'm pretty that there isn't an additional 2x speed to get here.
    
    menaerus 9 months ago
    
    I can write SIMD code that's 2x the speed of the non-vectorized code but I can also rewrite that same SIMD code so that 2x becomes 6x.
    Point being, not only that you can get 2x on top of initial 2x SIMD implementation but usually much more.
    Whether or not you see SIMD in the codegen is not a testament of how good the implementation really is.
    IPC is the relevant measure here.
    
    adgjlsfhk1 9 months ago
    
    IPC looks like 3.5 instructions per clock. (and the speed for bigger inputs will be memory bound rather than compute bound).
    
    menaerus 9 months ago
    
    If true this would be between ok-ish and decent result but since you didn't say anything about the experiment or how you ran it or anything about the HW it's hard to say more.
    I assume you didn't use https://github.com/miguelraz/StagedFilters.jl/blob/master/te... since that benchmark is pretty much flawed for more than a few reasons:
    - Operates with the dataset that cannot fit even in the L3 cache on most consumer machines: ~76MB (float64) and ~38MB (float32)
    - Intermixes python with the julia code
    - Not even a single repetition - no statistical significance
    - Does not show what happens with different data distributions, e.g. random, but only uses a monotonically increasing sequence
    
    ChrisRackauckas 9 months ago
    
    Did you link the wrong script? The script you show runs everything to statistical significance using Chairmarks.@b.
    Also, I don't understand what the issue would be with mixing Python and Julia code in the benchmark script. The Julia side JIT compiles the invocations which we've seen removes pretty much all non-Python overhead and actually makes the resulting SciPy calls faster than doing so from Python itself in many cases, see for example https://github.com/SciML/SciPyDiffEq.jl?tab=readme-ov-file#m... where invocations from Julia very handily outperform SciPy+Numba from the Python REPL. Of course, that is a higher order function so it's benefiting from the Julia JIT in other ways as well, but the point is in previous benchmarks we've seen the overhead floor as so low (~100ns IIRC) that it didn't effect benchmarks negatively for Python and actually improved many Python timings in practice. Though it would be good to show in this case what exactly the difference is in order to isolate any potential issue, I would be surprised if it's more than a 100ns overhead for an invocation like this and with 58ms being the benchmark size, that cost is well below the noise floor).
    Though trying different datasets is of course valid. There's no reason to reject a benchmark just because it doesn't fit into L3 cache, there's many use cases for that. But it does not mean that all use cases are likely to see such a result.
    
    menaerus 9 months ago
    
    If it doesn't fit into the L3 cache then what the benchmark is measuring is not the CPU efficiency of your SIMD approach but also inherently includes different (and complex) effects of L1+L2+L3 cache hierarchy, memory prefetchers and main memory latency. Latency of accessing the RAM is alone ~two to ~three orders of magnitude larger then the latency of executing the single instruction.
    Per-se this isn't wrong but you may get the skewed picture of your algorithm efficiency because you're measuring multiple things at once, and perhaps the ones you may not be interested in or even knew that they even exist or are included in the numbers of your experiment, e.g. how good the memory prefetcher understands the access patterns of your experiment.
    For the same reason, I think it's better to avoid mixing many different things into the same benchmark run.
    I stand corrected wrt statistical significance since I am not familiar with Julia. I missed to understand that part.
    
    gauge_field 9 months ago
    
    [dead]
anonymoushn 9 months ago

I'm happy to use compilers for register allocation, even though they aren't particularly good at it
menaerus 9 months ago

Definitely. And I also found the title quite misleading. It's the auto-vectorization that this presentation is trying to cover. Compile-time SIMD OTOH would mean something totally different, e.g. computation during the compile-time using the constexpr SIMD intrinsics.
I'd also add that it's not only about babysitting the compiler but you're also leaving a lot of performance off the table. Auto-vectorized code, generally speaking, unfortunately cannot beat the manually vectorized code (either through intrinsics or asm).

jtrueb 9 months ago

I think scientific computing in Rust is getting too much attention without contribution lately. We don't have many essential language features stabilized. SIMD, generic const exprs, intrinsics, function optimization overrides, and reasonable floating point overrides related to fast math are a long way from being stabilized. In order to get better perf, the code is full of these informal compiler hints to guide it towards an optimization like autovectorization or branch elision. The semantics around strict floating point standards are stifling and intrinsics have become less accessible than they used to be.

Separately, is Julia hitting a different LA backend? Rust's ndarray with a blas-src on Accelerate is pretty fast, but the Rust implementation is little slower on my macbook. This is a benchmark of a dot product.

```

    const M10: usize = 10_000_000;
    #[divan::bench]
    fn ndarray_dot32(b: Bencher) {
        b.with_inputs(|| (Array::from_vec(vec![0f32; M10]), Array::from_vec(vec![0f32; M10])))
            .bench_values(|(a, b)| {
                a.dot(&b)
            });
    }

    #[divan::bench]
    fn chunks_dot32(b: Bencher) {
        b.with_inputs(|| (vec![0f32; M10], vec![0f32; M10]))
            .bench_values(|(a, b)| {
                a.chunks_exact(32)
                    .zip(b.chunks_exact(32))
                    .map(|(a, b)| a.iter().zip(b.iter()).map(|(a, b)| a * b).sum::<f32>())
                    .sum::<f32>()
            });
    }

    #[divan::bench]
    fn iter_dot32(b: Bencher) {
        b.with_inputs(|| (vec![0f32; M100], vec![0f32; M100]))
            .bench_values(|(a, b)| {
                a.iter().zip(b.iter()).map(|(a, b)| a * b).sum::<f32>()
            });
    }
    
    ---- Rust ----
    Timer precision: 41 ns (100 samples)
    flops             fast    │ slow    │ median  │ mean
    ├─ chunks_dot32   3.903 ms│ 9.96 ms │ 4.366 ms│ 4.411 ms
    ├─ chunks_dot64   4.697 ms│ 16.29 ms│ 5.472 ms│ 5.516 ms
    ├─ iter_dot32     10.37 ms│ 11.36 ms│ 10.93 ms│ 10.86 ms
    ├─ iter_dot64     11.68 ms│ 13.07 ms│ 12.43 ms│ 12.4 ms
    ├─ ndarray_dot32  1.984 ms│ 2.91 ms │ 2.44 ms │ 2.381 ms
    ╰─ ndarray_dot64  4.021 ms│ 5.718 ms│ 5.141 ms│ 4.965 ms

    ---- Julia ----
    native_dot32:
    Median: 1.623 ms, Mean: 1.633 ms ± 341.705 μs
    Range: 1.275 ms - 12.242 ms

    native_dot64:
    Median: 5.286 ms, Mean: 5.179 ms ± 230.997 μs
    Range: 4.736 ms - 5.617 ms

    simd_dot32:
    Median: 1.818 ms, Mean: 1.830 ms ± 142.826 μs
    Range: 1.558 ms - 2.169 ms

    simd_dot64:
    Median: 3.564 ms, Mean: 3.567 ms ± 586.002 μs
    Range: 3.123 ms - 22.887 ms

    iter_dot32:
    Median: 9.566 ms, Mean: 9.549 ms ± 144.503 μs
    Range: 9.302 ms - 10.941 ms

    iter_dot64:
    Median: 9.666 ms, Mean: 9.640 ms ± 84.481 μs
    Range: 9.310 ms - 9.867 ms

    All: 0 bytes, 0 allocs

```

https://github.com/trueb2/flops-bench

gauge_field 9 months ago

[dead]

TinkersW 9 months ago

I only clicked through the slides and didn't watch the video..but ugh all I see is scalar SIMD in the assembly output(the ss ending means scalar, it would be ps if it was vector)

And they are apparently relying on the compiler to generate it...just no.

Use intrinsics, it ant that hard.

feverzsj 9 months ago

I think most rust projects still depend on clang's vector extension when SIMD is required.