[C#] Performance gets crushed(-30%) on Intel Celeron® J1800 #583

ItalyToast · 2021-08-04T12:47:30Z

ItalyToast
Aug 4, 2021

I was very excited and waiting in excitement to see my performance results pop up on PrimeView yesterday. Then the numbers showed: # 28 2680. 👀 What just happened... I was expecting to see it around mike-barber_bit-storage-striped since my machine (Intel I7 4770K) beats it by a couple of of hundred passes/sec. 2680 passes/s vs 3799 passes/s thats a 30%+ deviation from the expected numbers. Both runs were on docker with "docker run solution_x" where x is the solution number.

Maybe this is just difference in hardware that causes these discrepancies. But if so what is the difference? The L2 cache is bigger than the sieve on both processors. The sieve is 62.5kb and both the celeron and 4770k have 1mb / 256k of L2 cache respectively.

One theory is that the allocation and GC in C# would consume more memory and thus have an impact. The allocations will stay nice and tidy and only cause 1 or 2 cache misses per pass but we would build on the heap until we get a GC. The GC would get a lot of cache misses BUT the GC is a on a background thread if im not mistaken and should not impact performance. Rust in this case would alloc one sieve, calculate, dealloc and the next iteration would alloc in the same space that we just deallocated.

Anyone got some ideas about what is going on or how to debug these performance issues? Since I dont have access to daves machine Im not sure how to proceed.

The code in question: CSharp/solution_4

Here is the spec sheet for the two processors:
Celeron J1800
I7 4770K

fvbakel · 2021-08-04T13:53:17Z

fvbakel
Aug 4, 2021

@ItalyToast I think the size of the L1 cache also matters. At least on my hardware it does. I have 32kB L1 cache. Methods that do calculations in that size like my own fvbakel_Cwords have the best performance on my hardware. See also my results in #562.

You can run getconf -a on Linux to get your size and the size of the Celeron is 24kb according to the published report in PrimeView.

1 reply

ItalyToast Aug 4, 2021
Author

@fvbakel My assumtion was that we are accessing the memory linearly and we would always have prefetched data but that is not actually true when we get to factors bigger than 128 💡.

But if thats the case why wouldnt that also affect the rust solution? I read the source code for and the striped ones does some funky stuff with how the bits are ordered thats maybe what gives it better usage of L1, I'll have to dig deeper here.

louie-github · 2021-08-05T00:39:33Z

louie-github
Aug 5, 2021

I think you could make this discussion a general discussion, not just specific to C#, about performance differences, especially between machines. I know that my Julia solution's performance can differ wildly between machines, so it might be good to consolidate information as to what factors affect performance here.

That's one reason why I opened #584, to improve consistency of results between machines since I think my optimizations were getting close to CPU quirks instead of overall performance improvements. I had a discussion over at the Julia Discourse about why my loop was faster on my machine, and it had something to do with the registers maybe being pressured a little bit.

I don't really understand it all, but all I know is that a lot of the solutions in this repository are almost as optimized as can be, so the CPU features really start to matter and can affect how we optimize our solutions.

The Rust solution's striped algorithm is designed, if I understand correctly, to make better use of the cache. It's also why they implemented "striped-blocks" to work with CPUs with small cache sizes, like the Raspberry Pi.

0 replies

louie-github · 2021-08-05T01:22:10Z

louie-github
Aug 5, 2021

Another factor might be that your CPU is a lot faster (~2.5 GHz vs ~3.9 GHz), meaning that instructions that are used to set up loops, for example, might have a lot less overhead on your machine than the Celeron. Most of the time is spent on the clearing loop, so the faster you can set up the loop with the other instructions, the more the inner loop matters.

If your C# version has a high overhead but a faster loop, it would probably be faster on fast machines. If the Rust version has a low overhead but a slightly slower loop, on slower machines the lessened overhead might make up for the slower loop.

I think that's what happened with my solution, where the @simd loop had a lot of overhead that was almost negligible on my Core i5-9300H, but was very pronounced in the Raspberry Pi.

I haven't examined the assembly for either your C# solution or the Rust solution, so I can't verify what I'm speculating.

7 replies

fvbakel Aug 6, 2021

I got one stupid idea that you could clear the bits front to back and back to front

I tried that in C on my hardware, it gives a 4% improvement.

ItalyToast Aug 6, 2021
Author

Or maybe improve a 10mill sieve on a CPU with 30k cache, then a 1mill sieve should run faster on a CPU with 3k cache ?!? Maybe I'll try that myself...

Of course! Why didnt I think of that 😅

I got one stupid idea that you could clear the bits front to back and back to front

I tried that in C on my hardware, it gives a 4% improvement.

Is that the 1mill sieve on your hardware or on the leaderboard?

ItalyToast Aug 6, 2021
Author

I ran rust solution 1 and c# solution 4 with 10 000 000 as the sieve size and the c#. The sieve size in memory is 625k and most of the sieve should be outside of the L1 for my CPU. Here are the results:

Rust solution 1:
byte-storage Passes: 389, Threads: 1, Time: 5.0027976036, Average: 0.0128606623, Limit: 10000000, Counts: 664579, Valid: Pass

bit-storage Passes: 515, Threads: 1, Time: 5.0085105896, Average: 0.0097252633, Limit: 10000000, Counts: 664579, Valid: Pass

bit-storage-rotate Passes: 529, Threads: 1, Time: 5.0036330223, Average: 0.0094586639, Limit: 10000000, Counts: 664579, Valid:

bit-storage-striped Passes: 490, Threads: 1, Time: 5.0074648857, Average: 0.0102193160, Limit: 10000000, Counts: 664579, Valid:

C# solution 4:

Passes: 568, Time: 5.0030142, Avg: 0.008808123591549295, Limit: 10000000, Count: 664579, Valid: True

Hmm, I even tried 1 000 000 000 as the sieve size but that didnt improve the standing for rust on my machine. the best implementation for rust was bit-storage-rotate on 3.35 sec / pass and c# got 3.17 sec / pass. Im running on windows but doing the runs through docker. Hmm this is kinda frustrating 😠. Maybe a PI is the way to go anyway?

fvbakel Aug 6, 2021

Is that the 1mill sieve on your hardware or on the leaderboard?

That is on my hardware, I7, 3th gen. The solution is not on the leader board, but I just made a plane 1 milion sieve and just tested the difference if you go back and forth. The source code is here: https://github.com/fvbakel/Primes/blob/feature/Cextra/PrimeC/solution_98/primes_bit7.c

mike-barber Aug 8, 2021

@ItalyToast the original bit-storage-striped implementation is actually terrible from a cache point of view: much worse than the normal layout. And you're definitely seeing that in your run with a larger sieve :) The newer bit-storage-striped-blocks implementation addresses the locality issue, and should be a lot faster. But you'll need to pull that code -- it's not in your run.

I'm also going to check what the influence of upgrading to from dotnet core 3.1 to dotnet 5.0 for C# was. I had some unexpected performance regressions that required a bit of hacking on some projects at work.

louie-github · 2021-08-07T03:54:00Z

louie-github
Aug 7, 2021

For what it's worth, here are the results from running a few solutions on Ubuntu 20.04 under WSL2 (without Docker, since I don't have the space for it) on a machine with an Intel Core i5-9300H.

CPU-Z reports 32KBytes of both L1 D-Cache and L1 I-Cache (8-way set associative, 64-byte line size), 256KBytes of L2 Cache (4-way set associative, 64-byte line size) and 8MBytes of L3 Cache (16-way set associative, 64-bye line size).

I used @ItalyToast's csharp-docker-net5 branch used in #603 since I don't have .NET Core 3.1 currently installed on WSL2.

PrimeCSharp/solution_4 (best run out of 3): 9048 passes per 5 seconds

ItalyToastPrimes/PrimeCSharp/solution_4 on  csharp-docker-net5 .NET v5.0.302 🎯 net5.0
❯ dotnet publish -c release -o build
Microsoft (R) Build Engine version 16.10.2+857e5a733 for .NET
Copyright (C) Microsoft Corporation. All rights reserved.

  Determining projects to restore...
  Restored /home/louie/dev/ItalyToastPrimes/PrimeCSharp/solution_4/PrimeSieveCS.csproj (in 60 ms).
  PrimeSieveCS -> /home/louie/dev/ItalyToastPrimes/PrimeCSharp/solution_4/bin/Release/net5.0/PrimeSieveCS.dll
  PrimeSieveCS -> /home/louie/dev/ItalyToastPrimes/PrimeCSharp/solution_4/build/


ItalyToastPrimes/PrimeCSharp/solution_4 on  csharp-docker-net5 [?] .NET v5.0.302 🎯 net5.0 took 10s
❯ ./build/PrimeSieveCS
Passes: 9048, Time: 5.0002299, Avg: 0.0005526337201591511, Limit: 1000000, Count: 78498, Valid: True

italytoast;9048;5.00023;1;algorithm=base,faithful=yes,bits=1

I've shortened the output here since there's a lot of unnecessary information. This only concerns the single-threaded implementations.
PrimeRust/solution_1 (one run): fastest is bit-storage-striped-blocks at 17336 passes per 5 seconds.

ItalyToastPrimes/PrimeJulia/solution_3 on  csharp-docker-net5 [?] via ஃ v1.6.2
❯ rustc --version
rustc 1.54.0 (a178d0322 2021-07-26)

ItalyToastPrimes/PrimeRust/solution_1 on  csharp-docker-net5 [?] is 📦 v0.1.0 via 🦀 v1.54.0
❯ ./run.sh
...
Computing primes to 1000000 on 1 thread for 5 seconds.
byte-storage    Passes: 7957, Threads: 1, Time: 5.0002093315, Average: 0.0006284039, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_byte-storage;7957;5.0002093315;1;algorithm=base,faithful=yes,bits=8

Computing primes to 1000000 on 1 thread for 5 seconds.
bit-storage     Passes: 7676, Threads: 1, Time: 5.0003728867, Average: 0.0006514295, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage;7676;5.0003728867;1;algorithm=base,faithful=yes,bits=1

Computing primes to 1000000 on 1 thread for 5 seconds.
bit-storage-rotate Passes: 7501, Threads: 1, Time: 5.0005621910, Average: 0.0006666527, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage-rotate;7501;5.0005621910;1;algorithm=base,faithful=yes,bits=1

Computing primes to 1000000 on 1 thread for 5 seconds.
bit-storage-striped Passes: 12882, Threads: 1, Time: 5.0001339912, Average: 0.0003881489, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage-striped;12882;5.0001339912;1;algorithm=base,faithful=yes,bits=1

Computing primes to 1000000 on 1 thread for 5 seconds.
bit-storage-striped-blocks Passes: 17336, Threads: 1, Time: 5.0003714561, Average: 0.0002884386, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage-striped-blocks;17336;5.0003714561;1;algorithm=base,faithful=yes,bits=1
bit-storage-striped-blocks-small Passes: 16140, Threads: 1, Time: 5.0001983643, Average: 0.0003098016, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage-striped-blocks-small;16140;5.0001983643;1;algorithm=base,faithful=yes,bits=1
...

And just for fun, here's my PrimeJulia/solution_3 which I really only optimized using this machine.
Best of 2 is 9469 passes per 5 seconds.

ItalyToastPrimes/PrimeJulia/solution_3 on  csharp-docker-net5 [?] via ஃ v1.6.2
❯ julia --version
julia version 1.6.2

ItalyToastPrimes/PrimeJulia/solution_3 on  csharp-docker-net5 [?] via ஃ v1.6.2 took 5s
❯ julia primes_1of2.jl
Settings: sieve_size = 1000000 | duration = 5
Number of trues: 78498
primes_1of2.jl: Passes: 9469 | Elapsed: 5.0003838539123535 | Passes per second: 1893.6546226528897 | Average pass duration: 0.0005280794016171036
louie-github_port_1of2;9469;5.0003838539123535;1;algorithm=base,faithful=yes,bits=1

7 replies

ItalyToast Aug 7, 2021
Author

I understand the striped version a little bit better now. Since it does not need to worry about shifting the bits that it is marking, the thight loop is superfast, if the cache can keep up. Why did he choose 8 stripes? Since you can keep the same mask for each iteration in the tight loop, going bigger would only mean more overhead since we need to clear each stripe. Each stripe got a setup cost and more stripes would not be a benefit for the tight loop but would give you more overhead.

ItalyToast Aug 8, 2021
Author

I actually implemented a striped variant now, lets see how it holds up on the celeron. #610

louie-github Aug 8, 2021

Interesting. I'm still figuring out how to implement it in Julia. Nice to see more new techniques being ported to different languages!

Let's see how things play out, I guess.

ItalyToast Aug 8, 2021
Author

Take a look at my implementation, its much simpler IMO and keeps the normal bitarray representation.

louie-github Aug 8, 2021

Alright, I'll be sure to do that, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C#] Performance gets crushed(-30%) on Intel Celeron® J1800 #583

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 15 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[C#] Performance gets crushed(-30%) on Intel Celeron® J1800 #583

Replies: 4 comments · 15 replies

ItalyToast Aug 4, 2021 Author

ItalyToast Aug 6, 2021 Author

ItalyToast Aug 6, 2021 Author

ItalyToast Aug 7, 2021 Author

ItalyToast Aug 8, 2021 Author

ItalyToast Aug 8, 2021 Author

Replies: 4 comments 15 replies

ItalyToast Aug 4, 2021
Author

ItalyToast Aug 6, 2021
Author

ItalyToast Aug 6, 2021
Author

ItalyToast Aug 7, 2021
Author

ItalyToast Aug 8, 2021
Author

ItalyToast Aug 8, 2021
Author