[C#] Performance gets crushed(-30%) on Intel Celeron® J1800 #583
Replies: 4 comments 15 replies
-
@ItalyToast I think the size of the L1 cache also matters. At least on my hardware it does. I have 32kB L1 cache. Methods that do calculations in that size like my own You can run getconf -a on Linux to get your size and the size of the Celeron is 24kb according to the published report in PrimeView. |
Beta Was this translation helpful? Give feedback.
-
I think you could make this discussion a general discussion, not just specific to C#, about performance differences, especially between machines. I know that my Julia solution's performance can differ wildly between machines, so it might be good to consolidate information as to what factors affect performance here. That's one reason why I opened #584, to improve consistency of results between machines since I think my optimizations were getting close to CPU quirks instead of overall performance improvements. I had a discussion over at the Julia Discourse about why my loop was faster on my machine, and it had something to do with the registers maybe being pressured a little bit. I don't really understand it all, but all I know is that a lot of the solutions in this repository are almost as optimized as can be, so the CPU features really start to matter and can affect how we optimize our solutions. The Rust solution's striped algorithm is designed, if I understand correctly, to make better use of the cache. It's also why they implemented "striped-blocks" to work with CPUs with small cache sizes, like the Raspberry Pi. |
Beta Was this translation helpful? Give feedback.
-
Another factor might be that your CPU is a lot faster (~2.5 GHz vs ~3.9 GHz), meaning that instructions that are used to set up loops, for example, might have a lot less overhead on your machine than the Celeron. Most of the time is spent on the clearing loop, so the faster you can set up the loop with the other instructions, the more the inner loop matters. If your C# version has a high overhead but a faster loop, it would probably be faster on fast machines. If the Rust version has a low overhead but a slightly slower loop, on slower machines the lessened overhead might make up for the slower loop. I think that's what happened with my solution, where the I haven't examined the assembly for either your C# solution or the Rust solution, so I can't verify what I'm speculating. |
Beta Was this translation helpful? Give feedback.
-
For what it's worth, here are the results from running a few solutions on Ubuntu 20.04 under WSL2 (without Docker, since I don't have the space for it) on a machine with an Intel Core i5-9300H. CPU-Z reports 32KBytes of both L1 D-Cache and L1 I-Cache (8-way set associative, 64-byte line size), 256KBytes of L2 Cache (4-way set associative, 64-byte line size) and 8MBytes of L3 Cache (16-way set associative, 64-bye line size). I used @ItalyToast's csharp-docker-net5 branch used in #603 since I don't have .NET Core 3.1 currently installed on WSL2. PrimeCSharp/solution_4 (best run out of 3): 9048 passes per 5 seconds
I've shortened the output here since there's a lot of unnecessary information. This only concerns the single-threaded implementations.
And just for fun, here's my PrimeJulia/solution_3 which I really only optimized using this machine.
|
Beta Was this translation helpful? Give feedback.
-
I was very excited and waiting in excitement to see my performance results pop up on PrimeView yesterday. Then the numbers showed: # 28 2680. 👀 What just happened... I was expecting to see it around mike-barber_bit-storage-striped since my machine (Intel I7 4770K) beats it by a couple of of hundred passes/sec. 2680 passes/s vs 3799 passes/s thats a 30%+ deviation from the expected numbers. Both runs were on docker with "docker run solution_x" where x is the solution number.
Maybe this is just difference in hardware that causes these discrepancies. But if so what is the difference? The L2 cache is bigger than the sieve on both processors. The sieve is 62.5kb and both the celeron and 4770k have 1mb / 256k of L2 cache respectively.
One theory is that the allocation and GC in C# would consume more memory and thus have an impact. The allocations will stay nice and tidy and only cause 1 or 2 cache misses per pass but we would build on the heap until we get a GC. The GC would get a lot of cache misses BUT the GC is a on a background thread if im not mistaken and should not impact performance. Rust in this case would alloc one sieve, calculate, dealloc and the next iteration would alloc in the same space that we just deallocated.
Anyone got some ideas about what is going on or how to debug these performance issues? Since I dont have access to daves machine Im not sure how to proceed.
The code in question: CSharp/solution_4
Here is the spec sheet for the two processors:
Celeron J1800
I7 4770K
Beta Was this translation helpful? Give feedback.
All reactions