Naming convention for striped/unpeeled block based solutions #673

ItalyToast · 2021-08-23T14:38:36Z

ItalyToast
Aug 23, 2021

There is a couple of stripe inspired, block based solutions on the leaderboard now. But its not clear what the blocksizes are for the different implementations. I want to propose that we include the block size in the name of the solution, like I have done on mine: italytoast-stride8-blocks16k. The idea is that with the size in the name we can easily see if it fits in the L1 cache on the PrimeView site, that way we dont need to dive into the code to figure out what the actual block size is. For example we can see on the ARM processor that mikes small-blocks is slightly faster than blocks. We assume that is because the L1 is smaller than the block, but by how much and how big impact does it have?

@mike-barber @GordonBGood @ssovest @fvbakel Tagging you here since i see you have solutions with block in the name.

edit: this became quite the thread, it might seem like a trivial thing but this is a "dragrace" and the L1 cache is the turbo that is feeding the dragster engine. How we optimize around the L1 cache is one of the most important performance characteristics of a solution IMO.

GordonBGood · 2021-08-23T22:50:25Z

GordonBGood
Aug 23, 2021

@ItalyToast:

There is a couple of stripe inspired, block based solutions on the leaderboard now. But its not clear what the blocksizes are for the different implementations. I want to propose that we include the block size in the name of the solution, like I have done on mine: italytoast-stride8-blocks16k. The idea is that with the size in the name we can easily see if it fits in the L1 cache on the PrimeView site, that way we dont need to dive into the code to figure out what the actual block size is. For example we can see on the ARM processor that mikes small-blocks is slightly faster than blocks. We assume that is because the L1 is smaller than the block, but by how much and how big impact does it have?

@mike-barber @GordonBGood @ssovest @fvbakel Tagging you here since i see you have solutions with block in the name.

How about make the default unspecified block size 16 Kilobytes, and add the specification for any other size, otherwise there will need to be a flurry of PR's to match your proposed standard and as most implementations using blocks use the 16 Kilobyte size? A block size of 16 Kilobytes generally is a good workable size since the CPU L1 data cache size for the different CPU's on the leaderboard are 32, 24, and 32 Kilobytes for the Intel i7-8750, Intel Celeron J1800, and Sony BCM2711 ARM64, respectively.

mike-barber's "small" blocks are 4 kilobytes, and the smaller size only improves things slightly on the arm64 CPU due to the way that CPU handles "cache thrashing"; using bigger than 16 Kilobytes such as 32 Kilobytes means that the block size then exceeds the CPU L1 cache size for the Celeron, putting that CPU at a comparative disadvantage, and since a bit-packed odds-only buffer for all the odd composite bits to a million is only 62,500 bytes in size, a 64 Kilobyte size doesn't do anything at all and if used may slow the code if there are extra bounds checks.

If you want to standardize, we might want to use the term "striped" to refer to the implementation that re-orders the bit representations in each block so the composite order advances by common bit index order as contrasted to "unpeeled" which advances by all the bit indices per word in word index order. The advantage of "striped" (defined as above) over "unpeeled" is that, since the represented composite numbers advance in order, no secondary storage of intermediate block termination values needs to be made; the disadvantage is that it is somewhat more complex in the means necessary to reset the bit indices per "stripe" - performance-wise it is likely pretty much a wash although it may depend slightly on implementation and/or compiler optimizations.

Even when the block size is specified, there is also a difference in meaning in what "unrolled" means, most of the "block" implementation also use manual loop unrolling for four marking operations per loop, although I have found for some languages and CPU's that using eight marking operations is slightly faster (as it should be as there is then slightly less loop overhead per marking operation, countered by how it may impact the loop branch prediction); I use "unrolled" to mean my "Extreme Loop Unrolling" technique whereby the eight "unpeeled" loops are combined into a single loop with eight marking operations (or multiples of eight could be possible) per loop with each eight of different bit indices. If we want to make it clear which one is used, than I suppose I should change my implementations to use the tag "extreme" to distinguish them from the simply "unrolled" ones.

However, all of this tagging may make the label field pretty wide as in "name_unpeeled_block4K_unrolled4_hybrid64" meaning that for the sparse culling I used the "unpeeled" technique combined with a block 4K size unrolled by four markings per loop along with the dense hybrid technique over 64-bit words. A way to keep the tagging shorter would be to have a table of definitions in the CONTRIBUTING.md and/or README.md file for the whole project defining what the tags mean with standard abbreviations that can be used for them.

Most of these only are of use for the compiled languages from such as C#, F#, and AssemblyScript to all of the native code compiling languages, so it might be better just to standardize on the fastest technique available and all just write a version to that, which as per my latest Chapel tests would be either the above (perhaps the fastest for the Celeron, haven't fully tested yet) and "name_extreme_unrolled8_hybrid64" for the rest. If all "Top Fuel" implementations were using these top techniques (or one or the other), this would allow an easier comparison between CPU's and compiler optimizations given the common technique(s) used. I believe this technique can be used across all compiled languages - whether effectively or not will be dependent on the implementation and the compiler, although for languages without templates or macros, external code generation may need to be used for the "dense hybrid" technique as I had to do for Chapel.

5 replies

ItalyToast Aug 24, 2021
Author

Disclaimer: I write this response late at night so it might not be of top quality. I'll sleep on this and expand on my thoughts tomorrow.

How about make the default unspecified block size 16 Kilobytes, and add the specification for any other size, otherwise there will need to be a flurry of PR's to match your proposed standard and as most implementations using blocks use the 16 Kilobyte size? A block size of 16 Kilobytes generally is a good workable size since the CPU L1 data cache size for the different CPU's on the leaderboard are 32, 24, and 32 Kilobytes for the Intel i7-8750, Intel Celeron J1800, and Sony BCM2711 ARM64, respectively.

16k seems very reasonable as a default. If that is already used as the default value in most solutions thats great. Do you know if anyone of the block based solutions is not using 16k as a default?

If you want to standardize, we might want to use the term "striped" to refer to the implementation that re-orders the bit representations in each block so the composite order advances by common bit index order as contrasted to "unpeeled" which advances by all the bit indices per word in word index order. The advantage of "striped" (defined as above) over "unpeeled" is that, since the represented composite numbers advance in order, no secondary storage of intermediate block termination values needs to be made; the disadvantage is that it is somewhat more complex in the means necessary to reset the bit indices per "stripe" - performance-wise it is likely pretty much a wash although it may depend slightly on implementation and/or compiler optimizations.

I'm currently mostly only interessted in getting the blocksize standardized because i think thats missing right now and I think there will come a lot more solutions that want to use a block based algorithms. Using words such as small/big for the blocksize is not optimal IMO. Striped, unpeeled and extreme loop unrolling could maybe use some kind of description. I dont currently think there is a need to standardize that but it wouldnt hurt. A DICTIONARY.MD of some kind would be a good thing for outsiders that are coming in fresh to the repo.

I implemented my own "unpeeled" version. I called it "stride8" because I did not quite know if we were using the same algorithm or if there was some subtle difference. I thought it was a very descriptive name so I kept it for now.

Even when the block size is specified, there is also a difference in meaning in what "unrolled" means, most of the "block" implementation also use manual loop unrolling for four marking operations per loop, although I have found for some languages and CPU's that using eight marking operations is slightly faster (as it should be as there is then slightly less loop overhead per marking operation, countered by how it may impact the loop branch prediction); I use "unrolled" to mean my "Extreme Loop Unrolling" technique whereby the eight "unpeeled" loops are combined into a single loop with eight marking operations (or multiples of eight could be possible) per loop with each eight of different bit indices. If we want to make it clear which one is used, than I suppose I should change my implementations to use the tag "extreme" to distinguish them from the simply "unrolled" ones.

If the memory access pattern is changed a different name might be warranted but just unrolling a loop, I dont know.

However, all of this tagging may make the label field pretty wide as in "name_unpeeled_block4K_unrolled4_hybrid64" meaning that for the sparse culling I used the "unpeeled" technique combined with a block 4K size unrolled by four markings per loop along with the dense hybrid technique over 64-bit words. A way to keep the tagging shorter would be to have a table of definitions in the CONTRIBUTING.md and/or README.md file for the whole project defining what the tags mean with standard abbreviations that can be used for them.

"name_unpeeled_block4K_unrolled4_hybrid64" does look kind of ridiculous indeed. The hybrid variants would get big indeed.

The block size could also be a optional parameter in the output so it wouldnt take up precious space in the name. That would require some more work though.

GordonBGood Aug 24, 2021

@ItalyToast:

16k seems very reasonable as a default. If that is already used as the default value in most solutions thats great. Do you know if anyone of the block based solutions is not using 16k as a default?

I haven't seen any other than your offerings that use a block size of anything other than 16 Kilobytes other than the mike_barbar Rust solution_1 version that uses 4 Kilobytes but is marked "small". There is one other version, I think maybe in Go, that determines the optimum value to use by trial before running the final test over five seconds, and that one could be using any power of two value depending on what it finds. I don't know how that one would be marked by your system other than by a range. It will likely use 32 Kilobytes on the fast i7 CPU, 16 Kilobytes on the Intel Celeron j1800, and 4 Kilobytes on the arm64, as Mike already experimented and found that was the best value for "striped" on that CPU so likely is for "stride8" as well.

If you want to standardize, we might want to use the term "striped" to refer to the implementation that re-orders the bit representations in each block so the composite order advances by common bit index order as contrasted to "unpeeled" which advances by all the bit indices per word in word index order. The advantage of "striped" (defined as above) over "unpeeled" is that, since the represented composite numbers advance in order, no secondary storage of intermediate block termination values needs to be made; the disadvantage is that it is somewhat more complex in the means necessary to reset the bit indices per "stripe" - performance-wise it is likely pretty much a wash although it may depend slightly on implementation and/or compiler optimizations.

Striped, unpeeled and extreme loop unrolling could maybe use some kind of description. I don't currently think there is a need to standardize that but it wouldnt hurt. A DICTIONARY.MD of some kind would be a good thing for outsiders that are coming in fresh to the repo.

I implemented my own "unpeeled" version. I called it "stride8" because I did not quite know if we were using the same algorithm or if there was some subtle difference. I thought it was a very descriptive name so I kept it for now.

As for your temp "stride8", "unpeeled" is my own made-up word for what my technique does, which technique I have been using for quite some number of years (along with "extreme"). Thinking about it, I think your term "stride8" is a better description of what it is doing as compared to the alternate of "striped" and will gradually change my tags to that as I add techniques to the languages to which I have contributed. Practically, there isn't much difference in speed between "stride8" and "striped" and the choice of which to use is mostly up to which is easiest to implement.

Even when the block size is specified, there is also a difference in meaning in what "unrolled" means, most of the "block" implementations also use manual loop unrolling for four marking operations per loop, although I have found for some languages and CPU's that using eight marking operations is slightly faster (as it should be as there is then slightly less loop overhead per marking operation, countered by how it may impact the loop branch prediction); I use "unrolled" to mean my "Extreme Loop Unrolling" technique whereby the eight "unpeeled" loops are combined into a single loop with eight marking operations (or multiples of eight could be possible) per loop with each eight of different bit indices. If we want to make it clear which one is used, than I suppose I should change my implementations to use the tag "extreme" to distinguish them from the simply "unrolled" ones.

If the memory access pattern is changed a different name might be warranted but just unrolling a loop, I dont know.

Whether one unrolls a loop or not makes quite a difference in some languages (not much at in F# and perhaps not too much in C#). My "extreme" unrolling makes quite a difference again for languages that can optimize it properly and where it doesn't conflict with the way the CPU handles "cache thrashing" as in the Intel Celeron J1800.

However, all of this tagging may make the label field pretty wide as in "name_unpeeled_block4K_unrolled4_hybrid64" meaning that for the sparse culling I used the "unpeeled" technique combined with a block 4K size unrolled by four markings per loop along with the dense hybrid technique over 64-bit words. A way to keep the tagging shorter would be to have a table of definitions in the CONTRIBUTING.md and/or README.md file for the whole project defining what the tags mean with standard abbreviations that can be used for them.

"name_unpeeled_block4K_unrolled4_hybrid64" does look kind of ridiculous indeed. The hybrid variants would get big indeed.

The block size could also be a optional parameter in the output so it wouldnt take up precious space in the name. That would require some more work though.

Yes, we would have to get that through the "race committee".

ssovest Aug 24, 2021

The block size could also be a optional parameter in the output so it wouldnt take up precious space in the name.

I strongly agree. My solutions use block size of exactly 16000 bytes, so "-16k" in the name may confuse some people, and "-16000bytes" is too long.

ItalyToast Aug 24, 2021
Author

I haven't seen any other than your offerings that use a block size of anything other than 16 Kilobytes other than the mike_barbar Rust solution_1 version that uses 4 Kilobytes but is marked "small". There is one other version, I think maybe in Go, that determines the optimum value to use by trial before running the final test over five seconds, and that one could be using any power of two value depending on what it finds. I don't know how that one would be marked by your system other than by a range. It will likely use 32 Kilobytes on the fast i7 CPU, 16 Kilobytes on the Intel Celeron j1800, and 4 Kilobytes on the arm64, as Mike already experimented and found that was the best value for "striped" on that CPU so likely is for "stride8" as well.

I noticed I got better performance on the desktop with 32k blocksize so I saw no reason not to have a 32k and 16k solution. I dont think the 64k solution is going to be the most performant but I thought it was interesting to see the performance penalty on the different hardware. I didnt go lower than 16k as that should fit in all the L1s in the leaderboard.

A block size that is calculated during runtime or jit should probably be of some kind:

blocks-dynamic
dynblock
blocks-variable
varblock

The label of the solution could also contain the actual blocksize that it is using. That would require some more code for the solution implementer so should definietly be optional.

Whether one unrolls a loop or not makes quite a difference in some languages (not much at in F# and perhaps not too much in C#). My "extreme" unrolling makes quite a difference again for languages that can optimize it properly and where it doesn't conflict with the way the CPU handles "cache thrashing" as in the Intel Celeron J1800.

I would argue that the "extreme unrolling" is its own algorithm. The memory access pattern is not the same as either stride8 or striped. You are clearing the the bits in order 1,2,3,4,5,6,7,8 compared to 1,3,5,7,2,4,6,8 for stride2. But for example just unrolling your loop 4 times should not require a different label, the assumption is that you have tuned it to the optimal value. There might be differences in hardware that make different unrolls be optimal. Amount of registers etc. I think the only reason to include unroll count should be if that is the only thing that is distinct between two solutions. The difference in performance between normal and unrolled version shouldnt be too big, assuming the compiler is doing some kind of loop unrolling.

ItalyToast Aug 24, 2021
Author

@ssovest

I strongly agree. My solutions use block size of exactly 16000 bytes, so "-16k" in the name may confuse some people, and "-16000bytes" is too long.

"-16000bytes" is taking up too much space indeed. But I dont think using "-16k" is an issue, it still achieves the goal of giving us an idea of your blocksize and its easily compareable, the difference is 2,5%. The performance impact of 1 extra block of overhead (for a 1 000 000 sieve) should be very small and the performance numbers are also compareable. We could use KiB/KB, its more specific but I cant really see when you would need to compare a 16kib vs a 16kb version and the blocksize would be the deciding difference, even if that is the case you'd probably want to investigate it more in the code base anyway.

Outputting the blocksize would definietly be optimal.

ItalyToast · 2021-08-25T13:04:32Z

ItalyToast
Aug 25, 2021
Author

I want to propose a new characteristics tag: block_size for block based algorithms. where 0 is indicating that the block size is automatically resolved and a missing tag is indicating that the solution is not block based. The value should be the block size in bytes and postfix such as k kib etc is not expected to parsed correctly.

According to CONTRIBUTING.md the tags look like they should support arbitrary data. The constraint is that they are not longer than 32 in length for both name and value. If the tags support arbitrary key value pairs there should be no issue adding it to the output before its implemented in primeview.

Requesting feedback from @rbergen

0 replies

mike-barber · 2021-08-25T16:34:20Z

mike-barber
Aug 25, 2021

All solutions have to iterate over the whole sieve every time they're resetting factors, or they're not faithful.

In light of this, I don't know if a block size categorisation really makes sense. Let me try to explain... The reason I am using blocks in my striped solution is because I'm doing a dumb thing that thrashes the cache: I'm iterating over the whole sieve 8 times in order to reset each factor, instead of just once. So now I'm doing this in smaller blocks, which is helps the cache. But it's still the whole sieve on every factor.

Given this, I'd say that my blocks designation is an internal detail relevant only to my striped approach. I'm not sure how comparable other people's blocks are to mine, in other words -- I'm not sure they even mean the same thing. Maybe it makes sense to have alignment on the designation on all the derivatives of my striped solution, though.

Having noted this, I'm certainly happy to relabel my blocks as following, if @rbergen thinks it makes sense:

...blocks -> ...blocks16k
...blocks-small -> ...blocks4k

I'd probably suggest not changing other details of the reporting: it's bound to involve a lot of work for very little benefit.

2 replies

ItalyToast Aug 25, 2021
Author

In light of this, I don't know if a block size categorisation really makes sense. Let me try to explain... The reason I am using blocks in my striped solution is because I'm doing a dumb thing that thrashes the cache: I'm iterating over the whole sieve 8 times in order to reset each factor, instead of just once. So now I'm doing this in smaller blocks, which is helps the cache. But it's still the whole sieve on every factor.

The unpeeled/stride8 solutions is basically doing the same thing, the only difference is the ordering of the bits, by clearing the bits out of order we can reuse the mask and thus improvie the performance of the inner loop. We clear every 8th factor so we are aligned to the same "stripe", we do this 8 times in total, once for each "stripe". Since the implementation is quite simple I expect a lot more implementations to go down the same route since clearing multiple bits with a single mask is not allowed and the extreme unrolled solutions are harder to implement. If you want your language to be competetive on the leaderboard you currently need to implement either an unrolled solution or a striped/stride8 one. And that should result in more block based solutions.

I don't know if a block size categorisation really makes sense

Since a lot of languages dont have access to the cache sizes, the easiest way to get the right one is to implement different block sizes so the optimal size will rise to the top (which is what I do currently). If it was in a tag we could for example filter out only the most performant block sizes so we dont get a leaderboard that looks like this:

Lang1_blocks32k
Lang1_blocks16k
Lang1_blocks8k
Lang2_blocks32k
Lang2_blocks16k
Lang2_blocks8k

I'm doing a dumb thing that thrashes the cache:

Yes, but it works and is currently #2 on the leaderboard.

Having noted this, I'm certainly happy to relabel my blocks as following, if @rbergen thinks it makes sense:

...blocks -> ...blocks16k
...blocks-small -> ...blocks4k

The consensus seem to be that 16k is considered the default and does not require specific labeling but I think blocks4k would definietly be preferred over blocks-small.

it's bound to involve a lot of work for very little benefit.

I can do a lot of the work but I would like to know that that there is concensus that it is something we want. Doing the work and then no one wants it would suck.

mike-barber Aug 26, 2021

Thanks @ItalyToast

If you want your language to be competetive on the leaderboard you currently need to implement either an unrolled solution or a striped/stride8 one. And that should result in more block based solutions.

If the blocks have similar meaning for striped / stride8, then I agree that we could report them in a consistent way. It's irrelevant for unrolled as these solutions don't use blocks as far as I'm aware -- it's a single pass. I'm working on one for Rust now :)

it's bound to involve a lot of work for very little benefit.
I can do a lot of the work but I would like to know that that there is concensus that it is something we want. Doing the work and then no one wants it would suck.

If it's just a naming convention for the runs, it won't be too much work to implement, and I'm definitely on board with that. I was concerned about changing other things in the reporting, like tags etc: that could result in a lot of work on the reporting infrastructure side.

I'd suggest reporting the block size consistently for all solutions, rather than having 16k as a "default" and not mentioning the size. I'll add the changes to my upcoming PR so that I'm reporting ...blocks4k and ...blocks16k instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Naming convention for striped/unpeeled block based solutions #673

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Naming convention for striped/unpeeled block based solutions #673

ItalyToast Aug 23, 2021

Replies: 3 comments · 7 replies

GordonBGood Aug 23, 2021

ItalyToast Aug 24, 2021 Author

GordonBGood Aug 24, 2021

ssovest Aug 24, 2021

ItalyToast Aug 24, 2021 Author

ItalyToast Aug 24, 2021 Author

ItalyToast Aug 25, 2021 Author

mike-barber Aug 25, 2021

ItalyToast Aug 25, 2021 Author

mike-barber Aug 26, 2021

ItalyToast
Aug 23, 2021

Replies: 3 comments 7 replies

GordonBGood
Aug 23, 2021

ItalyToast Aug 24, 2021
Author

ItalyToast Aug 24, 2021
Author

ItalyToast Aug 24, 2021
Author

ItalyToast
Aug 25, 2021
Author

mike-barber
Aug 25, 2021

ItalyToast Aug 25, 2021
Author