Naming convention for striped/unpeeled block based solutions #673
Replies: 3 comments 7 replies
-
How about make the default unspecified block size 16 Kilobytes, and add the specification for any other size, otherwise there will need to be a flurry of PR's to match your proposed standard and as most implementations using blocks use the 16 Kilobyte size? A block size of 16 Kilobytes generally is a good workable size since the CPU L1 data cache size for the different CPU's on the leaderboard are 32, 24, and 32 Kilobytes for the Intel i7-8750, Intel Celeron J1800, and Sony BCM2711 ARM64, respectively. mike-barber's "small" blocks are 4 kilobytes, and the smaller size only improves things slightly on the arm64 CPU due to the way that CPU handles "cache thrashing"; using bigger than 16 Kilobytes such as 32 Kilobytes means that the block size then exceeds the CPU L1 cache size for the Celeron, putting that CPU at a comparative disadvantage, and since a bit-packed odds-only buffer for all the odd composite bits to a million is only 62,500 bytes in size, a 64 Kilobyte size doesn't do anything at all and if used may slow the code if there are extra bounds checks. If you want to standardize, we might want to use the term "striped" to refer to the implementation that re-orders the bit representations in each block so the composite order advances by common bit index order as contrasted to "unpeeled" which advances by all the bit indices per word in word index order. The advantage of "striped" (defined as above) over "unpeeled" is that, since the represented composite numbers advance in order, no secondary storage of intermediate block termination values needs to be made; the disadvantage is that it is somewhat more complex in the means necessary to reset the bit indices per "stripe" - performance-wise it is likely pretty much a wash although it may depend slightly on implementation and/or compiler optimizations. Even when the block size is specified, there is also a difference in meaning in what "unrolled" means, most of the "block" implementation also use manual loop unrolling for four marking operations per loop, although I have found for some languages and CPU's that using eight marking operations is slightly faster (as it should be as there is then slightly less loop overhead per marking operation, countered by how it may impact the loop branch prediction); I use "unrolled" to mean my "Extreme Loop Unrolling" technique whereby the eight "unpeeled" loops are combined into a single loop with eight marking operations (or multiples of eight could be possible) per loop with each eight of different bit indices. If we want to make it clear which one is used, than I suppose I should change my implementations to use the tag "extreme" to distinguish them from the simply "unrolled" ones. However, all of this tagging may make the label field pretty wide as in "name_unpeeled_block4K_unrolled4_hybrid64" meaning that for the sparse culling I used the "unpeeled" technique combined with a block 4K size unrolled by four markings per loop along with the dense hybrid technique over 64-bit words. A way to keep the tagging shorter would be to have a table of definitions in the CONTRIBUTING.md and/or README.md file for the whole project defining what the tags mean with standard abbreviations that can be used for them. Most of these only are of use for the compiled languages from such as C#, F#, and AssemblyScript to all of the native code compiling languages, so it might be better just to standardize on the fastest technique available and all just write a version to that, which as per my latest Chapel tests would be either the above (perhaps the fastest for the Celeron, haven't fully tested yet) and "name_extreme_unrolled8_hybrid64" for the rest. If all "Top Fuel" implementations were using these top techniques (or one or the other), this would allow an easier comparison between CPU's and compiler optimizations given the common technique(s) used. I believe this technique can be used across all compiled languages - whether effectively or not will be dependent on the implementation and the compiler, although for languages without templates or macros, external code generation may need to be used for the "dense hybrid" technique as I had to do for Chapel. |
Beta Was this translation helpful? Give feedback.
-
I want to propose a new characteristics tag: According to CONTRIBUTING.md the tags look like they should support arbitrary data. The constraint is that they are not longer than 32 in length for both name and value. If the tags support arbitrary key value pairs there should be no issue adding it to the output before its implemented in primeview. Requesting feedback from @rbergen |
Beta Was this translation helpful? Give feedback.
-
All solutions have to iterate over the whole sieve every time they're resetting factors, or they're not faithful. In light of this, I don't know if a block size categorisation really makes sense. Let me try to explain... The reason I am using Given this, I'd say that my Having noted this, I'm certainly happy to relabel my blocks as following, if @rbergen thinks it makes sense:
I'd probably suggest not changing other details of the reporting: it's bound to involve a lot of work for very little benefit. |
Beta Was this translation helpful? Give feedback.
-
There is a couple of stripe inspired, block based solutions on the leaderboard now. But its not clear what the blocksizes are for the different implementations. I want to propose that we include the block size in the name of the solution, like I have done on mine:
italytoast-stride8-blocks16k
. The idea is that with the size in the name we can easily see if it fits in the L1 cache on the PrimeView site, that way we dont need to dive into the code to figure out what the actual block size is. For example we can see on the ARM processor that mikessmall-blocks
is slightly faster thanblocks
. We assume that is because the L1 is smaller than the block, but by how much and how big impact does it have?@mike-barber @GordonBGood @ssovest @fvbakel Tagging you here since i see you have solutions with
block
in the name.edit: this became quite the thread, it might seem like a trivial thing but this is a "dragrace" and the L1 cache is the turbo that is feeding the dragster engine. How we optimize around the L1 cache is one of the most important performance characteristics of a solution IMO.
Beta Was this translation helpful? Give feedback.
All reactions