Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minimal but convincing example? #50

Open
jwaldmann opened this issue Oct 23, 2018 · 3 comments
Open

minimal but convincing example? #50

jwaldmann opened this issue Oct 23, 2018 · 3 comments

Comments

@jwaldmann
Copy link

I would very much want that the documentation contains a minimal convincing example
(and that this would also be included in tests, so we can be sure that it works).

"Convincing" in the sense that

  • running with +RTS -N<k> (for small k) visibly, and progressively, reduces execution time
  • code is readable and stand-alone (no fancy extra libraries)

For reference, I am using the following C# example (sum of bitcounts) in teaching
and I like that it has these properties:

  • it's two one-liners (naive bitcount implementation, naive summation)
  • parallelisation is trivial (add .AsParallel())
  • effect is visible (cut execution time nearly in half)
  • works out-of-the-box in csharp REPL
Func<int,int> bc = (int x) => { int c=0; while (x>0) {  c += x%2 ;   x >>= 1 ;  }   return c;   } 

Time(() => Console.WriteLine( Enumerable.Range(0,1<<27).Select(bc).Sum()))                             
       1811939328
       00:00:03.5504990

Time(() => Console.WriteLine( Enumerable.Range(0,1<<27).AsParallel().Select(bc).Sum()))
        1811939328                                                                                                                    
        00:00:02.1616790

If there is such an example (my naive use of parListChunk does not seem to cut it) I'm happy to write it up as haddock and submit a PR.

NB - Sure I know (and discuss with students) that there are better implementations of bc, and that sum-of-bitcounts (up to powers of two) has a closed form, so we don't actually need any of this...

@l29ah
Copy link

l29ah commented Sep 30, 2022

‰ time ghc -e 'import Control.Parallel.Strategies' -e 'import Control.DeepSeq' -e 'fib x = product [1..x]' -e 'deepseq (parMap rdeepseq fib $ replicate 5 200000) ()' +RTS -N
()
ghc -e 'import Control.Parallel.Strategies' -e 'import Control.DeepSeq' -e  -  36.62s user 5.15s system 388% cpu 10.743 total
‰ time ghc -e 'import Control.Parallel.Strategies' -e 'import Control.DeepSeq' -e 'fib x = product [1..x]' -e 'deepseq (map fib $ replicate 5 200000) ()' +RTS -N               
()
ghc -e 'import Control.Parallel.Strategies' -e 'import Control.DeepSeq' -e  -  33.28s user 6.57s system 165% cpu 24.062 total

@jwaldmann
Copy link
Author

yes, nice. (fib = factorial?)

I have to specify (reduce) the number of capabilities (my computer has too many?)


$ time ghc -e 'import Control.Parallel.Strategies' -e 'import Control.DeepSeq' -e 'fib x = product [1..x]' -e 'deepseq (parMap rdeepseq fib $ replicate 5 200000) ()' +RTS -N6 
()

real	0m6.178s
user	0m20.932s
sys	0m1.761s

$ time ghc -e 'import Control.Parallel.Strategies' -e 'import Control.DeepSeq' -e 'fib x = product [1..x]' -e 'deepseq (map fib $ replicate 5 200000) ()' +RTS -N6 
()

real	0m12.747s
user	0m13.292s
sys	0m1.911s

you are throwing away the result (I understand - else we'd be measuring the time for printing), my example above has the advantage that we can see, and compare results.

and .. the rabbit holes ... (why cut time only in half, where does the extra system time come from, why does this allocate at all, etc., etc.)

@jwaldmann
Copy link
Author

jwaldmann commented Oct 4, 2022

perhaps this? a slightly silly way of repeatedly computing factorial of 10

import Control.Parallel.Strategies
import Data.List (permutations)

main =  print
    $ withStrategy (parList rseq)
    $ map (length . permutations . take 10 . enumFrom) [1 .. 8]

or, with ghci:

$ ghci  +RTS -N8 -A1m

GHCi, version 9.2.4: https://www.haskell.org/ghc/  :? for help
ghci> import Control.Parallel.Strategies
ghci> import Data.List (permutations)
ghci> :set +s
ghci> map (length . permutations . take 10 . enumFrom) [1 .. 8]
[3628800,3628800,3628800,3628800,3628800,3628800,3628800,3628800]
(3.12 secs, 8,737,949,784 bytes)
ghci> withStrategy (parList rseq)  $ map (length . permutations . take 10 . enumFrom) [1 .. 8]
[3628800,3628800,3628800,3628800,3628800,3628800,3628800,3628800]
(1.28 secs, 1,092,361,960 bytes)

if we use -N4 when starting ghci, we see first four list elements appear at the same time, then next four. Or we can do

ghci> setNumCapabilities 1
(0.00 secs, 34,296 bytes)
ghci> withStrategy (parList rseq)  $ map (length . permutations . take 10 . enumFrom) [1 .. 8]
[3628800,3628800,3628800,3628800,3628800,3628800,3628800,3628800]
(2.80 secs, 8,737,948,216 bytes)
ghci> setNumCapabilities 2
(0.00 secs, 34,304 bytes)
ghci> withStrategy (parList rseq)  $ map (length . permutations . take 10 . enumFrom) [1 .. 8]
[3628800,3628800,3628800,3628800,3628800,3628800,3628800,3628800]
(1.69 secs, 1,092,362,304 bytes)

warning - more rabbit holes: https://gitlab.haskell.org/ghc/ghc/-/issues/20783

I guess the problem with such examples in ghci is:

  • more allocations => more work for the RTS =?=> less parallelisable
  • bytecode produced by ghci is not optimised (enough)? => hard to program a non-allocating workload.

Your factorial example is nice because it spends time in GMP (long multiplication) mostly (?). In my example, work is done in Data.List.permutations, a library module, for which ghci loads (optimized) object code (not byte code)? But then it's a "fancy extra library" (see top post) ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants