GPU version is quite efficient actually ( here is the score advantage plotted on the y-axis vs time, playing againt a single-core CPU)
Scales up nicely using MPI (Message Passing Interface) to a large distributed system (tested on a 2048-node supercomputer, up to 3.5M GPU threads)
I used this code while working on my PhD thesis. The MPI version has been tested on the Japanese TSUBAME supercomputer.