PairwiseDistancesReductions
are Cython-based implementations of computational
expensive patterns in many scikit-learn's algorithms.
In order to be able to maintain those on the longer-term, maintainers, and authors and reviewers of Pull Requests suggesting changes need to be able to easily and confidently assess performance regressions between revisions.
This independent asv
benchmark suite is meant to help in this regards.
For more context, see:
This suite can be installed with:
git clone git@github.com:jjerphan/pairwise-distances-reductions-asv-suite.git
cd pairwise-distances-reductions-asv-suite
pip install git+https://github.com/airspeed-velocity/asv
This suite can be run with:
# This might take a while (i.e several hours up to a day)
# if all combinations are benchmarked.
asv run
For more precised run, see asv
commands' documentation.
Have a feedback of performance improvement of regression in timely manner when needed for a scikit-learn Pull Request
In particular:
- have a GitHub actions workflow which would be triggerable by a comment
- specify revisions to compare (forwarded to
asv continuous
) - be able to indicate configuration to run benchmarks for, in particular regarding
the following parameters' values:
PairwiseDistancesReductions
metric
- format of
(X,Y)
(in{sparse, dense}²
)
- have the full, verbose, sorted,
asv
textual report
In particular:
- outputs graphs of hardware scalability
- report estimate of sequential code proportion using Amdahl's law
Benchmark are correctly and entirely reproducible, traceable and reportable when the following constraining requirements are met:
- the same machine is used overtime: in practice, we can't expect CI providers to allocate the same machines over time, nor to dispatch to specifications-identical machines at a given time.
- no other process that the benchmarks' are run on the machine: in practice, we can't expect CI providers to use process isolation
- benchmarks definition aren't changed between revision: this requires not reformatting benchmarks' python code because asv hashes the content of the file to trace benchmark overtime