[PERF] MPdist : Replace np.sort with np.partition for k-th value #378

JaKasb · 2021-05-08T17:33:23Z

JaKasb
May 8, 2021

In MPdist.py
https://github.com/TDAmeritrade/stumpy/blob/main/stumpy/mpdist.py

The output is the k-th smallest value of sorted(P_ABBA)

However one can extract the k-th smallest value without sorting the full array.
Skipping the sort() is possible by using numpy.partition()
https://numpy.org/doc/stable/reference/generated/numpy.partition.html#numpy.partition

sort() is O(nlogn) whereas partition() is O(n)

I don't know if this speedup affects the overall runtime or if MASS/STOMP is by far the dominant runtime consumer.
I also don't know if dask and numba support numpy.partition()

BTW I like your code and writing style.
I use stumpy in my research.

seanlaw · 2021-05-08T19:48:06Z

seanlaw
May 8, 2021
Maintainer

@JaKasb Thank you so much for starting the discussion and for your wonderful feedback regarding the code/writing style as your support means a lot.

In MPdist.py https://github.com/TDAmeritrade/stumpy/blob/main/stumpy/mpdist.py
The output is the k-th smallest value of sorted(P_ABBA)

However one can extract the k-th smallest value without sorting the full array. Skipping the sort() is possible by using numpy.partition() https://numpy.org/doc/stable/reference/generated/numpy.partition.html#numpy.partition
sort() is O(nlogn) whereas partition() is O(n)

I could be wrong but I believe that the time complexity might be closer to O(k + klogk) since np.partition returns values in unsorted order (O(k)) and then you'll need to sort the values (O(klogk). Perhaps, one could replace the sort and simply take the np.max value in O(k) time. Regardless, if n is substantially large then you are absolutely right that this would be much faster than a full sort but I'd be curious to know what your average size of n is and how frequently you are calling mpdist? Hopefully, this question does not come across as dismissive as I am genuinely curious.

I don't know if this speedup affects the overall runtime or if MASS/STOMP is by far the dominant runtime consumer.

This is a valid point. If you are only doing this mpdist sorting once/twice then the computational time should be relatively cheap especially on a short array as the computation should be dominated by MASS/STOMP as you've pointed out. However, if you are running mpdist thousands of times then that sorting time can certainly add up as n increases. In its current form, it is probably acceptable for the average, infrequent use case and, in the STUMPY code base, we like to carefully balance "code readability" with "performance". Overall, I think that there is no doubt that your proposal is faster but I'd like to better understand if your use case would benefit significantly from this change.

I also don't know if dask and numba support numpy.partition()

So, this part shouldn't matter because Dask and Numba are not being used for the actual selection of the top-K P_ABBA values. I believe that all of the logic is accomplished in pure NumPy:

stumpy/stumpy/mpdist.py

Lines 204 to 214 in 10b2672

    
           _compute_P_ABBA(T_A, T_B, m, P_ABBA, dask_client, device_id, mp_func) 
        
           P_ABBA.sort() 
        
           if k is not None: 
        
               k = min(int(k), P_ABBA.shape[0] - 1) 
        
           else: 
        
               percentage = min(percentage, 1.0) 
        
               percentage = max(percentage, 0.0) 
        
               k = min(math.ceil(percentage * (n_A + n_B)), n_A - m + 1 + n_B - m + 1 - 1) 
        
           MPdist = _select_P_ABBA_value(P_ABBA, k, custom_func)

I use stumpy in my research.

That's great to hear! I'm sure that you are doing it already but please don't forgot to cite the STUMPY paper where appropriate as we love reading about how STUMPY is being leveraged.

3 replies

JaKasb May 9, 2021
Author

I could be wrong but I believe that the time complexity might be closer to O(k + klogk) since np.partition returns values in unsorted order (O(k)) and then you'll need to sort the values (O(klogk). Perhaps, one could replace the sort and simply take the np.max value in O(k) time.

We can skip sorting because
assert np.sort(a)[k] == np.partition(a, k)[k]

The output of np.partition is an array of
[Left, k-th smallest value, Right]
where all elements of Left are smaller than k-th,
likewise, all elements in Right are greater than k-th value.

Left and Right sub-arrays are unordered.
The content of Left and Right are irrelevant if we search for P_ABBA[k]
Consequently we don't need to sort Left nor Right, therefore we skip the sort.

import numpy as np
a = np.arange(0,10)
np.random.shuffle(a)
k = 3
a_part = np.partition(a, k)
left = a_part[:k]
center = a_part[k]
right = a_part[k+1:]
a
left, center, right

array([0, 5, 4, 3, 7, 8, 6, 2, 1, 9])
(array([0, 2, 1]), 3, array([5, 4, 6, 7, 8, 9]))

In the above snippet, center is always equal to 3 and the ordering of Left and Right varies.

However, if you are running mpdist thousands of times then that sorting time can certainly add up as n increases.

In _mpdist_vect() the sort() is inside a loop.

stumpy/stumpy/mpdist.py

Line 219 in 10b2672

def _mpdist_vect(

I use _mpdist_vect() as alternative to MASS.
For my current use case _mpdist_vect() is already fast enough.

For my research n=1_641_600
query_length = 127 (actually intended to be 128)
RunTime for 35x3 MASS(query, ts) = 5 seconds
Runtime for 35x3 MPdist_vec(query,ts, m=query_length) = 50 seconds
Runtime for 35x3 MPdist_vec(query,ts, m=0.5*query_length) = 120 seconds

35 Jobs parallelized with Joblib over 32 cores, therefore not really a benchmark and the numbers vary between runs.

k = int(0.05*1_641_600)
a = np.random.standard_normal(size=2*1_641_600)
%timeit a.sort()
a = np.random.standard_normal(size=2*1_641_600)
%timeit a.partition(k)

39.8 ms ± 47.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.98 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

However the %timeit benchmark seems to be ill-designed, because the speedup for mpdist_vec would be 1E6 * 35ms = 583 minutes, which is more than the overall runtime of mpdist_vec ... ?

seanlaw May 10, 2021
Maintainer

We can skip sorting because
assert np.sort(a)[k] == np.partition(a, k)[k]

The output of np.partition is an array of
[Left, k-th smallest value, Right]
where all elements of Left are smaller than k-th,
likewise, all elements in Right are greater than k-th value.

Left and Right sub-arrays are unordered.
The content of Left and Right are irrelevant if we search for P_ABBA[k]
Consequently we don't need to sort Left nor Right, therefore we skip the sort.

Awesome! Thank you for pointing this out. I learned something new today!

In _mpdist_vect() the sort() is inside a loop.

Okay, so from what I can tell, in order to replace sort with np.partition, we'll need to make three changes in mpdist.py:

First, the _select_P_ABBA_value() function would change from:

stumpy/stumpy/mpdist.py

Lines 111 to 120 in 01e867c

    
           k = min(int(k), P_ABBA.shape[0] - 1) 
        
           if custom_func is not None: 
        
               MPdist = custom_func(P_ABBA) 
        
           else: 
        
               MPdist = P_ABBA[k] 
        
               if ~np.isfinite(MPdist): 
        
                   k = max(0, np.count_nonzero(np.isfinite(P_ABBA[:k])) - 1) 
        
                   MPdist = P_ABBA[k] 
        
           return MPdist

to:

    if custom_func is not None:
        MPdist = custom_func(P_ABBA)
    else:
        partition = np.partition(P_ABBA, k)
        MPdist = partition[k]
         if ~np.isfinite(MPdist):
            partition[:k].sort()
            k = max(0, np.count_nonzero(np.isfinite(partition[:k])) - 1)
            MPdist = partition[k]

Second, the _mpdist() function would change from:

stumpy/stumpy/mpdist.py

Lines 204 to 212 in 01e867c

    
           _compute_P_ABBA(T_A, T_B, m, P_ABBA, dask_client, device_id, mp_func) 
        
           P_ABBA.sort() 
        
           if k is not None: 
        
               k = min(int(k), P_ABBA.shape[0] - 1) 
        
           else: 
        
               percentage = min(percentage, 1.0) 
        
               percentage = max(percentage, 0.0) 
        
               k = min(math.ceil(percentage * (n_A + n_B)), n_A - m + 1 + n_B - m + 1 - 1)

to (simply remove the sort line):

    P_ABBA = np.empty(n_A - m + 1 + n_B - m + 1, dtype=np.float64)

    _compute_P_ABBA(T_A, T_B, m, P_ABBA, dask_client, device_id, mp_func)

    if k is not None:
        k = min(int(k), P_ABBA.shape[0] - 1)

Lastly, we'd also remove the sort call in _mpdist_vect() and everything stays unchanged:

     for i in range(MPdist_vect.shape[0]):
         P_ABBA[:j] = rolling_row_min[:, i]
         P_ABBA[j:] = col_min[i : i + j]
         MPdist_vect[i] = _select_P_ABBA_value(P_ABBA, k, custom_func)

What do you think @JaKasb? Hopefully, I didn't miss anything.

Of course, I'd need to update the relevant docstrings (since they currently claim to sort P_ABBA before selecting) since a custom_func would be affected and I'll also need to update the unit tests to match. Your feedback would be appreciated.

seanlaw May 10, 2021
Maintainer

I use _mpdist_vect() as alternative to MASS.

One additional note that _mpdist_vect() is a private function and is subject to change without notice. Please use at your own risk

JaKasb · 2021-05-10T15:05:48Z

JaKasb
May 10, 2021
Author

Looks good to me.
Furthermore the sorting/selecting now happens inside _select_P_ABBA_value().
IMO this is a more reasonable place.

In the current version the sorting occurs inplace.
Eg a.sort() instead of np.sort(a)

Does the code depend on P_ABBA being sorted, outside of _select_P_ABBA_value() ?
Eg in :

stumpy/stumpy/mpdist.py

Lines 281 to 285 in 10b2672

    
           for i in range(MPdist_vect.shape[0]): 
        
               P_ABBA[:j] = rolling_row_min[:, i] 
        
               P_ABBA[j:] = col_min[i : i + j] 
        
               P_ABBA.sort() 
        
               MPdist_vect[i] = _select_P_ABBA_value(P_ABBA, k, custom_func)

Because if you sort P_ABBA inplace and afterwards write into P_ABBA, the output should be different if the sort() is removed.
The reference implementation of MPdist_vect does not sort before the ingress and egress.
I don't fully understand MPdist_vect yet. They paper mentions a moving_min() function.
Maybe this is a bug ?
https://sites.google.com/site/mpdistinfo/

percentage = min(percentage, 1.0) 
percentage = max(percentage, 0.0)

Can be simplified into
percentage = np.clip(percentage, 0.0, 1.0)
or maybe even use assert 0.0 <= percentage <= 1.0 to tell the user that the parameter percentage is faulty

6 replies

seanlaw May 11, 2021
Maintainer

Does the code depend on P_ABBA being sorted, outside of _select_P_ABBA_value() ?

No, the code does not depend on P_ABBA being sorted outside of _select_P_ABBA_value(). If I recall correctly, we decided to move the sort outside of _select_P_ABBA_value() just so it was transparent/obvious and so we purposely limited the _select_P_ABBA_value() function to only perform the selection process (with the assumption that P_ABBA was already sorted and so there would be no surprise if you chose to provide a custom_func). In the updated form that we are currently proposing, if a user passed a custom_func for selection then they would need to perform the sorting themselves (assuming they need P_ABBA to be sorted.

percentage = np.clip(percentage, 0.0, 1.0)

I like this a lot better and cleaner! I'll make sure to make this change. Thank you

JaKasb May 12, 2021
Author

I compared the output of _mpdist_vect() with the reference implementation in MATLAB.
The output is correct.
Not exactly the same output, but sufficiently correlated.

My mental model of the loop was wrong.
I was worried that the inplace sort P_ABBA.sort() would mutate the state of P_ABBA for the next loop iteration.
But then I realized that the content of P_ABBA is created from scratch in every iteration.

stumpy/stumpy/mpdist.py

Lines 281 to 285 in 10b2672

for i in range(MPdist_vect.shape[0]):

P_ABBA[:j] = rolling_row_min[:, i]

P_ABBA[j:] = col_min[i : i + j]

P_ABBA.sort()

MPdist_vect[i] = _select_P_ABBA_value(P_ABBA, k, custom_func)

Interestingly, your implementation of MPdist_vec ist faster than the reference MATLAB version of fastMPdist_Vect().
However I assume that my Octave is much slower than orignal MATLAB.

seanlaw May 12, 2021
Maintainer

I compared the output of _mpdist_vect() with the reference implementation in MATLAB.

When you say this, do you mean that you compared the np.partition version with the reference MATLAB implementation? I expect the original (P_ABBA.sort()) version to match the MATLAB code since it was based off it.

The output is correct. Not exactly the same output, but sufficiently correlated.

Can you please elaborate on this point? "Sufficiently correlated" seems suspicious to me and so I would like to know more if possible.

JaKasb May 13, 2021
Author

I used the current release from pypi for the comparison.

By "sufficiently correlated" I mean that np.allclose() fails for the comparison but one cannot distuingish the 2 outputs when visualized on a lineplot.
Each implementation of MASS produces slightly different output.
No need to worry.
Sorry for the confusion.

seanlaw May 17, 2021
Maintainer

Alright, @JaKasb! I've implemented all (?) of the things that we had discussed above in f6d0a37 and it should be available in the upcoming release of v1.9.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF] MPdist : Replace np.sort with np.partition for k-th value #378

{{title}}

Replies: 2 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

[PERF] MPdist : Replace np.sort with np.partition for k-th value #378

JaKasb May 8, 2021

Replies: 2 comments · 9 replies

seanlaw May 8, 2021 Maintainer

JaKasb May 9, 2021 Author

seanlaw May 10, 2021 Maintainer

seanlaw May 10, 2021 Maintainer

JaKasb May 10, 2021 Author

seanlaw May 11, 2021 Maintainer

JaKasb May 12, 2021 Author

seanlaw May 12, 2021 Maintainer

JaKasb May 13, 2021 Author

seanlaw May 17, 2021 Maintainer

JaKasb
May 8, 2021

Replies: 2 comments 9 replies

seanlaw
May 8, 2021
Maintainer

JaKasb May 9, 2021
Author

seanlaw May 10, 2021
Maintainer

seanlaw May 10, 2021
Maintainer

JaKasb
May 10, 2021
Author

seanlaw May 11, 2021
Maintainer

JaKasb May 12, 2021
Author

seanlaw May 12, 2021
Maintainer

JaKasb May 13, 2021
Author

seanlaw May 17, 2021
Maintainer