Support the three types of mSTAMP queries from paper Matrix Profile VI #180

robertsd · 2020-05-20T15:50:54Z

robertsd
May 20, 2020

Thank you for developing and maintaining the stumpy project as it has been very useful. I have a use case that requires a "constrained search" on multivariate time series, such that I have at least one variable that must be included in the motif search. Constrained Search is one of three types of search discussed in the paper Matrix Profile VI, the others are Guided Search and Unconstrained Search.

What is the likelihood that these mSTAMP queries become possible in stumpy?

seanlaw · 2020-05-20T17:36:28Z

seanlaw
May 20, 2020
Maintainer

@robertsd Thank you for your feedback and for taking the time to submit this feature request! Please accept this response as an invitation for a longer form discussion as I'm just talking/thinking out loud here and I may not be fully understanding your request!

So, it's been a while since I had implemented "our flavor" of mSTAMP (see stumpy.mstump and stumpy.mstumped for the distributed version). Can you tell me if you've already been using these functions? Essentially, they are the "faster" STOMP-based implementation of mSTAMP for computing the matrix profile along each dimension.

When I had originally read the Matrix Profile VI paper (accompanying code by the original author is here), I had not paid too much attention to the "constrained search" section due to the fact that there weren't any explicit algorithms or code beyond mSTAMP (which we've essentially implemented in stumpy.mstump/stumpy.mstumped).

After quickly re-skimming the paper, it appears that the only detail provided regarding "inclusion" is:

The implementation of inclusion is slightly more complicated, as we must move the distance computed by using whitelisted dimensions up to the front after a column wise-ascending sort has been applied (see line 10 in ALGORITHM I. ).

With Algorithm 10 being (copy/pasted from the paper):

𝑷 ← size 𝑑 × 𝑛 − 𝑚 + 1 inf matrix
𝑖𝑑𝑥𝑒𝑠 ← integers from 1 to 𝑛 − 𝑚 + 1
for each 𝑖𝑑𝑥 in 𝑖𝑑𝑥𝑒𝑠 // random order if anytime algorithm used
   𝑫 ← size 𝑑 × 𝑛 − 𝑚 + 1 zero matrix
   for 𝑖 from 1 to 𝑑
       Q ← 𝑻[𝑖,𝑖𝑑𝑥:𝑖𝑑𝑥 + 𝑚 − 1]
       𝑫[𝑖, ∶] ← distanceProfile(𝑄, 𝑻[𝑖, ∶])
   end for
   𝑫 ← columnWiseAscendingSort(𝑫)
   𝐷′ ← length 𝑛 − 𝑚 + 1 zero array
   for 𝑖 from 1 to 𝑑
       𝐷′ ← 𝐷 + 𝑫[𝑖, ∶]
       𝐷′′ ← 𝐷′ ÷ 𝑖
       𝑷[𝑖, ∶] ← elementWiseMin(𝑷[𝑖, ∶], 𝐷′′)
   end for
end for
return

It sounds like we'd need to modify the step directly following the 𝑫 ← columnWiseAscendingSort(𝑫) step. Would you agree?

Do you happen to already know what one would need to add? I'd be happy to add it if you want to talk me through the process (and even better if you wanted to submit a pull request but absolutely no pressure to)?

0 replies

seanlaw · 2020-05-20T17:46:26Z

seanlaw
May 20, 2020
Maintainer

I have some initial thoughts and questions:

Assuming that we have some list called include that contains a zero-based indexing of the dimensions (i.e., if we have 5 dimensions or 5 times series then we can "include", say, the the even numbered dimensions by providing the list include = [1, 3]). So, you could call stumpy.mstump with something like stumpy.mstump(T, m, include=[1, 3]). How does this sound?
If more than one dimension needs to be "included", how would you choose which dimension is more important for the sorted order? Maybe, in our example above, if we wanted the 4th dimension to be more important than the 2nd dimension then we do stumpy.mstump(T, m, include=[3, 1])?
How many dimensions is your data? How what is the length of your time series?
Are you able to provide more detail as to your application? We like to track/understand user use cases and how people are leveraging STUMPY.
Do you require/need all three query types or only inclusion?

0 replies

robertsd · 2020-05-20T18:43:19Z

robertsd
May 20, 2020
Author

Yes I have been using stumpy.mstump for analysis of this data already, after reading the mentioned paper. I just happen to have a case where it is important to constrain the search for motif to include an important dimension. While I am not super familiar with the algorithm guts I can perhaps review to determine how to implement it, mostly I wanted to find out if it might already be in the plans.

Perfect
Sounds great!
From roughly 20 to ~200 dimensions, and several hundred thousand in length
I can say it is extremely similar to the example from the paper explained just after the introduction of the three types of search
While Unconstrained Search seems massively interesting, what would be most useful right now is Constrained ("inclusion")

0 replies

seanlaw · 2020-05-20T19:30:22Z

seanlaw
May 20, 2020
Maintainer

From roughly 20 to ~200 dimensions, and several hundred thousand in length

Have you already tried mstump on this full M-dimensional time series already? I’m only asking since a dataset of that size will take quite some time to compute. For a ~200K length time series, it takes around 5 hours to compute the matrix profile on a modest 2 core machine. And extrapolating to 200 dimensions means it would take around (200 x 5 hours) 1,000 hours to compute the M-dimensional matrix profile. This is a super rough estimate but it may be a concern.

Additionally, would you be comfortable with installing stumpy from source? It shouldn’t take much effort to add this feature but we probably won’t have an official release (on PyPI or Conda) until about a month from now as there is some work that we are focused on completing before the next release.

0 replies

robertsd · 2020-05-20T19:38:22Z

robertsd
May 20, 2020
Author

I understand, I intended to start with small number of dimensions, I also have a GPU available. I am OK with official release being a month out! This would be fantastic!

0 replies

seanlaw · 2020-05-20T20:05:41Z

seanlaw
May 20, 2020
Maintainer

I understand, I intended to start with small number of dimensions, I also have a GPU available. I am OK with official release being a month out! This would be fantastic!

Awesome! I'm sure you are already aware but, to be on the safe side, there is no GPU support (yet) for mstump (i.e., everything has to be computed using CPUs) since the GPU matrix profile calculation (extreme multithreading) is "different" from the CPU implementation.

0 replies

robertsd · 2020-05-20T20:20:11Z

robertsd
May 20, 2020
Author

Understood..

0 replies

seanlaw · 2020-05-22T01:11:29Z

seanlaw
May 22, 2020
Maintainer

While I am not super familiar with the algorithm guts I can perhaps review to determine how to implement it, mostly I wanted to find out if it might already be in the plans

I did a little digging/asking around and I think the simplest/cleanest solution would be something like:

tmp_swap = np.empty((len(include), n-m+1))  # This can be re-used for other iterations
if include is not None:
    # Swap the rows in `include` with the first `len(include)` rows
    tmp_swap[:] = D[:len(include)]
    D[:len(include)] = D[include]
    D[include] = tmp_swap
    # Only sort the rows beyond the the first `len(include)` rows
    D[:len(include)].sort(axis=0)
else:
    𝑫 ← columnWiseAscendingSort(𝑫)

@robertsd Do you think this is sufficient?

0 replies

robertsd · 2020-05-22T03:46:00Z

robertsd
May 22, 2020
Author

I don't think I could have done it better myself!

…

On Thu, May 21, 2020 at 8:11 PM Sean M. Law ***@***.***> wrote: While I am not super familiar with the algorithm guts I can perhaps review to determine how to implement it, mostly I wanted to find out if it might already be in the plans I did a little digging/asking around and I think the simplest/cleanest solution would be something like: tmp_swap = np.empty((len(include), n-m+1)) if include is not None: # Swap the rows in `include` with the first `len(include)` rows tmp_swap[:] = D[:len(include)] D[:len(include)] = D[include] D[include] = tmp_swap # Only sort the rows beyond the the first `len(include)` rows D[:len(include)].sort(axis=0) else: 𝑫 ← columnWiseAscendingSort(𝑫) @robertsd <https://github.com/robertsd> Do you think this is sufficient? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/TDAmeritrade/stumpy/issues/180#issuecomment-632423976>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEMDFAFQA2GZXWMALLGYJDRSXGM3ANCNFSM4NGBZQVA> .

-- “Do you pine for the days when men were men and wrote their own device drivers?” - Linus Torvalds

0 replies

seanlaw · 2020-05-22T20:18:30Z

seanlaw
May 22, 2020
Maintainer

@robertsd I didn't see any implementation details regarding "Guided Search" beyond the description:

Guided Search: Find the best motif on k dimensions, where the integer k is given by the user, but which k dimensions to use is unspecified.

So, I'm guessing that this is simply accomplished by computing the full M-dimensional matrix profile and then the user explicitly chooses k (the number of dimensions) afterward.

As for "Unconstrained Search", this also seems to involve computing the full M-dimensional matrix profile and then applying some elbow metric to choose k (the number of dimensions) afterward.

So, realistically, the only type of constrained search that needs to be modified at run-time is "inclusion" search.

0 replies

seanlaw · 2020-05-22T20:37:07Z

seanlaw
May 22, 2020
Maintainer

@robertsd The feature has been added for mstump and mstumped and you should see it in the next release. Feel free to open up a new issue (or re-open this one) if you have any more questions and we certainly welcome any contribution(s) in the future (even documentation).

0 replies

seanlaw · 2020-05-23T02:03:16Z

seanlaw
May 23, 2020
Maintainer

I don't think I could have done it better myself!

It turns out that we missed one crucial thing! When the user provides a list of indices to include, one has to account for cases where one or more of the indices is in/from one of the first few rows. So, let's say we have an array and indices as shown:

import numpy as np

x = np.array([[0,0],
              [1,1],
              [2,2],
              [3,3],
              [4,4],
              [5,5]])
indices = np.array([1, 2, 4])

If we followed our simple procedure above by first swapping the rows first then we'd get the following wrong output (note that [4, 4] is missing and [1, 1] is repeated):

[[1 1]
 [0 0]
 [1 1]
 [3 3]
 [2 2]
 [5 5]]

Instead, what we really want is indices [1, 2, 4] to be in the first three rows and for the first row to be moved to index 4:

[[1 1]
 [2 2]
 [4 4]
 [3 3]
 [0 0]
 [5 5]]

To achieve this, we need to actually do some pre-preparation work to identify which indices are "restricted" (i.e., those in the first few rows) and which indices are "unrestricted" (i.e., outside of the first few rows) and can ultimately be selectively written to at the end:

import numpy as np

x = np.array([[0,0],
              [1,1],
              [2,2],
              [3,3],
              [4,4],
              [5,5]])
indices = np.array([1, 2, 4])

# pre-preparation
restricted_indices = indices[indices < indices.shape[0]]
unrestricted_indices = indices[indices >= indices.shape[0]]
mask = np.ones(indices.shape[0], bool)
mask[restricted_indices] = False

# Same as before
tmp = x[:len(indices)].copy()
x[:len(indices)] = x[indices]
# x[indices] = tmp  # Replace this original step with the next one
x[unrestricted_indices] = tmp[mask]

This is what has been/will be implemented. Another thing that one needs to look out for is repeating indices in the input. This turned out to be a lot trickier than I had anticipated but I provide this here for completeness and transparency.

0 replies

robertsd · 2020-05-23T18:26:29Z

robertsd
May 23, 2020
Author

I see, nice catch! Thank you so much for working on this.. I did NOT expect you to build this feature so quickly!

…

On Fri, May 22, 2020 at 9:03 PM Sean M. Law ***@***.***> wrote: I don't think I could have done it better myself! It turns out that we missed one crucial thing! When you provide a list of indices to include, one has to account for cases where one of the indices is in one of the first few rows. So, let's say we have an array and indices as shown: import numpy as np x = np.array([[0,0], [1,1], [2,2], [3,3], [4,4], [5,5]]) indices = np.array([1, 2, 4]) If we followed our simple procedure above by first swapping the rows first then we'd get: [[1 1] [0 0] [1 1] [3 3] [2 2] [5 5]] Instead, what we really want is indices [1, 2, 4] to be in the first three rows and for the first row to be moved to index 4: [[1 1] [2 2] [4 4] [3 3] [0 0] [5 5]] To achieve this, we need to actually do so pre-preparation work to identify which indices are "restricted" and which indices are "unrestricted" to be written to at the end: import numpy as np x = np.array([[0,0], [1,1], [2,2], [3,3], [4,4], [5,5]]) indices = np.array([1, 2, 4]) # pre-preparation restricted_indices = indices[indices < indices.shape[0]] unrestricted_indices = indices[indices >= indices.shape[0]] mask = np.ones(indices.shape[0], bool) mask[restricted_indices] = False # Same as before tmp = x[:len(indices)].copy() x[:len(indices)] = x[indices] # x[indices] = tmp # Replace this original step with the next one x[unrestricted_indices] = tmp[mask] This is what has been/will be implemented. Another thing that one needs to look out for is repeating indices in the input. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/TDAmeritrade/stumpy/issues/180#issuecomment-632969036>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEMDFDIXQC37DHS6P3RXUDRS4VHDANCNFSM4NGBZQVA> .

-- “Do you pine for the days when men were men and wrote their own device drivers?” - Linus Torvalds

0 replies

seanlaw · 2020-05-25T00:00:19Z

seanlaw
May 25, 2020
Maintainer

Don’t mention it! Thanks again for submitting the feature request and please be sure to spread the word and share STUMPY with your network! 🙏

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support the three types of mSTAMP queries from paper Matrix Profile VI #180

{{title}}

Replies: 14 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Support the three types of mSTAMP queries from paper Matrix Profile VI #180

robertsd May 20, 2020

Replies: 14 comments

seanlaw May 20, 2020 Maintainer

seanlaw May 20, 2020 Maintainer

robertsd May 20, 2020 Author

seanlaw May 20, 2020 Maintainer

robertsd May 20, 2020 Author

seanlaw May 20, 2020 Maintainer

robertsd May 20, 2020 Author

seanlaw May 22, 2020 Maintainer

robertsd May 22, 2020 Author

seanlaw May 22, 2020 Maintainer

seanlaw May 22, 2020 Maintainer

seanlaw May 23, 2020 Maintainer

robertsd May 23, 2020 Author

seanlaw May 25, 2020 Maintainer

robertsd
May 20, 2020

seanlaw
May 20, 2020
Maintainer

seanlaw
May 20, 2020
Maintainer

robertsd
May 20, 2020
Author

seanlaw
May 20, 2020
Maintainer

robertsd
May 20, 2020
Author

seanlaw
May 20, 2020
Maintainer

robertsd
May 20, 2020
Author

seanlaw
May 22, 2020
Maintainer

robertsd
May 22, 2020
Author

seanlaw
May 22, 2020
Maintainer

seanlaw
May 22, 2020
Maintainer

seanlaw
May 23, 2020
Maintainer

robertsd
May 23, 2020
Author

seanlaw
May 25, 2020
Maintainer