Handling streaming data #19

rtmlp · 2019-05-24T01:15:46Z

rtmlp
May 24, 2019

I have gone through the notebooks to gloss over the type of implementation possible with time series data and noticed that all the examples and possibly the API is designed for static time series data.

Can the team share thoughts on application of these type of matrix profile algorithms to streaming data. For example, motif/pattern discovery or discord/anomaly detection on streaming data. Thanks :)

One approach that I can think of is using an offline and online approach. The idea is to have MP on historical data and for any new streaming data, the query (of some length of streaming data) will be used to compute its motif/discord. This is the online part.

For offline part, after certain time length, we recompute the MP based on the historical + newly available data to check for the shift in motifs/discords.

seanlaw · 2019-05-24T01:30:00Z

seanlaw
May 24, 2019
Maintainer

@rtmatx Thanks for the question. We haven't tried this on streaming data so it will depend on how quickly the data is coming in. Presumably, not at the rate of which CERN is collecting data.

So, assuming that the velocity of the incoming data is "reasonable" (however you define it - maybe 1% of the length, n, of your starting time series), streaming is actually a far easier problem as it is no longer an O(n^2) calculation (in terms of computational complexity). If I recall the paper correctly, once we calculate the matrix profile once then additional streaming data can be added incrementally in an efficient manner that is essentially O(n). This is because we don't have to recalculate the entire matrix profile. Instead, we just need to calculate the distance profile for the new (sliding) window (i.e., with new data point) and then update the matrix profile to reflect any new minimum distances.

While this library doesn't give you streaming out of the box, it does the heavy lifting for you and calculates the matrix profile first which, again, is O(n^2). It should be fairly straightforward for you to update the matrix profile thereafter (both extending it and updating the existing distances).

This could be a great PR if you'd like to work on a Tutorial that demonstrates streaming (maybe with the streamz Python package).

0 replies

seanlaw · 2019-05-24T01:42:07Z

seanlaw
May 24, 2019
Maintainer

For the record (and I haven't thought about this thoroughly), my first instinct is that streaming is out of scope for this package and should not be a core, built-in capability/feature. Not because I don't think it's important (it is definitely valuable) but because it's not really contributing to making the matrix profile calculation faster (which is the key contribution of this package). I could see a spin-off package called, say, "stumpy-streaming" that would install "stumpy" as a dependency.

However, a tutorial of how to accomplish streaming (after an initial matrix profile is computed) would be great.

0 replies

seanlaw · 2019-05-24T02:08:00Z

seanlaw
May 24, 2019
Maintainer

I failed to acknowledge your point above regarding online vs offline. Yes, your approach would be correct in both cases.

One thing that is not explicitly obvious from the examples is that the stumpy.stump function (and stumpy.stumped for the distributed case) can perform an AB-join by providing the optional parameter T_B (without this parameter, it is automatically a self-join):

# Warning: code not tested so use at your own risk!
import numpy as np

m = 50  # Window size

x = np.random.rand(1000)
mp = stumpy.stump(x, 50)  # This is a self-join

y = np.random.rand(100)  # New data

new_x = np.append(x, y)
new_y = np.append(x[-m+1:], y)  # Take the last 49 data points from x and append y to it 

partial_mp = stumpy.stump(new_x, 50, T_B=new_y)

# Now update `mp` with `partial_mp`. One needs to take care and make sure the indices are aligned properly though.

0 replies

seanlaw · 2019-05-25T14:35:38Z

seanlaw
May 25, 2019
Maintainer

@rtmatx Let me know if it makes sense to close this issue

0 replies

rtmlp · 2019-05-31T03:51:03Z

rtmlp
May 31, 2019
Author

Hi @seanlaw, Thanks a lot for the detailed info. It makes lot of sense. I have been reading about STAMPI and FLOSS algorithms that are discussed in the paper, which follows the implementation that you have highlighted. I am very keen on implementing the streaming/online version of the matrix profile algorithm as it can be helpful in lot of places.

I have to go through the implementation of the underlying algorithms. I can make a PR about it, it if you feel it can be added to this library. It might take sometime though

0 replies

rtmlp · 2019-05-31T03:51:55Z

rtmlp
May 31, 2019
Author

I failed to acknowledge your point above regarding online vs offline. Yes, your approach would be correct in both cases.

One thing that is not explicitly obvious from the examples is that the stumpy.stump function (and stumpy.stumped for the distributed case) can perform an AB-join by providing the optional parameter T_B (without this parameter, it is automatically a self-join):
# Warning: code not tested so use at your own risk!
import numpy as np

m = 50  # Window size

x = np.random.rand(1000)
mp = stumpy.stump(x, 50)  # This is a self-join

y = np.random.rand(100)  # New data

new_x = np.append(x, y)
new_y = np.append(x[:-m+1], y)  # Take the last 49 data points from x and append y to it 

partial_mp = stumpy.stump(new_x, 50, T_B=new_y)

# Now update `mp` with `partial_mp`. One needs to take care and make sure the indices are aligned properly though.

I think it should be np.append(x[-m+1:], y) isn't it ?

0 replies

seanlaw · 2019-05-31T10:11:41Z

seanlaw
May 31, 2019
Maintainer

I think it should be np.append(x[-m+1:], y) isn't it ?

You are right! I've edited my original post above to reflect this error.

0 replies

seanlaw · 2019-05-31T10:34:01Z

seanlaw
May 31, 2019
Maintainer

Hi @seanlaw, Thanks a lot for the detailed info. It makes lot of sense. I have been reading about STAMPI and FLOSS algorithms that are discussed in the paper, which follows the implementation that you have highlighted.

If I understand correctly, STAMPI is the interactive version of STAMP. However, it is significantly slower than STOMP and SCRIMP++ (i.e., interactive STOMP). SCRIMP++ and FLOSS are both on our project roadmap and we definitely welcome a contribution.

I am very keen on implementing the streaming/online version of the matrix profile algorithm as it can be helpful in lot of places.

I may be misunderstanding your meaning as I am not a computer/software engineer or I haven't read all of the papers yet so please correct me if I'm wrong or direct me to the right paper(s). I'm not aware of a "version of the matrix profile algorithm" that is specific to streaming/online. I've only read 1-2 sentence references to this but have yet to come across pseudocode in the papers. Can you please clarify what you are referring to?

I have to go through the implementation of the underlying algorithms. I can make a PR about it, it if you feel it can be added to this library. It might take sometime though

@rtmatx Depending on what you plans are, I think a tutorial contribution would be nice. Is that what you were thinking as well? If so, then we can close this issue and create a new one that is focused on a streaming data tutorial and we can add a more detailed checklist that is similar to Tutorial 1 (or shorter). Once the components are all there then we can review the PR.

If indeed users start asking for streaming/online to be a code feature then we can open a new PR after the tutorial has been created that is targeted for migrating it to a new feature.

Let me know what you think

0 replies

rtmlp · 2019-06-03T16:43:16Z

rtmlp
Jun 3, 2019
Author

I may be misunderstanding your meaning as I am not a computer/software engineer or I haven't read all of the papers yet so please correct me if I'm wrong or direct me to the right paper(s). I'm not aware of a "version of the matrix profile algorithm" that is specific to streaming/online. I've only read 1-2 sentence references to this but have yet to come across pseudocode in the papers. Can you please clarify what you are referring to?

Sorry for the misunderstanding, I am referring to STAMPI algorithm when I mentioned about matrix profile for streaming data.

@rtmatx Depending on what you plans are, I think a tutorial contribution would be nice. Is that what you were thinking as well? If so, then we can close this issue and create a new one that is focused on a streaming data tutorial and we can add a more detailed checklist that is similar to Tutorial 1 (or shorter). Once the components are all there then we can review the PR.

That sounds good. I will make a notebook example and let's see if it can help other users. Based on that we can refactor it, if needed.

I will close this issue and will create a PR once I have working implementation of matrix profile for streaming data using STAMPI

Thanks

0 replies

seanlaw · 2019-06-03T17:40:33Z

seanlaw
Jun 3, 2019
Maintainer

You know, I always (incorrectly) thought that STAMPI was was "interactive STAMP" (which I am against implementing - instead, we should implement SCRIMP++) but I just realized, after scanning the first paper, that is referring to "STAMP Incremental". This changes my perspective.

It looks like there is a "STOMPI" in Table 5 of this paper. At a glance, it doesn't look too complicated. We should go with "STUMPI" :)

0 replies

seanlaw · 2020-01-13T20:05:42Z

seanlaw
Jan 13, 2020
Maintainer

@rtmatx Just following up on this to see if you've had any opportunity to take a look at this or have any example code? Absolutely no pressure though.

0 replies

hmcoservit · 2020-06-04T12:38:29Z

hmcoservit
Jun 4, 2020

Any updates on this @seanlaw @rtmatx ? it would be very helpful.

0 replies

seanlaw · 2020-06-04T12:58:38Z

seanlaw
Jun 4, 2020
Maintainer

@hmcoservit Thank you for checking. Can you please describe your use case in this STUMPI issue?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling streaming data #19

{{title}}

Replies: 13 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Handling streaming data #19

rtmlp May 24, 2019

Replies: 13 comments

seanlaw May 24, 2019 Maintainer

seanlaw May 24, 2019 Maintainer

seanlaw May 24, 2019 Maintainer

seanlaw May 25, 2019 Maintainer

rtmlp May 31, 2019 Author

rtmlp May 31, 2019 Author

seanlaw May 31, 2019 Maintainer

seanlaw May 31, 2019 Maintainer

rtmlp Jun 3, 2019 Author

seanlaw Jun 3, 2019 Maintainer

seanlaw Jan 13, 2020 Maintainer

hmcoservit Jun 4, 2020

seanlaw Jun 4, 2020 Maintainer

rtmlp
May 24, 2019

seanlaw
May 24, 2019
Maintainer

seanlaw
May 24, 2019
Maintainer

seanlaw
May 24, 2019
Maintainer

seanlaw
May 25, 2019
Maintainer

rtmlp
May 31, 2019
Author

rtmlp
May 31, 2019
Author

seanlaw
May 31, 2019
Maintainer

seanlaw
May 31, 2019
Maintainer

rtmlp
Jun 3, 2019
Author

seanlaw
Jun 3, 2019
Maintainer

seanlaw
Jan 13, 2020
Maintainer

hmcoservit
Jun 4, 2020

seanlaw
Jun 4, 2020
Maintainer