Saving motifs for comparison to other files #936

datadeng46 · 2023-12-06T15:08:13Z

datadeng46
Dec 6, 2023

Hi,
I have a vast amount of data split across numerous files. This data is univariate and due to the size of the data loading every single file in at once is not achievable. As a result I am looking for a way to load in a file, run stumpy, then update with a new file (array of length 6000). Generating a matrix profile for the entire dataset is likely unmanageable, so perhaps a way of running a file, saving the top motif and comparing the next file against the top motif from the previous file.

Answered by seanlaw

Dec 7, 2023

@datadeng46 If the pattern is THAT well conserved and you are mostly interested in anomalies that don't look like the pattern, then instead of computing the full matrix profile, you might simply consider taking the known pattern and computing the distance profile using the (mass function)[https://stumpy.readthedocs.io/en/latest/api.html#stumpy.mass].

View full answer

seanlaw · 2023-12-06T15:44:11Z

seanlaw
Dec 6, 2023
Maintainer

@datadeng46 Thank you for your question and welcome to the STUMPY community. I understand that each file has ~6000 datapoints but can you tell me how many files there are in total?

Generating a matrix profile for the entire dataset is likely unmanageable

This may/may not be true depending on the length of the full/complete dataset so knowing the size will be helpful. Additionally, what hardware (CPUs, GPUs, RAM?) do you have access to for the matrix profile computation?

This data is univariate and due to the size of the data loading every single file in at once is not achievable

At the end of the day, you'll still need to read all of the files (either at once or in separate batches) and so I/O will need to happen regardless. The advice would vary depending on the size of the data and the hardware

so perhaps a way of running a file, saving the top motif and comparing the next file against the top motif from the previous file.

This may be a crude approximation but you run the risk of completely missing a pattern, located at the very beginning of your time series, that match a subsequence at the end of the time series. Instead, you may be better off sub-sampling the full time series first (i.e., reading in only every 10th or 100th data point) and computing the matrix profile using the lower frequency/resolution data in order to get an idea of whether ANYTHING interesting exists. This approach can often help you discover the potential motifs in a very computationally effective way. Additionally, you can combine this down-sampled approach with computing a (somewhat cheaper) approximate matrix profile (see scrump function)

8 replies

datadeng46 Dec 6, 2023
Author

@seanlaw there are approximately 54000 files. I do not have access to more computing power than my HP zbook laptop.

The pattern is pretty consistent in the data, as you can see from the attached images blue is what it is meant to look like and anomalies are determined by the time series following the red path. Within the data the amplitude varies a fair bit. Running stumpy on one file of data produces excellent results in picking up these anomalies, but I was hoping I could expand it out and run it on the entire dataset individually. This pattern happens at a very high frequency which makes downsampling rather challenging as it removes the features in the data. The pattern happens at varying frequencies and I can strategically remove datapoints to ensure the subsequence is the same length throughout the entire data but due to restrictions in hardware i am required to calculate in seperate batches

Thank you for your advice I am new to data science and don't come from a coding background so do have a fair few personal limitations!

JaKasb Dec 6, 2023

507740 files * 6000 samples = 3.24E08 samples = 2.59 GigaBytes
roughly 2^28
In context to the benchmark figures here:
https://raw.githubusercontent.com/TDAmeritrade/stumpy/master/docs/images/performance.png
you are at 324 Million on the x-axis.

Unless you have access to GPU, forget it.

Downsample by a factor of 100 and try to run it locally.
Subsample the number of files. e.g only include every 100th file
Downsample the samplerate, even a factor of 2 can help a lot.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.decimate.html

seanlaw Dec 7, 2023
Maintainer

@datadeng46 If the pattern is THAT well conserved and you are mostly interested in anomalies that don't look like the pattern, then instead of computing the full matrix profile, you might simply consider taking the known pattern and computing the distance profile using the (mass function)[https://stumpy.readthedocs.io/en/latest/api.html#stumpy.mass].

Answer selected by datadeng46

datadeng46 Dec 7, 2023
Author

@seanlaw just started implementing a solution using the mass function and it has appeared to function exactly how I was hoping! Is there a best practice way to turn the distance profile into an anomaly score? Having a play around I took the inverse of the amplitude with a specified window length the same size as the subsequence which seems to give similar results to the matrix profile in stumpy.stump. However if there is a best practice way I’d like to look at implementing that

seanlaw Dec 8, 2023
Maintainer

@datadeng46 Would you mind showing the mass code that you've settled upon? This way I can see how you are going about it.

datadeng46 Dec 8, 2023
Author

#@seanlaw hi Sean, please see below. Apologies I have had to remove details etc due to confidentiality. I iterate through the list of files, load in the data which is in the form of a numpy array. Due to the changing frequencies of the signal I resample the data so that each subsequence is the same length. I.e. the full subsequence of 4 bumps occurs across the same number of samples. Then apply the mass function using Q a motif defined previously in the form of a numpy array and T the full time series loaded in. On the output distance profile I have taken the amplitude using scipy signal min and max 1d filters. By no means is this perfect, open to improvements. The run time is excellent however and allows me to iterate through the 6000 sample files very quickly.

seanlaw Dec 8, 2023
Maintainer

This seems reasonable

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving motifs for comparison to other files #936

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Saving motifs for comparison to other files #936

datadeng46 Dec 6, 2023

Replies: 1 comment · 8 replies

seanlaw Dec 6, 2023 Maintainer

datadeng46 Dec 6, 2023 Author

JaKasb Dec 6, 2023

seanlaw Dec 7, 2023 Maintainer

datadeng46 Dec 7, 2023 Author

seanlaw Dec 8, 2023 Maintainer

datadeng46 Dec 8, 2023 Author

seanlaw Dec 8, 2023 Maintainer

datadeng46
Dec 6, 2023

Replies: 1 comment 8 replies

seanlaw
Dec 6, 2023
Maintainer

datadeng46 Dec 6, 2023
Author

seanlaw Dec 7, 2023
Maintainer

datadeng46 Dec 7, 2023
Author

seanlaw Dec 8, 2023
Maintainer

datadeng46 Dec 8, 2023
Author

seanlaw Dec 8, 2023
Maintainer