Multidimensional - Multi Timeseries Matrix Profile with `mstump` #262

geoHeil · 2020-10-06T07:49:50Z

geoHeil
Oct 6, 2020

Many thanks for #202, this is a great start. Do you plan to update the linked Jupiter notebook for multidimensional analysis & discord discovery?

I am interested in anomaly detection of time series.
My time-series originate from many IoT devices. For each device hourly metrics are recorded. Note: more than 1 metric is recorded. Additionally, metadata (geolocation and connection topology) are available. I wonder if the mstump/mstumped only works for a single time-series and how it could be extended to support multiple devices/time-series and potentially calculate the deviation from a group/cohort of time-series (geo-region, firmware version, ...)

geoHeil · 2020-10-06T08:03:48Z

geoHeil
Oct 6, 2020
Author

Furthermore, in the docs you mention that only self joins are supported. Could you explain what you mean with this?

0 replies

seanlaw · 2020-10-06T12:58:20Z

seanlaw
Oct 6, 2020
Maintainer

Many thanks for #202, this is a great start. Do you plan to update the linked Jupiter notebook for multidimensional analysis & discord discovery?

I'm glad you found it! Perhaps, you can explain what you mean by "multidimensional analysis". In STUMPY, our multi-dimensional matrix profile computation accepts a D-dimensional array as input and where each dimension corresponds to a separate time series (though all D time series must be the same length and aligned). More specifically, we faithfully reproduce the the multi-dimensional work from the original authors: paper here and we strongly recommend reading through it if you haven't had the chance to. The core of this paper is currently implemented as mstump/mstumped which, if you see the WIP tutorial:

mp, indices = stumpy.mstump(df[['T1', 'T2', 'T3']], m)  # This is analyzing 3 time series at once

For completeness (in case others are interested - I apologize in advance if you already understand this), a multi-dimensional matrix profile is not what most people think it is. That is, a multi-dimensional matrix profile is not individual 1-D matrix profiles stacked on top/next to each other. Again, I strong recommended reviewing the paper mentioned above and, while I'm sure that I don't have all of the answers, feel free to ask questions as having an open discussion also helps in my understanding as well.

Of course, if what you need is in fact D number of 1-dimensional time series stacked on top of each other then you might as well process them individually (in an embarrassingly parallel way) and then stack the results afterward. Just know that, fundamentally, this has a completely different definition from the original matrix profile papers and, for consistency and to avoid confusion, I would not refer to this as a multi-dimensional matrix profile.

If you don't mind, for clarity of communication, let's refer to the published definition as a "multi-dimensional matrix profile" and many individually analyzed 1-D matrix profiles (based on a set of related time series) as a "stacked matrix profile".

I am interested in anomaly detection of time series.

So, anomalies are really hard even in one dimension and @mexxexx and I have briefly discussed the meaning of (or lack there of) discords in the multi-dimensional matrix profile sense (using the published definition here). I may be mistaken but the multi-dimensional matrix profile paper didn't seem to provide too many details regarding how best to interpret this matrix profile for discords. We could certainly use some help here.

My time-series originate from many IoT devices. For each device hourly metrics are recorded. Note: more than 1 metric is recorded. Additionally, metadata (geolocation and connection topology) are available. I wonder if the mstump/mstumped only works for a single time-series and how it could be extended to support multiple devices/time-series and potentially calculate the deviation from a group/cohort of time-series (geo-region, firmware version, ...)

I'm not sure I fully understand your question here and you may need to provide a little more information to clarify. mstump/mstumped supports multi-dimensional time series. If you have a NumPy array, then each row of the array corresponds to a separate time series. So, perhaps in your case, each row corresponds to data from a different device. If your data is stored in a dataframe then each column (not row) would correspond to a different device.

and potentially calculate the deviation from a group/cohort of time-series (geo-region, firmware version, ...)

I wonder if you can just look at the final dimension of the multi-dimensional matrix profile to ascertain this though I'm not experienced enough with this or your particular use case to elaborate. But happy to talk through it if you want to provide more detail as to what you mean or are looking for.

Furthermore, in the docs you mention that only self joins are supported. Could you explain what you mean with this?

In the 1-D case, a self-join means that you have a single time series and you are looking for conserved patterns within that one time series. So, you are asking "for each subsequences in T_A, where is it's nearest neighbor in T_A".

In the case of an "AB-join", you have two independent time series, T_A and T_B and you may be asking "for each subsequences in T_A, where is it's nearest neighbor in T_B". Essentially, you are annotating T_A with information from T_B.

An AB-join makes very little sense in case of multi-dimensional matrix profiles and there is no published work on this and, hence, is not supported. Let me know if that makes sense.

0 replies

geoHeil · 2020-10-06T14:24:38Z

geoHeil
Oct 6, 2020
Author

Many thanks for the quick response!

The data might look like this (find the python code below which is generating such dummy data, obviously, the real data is not normally distributed and even might contain missing records):

            metrik_0  metrik_1  metrik_2  geospatial_id  topology_id  \
2020-01-01  0.496714 -0.138264  0.647689              4            3   
2020-01-01 -1.151154 -2.839079 -0.809381             11            2   
2020-01-01 -1.345014 -0.861050  1.377181              2            1   
2020-01-01 -1.831682 -1.051310  1.497432              1            4   
2020-01-01 -0.365241 -1.671170  2.035874             12            3   

            cohort_id  device_id  
2020-01-01          1          1  
2020-01-01          2          9  
2020-01-01          2         13  
2020-01-01          1          8  
2020-01-01          2         12

Description of the data

Each device (here named device_id) is generating an observation (=row) for a timestep $t$ (=hour). But more than a single metric are recorded.

What I want to achieve

Sometimes devices are broken (totally not working), or more important: partially not working and affecting their surroundings. I want to identify such (partially) broken devices (= anomalies).

How Stumpy/Matrix Profile fits in

Or at least how far I understand how it could fit in-so-far:

With mstump:

mp, indices = stumpy.mstump(df[['T1', 'T2', 'T3']], m)  # This is analyzing 3 time series at once

I can analyze a single multivariate time-series of a single device (at least this is what I understand). And T1...T3 would be the multivariate metrics recorded by a single device.

It is a desired property to not simply stack up single-dimension discords/anomalies but rather combine the information from multiple matrix (and again as far as I understand this is exactly what the current mstump implementation would do)

Is this correct? Or am I misunderstanding this?

Now my question is (assuming I am correct so far):

when evaluating a single device multiple metrics can be used to jointly identify an anomaly (i.e. my assumption of mstump)
how can I use the information from multiple devices to determine what is normal behavior when calculating the anomalies
how can discords be calculated/interpreted here? (but as you said this might not be fully clear)
What is the desired orientation?

If you have a NumPy array, then each row of the array corresponds to a separate time series. So, perhaps in your case, each row corresponds to data from a different device. If your data is stored in a dataframe then each column (not row) would correspond to a different device.

So far I thought that metrics should be columns timestamps should be rows, with an indicator which device this row belongs to.

Though maybe univariate discords could be computed and then combined in some logical way (i.e. if k out of n metrics determine an anomaly, it is considered as an anomaly).

The last comment is fine/makes sense.

import pandas as pd
import numpy as np
import random

random_seed = 42
np.random.seed(random_seed)
random.seed(random_seed)

def generate_df_for_device(n_observations, n_metrics, device_id, geo_id, topology_id, cohort_id):
        df = pd.DataFrame(np.random.randn(n_observations,n_metrics), index=pd.date_range('2020', freq='H', periods=n_observations))
        df.columns = [f'metrik_{c}' for c in df.columns]
        df['geospatial_id'] = geo_id
        df['topology_id'] = topology_id
        df['cohort_id'] = cohort_id
        df['device_id'] = device_id
        return df
    
def generate_multi_device(n_observations, n_metrics, n_devices, cohort_levels, topo_levels):
    results = []
    for i in range(1, n_devices +1):
        #print(i)
        r = random.randrange(1, n_devices)
        cohort = random.randrange(1, cohort_levels)
        topo = random.randrange(1, topo_levels)
        df_single_dvice = generate_df_for_device(n_observations, n_metrics, i, r, topo, cohort)
        results.append(df_single_dvice)
        #print(r)
    return pd.concat(results)

# hourly data, 1 week of data
n_observations = 7 * 24
n_metrics = 3
n_devices = 20
cohort_levels = 3
topo_levels = 5

df = generate_multi_device(n_observations, n_metrics, n_devices, cohort_levels, topo_levels)
df = df.sort_index()
df.head()

0 replies

seanlaw · 2020-10-06T19:43:52Z

seanlaw
Oct 6, 2020
Maintainer

I can analyze a single multivariate time-series of a single device (at least this is what I understand). And T1...T3 would be the multivariate metrics recorded by a single device.

Yes, I think we are on the same page. Technically, there is no assumption that the multivariate metrics be "recorded by a single device". You can also have three separate devices that are all synchronized to record data at the same time (i.e., one temperature sensor, one humidity sensor, and one pressure sensor).

It is a desired property to not simply stack up single-dimension discords/anomalies but rather combine the information from multiple matrix (and again as far as I understand this is exactly what the current mstump implementation would do)

Correct. mstump is our implementation of the published mstomp work. Each dimension (i.e., row), D, of the multi-dimensional matrix profile output corresponds to taking the average of the D smallest matrix profiles. Note that, as discussed in the paper, not all dimensions are useful. But, for your case, maybe you should look at the last dimension (i.e., using ALL dimensions) and seeing if it is anomalous. Again, this stuff is really hard to wrap one's mind around.

Is this correct? Or am I misunderstanding this?

Yes, I think so

Now my question is (assuming I am correct so far):
when evaluating a single device multiple metrics can be used to jointly identify an anomaly (i.e. my assumption of mstump)

Yes. You get a multi-dimensional matrix profile (and indices) as output and this is identical to the original mstomp paper. Again, I have not fully thought through what an "anomaly" is for a multi-dimensional matrix profile but, by comparing the dimensions of the matrix profile, you might be able to identify anomalies by looking at the largest values within a row (i.e., dimension).

how can I use the information from multiple devices to determine what is normal behavior when calculating the anomalies

Well, "normal" behavior is likely captured by the smallest values in your matrix profile but the problem is figuring out which number of dimensions is the most useful. We are still building tools for this and it is more art than science at this stage. However, if you have multiple devices then you may consider simply concatenating the time series from multiple devices (separated by a NaN within each time series) in order to identify potential motifs. However, I think one has to clearly define what "normal" behavior is and, also, what "anomaly" is in very strict terms.

how can discords be calculated/interpreted here? (but as you said this might not be fully clear)

As described above, discords (which are "potential anomalies") would be the matrix profile values from within each dimension (i.e., row) that have the largest value(s). The challenge here is that the multi-dimensional matrix profile output doesn't tell you WHICH dimension within your multivariable time series are the important/unimportant ones. It's not that the information is unavailable, it's just too much data and too complex to efficiently store in memory. In the future, we'll have some tools to help with the post-processing.

What is the desired orientation?

Well, it depends if you are using NumPy arrays as input or Pandas dataframes as input. if you are using NumPy arrays, then each time series is stored in each row. For Pandas dataframes, each time series is stored in each column.

We had started with NumPy arrays only and, typically, I was use to related datasets (i.e., each time series) being in the same row within a NumPy array. Then, later, we added Pandas support but Pandas typically orients a time series within a column. Sorry for the confusion 🤷‍♂️

0 replies

geoHeil · 2020-10-07T06:18:45Z

geoHeil
Oct 7, 2020
Author

I think that this is a good point:

Yes, I think we are on the same page. Technically, there is no assumption that the multivariate metrics be "recorded by a single device". You can also have three separate devices that are all synchronized to record data at the same time (i.e., one temperature sensor, one humidity sensor, and one pressure sensor).

It is a desired property to not simply stack up single-dimension discords/anomalies but rather combine the information from multiple matrices (and again as far as I understand this is exactly what the current mstump implementation would do)

But my assumption is that multiple metrics/sensors from a single device are more coherent than a second device which may reside on the other side of the country.

Thanks for the answer with regards to orientation. For me (assuming the pandas way of doing things) it is not yet clear how to properly structure my dataset with time-series from multiple devices. Would the example data I shared above to be considered in the right shape?

0 replies

mihailescum · 2020-10-07T07:48:37Z

mihailescum
Oct 7, 2020

Maybe I can add a thought or two:

If you have prior domain knowledge, it would be very helpful. Let's focus on one device for example. Can you say that all metrics are relevant? Can you exclude some of them? Or could it be that some metrics are important and some are not? Depending on the answer, things would get more easy. The third case will be the most complicated one.

Furthermore, I assume you know what the matrix profile is used for, but I'll just clarify once more: it can be used to easily detect repeated subsequences in a time-series, so called patterns or motifs. That means for it to be useful, your time series will probably need to have some sort of repeated pattern, which would either correspond to a functioning or a malfunctioning device. Depending on how your data looks like, you could reach your goal in different ways, maybe you can share a bit more info on it. But one example: suppose that a functioning device always has a recurring pattern (imagine a heart beat of a human). Then, you would calculate the matrix profile and look for very large matrix profile values, bacause a large MP value means that the subsequence has no similar matches. If you think about the heartbeat example, this could be a skipped beat.

You said that you have missing values. This is a problem, as you can't just set the values to NaN. You need to think about a strategy how to mask these values. Can you just leave faulty timestamps out? Can you use the value from the last timestamp? Always remember that you are really looking for visual patterns, so whatever method suits your dataset, try to make it so the visual structure of the time-series is not altered. One bad example would be, if your time-series values are around 1000, and you would set missing values to -1. Because in this case it is almost certain, that the MP would pick up such massive drops as main motifs, what probably is not what you want.

Would the example data I shared above to be considered in the right shape?

Stumpy can work either with numpy arrays or with data frames. If you pass a data frame object, then the orientation is correct. However, depending on your data again, you will probably want to only pass the values of one device. (Something like df[df[device_id]==1]).

I hope this helps!

0 replies

seanlaw · 2020-10-07T15:02:08Z

seanlaw
Oct 7, 2020
Maintainer

Thanks for the answer with regards to orientation. For me (assuming the pandas way of doing things) it is not yet clear how to properly structure my dataset with time-series from multiple devices. Would the example data I shared above to be considered in the right shape?

Yes, it looks about right. One word of caution is that STUMPY assumes floating point (continuous) inputs so, realistically, only the metrics would be "valid" input

0 replies

geoHeil · 2020-10-17T08:36:53Z

geoHeil
Oct 17, 2020
Author

I have looked further into my dataset and figured out that not all IoT devices are creating monitoring data (= the time-series I am looking at) all the time. Only when they are switched on. But as pointed out by you this results in an additional numeric problem.

I still need to think about how to handle this.

0 replies

seanlaw · 2020-10-17T12:34:32Z

seanlaw
Oct 17, 2020
Maintainer

You said that you have missing values. This is a problem, as you can't just set the values to NaN. You need to think about a strategy how to mask these values.

@mexxexx Would you mind elaborating on why one can't set the missing values to NaN?

0 replies

mihailescum · 2020-10-17T13:14:45Z

mihailescum
Oct 17, 2020

To be honest I can't pinpoint the problem exactly. But I have the feeling that if one of the dimensions is NaN and you calculate the multidimensional profile, something should go wrong.

However, maybe I am mistaken. After all, mstump calculates the one dimensional profiles first. So those would just be np.inf at the respective indices, which would propagate to the d-dimensional profile. So in those cases one should maybe instead look at lower dimensional profiles instead. While it could in theory be feasible, it might become tricky to handle if you allow a variable amounts of IoT device being "relevant" at each time step.

Does that make sense?

0 replies

seanlaw · 2020-10-17T13:48:53Z

seanlaw
Oct 17, 2020
Maintainer

Yes, thank you! I agree that matrix profile values would probably be np.inf for subsequences that contain np.nan. However, as you pointed out, one probably doesn't want to look for discords by simply looking for the largest matrix profile values since they could be np.inf (or affected by the presence of them). Motifs should be fine but discords (or potential anomalies) are going to be hard.

Thank you for sharing your insights!

0 replies

geoHeil · 2020-10-17T14:47:09Z

geoHeil
Oct 17, 2020
Author

Many thanks for your great points.

Let me give you some more context around this. The time-series stem from HFC (cable modem) routers/modems. https://www.usenix.org/conference/nsdi20/presentation/hu-jiyao gives a good overview of the business domain. The following figure is from this paper:

The individual devices (interfaces (for frequencies)) of a device (modem/router / MAC address) record multivariate time series of metrics.
My dataset is structured very similarly to the definition in this paper:

timestamp
anonymized MAC Address
anonymized interface (upstream name of channel frequency/port)
signal to noise ratio
transmission power
received signal power
Code Word Error Ratio received by the CM on this specific upstream.
Correctable Code Word Error Ratio received by the CM on this specific upstream.
topology (how they are connected to the hierarchy of the network)

The idea is to detect anomalies (when a device is broken) as in the case of high noise this is affecting the neighborhood and degrading the quality of the internet for other modems as well. I.e. usually noticed at the most global (fiber node) level, but hard to pinpoint to a failure cause / individual device.

With regards to the NULL values:

Not all devices are powered on/generating monitoring metrics on all interfaces for all observation timestamps. So in particular not one of the dimensions is NaN, but for this device/port/frequency no metrics were reported at all for an observational period.

As a starter, maybe starting with a univariate time series (signal to noise ratio) might be better - so I can work around the particular problem (partially). However, I would want to include details about what is normal also from other routers in the hierarchy - and would then again face a time alignment issue & missing data.

0 replies

seanlaw · 2020-11-02T14:30:14Z

seanlaw
Nov 2, 2020
Maintainer

@geoHeil If you don't mind, I'm going to close this for now but feel free to re-open if you have any further questions or have any updates. We'd be happy to discuss it with you!

0 replies

geoHeil · 2020-11-09T11:02:13Z

geoHeil
Nov 9, 2020
Author

If you have prior domain knowledge, it would be very helpful. Let's focus on one device for example. Can you say that all metrics are relevant? Can you exclude some of them? Or could it be that some metrics are important and some are not

Indeed, for a single device 3 metrics would be considered relevant.

But as far as I understand it - we would need to pivot the matrix and choose a single metric and have all the devices as columns - but this would then be a rather high dimensional space (easily in the thousands to 10 thousands).

So far, I have looked into stump, aamp for various metrics and time windows - but have not really found a useful pattern, unless any NaN or Missing values in the time-series are set to i.e. 0.

When I want to compute the distance using a single metric and > 1 device how would you suggest to do it?
Ideally it is not concatenating all the things ;)

I still have to get more familiar with the mstomp - but as you said it is a bit complex to define anomalies (and I will need more time to think about it).

Furthermore: it looks like AAMP seems to work better. I guess this is due to the reason that the time-series are not too periodic (unlike heart beats networking layer 1 issues have no nice periodontics and a lot of noise.)

1 reply

seanlaw Feb 13, 2021
Maintainer

@geoHeil Just a heads up that starting in STUMPY v1.8.0, you no longer have to call aamp and instead, you can/should use the normalize=False parameter:

stumpy.stump(T, m, normalize=False)

This is equivalent to:

stumpy.aamp(T, m)

In the background, when normalize=False, stumpy.stump just calls stumpy.aamp for you and hands over the relevant data. Additionally, stumpy.mstump (and all other relevant z-normalized functions) all take a normalize=False parameter when you want to skip z-normalization

geoHeil · 2020-11-09T11:21:16Z

geoHeil
Nov 9, 2020
Author

@seanlaw I do not have permissions to reopen - please tell me if I would want to continue discussion closed, or in a different issue.

0 replies

seanlaw · 2020-11-09T16:34:15Z

seanlaw
Nov 9, 2020
Maintainer

But when concatenating the distance depends on the order of devices ... would`nt it make more sense to transpose and find a different way to calculate the distance / better handle the large number of columns (= one per device)?

@geoHeil I'm not sure I'm following what you are saying as it feels like we are referring to different things when we're saying "concatenate". Would you mind providing a simple example with three devices (i.e., A, B, C) that each have time series with data points (i.e., [a0, a1, a2, a3,...,an] which corresponds to the time series in device A).

You've likely exceeded the extent of my concrete knowledge/understanding of matrix profiles so, full disclosure, know that any following responses may be limited/incorrect and are not meant to come across as authoritative. Your problem seems quite complex and the last thing that I want to do is to pretend that I know the answer when, in reality, I don't.

0 replies

geoHeil · 2020-11-09T18:42:48Z

geoHeil
Nov 9, 2020
Author

Many thanks for your great help!
I really appreciate your tips (and also stumpy).

I have a data frame similar to:

As you can see it is ordered by time.
Assuming I sort this by device_id, hour I practically have concatenated all the time-series one-after the other:

but now they are all one-after the other. This means that it can rather fast (vectorized) work on the data. However, as it is no longer sorted primarily by time, this also means (as far as I understand matrix profile) that the distances between device 1 and device 2 should be rather large i.e. finding anomalies would be hard (i.e. in the sense of incorporating the motifs of the different devices in a meaningful way to define normality).

So when I instead pivot the data to:

This is solved - but I would need to use mstump - with all the problems of multidimensional distances.
Most importantly: in my real data it is not only 4 devices but rather thousands (in fact in the range of multiple 10k) when aggregated a little bit. So the problem of dimensionality would only increase to a size which with this implementation of distance is not really useful anymore.

Is there a better solution?
Or am I misunderstanding it?

I certainly do want to incorporate the information of other devices close to each one to define what constitutes normality/ an anomaly. Starting with a single dimension is fine.

My observations are: on the data of a single device aamp sometimes seems to work good. But with the sorting by device and then by time no longer works really good.
One problem (unrelated to matrix profile) is that as written before the metrics are layer 1 connection quality information. This contains a considerable amount of noise, and oftentimes is not as periodic as an heartbeat.

Please see the code below.

%pylab inline
import stumpy

import pandas as pd
import numpy as np

import random
random_seed = 47
np.random.seed(random_seed)
random.seed(random_seed)

def generate_df_for_device(n_observations, n_metrics, device_id, geo_id, topology_id, cohort_id):
        df = pd.DataFrame(np.random.randn(n_observations,n_metrics), index=pd.date_range('2020', freq='H', periods=n_observations))
        df.columns = [f'metrik_{c}' for c in df.columns]
        df['geospatial_id'] = geo_id
        df['topology_id'] = topology_id
        df['cohort_id'] = cohort_id
        df['device_id'] = device_id
        return df
    
def generate_multi_device(n_observations, n_metrics, n_devices, cohort_levels, topo_levels):
    results = []
    for i in range(1, n_devices +1):
        #print(i)
        r = random.randrange(1, n_devices)
        cohort = random.randrange(1, cohort_levels)
        topo = random.randrange(1, topo_levels)
        df_single_dvice = generate_df_for_device(n_observations, n_metrics, i, r, topo, cohort)
        results.append(df_single_dvice)
        #print(r)
    return pd.concat(results)

# hourly data, 1 week of data
n_observations = 7 * 24
n_metrics = 1
n_devices = 4
cohort_levels = 2
topo_levels = 3

df = generate_multi_device(n_observations, n_metrics, n_devices, cohort_levels, topo_levels)
df = df.sort_index()
df = df.reset_index().rename(columns={'index':'hour'})
df = df.drop(['geospatial_id', 'topology_id', 'cohort_id'], axis=1)

### S1
df.head() 


### S2
df.sort_values(['device_id', 'hour']).head()

### S3
df.sort_values(['device_id', 'hour']).set_index(['hour', 'device_id']).unstack()

0 replies

seanlaw · 2020-11-12T14:28:19Z

seanlaw
Nov 12, 2020
Maintainer

@geoHeil Can you tell me if you expect a pattern in, say, device_id == 1 to exist in the other device_id? Or, is it possible? If not, then it doesn't make sense to concatenate them.

Perhaps, once you've pivoted the data, you should just compute the matrix profile for each device_id and then simply see if there are motifs/discords that occur roughly as the same time. The issue is that it sounds like you have a ton of data and I'm not sure any analysis approach will save you (maybe downsampling).

0 replies

geoHeil · 2020-11-12T19:41:44Z

geoHeil
Nov 12, 2020
Author

Well, these cohorts are failure boundaries, i.e. do not propagate the error introduced by broken devices to devices in the other failure region. But devices within a region should fairly soon converge to a broken state.

So it is expected that all regions would show broken devices from time to time.

To answer your question: It is expected to see a lot of noise in the data. Perhaps the best would be to only find matrix profiles with a short value/distance i.e. repeating patterns and not anomalies. And if phrased this way, indeed, it would be expected to see this noise in most of the devices - unless broken.

For now, I have resorted to an approach similar to https://stackoverflow.com/questions/64751921/pandas-apply-function-to-each-group-output-is-not-really-an-aggregation to apply the stump function to each group = device_id.

When you say transform and then apply: would this work in a vectorized way? Iterating over the groups as outlined above is still rather slow.

0 replies

seanlaw · 2020-11-12T20:18:06Z

seanlaw
Nov 12, 2020
Maintainer

@geoHeil Unfortunately, matrix profiles are an expensive computation when you have many, many time series. Considering that you have thousands to tens of thousands of time devices, this will certainly take a while. stumpy.stump is parallelized to use all cores on one machine but, for speed, you may consider using GPUs if you have access to them (see stumpy.gpu_stump). Again, this is a compute intensive process so 🤷‍♂️

0 replies

geoHeil · 2020-11-12T21:23:57Z

geoHeil
Nov 12, 2020
Author

Sure - just want to make sure to use it in an optimized/vectorized way or at least semantically correct ;). This is a great point to use GPUs. I do have access to some. and really appreciate that you already have looked into #53

0 replies

seanlaw · 2020-11-12T21:55:15Z

seanlaw
Nov 12, 2020
Maintainer

just want to make sure to use it in an optimized/vectorized way or at least semantically correct

I think the only thing is to make sure that the time series is passed to STUMPY as a NumPy array, which should guarantee that the data is contiguous in memory. So, after you pivot your data as you had described above, you may want do a df[df['device_id'] == 1].values as I'm not sure if Pandas only presents a view or if it rearranges the dataframe so that all of the data for one device is contiguous in memory. We really only have a very soft support for Pandas dataframes and, in fact, we just do T.asarray() as soon as you pass in a time series.

This is a great point to use GPUs. I do have access to some. and really appreciate that you already have looked into #53

Yes, I remembered you asking about NOT using GPUs before 😄

0 replies

geoHeil · 2020-11-13T07:47:11Z

geoHeil
Nov 13, 2020
Author

;) well I wanted to start with some basics first.

0 replies

seanlaw · 2020-11-13T15:57:31Z

seanlaw
Nov 13, 2020
Maintainer

@geoHeil In case it matters, there is currently no GPU support for mstump

0 replies

geoHeil · 2020-11-30T11:06:44Z

geoHeil
Nov 30, 2020
Author

@seanlaw 1D matrix profile works quite good now!

I have to better understand mstump to use it as well.

0 replies

seanlaw · 2020-11-30T14:17:29Z

seanlaw
Nov 30, 2020
Maintainer

1D matrix profile works quite good now!

Congratulations, that's great to hear! I'd love to hear more about what you are finding and how you are leveraging the matrix profile.

Also, if you end up publishing any related work, please don't forget to cite our STUMPY paper where appropriate.

I have to better understand mstump to use it as well.

In general, this will be more complex and possibly harder to provide assistance for but happy to share what little I know/understand.

0 replies

geoHeil · 2020-12-02T21:35:02Z

geoHeil
Dec 2, 2020
Author

No worries - but this will take some more months.

0 replies

geoHeil · 2021-01-05T20:50:12Z

geoHeil
Jan 5, 2021
Author

Awesome writeup #305 (reply in thread) I will have to try this soon

1 reply

seanlaw Jan 5, 2021
Maintainer

Oh, thank you and I'm glad you found it useful! ICYMI: There is also this recently released blog post that may be of interet to you: https://towardsdatascience.com/part-10-discovering-multidimensional-time-series-motifs-45da53b594bb

geoHeil · 2021-01-25T06:34:05Z

geoHeil
Jan 25, 2021
Author

@seanlaw when I have matrix profiles from multiple devices their values are only normalized for an individual device and thus not easily comparable. Is this right? I.e. is it wrong to assume a matrix profile of 2 vs. one of 1 is 2x as anomalous if it is stemming from a different device / time-series right?

12 replies

geoHeil Jan 27, 2021
Author

But you assume that the discords in time series of different devices are matching & aligned. This is not what I need. I just want to obtain the top-k most anomalous discord out of all the available discords generated from multiple time series.

seanlaw Jan 27, 2021
Maintainer

But you assume that the discords in time series of different devices are matching & aligned.

No, I do not make this assumption. What I am essentially trying to do is take all discords from all devices and double check to make sure each discord is also a discord relative to all other devices. Currently, the distance associated with each discord is only relative to one time series (its parent time series) and I am checking to make sure that each discord is also a "global discord". Doing this then allows me to rank order all of the discords (this is where I am accumulating all of the minimum distances). Of course, this may not be what you want and I am misunderstanding.

geoHeil Jan 29, 2021
Author

I think this is computationally rather complex. As devices might individually fail:

double check to make sure each discord is also a discord relative to all other devices

I have a problem understanding this part.
For me, this sounds like you try to compute something similar to TF-IDF here from NLP. i.e. to figure out how rare (I.e. anomalous) the discord is (globally). Do I understand this correctly?
Or do you mean something else with is also a global discord?

seanlaw Jan 29, 2021
Maintainer

I think this is computationally rather complex. As devices might individually fail:

That is fair and I don't presume to know more about your data than you do (in fact, I know far less 😊).

double check to make sure each discord is also a discord relative to all other devices

I have a problem understanding this part.

For me, this sounds like you try to compute something similar to TF-IDF here from NLP. i.e. to figure out how rare (I.e. anomalous) the discord is (globally). Do I understand this correctly?

Or do you mean something else with is also a global discord?

I think TF-IDF is a reasonable analogy but my computation (see last code snippet above) is, I think, much, much more simplistic than TF-IDF. Again, my suggestion is just a starting point and I just wanted to expose you to the stumpy.core.mass function for quickly computing 1-D distance profiles in case it is relevant to you.

geoHeil Jan 29, 2021
Author

thx

geoHeil · 2021-03-18T11:54:12Z

geoHeil
Mar 18, 2021
Author

I have the problem that my time series are quite noisy.
Unlike EEG data with a nice and clear periodicity, this results in many rather small matrix profiles.

I frequently observe the warning: A large number of values are smaller than 1e-5, even though I do have set the ignore_trivial=True.

Is this a problem? How can it be maybe adapted to be more suitable to this rather noisy time-series?

2 replies

seanlaw Mar 18, 2021
Maintainer

Good question. So, after the 1-D matrix profile is computed, we usually check if:

if mp.mean() < 10e-6 or np.all(mp < 10e-6):

If either of these conditions are True then we warn the user. So, if you have reason to believe that the mean mp value should be small or all of the mp values should be small then you can safely ignore this warning. If the amount of logging bothers you then you may consider changing the Python logging level.

geoHeil Mar 18, 2021
Author

thx

Multidimensional - Multi Timeseries Matrix Profile with mstump #262

geoHeil Oct 6, 2020

Replies: 32 comments · 16 replies

geoHeil Oct 6, 2020 Author

seanlaw Oct 6, 2020 Maintainer

geoHeil Oct 6, 2020 Author

seanlaw Oct 6, 2020 Maintainer

geoHeil Oct 7, 2020 Author

mihailescum Oct 7, 2020

seanlaw Oct 7, 2020 Maintainer

geoHeil Oct 17, 2020 Author

seanlaw Oct 17, 2020 Maintainer

mihailescum Oct 17, 2020

seanlaw Oct 17, 2020 Maintainer

geoHeil Oct 17, 2020 Author

seanlaw Nov 2, 2020 Maintainer

geoHeil Nov 9, 2020 Author

seanlaw Feb 13, 2021 Maintainer

geoHeil Nov 9, 2020 Author

seanlaw Nov 9, 2020 Maintainer

geoHeil Nov 9, 2020 Author

seanlaw Nov 12, 2020 Maintainer

geoHeil Nov 12, 2020 Author

seanlaw Nov 12, 2020 Maintainer

geoHeil Nov 12, 2020 Author

seanlaw Nov 12, 2020 Maintainer

geoHeil Nov 13, 2020 Author

seanlaw Nov 13, 2020 Maintainer

geoHeil Nov 30, 2020 Author

seanlaw Nov 30, 2020 Maintainer

geoHeil Dec 2, 2020 Author

geoHeil Jan 5, 2021 Author

seanlaw Jan 5, 2021 Maintainer

geoHeil Jan 25, 2021 Author

geoHeil Jan 27, 2021 Author

seanlaw Jan 27, 2021 Maintainer

geoHeil Jan 29, 2021 Author

seanlaw Jan 29, 2021 Maintainer

geoHeil Jan 29, 2021 Author

geoHeil Mar 18, 2021 Author

seanlaw Mar 18, 2021 Maintainer

geoHeil Mar 18, 2021 Author

Multidimensional - Multi Timeseries Matrix Profile with `mstump` #262

geoHeil
Oct 6, 2020

Replies: 32 comments 16 replies

geoHeil
Oct 6, 2020
Author

seanlaw
Oct 6, 2020
Maintainer

geoHeil
Oct 6, 2020
Author

seanlaw
Oct 6, 2020
Maintainer

geoHeil
Oct 7, 2020
Author

mihailescum
Oct 7, 2020

seanlaw
Oct 7, 2020
Maintainer

geoHeil
Oct 17, 2020
Author

seanlaw
Oct 17, 2020
Maintainer

mihailescum
Oct 17, 2020

seanlaw
Oct 17, 2020
Maintainer

geoHeil
Oct 17, 2020
Author

seanlaw
Nov 2, 2020
Maintainer

geoHeil
Nov 9, 2020
Author

seanlaw Feb 13, 2021
Maintainer

geoHeil
Nov 9, 2020
Author

seanlaw
Nov 9, 2020
Maintainer

geoHeil
Nov 9, 2020
Author

seanlaw
Nov 12, 2020
Maintainer

geoHeil
Nov 12, 2020
Author

seanlaw
Nov 12, 2020
Maintainer

geoHeil
Nov 12, 2020
Author

seanlaw
Nov 12, 2020
Maintainer

geoHeil
Nov 13, 2020
Author

seanlaw
Nov 13, 2020
Maintainer

geoHeil
Nov 30, 2020
Author

seanlaw
Nov 30, 2020
Maintainer

geoHeil
Dec 2, 2020
Author

geoHeil
Jan 5, 2021
Author

seanlaw Jan 5, 2021
Maintainer

geoHeil
Jan 25, 2021
Author

geoHeil Jan 27, 2021
Author

seanlaw Jan 27, 2021
Maintainer

geoHeil Jan 29, 2021
Author

seanlaw Jan 29, 2021
Maintainer

geoHeil Jan 29, 2021
Author

geoHeil
Mar 18, 2021
Author

seanlaw Mar 18, 2021
Maintainer

geoHeil Mar 18, 2021
Author