Replies: 32 comments 16 replies
-
Furthermore, in the docs you mention that only self joins are supported. Could you explain what you mean with this? |
Beta Was this translation helpful? Give feedback.
-
I'm glad you found it! Perhaps, you can explain what you mean by "multidimensional analysis". In STUMPY, our multi-dimensional matrix profile computation accepts a D-dimensional array as input and where each dimension corresponds to a separate time series (though all D time series must be the same length and aligned). More specifically, we faithfully reproduce the the multi-dimensional work from the original authors: paper here and we strongly recommend reading through it if you haven't had the chance to. The core of this paper is currently implemented as
For completeness (in case others are interested - I apologize in advance if you already understand this), a multi-dimensional matrix profile is not what most people think it is. That is, a multi-dimensional matrix profile is not individual 1-D matrix profiles stacked on top/next to each other. Again, I strong recommended reviewing the paper mentioned above and, while I'm sure that I don't have all of the answers, feel free to ask questions as having an open discussion also helps in my understanding as well. Of course, if what you need is in fact D number of 1-dimensional time series stacked on top of each other then you might as well process them individually (in an embarrassingly parallel way) and then stack the results afterward. Just know that, fundamentally, this has a completely different definition from the original matrix profile papers and, for consistency and to avoid confusion, I would not refer to this as a multi-dimensional matrix profile. If you don't mind, for clarity of communication, let's refer to the published definition as a "multi-dimensional matrix profile" and many individually analyzed 1-D matrix profiles (based on a set of related time series) as a "stacked matrix profile".
So, anomalies are really hard even in one dimension and @mexxexx and I have briefly discussed the meaning of (or lack there of) discords in the multi-dimensional matrix profile sense (using the published definition here). I may be mistaken but the multi-dimensional matrix profile paper didn't seem to provide too many details regarding how best to interpret this matrix profile for discords. We could certainly use some help here.
I'm not sure I fully understand your question here and you may need to provide a little more information to clarify.
I wonder if you can just look at the final dimension of the multi-dimensional matrix profile to ascertain this though I'm not experienced enough with this or your particular use case to elaborate. But happy to talk through it if you want to provide more detail as to what you mean or are looking for.
In the 1-D case, a self-join means that you have a single time series and you are looking for conserved patterns within that one time series. So, you are asking "for each subsequences in In the case of an "AB-join", you have two independent time series, An AB-join makes very little sense in case of multi-dimensional matrix profiles and there is no published work on this and, hence, is not supported. Let me know if that makes sense. |
Beta Was this translation helpful? Give feedback.
-
Many thanks for the quick response! The data might look like this (find the python code below which is generating such dummy data, obviously, the real data is not normally distributed and even might contain missing records):
Description of the data Each device (here named device_id) is generating an observation (=row) for a timestep What I want to achieve Sometimes devices are broken (totally not working), or more important: partially not working and affecting their surroundings. I want to identify such (partially) broken devices (= anomalies). How Stumpy/Matrix Profile fits in Or at least how far I understand how it could fit in-so-far: With mp, indices = stumpy.mstump(df[['T1', 'T2', 'T3']], m) # This is analyzing 3 time series at once I can analyze a single multivariate time-series of a single device (at least this is what I understand). And T1...T3 would be the multivariate metrics recorded by a single device. It is a desired property to not simply stack up single-dimension discords/anomalies but rather combine the information from multiple matrix (and again as far as I understand this is exactly what the current Is this correct? Or am I misunderstanding this? Now my question is (assuming I am correct so far):
So far I thought that metrics should be columns timestamps should be rows, with an indicator which device this row belongs to. Though maybe univariate discords could be computed and then combined in some logical way (i.e. if k out of n metrics determine an anomaly, it is considered as an anomaly). The last comment is fine/makes sense. import pandas as pd
import numpy as np
import random
random_seed = 42
np.random.seed(random_seed)
random.seed(random_seed)
def generate_df_for_device(n_observations, n_metrics, device_id, geo_id, topology_id, cohort_id):
df = pd.DataFrame(np.random.randn(n_observations,n_metrics), index=pd.date_range('2020', freq='H', periods=n_observations))
df.columns = [f'metrik_{c}' for c in df.columns]
df['geospatial_id'] = geo_id
df['topology_id'] = topology_id
df['cohort_id'] = cohort_id
df['device_id'] = device_id
return df
def generate_multi_device(n_observations, n_metrics, n_devices, cohort_levels, topo_levels):
results = []
for i in range(1, n_devices +1):
#print(i)
r = random.randrange(1, n_devices)
cohort = random.randrange(1, cohort_levels)
topo = random.randrange(1, topo_levels)
df_single_dvice = generate_df_for_device(n_observations, n_metrics, i, r, topo, cohort)
results.append(df_single_dvice)
#print(r)
return pd.concat(results)
# hourly data, 1 week of data
n_observations = 7 * 24
n_metrics = 3
n_devices = 20
cohort_levels = 3
topo_levels = 5
df = generate_multi_device(n_observations, n_metrics, n_devices, cohort_levels, topo_levels)
df = df.sort_index()
df.head() |
Beta Was this translation helpful? Give feedback.
-
Yes, I think we are on the same page. Technically, there is no assumption that the multivariate metrics be "recorded by a single device". You can also have three separate devices that are all synchronized to record data at the same time (i.e., one temperature sensor, one humidity sensor, and one pressure sensor).
Correct.
Yes, I think so
Yes. You get a multi-dimensional matrix profile (and indices) as output and this is identical to the original
Well, "normal" behavior is likely captured by the smallest values in your matrix profile but the problem is figuring out which number of dimensions is the most useful. We are still building tools for this and it is more art than science at this stage. However, if you have multiple devices then you may consider simply concatenating the time series from multiple devices (separated by a
As described above, discords (which are "potential anomalies") would be the matrix profile values from within each dimension (i.e., row) that have the largest value(s). The challenge here is that the multi-dimensional matrix profile output doesn't tell you WHICH dimension within your multivariable time series are the important/unimportant ones. It's not that the information is unavailable, it's just too much data and too complex to efficiently store in memory. In the future, we'll have some tools to help with the post-processing.
Well, it depends if you are using We had started with NumPy arrays only and, typically, I was use to related datasets (i.e., each time series) being in the same row within a NumPy array. Then, later, we added Pandas support but Pandas typically orients a time series within a column. Sorry for the confusion 🤷♂️ |
Beta Was this translation helpful? Give feedback.
-
I think that this is a good point:
But my assumption is that multiple metrics/sensors from a single device are more coherent than a second device which may reside on the other side of the country. Thanks for the answer with regards to orientation. For me (assuming the pandas way of doing things) it is not yet clear how to properly structure my dataset with time-series from multiple devices. Would the example data I shared above to be considered in the right shape? |
Beta Was this translation helpful? Give feedback.
-
Maybe I can add a thought or two: If you have prior domain knowledge, it would be very helpful. Let's focus on one device for example. Can you say that all metrics are relevant? Can you exclude some of them? Or could it be that some metrics are important and some are not? Depending on the answer, things would get more easy. The third case will be the most complicated one. Furthermore, I assume you know what the matrix profile is used for, but I'll just clarify once more: it can be used to easily detect repeated subsequences in a time-series, so called patterns or motifs. That means for it to be useful, your time series will probably need to have some sort of repeated pattern, which would either correspond to a functioning or a malfunctioning device. Depending on how your data looks like, you could reach your goal in different ways, maybe you can share a bit more info on it. But one example: suppose that a functioning device always has a recurring pattern (imagine a heart beat of a human). Then, you would calculate the matrix profile and look for very large matrix profile values, bacause a large MP value means that the subsequence has no similar matches. If you think about the heartbeat example, this could be a skipped beat. You said that you have missing values. This is a problem, as you can't just set the values to NaN. You need to think about a strategy how to mask these values. Can you just leave faulty timestamps out? Can you use the value from the last timestamp? Always remember that you are really looking for visual patterns, so whatever method suits your dataset, try to make it so the visual structure of the time-series is not altered. One bad example would be, if your time-series values are around
Stumpy can work either with numpy arrays or with data frames. If you pass a data frame object, then the orientation is correct. However, depending on your data again, you will probably want to only pass the values of one device. (Something like I hope this helps! |
Beta Was this translation helpful? Give feedback.
-
Yes, it looks about right. One word of caution is that STUMPY assumes floating point (continuous) inputs so, realistically, only the metrics would be "valid" input |
Beta Was this translation helpful? Give feedback.
-
I have looked further into my dataset and figured out that not all IoT devices are creating monitoring data (= the time-series I am looking at) all the time. Only when they are switched on. But as pointed out by you this results in an additional numeric problem. I still need to think about how to handle this. |
Beta Was this translation helpful? Give feedback.
-
@mexxexx Would you mind elaborating on why one can't set the missing values to NaN? |
Beta Was this translation helpful? Give feedback.
-
To be honest I can't pinpoint the problem exactly. But I have the feeling that if one of the dimensions is NaN and you calculate the multidimensional profile, something should go wrong. However, maybe I am mistaken. After all, mstump calculates the one dimensional profiles first. So those would just be Does that make sense? |
Beta Was this translation helpful? Give feedback.
-
Yes, thank you! I agree that matrix profile values would probably be Thank you for sharing your insights! |
Beta Was this translation helpful? Give feedback.
-
Many thanks for your great points. Let me give you some more context around this. The time-series stem from HFC (cable modem) routers/modems. https://www.usenix.org/conference/nsdi20/presentation/hu-jiyao gives a good overview of the business domain. The following figure is from this paper: The individual devices (interfaces (for frequencies)) of a device (modem/router / MAC address) record multivariate time series of metrics.
The idea is to detect anomalies (when a device is broken) as in the case of high noise this is affecting the neighborhood and degrading the quality of the internet for other modems as well. I.e. usually noticed at the most global (fiber node) level, but hard to pinpoint to a failure cause / individual device. With regards to the NULL values: Not all devices are powered on/generating monitoring metrics on all interfaces for all observation timestamps. So in particular not one of the dimensions is NaN, but for this device/port/frequency no metrics were reported at all for an observational period. As a starter, maybe starting with a univariate time series (signal to noise ratio) might be better - so I can work around the particular problem (partially). However, I would want to include details about what is normal also from other routers in the hierarchy - and would then again face a time alignment issue & missing data. |
Beta Was this translation helpful? Give feedback.
-
@geoHeil If you don't mind, I'm going to close this for now but feel free to re-open if you have any further questions or have any updates. We'd be happy to discuss it with you! |
Beta Was this translation helpful? Give feedback.
-
Indeed, for a single device 3 metrics would be considered relevant. But as far as I understand it - we would need to pivot the matrix and choose a single metric and have all the devices as columns - but this would then be a rather high dimensional space (easily in the thousands to 10 thousands). So far, I have looked into When I want to compute the distance using a single metric and > 1 device how would you suggest to do it? I still have to get more familiar with the mstomp - but as you said it is a bit complex to define anomalies (and I will need more time to think about it). Furthermore: it looks like AAMP seems to work better. I guess this is due to the reason that the time-series are not too periodic (unlike heart beats networking layer 1 issues have no nice periodontics and a lot of noise.) |
Beta Was this translation helpful? Give feedback.
-
@seanlaw I do not have permissions to reopen - please tell me if I would want to continue discussion closed, or in a different issue. |
Beta Was this translation helpful? Give feedback.
-
@geoHeil I'm not sure I'm following what you are saying as it feels like we are referring to different things when we're saying "concatenate". Would you mind providing a simple example with three devices (i.e., You've likely exceeded the extent of my concrete knowledge/understanding of matrix profiles so, full disclosure, know that any following responses may be limited/incorrect and are not meant to come across as authoritative. Your problem seems quite complex and the last thing that I want to do is to pretend that I know the answer when, in reality, I don't. |
Beta Was this translation helpful? Give feedback.
-
Many thanks for your great help! I have a data frame similar to: As you can see it is ordered by time. but now they are all one-after the other. This means that it can rather fast (vectorized) work on the data. However, as it is no longer sorted primarily by time, this also means (as far as I understand matrix profile) that the distances between device 1 and device 2 should be rather large i.e. finding anomalies would be hard (i.e. in the sense of incorporating the motifs of the different devices in a meaningful way to define normality). So when I instead pivot the data to: This is solved - but I would need to use mstump - with all the problems of multidimensional distances. Is there a better solution? I certainly do want to incorporate the information of other devices close to each one to define what constitutes normality/ an anomaly. Starting with a single dimension is fine. My observations are: on the data of a single device Please see the code below. %pylab inline
import stumpy
import pandas as pd
import numpy as np
import random
random_seed = 47
np.random.seed(random_seed)
random.seed(random_seed)
def generate_df_for_device(n_observations, n_metrics, device_id, geo_id, topology_id, cohort_id):
df = pd.DataFrame(np.random.randn(n_observations,n_metrics), index=pd.date_range('2020', freq='H', periods=n_observations))
df.columns = [f'metrik_{c}' for c in df.columns]
df['geospatial_id'] = geo_id
df['topology_id'] = topology_id
df['cohort_id'] = cohort_id
df['device_id'] = device_id
return df
def generate_multi_device(n_observations, n_metrics, n_devices, cohort_levels, topo_levels):
results = []
for i in range(1, n_devices +1):
#print(i)
r = random.randrange(1, n_devices)
cohort = random.randrange(1, cohort_levels)
topo = random.randrange(1, topo_levels)
df_single_dvice = generate_df_for_device(n_observations, n_metrics, i, r, topo, cohort)
results.append(df_single_dvice)
#print(r)
return pd.concat(results)
# hourly data, 1 week of data
n_observations = 7 * 24
n_metrics = 1
n_devices = 4
cohort_levels = 2
topo_levels = 3
df = generate_multi_device(n_observations, n_metrics, n_devices, cohort_levels, topo_levels)
df = df.sort_index()
df = df.reset_index().rename(columns={'index':'hour'})
df = df.drop(['geospatial_id', 'topology_id', 'cohort_id'], axis=1)
### S1
df.head()
### S2
df.sort_values(['device_id', 'hour']).head()
### S3
df.sort_values(['device_id', 'hour']).set_index(['hour', 'device_id']).unstack() |
Beta Was this translation helpful? Give feedback.
-
@geoHeil Can you tell me if you expect a pattern in, say, Perhaps, once you've pivoted the data, you should just compute the matrix profile for each |
Beta Was this translation helpful? Give feedback.
-
Well, these cohorts are failure boundaries, i.e. do not propagate the error introduced by broken devices to devices in the other failure region. But devices within a region should fairly soon converge to a broken state. So it is expected that all regions would show broken devices from time to time. To answer your question: It is expected to see a lot of noise in the data. Perhaps the best would be to only find matrix profiles with a short value/distance i.e. repeating patterns and not anomalies. And if phrased this way, indeed, it would be expected to see this noise in most of the devices - unless broken. For now, I have resorted to an approach similar to https://stackoverflow.com/questions/64751921/pandas-apply-function-to-each-group-output-is-not-really-an-aggregation to apply the When you say transform and then apply: would this work in a vectorized way? Iterating over the groups as outlined above is still rather slow. |
Beta Was this translation helpful? Give feedback.
-
@geoHeil Unfortunately, matrix profiles are an expensive computation when you have many, many time series. Considering that you have thousands to tens of thousands of time devices, this will certainly take a while. |
Beta Was this translation helpful? Give feedback.
-
Sure - just want to make sure to use it in an optimized/vectorized way or at least semantically correct ;). This is a great point to use GPUs. I do have access to some. and really appreciate that you already have looked into #53 |
Beta Was this translation helpful? Give feedback.
-
I think the only thing is to make sure that the time series is passed to STUMPY as a NumPy array, which should guarantee that the data is contiguous in memory. So, after you pivot your data as you had described above, you may want do a
Yes, I remembered you asking about NOT using GPUs before 😄 |
Beta Was this translation helpful? Give feedback.
-
;) well I wanted to start with some basics first. |
Beta Was this translation helpful? Give feedback.
-
@geoHeil In case it matters, there is currently no GPU support for |
Beta Was this translation helpful? Give feedback.
-
@seanlaw 1D matrix profile works quite good now! I have to better understand |
Beta Was this translation helpful? Give feedback.
-
Congratulations, that's great to hear! I'd love to hear more about what you are finding and how you are leveraging the matrix profile. Also, if you end up publishing any related work, please don't forget to cite our STUMPY paper where appropriate.
In general, this will be more complex and possibly harder to provide assistance for but happy to share what little I know/understand. |
Beta Was this translation helpful? Give feedback.
-
No worries - but this will take some more months. |
Beta Was this translation helpful? Give feedback.
-
Awesome writeup #305 (reply in thread) I will have to try this soon |
Beta Was this translation helpful? Give feedback.
-
@seanlaw when I have matrix profiles from multiple devices their values are only normalized for an individual device and thus not easily comparable. Is this right? I.e. is it wrong to assume a matrix profile of 2 vs. one of 1 is 2x as anomalous if it is stemming from a different device / time-series right? |
Beta Was this translation helpful? Give feedback.
-
I have the problem that my time series are quite noisy. I frequently observe the warning: A large number of values are smaller than 1e-5, even though I do have set the ignore_trivial=True. Is this a problem? How can it be maybe adapted to be more suitable to this rather noisy time-series? |
Beta Was this translation helpful? Give feedback.
-
Many thanks for #202, this is a great start. Do you plan to update the linked Jupiter notebook for multidimensional analysis & discord discovery?
I am interested in anomaly detection of time series.
My time-series originate from many IoT devices. For each device hourly metrics are recorded. Note: more than 1 metric is recorded. Additionally, metadata (geolocation and connection topology) are available. I wonder if the
mstump
/mstumped
only works for a single time-series and how it could be extended to support multiple devices/time-series and potentially calculate the deviation from a group/cohort of time-series (geo-region, firmware version, ...)Beta Was this translation helpful? Give feedback.
All reactions