Question: understanding the algorithm #150

hfwittmann · 2020-03-26T19:36:58Z

hfwittmann
Mar 26, 2020

Thanks for an excellent package!

I am not sure whether I have a full understanding of the algorithm. I have tried to replicate the result for the the series from the example
https://stumpy.readthedocs.io/en/latest/Tutorial_The_Matrix_Profile.html

with a basic implementation, but a get a deviation at one point. Therefore I am a little confused, where I am wrong/whether this is to be expected

So here are the results:

I get
``
import pandas as pd
import stumpy
import numpy as np

%%

time_series = np.array([0, 1, 3, 2, 9, 1, 14, 15, 1, 2, 2, 10, 7], dtype=float)

window = 4
S = stumpy.stump(time_series, m=window)
``

stumpy.stump(time_series, m=window)
array([[0.6424863376402249, 9, -1, 9],
[0.28570485146990177, 8, -1, 8],
[1.6401694431976326, 9, 0, 9],
[0.898130637894946, 1, 1, 8],
[1.2795471494078055, 9, 0, 9],
[1.781964662297751, 2, 2, 9],
[2.0583190140538696, 7, 3, 7],
[2.8394325732553067, 4, 4, 8],
[0.28570485146990177, 1, 1, 9],
[0.6424863376402249, 0, 0, -1]], dtype=object)

My own basic implementation:

from scipy.stats import zscore
import pandas as pd
import stumpy
import numpy as np
# %%
time_series = np.array([0, 1, 3, 2, 9, 1, 14, 15, 1, 2, 2, 10, 7], dtype=float)

window = 4
length = len(time_series)
d_plus = np.zeros(shape=[length - window + 1, length - window + 1]) + np.nan

do_zscore = True
for i in range(0, length - window + 1):
    for j in range(0, length - window + 1):
        # print(i, j)
        if j == i: continue
        time_series_i = time_series[i:i + window]
        time_series_j = time_series[j:j + window]

        if do_zscore:
            time_series_i = zscore(time_series_i)
            time_series_j = zscore(time_series_j)

        d_plus[i, j] = ((time_series_i - time_series_j)**2).sum()**0.5

yields:

argmins = np.nanargmin(d_plus, axis=1)
mins = np.nanmin(d_minus, axis=1)

argmins [9 8 9 1 9 2 7 6 1 0]
mins [0.642 0.286 1.64 0.898 1.28 1.782 2.058 2.058 0.286 0.642]

So the results match except for one position, in particular this is position 7

My result corresponds to

(i)
i = 7
j = 6
((zscore(time_series[i+0:i+4]) - zscore(time_series[j+0:j+4])) ** 2).sum() ** 0.5

2.058319014053869

Stumpy's result corresponds to
(ii)
i = 7
j = 4
((zscore(time_series[i+0:i+4]) - zscore(time_series[j+0:j+4])) ** 2).sum() ** 0.5

2.8394325732553067

It appears to me that (i) is correct.

Hence I am confused. Looking forward to your answer

seanlaw · 2020-03-27T02:37:11Z

seanlaw
Mar 27, 2020
Maintainer

@hfwittmann Thank you for your question and your kind words. Unfortunately, the tutorial is an oversimplification of the problem and so I strongly recommend that you go over the original matrix profile paper to get a better sense of everything that needs to be accounted for. They are really worthwhile reading.

Having said that, a naive implementation of what is happening can be found in our naive_mass unit test utility which is called in the stumpy.stump unit test. But essentially:

    m = 3
    zone = int(np.ceil(m / 4))
    left = np.array(
        [
            utils.naive_mass(Q, T_B, m, i, zone, True)
            for i, Q in enumerate(core.rolling_window(T_B, m))
        ],
        dtype=object,
    )

0 replies

seanlaw · 2020-03-27T02:45:09Z

seanlaw
Mar 27, 2020
Maintainer

In short, the one piece that you are missing (and which is discussed in the paper) is the idea of an "exclusion zone". That is, not only is a subsequence,i, not allowed to match itself (i.e., j != i, it is also not allowed to match any neighbors that are within [i - m/2, i + m/2] (this is the so-called exclusion zone). Note this is enforced for self-joins but not for AB-joins (i.e., comparisons between two different time series)

Thus, in your example m = 4 and so the closest match for i = 7 must have 5 < j < 9 (I can't remember if it's inclusive or exclusive of the endpoints). So, while j = 6 might be closest, it is excluded from consideration since it is within the exclusion zone.

In essence, you should replace the line if j == i: continue with something like if j > 9 or j < 5 (again, I don't recall if it is inclusive and may be if j >= 9 or j <= 5 ) and that should be it. Also, you'll actually start to notice that magnitude of this problem when the length of your array is around 10,000+.

I hope that helps!

0 replies

hfwittmann · 2020-03-27T08:12:59Z

hfwittmann
Mar 27, 2020
Author

@seanlaw Excellent, thank you for your explanation, really helpful! From what you are saying it appears to me the the exclusion zone should be symmetric, however

(i) the algorithm at position 6 finds the closest match at position 7, but
(ii) the algorithm at position 7 does not find 6 (although it would still be the closest match, except for the exclusion zone).

So I am still confused. Should the exclusion zone not come into play for (i), too?

I will definitely follow your suggestion and have a look at the paper!

0 replies

seanlaw · 2020-03-27T12:19:04Z

seanlaw
Mar 27, 2020
Maintainer

@hfwittmann You are correct. The exclusion zone should apply everywhere for self-joins. The issue that you are seeing seems to be fixed in the development version of the code (if you clone the repo then you should see that the results will change) and should look like:

import numpy as np
import stumpy

x = np.array([0, 1, 3, 2, 9, 1, 14, 15, 1, 2, 2, 10, 7], dtype=np.float64)
stumpy.stump(x, 4)

# array([[0.6424863376402249, 9, -1, 9],
#       [0.28570485146990177, 8, -1, 8],
#        [1.6401694431976326, 9, 0, 9],
#        [0.898130637894946, 1, 1, 8],
#        [1.2795471494078055, 9, 0, 9],
#        [1.781964662297751, 2, 2, 9],
#        [2.987226131718227, 3, 3, 8],
#        [2.8394325732553067, 4, 4, 9],
#        [0.28570485146990177, 1, 1, -1],
#        [0.6424863376402249, 0, 0, -1]], dtype=object)

@mexxexx Do you have any ideas as to what we may have changed (in dev) since our last release that would have fixed this issue? The 1.3.0 release on PyPI seems to give the following (wrong) result:

stumpy.stump(x, 4)

# array([[0.6424863376402249, 9, -1, 9],
#        [0.28570485146990177, 8, -1, 8],
#        [1.6401694431976326, 9, 0, 9],
#        [0.898130637894946, 1, 1, 8],
#        [1.2795471494078055, 9, 0, 9],
#        [1.781964662297751, 2, 2, 9],
#        [2.0583190140538696, 7, 3, 7],
#        [2.8394325732553067, 4, 4, 8],
#        [0.28570485146990177, 1, 1, 9],
#        [0.6424863376402249, 0, 0, -1]], dtype=object)

Since our unit tests were passing previously, this also implies that our unit tests were changed as well.

0 replies

mihailescum · 2020-03-27T13:12:29Z

mihailescum
Mar 27, 2020

Hi @seanlaw, I remember that I also came across the issue of the asymmetric exclusion zone. It appears to me that it was issue #131.
For this fix I also changed the behaviour of utils.naive_mass which explains why the unit tests were passing.

0 replies

seanlaw · 2020-03-27T13:17:11Z

seanlaw
Mar 27, 2020
Maintainer

Ahhh, yes! Thank you, @mexxexx. If it's okay with you (i.e., is there anything else major that still needs to be done?), I plan to push a minor release today.

0 replies

mihailescum · 2020-03-27T13:30:36Z

mihailescum
Mar 27, 2020

Amazing! No, from my side there is nothing that has to be fixed before.

0 replies

hfwittmann · 2020-03-27T18:02:02Z

hfwittmann
Mar 27, 2020
Author

@seanlaw @mexxexx Excellent stuff! I have pulled the update it's working as described! Thank you!

0 replies

seanlaw · 2020-03-28T01:22:09Z

seanlaw
Mar 28, 2020
Maintainer

@hfwittmann closing this for now but, fyi, v1.3.1 (with all of the new additions) is now live and can be conda or pip installed

@mexxexx Thanks again for all of your hard work and contributions!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: understanding the algorithm #150

{{title}}

Replies: 9 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Question: understanding the algorithm #150

hfwittmann Mar 26, 2020

%%

Replies: 9 comments

seanlaw Mar 27, 2020 Maintainer

seanlaw Mar 27, 2020 Maintainer

hfwittmann Mar 27, 2020 Author

seanlaw Mar 27, 2020 Maintainer

mihailescum Mar 27, 2020

seanlaw Mar 27, 2020 Maintainer

mihailescum Mar 27, 2020

hfwittmann Mar 27, 2020 Author

seanlaw Mar 28, 2020 Maintainer

hfwittmann
Mar 26, 2020

seanlaw
Mar 27, 2020
Maintainer

seanlaw
Mar 27, 2020
Maintainer

hfwittmann
Mar 27, 2020
Author

seanlaw
Mar 27, 2020
Maintainer

mihailescum
Mar 27, 2020

seanlaw
Mar 27, 2020
Maintainer

mihailescum
Mar 27, 2020

hfwittmann
Mar 27, 2020
Author

seanlaw
Mar 28, 2020
Maintainer