Skip to content

Commit ae0a2c3

Browse files
committed
Added a new tutorial describing the matrix profile
1 parent 36dbcd6 commit ae0a2c3

13 files changed

+307
-2
lines changed

docs/Tutorial_0.ipynb

Lines changed: 305 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,305 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Tutorial 0 - The Matrix Profile\n",
8+
"\n",
9+
"At its core, the STUMPY library efficiently computes something called a <i><b>matrix profile</b>, a vector that stores the z-normalized Euclidean distance between any subsequence within a time series and its nearest neigbor</i>.\n",
10+
"\n",
11+
"To fully understand what this means, let's take a step back and start with a simple illustrative example along with a few basic definitions:"
12+
]
13+
},
14+
{
15+
"cell_type": "markdown",
16+
"metadata": {},
17+
"source": [
18+
"## Time Series with Length n = 13"
19+
]
20+
},
21+
{
22+
"cell_type": "code",
23+
"execution_count": 1,
24+
"metadata": {},
25+
"outputs": [],
26+
"source": [
27+
"time_series = [0, 1, 3, 2, 9, 1, 14, 15, 1, 2, 2, 10, 7]\n",
28+
"n = len(time_series)"
29+
]
30+
},
31+
{
32+
"cell_type": "markdown",
33+
"metadata": {},
34+
"source": [
35+
"To analyze this time series with length `n = 13`, we could visualize the data or calculate global summary statistics (i.e., mean, median, mode, min, max). If you had a much longer time series, then you may even feel compelled to build an ARIMA model, perform anomaly detection, or attempt a forecasting model but these methods can be complicated and may often have false positives or no interpretable insights.\n",
36+
"\n",
37+
"<img src=\"images/time_series_viz.jpeg\" width=800>\n",
38+
"\n",
39+
"However, if we were to apply <i>Occam's Razor</i>, then what is the most <b>simple and intuitive</b> approach that we could take analyze to this time series?\n",
40+
"\n",
41+
"To answer this question, let's start with our first defintion:"
42+
]
43+
},
44+
{
45+
"cell_type": "markdown",
46+
"metadata": {},
47+
"source": [
48+
"## Subsequence /ˈsəbsəkwəns/ noun\n",
49+
"\n",
50+
"### <i>a part or section of the full time series</i>\n",
51+
"\n",
52+
"So, the following are all considered subsequences of our `time_series` since they can all be found in the time series above.\n",
53+
"\n",
54+
"<div style='margin-left: auto;margin-right: auto; text-align: center;'><img src=\"images/subsequence_a.jpeg\" width=400><img src=\"images/subsequence_b.jpeg\" width=400><img src=\"images/subsequence_c.jpeg\" width=400></div>"
55+
]
56+
},
57+
{
58+
"cell_type": "code",
59+
"execution_count": 2,
60+
"metadata": {},
61+
"outputs": [
62+
{
63+
"name": "stdout",
64+
"output_type": "stream",
65+
"text": [
66+
"[0, 1]\n",
67+
"[9, 1, 14]\n",
68+
"[3, 2, 9, 1, 14, 15, 1, 2]\n"
69+
]
70+
}
71+
],
72+
"source": [
73+
"print(time_series[0:2])\n",
74+
"print(time_series[4:7])\n",
75+
"print(time_series[2:10])"
76+
]
77+
},
78+
{
79+
"cell_type": "markdown",
80+
"metadata": {},
81+
"source": [
82+
"We can see that each subsequence can have a different sequence length that we'll call `m`. So, for example, if we choose `m = 4`, then we can think about how we might compare any two subsequences of the same length."
83+
]
84+
},
85+
{
86+
"cell_type": "code",
87+
"execution_count": 3,
88+
"metadata": {},
89+
"outputs": [
90+
{
91+
"name": "stdout",
92+
"output_type": "stream",
93+
"text": [
94+
"[0, 1, 3, 2] [1, 2, 2, 10]\n"
95+
]
96+
}
97+
],
98+
"source": [
99+
"m = 4\n",
100+
"i = 0 # starting index for the first subsequence\n",
101+
"j = 8 # starting index for the second subsequence\n",
102+
"\n",
103+
"subseq_1 = time_series[i:i+m]\n",
104+
"subseq_2 = time_series[j:j+m]\n",
105+
"\n",
106+
"print(subseq_1, subseq_2)"
107+
]
108+
},
109+
{
110+
"cell_type": "markdown",
111+
"metadata": {},
112+
"source": [
113+
"<div style='margin-left: auto;margin-right: auto; text-align: center;'><img src=\"images/subsequences.jpeg\" width=800></div>"
114+
]
115+
},
116+
{
117+
"cell_type": "markdown",
118+
"metadata": {},
119+
"source": [
120+
"One way to compare any two subsequences is to calculate what is called the Euclidean distance.\n",
121+
"\n",
122+
"## Euclidean Distance /yo͞oˈklidēən/ /ˈdistəns/ noun\n",
123+
"\n",
124+
"### <i>the straight-line distance between two points</i>\n",
125+
"\n",
126+
"<div style='margin-left: auto;margin-right: auto; text-align: center;'><img src=\"images/euclidean_distance.jpeg\" width=800></div>"
127+
]
128+
},
129+
{
130+
"cell_type": "code",
131+
"execution_count": 4,
132+
"metadata": {},
133+
"outputs": [
134+
{
135+
"name": "stdout",
136+
"output_type": "stream",
137+
"text": [
138+
"The square root of 67 = 8.18535277187245\n"
139+
]
140+
}
141+
],
142+
"source": [
143+
"import math\n",
144+
"\n",
145+
"D = 0\n",
146+
"for k in range(m):\n",
147+
" D += (time_series[i+k] - time_series[j+k])**2\n",
148+
"print(f\"The square root of {D} = {math.sqrt(D)}\")"
149+
]
150+
},
151+
{
152+
"cell_type": "markdown",
153+
"metadata": {},
154+
"source": [
155+
"## Distance Profile - Pairwise Euclidean Distances\n",
156+
"\n",
157+
"Now, we can take this a step further where he keep one subsequence the same (reference subsequence), change the second subsequence in a sliding window manner, and compute the Euclidean distance for each window. The resulting vector of pairwise Euclidean distances is also known as a <i><b>distance profile</b></i>.\n",
158+
"\n",
159+
"<div style='margin-left: auto;margin-right: auto; text-align: center;'><img src=\"images/pairwise_euclidean_distance.gif\" width=800></div>\n",
160+
"\n",
161+
"Of course, not all of these distances are useful. Specifically, the distance for the self match (or trivial match) isn't informative since the distance will be always be zero when you are comparing a subsequence with itself. So, we'll ignore it and, instead, take note of the next smallest distance from the distance profile and choose that as our best match:\n",
162+
"\n",
163+
"<div style='margin-left: auto;margin-right: auto; text-align: center;'><img src=\"images/trivial_match.jpeg\" width=800></div>\n",
164+
"\n",
165+
"Next, we can shift our reference subsequence over one element at a time and repeat the same sliding window process to compute the distance profile for each new reference subsequence.\n",
166+
"\n",
167+
"<div style='margin-left: auto;margin-right: auto; text-align: center;'><img src=\"images/distance_matrix.gif\" width=800></div>\n",
168+
"\n",
169+
"## Distance Matrix\n",
170+
"\n",
171+
"If we take all of the distance profiles that were computed for each reference subsequence and stack them one on top of each other then we get something called a <i><b>distance matrix</b></i>\n",
172+
"\n",
173+
"<div style='margin-left: auto;margin-right: auto; text-align: center;'><img src=\"images/distance_matrix.jpeg\" width=800></div>\n",
174+
"\n",
175+
"## The Real Problem - The Brute Force Approach\n",
176+
"\n",
177+
"Now, it might seem pretty straightforward at this point but what we need to do is consider how to compute this full distance matrix efficiently. Let's start with the brute force approach: "
178+
]
179+
},
180+
{
181+
"cell_type": "code",
182+
"execution_count": 5,
183+
"metadata": {},
184+
"outputs": [],
185+
"source": [
186+
"for i in range(n-m+1):\n",
187+
" for j in range(n-m+1):\n",
188+
" D = 0\n",
189+
" for k in range(m):\n",
190+
" D += (time_series[i+k] - time_series[j+k])**2\n",
191+
" D = math.sqrt(D)"
192+
]
193+
},
194+
{
195+
"cell_type": "markdown",
196+
"metadata": {},
197+
"source": [
198+
"At first glance, this may not look too bad but if we start considering both the computational complexity as well as the spatial complexity then we begin to understand the real problem. It turns out that, for longer time series (i.e., <i>n >> 10,000</i>) the computational complexity is <i>O(n<sup>2</sup>m)</i> (as evidenced by the three for loops in the code above) and the spatial complexity for storing the full distance matrix is <i>O(n<sup>2</sup>)</i>. \n",
199+
"\n",
200+
"To put this into perspective, imagine if you had a single sensor that collected data 20 times/min over the course of 5 years. This would result:"
201+
]
202+
},
203+
{
204+
"cell_type": "code",
205+
"execution_count": 6,
206+
"metadata": {},
207+
"outputs": [
208+
{
209+
"name": "stdout",
210+
"output_type": "stream",
211+
"text": [
212+
"There would be n = 52416000 data points\n"
213+
]
214+
}
215+
],
216+
"source": [
217+
"n = 20 * 60 * 24 * 364 * 5 # 20 times/min x 60 mins/hour x 24 hours/day x 365 days/year x 5 years\n",
218+
"print(f\"There would be n = {n} data points\")"
219+
]
220+
},
221+
{
222+
"cell_type": "markdown",
223+
"metadata": {},
224+
"source": [
225+
"Assuming that each calculation in the inner loop takes 0.0000001 seconds then this would take:"
226+
]
227+
},
228+
{
229+
"cell_type": "code",
230+
"execution_count": 7,
231+
"metadata": {},
232+
"outputs": [
233+
{
234+
"name": "stdout",
235+
"output_type": "stream",
236+
"text": [
237+
"It would take 137371850.1792 seconds to compute\n"
238+
]
239+
}
240+
],
241+
"source": [
242+
"time = 0.0000001 * (n * n - n)/2\n",
243+
"print(f\"It would take {time} seconds to compute\")"
244+
]
245+
},
246+
{
247+
"cell_type": "markdown",
248+
"metadata": {},
249+
"source": [
250+
"<b>Which is equivalent to 1,598.7 days (or 4.4 years) and 11.1 PB of memory to compute!</b> So, it is clearly not feasible to compute the distance matrix using our naive brute force method. And this is where the <i><b>matrix profile</b></i> comes into play. Recall,\n",
251+
"\n",
252+
"## Matrix Profile /ˈmātriks/ /ˈprōˌfīl/ noun \n",
253+
"\n",
254+
"### a vector that stores the (z-normalized) Euclidean distance between any subsequence within a time series and its nearest neigbor\n",
255+
"\n",
256+
"Practically, what this means is that the matrix profile is only interested in storing the smallest non-trivial distances from each distance profile, which significantly reduces the spatial complexity to O(n):\n",
257+
"\n",
258+
"<div style='margin-left: auto;margin-right: auto; text-align: center;'><img src=\"images/matrix_profile.gif\" width=800></div>\n",
259+
"\n",
260+
"But we still haven't figured out how to reduce the computational complexity and this is precisely the problem that is being addressed by [STUMPY](https://github.com/TDAmeritrade/stumpy)."
261+
]
262+
},
263+
{
264+
"cell_type": "markdown",
265+
"metadata": {},
266+
"source": [
267+
"## STUMPY\n",
268+
"\n",
269+
"In the fall of 2016, researchers from the [University of California, Riverside](https://www.cs.ucr.edu/~eamonn) and the [University of New Mexico](https://www.cs.unm.edu/~mueen/) published a beautiful set of [back-to-back papers](https://www.cs.ucr.edu/~eamonn/MatrixProfile.html) that described an <u>exact method</u> called <i><b>STOMP</b></i> for computing the matrix profile for any time series with a computational complexity of O(n<sup>2</sup>)! They also further demonstrated this using GPUs and this method approach is called <i><b>GPU-STOMP</b></i>.\n",
270+
"\n",
271+
"With the academics, data scientists, and developers in mind, we have taken some of these concepts and have open sourced STUMPY, a powerful and scalable library that efficiently computes the matrix profile. And, thanks to other open source software such as [Numba](http://numba.pydata.org/) and [Dask](https://dask.org/), our implementation can be both highly parallelized (for a single server) and highly distributed (across multiple servers). We've tested STUMPY on as many as 256 CPU cores that are spread across 32 servers and have achieved similiar [performance](https://github.com/TDAmeritrade/stumpy#performance) to the published GPU-STOMP work."
272+
]
273+
},
274+
{
275+
"cell_type": "markdown",
276+
"metadata": {},
277+
"source": [
278+
"## Conclusion\n",
279+
"\n",
280+
"We are excited to share STUMPY with you and please reach out and tell us know how it has enabled your time series analysis work. We'd love to hear from you!"
281+
]
282+
}
283+
],
284+
"metadata": {
285+
"kernelspec": {
286+
"display_name": "Python 3",
287+
"language": "python",
288+
"name": "python3"
289+
},
290+
"language_info": {
291+
"codemirror_mode": {
292+
"name": "ipython",
293+
"version": 3
294+
},
295+
"file_extension": ".py",
296+
"mimetype": "text/x-python",
297+
"name": "python",
298+
"nbconvert_exporter": "python",
299+
"pygments_lexer": "ipython3",
300+
"version": "3.6.8"
301+
}
302+
},
303+
"nbformat": 4,
304+
"nbformat_minor": 2
305+
}

docs/Tutorial_2.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# Analyzing Taxi Time Series Data with Time Series Chains\n",
7+
"# Tutorial 2 - Analyzing Taxi Time Series Data with Time Series Chains\n",
88
"\n",
99
"This example utilizes the main takeways from the research papers: [Matrix Profile VII](https://www.cs.ucr.edu/~eamonn/chains_ICDM.pdf).\n",
1010
"\n",
@@ -260,7 +260,7 @@
260260
"cell_type": "markdown",
261261
"metadata": {},
262262
"source": [
263-
"# Resources\n",
263+
"## Resources\n",
264264
"\n",
265265
"[Matrix Profile VII](https://www.cs.ucr.edu/~eamonn/chains_ICDM.pdf)"
266266
]

docs/images/distance_matrix.gif

4.23 MB
Loading

docs/images/distance_matrix.jpeg

283 KB
Loading
296 KB
Loading

docs/images/matrix_profile.gif

897 KB
Loading
2.57 MB
Loading

docs/images/subsequence_a.jpeg

267 KB
Loading

docs/images/subsequence_b.jpeg

270 KB
Loading

docs/images/subsequence_c.jpeg

280 KB
Loading

0 commit comments

Comments
 (0)