DEV Community

wrighter
wrighter

Posted on • Originally published at wrighters.io on

Indexing time series data in pandas

Quite often the data that we want to analyze has a time based component. Think about data like daily temperatures or rainfall, stock prices, sales data, student attendance, or events like clicks or views of a web application. There is no shortage of sources of data, and new sources are being added all the time. As a result, most pandas users will need to be familiar with time series data at some point.

A time series is just a pandas DataFrame or Series that has a time based index. The values in the time series can be anything else that can be contained in the containers, they are just accessed using date or time values. A time series container can be manipulated in many ways in pandas, but for this article I will focus just on the basics of indexing. Knowing how indexing works first is important for data exploration and use of more advanced features.

DatetimeIndex

In pandas, a DatetimeIndex is used to provide indexing for pandas Series and DataFrames and works just like other Index types, but provides special functionality for time series operations. We’ll cover the common functionality with other Index types first, then talk about the basics of partial string indexing.

One word of warning before we get started. It’s important for your index to be sorted, or you may get some strange results.

Examples

To show how this functionality works, let’s create some sample time series data with different time resolutions.

import pandas as pd import numpy as np import datetime # this is an easy way to create a DatetimeIndex # both dates are inclusive d_range = pd.date_range("2021-01-01", "2021-01-20") # this creates another DatetimeIndex, 10000 minutes long m_range = pd.date_range("2021-01-01", periods=10000, freq="T") # daily data in a Series daily = pd.Series(np.random.rand(len(d_range)), index=d_range) # minute data in a DataFrame minute = pd.DataFrame(np.random.rand(len(m_range), 1), columns=["value"], index=m_range) # time boundaries not on the minute boundary, add some random jitter mr_range = m_range + pd.Series([pd.Timedelta(microseconds=1_000_000.0 * s) for s in np.random.rand(len(m_range))]) # minute data in a DataFrame, but at a higher resolution minute2 = pd.DataFrame(np.random.rand(len(mr_range), 1), columns=["value"], index=mr_range) daily.head() 
Enter fullscreen mode Exit fullscreen mode
2021-01-01 0.293300 2021-01-02 0.921466 2021-01-03 0.040813 2021-01-04 0.107230 2021-01-05 0.201100 Freq: D, dtype: float64 
Enter fullscreen mode Exit fullscreen mode
minute.head() 
Enter fullscreen mode Exit fullscreen mode
 value 2021-01-01 00:00:00 0.124186 2021-01-01 00:01:00 0.542545 2021-01-01 00:02:00 0.557347 2021-01-01 00:03:00 0.834881 2021-01-01 00:04:00 0.732195 
Enter fullscreen mode Exit fullscreen mode
minute2.head() 
Enter fullscreen mode Exit fullscreen mode
 value 2021-01-01 00:00:00.641049 0.527961 2021-01-01 00:01:00.088244 0.142192 2021-01-01 00:02:00.976195 0.269042 2021-01-01 00:03:00.922019 0.509333 2021-01-01 00:04:00.452614 0.646703 
Enter fullscreen mode Exit fullscreen mode

Resolution

A DatetimeIndex has a resolution that indicates to what level the Index is indexing the data. The three indices created above have distinct resolutions. This will have ramifications in how we index later on.

print("daily:", daily.index.resolution) print("minute:", minute.index.resolution) print("randomized minute:", minute2.index.resolution) 
Enter fullscreen mode Exit fullscreen mode
daily: day minute: minute randomized minute: microsecond 
Enter fullscreen mode Exit fullscreen mode

Typical indexing

Before we get into some of the “special” ways to index a pandas Series or DataFrame with a DatetimeIndex, let’s just look at some of the typical indexing functionality.

Basics

I’ve covered the basics of indexing before, so I won’t cover too many details here. However it’s important to realize that a DatetimeIndex works just like other indices in pandas, but has extra functionality. (The extra functionality can be more useful and convenient, but just hold tight, those details are next). If you already understand basic indexing, you may want to skim until you get to partial string indexing. If you haven’t read my articles on indexing, you should start with the basics and go from there.

Indexing a DatetimeIndex using a datetime-like object will use exact indexing.

getitem a.k.a the array indexing operator ([])

When using datetime-like objects for indexing, we need to match the resolution of the index.

This ends up looking fairly obvious for our daily time series.

daily[pd.Timestamp("2021-01-01")] 
Enter fullscreen mode Exit fullscreen mode
0.29330017699861666 
Enter fullscreen mode Exit fullscreen mode
try: minute[pd.Timestamp("2021-01-01 00:00:00")] except KeyError as ke: print(ke) 
Enter fullscreen mode Exit fullscreen mode
Timestamp('2021-01-01 00:00:00') 
Enter fullscreen mode Exit fullscreen mode

This KeyError is raised because in a DataFrame, using a single argument to the [] operator will look for a column, not a row. We have a single column called value in our DataFrame, so the code above is looking for a column. Since there isn’t a column by that name, there is a KeyError. We will use other methods for indexing rows in a DataFrame.

.iloc indexing

Since the iloc indexer is integer offset based, it’s pretty clear how it works, not much else to say here. It works the same for all resolutions.

daily.iloc[0] 
Enter fullscreen mode Exit fullscreen mode
0.29330017699861666 
Enter fullscreen mode Exit fullscreen mode
minute.iloc[-1] 
Enter fullscreen mode Exit fullscreen mode
value 0.999354 Name: 2021-01-07 22:39:00, dtype: float64 
Enter fullscreen mode Exit fullscreen mode
minute2.iloc[4] 
Enter fullscreen mode Exit fullscreen mode
value 0.646703 Name: 2021-01-01 00:04:00.452614, dtype: float64 
Enter fullscreen mode Exit fullscreen mode

.loc indexing

When using datetime-like objects, you need to have exact matches for single indexing. It’s important to realize that when you make datetime or pd.Timestamp objects, all the fields you don’t specify explicitly will default to 0.

jan1 = datetime.datetime(2021, 1, 1) daily.loc[jan1] 
Enter fullscreen mode Exit fullscreen mode
0.29330017699861666 
Enter fullscreen mode Exit fullscreen mode
minute.loc[jan1] # the defaults for hour, minute, second make this work 
Enter fullscreen mode Exit fullscreen mode
value 0.124186 Name: 2021-01-01 00:00:00, dtype: float64 
Enter fullscreen mode Exit fullscreen mode
try: # we don't have that exact time, due to the jitter  minute2.loc[jan1] except KeyError as ke: print("Missing in index: ", ke) # but we do have a value on that day # we could construct it manually to the microsecond if needed jan1_ms = datetime.datetime(2021, 1, 1, 0, 0, 0, microsecond=minute2.index[0].microsecond) minute2.loc[jan1_ms] 
Enter fullscreen mode Exit fullscreen mode
Missing in index: datetime.datetime(2021, 1, 1, 0, 0) value 0.527961 Name: 2021-01-01 00:00:00.641049, dtype: float64 
Enter fullscreen mode Exit fullscreen mode

Slicing

Slicing with integers works as expected, you can read more about regular slicing here. But here’s a few examples of “regular” slicing, which works with the array indexing operator ([]) or the .iloc indexer.

daily[0:2] # first two, end is not inclusive 
Enter fullscreen mode Exit fullscreen mode
2021-01-01 0.293300 2021-01-02 0.921466 Freq: D, dtype: float64 
Enter fullscreen mode Exit fullscreen mode
minute[0:2] # same 
Enter fullscreen mode Exit fullscreen mode
 value 2021-01-01 00:00:00 0.124186 2021-01-01 00:01:00 0.542545 
Enter fullscreen mode Exit fullscreen mode
minute2[1:5:2] # every other 
Enter fullscreen mode Exit fullscreen mode
 value 2021-01-01 00:01:00.088244 0.142192 2021-01-01 00:03:00.922019 0.509333 
Enter fullscreen mode Exit fullscreen mode
minute2.iloc[1:5:2] # works with the iloc indexer as well 
Enter fullscreen mode Exit fullscreen mode
 value 2021-01-01 00:01:00.088244 0.142192 2021-01-01 00:03:00.922019 0.509333 
Enter fullscreen mode Exit fullscreen mode

Slicing with datetime-like objects also works. Note that the end item is inclusive, and the defaults for hours, minutes, seconds, and microseconds will set the cutoff for the randomized data on minute boundaries (in our case).

daily[datetime.date(2021,1,1):datetime.date(2021, 1,3)] # end is inclusive 
Enter fullscreen mode Exit fullscreen mode
2021-01-01 0.293300 2021-01-02 0.921466 2021-01-03 0.040813 Freq: D, dtype: float64 
Enter fullscreen mode Exit fullscreen mode
minute[datetime.datetime(2021, 1, 1): datetime.datetime(2021, 1, 1, 0, 2, 0)] 
Enter fullscreen mode Exit fullscreen mode
 value 2021-01-01 00:00:00 0.124186 2021-01-01 00:01:00 0.542545 2021-01-01 00:02:00 0.557347 
Enter fullscreen mode Exit fullscreen mode
minute2[datetime.datetime(2021, 1, 1): datetime.datetime(2021, 1, 1, 0, 2, 0)] 
Enter fullscreen mode Exit fullscreen mode
 value 2021-01-01 00:00:00.641049 0.527961 2021-01-01 00:01:00.088244 0.142192 
Enter fullscreen mode Exit fullscreen mode

This sort of slicing work with [] and .loc, but not .iloc, as expected. Remember, .iloc is for integer offset indexing.

minute2.loc[datetime.datetime(2021, 1, 1): datetime.datetime(2021, 1, 1, 0, 2, 0)] 
Enter fullscreen mode Exit fullscreen mode
 value 2021-01-01 00:00:00.641049 0.527961 2021-01-01 00:01:00.088244 0.142192 
Enter fullscreen mode Exit fullscreen mode
try: # no! use integers with iloc  minute2.iloc[datetime.datetime(2021, 1, 1): datetime.datetime(2021, 1, 1, 0, 2, 0)] except TypeError as te: print(te) 
Enter fullscreen mode Exit fullscreen mode
cannot do positional indexing on DatetimeIndex with these indexers [2021-01-01 00:00:00] of type datetime 
Enter fullscreen mode Exit fullscreen mode

Special indexing with strings

Now things get really interesting and helpful. When working with time series data, partial string indexing can be very helpful and way less cumbersome than working with datetime objects. I know we started with objects, but now you see that for interactive use and exploration, strings are very helpful. You can pass in a string that can be parsed as a full date, and it will work for indexing.

daily["2021-01-04"] 
Enter fullscreen mode Exit fullscreen mode
0.10723013753233923 
Enter fullscreen mode Exit fullscreen mode
minute.loc["2021-01-01 00:03:00"] 
Enter fullscreen mode Exit fullscreen mode
value 0.834881 Name: 2021-01-01 00:03:00, dtype: float64 
Enter fullscreen mode Exit fullscreen mode

Strings also work for slicing.

minute.loc["2021-01-01 00:03:00":"2021-01-01 00:05:00"] # end is inclusive 
Enter fullscreen mode Exit fullscreen mode
 value 2021-01-01 00:03:00 0.834881 2021-01-01 00:04:00 0.732195 2021-01-01 00:05:00 0.291089 
Enter fullscreen mode Exit fullscreen mode

Partial String Indexing

Partial strings can also be used, so you only need to specify part of the data. This can be useful for pulling out a single year, month, or day from a longer dataset.

daily["2021"] # all items match (since they were all in 2021) daily["2021-01"] # this one as well (and only in January for our data) 
Enter fullscreen mode Exit fullscreen mode
2021-01-01 0.293300 2021-01-02 0.921466 2021-01-03 0.040813 2021-01-04 0.107230 2021-01-05 0.201100 2021-01-06 0.534822 2021-01-07 0.070303 2021-01-08 0.413683 2021-01-09 0.316605 2021-01-10 0.438853 2021-01-11 0.258554 2021-01-12 0.473523 2021-01-13 0.497695 2021-01-14 0.250582 2021-01-15 0.861521 2021-01-16 0.589558 2021-01-17 0.574399 2021-01-18 0.951196 2021-01-19 0.967695 2021-01-20 0.082931 Freq: D, dtype: float64 
Enter fullscreen mode Exit fullscreen mode

You can do this on a DataFrame as well.

minute["2021-01-01"] 
Enter fullscreen mode Exit fullscreen mode
<ipython-input-67-96027d36d9fe>:1: FutureWarning: Indexing a DataFrame with a datetimelike index using a single string to slice the rows, like `frame[string]`, is deprecated and will be removed in a future version. Use `frame.loc[string]` instead. minute["2021-01-01"] value 2021-01-01 00:00:00 0.124186 2021-01-01 00:01:00 0.542545 2021-01-01 00:02:00 0.557347 2021-01-01 00:03:00 0.834881 2021-01-01 00:04:00 0.732195 ... ... 2021-01-01 23:55:00 0.687931 2021-01-01 23:56:00 0.001978 2021-01-01 23:57:00 0.770587 2021-01-01 23:58:00 0.154300 2021-01-01 23:59:00 0.777973 [1440 rows x 1 columns] 
Enter fullscreen mode Exit fullscreen mode

See that deprecation warning? You should no longer use [] for DataFrame string indexing (as we saw above, [] should be used for column access, not rows). Depending on whether the value is found in the index or not, you may get an error or a warning. Use .loc instead so you can avoid the confusion.

minute2.loc["2021-01-01"] 
Enter fullscreen mode Exit fullscreen mode
 value 2021-01-01 00:00:00.641049 0.527961 2021-01-01 00:01:00.088244 0.142192 2021-01-01 00:02:00.976195 0.269042 2021-01-01 00:03:00.922019 0.509333 2021-01-01 00:04:00.452614 0.646703 ... ... 2021-01-01 23:55:00.642728 0.749619 2021-01-01 23:56:00.238864 0.053027 2021-01-01 23:57:00.168598 0.598910 2021-01-01 23:58:00.103543 0.107069 2021-01-01 23:59:00.687053 0.941584 [1440 rows x 1 columns] 
Enter fullscreen mode Exit fullscreen mode

If using string slicing, the end point includes all times in the day.

minute2.loc["2021-01-01":"2021-01-02"] 
Enter fullscreen mode Exit fullscreen mode
 value 2021-01-01 00:00:00.641049 0.527961 2021-01-01 00:01:00.088244 0.142192 2021-01-01 00:02:00.976195 0.269042 2021-01-01 00:03:00.922019 0.509333 2021-01-01 00:04:00.452614 0.646703 ... ... 2021-01-02 23:55:00.604411 0.987777 2021-01-02 23:56:00.134674 0.159338 2021-01-02 23:57:00.508329 0.973378 2021-01-02 23:58:00.573397 0.223098 2021-01-02 23:59:00.751779 0.685637 [2880 rows x 1 columns] 
Enter fullscreen mode Exit fullscreen mode

But if we include times, it will include partial periods, cutting off the end right up to the microsecond if it is specified.

minute2.loc["2021-01-01":"2021-01-02 13:32:01"] 
Enter fullscreen mode Exit fullscreen mode
 value 2021-01-01 00:00:00.641049 0.527961 2021-01-01 00:01:00.088244 0.142192 2021-01-01 00:02:00.976195 0.269042 2021-01-01 00:03:00.922019 0.509333 2021-01-01 00:04:00.452614 0.646703 ... ... 2021-01-02 13:28:00.925951 0.969213 2021-01-02 13:29:00.037827 0.758476 2021-01-02 13:30:00.309543 0.473163 2021-01-02 13:31:00.363813 0.846199 2021-01-02 13:32:00.867343 0.007899 [2253 rows x 1 columns] 
Enter fullscreen mode Exit fullscreen mode

Slicing vs. exact matching

Our three datasets have different resolutions in their index: day, minute, and microsecond respectively. If we pass in a string indexing parameter and the resolution of the string is less accurate than the index, it will be treated as a slice. If it’s the same or more accurate, it’s treated as an exact match. Let’s use our microsecond (minute2) and minute (minute) resolution data examples. Note that every time you get a slice of the DataFrame, the value returned is a DataFrame. When it’s an exact match, it’s a Series.

minute2.loc["2021-01-01"] # slice - the entire day minute2.loc["2021-01-01 00"] # slice - the first hour of the day minute2.loc["2021-01-01 00:00"] # slice - the first minute of the day minute2.loc["2021-01-01 00:00:00"] # slice - the first minute and second of the day 
Enter fullscreen mode Exit fullscreen mode
 value 2021-01-01 00:00:00.641049 0.527961 
Enter fullscreen mode Exit fullscreen mode
print(str(minute2.index[0])) # note the string representation include the full microseconds minute2.loc[str(minute2.index[0])] # slice - this seems incorrect to me, should return Series not DataFrame minute2.loc[minute2.index[0]] # exact match 
Enter fullscreen mode Exit fullscreen mode
2021-01-01 00:00:00.641049 value 0.527961 Name: 2021-01-01 00:00:00.641049, dtype: float64 minute.loc["2021-01-01"] # slice - the entire day minute.loc["2021-01-01 00"] # slice - the first hour of the day minute.loc["2021-01-01 00:00"] # exact match value 0.124186 Name: 2021-01-01 00:00:00, dtype: float64 
Enter fullscreen mode Exit fullscreen mode

Note that for a microsecond resolution string match, I don’t see an exact match (where the return would be a Series), but instead a slice match (because the return value is a DataFrame). On the minute resolution DataFrame it worked as I expected.

asof

One way to deal with this sort of issue is to use asof. Often, when you have data that is either randomized in time or may have missing values, getting the most recent value as of a certain time is preffered. You could do this yourself, but it looks little cleaner to use asof.

minute2.loc[:"2021-01-01 00:00:03"].iloc[-1] # vs minute2.asof("2021-01-01 00:00:03") 
Enter fullscreen mode Exit fullscreen mode
value 0.527961 Name: 2021-01-01 00:00:03, dtype: float64 
Enter fullscreen mode Exit fullscreen mode

truncate

You can also use truncate which is sort of like slicing. You specify a value of before or after (or both) to indicate cutoffs for data. Unlike slicing which includes all values that partially match the date, truncate assumes 0 for any unspecified values of the date.

minute2.truncate(after="2021-01-01 00:00:03") 
Enter fullscreen mode Exit fullscreen mode
 value 2021-01-01 00:00:00.641049 0.527961 
Enter fullscreen mode Exit fullscreen mode

Summary

You can now see that time series data can be indexed a bit differently than other types of Index in pandas. Understanding time series slicing will allow you to quickly navigate time series data and quickly move on to more advanced time series analysis.

The post Indexing time series data in pandas appeared first on wrighters.io.

Top comments (0)