Programming rolling window data analysis with Python and pandas

Learn to query and analyze time-series data using analytic functions and rolling window calculations.

By Yuli Vasiliev | October 2021


Programming rolling window data analysis with Python and pandas

Time-series data, also referred to as time-stamped data, commonly represents a series of measurements or observations indexed in chronological order. Typically, time-series data is collected on a regular basis through repeated measurements and data points are recorded at regular intervals. This article covers some analysis techniques that you can apply to time-series data to extract meaningful statistics from it, using Python’s pandas data analysis library as well as the SQL language for comparison.

Actually, any dataset that can be indexed in time order may be considered a time series and can be used for time-series analysis.

For example, a collection of purchase orders can be viewed as a sequence of discrete-time documents, which you can analyze based on their time order. You might want to calculate a running total of orders by employee based on the processing date or job date. This request can be implemented using the technique known as rolling window calculations.

Another familiar example of time-series data is a stock’s price data, where the stock price is recorded at specific and regular intervals, producing a discrete time series. You can discover the most profitable stock within a specified period, using time-series analysis techniques by computing an average value for each row based on the values in the nearby rows, which are referred to as a window in this context.

This article assumes that you know how to install and use Python.

Setting up the development environment for the sample programs

To follow along with the examples discussed in this article, you’ll need to create several example tables in Oracle Database and populate them with data. For that, you can execute against the database the SQL statements found in the timeseries_article.sql script; you can download the script here. Then, to be able to get the data from the database into Python, you need to have the pandas, SQLAlchemy, and cx_Oracle libraries installed in your Python environment. If you don’t have them yet, you can install them using the pip (preferred installer program) command, as follows, from a terminal window:



$ pip install pandas
$ pip install SQLAlchemy
$ pip install cx_Oracle

To follow the stock price data analysis example, you also need to install yfinance, a Python wrapper for the Yahoo Finance API that provides historical and real-time data for stock quotes.



$ pip install yfinance

Analyzing purchase orders based on their dates

Consider a set of purchase orders, with some days having no orders, some days one order, and some days multiple orders. (A sample sequence of dummy purchase orders can be found in the orders table loaded by the timeseries_article.sql script.) The following Python code moves the example purchase orders data into a pandas DataFrame:



import pandas as pd
import cx_Oracle
import sqlalchemy
from sqlalchemy.exc import SQLAlchemyError
try:
  engine = sqlalchemy.create_engine("oracle+cx_oracle://usr:pswd@localhost/?service_name=orcl")
  orders_sql = """SELECT * FROM orders"""
  df_orders = pd.read_sql(orders_sql, engine)
except SQLAlchemyError as e:
  print(e)

Once the data is loaded into a DataFrame, you can use pandas methods to perform analysis on it. The snippet below processes the rows loaded into the df_orders DataFrame and calculates the cumulative total of orders within a day by employee.



df_orders['day_running_total_by_empl'] = df_orders.sort_values(['ordate','empl','pono']).groupby(['ordate', 'empl'])['total'].cumsum()

The code uses the sort_values() method to sort the values found in the ordate (order date), empl (employee name), and pono (purchase order number) columns in the specified order. Then, it uses groupby() to group the rows by the values in the ordate and empl columns. Finally, it uses cumsum() to calculate a cumulative sum for each group.

To view the result set, use the following code:



print(df_orders[['ordate', 'empl','total','day_running_total_by_empl']].sort_values(['ordate','empl']))

Here is the output; a line is skipped after each ordate group to improve readability.



       ordate           empl    total  day_running_total_by_empl
0  2021-08-14  Albert Forest   404.34                     404.34
2  2021-08-14  Albert Forest   308.15                     712.49
3  2021-08-14  Albert Forest   298.49                    1010.98
1  2021-08-14   Tom Richards  1032.89                    1032.89

4  2021-08-15  Albert Forest   327.16                     327.16
7  2021-08-15  Albert Forest   258.48                     585.64
5  2021-08-15   Tom Richards  1139.89                    1139.89
6  2021-08-15   Tom Richards   334.99                    1474.88

9  2021-08-16  Albert Forest   750.98                     750.98
11 2021-08-16  Albert Forest   118.14                     869.12
8  2021-08-16  Sydney Hurley    98.33                      98.33
12 2021-08-16  Sydney Hurley   401.89                     500.22
10 2021-08-16   Tom Richards   433.94                     433.94

13 2021-08-19     Ivan Levin   306.23                     306.23
15 2021-08-19     Ivan Levin   199.98                     506.21
14 2021-08-19  Sydney Hurley   503.54                     503.54

Compare the pandas result set to a SQL result set. To get the same result set in SQL, you can take advantage of the OVER clause in a SELECT statement. This clause lets you define the partitioning and ordering for the rowset and then specify a sliding window (range of rows around the row being evaluated) within which you apply an analytic function, thus computing an aggregated value for each row.

In this particular example, partition the rowset by the values in the ordate and empl columns and order by the same columns plus the pono column, applying the sum() function to compute a cumulative sum for each row in the rowset. Here’s the query.



SELECT rownum-1 as " ", ordate, empl, total order_total,
 SUM(total) OVER (PARTITION BY ordate, empl ORDER BY ordate, empl, pono ROWS UNBOUNDED PRECEDING) day_running_total_by_empl
FROM orders
ORDER BY ordate, empl;

You may notice rownum-1 as " " at the first position in the select list. You didn’t use anything like this in the pandas version of this query, so why is it here? The row number (index) is automatically output when you print out a pandas DataFrame; by contrast, to see the row number in a SQL query, you need to explicitly include the rownum pseudocolumn in the select list. You specify rownum-1 because pandas uses zero-based indexing, while the initial element in the SQL rownum pseudocolumn is 1.

The most important difference between the query built with pandas and the SQL query discussed here is that with pandas, you don’t use the windowing technique used in SQL. Thus, the ROWS UNBOUNDED PRECEDING keywords in the SQL query compose the windowing clause; this defines the window that starts at the first row of a partition and ends at the current row, allowing for cumulative aggregations within a partition.

The groupby() and cumsum() method combination in the pandas query does the same thing, imitating the windowing technique.

Summing over a fixed number of rows

The above technique won’t work, however, if you want to define a fixed number of rows preceding (or following) the current row. Turning back to the example, suppose you need to accumulate a sum over only two nearby orders within a partition. With pandas, this can be implemented as follows:



df_orders['two_row_total_by_empl'] = df_orders.sort_values(['ordate','empl','pono']).groupby(['ordate', 'empl'])['total'].rolling(2, min_periods = 1).sum().reset_index(drop=True, level=[0,1])

This replaces the cumsum() method with the rolling() and sum() method combination, which explicitly defines the window to slide within a partition. The first parameter passed to rolling() defines the size of the window. This example passes the value 2, so that only two rows (the current and preceding) make up the window. To print the result set, use the following code:



print(df_orders[['ordate', 'empl','total','two_row_total_by_empl']].sort_values(['ordate','empl']))

Below are the first four rows of the output.



       ordate           empl    total      two_row_total_by_empl
0  2021-08-14  Albert Forest   404.34                     404.34
2  2021-08-14  Albert Forest   308.15                     712.49
3  2021-08-14  Albert Forest   298.49                     606.64
1  2021-08-14   Tom Richards  1032.89                    1032.89

Expectedly, only the cumulative sum in the third row differs from the corresponding result in the previous example. This is because the row in question is the third row within a partition, while the cumulative sum is calculated now within a two-row window.

Compare the pandas result set to a SQL result set. The corresponding SQL query would look as follows:



SELECT rownum-1 as " ", ordate, empl, total order_total,
 SUM(total) OVER (PARTITION BY ordate, empl ORDER BY ordate, empl, pono ROWS 1 PRECEDING) two_row_total_by_empl
FROM orders
ORDER BY ordate, empl;

This query defines a two-row window by specifying ROWS 1 PRECEDING in the windowing clause, meaning the window will comprise the current row and its preceding row found within the same partition.

Stock price analysis

You can base the size of a sliding analysis window on either a physical number of rows or a logical interval such as time or date. If you’re wondering when you might want to use a logical interval window over a window with a physical number of rows, suppose you need to compute the average price of a stock over a three-calendar-day period. Because a stock exchange is not open every calendar day, there are no rows for holidays and weekends in historical stock data sequences. So, if you define a three-day window, it will cover a single row for any date falling on Monday, two rows for Tuesdays, and three rows for Wednesdays, Thursdays, and Fridays—and it will not compensate for holidays.

To experiment with stock data, you first need to get some. With the following script, you download stock data for three tech-company tickers for the last seven trading days, loading this data in a pandas DataFrame:



import pandas as pd
import yfinance as yf
df_stocks = pd.DataFrame()
tickers = ['MSFT','ORCL','AMZN']
for ticker in tickers:
  tkr = yf.Ticker(ticker)
  hist = tkr.history(period='7d')
  hist['Symbol']=ticker
  df_stocks = df_stocks.append(hist[['Symbol', 'Close']].rename(columns={'Close': 'Price'}))

To determine how these stocks have performed, evaluate the difference between a stock’s closing price on a trading day and an average of its closing prices on the previous trading days within a three-calendar-day interval. To start, you need to sort the rows in the dataset by the Symbol and Date columns.



df_stocks = df_stocks.sort_values(['Symbol','Date'])

For each row within a Symbol partition, calculate an average price of the previous trading days within a three-calendar-day interval, saving the results in a separate column.



df_stocks['three_day_avg'] = df_stocks.groupby(['Symbol'])['Price'].rolling('3d', min_periods = 1).mean().reset_index(drop=True, level=0)

To be able to compare the price in a row with the average calculated for the previous row, shift the index by 1 for the newly created three_day_avg column.



df_stocks['prev_avg'] = df_stocks.groupby(['Symbol'])['three_day_avg'].shift(1)

Now you can calculate the percentage of change for the prices compared to the average of the previous three days.



import numpy as np
df_stocks['priceRise_idx'] = np.log(df_stocks['Price']/df_stocks['prev_avg'])

The result set can be viewed as follows:



print(df_stocks[['Symbol','Price','priceRise_idx']])

The output should look like the following:



           Symbol        Price  priceRise_idx
Date                                       
2021-08-05   AMZN  3375.989990            NaN
2021-08-06   AMZN  3344.939941      -0.009240
2021-08-09   AMZN  3341.870117      -0.005549
2021-08-10   AMZN  3320.679932      -0.006361
2021-08-11   AMZN  3292.110107      -0.011826
2021-08-12   AMZN  3303.500000      -0.004446
2021-08-13   AMZN  3293.969971      -0.003473
2021-08-05   MSFT   289.519989            NaN
2021-08-06   MSFT   289.459991      -0.000207
2021-08-09   MSFT   288.329987      -0.004015
2021-08-10   MSFT   286.440002      -0.006577
2021-08-11   MSFT   286.950012      -0.001515
2021-08-12   MSFT   289.809998       0.008907
2021-08-13   MSFT   292.850006       0.017626
2021-08-05   ORCL    89.430000            NaN
2021-08-06   ORCL    89.519997       0.001006
2021-08-09   ORCL    89.900002       0.004739
2021-08-10   ORCL    89.639999      -0.002896
2021-08-11   ORCL    89.629997      -0.001561
2021-08-12   ORCL    89.809998       0.000965
2021-08-13   ORCL    90.379997       0.007627

A negative value in the priceRise_idx column indicates that the stock price has fallen compared to the average of the previous days. In contrast, a positive value indicates the price increased. Looking through the results above, you may notice that on the investigated interval the AMZN stock price declined steadily; the MSFT stock price went down and then began to rise; and only the ORCL stock price rose on most days.

Compare the pandas result set to a SQL result set. Before you can generate the same result set in SQL, populate the stocks table created with timeseries_article.sql. To do this, run the stocks_inserts.py script shown in Listing 1; you can download the script here.

Listing 1. Populating the stocks table



import pandas as pd
import yfinance as yf
df_stocks = pd.DataFrame()
tickers = ['MSFT','ORCL','AMZN']
for ticker in tickers:
  tkr = yf.Ticker(ticker)
  hist = tkr.history(period='7d')
  hist['Symbol']=ticker
  df_stocks = df_stocks.append(hist[['Symbol', 'Close']].rename(columns={'Close': 'Price'}))

stocks_to_db = df_stocks[['Symbol', 'Price']].reset_index().rename(columns={'Date': 'Dt'}).round(2)
data = list(stocks_to_db.itertuples(index=False, name=None))
import cx_Oracle
try:
  conn = cx_Oracle.connect("usr", "pswd", "localhost/orcl")
  cursor = conn.cursor()
  #defining the query
  query_add_stocks = """INSERT INTO stocks (dt, symbol, price) 
                      VALUES (TO_DATE(:1, 'DD-MM-YYYY'), :2, :3)"""
  #inserting the stock rows
  cursor.executemany(query_add_stocks, data)
  conn.commit()
except cx_Oracle.DatabaseError as exc:
  err = exc.args
  print("Oracle-Error-Code:", err.code)
  print("Oracle-Error-Message:", err.message)
finally:
  cursor.close()

After that, issue the following SQL statement in your SQL command-line tool:



SELECT dt, symbol, price, LN(price/prev_avg) priceRise_idx FROM
 (SELECT dt, symbol, price, LAG(three_day_avg) OVER (PARTITION BY symbol ORDER BY symbol, dt) prev_avg FROM
  (SELECT dt, symbol, price,
   AVG(price) OVER (PARTITION BY symbol ORDER BY dt RANGE BETWEEN 2 PRECEDING AND CURRENT ROW) three_day_avg
   FROM stocks
   ORDER BY symbol, dt
  )
 );

Conclusion

Both SQL and Python’s pandas library can answer nearly any question you might ask of your data. In this article, you learned to use these technologies to query and analyze time-series data utilizing analytic functions that allow you to compute an aggregate value for each row based on a group of nearby rows enclosed in a sliding window. You also learned to define both a window that is based on a physical number of rows and a window based on a date interval.

Illustration: Wes Rowell

Yuli Vasiliev (@VasilievYuli) is a programmer, freelance writer, and consultant specializing in open source development, Oracle Database technologies, and natural-language processing (NLP). He is the author of Natural Language Processing with Python and spaCy.