By Yuli Vasiliev | October 2021
Time-series data, also referred to as time-stamped data, commonly represents a series of measurements or observations indexed in chronological order. Typically, time-series data is collected on a regular basis through repeated measurements and data points are recorded at regular intervals. This article covers some analysis techniques that you can apply to time-series data to extract meaningful statistics from it, using Python’s pandas data analysis library as well as the SQL language for comparison.
Actually, any dataset that can be indexed in time order may be considered a time series and can be used for time-series analysis.
For example, a collection of purchase orders can be viewed as a sequence of discrete-time documents, which you can analyze based on their time order. You might want to calculate a running total of orders by employee based on the processing date or job date. This request can be implemented using the technique known as rolling window calculations.
Another familiar example of time-series data is a stock’s price data, where the stock price is recorded at specific and regular intervals, producing a discrete time series. You can discover the most profitable stock within a specified period, using time-series analysis techniques by computing an average value for each row based on the values in the nearby rows, which are referred to as a window in this context.
This article assumes that you know how to install and use Python.
To follow along with the examples discussed in this article, you’ll need to create several example tables in Oracle Database and populate them with data. For that, you can execute against the database the SQL statements found in the timeseries_article.sql
script; you can download the script here. Then, to be able to get the data from the database into Python, you need to have the pandas
, SQLAlchemy
, and cx_Oracle
libraries installed in your Python environment. If you don’t have them yet, you can install them using the pip
(preferred installer program) command, as follows, from a terminal window:
$ pip install pandas
$ pip install SQLAlchemy
$ pip install cx_Oracle
To follow the stock price data analysis example, you also need to install yfinance
, a Python wrapper for the Yahoo Finance API that provides historical and real-time data for stock quotes.
$ pip install yfinance
Consider a set of purchase orders, with some days having no orders, some days one order, and some days multiple orders. (A sample sequence of dummy purchase orders can be found in the orders table loaded by the timeseries_article.sql
script.) The following Python code moves the example purchase orders data into a pandas DataFrame:
import pandas as pd
import cx_Oracle
import sqlalchemy
from sqlalchemy.exc import SQLAlchemyError
try:
engine = sqlalchemy.create_engine("oracle+cx_oracle://usr:pswd@localhost/?service_name=orcl")
orders_sql = """SELECT * FROM orders"""
df_orders = pd.read_sql(orders_sql, engine)
except SQLAlchemyError as e:
print(e)
Once the data is loaded into a DataFrame, you can use pandas methods to perform analysis on it. The snippet below processes the rows loaded into the df_orders
DataFrame and calculates the cumulative total of orders within a day by employee.
df_orders['day_running_total_by_empl'] = df_orders.sort_values(['ordate','empl','pono']).groupby(['ordate', 'empl'])['total'].cumsum()
The code uses the sort_values()
method to sort the values found in the ordate
(order date), empl
(employee name), and pono
(purchase order number) columns in the specified order. Then, it uses groupby()
to group the rows by the values in the ordate
and empl
columns. Finally, it uses cumsum()
to calculate a cumulative sum for each group.
To view the result set, use the following code:
print(df_orders[['ordate', 'empl','total','day_running_total_by_empl']].sort_values(['ordate','empl']))
Here is the output; a line is skipped after each ordate
group to improve readability.
ordate empl total day_running_total_by_empl
0 2021-08-14 Albert Forest 404.34 404.34
2 2021-08-14 Albert Forest 308.15 712.49
3 2021-08-14 Albert Forest 298.49 1010.98
1 2021-08-14 Tom Richards 1032.89 1032.89
4 2021-08-15 Albert Forest 327.16 327.16
7 2021-08-15 Albert Forest 258.48 585.64
5 2021-08-15 Tom Richards 1139.89 1139.89
6 2021-08-15 Tom Richards 334.99 1474.88
9 2021-08-16 Albert Forest 750.98 750.98
11 2021-08-16 Albert Forest 118.14 869.12
8 2021-08-16 Sydney Hurley 98.33 98.33
12 2021-08-16 Sydney Hurley 401.89 500.22
10 2021-08-16 Tom Richards 433.94 433.94
13 2021-08-19 Ivan Levin 306.23 306.23
15 2021-08-19 Ivan Levin 199.98 506.21
14 2021-08-19 Sydney Hurley 503.54 503.54
Compare the pandas result set to a SQL result set. To get the same result set in SQL, you can take advantage of the OVER
clause in a SELECT
statement. This clause lets you define the partitioning and ordering for the rowset and then specify a sliding window (range of rows around the row being evaluated) within which you apply an analytic function, thus computing an aggregated value for each row.
In this particular example, partition the rowset by the values in the ordate
and empl
columns and order by the same columns plus the pono
column, applying the sum()
function to compute a cumulative sum for each row in the rowset. Here’s the query.
SELECT rownum-1 as " ", ordate, empl, total order_total,
SUM(total) OVER (PARTITION BY ordate, empl ORDER BY ordate, empl, pono ROWS UNBOUNDED PRECEDING) day_running_total_by_empl
FROM orders
ORDER BY ordate, empl;
You may notice rownum-1 as " "
at the first position in the select list. You didn’t use anything like this in the pandas version of this query, so why is it here? The row number (index) is automatically output when you print out a pandas DataFrame; by contrast, to see the row number in a SQL query, you need to explicitly include the rownum
pseudocolumn in the select list. You specify rownum-1
because pandas uses zero-based indexing, while the initial element in the SQL rownum pseudocolumn is 1.
The most important difference between the query built with pandas and the SQL query discussed here is that with pandas, you don’t use the windowing technique used in SQL. Thus, the ROWS UNBOUNDED PRECEDING
keywords in the SQL query compose the windowing clause; this defines the window that starts at the first row of a partition and ends at the current row, allowing for cumulative aggregations within a partition.
The groupby()
and cumsum()
method combination in the pandas query does the same thing, imitating the windowing technique.
The above technique won’t work, however, if you want to define a fixed number of rows preceding (or following) the current row. Turning back to the example, suppose you need to accumulate a sum over only two nearby orders within a partition. With pandas, this can be implemented as follows:
df_orders['two_row_total_by_empl'] = df_orders.sort_values(['ordate','empl','pono']).groupby(['ordate', 'empl'])['total'].rolling(2, min_periods = 1).sum().reset_index(drop=True, level=[0,1])
This replaces the cumsum()
method with the rolling()
and sum()
method combination, which explicitly defines the window to slide within a partition. The first parameter passed to rolling()
defines the size of the window. This example passes the value 2, so that only two rows (the current and preceding) make up the window. To print the result set, use the following code:
print(df_orders[['ordate', 'empl','total','two_row_total_by_empl']].sort_values(['ordate','empl']))
Below are the first four rows of the output.
ordate empl total two_row_total_by_empl
0 2021-08-14 Albert Forest 404.34 404.34
2 2021-08-14 Albert Forest 308.15 712.49
3 2021-08-14 Albert Forest 298.49 606.64
1 2021-08-14 Tom Richards 1032.89 1032.89
Expectedly, only the cumulative sum in the third row differs from the corresponding result in the previous example. This is because the row in question is the third row within a partition, while the cumulative sum is calculated now within a two-row window.
Compare the pandas result set to a SQL result set. The corresponding SQL query would look as follows:
SELECT rownum-1 as " ", ordate, empl, total order_total,
SUM(total) OVER (PARTITION BY ordate, empl ORDER BY ordate, empl, pono ROWS 1 PRECEDING) two_row_total_by_empl
FROM orders
ORDER BY ordate, empl;
This query defines a two-row window by specifying ROWS 1 PRECEDING
in the windowing clause, meaning the window will comprise the current row and its preceding row found within the same partition.
You can base the size of a sliding analysis window on either a physical number of rows or a logical interval such as time or date. If you’re wondering when you might want to use a logical interval window over a window with a physical number of rows, suppose you need to compute the average price of a stock over a three-calendar-day period. Because a stock exchange is not open every calendar day, there are no rows for holidays and weekends in historical stock data sequences. So, if you define a three-day window, it will cover a single row for any date falling on Monday, two rows for Tuesdays, and three rows for Wednesdays, Thursdays, and Fridays—and it will not compensate for holidays.
To experiment with stock data, you first need to get some. With the following script, you download stock data for three tech-company tickers for the last seven trading days, loading this data in a pandas DataFrame:
import pandas as pd
import yfinance as yf
df_stocks = pd.DataFrame()
tickers = ['MSFT','ORCL','AMZN']
for ticker in tickers:
tkr = yf.Ticker(ticker)
hist = tkr.history(period='7d')
hist['Symbol']=ticker
df_stocks = df_stocks.append(hist[['Symbol', 'Close']].rename(columns={'Close': 'Price'}))
To determine how these stocks have performed, evaluate the difference between a stock’s closing price on a trading day and an average of its closing prices on the previous trading days within a three-calendar-day interval. To start, you need to sort the rows in the dataset by the Symbol
and Date
columns.
df_stocks = df_stocks.sort_values(['Symbol','Date'])
For each row within a Symbol
partition, calculate an average price of the previous trading days within a three-calendar-day interval, saving the results in a separate column.
df_stocks['three_day_avg'] = df_stocks.groupby(['Symbol'])['Price'].rolling('3d', min_periods = 1).mean().reset_index(drop=True, level=0)
To be able to compare the price in a row with the average calculated for the previous row, shift the index by 1 for the newly created three_day_avg
column.
df_stocks['prev_avg'] = df_stocks.groupby(['Symbol'])['three_day_avg'].shift(1)
Now you can calculate the percentage of change for the prices compared to the average of the previous three days.
import numpy as np
df_stocks['priceRise_idx'] = np.log(df_stocks['Price']/df_stocks['prev_avg'])
The result set can be viewed as follows:
print(df_stocks[['Symbol','Price','priceRise_idx']])
The output should look like the following:
Symbol Price priceRise_idx
Date
2021-08-05 AMZN 3375.989990 NaN
2021-08-06 AMZN 3344.939941 -0.009240
2021-08-09 AMZN 3341.870117 -0.005549
2021-08-10 AMZN 3320.679932 -0.006361
2021-08-11 AMZN 3292.110107 -0.011826
2021-08-12 AMZN 3303.500000 -0.004446
2021-08-13 AMZN 3293.969971 -0.003473
2021-08-05 MSFT 289.519989 NaN
2021-08-06 MSFT 289.459991 -0.000207
2021-08-09 MSFT 288.329987 -0.004015
2021-08-10 MSFT 286.440002 -0.006577
2021-08-11 MSFT 286.950012 -0.001515
2021-08-12 MSFT 289.809998 0.008907
2021-08-13 MSFT 292.850006 0.017626
2021-08-05 ORCL 89.430000 NaN
2021-08-06 ORCL 89.519997 0.001006
2021-08-09 ORCL 89.900002 0.004739
2021-08-10 ORCL 89.639999 -0.002896
2021-08-11 ORCL 89.629997 -0.001561
2021-08-12 ORCL 89.809998 0.000965
2021-08-13 ORCL 90.379997 0.007627
A negative value in the priceRise_idx
column indicates that the stock price has fallen compared to the average of the previous days. In contrast, a positive value indicates the price increased. Looking through the results above, you may notice that on the investigated interval the AMZN stock price declined steadily; the MSFT stock price went down and then began to rise; and only the ORCL stock price rose on most days.
Compare the pandas result set to a SQL result set. Before you can generate the same result set in SQL, populate the stocks table created with timeseries_article.sql
. To do this, run the stocks_inserts.py
script shown in Listing 1; you can download the script here.
Listing 1. Populating the stocks table
import pandas as pd
import yfinance as yf
df_stocks = pd.DataFrame()
tickers = ['MSFT','ORCL','AMZN']
for ticker in tickers:
tkr = yf.Ticker(ticker)
hist = tkr.history(period='7d')
hist['Symbol']=ticker
df_stocks = df_stocks.append(hist[['Symbol', 'Close']].rename(columns={'Close': 'Price'}))
stocks_to_db = df_stocks[['Symbol', 'Price']].reset_index().rename(columns={'Date': 'Dt'}).round(2)
data = list(stocks_to_db.itertuples(index=False, name=None))
import cx_Oracle
try:
conn = cx_Oracle.connect("usr", "pswd", "localhost/orcl")
cursor = conn.cursor()
#defining the query
query_add_stocks = """INSERT INTO stocks (dt, symbol, price)
VALUES (TO_DATE(:1, 'DD-MM-YYYY'), :2, :3)"""
#inserting the stock rows
cursor.executemany(query_add_stocks, data)
conn.commit()
except cx_Oracle.DatabaseError as exc:
err = exc.args
print("Oracle-Error-Code:", err.code)
print("Oracle-Error-Message:", err.message)
finally:
cursor.close()
After that, issue the following SQL statement in your SQL command-line tool:
SELECT dt, symbol, price, LN(price/prev_avg) priceRise_idx FROM
(SELECT dt, symbol, price, LAG(three_day_avg) OVER (PARTITION BY symbol ORDER BY symbol, dt) prev_avg FROM
(SELECT dt, symbol, price,
AVG(price) OVER (PARTITION BY symbol ORDER BY dt RANGE BETWEEN 2 PRECEDING AND CURRENT ROW) three_day_avg
FROM stocks
ORDER BY symbol, dt
)
);
Both SQL and Python’s pandas library can answer nearly any question you might ask of your data. In this article, you learned to use these technologies to query and analyze time-series data utilizing analytic functions that allow you to compute an aggregate value for each row based on a group of nearby rows enclosed in a sliding window. You also learned to define both a window that is based on a physical number of rows and a window based on a date interval.
Illustration: Wes Rowell
Yuli Vasiliev (@VasilievYuli) is a programmer, freelance writer, and consultant specializing in open source development, Oracle Database technologies, and natural-language processing (NLP). He is the author of Natural Language Processing with Python and spaCy.