# Introduction
Time collection characteristic engineering would not comply with the identical guidelines as tabular knowledge. Observations aren’t impartial, row order is not incidental, and essentially the most helpful options are not often particular person readings. You will should determine patterns throughout time like charges of change, lag comparisons, deviations from a rolling baseline, and extra.
Constructing lags, sliding home windows, and grouping throughout resolutions are all, at their core, iteration issues over ordered sequences. Python’s itertools module is a pure match for this sort of work. It would not substitute high-level pandas abstractions like .rolling(), but it surely offers you lower-level constructing blocks to assemble precisely the options you want, with full management over the logic.
On this article, you will construct seven classes of time collection options utilizing itertools. You will additionally apply every to a pattern dataset.
You may get the code on GitHub.
# Making a Pattern Dataset
Earlier than we begin constructing the options, let’s spin up a pattern sensor dataset to work with all through the article.
import numpy as np
import pandas as pd
import itertools
np.random.seed(42)
intervals = 168 # one week of hourly readings
index = pd.date_range(begin="2024-03-01", intervals=intervals, freq="h")
hours = np.arange(intervals)
# Temperature (°C): every day cycle + gradual drift + noise
temp_base = 3.5
temp_daily = 1.2 * np.sin(2 * np.pi * hours / 24)
temp_drift = 0.003 * hours
temp_noise = np.random.regular(0, 0.3, intervals)
temperature = temp_base + temp_daily + temp_drift + temp_noise
# Humidity (%): inverse relationship with temperature + noise
humidity = 78 - 2.1 * (temperature - temp_base) + np.random.regular(0, 1.2, intervals)
# Energy draw (kW): peaks throughout enterprise hours, larger on weekdays
day_of_week = index.dayofweek
business_hours = ((index.hour >= 8) & (index.hour <= 18)).astype(int)
weekend_factor = np.the place(day_of_week >= 5, 0.6, 1.0)
energy = (
42.0
+ 18.0 * business_hours * weekend_factor
+ np.random.regular(0, 2.1, intervals)
)
df = pd.DataFrame({
"temperature_c": np.spherical(temperature, 3),
"humidity_pct": np.spherical(humidity, 2),
"power_kw": np.spherical(energy, 2),
}, index=index)
df.index.identify = "timestamp"
print(df.head(8))
print(f"nShape: {df.form}")
Output:
temperature_c humidity_pct power_kw
timestamp
2024-03-01 00:00:00 3.649 77.39 40.27
2024-03-01 01:00:00 3.772 76.52 41.33
2024-03-01 02:00:00 4.300 75.25 42.87
2024-03-01 03:00:00 4.814 74.26 40.82
2024-03-01 04:00:00 4.481 75.85 40.27
2024-03-01 05:00:00 4.604 76.09 42.51
2024-03-01 06:00:00 5.192 74.78 42.51
2024-03-01 07:00:00 4.910 76.03 40.94
Form: (168, 3)
We now have 168 hourly readings throughout three sensor channels. Now let’s construct options.
# 1. Producing Lag Options with islice
Lag options are essentially the most elementary time collection characteristic: the worth of a variable at a hard and fast variety of steps prior to now. For instance, values from 1 step in the past, 6 steps in the past, or 24 steps in the past can every seize distinct patterns comparable to short-term fluctuations, recurring intra-period conduct, and longer-term developments or seasonality.
Let’s construct lag options for our pattern dataset utilizing islice:
sensor_readings = df["temperature_c"].tolist()
lag_offsets = [1, 6, 12, 24]
lag_features = {}
for lag in lag_offsets:
lagged = record(itertools.islice(sensor_readings, 0, len(sensor_readings) - lag))
# Pad the start with None to protect index alignment
lag_features[f"temp_lag_{lag}h"] = [None] * lag + lagged
lag_df = pd.DataFrame(lag_features, index=df.index)
lag_df["temperature_c"] = df["temperature_c"]
print(lag_df.iloc[24:30])
Output:
temp_lag_1h temp_lag_6h temp_lag_12h temp_lag_24h
timestamp
2024-03-02 00:00:00 2.831 2.082 3.609 3.649
2024-03-02 01:00:00 3.409 1.974 2.654 3.772
2024-03-02 02:00:00 3.919 2.960 2.425 4.300
2024-03-02 03:00:00 3.833 2.647 2.528 4.814
2024-03-02 04:00:00 4.542 2.986 2.205 4.481
2024-03-02 05:00:00 4.443 2.831 2.486 4.604
temperature_c
timestamp
2024-03-02 00:00:00 3.409
2024-03-02 01:00:00 3.919
2024-03-02 02:00:00 3.833
2024-03-02 03:00:00 4.542
2024-03-02 04:00:00 4.443
2024-03-02 05:00:00 4.659
islice(sensor_readings, 0, len - lag) extracts the sequence shifted again by lag steps with out creating a replica of the total record. The None padding on the entrance retains each lag characteristic aligned with the unique index. This issues once you later drop NaNs for mannequin coaching.
# 2. Constructing Rolling Window Options with islice and accumulate
A single lag worth tells you what the sensor learn at a degree prior to now. A rolling statistic tells you what the sensor has been doing over a window of time, which is commonly way more helpful.
readings = df["temperature_c"].tolist()
window_size = 6 # 6-hour rolling window
rolling_features = []
for i in vary(len(readings)):
if i < window_size:
rolling_features.append({
"rolling_mean_6h": None,
"rolling_std_6h": None,
"rolling_min_6h": None,
"rolling_max_6h": None,
})
proceed
window = record(itertools.islice(readings, i - window_size, i))
# Use accumulate to compute working sum for imply
running_sum = record(itertools.accumulate(window))
window_mean = running_sum[-1] / window_size
window_mean_sq = sum(x**2 for x in window) / window_size
rolling_features.append({
"rolling_mean_6h": spherical(window_mean, 4),
"rolling_std_6h": spherical((window_mean_sq - window_mean**2) ** 0.5, 4),
"rolling_min_6h": spherical(min(window), 4),
"rolling_max_6h": spherical(max(window), 4),
})
roll_df = pd.DataFrame(rolling_features, index=df.index)
roll_df["temperature_c"] = df["temperature_c"]
print(roll_df.iloc[6:12])
Output:
rolling_mean_6h rolling_std_6h rolling_min_6h
timestamp
2024-03-01 06:00:00 4.2700 0.4256 3.649
2024-03-01 07:00:00 4.5272 0.4386 3.772
2024-03-01 08:00:00 4.7168 0.2929 4.300
2024-03-01 09:00:00 4.7372 0.2662 4.422
2024-03-01 10:00:00 4.6912 0.2728 4.422
2024-03-01 11:00:00 4.6095 0.3769 3.991
rolling_max_6h temperature_c
timestamp
2024-03-01 06:00:00 4.814 5.192
2024-03-01 07:00:00 5.192 4.910
2024-03-01 08:00:00 5.192 4.422
2024-03-01 09:00:00 5.192 4.538
2024-03-01 10:00:00 5.192 3.991
2024-03-01 11:00:00 5.192 3.704
The accumulate name right here computes the working sum of the window so we get the entire in a single cross — running_sum[-1] — with out calling sum() individually. For big datasets processed in a streaming trend, avoiding redundant passes over the identical knowledge is environment friendly.
# 3. Creating Seasonal Interplay Options with product
Many time collection exhibit layered seasonality, the place a number of temporal cycles work together — comparable to time of day, day of week, and broader operational or cyclical intervals. Interplay options that mix these dimensions can seize patterns that particular person time elements alone could overlook.
Now let’s construct interplay options with product:
hours_of_day = record(vary(24))
day_types = ["weekday", "weekend"]
operational_shifts = ["off_peak", "on_peak"] # on_peak: 08:00–18:00
# Construct a full lookup grid for all combos
season_grid = record(itertools.product(hours_of_day, day_types, operational_shifts))
season_df = pd.DataFrame(season_grid, columns=["hour", "day_type", "shift"])
# Simulate anticipated baseline temperature per mixture
np.random.seed(14)
season_df["baseline_temp_c"] = np.spherical(
3.5
+ 0.8 * np.sin(2 * np.pi * season_df["hour"] / 24)
+ np.the place(season_df["day_type"] == "weekend", 0.3, 0.0)
+ np.the place(season_df["shift"] == "on_peak", 0.5, 0.0)
+ np.random.regular(0, 0.1, len(season_df)),
3
)
print(season_df[season_df["hour"].isin([0, 8, 14, 20])].head(16).to_string(index=False))
print(f"nTotal grid combos: {len(season_df)}")
Output:
hour day_type shift baseline_temp_c
0 weekday off_peak 3.655
0 weekday on_peak 4.008
0 weekend off_peak 3.817
0 weekend on_peak 4.293
8 weekday off_peak 4.325
8 weekday on_peak 4.601
8 weekend off_peak 4.446
8 weekend on_peak 4.978
14 weekday off_peak 3.370
14 weekday on_peak 3.628
14 weekend off_peak 3.279
14 weekend on_peak 3.959
20 weekday off_peak 2.726
20 weekday on_peak 3.256
20 weekend off_peak 3.056
20 weekend on_peak 3.530
Whole grid combos: 96
This grid merges again onto your fundamental dataset as a baseline_temp_c characteristic per row — giving each studying a context-aware anticipated worth. The deviation from that baseline, temperature_c - baseline_temp_c, is then a helpful anomaly detection characteristic.
# 4. Extracting Sliding Window Statistics with tee
Typically it’s good to course of the identical sequence by a number of statistical lenses concurrently — imply, variance, charge of change — with out iterating over it a number of occasions. itertools.tee creates impartial iterators from a single supply, which is strictly what you want.
def sliding_window_stats(collection, window_size):
"""Compute imply, vary and rate-of-change over sliding home windows utilizing tee."""
outcomes = []
it = iter(collection)
window = record(itertools.islice(it, window_size))
if len(window) < window_size:
return outcomes
outcomes.append({
"window_mean": spherical(sum(window) / window_size, 4),
"window_range": spherical(max(window) - min(window), 4),
"rate_of_change": spherical(window[-1] - window[0], 4),
})
for next_val in it:
window = window[1:] + [next_val]
# tee creates two impartial iterators over the identical window
iter_a, iter_b = itertools.tee(iter(window))
values_a = record(iter_a)
values_b = record(iter_b)
mean_val = sum(values_a) / window_size
outcomes.append({
"window_mean": spherical(mean_val, 4),
"window_range": spherical(max(values_b) - min(values_b), 4),
"rate_of_change": spherical(window[-1] - window[0], 4),
})
return outcomes
power_readings = df["power_kw"].tolist()
stats = sliding_window_stats(power_readings, window_size=8)
stats_df = pd.DataFrame(stats, index=df.index[7:])
stats_df["power_kw"] = df["power_kw"].iloc[7:].values
print(stats_df.iloc[0:8])
Output:
window_mean window_range rate_of_change power_kw
timestamp
2024-03-01 07:00:00 41.4400 2.60 0.67 40.94
2024-03-01 08:00:00 43.7825 18.74 17.68 59.01
2024-03-01 09:00:00 46.1775 20.22 17.62 60.49
2024-03-01 10:00:00 47.9387 20.22 16.14 56.96
2024-03-01 11:00:00 49.9663 20.22 16.77 57.04
2024-03-01 12:00:00 52.2437 19.55 15.98 58.49
2024-03-01 13:00:00 54.3738 19.55 17.04 59.55
2024-03-01 14:00:00 56.6412 19.71 19.71 60.65
As seen, tee allows you to cross the identical window iterator into two separate downstream computations with out rewinding or copying the record your self.
# 5. Combining Multi-Decision Time Options with chain
Helpful time collection options typically come from a number of temporal resolutions concurrently: the uncooked hourly studying, a 6-hour rolling imply, a 24-hour rolling imply, and a calendar characteristic like hour-of-day. These are often in separate arrays and want assembling into one clear characteristic record. Here is how you should use chain to mix such options:
humidity = df["humidity_pct"].tolist()
def rolling_means(collection, window):
means = []
for i in vary(len(collection)):
if i < window:
means.append(None)
else:
w = record(itertools.islice(collection, i - window, i))
means.append(spherical(sum(w) / window, 3))
return means
rolling_6h = rolling_means(humidity, 6)
rolling_24h = rolling_means(humidity, 24)
hour_of_day = df.index.hour.tolist()
is_business_hour = [1 if 8 <= h <= 18 else 0 for h in hour_of_day]
# chain assembles characteristic identify record from logically grouped sublists
feature_names = record(itertools.chain(
["humidity_raw"],
["humidity_roll_6h", "humidity_roll_24h"],
["hour_of_day", "is_business_hour"],
))
multi_res_df = pd.DataFrame({
identify: vals for identify, vals in zip(
feature_names,
[humidity, rolling_6h, rolling_24h, hour_of_day, is_business_hour]
)
}, index=df.index)
print(multi_res_df.iloc[24:30])
Output:
humidity_raw humidity_roll_6h humidity_roll_24h
timestamp
2024-03-02 00:00:00 78.45 79.622 78.055
2024-03-02 01:00:00 75.63 79.105 78.100
2024-03-02 02:00:00 77.51 78.190 78.062
2024-03-02 03:00:00 76.27 78.088 78.157
2024-03-02 04:00:00 74.96 77.805 78.240
2024-03-02 05:00:00 75.75 77.208 78.203
hour_of_day is_business_hour
timestamp
2024-03-02 00:00:00 0 0
2024-03-02 01:00:00 1 0
2024-03-02 02:00:00 2 0
2024-03-02 03:00:00 3 0
2024-03-02 04:00:00 4 0
2024-03-02 05:00:00 5 0
chain right here assembles the characteristic identify record from logically grouped sublists — uncooked sensor, rolling aggregates, calendar options. As your characteristic set grows throughout extra sensor channels and extra resolutions, chain retains that meeting readable and simple to increase.
# 6. Computing Pairwise Temporal Correlations with combos
In a multi-sensor setting, the relationships between variables over time typically comprise beneficial indicators that particular person measurements alone can not seize. For instance, simultaneous will increase throughout two sensors could reveal rising situations or interactions that may not be obvious when every collection is analyzed in isolation.
Incorporating options that replicate these joint dynamics can enhance a mannequin’s capacity to detect delicate patterns and dependencies. Let’s attempt constructing pairwise correlations utilizing combos:
sensor_cols = ["temperature_c", "humidity_pct", "power_kw"]
window_size = 12
pairwise_features = {}
for col_a, col_b in itertools.combos(sensor_cols, 2):
feature_name = f"corr_{col_a[:4]}_{col_b[:4]}_12h"
correlations = []
series_a = df[col_a].tolist()
series_b = df[col_b].tolist()
for i in vary(len(series_a)):
if i < window_size:
correlations.append(None)
proceed
win_a = record(itertools.islice(series_a, i - window_size, i))
win_b = record(itertools.islice(series_b, i - window_size, i))
mean_a = sum(win_a) / window_size
mean_b = sum(win_b) / window_size
cov = sum((a - mean_a) * (b - mean_b) for a, b in zip(win_a, win_b)) / window_size
std_a = (sum((a - mean_a)**2 for a in win_a) / window_size) ** 0.5
std_b = (sum((b - mean_b)**2 for b in win_b) / window_size) ** 0.5
corr = spherical(cov / (std_a * std_b), 4) if std_a > 0 and std_b > 0 else None
correlations.append(corr)
pairwise_features[feature_name] = correlations
corr_df = pd.DataFrame(pairwise_features, index=df.index)
print(corr_df.iloc[12:18])
Output:
corr_temp_humi_12h corr_temp_powe_12h
timestamp
2024-03-01 12:00:00 -0.6700 -0.2281
2024-03-01 13:00:00 -0.7208 -0.4960
2024-03-01 14:00:00 -0.7442 -0.6669
2024-03-01 15:00:00 -0.7678 -0.7076
2024-03-01 16:00:00 -0.8116 -0.7265
2024-03-01 17:00:00 -0.8368 -0.7482
corr_humi_powe_12h
timestamp
2024-03-01 12:00:00 0.5380
2024-03-01 13:00:00 0.6614
2024-03-01 14:00:00 0.7202
2024-03-01 15:00:00 0.7311
2024-03-01 16:00:00 0.7233
2024-03-01 17:00:00 0.7219
# 7. Accumulating Operating Baselines with accumulate
A given worth can carry completely different significance relying on when it happens in a sequence. What issues is its deviation from the evolving baseline — the working imply as much as that time limit. Utilizing an incremental strategy comparable to accumulate, you possibly can compute this working imply effectively with out storing the complete historical past.
readings = df["temperature_c"].tolist()
running_sums = record(itertools.accumulate(readings))
running_counts = record(itertools.accumulate([1] * len(readings)))
running_means = [
round(s / c, 4)
for s, c in zip(running_sums, running_counts)
]
# Operating max — highest temperature seen thus far, helpful for breach monitoring
running_max = record(itertools.accumulate(readings, func=max))
deviation_from_baseline = [
round(r - m, 4)
for r, m in zip(readings, running_means)
]
baseline_df = pd.DataFrame({
"temperature_c": readings,
"running_mean": running_means,
"running_max": running_max,
"deviation_from_baseline": deviation_from_baseline,
}, index=df.index)
print(baseline_df.iloc[20:28])
Output:
temperature_c running_mean running_max
timestamp
2024-03-01 20:00:00 2.960 3.5857 5.192
2024-03-01 21:00:00 2.647 3.5430 5.192
2024-03-01 22:00:00 2.986 3.5188 5.192
2024-03-01 23:00:00 2.831 3.4902 5.192
2024-03-02 00:00:00 3.409 3.4869 5.192
2024-03-02 01:00:00 3.919 3.5035 5.192
2024-03-02 02:00:00 3.833 3.5157 5.192
2024-03-02 03:00:00 4.542 3.5524 5.192
deviation_from_baseline
timestamp
2024-03-01 20:00:00 -0.6257
2024-03-01 21:00:00 -0.8960
2024-03-01 22:00:00 -0.5328
2024-03-01 23:00:00 -0.6592
2024-03-02 00:00:00 -0.0779
2024-03-02 01:00:00 0.4155
2024-03-02 02:00:00 0.3173
2024-03-02 03:00:00 0.9896
# Abstract
Time collection characteristic engineering is essentially about describing context — what has this sign been doing, relative to what we anticipate it to be doing? Each perform lined here’s a completely different means of formalizing that query right into a quantity a mannequin can be taught from.
Here is a abstract of the patterns we have lined on this article:
| itertools Operate | Time Sequence Characteristic | Instance |
|---|---|---|
islice |
Lag options | Temperature 1h, 6h, 24h in the past |
islice + accumulate |
Rolling window stats | 6h imply, std, min, max |
product |
Seasonal interplay grid | Hour × day kind × shift baseline |
tee |
Parallel window statistics | Imply + vary + charge of change |
chain |
Multi-resolution characteristic meeting | Uncooked + rolling + calendar options |
combos |
Pairwise cross-sensor correlations | Temp–humidity, temp–energy rolling corr |
accumulate |
Operating baseline + deviation | Drift detection from historic imply |
And since itertools works on the iterator stage, all of those patterns compose cleanly into streaming pipelines as properly. Joyful characteristic engineering!
Bala Priya C is a developer and technical author from India. She likes working on the intersection of math, programming, knowledge science, and content material creation. Her areas of curiosity and experience embrace DevOps, knowledge science, and pure language processing. She enjoys studying, writing, coding, and low! At present, she’s engaged on studying and sharing her information with the developer neighborhood by authoring tutorials, how-to guides, opinion items, and extra. Bala additionally creates partaking useful resource overviews and coding tutorials.
