Time Series Data Transformation using Python - GeeksforGeeks (2024)

Last Updated : 15 May, 2024

Improve

Time series data transformation is a crucial step in time series analysis and forecasting, it involves converting raw time series data into a format that is suitable for analysis and modelling. In this article, we will see how we can use time series data transformation, which will be beneficial to our analysis.

Types of transformations

In univariate time series data, there are mainly four main types of transformations, that are used to make our data fit for model building.

They are :

  • Power Transform
  • Difference Transform
  • Standardization
  • Normalization

Generating the dataset

This code utilizes Pandas and NumPy libraries to create a synthetic dataset representing weather conditions over a 100-hour period, starting from April 1, 2006. Random weather conditions, including temperature, humidity, wind speed, pressure, visibility, and apparent temperature, are generated and stored in a Pandas data frame. Each row in the DataFrame corresponds to a specific hour, with columns indicating the date, weather conditions, and various meteorological parameters.

Python
import pandas as pdimport numpy as np# Generate example datanp.random.seed(0)dates = pd.date_range('2006-04-01', periods=100, freq='H')formatted_dates = [date.strftime('%Y-%m-%d %H:%M:%S.000 +0200') for date in dates]# Generate weather conditions and daily summaries for each dateweather_conditions = ['Partly cloudy', 'Sunny', 'Rainy', 'Cloudy']weather_conditions = [random.choice(weather_conditions) for _ in range(100)]daily_summary = [' '.join([random.choice(weather_conditions), 'throughout the day.']) for _ in range(100)]temperature = np.random.randint(50, 100, size=100)humidity = np.random.randint(40, 90, size=100)wind_speed = np.random.randint(0, 15, size=100)pressure = np.random.randint(980, 1050, size=100)visibility = np.random.randint(0, 15, size=100)apparent_temperature = np.random.randint(50, 100, size=100)# Create DataFramedf = pd.DataFrame({ 'Formatted Date': formatted_dates, 'Weather Conditions': weather_conditions, 'Temperature (C)': temperature, 'Humidity': humidity, 'Wind Speed (km/h)': wind_speed, 'Pressure (mbar)': pressure, 'Visibility (km)': visibility, 'Apparent Temperature (C)': apparent_temperature, 'Daily Summary': daily_summary})print(df.head())


Output:

Formatted Date Weather Conditions Temperature (C) Humidity Wind Speed (km/h) Pressure (mbar) Visibility (km) Apparent Temperature (C) Daily Summary
0 2006-04-01 00:00:00.000 +0200 Partly cloudy 94 45 8 1016 2 80 Rainy throughout the day.
1 2006-04-01 01:00:00.000 +0200 Cloudy 97 81 8 1028 13 58 Partly cloudy throughout the day.
2 2006-04-01 02:00:00.000 +0200 Rainy 50 75 9 1005 7 70 Cloudy throughout the day.
3 2006-04-01 03:00:00.000 +0200 Rainy 53 40 2 1047 8 57 Rainy throughout the day.
4 2006-04-01 04:00:00.000 +0200 Rainy 53 71 8 1015 4 53 Rainy throughout the day.

Applying Transformations to our data,

1. Time Series Data Transformation Using Power Transform:

The power transform is mainly used to make the variance of the data constant. It involves mathematically transforming the data so it changes its distribution to be more Gaussian (normal). This can be particularly useful in cases where the data has a skewed distribution or heteroscedasticity (varying variance).

The code utilizes a statistical technique called power transformation, specifically the Yeo-Johnson method.

Python
from sklearn.preprocessing import PowerTransformer# Compute variance of the original 'Temperature (C)' columnoriginal_variance = df['Temperature (C)'].var()# Apply power transform to 'Temperature (C)' columnpt = PowerTransformer(method='yeo-johnson')df['Temperature (C)'] = pt.fit_transform(df[['Temperature (C)']])# Compute variance of the transformed 'Temperature (C)' columntransformed_variance = df['Temperature (C)'].var()print("Original Variance:", original_variance)print("Transformed Variance:", transformed_variance)

Output:

Original Variance: 217.82828282828282 
Transformed Variance: 1.0101010101010097

This significant reduction in variance indicates that the power transform successfully stabilized the variance of the data. The original variance of the ‘Temperature (C)’ column was 217.82828282828282, and after we apply the Yeo-Johnson power transform, the variance became 1.0101010101010097.

2. Time Series Data Transformation Using Difference Transform:

The difference transform is a technique used to make a time series data stationary by computing the differences between consecutive observations. This transformation is useful for removing trends or seasonal patterns in the data, making it easier to model using techniques like ARIMA.

This code applies a differencing transformation to the ‘Humidity’ column of a DataFrame and performs the Augmented Dickey-Fuller (ADF) test to check for stationarity.

Python
from statsmodels.tsa.stattools import adfuller# Apply difference transform to 'Humidity' columndf['Humidity difference'] = df['Humidity'].diff()# Perform Dickey-Fuller test for stationarityresult = adfuller(df['Humidity difference'].dropna())print("Humidity difference ADF Statistic:", result[0])print("Humidity difference p-value:", result[1])print("Humidity difference Critical Values:")for key, value in result[4].items(): print(f" {key}: {value}")

Output:

Humidity difference ADF Statistic: -6.594772523405528
Humidity difference p-value: 6.969838186303788e-09
Humidity difference Critical Values:
1%: -3.50434289821397
5%: -2.8938659630479413
10%: -2.5840147047458037

Here, we performed the Dickey-Fuller test to test for stationarity after applying the differenceing transformation.

And the results of the Dickey-Fuller test for the ‘Humidity difference’ column indicated that the data is likely stationary which is supported by the very low p-value (6.969838186303788e-09), which is less than the typical significance level of 0.05. Additionally, the ADF statistic is lower than the critical values at the 1%, 5%, and 10% levels, further indicating that we can reject the null hypothesis of non-stationarity.

3. Time Series Data Transformation Using Standardization:

Standardization, also known as z-score normalization, is a preprocessing technique used to scale the features of a dataset to have a mean of 0 and a standard deviation of 1. This transformation can be useful when working with features that have different scales, as it helps to bring all features to a similar scale.

This code demonstrates how to use the StandardScaler from scikit-learn to standardize the ‘Humidity’ and ‘Pressure (mbar)’ columns in a DataFrame

Python
from sklearn.preprocessing import StandardScaler# Create a StandardScaler objectscaler = StandardScaler()# Fit the scaler to the data and transform the 'Humidity' and 'Pressure (mbar)' columnsdf['Humidity standardized'] = scaler.fit_transform(df[['Humidity']])df['Pressure standardized'] = scaler.fit_transform(df[['Pressure (mbar)']])# Display the transformed DataFrameprint(df[['Humidity standardized', 'Pressure standardized']].head())

Output:

 Humidity standardized Pressure standardized
0 -1.264019 0.045409
1 1.151303 0.664619
2 0.748750 -0.522201
3 -1.599480 1.645036
4 0.480381 -0.006192

The columns ‘Humidity standardized’ and ‘Pressure standardized’ are now standardized, with their values now having a mean of 0 and a standard deviation of 1, which brings them to a similar scale.

4. Time Series Data Transformation Using Normalization

Normalization is another data preprocessing technique used to scale the features of a dataset to a fixed range. This is achieved by subtracting the minimum value of the feature and then dividing by the range of the feature. Normalization is particularly useful when the features have different ranges and unit.

Here’s the code fits the scaler to the data and transforms the ‘Humidity’ column, then prints out the first few rows of the transformed data.

Python
from sklearn.preprocessing import MinMaxScaler# Create a MinMaxScaler objectscaler = MinMaxScaler()# Fit the scaler to the data and transform the 'Humidity' columndf['Humidity normalized'] = scaler.fit_transform(df[['Humidity']])# Display the transformed DataFrameprint(df['Humidity normalized'].head())

Output:

0 0.102041
1 0.836735
2 0.714286
3 0.000000
4 0.632653
Name: Humidity normalized, dtype: float64

The ‘Humidity normalized’ column has been normalized, which will be computationally efficient when we apply to our model.

Conclusion

In conclusion, time series data transformation is a crucial step in time series analysis and forecasting. It involves converting raw time series data into a format that is suitable for analysis and modeling. We applied these transformations to a sample dataset, showcasing how each transformation affects the data and its suitability for modeling. These transformations are essential for preparing time series data for analysis and modeling, ensuring that the data is in a suitable format for accurate and effective forecasting.



Like Article

Suggest improvement

Previous

Python | Pandas Series.dt.is_year_start

Next

Positional Encoding in Transformers

Share your thoughts in the comments

Please Login to comment...

Time Series Data Transformation using Python - GeeksforGeeks (2024)

References

Top Articles
Latest Posts
Article information

Author: Lilliana Bartoletti

Last Updated:

Views: 6776

Rating: 4.2 / 5 (53 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Lilliana Bartoletti

Birthday: 1999-11-18

Address: 58866 Tricia Spurs, North Melvinberg, HI 91346-3774

Phone: +50616620367928

Job: Real-Estate Liaison

Hobby: Graffiti, Astronomy, Handball, Magic, Origami, Fashion, Foreign language learning

Introduction: My name is Lilliana Bartoletti, I am a adventurous, pleasant, shiny, beautiful, handsome, zealous, tasty person who loves writing and wants to share my knowledge and understanding with you.