Leo Smith of Gridwatch has done a cracking job of making electricity data available on an online resource. The data available through his website includes total demand, total generation from different generation types, interconnector flows and frequency with a temporal resolution of approximately 5 minutes.
This post is about using a Python package called pandas to perform data analysis on the data downloaded from the gridwatch website. Pandas is a powerful data analysis library for Python.
A small Python script follows that reads gridwatch data into a Pandas data frame, performs data cleaning for erroneous data and plot.
1 2 |
import pandas as pd import matplotlib.pyplot as plt |
The first thing we want to do is to import pandas and matplotlib packages in our Python script. This is achieved by lines 1 and 2.
3 4 |
filename = 'gridwatch.csv' data = pd.read_csv(filename) |
‘gridwatch.csv‘ is a csv file that I have downloaded from the gridwatch website. To keep things simple, I have only downloaded the demand data in this file. However, the same script would work for any other data or a combination of data types. The resulting file is a big csv file (~3.8MB, 104860 lines).
Line 4 reads our csv file as a pandas data frame.
5 6 7 |
data = data.rename(columns=lambda x:x.strip()) for key in data: data[key] = data[key].map(lambda x: str(x).strip()) |
The downloaded csv file (and hence the pandas dataframe) contains leading white space. We need to fix this by stripping the white space, as this would cause problems later. This is achieved by lines 5-7.
8 9 10 11 12 13 |
data['timestamp'] = pd.to_datetime(data['timestamp'], format="%Y-%m-%d %H:%M:%S") data['demand'] = pd.to_numeric(data['demand']) data.set_index('timestamp', inplace=True) ax = data['demand'].plot() plt.show() |
Line 8 converts the datatype of our dataframe column ‘timestamp’ to take a datetime data type. This is useful as I would like to plot the demand against time and for that need pandas to recognise the datetime data type. Line 10 converts the demand to numeric values. Line 11 sets ‘timestamp’ to index of the data frame. Line 12 and 13 plot the demand against time, and we have the following plot.
The above figure shows some spikes and dips. These spikes and dips corresponds to the erroneous data in the downloaded csv file. The total demand of Great British electricity system can not be higher than 60GW and less that 15GW. Based on this knowledge, we need to clean our data. This can be achieved by using following command.
data = data[(data.demand>15000) & (data.demand<60000)]
See how easy is plotting and making sense of data using Pandas and Python.
The full code of reading gridwatch data is available on this link.