Using linear regression to predict renewable energy production from weather data for Germany in 2016
In this post I describe how to predict wind and solar generation from weather data using a simple linear regression algorithm and a dataset containing energy production and weather information for Germany during 2016.
This project allowed me to play with several important Data Science concepts and practices.
First, I had to look for data in places other than Kaggle, where most datasets are clean and ready for implementing Machine Learning algorithms. Even though, as we’ll see below, the datasets I used were surprisingly devoid of big problems such as lots of missing values, they were still closer to what a practising Data Scientist actually uses.
Second, this project allowed me to deal with time series in a real-world context, as the data for the wind and solar generation and weather comes in hourly resolution.
Third, even though I’m using real data, I didn’t need to use complex Deep Learning models or other sophisticated models to make reasonably good predictions. It’s always an important lesson that some problems, even if they seem complicated, can be solved using simple techniques, in this case linear regression.
Finally, working with data connected to renewable energy isn’t just a nice academic exercise. You can easily imagine, for example, how important it is to predict future renewable generation given its intermittent nature. Using Machine Learning techniques seems to be a fruitful option.
The code I use can be found in my GitHub repository:
Contribute to Renewable-energy-weather development by creating an account on GitHub.
The Jupyter notebook can be also viewed here.
The data I used in this analysis come from Open Power System Data, a free-of-charge platform with data on installed generation capacity by country/technology, individual power plants (conventional and renewable), and time series data.
Open Power System Data - A platform for open data of the European power system.
Open Power System Data is a free-of-charge data platform dedicated to electricity system researchers. We collect…
The platform contains data for 37 European countries, but in this project I’m going to focus on data for Germany in 2016 as an example. In particular, I’m going to use two datasets:
- Time series with load, wind and solar, prices in hourly resolution. The CSV file with data for all 37 countries from 2006 and 2017 can be obtained here.
- Weather data with wind speed, radiation, temperature and other measurements. Given the huge amount of data, the Open Power System Data platform provides a CSV file for the German dataset for 2016 (which can be found here) and a script to download the data for other countries and years.
Wind and solar generation
First, we start with the CSV file with the time series for the 37 European countries from 2006 and 2017, but read only the data for Germany:
production = pd.read_csv("data/time_series_60min_singleindex.csv",
usecols=(lambda s: s.startswith('utc') |
Using the option
parse_dates=, together with
index_col=0, guarantees that the column with the date and time of each measurement is stored as a
After filtering for the rows for 2016, we end up with a DataFrame with 8784 entries (the number of hours in a leap year like 2016) and 48 columns, each relative to a different quantity such as solar capacity, wind capacity, etc (you can see the full list here). I’m only going to be interested in two of them:
- ‘DE_solar_generation_actual’, with the actual solar generation in MW;
- ‘DE_wind_generation_actual’, with the actual wind generation in MW.
Fortunately, there are no missing values to worry about, so we can make a couple of plots to have an idea about the data.
As we would naively expect, there is no clear pattern for the wind generation across the year, even though we can see slightly larger production roughly in February, November and December.
As for the solar generation, as expected, it was significantly larger in the middle months of the year.
Now, we read the CSV file containg the weather data for Germany in 2016.
weather = pd.read_csv("data/weather_data_GER_2016.csv",
If we check the info atribute of the
weather DataFrame, we obtain:
DatetimeIndex: 2248704 entries, 2016-01-01 00:00:00 to
Data columns (total 14 columns):
cumulated hours int64
dtypes: float64(11), int64(3)
memory usage: 257.3 MB
That’s 2248704 entries! If you do the maths, it corresponds to 8784 entries for 256 geographical ‘chuncks’ of Germany, each characterized by its latitute
lat and longitute
lon. The other columns are as follows:
- v1: velocity [m/s] at height h1 (2 meters above displacement height);
- v2: velocity [m/s] at height h2 (10 meters above displacement height);
- v_50m: velocity [m/s] at 50 meters above ground;
- h1: height above ground [m] (h1 = displacement height +2m);
- h2: height above ground [m] (h2 = displacement height +10m);
- z0: roughness length [m];
- SWTDN: total top-of-the-atmosphere horizontal radiation [W/m²];
- SWGDN: total ground horizontal radiation [W/m²];
- T: Temperature [K] at 2 meters above displacement height (see h1);
- Rho: air density [kg/m³] at surface;
- p: air pressure [Pa] at surface.
Note that, at this point, we have information about the wind and solar generation at a national level and information about the weather at a more local level, at least for each of the 256 parts in which the German was divided for these measurements. If we want to use the two datasets together, we need to transform the weather data so that we have a single row for each hour of 2016.
We have some limitations for a complete analysis, as, for instance, we don’t know the location within Germany of the wind turbines and solar panels. Given the purposes of this project, I’m going to simply group the weather data by each hour and take the average over the geographical ‘chucks’. In this way, we obtain a DataFrame with 8784 entries, which we can later merge with the first DataFrame.
Before doing that, let’s see how some of these averaged weather quantities behaved in Germany in 2016.
As with the wind generation, we see that the wind velocity does not follow a specific pattern, although it was larger in February, November and December.
The horizontal radiation at the ground level was, as expected, larger during the Summer months, likewise with the temperature, as plotted below.
As suggested by the above plots from both datasets (and by common sense), there seems to be some correlation between the wind and solar generation and some of the measured weather quantities. Further evidence for this claim can be obtained from the following plots, in which the wind and solar generation is shown as a function of the several weather quantities.
There seems to be a linear relation between the wind generation and the wind velocities v1, v2 and v_50m, but not the other quantities.
Similarly, there seems to be a linear relation between the solar generation and the top-of-the-atmosphere and ground radiation.
Given these observations, I’m going to try a linear regression algorithm in order to predict the wind and solar generation from some of the above weather quantities.
In one of my previous blog posts I gave an brief overview of linear regression. In summary, the output of a linear regression algorithm is a linear function of the input:
is a vector of parameters.The objective is to find the parameters which minimize the mean squared error:
This can be achieved using
LinearRegression from the scikit-learn library.
To predict the wind generation, we construct the features matrix
X_wind with the features v1, v2 and v_50m, and the target
Y_wind with actual wind generation. Then, we implement the algorithm:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_scorelr = LinearRegression()scores_wind = cross_val_score(lr, X_wind, y_wind, cv=5)
print(scores_wind, "\naverage =", np.mean(scores_wind))
Let’s analyse what’s going on here, if you aren’t familiar with it:
- We import
sklearn.linear_model, which implements the ordinary least squares linear regression. (You can find more information in the documentation.)
- In order to evaluate the performance of the algorithm, we divide the data using a procedure called cross-validation (CV for short). For the k-fold CV, the dataset is split into k smaller sets or ‘folds’, the model is trained in k-1 of those folds, and the resulting model is validated on the remaining part of the data. The performance measure provided by the CV is then the average of the performance measure computed in each experiment. In the code above, we use
sklearn.model_selection, with number of folds
cv=5(more information in the documentation).
- The performance measure that
LinearRegressiongives by default is the coefficient of determination R² of the prediction. It measures how well the predictions approximate the true values. A value close to 1 means that the regression makes predictions which are close to the true values. It is formally computed using the formula:
The output of the code above for our case is:
[0.88261401 0.88886305 0.83623262 0.88974363 0.85338174]
average = 0.870167010172279
The first line contains the five values of R² for each of the 5 folds in the cross validation procedure, whereas the second line is their average. We see that our linear model has an R² of approximately 0.87, which is quite good for such a simple model! We can make good predictions about the wind generation in Germany in 2016 given only the wind velocities at different heights.
To predict the solar generation, we follow a very similar procedure. We again construct the features matrix
X_solar, but now with the features SWTDN, SWGDN and T, and the target
Y_solar with actual solar generation. Then, we implement the algorithm:
scores_solar = cross_val_score(lr, X_solar, y_solar, cv=5)
print(scores_solar, "\naverage =", np.mean(scores_solar))
The output is:
[0.8901974 0.95027431 0.95982151 0.95090201 0.8715077 ]
average = 0.9245405855731855
We get an even better value of R²! We can make good predictions about the solar generation in Germany in 2016 given only the temperature and top-of-the-atmosphere and ground radiation.
Even with these good results, there is certainly much we can do to improve the analysis. As an example, it is very probable that some of the features used in the regression are collinear, that is, they are moderately or highly correlated. For a first analysis like this, in which we are only interested in the predictive power of the model, it is not a major concern. However, without further work, we cannot say much about the influence of each individual feature on the wind and solar generation.
It’s quite impressive how much we were able to accomplish using real-world data and a simple algorithm. Of course, more sophisticated analyses may be made, which are beyond the scope of this post. If you want to know more, I found the following article to be very instructive: