Using linear regression to predict renewable energy production from weather data for Germany in 2016

In this post I describe how to predict wind and solar generation from weather data using a simple linear regression algorithm and a dataset containing energy production and weather information for Germany during 2016.

This project allowed me to play with several important Data Science concepts and practices.

First, I had to look for data in places other than Kaggle, where most datasets are clean and ready for implementing Machine Learning algorithms. Even though, as we’ll see below, the datasets I used were surprisingly devoid of big problems such as lots of missing values, they were still closer to what a practising Data Scientist actually uses.

Second, this project allowed me to deal with time series in a real-world context, as the data for the wind and solar generation and weather comes in hourly resolution.

Third, even though I’m using real data, I didn’t need to use complex Deep Learning models or other sophisticated models to make reasonably good predictions. It’s always an important lesson that some problems, even if they seem complicated, can be solved using simple techniques, in this case linear regression.

Finally, working with data connected to renewable energy isn’t just a nice academic exercise. You can easily imagine, for example, how important it is to predict future renewable generation given its intermittent nature. Using Machine Learning techniques seems to be a fruitful option.

The code I use can be found in my GitHub repository:

The Jupyter notebook can be also viewed here.

The data I used in this analysis come from Open Power System Data, a free-of-charge platform with data on installed generation capacity by country/technology, individual power plants (conventional and renewable), and time series data.

## Open Power System Data - A platform for open data of the European power system.

### Open Power System Data is a free-of-charge data platform dedicated to electricity system researchers. We collect…

open-power-system-data.org

The platform contains data for 37 European countries, but in this project I’m going to focus on data for Germany in 2016 as an example. In particular, I’m going to use two datasets:

**Time series**with load, wind and solar, prices in hourly resolution. The CSV file with data for all 37 countries from 2006 and 2017 can be obtained here.**Weather data**with wind speed, radiation, temperature and other measurements. Given the huge amount of data, the Open Power System Data platform provides a CSV file for the German dataset for 2016 (which can be found here) and a script to download the data for other countries and years.

## Wind and solar generation

First, we start with the CSV file with the time series for the 37 European countries from 2006 and 2017, but read only the data for Germany:

`production = pd.read_csv("data/time_series_60min_singleindex.csv",`

usecols=(lambda s: s.startswith('utc') |

s.startswith('DE')),

parse_dates=[0], index_col=0)

Using the option`parse_dates=[0]`

, together with `index_col=0`

, guarantees that the column with the date and time of each measurement is stored as a `DatetimeIndex`

.

After filtering for the rows for 2016, we end up with a DataFrame with 8784 entries (the number of hours in a leap year like 2016) and 48 columns, each relative to a different quantity such as solar capacity, wind capacity, etc (you can see the full list here). I’m only going to be interested in two of them:

- ‘DE_solar_generation_actual’, with the actual solar generation in MW;
- ‘DE_wind_generation_actual’, with the actual wind generation in MW.

Fortunately, there are no missing values to worry about, so we can make a couple of plots to have an idea about the data.

As we would naively expect, there is no clear pattern for the wind generation across the year, even though we can see slightly larger production roughly in February, November and December.

As for the solar generation, as expected, it was significantly larger in the middle months of the year.

## Weather data

Now, we read the CSV file containg the weather data for Germany in 2016.

`weather = pd.read_csv("data/weather_data_GER_2016.csv",`

parse_dates=[0], index_col=0)

If we check the info atribute of the `weather`

DataFrame, we obtain:

`<class 'pandas.core.frame.DataFrame'>`

DatetimeIndex: 2248704 entries, 2016-01-01 00:00:00 to

2016-12-31 23:00:00

Data columns (total 14 columns):

cumulated hours int64

lat float64

lon float64

v1 float64

v2 float64

v_50m float64

h1 int64

h2 int64

z0 float64

SWTDN float64

SWGDN float64

T float64

rho float64

p float64

dtypes: float64(11), int64(3)

memory usage: 257.3 MB

That’s 2248704 entries! If you do the maths, it corresponds to 8784 entries for 256 geographical ‘chuncks’ of Germany, each characterized by its latitute `lat`

and longitute `lon`

. The other columns are as follows:

*Wind parameters:*

- v1: velocity [m/s] at height h1 (2 meters above displacement height);
- v2: velocity [m/s] at height h2 (10 meters above displacement height);
- v_50m: velocity [m/s] at 50 meters above ground;
- h1: height above ground [m] (h1 = displacement height +2m);
- h2: height above ground [m] (h2 = displacement height +10m);
- z0: roughness length [m];

*Solar parameters:*

- SWTDN: total top-of-the-atmosphere horizontal radiation [W/m²];
- SWGDN: total ground horizontal radiation [W/m²];

*Temperature data:*

- T: Temperature [K] at 2 meters above displacement height (see h1);

*Air data:*

- Rho: air density [kg/m³] at surface;
- p: air pressure [Pa] at surface.

Note that, at this point, we have information about the wind and solar generation at a *national* level and information about the weather at a more *local* level, at least for each of the 256 parts in which the German was divided for these measurements. If we want to use the two datasets together, we need to transform the weather data so that we have a single row for each hour of 2016.

We have some limitations for a complete analysis, as, for instance, we don’t know the location within Germany of the wind turbines and solar panels. Given the purposes of this project, I’m going to simply group the weather data by each hour and take the average over the geographical ‘chucks’. In this way, we obtain a DataFrame with 8784 entries, which we can later merge with the first DataFrame.

Before doing that, let’s see how some of these averaged weather quantities behaved in Germany in 2016.

As with the wind generation, we see that the wind velocity does not follow a specific pattern, although it was larger in February, November and December.

The horizontal radiation at the ground level was, as expected, larger during the Summer months, likewise with the temperature, as plotted below.

As suggested by the above plots from both datasets (and by common sense), there seems to be some correlation between the wind and solar generation and some of the measured weather quantities. Further evidence for this claim can be obtained from the following plots, in which the wind and solar generation is shown as a function of the several weather quantities.

There seems to be a linear relation between the wind generation and the wind velocities v1, v2 and v_50m, but not the other quantities.

Similarly, there seems to be a linear relation between the solar generation and the top-of-the-atmosphere and ground radiation.

Given these observations, I’m going to try a linear regression algorithm in order to predict the wind and solar generation from some of the above weather quantities.

In one of my previous blog posts I gave an brief overview of linear regression. In summary, the output of a **linear regression algorithm** is a linear function of the input:

where

is a vector of parameters.The objective is to find the parameters which minimize the **mean squared error**:

This can be achieved using`LinearRegression`

from the scikit-learn library.

## Wind generation

To predict the wind generation, we construct the features matrix `X_wind`

with the features v1, v2 and v_50m, and the target `Y_wind`

with actual wind generation. Then, we implement the algorithm:

fromsklearn.linear_modelimportLinearRegressionfromsklearn.model_selectionimportcross_val_scorelr = LinearRegression()scores_wind = cross_val_score(lr, X_wind, y_wind, cv=5)

print(scores_wind, "\naverage =", np.mean(scores_wind))

Let’s analyse what’s going on here, if you aren’t familiar with it:

- We import
`LinearRegression`

from`sklearn.linear_model`

, which implements the ordinary least squares linear regression. (You can find more information in the documentation.) - In order to evaluate the performance of the algorithm, we divide the data using a procedure called
**cross-validation**(CV for short). For the*k*-fold CV, the dataset is split into*k*smaller sets or ‘folds’, the model is trained in*k*-1 of those folds, and the resulting model is validated on the remaining part of the data. The performance measure provided by the CV is then the average of the performance measure computed in each experiment. In the code above, we use`cross_val_score`

from`sklearn.model_selection`

, with number of folds`cv=5`

(more information in the documentation).

- The performance measure that
`LinearRegression`

gives by default is the**coefficient of determination***R*² of the prediction. It measures how well the predictions approximate the true values. A value close to 1 means that the regression makes predictions which are close to the true values. It is formally computed using the formula:

The output of the code above for our case is:

`[0.88261401 0.88886305 0.83623262 0.88974363 0.85338174] `

average = 0.870167010172279

The first line contains the five values of *R*² for each of the 5 folds in the cross validation procedure, whereas the second line is their average. We see that our linear model has an *R*² of approximately 0.87, which is quite good for such a simple model! We can make good predictions about the wind generation in Germany in 2016 given only the wind velocities at different heights.

## Solar generation

To predict the solar generation, we follow a very similar procedure. We again construct the features matrix `X_solar`

, but now with the features SWTDN, SWGDN and T, and the target `Y_solar`

with actual solar generation. Then, we implement the algorithm:

`scores_solar = cross_val_score(lr, X_solar, y_solar, cv=5)`

print(scores_solar, "\naverage =", np.mean(scores_solar))

The output is:

`[0.8901974 0.95027431 0.95982151 0.95090201 0.8715077 ] `

average = 0.9245405855731855

We get an even better value of *R*²! We can make good predictions about the solar generation in Germany in 2016 given only the temperature and top-of-the-atmosphere and ground radiation.

Even with these good results, there is certainly much we can do to improve the analysis. As an example, it is very probable that some of the features used in the regression are *collinear*, that is, they are moderately or highly correlated. For a first analysis like this, in which we are only interested in the predictive power of the model, it is not a major concern. However, without further work, we cannot say much about the influence of each individual feature on the wind and solar generation.

It’s quite impressive how much we were able to accomplish using real-world data and a simple algorithm. Of course, more sophisticated analyses may be made, which are beyond the scope of this post. If you want to know more, I found the following article to be very instructive: