- Data Science Lifecycle phases and the importance of each phase in the overall process.
- How to develop and deploy ML models for time-series forecasting.
- Best practices when developing machine learning applications on time series forecasting
- Deep learning techniques for time series forecasting
Sponsored by Curity
Want to ensure your app developers can create secure and smooth login experiences for your customers? With Curity you can protect user identities, secure apps and websites, and manage API access.
Srini Penchikala: My name is Srini Penchikala. I am the lead editor for AIML and data engineering community at InfoQ website. Thank you for checking out this podcast. In today's podcast, I will be speaking withDr. Francesca Lazzeri on machine learning for time series forecasting as the main topic. This will include automated machine learning and deep learning for time series data forecasting. We will also talk about other emerging trends in machine learning development and operations areas, including data science lifecycle. And if the time permits, at the end of the podcast, we will also discuss MLOps best practices.
So let me first introduce Dr. Lazzeri to our listeners. Dr. Francesca Lazzeri currently works as principal cloud advocate manager at Microsoft. She is an experienced scientist and a machine learning practitioner with over 12 years of both academic and industry experience. She is the author of a number of applications, including technology journals, conferences and books. Dr. Lazzeri currently leads an international team of data scientists, cloud advocates and AI developers at Microsoft. Before joining Microsoft organization, she was a research fellow at Harvard University in the technology and operations management unit. Dr. Lazzeri is no stranger to the InfoQ audience. She has already published articles on infoq.com website, and also spoke recently at the QCon Plus virtual conference on MLOps topic.
Francesca, welcome to the podcast. Thank you for joining me today. Before we get started, do you have any additional comments about your recent research projects that you have been leading, that may be of interest to our readers?
Francesca Lazzeri: Hi everyone and thank you so much, Srini, for inviting me to this podcast. It's just a pleasure for me to be here. I think that you gave our listeners a very good introduction. And today I would really like to talk about two main topics that are time series forecasting, and specifically how you can apply different machine learning techniques and deep learning approaches to time series data. So I'm sure that you have some questions for me on that topic.
And then some of the latest research areas I have been working on are automated machine learning for time series forecasting, but also for classification and regression. And of course we can talk a little bit more about that. And another area is interpretability for machine learning forecast and interpretability of open source packages that data scientists then can apply to machine learning models.
So I'm very happy to be here today, and thank you again.
Data science life cycle phases [02:52]
Srini Penchikala: Thank you Francesca. I would like to start this discussion with the data science life cycle, to kind of provide our listeners, based on your experience and expertise, what they should be looking for in their own machine learning development projects. So typically all machine learning applications will need some type of a process, and some kind of data science lifecycle, to be successful in their projects. Can you discuss what data science life cycle is about and what phases in that life cycle process are critical in terms of developing and deploying machine learning solutions to production?
Francesca Lazzeri: This is a great question. I started my career at Microsoft about seven years ago as a data scientist. And since then, I have been working on many different data science and applied machine learning solutions for different customers, both internal and external customers. And I have been noticing that there are a few steps that are very common to any machine learning and data science projects. So usually I refer to those steps as the data science life cycle. Some other data scientists call it the machine learning life cycles. But basically we are speaking about five different steps. And I always liketo share these facts with my customers and my data scientist and my team, because if you follow these steps, I'm pretty sure that you're going to build end-to-end successful solutions.
The first step is business understanding. The second is data acquisition and understanding. Then we have the third one that is a core part of the machine learning life cycle that is modeling, and it's the step that all data scientists work on a lot. Then there is the deployment; and then there is [00:04:30] a final one that I like to call customer acceptance.
Let's understand these five steps a little bit better. The business understanding is really the moment in which you, as a data scientist or as a data science manager, you need to specify what I call the key variables that are to serve as the model targets. So those are the variables that you're going to use for your machine learning models. And you're going to also define those metrics that you're going to look at in order to [00:05:00] evaluate the performance of your model, but also the business impact of your model. And you can do this only if you are in a team where you have both experts from the data science field and the machine learning field.
The business understanding is really about identifying the relevant data sources that the business has access to, and also that the business sometimes needs to obtain, because sometimes you also want to include like external variables, external indicators that can improve your machine learning model. So business understanding is really the first step and the key step, in my opinion, when data scientist experts, data experts and business experts that can be also like customers for example, they talk together and they define the business problems that they want to solve with data.
After that, there is what I call the data acquisition and understanding. Here, I would say that the main goal is to produce a clean and also high quality data set. And this data set has to have a direct relationship to the target variable, because of course you need to put together a set of features that are going to help you with predicting and measuring the target variable that you need in order to solve the business problem. So here it's very important for the data engineer, the data scientist or the data architect; it depends of course on the team that you're working with, for them, for those professionals, it's very important to develop a solution architecture of the datapipeline that needs to be refreshed and also scores the data in a very, I would say, consistent and regular way.
Data acquisition is also interesting because again, it's not just about data acquisition, but it's also about data understanding. And so there are different assets that you need to look at in order to prepare these steps. These are like, for example, the data source, the data pipeline, the environment that you're going to leverage in order to build your data science application. And then there is all the exploration, wrangling, and cleaning of your data. So this second step is really about your data.
Then we have the third step that I was mentioning before which is modeling. Modeling, I would say, is the comfort zone for data scientists in general, because here is the step in which data scientists determine the optimal data features for the machine learning model that they are creating. So it's really about making sure that they try different algorithms. There [00:07:30] is also something that is very common in any data science project, that is the feature engineering process. Here is about transforming, but also selecting the right features for your model. And then you move to the model training that is about selecting the right algorithm, but also doing parameter tuning and trying different model settings in order to make sure that you are building a healthy, and I would say also stable model. And finally there is the model evaluation. [00:08:00] So all these steps that are feature engineering, model training, model evaluation are actually part of this third, the big step, that is the modeling part.
Then after the modeling, usually there is the deployment. So here, as a data scientist, most of the time you have already identified the best model for your environment, the best model that is also helping you answer the business questions that you start with. And at this point, you just need to deploy the models with the data pipeline to a production or a production-like environment in order to allow the other people in your company or externally to consume your model. So I really like to define the deployment stage as the part in which your machine learning algorithm is more or less becoming AI. Because most of the time when you deploy your model, you are creating what we call a REST URL, or also an API, that other data scientists, other customers can call, they can consume it, so that they can leverage the results from your model. So deployment is really a critical part of the data science life cycle.
And finally there is the customer acceptance. So this is the fifth step. It is about finalizing what I call the project deliverable. So here, of course, you have to make sure that everything's working. You have to confirm that the pipeline and the model, the deployment of the model is ready, using a production environment that [00:09:30] is working. And also the most important step here is to make sure that it satisfied the customer objectives, the business objectives. So the customer acceptance is really about making sure that the business objectives, the business understanding that was decided at the beginning, are met. So you were able to reach those objectives through your solution and through your data pipeline.
So again, I refer to these five important steps: business understanding, the big one; the second is data acquisition and understanding; then we have the modeling; the fourth one is deployment; and the fifth one, that is the final one, is the customer acceptance. So I always refer to those five steps as the data science life cycle. And I usually encourage every customer and every data scientist in my team to follow these steps.
Data science life cycle process use cases [10:22]
Srini Penchikala: Do you have any specific use cases that you can briefly discuss from your recent projects, how you are able to use this data science life cycle process you just discussed?
Francesca Lazzeri: Honestly, we use this data science life cycle also internally at Microsoft when we develop new features, and also when we try to somehow improve the experience of our customers. Like, something we have been working on for a while now at Microsoft, I think we started this project in 2018, is automated machine learning that is also called automated ML or AutoML. And automated machine learning is really the process [00:11:00] of automating this time-consuming and iterative task of machine learning model development. So with the automated machine learning, basically what you're going to do is really trying to understand what is the business question, the business goal that you're trying to follow with your data. And then automated machine learning is going to help you read the data featurization. So all the feature engineering part with the model selection, model training, and also of course the model evaluation. Because in order to achieve a model selection, you need to go through model training and modeling evaluation, and then it's going to help you also with the deployment.
It's a sort of acceleration of this data science life cycle that I was mentioning to you, and is allowing now data scientists to leverage all these steps that are in the data science life cycle, like I mentioned, especially in terms of feature engineering, model training, model evaluation, model selection, and then the deployment. What [00:12:00] is nice of automated machine learning, again is a concept that we had been developing through also the data science life cycle and through also a research project from Microsoft, is the fact it accelerated this time consuming process of creating models. And what is nice now of automated machine learning is that it supports three different scenarios in machine learning. Those scenarios are: classification, regressions and forecasting. When you want to train and also tune a model, you can specify a specific target metric based on these three different scenarios. Of course then you have to select one scenario, and then you can automate this process.
Just to give you a little bit more information of these, like classification, I'm sure that here, our listeners are very familiar with this concept, about most of the time classification is really a common machine learning task, because it's a type of supervised learning that you use when you want to build the models that learn using training data, and also apply those learnings to new data. So classification usually is an algorithm that allows you to identify if a specific variable result is going to be between two classes or between multiple classes. So as you can understand from this simple and basic explanation of this algorithm, is a very common scenario in machine learning.
The second scenario that I was trying to explain, and that is right now to support it, automated machine learning, is regression, that is also a very common supervised learning task. And regression is different from classification because the predicted output values are categorical in classification. And in regression, the regression models actually predict numerical output values that are based on these independent predictors. So again, these are two very similar scenarios: classification and regression, and are very common in any data science and machine learning use case.
I have to say that automated machine learning has been used now a lot from our customers, in scenarios such as time series forecasting, because of building forecast, as you can understand, is really a crucial part of any business. Like when it's revenue, inventory, sales or customer demands, or you always want to predict something that is going to happen in the future. And most of our customers have data that have this time component, and this makes data set time series data. And you can use now automated machine learning to combine different techniques and approaches, and get these very good, very high quality time series forecast.
So in general, automated machine learning is a research project, as I was saying, that started with Microsoft research. We apply the data science life cycle through the design of this automated machine learning feature. And we have been seeing many data scientists, analysts and developers across different industries using automated machine learning, either to implement machine learning solutions without programming knowledge; like there are some data scientists that just prefer to accelerate the data life cycle, leveraging these new feature, and of course the time that they spend building and programming the actual models is reduced. We saw that also our customers are saving a lot of time and resources for these. And finally, the most important piece of automated machine learning, in my opinion, is that as a data scientist, you can leverage data science best practices, because you have access to all these different options in terms of algorithms, but also model selection, model tuning and also featurization. So it's a very nice research project that we had [00:16:00] been working on for the past few years.
Time series forecasting and its importance in overall machine learning efforts [16:03]
Srini Penchikala: Yes, definitely. Automated machine learning and time series data management and analytics, both topics have been getting a lot of attention these days. I know you've been working on time series forecasting as one of the main focus areas. Before we discuss the details of that process and techniques, I know you briefly mentioned, in the previous response, but for our listeners who are new to this topic, can you please define what is time series forecasting, and why it is important in the overall machine learning efforts?
Francesca Lazzeri: This is actually a great question and thanks for asking this question, Srini, because most of the time I'm talking to other data scientists and developers, and I assume that people immediately understand the time series forecasting and time series data. But it's important to start from the data, as you said, and understand very well, what is the data that we're dealing with.
So time series is a really unique type of data, but it's very common in many different data sets and industries. So time series is a type of data that measures how things and events in general, they change over time. Usually in a time series data set, the time column, so the time variable, is actually not a variable per se. So I call it the time column because you can, of course, use it for generating additional variables, but it's not a variable per se. It is actually a primary structure of your data set, and you can use it in order to improve and to, I would say, enrich your data set. Because it represents this primary temporal structure that makes time series problem more challenging as data scientists, because you need to apply specific data processing and specific feature engineering techniques to handle this time series data. However, this temporal structure also represents a source of additional information, additional knowledge that data scientists can use in order to actually create additional variables.
For example, you can learn how to leverage this temporal information to extrapolate insights from your time series data like trends, like seasonality information. And this is going to allow you, as a data scientist, to make your time series easier to model, and also to use it for future strategy and also planning operations in different industries. Honestly, I have been seeing customers from finance, to manufacturing, to healthcare, marketing as well. So time series forecasting as a whole has played a major role, I would say a lot, in the business insight with respect to time. So it's really something that you don't find only in finance, as I was saying, but you will now find time series data in many different industries. Also IoT is another common scenario.
And some of the questions that you can answer by leveraging the time series data, for example, what are going to be the sales volumes of different items, of different products in different stores in the next week, or in the next four months, for example, if you are building a quarter timeseries forecasting model. Also, what are the passenger numbers of different international airlines like for next week or for next month. So as you can see, it's really a type of data that allows you to forecast something, to predict an event that is really related, really attached to this temporal structure.
Best practices for machine learning applications on time series forecasting [19:28]
Srini Penchikala: Definitely. And you have also written a book on this topic. It's titled "Machine Learning for Time Series Forecasting with Python." Can you talk about what are some best practices when developing a machine learning based applications for time series forecasting? Maybe highlight some best practices from the book you authored.
Francesca Lazzeri: Of course. So in 2020, last year, I wrote this book. The title is Machine Learning for Time Series Forecasting with Python. I published it with Wiley. And in this book, honestly, I focus a lot of my attention on Python. So what are the different Python packages that the data scientists can leverage in order to be an end-to-end time series forecasting based solutions. However, I don't only focus on the modeling part, I focus also on the data preparation and the deployment part. Because as I discuss through the book, I always think about data scientists as professionals that are able to build end-to-end solutions. So they understand a little bit the business of this scenario that they are building. And then they, of course, prepare the data and do the feature engineering part. They built the model, but they are also responsible for the deployment of the model and the operationalization of the end-to-end time series forecasting solutions.
So some of the tips and tweaks that I try to share in the book are also related to the end-to-end format that, in my opinion, data scientists should follow, in order to build successful time series solutions. So I think it's in chapter two you find best practices such as how to design an end-to-end time series forecasting solutions. And I introduce some key techniques for building machine learning forecasting solutions such as, for example, how you can define the business understanding and the performance metrics definition. Because this is really an aspect that you need to go through in order to understand how to qualify the business problem, and to ensure that your predictive analytics and machine learning [00:21:30] techniques are effective and applicable for that specific time series problem.
Then I also share some tips and tricks in terms of data ingestion. Here I explain how you can collect an important time series data, and how you can analyze this data, and how you can also manage different structured and unstructured data, in order to then discover what I call the real time insights. So the data ingestion part for a time series solution is very unique, and you need to be a little bit of an expert of time series data in order to make sure that you are building that step in a clear and healthy way.
Then I also share a little bit my experience in terms of data exploration and understanding. This is more about taking your raw time series data and converting it into a format that can be used for data cleaning and feature engineering. And the most, I would say, interesting part of time series data and time [00:22:30] series forecasting is really the data pre-processing and feature engineering step. Here I spend a lot of time showing different examples on how you can apply different types of techniques to your time series data, and how you need to do a few things. Like for example, you need to be able to leverage information such as trend, seasonality in your time series data so that you can build additional variables from your time series data. And those variables are going to help you in building a better and, I would say, more accurate time series model.
And then what I also show in this section, I try to share different examples of feature engineering for time series data. There are many different categories of feature engineering for time series data. There is like there's time-based feature engineering process that is about using a timestamp column, that I was mentioning at the beginning of our conversation, and leverage that to build additional variables such as, for example, it is a holiday or no; it is afternoon or it is morning, or it is night; is it winter or it is summer. So you can understand that by creating those additional variables, you are somehow extracting additional insights from your time column, and you can use these insights to understand if there is a seasonality in your data set, if you can predict something in the future leveraging this seasonality, and so on. And so this is a [00:24:00] very nice aspect of time series data where you need a little bit of creativity as a data scientist, because if you're able really to extract the type of time knowledge from your data set, then the modeling part is going to become much easier.
Another important suggestion that I gave to my readers in this part of the book is also how you can transform your time series data set into a supervised learning problem. Here, there are different techniques that you can do. Again, most of the time, I suggest the readers to do this data pre-processing and feature engineering steps. So there are different techniques that you can use in order to pivot your data set. And so you can use your past values as variables to predict the future variables. And here, we are very lucky as data scientists because we have different Python packages and techniques and libraries that we can just apply in order to do this pivoting exercise and transform our time series data into supervised learning problems.
And then other couple of suggestions that I give to my readers in the book are, for example, the model deployment. As I was mentioning at the beginning of this podcast, I find that the deployment piece is extremely important for any data science and machine learning solution, because it's really the moment in which you, as a data scientist, give other people, other data scientists the opportunity to leverage your machine learning solution.
And there are specific techniques that you can leverage in order to make sure that your time series model is deployed. Python again can help you with that. And you can make sure that you integrate the machine learning model into an existing production environment in order to start using it and to make practical business decisions based on your time series data. Again, this is more related to the final chapter, and I think it's chapter six of the book, where I show you, okay now that you have a very nice time series forecasting model, how you can make sure that using still Python, you can deploy the end, you can actually productionalize it so that your business or other data scientists can consume it.
So I try, Srini, to summarize all the tips and tricks that I gave other data scientists, data analysts, business experts in the book on how they can build the end-to-end successful time series forecasting model.
Time-series forecasting technologies [26:24]
Srini Penchikala: Yes, sounds good, Francesca. Definitely very good information. You mentioned Python as one of the languages. Do you recommend any other technologies like TensorFlow or anything else for developers who are probably not experts in Python, but they want to try to learn new technologies?
Francesca Lazzeri: I think that TensorFlow, Keras, and also Python, those are all open source frameworks that data scientists can leverage. And the beauty of this is that they are very strong frameworks, very stable. They are supported by the community, and so you don't need really to write those algorithmsfrom scratch. You can just leverage those frameworks. And then since they are open source, you can then start with those. And then while you are developing your solution, any machine learning and data science solution doesn't have to be time series forecasting one, but if you have any machine learning solution, you can customize it as much as you want, because again it's Python. So you can absolutely leverage those frameworks and then customize your solution for your own problem.
As you said, TensorFlow is a great one. Another one that I have been using a lot is PyTorch. PyTorch is interesting because as you know, it's a framework that was developed by Facebook. And at the beginning, it was really used a lot in academia. You couldn't see it too much in the industry. This was a few years ago. But I have to say, in the last year, it has become like a really good framework. And now I have been seeing a lot of companies, developers, data scientists in the industry leveraging PyTorch. So I would say that PyTorch is also another framework that data scientists should look at. A couple of good training for PyTorch, you can find them actually on the PyTorch website. Also in terms of additional applications, like computer vision is one, NLP is another application. And both of them, I know that they have been very popular; of course application leveraging PyTorch framework.
Another great Python library that I know is very good for time series forecasting solution is Prophet. In the book, unfortunately,I didn't have time and space, I would say, to talk about Prophet. But I use that open source package a lot for times series forecasting. So probably I should write another book where I'm going just to focus on Prophet and how you can use it to build end-to-end forecasting their solutions.
Deep learning for time series forecasting [28:48]
Srini Penchikala: Yes. You mentioned NLP in the previous response. So can you quickly talk about how can we use deep learning techniques for time series forecasting? In addition to automated machine learning, does deep learning bring any additional value to time-series forecasting problems?
Francesca Lazzeri: There is a chapter in my book in which I talk only about deep learning techniques for time series forecasting. So I start the book with just some tips and tricks from an architectural point of view, for time series forecasting solution. Then I introduce the time series preparation and data exploration. Then I spend a lot of time in the book to introduce classical approaches to time series forecasting such as ARIMA,SARIMA and SARIMAX, and then I get finally to the deep learning part.
So, yes there are a few techniques that have been very successful in the last few years. So many data scientists have been applying those techniques to time series forecasting solutions. I have to say that I have been applying techniques such as recurrent neural networks such as GRU, but also LSTM. LSTM, by the way, stands for a long short-term memory. And those algorithms that are still the typology of recurrent neural networks are very good when you're applying those to time series forecasting. In the book, I also show you how you can leverage some Python packages. You mentioned, Srini, TensorFlow, and I show you how you can leverage those open-source packages to apply deep learning techniques such as recurrent neural networks to time series forecasting.
Another very successful application of deep learning is convolutional neural networks. In the latest few years, many data scientists had been applying convolutional neural networks to time series data, specifically in the IOT and smart grid, with the energy sector. And in these scenarios, since you have a lot of data, and most of the time the intervals, the granularity of your time series data is also like every five seconds. So you can imagine that you have a pretty good amount of data. And so techniques such as convolutional neural networks, it most of the time had been applied only to a scenario such as computer [00:31:00] vision for example. They had been showing great results also for time series applications.
So yes, I would say that, again it depends on your own business problem, your own time series data set of course. But if you had the right data and the right opportunity, I would say that both the recurrent neural networks and convolution neural networks had been pretty successful for time series scenarios.
Recommendations for machine learning and time series forecasting [31:24]
Srini Penchikala: Before we wrap up this podcast, Francesca, do you have any additional recommendations for our listeners who want to learn more about machine learning in general, and the time series forecasting topic in particular?
Francesca Lazzeri: I think that I always like to read at least two or three scientific articles per week on time series forecasting applications. And I usually publish those types of articles on my Twitter, because I think it is important to share those types of information in the industry. So if people want to just follow me to see the type of articles that I have been reading, and also the type of that I have been studying to improve my time series forecasting models, you can feel free to follow me. My handle on Twitter is F like Francesca, R like Roman. So it's my two first letter of my name, Francesca, and then it's my last name, Lazzeri. So it's @frlazzeri. And again, I always publish a lot of GitHub repos about time series forecasting, and the scientific articles.
I think that the best way for people to learn is not really to take like a PhD in machine learning or in artificial intelligence. It's more about really trying to understand what is the time series problem that they want to solve; it can be also a side problem, and then look at the community and understand what other people are sharing in terms of materials, content and so on.
Honestly, it is resource where I usually go a lot and I look at the articles and I read a lot of articles is InfoQ. So when I was referring to those [00:33:00] articles that I read, most of them, I would say 90% of them are from InfoQ. And InfoQ represents really a great source for the community. So really following people on Twitter, looking at GitHub repos, reading articles from InfoQ is the best way, in my opinion, to get updated and learning new things.
Final thoughts and wrap-up [33:19]
Srini Penchikala: Thank you for promoting InfoQ as one of the resources. Definitely, we try to write and publish these types of postings from subject matter experts like you. It's definitely practitioners like you helping the community learn about new things.
Francesca, thank you very much for joining this podcast. It's been great to discuss with you some of the emerging trends in ML space, like time series data management and forecasting, automated machine learning, and more importantly, the data science lifecycle that brings structure, governance and rigor to ML projects.
Do you have any final thoughts before we wrap up this podcast?
Francesca Lazzeri: It was just a pleasure and I look forward to other podcasts, other conferences and other articles on InfoQ. Thank you so much for inviting me today.
Srini Penchikala: Yes, thank you Francesca. Thank you.
You can keep up-to-date with the podcasts via our RSS Feed, and they are available viaSoundCloud,Apple Podcasts,Spotify,Overcastand the Google Podcast.From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.