`CREATE MODEL`

statement for time series models

To create time series models in BigQuery, use theBigQuery ML `CREATE MODEL`

statement and specify `MODEL_TYPE`

to be`'ARIMA_PLUS'`

.

For information about supported model types of each SQL statement and function,and all supported SQL statements and functions for each model type, readEnd-to-end user journey for each model.

### BigQuery ML time series modeling pipeline

The BigQuery ML time series modeling pipeline includesmultiple modules. The ARIMA model is the most computationally expensive, whichis why the model is named `ARIMA_PLUS`

.

The modeling pipeline for the BigQuery ML time series includes thefollowing functionalities:

- Infer the data frequency of the time series.
- Handle irregular time intervals.
- Handle duplicated timestamps by taking the mean value.
- Interpolate missing data using local linear interpolation.
- Detect and clean spike and dip outliers.
- Detect and adjust abrupt step (level) changes.
- Detect and adjust holiday effect.
- Detect multiple seasonal patterns within a single time series viaSeasonal and Trend decomposition using Loess (STL), and extrapolate seasonality via double exponential smoothing (ETS).
- Detect and model the trend using the ARIMA model and the auto.ARIMA algorithm for automatic hyperparameter tuning. In auto.ARIMA, dozens ofcandidate models are trained and evaluated in parallel. The best model comeswith the lowestAkaike information criterion (AIC).

### BigQuery ML large-scale time series

Up to 100,000,000 time series can be forecast simultaneously with a single queryusing the TIME_SERIES_ID_COLoption. Different modeling pipelines run in parallel, assuming that enough slotsare available. The following diagram shows this process:

### Large-scale time series forecasting best practices

Learn how to perform Scalable forecasting with millions of time series in BigQuery

- When you have a large number (for example, 100,000) of time series toforecast, we highly recommended that you first forecast a small batch of timeseries (for example, 1000) to see how long the query takes. You can thenestimate how long your entire time series forecast will take.
- You can use the
`AUTO_ARIMA_MAX_ORDER`

option to balance between the queryruntime and forecast accuracy. Increasing`AUTO_ARIMA_MAX_ORDER`

expands the hyperparameter search space to try more complex, that is with highernon-seasonal p and q, ARIMA models. So decreasing the value of`AUTO_ARIMA_MAX_ORDER`

can accelerate the query. For example, if you use 3instead of the default value 5 for this option, the query runtime is reducedby at least 50%. The forecast accuracy might drop slightly for some of thetime series. If a shorter training time is important to your case, use asmaller value for`AUTO_ARIMA_MAX_ORDER`

. - The model training time for each time series has in linear relationship to itslength or the number of data points. The longer the time series, the longerthe training takes. On the other hand, for time series forecasting, not alldata points contribute equally to the model fitting process. Instead, the morerecent the data point is, the more it contributes. Therefore, if you have along time series, for example ten years of daily data, you don't need to train atime series model using all the data points. The most recent two or three yearsof data points are enough.
- You can use
`TIME_SERIES_LENGTH_FRACTION`

,`MIN_TIME_SERIES_LENGTH`

and`MAX_TIME_SERIES_LENGTH`

training options to enable smart, fast model trainingwith little or no loss of forecasting accuracy. The idea behind this isthat while periodic modeling such as seasonality requires a certain number oftime points, trend modeling doesn't need many time points. Meanwhile, trendmodeling is much more computationally expensive than other time seriescomponents. By using the training options above, you can efficiently modelthe trend component with a subset of the time series, while the other timeseries components use the entire time series. - To avoid a single long-running query, you can also use BigQuerymulti-statement queries.

`CREATE MODEL`

syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}model_nameOPTIONS(MODEL_TYPE = 'ARIMA_PLUS' [, TIME_SERIES_TIMESTAMP_COL =string_value] [, TIME_SERIES_DATA_COL =string_value] [, TIME_SERIES_ID_COL = {string_value|string_array} ] [, HORIZON =int64_value] [, AUTO_ARIMA = { TRUE | FALSE } ] [, AUTO_ARIMA_MAX_ORDER =int64_value] [, NON_SEASONAL_ORDER =(int64_value, int64_value, int64_value)] [, DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' } ] [, INCLUDE_DRIFT = { TRUE | FALSE } ] [, HOLIDAY_REGION =string_value|string_array] [, CLEAN_SPIKES_AND_DIPS = { TRUE | FALSE } ] [, ADJUST_STEP_CHANGES = { TRUE | FALSE } ] [, DECOMPOSE_TIME_SERIES = { TRUE | FALSE } ] [, TIME_SERIES_LENGTH_FRACTION =float64_value] [, MIN_TIME_SERIES_LENGTH =int64_value] [, MAX_TIME_SERIES_LENGTH =int64_value] [, TREND_SMOOTHING_WINDOW_SIZE =int64_value])AS query_statement

### CREATE MODEL

Creates a new BigQuery ML model in the specified dataset. If the modelname exists, `CREATE MODEL`

returns an error.

### CREATE MODEL IF NOT EXISTS

Creates a new BigQuery ML model only if the model does not currentlyexist in the specified dataset.

### CREATE OR REPLACE MODEL

Creates a new BigQuery ML model and replaces any existing model with thesame name in the specified dataset.

### model_name

`model_name`

is the name of the BigQuery ML model that you're creating orreplacing. The model name must be unique per dataset: no other model or tablecan have the same name. The model name must follow the same naming rules as aBigQuery table. A model name can contain the following:

- Up to 1,024 characters
- Letters of either case, numbers, and underscores

`model_name`

is not case-sensitive.

If you do not have a default project configured, prepend the project ID to themodel name in following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example:

`myproject.mydataset.mymodel`

`CREATE MODEL`

supports the following options:

### MODEL_TYPE

**Syntax**

`MODEL_TYPE = 'ARIMA_PLUS'`

**Description**

Specifies the model type. To create a time series model, set `model_type`

to `'ARIMA_PLUS'`

.

### model_option_list

In the `model_option_list`

, the options that are always required include`model_type`

, `time_series_timestamp_col`

, `time_series_data_col`

. Other optionsare only required in certain scenarios. See more details below.

Time series models support the following options:

#### TIME_SERIES_TIMESTAMP_COL

**Syntax**

` TIME_SERIES_TIMESTAMP_COL = `

`string_value`

**Description**

The timestamp column name for time series models.

**Arguments**

is a `string_value``'STRING'`

.

#### TIME_SERIES_DATA_COL

**Syntax**

` TIME_SERIES_DATA_COL = `

`string_value`

**Description**

The data column name for time series models.

**Arguments**

is a `string_value``'STRING'`

.

#### TIME_SERIES_ID_COL

**Syntax**

` TIME_SERIES_ID_COL = { `

`string_value` | `string_array` }

**Description**

The ID column names for time series models. These columns are used when the userwants to fit and forecast multiple time series using a single query. DifferentIDs indicate different time series.

**Arguments**

This can be either of the following:

:`string_value``'STRING'`

: an array of`string_array``'STRING'`

#### HORIZON

**Syntax**

` HORIZON = `

`int64_value`

**Description**

The number of time points to forecast. When forecasting multiple time-series atonce, this parameter applies to each time series.

**Arguments**

The value is a `INT64`

. The default value is 1000. The maximum value is 10,000.

#### AUTO_ARIMA

**Syntax**

`AUTO_ARIMA = { TRUE | FALSE }`

**Description**

Whether the training process should use auto.ARIMA or not. If true, trainingwill automatically find the best non-seasonal order (i.e., the p, d, q tuple)and decide whether or not to include a linear drift term when d is 1. If false,the user must specify non_seasonal_order in the query. When forecasting multipletime-series at the same time, the auto.ARIMA algorithm must be used for eachtime series, so this option must not be set to false.

**Arguments**

The value is a `BOOL`

. The default value is `TRUE`

.

#### AUTO_ARIMA_MAX_ORDER

**Syntax**

`AUTO_ARIMA_MAX_ORDER = `

`int64_value`

**Description**

The maximum value for the sum of non-sesonal p and q. It controls the parametersearch space in the auto.ARIMA algorithm. Currently, the allowed values are(1, 2, 3, 4, 5). As a reference, for each value there are (3, 6, 10, 15, 21) candidatemodels to evaluate if non-seasonal d is determined to be 0 or 2. If non-seasonald is determined to be 1, the number of candidate models to evaluate doubles asthere is an additional drift term to consider for all the existing candidatemodels. This option is disabled when `AUTO_ARIMA`

is set to false.

**Arguments**

The value is a `INT64`

. The default value is 5. The minimum value is 1 and themaximum value is 5.

#### NON_SEASONAL_ORDER

**Syntax**

` NON_SEASONAL_ORDER = `

`(int64_value, int64_value, int64_value)`

**Description**

The tuple of non-seasonal p, d, q for the `ARIMA_PLUS`

model. There are nodefault values and you must specify all of them. You must explicitly specifyauto_arima to false to use this option. Currently, p and q are restricted to[0, 1, 2, 3, 4, 5] and d is restricted to [0, 1, 2]. When forecasting multipletime-series at the same time, because the auto.ARIMA algorithm must be used foreach time series, this option is disabled.

**Arguments**

is a tuple ofthree `(int64_value, int64_value, int64_value)``'INT64'`

.

#### DATA_FREQUENCY

**Syntax**

`DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' }`

**Description**

The data frequency of the input time series. The finest supported granularity is`'PER_MINUTE'`

. When forecasting multiple time-series at once, this argumentapplies to all individual time series.

**Arguments**

Accepts the following values:

`'AUTO_FREQUENCY'`

: the training process automatically infers the datafrequency, which can be one of the values listed below.

`'PER_MINUTE'`

: per-minute time series

`'HOURLY'`

: hourly time series

`'DAILY'`

: daily time series

`'WEEKLY'`

: weekly time series

`'MONTHLY'`

: monthly time series

`'QUARTERLY'`

: querterly time series

`'YEARLY'`

: yearly time series

The default value is `'AUTO_FREQUENCY'`

.

#### INCLUDE_DRIFT

**Syntax**

`INCLUDE_DRIFT = { TRUE | FALSE }`

**Description**

Should the `ARIMA_PLUS`

model include a linear drift term or not. The drift termis applicable when non-seasonal d is 1.

When auto-arima is set to false, this argument is default to false. It can beset to true only when non-seasonal d is 1, otherwise it will return an invalidquery error.

When auto-arima is set to true, it will automatically decide whether or not toinclude a linear drift term. Therefore, this option is disabled for auto-ARIMA.

**Arguments**

The value is a `BOOL`

. The default value is `FALSE`

for auto_arima is disabled.

#### HOLIDAY_REGION

**Syntax**

`HOLIDAY_REGION = `

`string_value` | `string_array`

**Description**

The geographical region based on which the holiday effect is applied inmodeling. By default, holiday effect modeling is disabled. To turn it on,specify the holiday region using this option. The value can be a single regionstring or a list of region strings. If you include more than one region string,the union of the holidays in all the provided regions will be taken intomodeling.

**Arguments**

`HOLIDAY_REGION`

is a polymorphic option that can be defined by a singlestring or an array of strings.

`string_value`is a type`STRING`

.For example:

`HOLIDAY_REGION = 'GLOBAL'`

`string_array`is an`ARRAY`

of type`STRING`

s, where each`STRING`

is one of the following supported region strings.For example:

`HOLIDAY_REGION = ['US', 'UK']`

Accepts the following values:

*Top level: global*

`'GLOBAL'`

*Second level: continental regions*

`'NA'`

: North America`'JAPAC'`

: Japan and Asia Pacific`'EMEA'`

: Europe, the Middle East and Africa`'LAC'`

: Latin America and the Caribbean

*Third level: countries/regions*

`'AE'`

: United Arab Emirates`'AR'`

: Argentina`'AT'`

: Austria`'AU'`

: Australia`'BE'`

: Belgium`'BR'`

: Brazil`'CA'`

: Canada`'CH'`

: Switzerland`'CL'`

: Chile`'CN'`

: China`'CO'`

: Colombia`'CZ'`

: Czechia`'DE'`

: Germany`'DK'`

: Denmark`'DZ'`

: Algeria`'EC'`

: Ecuador`'EE'`

: Estonia`'EG'`

: Egypt`'ES'`

: Spain`'FI'`

: Finland`'FR'`

: France`'GB'`

: United Kingdom`'GR'`

: Greece`'HK'`

: Hong Kong`'HU'`

: Hungary`'ID'`

: Indonesia`'IE'`

: Ireland`'IL'`

: Israel`'IN'`

: India`'IR'`

: Iran`'IT'`

: Italy`'JP'`

: Japan`'KR'`

: South Korea`'LV'`

: Latvia`'MA'`

: Morocco`'MX'`

: Mexico`'MY'`

: Malaysia`'NG'`

: Nigeria`'NL'`

: Netherlands`'NO'`

: Norway`'NZ'`

: New Zealand`'PE'`

: Peru`'PH'`

: Philippines`'PK'`

: Pakistan`'PL'`

: Poland`'PT'`

: Portugal`'RO'`

: Romania`'RS'`

: Serbia`'RU'`

: Russia`'SA'`

: Saudi Arabia`'SE'`

: Sweden`'SG'`

: Singapore`'SI'`

: Slovenia`'SK'`

: Slovakia`'TH'`

: Thailand`'TR'`

: Turkey`'TW'`

: Taiwan`'UA'`

: Ukraine`'US'`

: United States`'VE'`

: Venezuela`'VN'`

: Vietnam`'ZA'`

: South Africa

#### CLEAN_SPIKES_AND_DIPS

**Syntax**

`CLEAN_SPIKES_AND_DIPS = { TRUE | FALSE }`

**Description**

Whether or not to perform automatic spikes and dips detection and cleanup in the`ARIMA_PLUS`

model training pipeline. The spikes and dips are replaced withlocal linear interpolated values when they are detected.

**Arguments**

The value is a `BOOL`

. The default value is `TRUE`

.

#### ADJUST_STEP_CHANGES

**Syntax**

`ADJUST_STEP_CHANGES = { TRUE | FALSE }`

**Description**

Whether or not to perform automatic step change detection and adjustment in the`ARIMA_PLUS`

model training pipeline.

**Arguments**

The value is a `BOOL`

. The default value is `TRUE`

.

#### DECOMPOSE_TIME_SERIES

**Syntax**

`DECOMPOSE_TIME_SERIES = { TRUE | FALSE }`

**Description**

Whether the separate components of both the history and forecast parts of thetime series (such as holiday effect and seasonal components) should be saved inthe `ARIMA_PLUS`

model.

**Arguments**

The value is a `BOOL`

. The default value is `TRUE`

.

#### TIME_SERIES_LENGTH_FRACTION

**Syntax**

`TIME_SERIES_LENGTH_FRACTION = `

`float64_value`

**Description**

The fraction of the interpolated length of the time series that is used tomodel the time series trend component. All of the time points of the time seriesare used to model the non-trend component. For example, if the time series has100 time points, then specifying a `TIME_SERIES_LENGTH_FRACTION`

of 0.5 uses themost recent 50 time points for trend modeling. This training option acceleratesmodeling training without sacrificing much forecasting accuracy.

**Arguments**

is a `float64_value``FLOAT64`

. The value must be within(0, 1). The default behavior is using all the points in the time series.

#### MIN_TIME_SERIES_LENGTH

**Syntax**

`MIN_TIME_SERIES_LENGTH = `

`int64_value`

**Description**

The minimum number of time points in a time series that are used in modeling thetrend component of the time series. `MIN_TIME_SERIES_LENGTH`

requires`TIME_SERIES_LENGTH_FRACTION`

is present. For example, if the query isforecasting two time series simultaneously by using `time_series_id_col`

andone time series has 100 time points while the other one has 30 time points, thenby setting `TIME_SERIES_LENGTH_FRACTION`

to 0.5 and `MIN_TIME_SERIES_LENGTH`

to20, then the first time series's last 50 points are used for trend modeling. Forthe second time series, because `MIN_TIME_SERIES_LENGTH`

is 20, the last 20points rather than 15 (30 * 0.5) points are used in trend modeling. This trainingoption prevents too few time points from being used in trend modeling when`TIME_SERIES_LENGTH_FRACTION`

is used, particularly when forecasting multipletime series in a single query using `time_series_id_col`

.

**Arguments**

is an `int64_value``INT64`

. The default value is 20. Theminimum value is 4.

#### MAX_TIME_SERIES_LENGTH

**Syntax**

`MAX_TIME_SERIES_LENGTH = `

`int64_value`

**Description**

The maximum number of time points in a time series that can be used in modelingthe trend component of the time series. For example, if the query is forecastingtwo time series simultaneously by using `TIME_SERIES_ID_COL`

where one timeseries has 100 time points while the other one has 50 time points, then bysetting `MAX_TIME_SERIES_LENGTH`

to 30 both of time series will use the last 30points for trend modeling.

**Arguments**

is an `int64_value``INT64`

. It doesn't have a defaultvalue and the minimum value is 4. It's recommended to try `30`

as a startingvalue.

#### TREND_SMOOTHING_WINDOW_SIZE

**Syntax**

`TREND_SMOOTHING_WINDOW_SIZE = `

`int64_value`

**Description**

Smoothing window size for the trend component. When a positive value isspecified, a center moving average smoothing is applied on the historytrend. When the smoothing window is out of the boundary at the beginning or theend of the trend, the first element or the last element is padded to fillthe smoothing window before the average is applied.

**Arguments**

is a type `int64_value``INT64`

. There is no default value. A positive value must be specified to smooth the trend.

### query_statement

The `AS query_statement`

clause specifies the standard SQL query that is used togenerate the training data. For information about the supported SQL syntax ofthe `query_statement`

clause, seeStandard SQL query syntax.

For time series models, the query_statement is expected to contain either 2 or 3columns, depending on whether the user wants to forecast a single time-series ormultiple time-series. In both cases, `time_series_timestamp_col`

and`time_series_data_col`

are required. Additional `time_series_id_col`

columns arerequired for forecasting multiple time series.

## Supported inputs

The `CREATE MODEL`

statement supports the following data types for thetime series input columns.

### Supported data types for time series model inputs

BigQuery ML supports different standard SQL data types for the inputcolumns for time series models. Supported data types for each respective columninclude:

`time series input column` | `Supported types` |
---|---|

`time_series_timestamp_col` | TIMESTAMP DATE DATETIME |

`time_series_data_col` | INT64 NUMERIC BIGNUMERIC FLOAT64 |

`time_series_id_col` | STRING INT64 ARRAY of STRING or INT64 |

### Known limitations

`CREATE MODEL`

statements for time series models must comply with the followingrules:

- For the input time series, the maximum length is 1,000,000 time points and theminimum length is 3 time points. When forecasting multiple time-series at thesame time, the limit applies to each time series.
- The maximum number of time series to forecast simultaneously using the IDcolumns is 100,000,000.
- When forecasting multiple time-series simultaneously using the ID column,those invalid time series that fail the model fitting will be ignored andwon't appear in the results of forecast. Examples are single pointtime series. A warning message is shown in this case, and you can use theML.ARIMA_EVALUATEfunction to retrieve the possible error message.
- The maximum time points to forecast, which is specified using
`horizon`

, is10,000. - Holiday effect modeling is effective only for approximately 5 years.
- After a multiple time-series model is trained, the evaluation tab in theBigQuery page on the Google Cloud console only shows the evaluationmetrics for the first 100 time series. To see the evaluation metrics for allof the time series, useML.ARIMA_EVALUATE
- The BigQuery ML training option
`warm_start`

is not supported by`ARIMA_PLUS`

models.

`CREATE MODEL`

examples

The following example creates models named `mymodel`

in `mydataset`

in yourdefault project.

### Training a time series model to forecast a single time series

This example creates a time series model.

`CREATE MODEL `project_id.mydataset.mymodel` OPTIONS(MODEL_TYPE='ARIMA_PLUS', time_series_timestamp_col='date', time_series_data_col='transaction') ASSELECT date, transactionFROM `mydataset.mytable``

### Training multiple time-series models for multiple time-series at the same time

This example creates multiple time-series models, one for each input time series.

`CREATE MODEL `project_id.mydataset.mymodel` OPTIONS(MODEL_TYPE='ARIMA_PLUS', time_series_timestamp_col='date', time_series_data_col='transaction', time_series_id_col='company_name') ASSELECT date, transaction, company_nameFROM `mydataset.mytable``

### Training multiple time-series models for multiple time series at the same time using multiple time-series ID columns

This example creates multiple time-series models, one for each input time series.

`CREATE MODEL `project_id.mydataset.mymodel` OPTIONS(MODEL_TYPE='ARIMA_PLUS', time_series_timestamp_col='date', time_series_data_col='transaction', time_series_id_col=['company_name', 'department_name']) ASSELECT date, transaction, company_name, department_nameFROM `mydataset.mytable``

### Train multiple time-series models for multiple time series using a subset defined as a fraction of the time points for speed-up

This example creates multiple time-series models, one for each input time series.

`CREATE MODEL `project_id.mydataset.mymodel` OPTIONS(MODEL_TYPE='ARIMA_PLUS', time_series_timestamp_col='date', time_series_data_col='transaction', time_series_id_col=['company_name', 'department_name'], time_series_length_fraction=0.5, min_time_series_length=30) ASSELECT date, transaction, company_name, department_nameFROM `mydataset.mytable``

### Train multiple time-series models for multiple time series using a subset defined as a maximum number of points in the time points

This example creates multiple time-series models, one for each input time series.

`CREATE MODEL `project_id.mydataset.mymodel` OPTIONS(MODEL_TYPE='ARIMA_PLUS', time_series_timestamp_col='date', time_series_data_col='transaction', time_series_id_col=['company_name', 'department_name'], max_time_series_length=50) ASSELECT date, transaction, company_name, department_nameFROM `mydataset.mytable``

## What's next

- Walk through our tutorials that use the time series model inBigQuery ML:
- Performing single time-series forecasting from Google Analytics data
- Performing multiple time-series forecasting with a single query from NYC Citi Bike trips data
- Scalable forecasting with millions of time-series in BigQuery