Train deep learning PyTorch models - Azure Machine Learning (2023)

  • Article
  • 9 minutes to read

APPLIES TO: Train deep learning PyTorch models - Azure Machine Learning (1) Python SDK azureml v1

In this article, learn how to run your PyTorch training scripts at enterprise scale using Azure Machine Learning.

The example scripts in this article are used to classify chicken and turkey images to build a deep learning neural network (DNN) based on PyTorch's transfer learning tutorial. Transfer learning is a technique that applies knowledge gained from solving one problem to a different but related problem. Transfer learning shortens the training process by requiring less data, time, and compute resources than training from scratch. To learn more about transfer learning, see the deep learning vs machine learning article.

Whether you're training a deep learning PyTorch model from the ground-up or you're bringing an existing model into the cloud, you can use Azure Machine Learning to scale out open-source training jobs using elastic cloud compute resources. You can build, deploy, version, and monitor production-grade models with Azure Machine Learning.

Prerequisites

Run this code on either of these environments:

  • Azure Machine Learning compute instance - no downloads or installation necessary

    • Complete the Quickstart: Get started with Azure Machine Learning to create a dedicated notebook server pre-loaded with the SDK and the sample repository.
    • In the samples deep learning folder on the notebook server, find a completed and expanded notebook by navigating to this directory: how-to-use-azureml > ml-frameworks > pytorch > train-hyperparameter-tune-deploy-with-pytorch folder.
  • Your own Jupyter Notebook server

    You can also find a completed Jupyter Notebook version of this guide on the GitHub samples page. The notebook includes expanded sections covering intelligent hyperparameter tuning, model deployment, and notebook widgets.

Before you can run the code in this article to create a GPU cluster, you'll need to request a quota increase for your workspace.

Set up the experiment

This section sets up the training experiment by loading the required Python packages, initializing a workspace, creating the compute target, and defining the training environment.

Import packages

First, import the necessary Python libraries.

import osimport shutilfrom azureml.core.workspace import Workspacefrom azureml.core import Experimentfrom azureml.core import Environmentfrom azureml.core.compute import ComputeTarget, AmlComputefrom azureml.core.compute_target import ComputeTargetException

Initialize a workspace

The Azure Machine Learning workspace is the top-level resource for the service. It provides you with a centralized place to work with all the artifacts you create. In the Python SDK, you can access the workspace artifacts by creating a workspace object.

(Video) Train deep learning models with Azure Machine Learning Service & Databricks

Create a workspace object from the config.json file created in the prerequisites section.

ws = Workspace.from_config()

Get the data

The dataset consists of about 120 training images each for turkeys and chickens, with 100 validation images for each class. We'll download and extract the dataset as part of our training script pytorch_train.py. The images are a subset of the Open Images v5 Dataset. For more steps on creating a JSONL to train with your own data, see this Jupyter notebook.

Prepare training script

In this tutorial, the training script, pytorch_train.py, is already provided. In practice, you can take any custom training script, as is, and run it with Azure Machine Learning.

Create a folder for your training script(s).

project_folder = './pytorch-birds'os.makedirs(project_folder, exist_ok=True)shutil.copy('pytorch_train.py', project_folder)

Create a compute target

Create a compute target for your PyTorch job to run on. In this example, create a GPU-enabled Azure Machine Learning compute cluster.

Important

Before you can create a GPU cluster, you'll need to request a quota increase for your workspace.

# Choose a name for your CPU clustercluster_name = "gpu-cluster"# Verify that cluster does not exist alreadytry: compute_target = ComputeTarget(workspace=ws, name=cluster_name) print('Found existing compute target')except ComputeTargetException: print('Creating a new compute target...') compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', max_nodes=4) # Create the cluster with the specified name and configuration compute_target = ComputeTarget.create(ws, cluster_name, compute_config) # Wait for the cluster to complete, show the output log compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

If you instead want to create a CPU cluster, provide a different VM size to the vm_size parameter, such as STANDARD_D2_V2.

Note

You may choose to use low-priority VMs to run some or all of your workloads. See how to create a low-priority VM.

For more information on compute targets, see the what is a compute target article.

(Video) Using the Azure Machine Learning Python SDK to Train a PyTorch model at Scale by Henk Boelman

Define your environment

To define the Azure ML Environment that encapsulates your training script's dependencies, you can either define a custom environment or use an Azure ML curated environment.

Use a curated environment

Azure ML provides prebuilt, curated environments if you don't want to define your own environment. There are several CPU and GPU curated environments for PyTorch corresponding to different versions of PyTorch.

If you want to use a curated environment, you can run the following command instead:

curated_env_name = 'AzureML-PyTorch-1.6-GPU'pytorch_env = Environment.get(workspace=ws, name=curated_env_name)

To see the packages included in the curated environment, you can write out the conda dependencies to disk:

pytorch_env.save_to_directory(path=curated_env_name)

Make sure the curated environment includes all the dependencies required by your training script. If not, you'll have to modify the environment to include the missing dependencies. If the environment is modified, you'll have to give it a new name, as the 'AzureML' prefix is reserved for curated environments. If you modified the conda dependencies YAML file, you can create a new environment from it with a new name, for example:

pytorch_env = Environment.from_conda_specification(name='pytorch-1.6-gpu', file_path='./conda_dependencies.yml')

If you had instead modified the curated environment object directly, you can clone that environment with a new name:

pytorch_env = pytorch_env.clone(new_name='pytorch-1.6-gpu')

Create a custom environment

You can also create your own Azure ML environment that encapsulates your training script's dependencies.

First, define your conda dependencies in a YAML file; in this example the file is named conda_dependencies.yml.

channels:- conda-forgedependencies:- python=3.6.2- pip=21.3.1- pip: - azureml-defaults - torch==1.6.0 - torchvision==0.7.0 - future==0.17.1 - pillow

Create an Azure ML environment from this conda environment specification. The environment will be packaged into a Docker container at runtime.

By default if no base image is specified, Azure ML will use a CPU image azureml.core.environment.DEFAULT_CPU_IMAGE as the base image. Since this example runs training on a GPU cluster, you'll need to specify a GPU base image that has the necessary GPU drivers and dependencies. Azure ML maintains a set of base images published on Microsoft Container Registry (MCR) that you can use. For more information, see AzureML-Containers GitHub repo.

pytorch_env = Environment.from_conda_specification(name='pytorch-1.6-gpu', file_path='./conda_dependencies.yml')# Specify a GPU base imagepytorch_env.docker.enabled = Truepytorch_env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'

Tip

Optionally, you can just capture all your dependencies directly in a custom Docker image or Dockerfile, and create your environment from that. For more information, see Train with custom image.

(Video) Build and Deploy PyTorch Models with Azure Machine Learning

For more information on creating and using environments, see Create and use software environments in Azure Machine Learning.

Configure and submit your training run

Create a ScriptRunConfig

Create a ScriptRunConfig object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on. Any arguments to your training script will be passed via command line if specified in the arguments parameter. The following code will configure a single-node PyTorch job.

from azureml.core import ScriptRunConfigsrc = ScriptRunConfig(source_directory=project_folder, script='pytorch_train.py', arguments=['--num_epochs', 30, '--output_dir', './outputs'], compute_target=compute_target, environment=pytorch_env)

Warning

Azure Machine Learning runs training scripts by copying the entire source directory. If you have sensitive data that you don't want to upload, use a .ignore file or don't include it in the source directory . Instead, access your data using an Azure ML dataset.

For more information on configuring jobs with ScriptRunConfig, see Configure and submit training runs.

Warning

If you were previously using the PyTorch estimator to configure your PyTorch training jobs, please note that Estimators have been deprecated as of the 1.19.0 SDK release. With Azure ML SDK >= 1.15.0, ScriptRunConfig is the recommended way to configure training jobs, including those using deep learning frameworks. For common migration questions, see the Estimator to ScriptRunConfig migration guide.

Submit your run

The Run object provides the interface to the run history while the job is running and after it has completed.

run = Experiment(ws, name='Tutorial-pytorch-birds').submit(src)run.wait_for_completion(show_output=True)

What happens during run execution

As the run is executed, it goes through the following stages:

  • Preparing: A docker image is created according to the environment defined. The image is uploaded to the workspace's container registry and cached for later runs. Logs are also streamed to the run history and can be viewed to monitor progress. If a curated environment is specified instead, the cached image backing that curated environment will be used.

  • Scaling: The cluster attempts to scale up if the Batch AI cluster requires more nodes to execute the run than are currently available.

    (Video) Build and deploy PyTorch models with Azure Machine Learning

  • Running: All scripts in the script folder are uploaded to the compute target, data stores are mounted or copied, and the script is executed. Outputs from stdout and the ./logs folder are streamed to the run history and can be used to monitor the run.

  • Post-Processing: The ./outputs folder of the run is copied over to the run history.

Register or download a model

Once you've trained the model, you can register it to your workspace. Model registration lets you store and version your models in your workspace to simplify model management and deployment.

model = run.register_model(model_name='pytorch-birds', model_path='outputs/model.pt')

Tip

The deployment how-tocontains a section on registering models, but you can skip directly to creating a compute target for deployment, since you already have a registered model.

You can also download a local copy of the model by using the Run object. In the training script pytorch_train.py, a PyTorch save object persists the model to a local folder (local to the compute target). You can use the Run object to download a copy.

# Create a model folder in the current directoryos.makedirs('./model', exist_ok=True)# Download the model from run historyrun.download_file(name='outputs/model.pt', output_file_path='./model/model.pt'), 

Distributed training

Azure Machine Learning also supports multi-node distributed PyTorch jobs so that you can scale your training workloads. You can easily run distributed PyTorch jobs and Azure ML will manage the orchestration for you.

Azure ML supports running distributed PyTorch jobs with both Horovod and PyTorch's built-in DistributedDataParallel module.

For more information about distributed training, see the Distributed GPU training guide.

Export to ONNX

To optimize inference with the ONNX Runtime, convert your trained PyTorch model to the ONNX format. Inference, or model scoring, is the phase where the deployed model is used for prediction, most commonly on production data. For an example, see the Exporting model from PyTorch to ONNX tutorial.

Next steps

In this article, you trained and registered a deep learning, neural network using PyTorch on Azure Machine Learning. To learn how to deploy a model, continue on to our model deployment article.

  • How and where to deploy models
  • Track run metrics during training
  • Tune hyperparameters
  • Reference architecture for distributed deep learning training in Azure

FAQs

Can you train ML model on Azure? ›

Azure Machine Learning provides several ways to train your models, from code-first solutions using the SDK to low-code solutions such as automated machine learning and the visual designer.

Does Azure ml support PyTorch? ›

Whether you're training a deep learning PyTorch model from the ground-up or you're bringing an existing model into the cloud, you can use AzureML to scale out open-source training jobs using elastic cloud compute resources. You can build, deploy, version, and monitor production-grade models with AzureML.

What are two advantages of using the Azure ML platform? ›

6 Benefits of Azure Machine Learning
  • #1: Leverage ML as a Service. Microsoft offers Azure ML as a pay-as-you-go service. ...
  • #2: Benefit from MLOps. ...
  • #3: Accelerate ML with Best-of-Breed Algorithms. ...
  • #4: Support Remote Working with Cloud-based Services. ...
  • #5: Compliant & Secure ML Apps. ...
  • #6: Catalyze Business Growth.
28 Oct 2021

Which environment is best for machine learning? ›

Below are the most efficient, commonly used online machine learning environments:
  • Google Colaboratory.
  • IBM Watson.
  • Kaggle Kernel.
  • Coclac.
  • Microsoft Azure.
30 Sept 2020

What is PyTorch enterprise? ›

The PyTorch Enterprise Support Program provides long-term support, prioritized troubleshooting, and integration with Azure solutions. Long-term support (LTS): Microsoft will provide commercial support for the public PyTorch codebase. Each release will be supported for as long as it is current.

How do I train my Azure custom model? ›

Create a project in the Form Recognizer Studio
  1. Start by navigating to the Form Recognizer Studio. ...
  2. In the Studio, select the Custom models tile, on the custom models page and select the Create a project button. ...
  3. Next select the storage account you used to upload your custom model training dataset.
12 Oct 2022

How do you train a model for machine learning? ›

3 steps to training a machine learning model
  1. Step 1: Begin with existing data. Machine learning requires us to have existing data—not the data our application will use when we run it, but data to learn from. ...
  2. Step 2: Analyze data to identify patterns. ...
  3. Step 3: Make predictions.

How do you make an ML model in Azure? ›

Sign in to the studio
  1. Sign in to Azure Machine Learning studio.
  2. Select your subscription and the workspace you created.
  3. Select Get started.
  4. In the left pane, select Automated ML under the Author section. ...
  5. Select +New automated ML job.
4 days ago

How do you deploy a trained model in Azure? ›

Workflow for deploying a model
  1. Register the model.
  2. Prepare an entry script.
  3. Prepare an inference configuration.
  4. Deploy the model locally to ensure everything works.
  5. Choose a compute target.
  6. Deploy the model to the cloud.
  7. Test the resulting web service.
6 days ago

How do you save a trained model in Azure ML? ›

There's no way to do anything with a model outside of AzureML. "You can also download a local copy of the model by using the Run object. In the training script pytorch_train.py, a PyTorch save object persists the model to a local folder (local to the compute target). You can use the Run object to download a copy."

What are limitations of Azure ML? ›

Azure Machine Learning assets
ResourceMaximum limit
Datasets10 million
Runs10 million
Models10 million
Artifacts10 million
21 Sept 2022

Is Azure machine learning PaaS or SAAS? ›

Microsoft Azure is widely considered both a Platform as a Service (PaaS) and Infrastructure as a Service (IaaS) offering.

Who uses Azure machine learning? ›

Machine learning professionals, data scientists, and engineers can use it in their day-to-day workflows: Train and deploy models, and manage MLOps. You can create a model in Azure Machine Learning or use a model built from an open-source platform, such as Pytorch, TensorFlow, or scikit-learn.

Which IDE is best for deep learning? ›

6 Best Python IDEs for Data Science & Machine Learning [2022]
  • Spyder. Explore our Popular Data Science Courses.
  • Thonny. Explore our Popular Data Science Courses.
  • JupyterLab.
  • PyCharm.
  • Visual Code.
  • Atom.
23 Sept 2022

Which software is better for machine learning? ›

The 10 Best Machine Learning Software Summary
ToolPrice
1cnvrg.io Best machine learning software for the gaming industry$9500/instanc
2Amazon Machine Learning Best for those in the AWS ecosystem$0.42/hour
3H2O.ai Best open source integration with Spark$0.046/hour
4TensorFlow Best for ML models on mobile and IoT devices
6 more rows
2 Jan 2022

What IDE do machine learning engineers use? ›

3.3.

The common IDEs that being used for machine learning are Jupyter Notebook, PyCharm. These IDEs make programming of machine learning easier. Juyter notebook is easy if you are prototyping an idea or making a simple function. I find it is not easy to format my code or debug my code in Jupyter Notebook.

How do I create a machine learning model in Azure? ›

Each of these stages maps to a set of modules in Azure ML Studio.
  1. Step 1: preprocess your data. For this step, you'll be using the modules under Data Transformation . ...
  2. Step 2: split into train/test. ...
  3. Step 3: select and/or create your features. ...
  4. Step 4–5–6: train, score and evaluate your model. ...
  5. Step 7: deploy selected model.
14 Jun 2020

How do you deploy the Azure machine learning model? ›

Workflow for deploying a model
  1. Register the model.
  2. Prepare an entry script.
  3. Prepare an inference configuration.
  4. Deploy the model locally to ensure everything works.
  5. Choose a compute target.
  6. Deploy the model to the cloud.
  7. Test the resulting web service.
23 Sept 2022

How do I train my Azure custom model? ›

Create a project in the Form Recognizer Studio
  1. Start by navigating to the Form Recognizer Studio. ...
  2. In the Studio, select the Custom models tile, on the custom models page and select the Create a project button. ...
  3. Next select the storage account you used to upload your custom model training dataset.
12 Oct 2022

How do you train data in machine learning? ›

3 steps to training a machine learning model
  1. Step 1: Begin with existing data. Machine learning requires us to have existing data—not the data our application will use when we run it, but data to learn from. ...
  2. Step 2: Analyze data to identify patterns. ...
  3. Step 3: Make predictions.

How do you save a trained model in Azure ML? ›

There's no way to do anything with a model outside of AzureML. "You can also download a local copy of the model by using the Run object. In the training script pytorch_train.py, a PyTorch save object persists the model to a local folder (local to the compute target). You can use the Run object to download a copy."

What is azure machine learning designer? ›

Azure Machine Learning designer is a drag-and-drop interface used to train and deploy models in Azure Machine Learning. This article describes the tasks you can do in the designer. To get started with the designer, see Tutorial: Train a no-code regression model.

What is Azure machine learning pipeline? ›

ML pipelines execute on compute targets (see What are compute targets in Azure Machine Learning). Pipelines can read and write data to and from supported Azure Storage locations. If you don't have an Azure subscription, create a free account before you begin. Try the free or paid version of Azure Machine Learning.

Which three process should you perform in sequence before you deploy the model? ›

Question 21

Which three processes should you perform in sequence before you deploy the model? Processes: data encryption. model retraining.

How do you deploy a machine learning model on a website? ›

How was it done?
  1. Step 1: Create a new virtual environment using Pycharm IDE.
  2. Step 2: Install necessary libraries.
  3. Step 3: Build the best machine learning model and Save it.
  4. Step 4: Test the loaded model.
  5. Step 5: Create main.py file.
  6. Step 6: Upload local project to Github.
29 Dec 2021

What is speech model training? ›

Speech technologies increasingly rely on deep neural networks, a type of machine learning that helps us build more accurate and faster speech recognition models. Generally deep neural networks need larger amounts of data to work well and improve over time. This process of improvement is called model training.

What is form recognizer? ›

Form Recognizer is an AI service that applies advanced machine learning to extract text, key-value pairs, tables, and structures from documents automatically and accurately. Turn documents into usable data and shift your focus to acting on information rather than compiling it.

How do you train a PyTorch model? ›

To train the image classifier with PyTorch, you need to complete the following steps:
  1. Load the data. If you've done the previous step of this tutorial, you've handled this already.
  2. Define a Convolution Neural Network.
  3. Define a loss function.
  4. Train the model on the training data.
  5. Test the network on the test data.
22 Jun 2022

How do you train a model with large datasets? ›

Training on Large Datasets That Don't Fit In Memory in Keras
  1. Downloading and Understanding Dataset.
  2. Preparation of Dataset — To Load the Dataset in Batches.
  3. Shuffling and Splitting of the Dataset In Train And Validation Set.
  4. Creation of Custom Generator.
  5. Defining Model Architecture and Training Model.
  6. Conclusion.

What are training examples in machine learning? ›

Supervised machine learning is when the program is “trained” on a predefined set of “training examples,” which then facilitate its ability to reach an accurate conclusion when given new data. Unsupervised machine learning is when the program is given a bunch of data and must find patterns and relationships therein.

Videos

1. Build and deploy PyTorch models with Azure ML
(MTG France)
2. Build Machine Learning Models with PyTorch & Azure ML Service
(Microsoft Reactor)
3. Azure Machine Learning [June 2022]
(BlueGranite, Inc.)
4. Build and deploy PyTorch models with Azure Machine Learning
(Somos Lulo TV)
5. Quickstart: Train and deploy a model in Azure Machine Learning in 10 minutes
(Luis Valencia)
6. Deep Learning Container with PyTorch and Azure ML
(Microsoft Developer)
Top Articles
Latest Posts
Article information

Author: Jonah Leffler

Last Updated: 13/07/2023

Views: 6030

Rating: 4.4 / 5 (65 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Jonah Leffler

Birthday: 1997-10-27

Address: 8987 Kieth Ports, Luettgenland, CT 54657-9808

Phone: +2611128251586

Job: Mining Supervisor

Hobby: Worldbuilding, Electronics, Amateur radio, Skiing, Cycling, Jogging, Taxidermy

Introduction: My name is Jonah Leffler, I am a determined, faithful, outstanding, inexpensive, cheerful, determined, smiling person who loves writing and wants to share my knowledge and understanding with you.