Accelerate, Three Powerful Sublibraries for PyTorch

Zachary Mueller

What is 🤗 Accelerate?

graph LR
    A{"🤗 Accelerate#32;"}
    A --> B["Launching<br>Interface#32;"]
    A --> C["Training Library#32;"]
    A --> D["Big Model<br>Inference#32;"]

A Launching Interface

Launching scripts in different environments is complicated:

```
python script.py
```

torchrun --nnodes=1 --nproc_per_node=2 script.py

```
deepspeed --num_gpus=2 script.py
```

And more!

A Launching Interface

But it doesn’t have to be:

accelerate launch script.py

A single command to launch with DeepSpeed, Fully Sharded Data Parallelism, across single and multi CPUs and GPUs, and to train on TPUs¹ too!

A Launching Interface

Generate a device-specific configuration through accelerate config

A Launching Interface

Or don’t. accelerate config doesn’t have to be done!

torchrun --nnodes=1 --nproc_per_node=2 script.py
accelerate launch --multi_gpu --nproc_per_node=2 script.py

A quick default configuration can be made too:

accelerate config default

A Launching Interface

With the notebook_launcher it’s also possible to launch code directly from your Jupyter environment too!

from accelerate import notebook_launcher
notebook_launcher(
    training_loop_function, 
    args, 
    num_processes=2
)

Launching training on 2 GPUs.
epoch 0: 88.12
epoch 1: 91.73
epoch 2: 92.58
epoch 3: 93.90
epoch 4: 94.71

A Training Library

Okay, will accelerate launch make do_the_thing.py use all my GPUs magically?

A Training Library

Just showed that its possible using accelerate launch to launch a python script in various distributed environments
This does not mean that the script will just “use” that code and still run on the new compute efficiently.
Training on different computes often means many lines of code changed for each specific compute.
🤗 accelerate solves this by ensuring the same code can be ran on a CPU or GPU, multiples, and on TPUs!

A Training Library

for batch in dataloader:
    optimizer.zero_grad()
    inputs, targets = batch
    inputs = inputs.to(device)
    targets = targets.to(device)
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    loss.backward()
    optimizer.step()
    scheduler.step()

A Training Library

# For alignment purposes
for batch in dataloader:
    optimizer.zero_grad()
    inputs, targets = batch
    inputs = inputs.to(device)
    targets = targets.to(device)
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    loss.backward()
    optimizer.step()
    scheduler.step()

from accelerate import Accelerator
accelerator = Accelerator()
dataloader, model, optimizer scheduler = (
    accelerator.prepare(
        dataloader, model, optimizer, scheduler
    )
)

for batch in dataloader:
    optimizer.zero_grad()
    inputs, targets = batch
    # inputs = inputs.to(device)
    # targets = targets.to(device)
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    accelerator.backward(loss) # loss.backward()
    optimizer.step()
    scheduler.step()

A Training Library

What all happened in Accelerator.prepare?

Accelerator looked at the configuration
The dataloader was converted into one that can dispatch each batch onto a seperate GPU
The model was wrapped with the appropriate DDP wrapper from either torch.distributed or torch_xla
The optimizer and scheduler were both converted into an AcceleratedOptimizer and AcceleratedScheduler which knows how to handle any distributed scenario

Let’s bring in `fastai`

To utilize the notebook_launcher and accelerate at once it requires a few steps:

Migrate the DataLoaders creation to inside the train function
Use the distrib_ctx context manager fastai provides
Train!

Let’s bring `fastai`

Here it is in code, based on the distributed app examples

from fastai.vision.all import *
from fastai.distributed import *

path = untar_data(URLs.PETS)/'images'

def train():
    dls = ImageDataLoaders.from_name_func(
        path, get_image_files(path), valid_pct=0.2,
        label_func=lambda x: x[0].isupper(), item_tfms=Resize(224))
    learn = vision_learner(dls, resnet34, metrics=error_rate).to_fp16()
    with learn.distrib_ctx(in_notebook=True, sync_bn=False):
        learn.fine_tune(1)

notebook_launcher(train, num_processes=2)

Let’s bring `fastai`

Here it is in code, based on the distributed app examples

from fastai.vision.all import *
from fastai.distributed import *

path = untar_data(URLs.PETS)/'images'

def train():
    dls = ImageDataLoaders.from_name_func(
        path, get_image_files(path), valid_pct=0.2,
        label_func=lambda x: x[0].isupper(), item_tfms=Resize(224))
    learn = vision_learner(dls, resnet34, metrics=error_rate).to_fp16()
    with learn.distrib_ctx(in_notebook=True, sync_bn=False):
        learn.fine_tune(1)

notebook_launcher(train, num_processes=2)

Let’s bring `fastai`

The key important parts to remember are:

No code should touch the GPU before calling notebook_launcher
Generally it’s recommended to let fastai handle gradient accumulation and mixed precision in this case, so use their in-house Callbacks
Use the notebook_launcher to run the training function after everything is complete.

Accelerate, Three Powerful Sublibraries for PyTorch

What is 🤗 Accelerate?

A Launching Interface

A Launching Interface

A Launching Interface

A Launching Interface

A Launching Interface

A Training Library

A Training Library

A Training Library

A Training Library

A Training Library

Let’s bring in fastai

Let’s bring fastai

Let’s bring fastai

Let’s bring fastai

Let’s bring in `fastai`

Let’s bring `fastai`

Let’s bring `fastai`

Let’s bring `fastai`