graph LR A{"🤗 Accelerate#32;"} A --> B["Launching<br>Interface#32;"] A --> C["Training Library#32;"] A --> D["Big Model<br>Inference#32;"]
graph LR A{"🤗 Accelerate#32;"} A --> B["Launching<br>Interface#32;"] A --> C["Training Library#32;"] A --> D["Big Model<br>Inference#32;"]
Launching scripts in different environments is complicated:
And more!
But it doesn’t have to be:
A single command to launch with DeepSpeed
, Fully Sharded Data Parallelism, across single and multi CPUs and GPUs, and to train on TPUs1 too!
Generate a device-specific configuration through accelerate config
Or don’t. accelerate config
doesn’t have to be done!
torchrun --nnodes=1 --nproc_per_node=2 script.py
accelerate launch --multi_gpu --nproc_per_node=2 script.py
A quick default configuration can be made too:
With the notebook_launcher
it’s also possible to launch code directly from your Jupyter environment too!
Okay, will accelerate launch
make do_the_thing.py
use all my GPUs magically?
accelerate launch
to launch a python script in various distributed environmentsaccelerate
solves this by ensuring the same code can be ran on a CPU or GPU, multiples, and on TPUs!from accelerate import Accelerator
accelerator = Accelerator()
dataloader, model, optimizer scheduler = (
accelerator.prepare(
dataloader, model, optimizer, scheduler
)
)
for batch in dataloader:
optimizer.zero_grad()
inputs, targets = batch
# inputs = inputs.to(device)
# targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
accelerator.backward(loss) # loss.backward()
optimizer.step()
scheduler.step()
What all happened in Accelerator.prepare
?
Accelerator
looked at the configurationdataloader
was converted into one that can dispatch each batch onto a seperate GPUmodel
was wrapped with the appropriate DDP wrapper from either torch.distributed
or torch_xla
optimizer
and scheduler
were both converted into an AcceleratedOptimizer
and AcceleratedScheduler
which knows how to handle any distributed scenariofastai
To utilize the notebook_launcher
and accelerate
at once it requires a few steps:
DataLoaders
creation to inside the train
functiondistrib_ctx
context manager fastai providesfastai
Here it is in code, based on the distributed app examples
from fastai.vision.all import *
from fastai.distributed import *
path = untar_data(URLs.PETS)/'images'
def train():
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2,
label_func=lambda x: x[0].isupper(), item_tfms=Resize(224))
learn = vision_learner(dls, resnet34, metrics=error_rate).to_fp16()
with learn.distrib_ctx(in_notebook=True, sync_bn=False):
learn.fine_tune(1)
notebook_launcher(train, num_processes=2)
fastai
Here it is in code, based on the distributed app examples
from fastai.vision.all import *
from fastai.distributed import *
path = untar_data(URLs.PETS)/'images'
def train():
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2,
label_func=lambda x: x[0].isupper(), item_tfms=Resize(224))
learn = vision_learner(dls, resnet34, metrics=error_rate).to_fp16()
with learn.distrib_ctx(in_notebook=True, sync_bn=False):
learn.fine_tune(1)
notebook_launcher(train, num_processes=2)
fastai
The key important parts to remember are:
notebook_launcher
notebook_launcher
to run the training function after everything is complete.