Hugging Face Accelerate's 8270-line llms.txt shows what thorough AI preparation looks like

There are many ways to launch and run your code depending on your training environment ([torchrun](https://pytorch.org/docs/stable/elastic/run.html), [DeepSp...

Readers can learn how to efficiently manage and execute PyTorch training across various distributed environments using Hugging Face Accelerate. The implementation outlines key features that enhance scalability and accessibility for large model inference, providing a comprehensive guide for developers.

8,270

Lines

+704% vs avg

209

Sections

+1129% vs avg

1000+

Companies

using llms.txt

Files

llms.txt

Visit Hugging Face Accelerate View Raw llms.txt

Key Insights

Comprehensive structure

With 209 distinct sections, this file provides thorough coverage for AI systems.

Comprehensive detail

8270 lines of thorough documentation for AI systems.

llms.txt Preview

First 100 lines of 8,270 total

View Full File

# Quicktour

There are many ways to launch and run your code depending on your training environment ([torchrun](https://pytorch.org/docs/stable/elastic/run.html), [DeepSpeed](https://www.deepspeed.ai/), etc.) and available hardware. Accelerate offers a unified interface for launching and training on different distributed setups, allowing you to focus on your PyTorch training code instead of the intricacies of adapting your code to these different setups. This allows you to easily scale your PyTorch code for training and inference on distributed setups with hardware like GPUs and TPUs. Accelerate also provides Big Model Inference to make loading and running inference with really large models that usually don't fit in memory more accessible.

This quicktour introduces the three main features of Accelerate:

* a unified command line launching interface for distributed training scripts
* a training library for adapting PyTorch training code to run on different distributed setups
* Big Model Inference

## Unified launch interface

Accelerate automatically selects the appropriate configuration values for any given distributed training framework (DeepSpeed, FSDP, etc.) through a unified configuration file generated from the [`accelerate config`](package_reference/cli#accelerate-config) command. You could also pass the configuration values explicitly to the command line which is helpful in certain situations like if you're using SLURM.


But in most cases, you should always run [`accelerate config`](package_reference/cli#accelerate-config) first to help Accelerate learn about your training setup.

```bash
accelerate config
```

The [`accelerate config`](package_reference/cli#accelerate-config) command creates and saves a default_config.yaml file in Accelerates cache folder. This file stores the configuration for your training environment, which helps Accelerate correctly launch your training script based on your machine.

After you've configured your environment, you can test your setup with [`accelerate test`](package_reference/cli#accelerate-test), which launches a short script to test the distributed environment.

```bash
accelerate test
```

> [!TIP]
> Add `--config_file` to the `accelerate test` or `accelerate launch` command to specify the location of the configuration file if it is saved in a non-default location like the cache.

Once your environment is setup, launch your training script with [`accelerate launch`](package_reference/cli#accelerate-launch)!

```bash
accelerate launch path_to_script.py --args_for_the_script
```

To learn more, check out the [Launch distributed code](basic_tutorials/launch) tutorial for more information about launching your scripts.

We also have a [configuration zoo](https://github.com/huggingface/accelerate/blob/main/examples/config_yaml_templates) which showcases a number of premade **minimal** example configurations for a variety of setups you can run.

## Adapt training code

The next main feature of Accelerate is the `Accelerator` class which adapts your PyTorch code to run on different distributed setups.

You only need to add a few lines of code to your training script to enable it to run on multiple GPUs or TPUs.

```diff
+ from accelerate import Accelerator
+ accelerator = Accelerator()

+ device = accelerator.device
+ model, optimizer, training_dataloader, scheduler = accelerator.prepare(
+     model, optimizer, training_dataloader, scheduler
+ )

  for batch in training_dataloader:
      optimizer.zero_grad()
      inputs, targets = batch
-     inputs = inputs.to(device)
-     targets = targets.to(device)
      outputs = model(inputs)
      loss = loss_function(outputs, targets)
+     accelerator.backward(loss)
      optimizer.step()
      scheduler.step()
```

1. Import and instantiate the `Accelerator` class at the beginning of your training script. The `Accelerator` class initializes everything necessary for distributed training, and it automatically detects your training environment (a single machine with a GPU, a machine with several GPUs, several machines with multiple GPUs or a TPU, etc.) based on how the code was launched.

```python
from accelerate import Accelerator

accelerator = Accelerator()
```

2. Remove calls like `.cuda()` on your model and input data. The `Accelerator` class automatically places these objects on the appropriate device for you.

> [!WARNING]
> This step is *optional* but it is considered best practice to allow Accelerate to handle device placement. You could also deactivate automatic device placement by passing `device_placement=False` when initializing the `Accelerator`. If you want to explicitly place objects on a device with `.to(device)`, make sure you use `accelerator.device` instead. For example, if you create an optimizer before placing a model on `accelerator.device`, training fails on a TPU.

> [!WARNING]
> Accelerate does not use non-blocking transfers by default for its automatic device placement, which can result in potentially unwanted CUDA synchronizations.  You can enable non-blocking transfers by passing a `DataLoaderConfiguration` with `non_blocking=True` set as the `dataloader_config` when initializing the `Accelerator`.  As usual, non-blocking transfers will only work if the dataloader also has `pin_memory=True` set.  Be wary that using non-blocking transfers from GPU to CPU may cause incorrect results if it results in CPU operations being performed on non-ready tensors.

```py
device = accelerator.device
```

3. Pass all relevant PyTorch objects for training (optimizer, model, dataloader(s), learning rate scheduler) to the `prepare()` method as soon as they're created. This method wraps the model in a container optimized for your distributed setup, uses Accelerates version of the optimizer and scheduler, and creates a sharded version of your dataloader for distribution across GPUs or TPUs.

```python
model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
    model, optimizer, train_dataloader, lr_scheduler
)
```

4. Replace `loss.backward()` with `backward()` to use the correct `backward()` method for your training setup.

```py

View Complete File (8,270 lines)

Hugging Face Accelerate is set up. Is yours?

Check your AI readiness in 30 seconds. See who AI recommends in your space. Free, no signup.

Check Your Site Generate Your llms.txt

More Examples to Explore

View All

1000+ sites already set up

Hugging Face Accelerate is ready for AI. Are you?

Check your AI readiness score in 30 seconds — free, no signup required. Then generate your own llms.txt and start tracking your visibility.

Check Your Site Browse More Examples