Skip to content

Commit

Permalink
Update readme (#96)
Browse files Browse the repository at this point in the history
Summary:
As title.

Pull Request resolved: #96

Reviewed By: ananthsub

Differential Revision: D40359341

Pulled By: yifuwang

fbshipit-source-id: 28348256d0f8966fc630a3ef745695c3f31b3479
  • Loading branch information
yifuwang authored and facebook-github-bot committed Oct 13, 2022
1 parent 9c2cbec commit 4596fc6
Show file tree
Hide file tree
Showing 2 changed files with 25 additions and 31 deletions.
54 changes: 24 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,12 @@

**This library is currently in Alpha and currently does not have a stable release. The API may change and may not be backward compatible. If you have suggestions for improvements, please open a GitHub issue. We'd love to hear your feedback.**

A light-weight library for adding fault tolerance to large-scale PyTorch distributed training workloads.
A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind.


## Install

Requires Python >= 3.7 and PyTorch >= 1.11
Requires Python >= 3.7 and PyTorch >= 1.12

From pip:

Expand All @@ -26,52 +26,46 @@ pip install --pre torchsnapshot-nightly
From source:

```bash
git clone https://github.com/facebookresearch/torchsnapshot
git clone https://github.com/pytorch/torchsnapshot
cd torchsnapshot
pip install -r requirements.txt
python setup.py install
```

## Concepts
- **Stateful object** - an object that whose state can be obtained via `.state_dict()` and restored via `.load_state_dict()`. Most PyTorch components (e.g. `Module`, `Optimizer`, `LRScheduler`) already implement this [protocol](https://github.com/facebookresearch/torchsnapshot/blob/main/torchsnapshot/stateful.py).
- **App state** - the application state described using multiple stateful objects.
- **Snapshot** - the persisted app state.
## Why TorchSnapshot

**Performance**
- TorchSnapshot provides a fast checkpointing implementation employing various optimizations, including zero-copy serialization for most tensor types, overlapped device-to-host copy and storage I/O, parallelized storage I/O.
- TorchSnapshot greatly speeds up checkpointing for DistributedDataParallel workloads by distributing the write load across all ranks ([benchmark](https://github.com/pytorch/torchsnapshot/tree/main/benchmarks/ddp)).
- When host memory is abundant, TorchSnapshot allows training to resume before all storage I/O completes, reducing the time blocked by checkpoint saving.

## Basic Usage
**Memory Usage**
- TorchSnapshot's memory usage adapts to the host's available resources, greatly reducing the chance of out-of-memory issues when saving and loading checkpoints.
- TorchSnapshot supports efficient random access to individual objects within a snapshot, even when the snapshot is stored in a cloud object storage.

Describing the application state with multiple stateful objects:
```python
app_state = {"model": model, "optimizer": optimizer}
```


Taking a snapshot of the application state:
```python
from torchsnapshot import Snapshot
**Usability**
- Simple APIs that are consistent between distributed and non-distributed workloads.
- Out of the box integration with commonly used cloud object storage systems.
- Automatic resharding (elasticity) on world size change for supported workloads ([more details](https://pytorch.org/torchsnapshot/getting_started.html#elasticity-experimental)).

# File System
snapshot = Snapshot.take(path="/foo/bar/baz", app_state=app_state)
**Security**
- Secure tensor serialization without pickle dependency [WIP].

# S3
snapshot = Snapshot.take(path="s3://foo/bar", app_state=app_state)

# Google Cloud Storage
snapshot = Snapshot.take(path="gcs://foo/bar", app_state=app_state)
```
## Getting Started

Referencing an existing snapshot:
```python
snapshot = Snapshot(path="foo/bar/baz")
```
from torchsnapshot import Snapshot

# Taking a snapshot
app_state = {"model": model, "optimizer": optimizer}
snapshot = Snapshot.take(app_state=app_state, "/path/to/snapshot")

Restoring the application state from a snapshot:
```python
# Restoring from a snapshot
snapshot.restore(app_state=app_state)
```

See the [example directory](https://github.com/facebookresearch/torchsnapshot/tree/main/examples) for more examples.
See the [documentation](https://pytorch.org/torchsnapshot/getting_started.html) for more details.


## License
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ def parse_args() -> argparse.Namespace:
version=version,
author="torchsnapshot team",
author_email="[email protected]",
description="A lightweight library for adding fault tolerance to large-scale PyTorch distributed training workloads.",
description="A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind.",
long_description=readme,
long_description_content_type="text/markdown",
url="https://github.com/pytorch/torchsnapshot",
Expand Down

0 comments on commit 4596fc6

Please sign in to comment.