Skip to content

Commit 06c02aa

Browse files
authored
Merge pull request #31 from liulanzheng/main
update doc
2 parents af1eb04 + 6e49e46 commit 06c02aa

File tree

4 files changed

+53
-6
lines changed

4 files changed

+53
-6
lines changed

docs/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ OSS Connector for AI/ML contains some high-performance Python libraries specific
44

55
Currently, the OSS connector is composed of two libraries: OSS Model Connector and OSS Torch Connector.
66

7-
- [OSS Torch Connector](https://aliyun.github.io/oss-connector-for-ai-ml/#/torchconnector/introduction) is dedicated to AI training scenarios, including loading [datasets](https://pytorch.org/docs/stable/data.html#dataset-types) from OSS and loading/saving checkpoints from/to OSS.
7+
- [OSS Torch Connector](https://aliyun.github.io/oss-connector-for-ai-ml/#/torchconnector/introduction) is dedicated to AI training scenarios, including loading [datasets](https://pytorch.org/docs/stable/data.html#dataset-types) from OSS and loading/saving checkpoints or [Distributed Checkpoints(DCP)](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) from/to OSS.
88

99
- [OSS Model Connector](https://aliyun.github.io/oss-connector-for-ai-ml/#/modelconnector/introduction) focuses on AI inference scenarios, loading large model files from OSS into local AI inference frameworks.
1010

docs/torchconnector/examples.md

Lines changed: 41 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -273,4 +273,44 @@ with checkpoint.writer(CHECKPOINT_WRITE_URI) as writer:
273273
torch.save(state_dict, writer)
274274
```
275275

276-
OssCheckpoint can be used for checkpoints, and also for high-speed uploading and downloading of arbitrary objects. In our testing environment, the download speed can exceed 15GB/s.
276+
OssCheckpoint can be used for checkpoints, and also for high-speed uploading and downloading of arbitrary objects. In our testing environment, the download speed can exceed 15GB/s.
277+
278+
## Distributed checkpoints
279+
280+
OSS connector for AI/ML supports [PyTorch distributed checkpoints(DCP)](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) since v1.2.0rc2.
281+
282+
```py
283+
import torchvision
284+
import torch.distributed.checkpoint as DCP
285+
from osstorchconnector import OssFileSystem
286+
import torch
287+
288+
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
289+
CONFIG_PATH = "/etc/oss-connector/config.json"
290+
CRED_PATH = "/root/.alibabacloud/credentials"
291+
OSS_URI = "oss://ossconnectorbucket/dcp-checkpoint-resnet18"
292+
293+
model = torchvision.models.resnet18()
294+
295+
# write to OSS
296+
fs = OssFileSystem(endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH)
297+
oss_storage_writer = fs.writer(OSS_URI)
298+
# DCP.save or DCP.async_save
299+
checkpoint_future = DCP.async_save(
300+
state_dict=model.state_dict(),
301+
storage_writer=oss_storage_writer,
302+
)
303+
checkpoint_future.result()
304+
305+
306+
# load from OSS
307+
loaded_state_dict = {
308+
key: torch.zeros_like(value) for key, value in model.state_dict().items()
309+
}
310+
oss_storage_reader = fs.reader(OSS_URI)
311+
DCP.load(
312+
loaded_state_dict,
313+
storage_reader=oss_storage_reader,
314+
)
315+
316+
```

docs/torchconnector/installation.md

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,11 @@
44

55
- OS: Linux x86-64
66
- glibc: >= 2.17
7-
- Python: 3.8-3.12
8-
- PyTorch: >= 2.0
7+
- Python: 3.8 - 3.12
8+
- PyTorch:
9+
- `>= 2.0` (v1.0.0rc1-v1.1.0)
10+
- `>= 2.3` (since v1.2.0rc1)
11+
- Kernel module: userfaultfd (for checkpoint)
912

1013
## Install
1114

@@ -25,4 +28,8 @@ For example, download the `osstorchconnector/v1.1.0rc1` for Python 3.11 and inst
2528
wget https://github.com/aliyun/oss-connector-for-ai-ml/releases/download/osstorchconnector%2Fv1.1.0rc1/osstorchconnector-1.1.0rc1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
2629

2730
pip install osstorchconnector-1.1.0rc1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
28-
```
31+
```
32+
33+
### Additional configuration for Docker
34+
35+
To use checkpoint-related features within Docker, the container must be run with `--privilege`. This is due to our reliance on userfaultfd to accelerate the reading of checkpoints.

docs/torchconnector/introduction.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
## Overview
55

66
OSS Torch Connector provides both [Map-style and Iterable-style datasets](https://pytorch.org/docs/stable/data.html#dataset-types) for loading datasets from OSS.
7-
And also provides a method for loading and saving checkpoints from and to OSS.
7+
And also provides a method for loading and saving checkpoints or [Distributed Checkpoints(DCP)](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) from and to OSS.
88

99
The core part of is OSS Connector for AI/ML is implemented in C++ using [PhotonLibOS](https://github.com/alibaba/PhotonLibOS). This repository only contains the code of Python.
1010

0 commit comments

Comments
 (0)