[build] make builder smarter and configurable wrt compute capabilities + docs #578
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes:
error, resulting from this build:
multi_tensor_adam.cuandfused_lamb_cuda_kernel.cuwere getting only-gencode=arch=compute_80,code=sm_80flags and missing all the rest-gencode arch=compute_60,code=compute_60 -gencode arch=compute_61,code=compute_61 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_80,code=compute_80 -gencode arch=compute_86,code=compute_86The lone
-gencode=arch=compute_80,code=sm_80comes fromCUDAExtension- which currently checks the capacity of the 1st card only and assumes that the other cards and the same. Moreover it clamps down the number to the minimum of the same first digit in:so, for example,
sm_86becomessm_80.I'm pretty sure it's wrong though since it's the card with
compute_61that was getting this error, so these 2 libs weren't built to supportcompute_61. This is with pytorch-nightly.The 2 cards I have right now are:
Note: A PR has been proposed to fix this problem transparently to
CUDAExtensionusers, but it won't be available until future versions of pytorch if it's accepted. pytorch/pytorch#48891The cost of this deepspeed PR is that the build process is now slightly slower as it now has to build 4-5 kernels x 2 instead of just 2 (assuming all features are enabled to be compiled). So perhaps down the road we can fix that by conditioning on pytorch version and build less kernels. Alternatively you could copy the loop to get just the required archs: https://github.com/pytorch/pytorch/blob/b8f90d778d4c0739c7c07fe2b2fb0aef5e7c77e7/torch/utils/cpp_extension.py#L1531-L1545
edit: After understanding how
CUDAExtensionsorts out its archs I have made further improvements to this PRTORCH_CUDA_ARCH_LIST, in exactly same manner asCUDAExtensiondoes it. So now I can build deepspeed much faster by specifying only the archs that I need:TORCH_CUDA_ARCH_LIST="6.1;7.5;8.6" DS_BUILD_OPS=1 pip install --no-clean --no-cache -v --disable-pip-version-check -e .TORCH_CUDA_ARCH_LISToverrides theCUDAOpBuilder'scross_compile_archsargCUDAOpBuildernuancesA related PR: proposed support for compute_86 in #577 to include the full capabilities of the rtx-30* cards.
This might be related to #95