-
Notifications
You must be signed in to change notification settings - Fork 380
Description
Payloads are our way of facilitating IO between different entry points throughout a ray tracing pipeline.
In VK/DX, payload values map directly to payload registers, of which today there are 32. When payload structures exceed this limit, values spill into local memory. This all is handled driver side.
However, OptiX behavior is different. With OptiX, it's up to the end-user to call optixSetPayload_<#> and optixGetPayload_<#> explicitly.
On the first bringup of OptiX support in Slang, we implemented the easiest solution at the time, which was to always spill payload structures to local memory, which we'd then reference using the well known "pointer packing pattern" that OptiX users tend to do. We do this today in slang cuda prelude, here:
slang/prelude/slang-cuda-prelude.h
Lines 3171 to 3193 in 6b286bf
| static __forceinline__ __device__ void* unpackOptiXRayPayloadPointer(uint32_t i0, uint32_t i1) | |
| { | |
| const uint64_t uptr = static_cast<uint64_t>(i0) << 32 | i1; | |
| void* ptr = reinterpret_cast<void*>(uptr); | |
| return ptr; | |
| } | |
| static __forceinline__ __device__ void packOptiXRayPayloadPointer( | |
| void* ptr, | |
| uint32_t& i0, | |
| uint32_t& i1) | |
| { | |
| const uint64_t uptr = reinterpret_cast<uint64_t>(ptr); | |
| i0 = uptr >> 32; | |
| i1 = uptr & 0x00000000ffffffff; | |
| } | |
| static __forceinline__ __device__ void* getOptiXRayPayloadPtr() | |
| { | |
| const uint32_t u0 = optixGetPayload_0(); | |
| const uint32_t u1 = optixGetPayload_1(); | |
| return unpackOptiXRayPayloadPointer(u0, u1); | |
| } |
Although this works, it also is very bad for performance. Repos like 3DGUT use the full 32 payload registers to accelerate K-closest-gaussian collection for inverse rendering. If we were to try to write this same kernel today but in Slang, we'd take a pretty big hit to training perf.
To fix this issue, Slang should identify when the payload structure is less than 32 registers worth of data (128 bytes). In that case, Slang should drop this pointer indirection pattern and instead emit the appropriate optixSetPayload_<#> and optixGetPayload_<#> functions.
If it makes the compiler side of this problem any easier, we might be able to emit PTX directly. For example, optixSetPayload_13 is defined as follows:
static __forceinline__ __device__ void optixSetPayload_13( unsigned int p )
{
asm volatile( "call _optix_set_payload, (%0, %1);" : : "r"( 13 ), "r"( p ) : );
}
The "r" register here takes as input an integer literal between 0 and 31, which the Slang compiler could generate during OptiX code emission.
Likewise, here's how the optixGetPayload works, here for payload register 15:
static __forceinline__ __device__ unsigned int optixGetPayload_15()
{
unsigned int result;
asm volatile( "call (%0), _optix_get_payload, (%1);" : "=r"( result ) : "r"( 15 ) : );
return result;
}