关于 ITS 到底是怎么一回事的调研

今天想优化掉一部分现有 kernel 的 warp divergence。然后去大概讨论一下的时候，突然有人说这个优化可能没有效果，因为有一个 CUDA Programming Guide 的机制，并且贴了对应的内容给我，当时看的是老版本的文档，有点懵，于是自己去翻了一下。

这段文档新的位置是 Independent Thread Scheduling，大家当然可以自己去阅读，不过这边直接引用原文了。

On GPUs with compute capability lower than 7.0, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp.

在计算能力低于 7.0 的 GPU 上，一个 warp 使用单一程序计数器（program counter），由 warp 内全部 32 个线程共享；同时配合一个活动掩码（active mask）来标记该 warp 中哪些线程处于活动状态、会参与执行当前指令。

As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can lead to deadlock, depending on which warp the contending threads come from.

因此，当同一 warp 中的线程处在不同的发散区域或不同的执行状态时，它们无法彼此通信。依赖锁或互斥量（locks/mutexes）来实现细粒度数据共享的算法，可能会发生死锁；是否死锁会取决于发生争用的线程来自哪个 warp。

In GPUs of compute capability 7.0 and later, independent thread scheduling allows full concurrency between threads, regardless of warp.

With independent thread scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity, either to make better use of execution resources or to allow one thread to wait for data to be produced by another.

在计算能力 7.0 及更高的 GPU 上，独立线程调度（independent thread scheduling）允许线程之间实现完全并发，而不受 warp 的限制。在独立线程调度下，GPU 会为每个线程维护执行状态，包括程序计数器和调用栈（call stack）；并且可以以单线程粒度让出执行（yield），从而可以实现更充分利用执行资源，或者某个线程等待另一个线程产生数据。

A schedule optimizer determines how to group active threads from the same warp together into SIMT units. This retains the high throughput of SIMT execution as in prior NVIDIA GPUs, but with much more flexibility: threads can now diverge and reconverge at sub-warp granularity.

调度优化器会决定如何将同一 warp 中处于活动状态的线程组合成 SIMT 执行单元。这样既保留了以往 NVIDIA GPU 的 SIMT 高吞吐执行特性，同时也提供了更强的灵活性：线程现在可以在小于一个 warp 的粒度上发散，并在之后重新一起工作（reconverge）。

Independent thread scheduling can break code that relies on implicit warp-synchronous behavior from previous GPU architectures. Warp-synchronous code assumes that threads in the same warp execute in lockstep at every instruction, but the ability for threads to diverge and reconverge at sub-warp granularity makes such assumptions invalid.

独立线程调度可能会破坏那些依赖早期 GPU 架构中隐式 warp 同步（implicit warp-synchronous）行为的代码。所谓 warp 同步（warp-synchronous）代码，是假设同一 warp 内的线程在每条指令上都严格同步执行；但是现在的话，这个假设不成立。

This can lead to a different set of threads participating in the executed code than intended. Any warp-synchronous code developed for GPUs prior to CC 7.0 (such as synchronization-free intra-warp reductions) should be revisited to ensure compatibility. Developers should explicitly synchronize such code using __syncwarp() to ensure correct behavior across all GPU generations.

这可能导致实际参与执行的线程集合与预期不一致。任何为 CC 7.0 之前 GPU 编写的 warp 同步代码（例如不使用同步的 warp 内归约）都应重新审视，以确保兼容性。开发者应使用 syncwarp() 对这类代码进行显式同步，以保证在所有 GPU 架构上都能正确运行。

下面进行一个中译中：

第一段话是说，之前如果出现分支，我们就直接用掩码来决定某些线程做或不做某个分支的行为，然后顺序的激活掩码代表的线程执行对应分支的作业。一旦发生 warp divergence 那么整个路径有两个分支，程序员并不知道实际上先执行哪一个，因为这个发射的顺序是不可知的。

__device__ int flag;
__device__ int data;

if (lane_id < 4) {              // 生产者
    data = compute();
    flag = 1;
} else {                         // 消费者
    while (flag == 0) { }        // 等待生产者
    use(data);
}

在 Volta 以后的 ITS 机制下，GPU 会为每个线程维护独立的执行状态（包括 PC 和调用栈），因此同一 warp 内的线程可以在小于 warp 的粒度上发散和重聚。需要强调的是：这并不意味着线程从此“完全独立并行、互不影响”，因为硬件仍然以 SIMT 的方式发射指令——调度器只是在每个时刻从同一 warp 的活跃线程里，挑出当前位于同一执行点（同一 PC）的一组线程组成一个 SIMT 执行单元来执行。

因此，在一些“发散后某一侧线程自旋等待另一侧产出”的写法里，旧架构可能因为 warp 共享 PC 导致整个 warp 被卡在等待路径上，从而让“能推进并释放条件的那一侧”没有机会执行，出现结构性卡死；而 ITS 由于具备每线程独立进度与更灵活的重聚/调度，能让 warp 内处于不同执行点的线程更有机会交替推进，从而降低这类结构性死锁的风险。

后面那个对存量代码导致的问题，可以不用在乎，现在应该没有无 ITS 的卡还存在于市面上了。我的理解是，因为有的线程会让出，可能最后并不是两次发射，而是更多，因为让出的线程要重新激活，所以为了保证语义，要显示同步。

如果按照编程规范来写，本身就知道上面的代码性能是不好的，所以大部分时候本来就不会这么写。所以，如果你遇到一个其实是需要清零+向量乘的操作，先分支进行乘法再分支清零，其实是不如 warp divergence less 的先清零的，虽然 GPU 空转了，但是计算的代价不大，branchless 的写法更容易拿到性能优势，同时还能减少一个分支计算的寄存器的损耗。

综上所述，ITS 并不能消除 warp divergence，在 SIMT 的世界里，我们依然需要去设计算法，使得 branch 被尽可能的消除，只有在实在没办法，不得不写出上面的情况的时候（比如算法的内在结构 or 操作占比很小），我们才选择使用 ITS 给我们的能力，让程序可以尽可能跑起来。