循环向量化：诊断和控制

循环向量化最初在 LLVM 3.2 中引入，并在 LLVM 3.3 中默认启用。它以前曾在本博客的 2012 年和 2013 年中进行过讨论，以及在 FOSDEM 2014 和 Apple 的 WWDC 2013 上。LLVM 循环向量器将循环的多个迭代合并，以提高性能。现代处理器可以使用高级硬件功能（例如多个执行单元和乱序执行）来利用交错指令的独立性，从而提高性能。

不幸的是，当循环向量化不可行或不划算时，循环会被静默跳过。这对许多依赖向量化提供的性能的应用程序来说是个问题。LLVM 的最新更新提供了一些命令行参数，用于帮助诊断向量化问题，以及一个新的 pragma 语法，用于调整循环向量化、交错和展开。

新功能：诊断备注

诊断备注为用户提供了对 LLVM 优化过程（包括展开、交错和向量化）行为的洞察力。它们使用 Rpass 命令行参数启用。交错和向量化诊断备注由指定 'loop-vectorize' 过程生成。例如，指定 ' -Rpass=loop-vectorize' 会告诉我们以下循环被向量化了 4 倍，并且交错因子为 2。

void test1(int *List, int Length) {

int i = 0;

while(i < Length) {

List[i] = i*2;

i++;

}

clang -O3 -Rpass=loop-vectorize -S test1.c -o /dev/null

test1.c:4:5: remark:

向量化循环 (向量化因子：4，展开交错因子：2)

while(i < Length) {

许多循环无法向量化，包括具有复杂控制流、不可向量化类型和不可向量化调用的循环。例如，为了证明向量化以下循环是安全的，我们必须证明数组 'A' 不是数组 'B' 的别名。但是，无法识别数组 'A' 的边界。

void test2(int *A, int *B, int Length) {

for (int i = 0; i < Length; i++)

A[B[i]]++;

}

clang -O3 -Rpass-analysis=loop-vectorize -S test2.c -o /dev/null

test2.c:3:5: remark:

循环未向量化：无法识别数组边界

for (int i = 0; i < Length; i++)

控制流和其他不可向量化语句由 ' -Rpass-analysis' 命令行参数报告。例如，'break' 和 'switch' 的许多用法不可向量化。

C/C++ 代码	-Rpass-analysis=loop-vectorize
for (int i = 0; i < Length; i++) { if (A[i] > 10.0) break; A[i] = 0; }	control_flow.cpp:5:9: remark: 循环未向量化：循环控制流未被向量器理解 if (A[i] > 10.0) ^
for (int i = 0; i < Length; i++) { switch(A[i]) { case 0: B[i] = 1; break; case 1: B[i] = 2; break; default: B[i] = 3; } }	no_switch.cpp:4:5: remark: 循环未向量化：循环包含 switch 语句 switch(A[i]) { ^

新功能：循环 pragma 指令

显式控制向量化、交错和展开的行为对于微调性能是必要的。例如，在编译为大小 (-Os) 时，最好向量化应用程序的热循环以提高性能。向量化、交错和展开可以使用 #pragma clang loop 指令在任何 for、while、do-while 或 c++11 范围基 for 循环之前显式指定。例如，以下循环使用循环 pragma 指令显式指定向量化宽度和交错计数。

void test3(float *Vx, float *Vy, float *Ux, float *Uy, float *P, int Length) {

#pragma clang loop vectorize_width(4) interleave_count(4)

#pragma clang loop unroll(disable)

for (int i = 0; i < Length; i++) {

float A = Vx[i] * Ux[i];

float B = A + Vy[i] * Uy[i];

P[i] = B;

}

clang -O3 -Rpass=loop-vectorize -S test3.c -o /dev/null

test3.c:5:5: remark:

向量化循环 (向量化因子：4，展开交错因子：4)

for (int i = 0; i < Length; i++) {

整型常量表达式

选项 vectorize_width、interleave_count 和 unroll_count 接受整型常量表达式。因此，它可以像下面的示例一样计算。

template <int ArchWidth, int ExecutionUnits>

void test4(float *Vx, float *Vy, float *Ux, float *Uy, float *P, int Length) {

#pragma clang loop vectorize_width(ArchWidth)

#pragma clang loop interleave_count(ExecutionUnits * 4)

for (int i = 0; i < Length; i++) {

float A = Vx[i] * Ux[i];

float B = A + Vy[i] * Uy[i];

P[i] = B;

}

void compute_test4(float *Vx, float *Vy, float *Ux, float *Uy, float *P, int Length) {

const int arch_width = 4;

const int exec_units = 2;

test4<arch_width, exec_units>(Vx, Vy, Ux, Uy, P, Length);

}

clang -O3 -Rpass=loop-vectorize -S test4.cpp -o /dev/null

test4.cpp:6:5: remark:

向量化循环 (向量化因子：4，展开交错因子：8)

for (int i = 0; i < Length; i++) {

性能警告

有时循环变换执行起来不安全。例如，由于使用了复杂的控制流，向量化失败。如果显式指定向量化，则会生成警告消息以提醒程序员该指令无法执行。例如，以下函数返回循环中最后一个正值，无法向量化，因为 'last_positive_value' 变量在循环外部使用。

int test5(int *List, int Length) {

int last_positive_index = 0;

#pragma clang loop vectorize(enable)

for (int i = 1; i < Length; i++) {

if (List[i] > 0) {

last_positive_index = i;

continue;

}

List[i] = 0;

}

return last_positive_index;

}

clang -O3 -g -S test5.c -o /dev/null

test5.c:5:9: warning:

循环未向量化：显式指定的循环向量化失败

for (int i = 1; i < Length; i++) {

调试选项 ' -g' 允许将源行与警告一起提供。

结论

诊断备注和循环 pragma 指令是两个新的功能，它们对于反馈导向性能调整很有用。特别感谢所有为这些功能开发做出贡献的人。未来的工作包括将诊断备注添加到 SLP 向量器中，以及为循环 pragma 指令添加一个额外的选项，以声明内存操作是安全的向量化。欢迎提供改进的更多想法。