Vectorization is an important optimization for compilers where we can vectorize code to execute an instruction on multiple datasets in one go. If the backend architecture supports vector registers, a broad range of data can be loaded into those vector registers, and special vector instructions can be executed on the registers.
There are two types of vectorization in LLVM—Superword-Level Parallelism (SLP) and loop vectorization. Loop vectorization deals with vectorization opportunities in a loop, while SLP vectorization deals with vectorizing straight-line code in a basic block. In this recipe, we will see how straight-line code is vectorized.
SLP vectorization constructs a bottom-up tree of the IR expression, and broadly compares the nodes of the tree to see whether they are similar and hence can be combined to form vectors. The file to be modified is lib/Transform/Vectorize/SLPVectorizer.cpp
.
We will try to vectorize a piece of straight-line code, such as return
a[0]
+
a[1]
+
a[2]
+
a[3]
.
The expression tree for the preceding type of code will be a somewhat one-sided tree. We will run a DFS to store the operands and the operators.
The IR for the preceding kind of expression will look like this:
define i32 @hadd(i32* %a) { entry: %0 = load i32* %a, align 4 %arrayidx1 = getelementptr inbounds i32* %a, i32 1 %1 = load i32* %arrayidx1, align 4 %add = add nsw i32 %0, %1 %arrayidx2 = getelementptr inbounds i32* %a, i32 2 %2 = load i32* %arrayidx2, align 4 %add3 = add nsw i32 %add, %2 %arrayidx4 = getelementptr inbounds i32* %a, i32 3 %3 = load i32* %arrayidx4, align 4 %add5 = add nsw i32 %add3, %3 ret i32 %add5 }
The vectorization model follows three steps:
SLPVectorizer.cpp
file. A new function needs to be implemented for DFS traversal of the expression tree for the IR shown in the Getting ready section:bool matchFlatReduction(PHINode *Phi, BinaryOperator *B, const DataLayout *DL) { if (!B) return false; if (B->getType()->isVectorTy() || !B->getType()->isIntegerTy()) return false; ReductionOpcode = B->getOpcode(); ReducedValueOpcode = 0; ReduxWidth = MinVecRegSize / DL->getTypeAllocSizeInBits(B->getType()); ReductionRoot = B; ReductionPHI = Phi; if (ReduxWidth < 4) return false; if (ReductionOpcode != Instruction::Add) return false; SmallVector<BinaryOperator *, 32> Stack; ReductionOps.push_back(B); ReductionOpcode = B->getOpcode(); Stack.push_back(B); // Traversal of the tree. while (!Stack.empty()) { BinaryOperator *Bin = Stack.back(); if (Bin->getParent() != B->getParent()) return false; Value *Op0 = Bin->getOperand(0); Value *Op1 = Bin->getOperand(1); if (!Op0->hasOneUse() || !Op1->hasOneUse()) return false; BinaryOperator *Op0Bin = dyn_cast<BinaryOperator>(Op0); BinaryOperator *Op1Bin = dyn_cast<BinaryOperator>(Op1); Stack.pop_back(); // Do not handle case where both the operands are binary //operators if (Op0Bin && Op1Bin) return false; // Both the operands are not binary operator. if (!Op0Bin && !Op1Bin) { ReducedVals.push_back(Op1); ReducedVals.push_back(Op0); ReductionOps.push_back(Bin); continue; } // One of the Operand is binary operand, push that into stack // for further processing. Push the other non-binary operand //into ReducedVals. if (Op0Bin) { if (Op0Bin->getOpcode() != ReductionOpcode) return false; Stack.push_back(Op0Bin); ReducedVals.push_back(Op1); ReductionOps.push_back(Op0Bin); } if (Op1Bin) { if (Op1Bin->getOpcode() != ReductionOpcode) return false; Stack.push_back(Op1Bin); ReducedVals.push_back(Op0); ReductionOps.push_back(Op1Bin); } } SmallVector<Value *, 16> Temp; // Reverse the loads from a[3], a[2], a[1], a[0] // to a[0], a[1], a[2], a[3] for checking incremental // consecutiveness further ahead. while (!ReducedVals.empty()) Temp.push_back(ReducedVals.pop_back_val()); ReducedVals.clear(); for (unsigned i = 0, e = Temp.size(); i < e; ++i) ReducedVals.push_back(Temp[i]); return true; }
SLPVectorizer.cpp
file, add the following lines to the getReductionCost()
function:int HAddCost = INT_MAX; // If horizontal addition pattern is identified, calculate cost. // Such horizontal additions can be modeled into combination of // shuffle sub-vectors and vector adds and one single extract element // from last resultant vector. // e.g. a[0]+a[1]+a[2]+a[3] can be modeled as // %1 = load <4 x> %0 // %2 = shuffle %1 <2, 3, undef, undef> // %3 = add <4 x> %1, %2 // %4 = shuffle %3 <1, undef, undef, undef> // %5 = add <4 x> %3, %4 // %6 = extractelement %5 <0> if (IsHAdd) { unsigned VecElem = VecTy->getVectorNumElements(); unsigned NumRedxLevel = Log2_32(VecElem); HAddCost = NumRedxLevel * (TTI->getArithmeticInstrCost(ReductionOpcode, VecTy) + TTI->getShuffleCost(TargetTransformInfo::SK_ExtractSubvector, VecTy, VecElem / 2, VecTy)) + TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, 0); }
PairwiseRdxCost
and SplittingRdxCost
, compare them with HAddCost
:VecReduxCost = HAddCost < VecReduxCost ? HAddCost : VecReduxCost;
vectorizeChainsInBlock()
function, call the matchFlatReduction()
function you just defined:// Try to vectorize horizontal reductions feeding into a return. if (ReturnInst *RI = dyn_cast<ReturnInst>(it)) if (RI->getNumOperands() != 0) if (BinaryOperator *BinOp = dyn_cast<BinaryOperator>(RI->getOperand(0))) { DEBUG(dbgs() << "SLP: Found a return to vectorize. "); HorizontalReduction HorRdx; IsReturn = true; if ((HorRdx.matchFlatReduction(nullptr, BinOp, DL) && HorRdx.tryToReduce(R, TTI)) || tryToVectorizePair(BinOp->getOperand(0), BinOp->getOperand(1), R)) { Changed = true; it = BB->begin(); e = BB->end(); continue; } }
static bool IsReturn = false; static bool IsHAdd = false;
isFullyVectorizableTinyTree()
function:if (VectorizableTree.size() == 1 && IsReturn && IsHAdd)return true;
Compile the LLVM project after saving the file containing the preceding code, and run the opt tool on the example IR, as follows:
example.ll
file and paste the following IR in it:define i32 @hadd(i32* %a) { entry: %0 = load i32* %a, align 4 %arrayidx1 = getelementptr inbounds i32* %a, i32 1 %1 = load i32* %arrayidx1, align 4 %add = add nsw i32 %0, %1 %arrayidx2 = getelementptr inbounds i32* %a, i32 2 %2 = load i32* %arrayidx2, align 4 %add3 = add nsw i32 %add, %2 %arrayidx4 = getelementptr inbounds i32* %a, i32 3 %3 = load i32* %arrayidx4, align 4 %add5 = add nsw i32 %add3, %3 ret i32 %add5 }
example.ll
:$ opt -basicaa -slp-vectorizer -mtriple=aarch64-unknown-linux-gnu -mcpu=cortex-a57
The output will be vectorized code, like the following:
define i32 @hadd(i32* %a) { entry: %0 = bitcast i32* %a to <4 x i32>* %1 = load <4 x i32>* %0, align 4 %rdx.shuf = shufflevector <4 x i32> %1, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef> %bin.rdx = add <4 x i32> %1, %rdx.shuf %rdx.shuf1 = shufflevector <4 x i32> %bin.rdx, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef> %bin.rdx2 = add <4 x i32> %bin.rdx, %rdx.shuf1 %2 = extractelement <4 x i32> %bin.rdx2, i32 0 ret i32 %2 }
As observed, the code gets vectorized. The matchFlatReduction()
function performs a DFS traversal of the expression and stores all the loads in ReducedVals
, while adds are stored in ReductionOps
. After this, the cost of horizontal vectorization is calculated in HAddCost
and compared with scalar cost. It turns out to be profitable. Hence, it vectorizes the expression. This is handled in the tryToReduce()
function, which is already implemented.
3.137.41.205