Vectorizing IR

Vectorization is an important optimization for compilers where we can vectorize code to execute an instruction on multiple datasets in one go. If the backend architecture supports vector registers, a broad range of data can be loaded into those vector registers, and special vector instructions can be executed on the registers.

There are two types of vectorization in LLVM—Superword-Level Parallelism (SLP) and loop vectorization. Loop vectorization deals with vectorization opportunities in a loop, while SLP vectorization deals with vectorizing straight-line code in a basic block. In this recipe, we will see how straight-line code is vectorized.

Getting ready

SLP vectorization constructs a bottom-up tree of the IR expression, and broadly compares the nodes of the tree to see whether they are similar and hence can be combined to form vectors. The file to be modified is lib/Transform/Vectorize/SLPVectorizer.cpp.

We will try to vectorize a piece of straight-line code, such as return a[0] + a[1] + a[2] + a[3].

The expression tree for the preceding type of code will be a somewhat one-sided tree. We will run a DFS to store the operands and the operators.

The IR for the preceding kind of expression will look like this:

define i32 @hadd(i32* %a) {
entry:
    %0 = load i32* %a, align 4
    %arrayidx1 = getelementptr inbounds i32* %a, i32 1
    %1 = load i32* %arrayidx1, align 4
    %add = add nsw i32 %0, %1
    %arrayidx2 = getelementptr inbounds i32* %a, i32 2
    %2 = load i32* %arrayidx2, align 4
    %add3 = add nsw i32 %add, %2
    %arrayidx4 = getelementptr inbounds i32* %a, i32 3
    %3 = load i32* %arrayidx4, align 4
    %add5 = add nsw i32 %add3, %3
    ret i32 %add5
}

The vectorization model follows three steps:

  1. Checking whether it's legal to vectorize.
  2. Calculating the profitability of the vectorized code over the scalarized code.
  3. Vectorizing the code if these two conditions are satisfied.

How to do it...

  1. Open the SLPVectorizer.cpp file. A new function needs to be implemented for DFS traversal of the expression tree for the IR shown in the Getting ready section:
    bool matchFlatReduction(PHINode *Phi, BinaryOperator *B, const DataLayout *DL) {
    
      if (!B)
        return false;
    
      if (B->getType()->isVectorTy() ||
        !B->getType()->isIntegerTy())
        return false;
    
    ReductionOpcode = B->getOpcode();
    ReducedValueOpcode = 0;
    ReduxWidth = MinVecRegSize / DL->getTypeAllocSizeInBits(B->getType());
    ReductionRoot = B;
    ReductionPHI = Phi;
    
    if (ReduxWidth < 4)
      return false;
    if (ReductionOpcode != Instruction::Add)
      return false;
    
    SmallVector<BinaryOperator *, 32> Stack;
    ReductionOps.push_back(B);
    ReductionOpcode = B->getOpcode();
    Stack.push_back(B);
    
    // Traversal of the tree.
    while (!Stack.empty()) {
      BinaryOperator *Bin = Stack.back();
      if (Bin->getParent() != B->getParent())
        return false;
      Value *Op0 = Bin->getOperand(0);
      Value *Op1 = Bin->getOperand(1);
      if (!Op0->hasOneUse() || !Op1->hasOneUse())
        return false;
      BinaryOperator *Op0Bin = dyn_cast<BinaryOperator>(Op0); BinaryOperator *Op1Bin = dyn_cast<BinaryOperator>(Op1); Stack.pop_back();
    
      // Do not handle case where both the operands are binary
    //operators
      if (Op0Bin && Op1Bin)
        return false;
      // Both the operands are not binary operator.
      if (!Op0Bin && !Op1Bin) {
        ReducedVals.push_back(Op1);
        ReducedVals.push_back(Op0);
    
        ReductionOps.push_back(Bin);
        continue;
    }
    
    // One of the Operand is binary operand, push that into stack
    // for further processing. Push the other non-binary operand //into ReducedVals.
      if (Op0Bin) {
        if (Op0Bin->getOpcode() != ReductionOpcode)
          return false;
        Stack.push_back(Op0Bin);
        ReducedVals.push_back(Op1);
    
        ReductionOps.push_back(Op0Bin);
      }
    
      if (Op1Bin) {
    
        if (Op1Bin->getOpcode() != ReductionOpcode)
          return false;
        Stack.push_back(Op1Bin);
        ReducedVals.push_back(Op0);
        ReductionOps.push_back(Op1Bin);
      }
    }
    SmallVector<Value *, 16> Temp;
    // Reverse the loads from a[3], a[2], a[1], a[0]
    
    // to a[0], a[1], a[2], a[3] for checking incremental
    // consecutiveness further ahead.
    while (!ReducedVals.empty())
      Temp.push_back(ReducedVals.pop_back_val());
    ReducedVals.clear();
    for (unsigned i = 0, e = Temp.size(); i < e; ++i)
      ReducedVals.push_back(Temp[i]);
      return true;
    }
  2. Calculate the cost of the resultant vectorized IR and conclude whether it is profitable to vectorize. In the SLPVectorizer.cpp file, add the following lines to the getReductionCost() function:
    int HAddCost = INT_MAX;
    // If horizontal addition pattern is identified, calculate cost.
    
    // Such horizontal additions can be modeled into combination of
    
    // shuffle sub-vectors and vector adds and one single extract element
    
    // from last resultant vector.
    
    // e.g. a[0]+a[1]+a[2]+a[3] can be modeled as // %1 = load <4 x> %0
    // %2 = shuffle %1 <2, 3, undef, undef>
    // %3 = add <4 x> %1, %2
    // %4 = shuffle %3 <1, undef, undef, undef>
    
    // %5 = add <4 x> %3, %4
    
    // %6 = extractelement %5 <0>
    if (IsHAdd) {
      unsigned VecElem = VecTy->getVectorNumElements();
      unsigned NumRedxLevel = Log2_32(VecElem);
      HAddCost = NumRedxLevel *
       (TTI->getArithmeticInstrCost(ReductionOpcode, VecTy) + TTI->getShuffleCost(TargetTransformInfo::SK_ExtractSubvector, VecTy, VecElem / 2, VecTy)) + TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, 0);
      }
  3. In the same function, after calculating PairwiseRdxCost and SplittingRdxCost, compare them with HAddCost:
    VecReduxCost = HAddCost < VecReduxCost ? HAddCost : VecReduxCost;
  4. In the vectorizeChainsInBlock() function, call the matchFlatReduction() function you just defined:
    // Try to vectorize horizontal reductions feeding into a return.
    if (ReturnInst *RI = dyn_cast<ReturnInst>(it))
    
    if (RI->getNumOperands() != 0)
    if (BinaryOperator *BinOp =
       dyn_cast<BinaryOperator>(RI->getOperand(0))) {
    
      DEBUG(dbgs() << "SLP: Found a return to vectorize.
    ");
    
      HorizontalReduction HorRdx;
      IsReturn = true;
    
      if ((HorRdx.matchFlatReduction(nullptr, BinOp, DL) && HorRdx.tryToReduce(R, TTI)) || tryToVectorizePair(BinOp->getOperand(0), BinOp->getOperand(1), R)) {
      Changed = true;
    
      it = BB->begin();
      e = BB->end();
      continue;
    
    }
    }
  5. Define two global flags to keep a track of horizontal reduction, which feeds into a return:
    static bool IsReturn = false;
    static bool IsHAdd = false;
  6. Allow the vectorization of small trees if they feed into a return. Add the following line to the isFullyVectorizableTinyTree() function:
    if (VectorizableTree.size() == 1 && IsReturn && IsHAdd)return true;

How it works…

Compile the LLVM project after saving the file containing the preceding code, and run the opt tool on the example IR, as follows:

  1. Open the example.ll file and paste the following IR in it:
    define i32 @hadd(i32* %a) {
    entry:
        %0 = load i32* %a, align 4
        %arrayidx1 = getelementptr inbounds i32* %a, i32 1
        %1 = load i32* %arrayidx1, align 4
        %add = add nsw i32 %0, %1
        %arrayidx2 = getelementptr inbounds i32* %a, i32 2
        %2 = load i32* %arrayidx2, align 4
        %add3 = add nsw i32 %add, %2
        %arrayidx4 = getelementptr inbounds i32* %a, i32 3
        %3 = load i32* %arrayidx4, align 4
        %add5 = add nsw i32 %add3, %3
        ret i32 %add5
    }
    
  2. Run the opt tool on example.ll:
    $ opt -basicaa -slp-vectorizer -mtriple=aarch64-unknown-linux-gnu -mcpu=cortex-a57
    

    The output will be vectorized code, like the following:

    define i32 @hadd(i32* %a) {
    
    entry:
    
    %0 = bitcast i32* %a to <4 x i32>*
    %1 = load <4 x i32>* %0, align 4 %rdx.shuf = shufflevector <4 x i32> %1, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
    
    %bin.rdx = add <4 x i32> %1,
    
    %rdx.shuf %rdx.shuf1 = shufflevector <4 x i32>
    
    %bin.rdx, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef> %bin.rdx2 = add <4 x i32> %bin.rdx, %rdx.shuf1
    
    %2 = extractelement <4 x i32> %bin.rdx2, i32 0
    
    ret i32 %2
    
    }
    

As observed, the code gets vectorized. The matchFlatReduction() function performs a DFS traversal of the expression and stores all the loads in ReducedVals, while adds are stored in ReductionOps. After this, the cost of horizontal vectorization is calculated in HAddCost and compared with scalar cost. It turns out to be profitable. Hence, it vectorizes the expression. This is handled in the tryToReduce() function, which is already implemented.

See also…

  • For detailed vectorization concepts, refer to the paper Loop-Aware SLP in GCC by Ira Rosen, Dorit Nuzman, and Ayal Zaks
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.41.205