LAB 9: Code Optimization

138 9. CODE OPTIMIZATION

completed. By using VFPv4, this can be accomplished in one fused multiply-accumulate in-

struction resulting in only one rounding.

9.5.3 NEON INTRINSICS

Although SIMD instructions can be used when writing hand-optimized assembly code, this

approach would be cumbersome and time consuming. Fortunately, a mechanism to come close

to this approach is the use of NEON intrinsics within a C code by using the functions provided

in the header arm_neon.h. is was encountered during a previous lab involving the Newton–

Raphson iteration for ﬁnding inverse and square root of a number.

However, it is important to note that the use of intrinsics has drawbacks in terms of aligned

data. Aligned data refers to data stored in memory such that the base address is a multiple of

powers of two, and data accesses are performed on the same data stride length. For ARM, an

eﬀective data alignment would be the one that matches the size of a cache line in level 1 cache.

For example, on the Cortex-A15 ARM processor, the cache line size is 64 bytes. Aligned data

loads allow the processor to read ahead in memory and load data into the level 1 cache before

it is read into the register ﬁle, resulting in decreased loading times. Aligned data loads cannot

be performed when using intrinsics, and memory pointer increments cannot be done as part of

load or store operations. e only way to incorporate these features is via assembly code.

Now, let us consider the situation involving ﬂoating-point arithmetic. e ﬁltering code

version when using NEON intrinsics is stated below:

void computeFIR(FIRFilter* fir, float* input) {

int i, j;

float32x4_t freg1, freg2, freg3; //temporary registers

float32_t* coeffsPnt; //temporary pointer to coefficients array

//temporary pointers to window buffer

float32_t* windowPnt1 = fir->window;

float32_t* windowPnt2 = &(fir->window[fir->frameSize]);

//Assuming the number of coefficients is a multiple of 4

for(i=0; i<fir->numCoefficients; i+=4) {

//load elements starting at window[numCoefficients + i]

//and shift to window[i]

freg1 = vld1q_f32(windowPnt2);

windowPnt2 += 4;

vst1q_f32(windowPnt1, freg1);

windowPnt1 += 4;

}

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for LAB 9: Code Optimization