138 9. CODE OPTIMIZATION
completed. By using VFPv4, this can be accomplished in one fused multiply-accumulate in-
struction resulting in only one rounding.
9.5.3 NEON INTRINSICS
Although SIMD instructions can be used when writing hand-optimized assembly code, this
approach would be cumbersome and time consuming. Fortunately, a mechanism to come close
to this approach is the use of NEON intrinsics within a C code by using the functions provided
in the header arm_neon.h. is was encountered during a previous lab involving the Newton–
Raphson iteration for finding inverse and square root of a number.
However, it is important to note that the use of intrinsics has drawbacks in terms of aligned
data. Aligned data refers to data stored in memory such that the base address is a multiple of
powers of two, and data accesses are performed on the same data stride length. For ARM, an
effective data alignment would be the one that matches the size of a cache line in level 1 cache.
For example, on the Cortex-A15 ARM processor, the cache line size is 64 bytes. Aligned data
loads allow the processor to read ahead in memory and load data into the level 1 cache before
it is read into the register file, resulting in decreased loading times. Aligned data loads cannot
be performed when using intrinsics, and memory pointer increments cannot be done as part of
load or store operations. e only way to incorporate these features is via assembly code.
Now, let us consider the situation involving floating-point arithmetic. e filtering code
version when using NEON intrinsics is stated below:
void computeFIR(FIRFilter* fir, float* input) {
int i, j;
float32x4_t freg1, freg2, freg3; //temporary registers
float32_t* coeffsPnt; //temporary pointer to coefficients array
//temporary pointers to window buffer
float32_t* windowPnt1 = fir->window;
float32_t* windowPnt2 = &(fir->window[fir->frameSize]);
//Assuming the number of coefficients is a multiple of 4
for(i=0; i<fir->numCoefficients; i+=4) {
//load elements starting at window[numCoefficients + i]
//and shift to window[i]
freg1 = vld1q_f32(windowPnt2);
windowPnt2 += 4;
vst1q_f32(windowPnt1, freg1);
windowPnt1 += 4;
}
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.135.224