Assembly Optimizations for OMAP4

From OMAPpedia

(Difference between revisions)
Jump to: navigation, search
(Rules of Assembly Optimization)
Line 15: Line 15:
1.) Learn the Assembly Language you are using.  
1.) Learn the Assembly Language you are using.  
2.) Know the Hardware you are using.  
2.) Know the Hardware you are using.
= Optimization Strategies =  
= Optimization Strategies =  

Revision as of 16:43, 7 March 2011



1.) OMAP4 chips have an Dual Core Cortex A9 ARM Processor as the main processing unit.
2.) The main memory is access through a dual channel controller. This means it can support 2 transactions at once.
3.) The A9 L1 cache is XXKb (per core)
4.) The A9 L2 cache is 32Kb (shared between cores) - random evacuation.
5.) 4-Deep Preload Pipeline

Rules of Assembly Optimization

1.) Learn the Assembly Language you are using.
2.) Know the Hardware you are using.

Optimization Strategies

How to converting your C code to optimized assembly is the topic of many, many books. Here's a concise description of some methods of tricks to use for converting C code for OMAP4.

Data Parallelism

The NEON pipeline is normally 8 data units wide, though some instructions do less. Normally a for loop in C which does a memset would look something like this:

void memset(void *ptr, int data, size_t size) {
    for (int i = 0; i < size; i++) {
         ((uint8_t *)ptr)[i] = (uint8_t)data;

The NEON equivalent operates on this data in units of 8 unsigned characters, which means that the size must be a multiple of 8.

.global memset_neon
ptr .req 0
data .req 1
size .req 2
    PROLOG r0,r2
    vdup.8 d0, r1           # Put the data value to set in d0 from r1 (will be 8 bit truncated)
    vst1.8 {d0}, [r0]!      # Store 8 values at once
    subs   r2, r2, #8       # reduce size by 8
    bgt    memset_neon_loop # loop if more bytes are left
    EPILOG r0,r2
.unreq ptr
.unreq data
.unreq size 

In this manner, 8 units are operated on in parallel.

Thread Parallelism

Since OMAP4 is a dual core and has a dual channel memory controller you write thread parallel operations which each have NEON optimized code, one thread per core which receives full memory access speed. However, due to L2 cache sharing, there is less of a guarantee that data will stay in the cache for the same length of time.


If you are going to be operating on a large array it benefits you to preload that data into the L2 using the "pld" instruction. This fetches the data from the main memory and populates the L2 cache with the data.


ARM Info Center [1] - NEON Code Reference for the RVDS compiler. GCC's assembly syntax is very similar (though not all mnemonics are supported).

NEON Reference in PDF form through Google Docs: [2]

Personal tools