Writing ARM Assembly

From OMAPpedia

Jump to: navigation, search


[edit] Overview

This page will go over the basics of writing ARM assembly on the OMAP platform against the GCC family of compilers and assemblers. If you have assembly that is in NASM format, you can port it over using the guide at Porting NASM Assembly to GCC. For OMAP4 Specifics, see Assembly Optimizations for OMAP4.

[edit] Reference

For assembly instruction references, refer to ARM's site http://infocenter.arm.com/help/index.jsp for the specific processor type in the OMAP you are using.

To figure out which instruction set you can use and thus if you can have NEON or some subset of parallel instructions, see this table:

OMAP Type ARM Type ARM Version SIMD Prefetch Depth
OMAP1xxx ARM926EJ-S (1) ARMv5 No -
OMAP2xxx ARM1136 ARMv6 Some 2
OMAP3xxx ARM Cortex A8 ARMv7 NEON 4
OMAP4xxx ARM Cortex A9 ARMv7 NEON 4
OMAP5xxx ARM Cortex A15 ARMv7 NEON 6

(1) Some variation exists.

[edit] Makefiles

You'll need to make sure that your Makefile supports cross compiling against the ARM assemblers. See OMAP Platform Support Tools. When compiling or assembling the assembly files, be sure to set your $(CC).


[edit] Assembly Files

Assembly files have historically been named with a .S or .s extension. Use .S to be able to pass the file through the C++ preprocessor as well as the assembler.

Parameters are named r0-r3 here to show how the assembly registers translates these into parameters. Parameters beyond 4 are pushed onto the stack. If you can't avoid going over this, there are ways to pull the additional parameters off the stack in the assembly into the r4-r11 registers in the prolog.

[edit] Comments

Comments should be either used with /* comment */ or the per line comment #.

[edit] Calling C functions from Assembly

Calling C functions from assembly is largely an issue of setting up the parameters correctly and then branching to the function.

In your C files define your function (no need to declare, unless other C functions call it).

int somefunc(int r0, int r1, int r2, int r3)
    // does something

In the Assembly File:

.extern somefunc

And in the subroutine itself:

    # move parameters manually to r0, r1, r2, r3
    bl somefunc
    # return code is in r0, r4-r11 should be preserved

If you need to add additional parameters to the stack you must also remove them after the function call to keep the sp correct.

[edit] Calling Assembly Functions from C

First, define your functions in a C header file so that the C/C++ code can find the prototype for it.

/** This is simple function which just returns 0 */
int function(int r0, int r1, int r2, int r3);

Second, you'll have to define the function or symbol in the assembly file. Naming it a global variable will allow the linker to find it and resolve the symbol in the C file.

.global function

[edit] EABI Calling conventions

In the EABI spec http://en.wikipedia.org/wiki/Application_binary_interface#EABI defines how functions are called, how stacks are used, which registers do what, etc. This allows assembly and C to link together successfully (even across different compilers which support EABI). The calling conventions can be found http://en.wikipedia.org/wiki/Calling_convention#ARM. The EABI standard dictates that the ARM Stack be "Full Descending" which means that stores need to decrement beforehand and loads must increment afterward. You can use the actual addressing types "DB" and "IA" or just "FD" on the assembly instructions.

[edit] Prolog

The prolog saves the state of the registers r4 through r11 typically (you can save any amount you need to, but those are the typical ones). This instruction also post-updates the stack pointer (sp).

stmdb sp!, {r4-r11} /* Push 8 "longs" on the stack and subtracts sp beforehand */

If there are additional parameters on the stack you can reference them after the stmia instruction, but you'll need to offset the sp by the appropriate values. This *assumes* that you use {r4-r11}.

ldr r4, [sp, #(4*9)]  /* This loads parameter 5 which is 9 "longs" "up" on the stack now */
ldr r5, [sp, #(4*10)] /* This loads parameter 6 which is 10 "longs" "up" on the stack now */

[edit] Epilog

The epilog restores the previous register set from the stack back to the registers and updates the sp value.

ldmia sp!, {r4-r11}

[edit] Return

The return places the return value into r0 and moves the lr (the return address) into the pc. This will cause the next instruction fecthed to be the instruction after the call to the function.

mov r0, #0
mov pc, lr

[edit] Optimized Return

You can reduce your code size by also popping the LR from the stack back into the PC, which also acts as the "return" statement. Here I use the "FD" stack mode.

stmfd sp!,{r4-r11,lr} # stack save + return address
# use 10 as the additional offset for other parameters off the stack since we're saving 9 ints now
ldr r4, [sp, #(4*10)]
ldmfd sp!,{r4-r11,pc} # stack restore + return

[edit] Register Renaming

With the Gas style assemblers, you can rename registers to aid in readability.

name .req register


pixels .req r0
width .req r1
height .req r2 

    mul pixels, width, height

[edit] Complete Listing

.global function
    # prolog
    stmdb sp!, {r4-r11}
    # epilog
    ldmia sp!, {r4-r11}
    # return value goes into r0, here it's zero
    mov r0, #0
    mov pc, lr

[edit] Defining Strings

The assembler allows you to define strings in the format (with special characters):

.global final_message
.string "Sorry for the Inconvenience\n"

Use a label before the string in order to reference it.

[edit] Defining Constants

The GNU assembler takes constants in the form of

.equ symbol, value

Such that you could do this (capitalization is optional):


[edit] Defining Data Arrays

When you need to define large static arrays of data (tables, precomputed values, multiple constants, etc.) you can use a data section to do this. This is not quite the same as the .data section (which can be static data or functions).

.global my_array
.long 127
.long 28
.long 94
.long 23

This symbol can be then be used and to load these values into registers to apply to calculations, etc.

ldr r4, =my_array
ldr r5, [r4, #0x0]
ldr r6, [r4, #0x4]
ldr r7, [r4, #0x8]
ldr r8, [r4, #0xC]

[edit] Types

Each type can be zero (?) or more expressions.

.byte 247         /* is 8 bit  */
.word 2098        /* is 16 bit */
.long 10238476    /* is 32 bit */
.quad 23487928374 /* is 64 bit */
.octa 928374928734982734 /* is 128 bit */
.float 3.141528   /* is 32 bit IEEE floating point. */

.byte 0xEF, 0xBE, 0xAD, 0xDE /* Byte sequence 0xDEADBEEF in LITTLE ENDIAN */

[edit] Defining Macros

The GNU assembler also allows macros which can be used to simplify some assembly routines.

.macro name operand [,operand,...]

Here's an example that does a 4 value average

# avg = (a+b+c+d)/4;
.macro avgerage avg,sum,a,b,c,d
	add \sum, \a, \b
	add \sum, \c, \sum
	add \sum, \d, \sum
	mov \avg, \sum, lsr #2	

[edit] Odd's n' Ends

You should define your assembly file with .text at the beginning and .end at the end.


[edit] Enabling NEON

If you are assembling for ARMv7 instructions (NEON) then you must state so in the Makefile in the AFLAGS as -march=armv7-a or -mfpu=neon. You can also state so in the assembly file as:

.arch armv7-a
.fpu neon

[edit] Register Usage

There's a good table reference for which registers are used for what in GCC (during inline assembly at least) at [1], under "Register Usage".

[edit] Harware Considerations

When programming in NEON, there are several considerations as to how to craft the subroutine.

[edit] Stack Direction

This effects the prolog and epilog (stack push and pop). Know whether it should be a down-stack (EABI compliant) or some other version:

stmdb sp!,{r4-r12,lr} # push + save return
... subroutine
ldmia sp!,{r4-r12,pc} # pop + return 

[edit] Loops

The most efficient loops on an ARM are not the traditional "for" loops but decrementing branch loops

i .req rX
limit .req rY
mov i, limit
... # code
subs i, i, #1
bgt label

This is syntactically equivalent to this for loop:

for (i = limit; i > 0; i--) {}

This loop works by setting the status register which the "bgt" uses. The "subs" is effectively a "sub"+"cmp" in one instruction.

[edit] Prefetch

Prefetch is probably the single most important part of NEON optimization. You *must* use the "pld" instruction to get any significant speed up. You must know how many prefetches can be executed before the prefetch queues is full or capped. The table above gives some ideas about this. Additionally you should know the architecture's cache line size so that the data is contiguously

L2_CACHE_LINE_SIZE .equ 64 # on A15
pld r0
... repeat until capped

[edit] Stalls

When using intermediate registers, the usage of these registers should be staggered or separated by some instructions if possible to prevent stalls being introduced in the pipeline. Instead of this:

vadd.u8 d0, d1, d2
vadd.u8 d3, d0, d4
vadd.u8 d5, d6, d7
vadd.u8 d8, d5, d9

Where d0 and d5 are used directly after being computed, it would be better to rearrange the operations to put some distance between the store and load of these registers to prevent pipeline stalls.

vadd.u8 d0, d1, d2
vadd.u8 d5, d6, d7
vadd.u8 d3, d0, d4
vadd.u8 d8, d5, d9

This is a simple trick which speeds up computation.

[edit] Write Combiner

One trick used to get better write performance is the "write combiner". This allows smaller writes of a few bytes to be aggregated into larger writes which are more efficient on the bus. This is typically 128 _bits_ or 16 bytes (one Q register or two D registers). This means you may unroll your loops and do two or more computations per cycle to aggregate enough write-data to fill the combiner. There are some limitations and restrictions to using this, so read the manual.

[edit] Reference

DVP YUV NEON Color Convert Functions

Personal tools