GBATEK
Gameboy Advance / Nintendo DS - Technical Info - Extracted from no$gba version 2.6a

[ GBA | DS | CPU ]

 CPU Reference

General ARM7TDMI Information
CPU Overview
CPU Register Set
CPU Flags
CPU Exceptions
CPU Memory Alignments

The ARM7TDMI Instruction Sets
THUMB Instruction Set
ARM Instruction Set
Pseudo Instructions and Directives

Further Information
ARM CP15 System Control Coprocessor
CPU Instruction Cycle Times
CPU Versions
CPU Data Sheet


 CPU Overview < ^

The ARM7TDMI is a 32bit RISC (Reduced Instruction Set Computer) CPU, designed by ARM (Advanced RISC Machines), and designed for both high performance and low power consumption.

Fast Execution
Depending on the CPU state, all opcodes are sized 32bit or 16bit (that's counting both the opcode bits and its parameters bits) providing fast decoding and execution. Additionally, pipelining allows - (a) one instruction to be executed while (b) the next instruction is decoded and (c) the next instruction is fetched from memory - all at the same time.

Data Formats
The CPU manages to deal with 8bit, 16bit, and 32bit data, that are called:
   8bit - Byte
16bit - Halfword
32bit - Word
The two CPU states
As mentioned above, two CPU states exist:
- ARM state: Uses the full 32bit instruction set (32bit opcodes)
- THUMB state: Uses a cutdown 16bit instruction set (16bit opcodes)
Regardless of the opcode-width, both states are using 32bit registers, allowing 32bit memory addressing as well as 32bit arithmetic/logical operations.

When to use ARM state
Basically, there are two advantages in ARM state:
 - Each single opcode provides more functionality, resulting
in faster execution when using a 32bit bus memory system
(such like opcodes stored in GBA Work RAM).
- All registers R0-R15 can be accessed directly.
The downsides are:
 - Not so fast when using 16bit memory system
(but it still works though).
- Program code occupies more memory space.
When to use THUMB state
There are two major advantages in THUMB state:
 - Faster execution up to approx 160% when using a 16bit bus
memory system (such like opcodes stored in GBA GamePak ROM).
- Reduces code size, decreases memory overload down to approx 65%.
The disadvantages are:
 - Not as multi-functional opcodes as in ARM state, so it will
be sometimes required use more than one opcode to gain a
similar result as for a single opcode in ARM state.
- Most opcodes allow only registers R0-R7 to be used directly.
Combining ARM and THUMB state
Switching between ARM and THUMB state is done by a normal branch (BX) instruction which takes only a handful of cycles to execute (allowing to change states as often as desired - with almost no overload).

Also, as both ARM and THUMB are using the same register set, it is possible to pass data between ARM and THUMB mode very easily.

The best memory & execution performance can be gained by combining both states: THUMB for normal program code, and ARM code for timing critical subroutines (such like interrupt handlers, or complicated algorithms).

Note: ARM and THUMB code cannot be executed simultaneously.

Automatic state changes
Beside for the above manual state switching by using BX instructions, the following situations involve automatic state changes:
- CPU switches to ARM state when executing an exception
- User switches back to old state when leaving an exception


 CPU Register Set < ^

Overview
The following table shows the ARM7TDMI register set which is available in each mode. There's a total of 37 registers (32bit each), 31 general registers (Rxx) and 6 status registers (xPSR).
Note that only some registers are 'banked', for example, each mode has it's own R14 register: called R14, R14_fiq, R14_svc, etc. for each mode respectively.
However, other registers are not banked, for example, each mode is using the same R0 register, so writing to R0 will always affect the content of R0 in other modes also.
  System/User FIQ       Supervisor Abort     IRQ       Undefined
  --------------------------------------------------------------
  R0          R0        R0         R0        R0        R0
  R1          R1        R1         R1        R1        R1
  R2          R2        R2         R2        R2        R2
  R3          R3        R3         R3        R3        R3
  R4          R4        R4         R4        R4        R4
  R5          R5        R5         R5        R5        R5
  R6          R6        R6         R6        R6        R6
  R7          R7        R7         R7        R7        R7
  --------------------------------------------------------------
  R8          R8_fiq    R8         R8        R8        R8
  R9          R9_fiq    R9         R9        R9        R9
  R10         R10_fiq   R10        R10       R10       R10
  R11         R11_fiq   R11        R11       R11       R11
  R12         R12_fiq   R12        R12       R12       R12
  R13 (SP)    R13_fiq   R13_svc    R13_abt   R13_irq   R13_und
  R14 (LR)    R14_fiq   R14_svc    R14_abt   R14_irq   R14_und
  R15 (PC)    R15       R15        R15       R15       R15
  --------------------------------------------------------------
  CPSR        CPSR      CPSR       CPSR      CPSR      CPSR
  --          SPSR_fiq  SPSR_svc   SPSR_abt  SPSR_irq  SPSR_und
  --------------------------------------------------------------
R0-R12 Registers (General Purpose Registers)
These thirteen registers may be used for whatever general purposes. Basically, each is having same functionality and performance, ie. there is no 'fast accumulator' for arithmetic operations, and no 'special pointer register' for memory addressing.
However, in THUMB mode only R0-R7 (Lo registers) may be accessed freely, while R8-R12 and up (Hi registers) can be accessed only by some instructions.

R13 Register (SP)
This register is used as Stack Pointer (SP) in THUMB state. While in ARM state the user may decided to use R13 and/or other register(s) as stack pointer(s), or as general purpose register.
As shown in the table above, there's a separate R13 register in each mode, and (when used as SP) each exception handler may (and MUST!) use its own stack.

R14 Register (LR)
This register is used as Link Register (LR). That is, when calling to a sub-routine by a Branch with Link (BL) instruction, then the return address (ie. old value of PC) is saved in this register.
Storing the return address in the LR register is obviously faster than pushing it into memory, however, as there's only one LR register for each mode, the user must manually push its content before issuing 'nested' subroutines.
Same happens when an exception is called, PC is saved in LR of new mode.
Note: In ARM mode, R14 may be used as general purpose register also, provided that above usage as LR register isn't required.

R15 Register (PC)
R15 is always used as program counter (PC). Note that when reading R15, this will usually return a value of PC+nn because of read-ahead (pipelining), whereas 'nn' depends on the instruction and on the CPU state (ARM or THUMB).

CPSR and SPSR (Program Status Registers) (ARMv3 and up)
The current condition codes (flags) and CPU control bits are stored in the CPSR register. When an exception arises, the old CPSR is saved in the SPSR of the respective exception-mode (much like PC is saved in LR).
For details refer to chapter about CPU Flags.


 CPU Flags < ^

Current Program Status Register (CPSR)
  Bit   Expl.
31 N - Sign Flag (0=Not Signed, 1=Signed)
30 Z - Zero Flag (0=Not Zero, 1=Zero)
29 C - Carry Flag (0=No Carry, 1=Carry)
28 V - Overflow Flag (0=No Overflow, 1=Overflow)
27 Q - Sticky Overflow (1=Sticky Overflow, ARMv5TE and up only)
26-8 Reserved (For future use) - Do not change manually!
7 I - IRQ disable (0=Enable, 1=Disable)
6 F - FIQ disable (0=Enable, 1=Disable)
5 T - State Bit (0=ARM, 1=THUMB) - Do not change manually!
4-0 M4-M0 - Mode Bits (See below)
Bit 31-28: Condition Code Flags (N,Z,C,V)
These bits reflect results of logical or arithmetic instructions. In ARM mode, it is often optionally whether an instruction should modify flags or not, for example, it is possible to execute a SUB instruction that does NOT modify the condition flags.
In ARM state, all instructions can be executed conditionally depending on the settings of the flags, such like MOVEQ (Move if Z=1). While In THUMB state, only Branch instructions (jumps) can be made conditionally.

Bit 27: Sticky Overflow Flag (Q) - ARMv5TE and ARMv5TExP and up only
Used by QADD, QSUB, QDADD, QDSUB, SMLAxy, and SMLAWy only. These opcodes set the Q-flag in case of overflows, but leave it unchanged otherwise. The Q-flag can be tested/reset by MSR/MRS opcodes only.

Bit 27-8: Reserved Bits (except Bit 27 on ARMv5TE and up, see above)
These bits are reserved for possible future implementations. For best forwards compatibility, the user should never change the state of these bits, and should not expect these bits to be set to a specific value.

Bit 7-0: Control Bits (I,F,T,M4-M0)
These bits may change when an exception occurs. In privileged modes (non-user modes) they may be also changed manually.
The interrupt bits I and F are used to disable IRQ and FIQ interrupts respectively (a setting of "1" means disabled).
The T Bit signalizes the current state of the CPU (0=ARM, 1=THUMB), this bit should never be changed manually - instead, changing between ARM and THUMB state must be done by BX instructions.
The Mode Bits M4-M0 contain the current operating mode.
  Binary Hex Dec  Expl.
10000b 10h 16 - User (non-privileged)
10001b 11h 17 - FIQ
10010b 12h 18 - IRQ
10011b 13h 19 - Supervisor (SWI)
10111b 17h 23 - Abort
11011b 1Bh 27 - Undefined
11111b 1Fh 31 - System (privileged 'User' mode) (ARMv4 and up)
Writing any other values into the Mode bits is not allowed.

Saved Program Status Registers (SPSR_<mode>)
Additionally to above CPSR, five Saved Program Status Registers exist:
SPSR_fiq, SPSR_svc, SPSR_abt, SPSR_irq, SPSR_und
Whenever the CPU enters an exception, the current status register (CPSR) is copied to the respective SPSR_<mode> register. Note that there is only one SPSR for each mode, so nested exceptions inside of the same mode are allowed only if the exception handler saves the content of SPSR in memory.
For example, for an IRQ exception: IRQ-mode is entered, and CPSR is copied to SPSR_irq. If the interrupt handler wants to enable nested IRQs, then it must first push SPSR_irq before doing so.


 CPU Exceptions < ^

Exceptions are caused by interrupts or errors. In the ARM7TDMI the following exceptions may arise, sorted by priority, starting with highest priority:
- Reset
- Data Abort
- FIQ
- IRQ
- Prefetch Abort
- Software Interrupt
- Undefined Instruction

Exception Vectors
The following are the exception vectors in memory. That is, when an exception arises, CPU is switched into ARM state, and the program counter (PC) is loaded by the respective address.
  Address    Exception                  Mode on Entry      Interrupt Flags
BASE+00h Reset Supervisor (_svc) I=1, F=1
BASE+04h Undefined Instruction Undefined (_und) I=1, F=unchanged
BASE+08h Software Interrupt (SWI) Supervisor (_svc) I=1, F=unchanged
BASE+0Ch Prefetch Abort Abort (_abt) I=1, F=unchanged
BASE+10h Data Abort Abort (_abt) I=1, F=unchanged
BASE+14h (Reserved) - - -
BASE+18h Normal Interrupt (IRQ) IRQ (_irq) I=1, F=unchanged
BASE+1Ch Fast Interrupt (FIQ) FIQ (_fiq) I=1, F=1
BASE is normally 00000000h, but may be optionally FFFF0000h in some ARM CPUs.
As there's only space for one ARM opcode at each of the above addresses, it'd be usually recommended to deposit a Branch opcode into each vector, which'd then redirect to the actual exception handlers address.

Actions performed by CPU when entering an exception
  - R14=PC+nn              ;save old PC, ie. return address
- SPSR_<new mode>=CPSR ;save old flags
- CPSR new T,M bits ;set to T=0 (ARM state), and M4-0=new mode
- CPSR new I bit ;IRQs disabled (I=1), done by ALL exceptions
- CPSR new F bit ;FIQs disabled (F=1), done by Reset and FIQ only
- PC=exception_vector ;see table above
Above "PC+nn" depends on the type of exception. Basically, in ARM state that nn-offset is caused by pipelining, and in THUMB state an identical ARM-style 'offset' is generated (even though the 'base address' may be only halfword-aligned).

Required user-handler actions when returning from an exception
Restore any general registers (R0-R14) which might have been modified by the exception handler. Use return-instruction as listed in the respective descriptions below, this will both restore PC and CPSR - that automatically involves that the old CPU state (THUMB or ARM) as well as old state of FIQ and IRQ disable flags are restored.
As mentioned above (see action on entering...), the return address is always saved in ARM-style format, so that exception handler may use the same return-instruction, regardless of whether the exception has been generated from inside of ARM or THUMB state.

FIQ (Fast Interrupt Request)
This interrupt is generated by a LOW level on the nFIQ input. It is supposed to process timing critical interrupts at a high priority, as fast as possible.
Additionally to the common banked registers (R13_fiq,R14_fiq), five extra banked registers (R8_fiq-R12_fiq) are available in FIQ mode. The exception handler may freely access these registers without modifying the main programs R8-R12 registers (and without having to save that registers on stack).
In privileged (non-user) modes, FIQs may be also manually disabled by setting the F Bit in CPSR.

IRQ (Normal Interrupt Request)
This interrupt is generated by a LOW level on the nIRQ input. Unlike FIQ, the IRQ mode is not having its own banked R8-R12 registers.
IRQ is having lower priority than FIQ, and IRQs are automatically disabled when a FIQ exception becomes executed. In privileged (non-user) modes, IRQs may be also manually disabled by setting the I Bit in CPSR.
To return from IRQ Mode (continuing at following opcode):
  SUBS PC,R14,4   ;both PC=R14_irq-4, and CPSR=SPSR_irq
Software Interrupt
Generated by a software interrupt instruction (SWI). Recommended to request a supervisor (operating system) function. The SWI instruction may also contain a parameter in the 'comment field' of the opcode:
In case that your main program issues SWIs from both inside of THUMB and ARM states, then your exception handler must separate between 24bit comment fields in ARM opcodes, and 8bit comment fields in THUMB opcodes (if necessary determine old state by examining T Bit in SPSR_svc); However, in Little Endian mode, you could use only the most significant 8bits of the 24bit ARM comment field (as done in the GBA, for example) - the exception handler could then process the BYTE at [R14-2], regardless of whether it's been called from ARM or THUMB state.
To return from Supervisor Mode (continuing at following opcode):
  MOVS PC,R14   ;both PC=R14_svc, and CPSR=SPSR_svc
Note: Like all other exceptions, SWIs are always executed in ARM state, no matter whether it's been caused by an ARM or THUMB state SWI instruction.

Undefined Instruction Exception (supported by ARMv3 and up)
This exception is generated when the CPU comes across an instruction which it cannot handle. Most likely signalizing that the program has locked up, and that an errormessage should be displayed.
However, it might be also used to emulate custom functions, ie. as an additional 'SWI' instruction (which'd use R14_und and SPSR_und though, and it'd thus allow to execute the Undefined Instruction handler from inside of Supervisor mode without having to save R14_svc and SPSR_svc).
To return from Undefined Mode (continuing at following opcode):
  MOVS PC,R14   ;both PC=R14_und, and CPSR=SPSR_und
Note that not all unused opcodes are necessarily producing an exception, for example, an ARM state Multiply instruction with Bit 6 set to "1" would be blindly accepted as 'legal' opcode.

Abort (supported by ARMv3 and up)
Aborts (page faults) are mostly supposed for virtual memory systems (ie. not used in GBA, as far as I know), otherwise they might be used just to display an error message. Two types of aborts exists:
- Prefetch Abort (occurs during an instruction prefetch)
- Prefetch Abort (also occurs on BKPT opcodes, ARMv5 and up)
- Data Abort (occurs during a data access)
A virtual memory systems abort handler would then most likely determine the fault address: For prefetch abort that's just "R14_abt-4". For Data abort, the THUMB or ARM instruction at "R14_abt-8" needs to be 'disassembled' in order to determine the addressed data in memory.
The handler would then fix the error by loading the respective memory page into physical memory, and then retry to execute the SAME instruction again, by returning as follows:
  prefetch abort: SUBS PC,R14,#4   ;PC=R14_abt-4, and CPSR=SPSR_abt
data abort: SUBS PC,R14,#8 ;PC=R14_abt-8, and CPSR=SPSR_abt
Separate exception vectors for prefetch/data abort exists, each should use the respective return instruction as shown above.

Reset
Forces PC=VVVV0000h, and forces control bits of CPSR to T=0 (ARM state), F=1 and I=1 (disable FIQ and IRQ), and M4-0=10011b (Supervisor mode).


 CPU Memory Alignments < ^

The CPU does NOT support accessing mis-aligned addresses (which would be rather slow because it'd have to merge/split that data into two accesses).
When reading/writing code/data to/from memory, Words and Halfwords must be located at well-aligned memory address, ie. 32bit words aligned by 4, and 16bit halfwords aligned by 2.

Mis-aligned STR,STRH,STM,LDM,LDRD,STRD,PUSH,POP (forced align)
The mis-aligned low bit(s) are ignored, the memory access goes to a forcibly aligned (rounded-down) memory address.
For LDRD/STRD, it isn't clearly defined if the address must be aligned by 8 (on the NDS, align-4 seems to be okay) (align-8 may be required on other CPUs with 64bit databus).

Mis-aligned LDR,SWP (rotated read)
Reads from forcibly aligned address "addr AND (NOT 3)", and does then rotate the data as "ROR (addr AND 3)*8". That effect is internally used by LDRB and LDRH opcodes (which do then mask-out the unused bits).
The SWP opcode works like a combination of LDR and STR, that means, it does read-rotated, but does write-unrotated.

Mis-aligned LDRH,LDRSH (does or does not do strange things)
On ARM9 aka ARMv5 aka NDS9:
  LDRH Rd,[odd]   -->  LDRH Rd,[odd-1]        ;forced align
LDRSH Rd,[odd] --> LDRSH Rd,[odd-1] ;forced align
On ARM7 aka ARMv5 aka NDS7/GBA:
  LDRH Rd,[odd]   -->  LDRH Rd,[odd-1] ROR 8  ;read to bit0-7 and bit24-31
LDRSH Rd,[odd] --> LDRSB Rd,[odd] ;sign-expand BYTE value
Mis-aligned PC/R15 (branch opcodes, or MOV/ALU/LDR with Rd=R15)
For ARM code, the low bits of the target address should be usually zero, otherwise, R15 is forcibly aligned by clearing the lower two bits.
For THUMB code, the low bit of the target address may/should/must be set, the bit is (or is not) interpreted as thumb-bit (depending on the opcode), and R15 is then forcibly aligned by clearing the lower bit.
In short, R15 will be always forcibly aligned, so mis-aligned branches won't have effect on subsequent opcodes that use R15, or [R15+disp] as operand.


 THUMB Instruction Set < ^

When operating in THUMB state, cut-down 16bit opcodes are used.
THUMB supported on T-variants of ARMv4 and up, ie. ARMv4T, ARMv5T, etc.

Summary
THUMB Instruction Summary

Register Operations
THUMB.1: move shifted register
THUMB.2: add/subtract
THUMB.3: move/compare/add/subtract immediate
THUMB.4: ALU operations
THUMB.5: Hi register operations/branch exchange

Memory Addressing Operations
THUMB.6: load PC-relative
THUMB.7: load/store with register offset
THUMB.8: load/store sign-extended byte/halfword
THUMB.9: load/store with immediate offset
THUMB.10: load/store halfword
THUMB.11: load/store SP-relative
THUMB.12: get relative address
THUMB.13: add offset to stack pointer
THUMB.14: push/pop registers
THUMB.15: multiple load/store

Jumps and Calls
THUMB.16: conditional branch
THUMB.17: software interrupt and breakpoint
THUMB.18: unconditional branch
THUMB.19: long branch with link
(See also THUMB.5: BX Rs, and ADD/MOV PC,Rs.)

Note:
Switching between ARM and THUMB state can be done by using the Branch and Exchange (BX) instruction.


 THUMB Instruction Summary < ^

The table below lists all THUMB mode instructions with clock cycles, affected CPSR flags, Format/chapter number, and description.
Only register R0..R7 can be used in thumb mode (unless R8-15,SP,PC are explicitly mentioned).

Logical Operations
  Instruction        Cycles Flags Format Expl.
MOV Rd,Imm8bit 1S NZ-- 3 Rd=nn
MOV Rd,Rs 1S NZ00 2 Rd=Rs+0
MOV R0..14,R8..15 1S ---- 5 Rd=Rs
MOV R8..14,R0..15 1S ---- 5 Rd=Rs
MOV R15,R0..15 2S+1N ---- 5 PC=Rs
MVN Rd,Rs 1S NZ-- 4 Rd=NOT Rs
AND Rd,Rs 1S NZ-- 4 Rd=Rd AND Rs
TST Rd,Rs 1S NZ-- 4 Void=Rd AND Rs
BIC Rd,Rs 1S NZ-- 4 Rd=Rd AND NOT Rs
ORR Rd,Rs 1S NZ-- 4 Rd=Rd OR Rs
EOR Rd,Rs 1S NZ-- 4 Rd=Rd XOR Rs
LSL Rd,Rs,Imm5bit 1S NZc- 1 Rd=Rs SHL nn
LSL Rd,Rs 1S+1I NZc- 4 Rd=Rd SHL (Rs AND 0FFh)
LSR Rd,Rs,Imm5bit 1S NZc- 1 Rd=Rs SHR nn
LSR Rd,Rs 1S+1I NZc- 4 Rd=Rd SHR (Rs AND 0FFh)
ASR Rd,Rs,Imm5bit 1S NZc- 1 Rd=Rs SAR nn
ASR Rd,Rs 1S+1I NZc- 4 Rd=Rd SAR (Rs AND 0FFh)
ROR Rd,Rs 1S+1I NZc- 4 Rd=Rd ROR (Rs AND 0FFh)
NOP 1S ---- 5 R8=R8
Carry flag affected only if shift amount is non-zero.

Arithmetic Operations and Multiply
  Instruction        Cycles Flags Format Expl.
ADD Rd,Rs,Imm3bit 1S NZCV 2 Rd=Rs+nn
ADD Rd,Imm8bit 1S NZCV 3 Rd=Rd+nn
ADD Rd,Rs,Rn 1S NZCV 2 Rd=Rs+Rn
ADD R0..14,R8..15 1S ---- 5 Rd=Rd+Rs
ADD R8..14,R0..15 1S ---- 5 Rd=Rd+Rs
ADD R15,R0..15 2S+1N ---- 5 PC=Rd+Rs
ADD Rd,PC,Imm8bit*4 1S ---- 12 Rd=(($+4) AND NOT 2)+nn
ADD Rd,SP,Imm8bit*4 1S ---- 12 Rd=SP+nn
ADD SP,Imm7bit*4 1S ---- 13 SP=SP+nn
ADD SP,-Imm7bit*4 1S ---- 13 SP=SP-nn
ADC Rd,Rs 1S NZCV 4 Rd=Rd+Rs+Cy
SUB Rd,Rs,Imm3Bit 1S NZCV 2 Rd=Rs-nn
SUB Rd,Imm8bit 1S NZCV 3 Rd=Rd-nn
SUB Rd,Rs,Rn 1S NZCV 2 Rd=Rs-Rn
SBC Rd,Rs 1S NZCV 4 Rd=Rd-Rs-NOT Cy
NEG Rd,Rs 1S NZCV 4 Rd=0-Rs
CMP Rd,Imm8bit 1S NZCV 3 Void=Rd-nn
CMP Rd,Rs 1S NZCV 4 Void=Rd-Rs
CMP R0-15,R8-15 1S NZCV 5 Void=Rd-Rs
CMP R8-15,R0-15 1S NZCV 5 Void=Rd-Rs
CMN Rd,Rs 1S NZCV 4 Void=Rd+Rs
MUL Rd,Rs 1S+mI NZx- 4 Rd=Rd*Rs
Jumps and Calls
  Instruction        Cycles    Flags Format Expl.
B disp 2S+1N ---- 18 PC=$+/-2048
BL disp 3S+1N ---- 19 PC=$+/-4M, LR=$+5
B{cond=true} disp 2S+1N ---- 16 PC=$+/-0..256
B{cond=false} disp 1S ---- 16 N/A
BX R0..15 2S+1N ---- 5 PC=Rs, ARM/THUMB (Rs bit0)
SWI Imm8bit 2S+1N ---- 17 PC=8, ARM SVC mode, LR=$+2
BKPT Imm8bit ??? ---- 17 ??? ARM9 Prefetch Abort
BLX disp ??? ---- ??? ??? ARM9
BLX R0..R14 ??? ---- ??? ??? ARM9
POP {Rlist,}PC (n+1)S+2N+1I ---- 14
MOV R15,R0..15 2S+1N ---- 5 PC=Rs
ADD R15,R0..15 2S+1N ---- 5 PC=Rd+Rs
The thumb BL instruction occupies two 16bit opcodes, 32bit in total.

Memory Load/Store
  Instruction        Cycles    Flags Format Expl.
LDR Rd,[Rb,5bit*4] 1S+1N+1I ---- 9 Rd = WORD[Rb+nn]
LDR Rd,[PC,8bit*4] 1S+1N+1I ---- 6 Rd = WORD[PC+nn]
LDR Rd,[SP,8bit*4] 1S+1N+1I ---- 11 Rd = WORD[SP+nn]
LDR Rd,[Rb,Ro] 1S+1N+1I ---- 7 Rd = WORD[Rb+Ro]
LDRB Rd,[Rb,5bit*1] 1S+1N+1I ---- 9 Rd = BYTE[Rb+nn]
LDRB Rd,[Rb,Ro] 1S+1N+1I ---- 7 Rd = BYTE[Rb+Ro]
LDRH Rd,[Rb,5bit*2] 1S+1N+1I ---- 10 Rd = HALFWORD[Rb+nn]
LDRH Rd,[Rb,Ro] 1S+1N+1I ---- 8 Rd = HALFWORD[Rb+Ro]
LDSB Rd,[Rb,Ro] 1S+1N+1I ---- 8 Rd = SIGNED_BYTE[Rb+Ro]
LDSH Rd,[Rb,Ro] 1S+1N+1I ---- 8 Rd = SIGNED_HALFWORD[Rb+Ro]
STR Rd,[Rb,5bit*4] 2N ---- 9 WORD[Rb+nn] = Rd
STR Rd,[SP,8bit*4] 2N ---- 11 WORD[SP+nn] = Rd
STR Rd,[Rb,Ro] 2N ---- 7 WORD[Rb+Ro] = Rd
STRB Rd,[Rb,5bit*1] 2N ---- 9 BYTE[Rb+nn] = Rd
STRB Rd,[Rb,Ro] 2N ---- 7 BYTE[Rb+Ro] = Rd
STRH Rd,[Rb,5bit*2] 2N ---- 10 HALFWORD[Rb+nn] = Rd
STRH Rd,[Rb,Ro] 2N ---- 8 HALFWORD[Rb+Ro]=Rd
PUSH {Rlist}{LR} (n-1)S+2N ---- 14
POP {Rlist}{PC} ---- 14 (ARM9: with mode switch)
STMIA Rb!,{Rlist} (n-1)S+2N ---- 15
LDMIA Rb!,{Rlist} nS+1N+1I ---- 15
THUMB Binary Opcode Format
This table summarizes the position of opcode/parameter bits for THUMB mode instructions, Format 1-19.
 Form|_15|_14|_13|_12|_11|_10|_9_|_8_|_7_|_6_|_5_|_4_|_3_|_2_|_1_|_0_|
__1_|_0___0___0_|__Op___|_______Offset______|____Rs_____|____Rd_____|Shifted
__2_|_0___0___0___1___1_|_I,_Op_|___Rn/nn___|____Rs_____|____Rd_____|ADD/SUB
__3_|_0___0___1_|__Op___|____Rd_____|_____________Offset____________|Immedi.
__4_|_0___1___0___0___0___0_|______Op_______|____Rs_____|____Rd_____|AluOp
__5_|_0___1___0___0___0___1_|__Op___|Hd_|Hs_|____Rs_____|____Rd_____|HiReg/BX
__6_|_0___1___0___0___1_|____Rd_____|_____________Word______________|LDR PC
__7_|_0___1___0___1_|__Op___|_0_|___Ro______|____Rb_____|____Rd_____|LDR/STR
__8_|_0___1___0___1_|__Op___|_1_|___Ro______|____Rb_____|____Rd_____|""H/SB/SH
__9_|_0___1___1_|__Op___|_______Offset______|____Rb_____|____Rd_____|""{B}
_10_|_1___0___0___0_|Op_|_______Offset______|____Rb_____|____Rd_____|""H
_11_|_1___0___0___1_|Op_|____Rd_____|_____________Word______________|"" SP
_12_|_1___0___1___0_|Op_|____Rd_____|_____________Word______________|ADD PC/SP
_13_|_1___0___1___1___0___0___0___0_|_S_|___________Word____________|ADD SP,nn
_14_|_1___0___1___1_|Op_|_1___0_|_R_|____________Rlist______________|PUSH/POP
_17_|_1___0___1___1___1___1___1___0_|___________User_Data___________|BKPT ARM9
_15_|_1___1___0___0_|Op_|____Rb_____|____________Rlist______________|STM/LDM
_16_|_1___1___0___1_|_____Cond______|_________Signed_Offset_________|B{cond}
_U__|_1___1___0___1___1___1___1___0_|_____________var_______________|UNDEF ARM9
_17_|_1___1___0___1___1___1___1___1_|___________User_Data___________|SWI
_18_|_1___1___1___0___0_|________________Offset_____________________|B
_19_|_1___1___1___0___1_|_________________________var___________|_0_|BLXsuf ARM9
_U__|_1___1___1___0___1_|_________________________var___________|_1_|UNDEF ARM9
_19_|_1___1___1___1_|_H_|______________Offset_Low/High______________|BL (BLX ARM9)
Further UNDEFS ??? ARM9?
 1011 0001 xxxxxxxx (reserved)
1011 0x1x xxxxxxxx (reserved)
1011 10xx xxxxxxxx (reserved)
1011 1111 xxxxxxxx (reserved)
1101 1110 xxxxxxxx (free for user)

 THUMB.1: move shifted register < ^

Opcode Format
  Bit    Expl.
15-13 Must be 000b for 'move shifted register' instructions
12-11 Opcode
00b: LSL Rd,Rs,#Offset (logical/arithmetic shift left)
01b: LSR Rd,Rs,#Offset (logical shift right)
10b: ASR Rd,Rs,#Offset (arithmetic shift right)
11b: Reserved (used for add/subtract instructions)
10-6 Offset (0-31)
5-3 Rs - Source register (R0..R7)
2-0 Rd - Destination register (R0..R7)
Example: LSL Rd,Rs,#nn ; Rd = Rs << nn ; ARM equivalent: MOVS Rd,Rs,LSL #nn
Zero shift amount is having special meaning (same as for ARM shifts), LSL#0 performs no shift (the the carry flag remains unchanged), LSR/ASR#0 are interpreted as LSR/ASR#32. Attempts to specify LSR/ASR#0 in source code are automatically redirected as LSL#0, and source LSR/ASR#32 is redirected as opcode LSR/ASR#0.
Execution Time: 1S
Flags: Z=zeroflag, N=sign, C=carry (except LSL#0: C=unchanged), V=unchanged.


 THUMB.2: add/subtract < ^

Opcode Format
  Bit    Expl.
15-11 Must be 00011b for 'add/subtract' instructions
10-9 Opcode (0-3)
0: ADD Rd,Rs,Rn ;add register Rd=Rs+Rn
1: SUB Rd,Rs,Rn ;subtract register Rd=Rs-Rn
2: ADD Rd,Rs,#nn ;add immediate Rd=Rs+nn
3: SUB Rd,Rs,#nn ;subtract immediate Rd=Rs-nn
Pseudo/alias opcode with Imm=0:
2: MOV Rd,Rs ;move (affects cpsr) Rd=Rs+0
8-6 For Register Operand:
Rn - Register Operand (R0..R7)
For Immediate Operand:
nn - Immediate Value (0-7)
5-3 Rs - Source register (R0..R7)
2-0 Rd - Destination register (R0..R7)
Return: Rd contains result, N,Z,C,V affected (including MOV).
Execution Time: 1S


 THUMB.3:move/compare/add/subtract immediate < ^

Opcode Format
  Bit    Expl.
15-13 Must be 001b for this type of instructions
12-11 Opcode
00b: MOV Rd,#nn ;move Rd = #nn
01b: CMP Rd,#nn ;compare Void = Rd - #nn
10b: ADD Rd,#nn ;add Rd = Rd + #nn
11b: SUB Rd,#nn ;subtract Rd = Rd - #nn
10-8 Rd - Destination Register (R0..R7)
7-0 nn - Unsigned Immediate (0-255)
ARM equivalents for MOV/CMP/ADD/SUB are MOVS/CMP/ADDS/SUBS same format.
Execution Time: 1S
Return: Rd contains result (except CMP), N,Z,C,V affected (for MOV only N,Z).


 THUMB.4: ALU operations < ^

Opcode Format
  Bit    Expl.
15-10 Must be 010000b for this type of instructions
9-6 Opcode (0-Fh)
0: AND Rd,Rs ;AND logical Rd = Rd AND Rs
1: EOR Rd,Rs ;XOR logical Rd = Rd XOR Rs
2: LSL Rd,Rs ;log. shift left Rd = Rd << (Rs AND 0FFh)
3: LSR Rd,Rs ;log. shift right Rd = Rd >> (Rs AND 0FFh)
4: ASR Rd,Rs ;arit shift right Rd = Rd SAR (Rs AND 0FFh)
5: ADC Rd,Rs ;add with carry Rd = Rd + Rs + Cy
6: SBC Rd,Rs ;sub with carry Rd = Rd - Rs - NOT Cy
7: ROR Rd,Rs ;rotate right Rd = Rd ROR (Rs AND 0FFh)
8: TST Rd,Rs ;test Void = Rd AND Rs
9: NEG Rd,Rs ;negate Rd = 0 - Rs
A: CMP Rd,Rs ;compare Void = Rd - Rs
B: CMN Rd,Rs ;neg.compare Void = Rd + Rs
C: ORR Rd,Rs ;OR logical Rd = Rd OR Rs
D: MUL Rd,Rs ;multiply Rd = Rd * Rs
E: BIC Rd,Rs ;bit clear Rd = Rd AND NOT Rs
F: MVN Rd,Rs ;not Rd = NOT Rs
5-3 Rs - Source Register (R0..R7)
2-0 Rd - Destination Register (R0..R7)
ARM equivalent for NEG would be RSBS.
Return: Rd contains result (except TST,CMP,CMN),
Affected Flags:
  N,Z,C,V for  ADC,SBC,NEG,CMP,CMN
N,Z,C for LSL,LSR,ASR,ROR (carry flag unchanged if zero shift amount)
N,Z,C for MUL on ARMv4 and below: carry flag destroyed
N,Z for MUL on ARMv5 and above: carry flag unchanged
N,Z for AND,EOR,TST,ORR,BIC,MVN
Execution Time:
  1S      for  AND,EOR,ADC,SBC,TST,NEG,CMP,CMN,ORR,BIC,MVN
1S+1I for LSL,LSR,ASR,ROR
1S+mI for MUL on ARMv4 (m=1..4; depending on MSBs of incoming Rd value)
1S+mI for MUL on ARMv5 (m=3; fucking slow, no matter of MSBs of Rd value)

 THUMB.5: Hi register operations/branch exchange < ^

Opcode Format
  Bit    Expl.
15-10 Must be 010001b for this type of instructions
9-8 Opcode (0-3)
0: ADD Rd,Rs ;add Rd = Rd+Rs
1: CMP Rd,Rs ;compare Void = Rd-Rs ;CPSR affected
2: MOV Rd,Rs ;move Rd = Rs
2: NOP ;nop R8 = R8
3: BX Rs ;jump PC = Rs ;may switch THUMB/ARM
3: BLX Rs ;call PC = Rs ;may switch THUMB/ARM (ARM9)
7 MSBd - Destination Register most significant bit (or BL/BLX flag)
6 MSBs - Source Register most significant bit
5-3 Rs - Source Register (together with MSBs: R0..R15)
2-0 Rd - Destination Register (together with MSBd: R0..R15)
Restrictions: For ADD/CMP/MOV, MSBs and/or MSBd must be set, ie. it is not allowed that both are cleared.
When using R15 (PC) as operand, the value will be the address of the instruction plus 4 (ie. $+4). Except for BX R15: CPU switches to ARM state, and PC is auto-aligned as (($+4) AND NOT 2).
For BX, MSBs may be 0 or 1, MSBd must be zero, Rd is not used/zero.
For BLX, MSBs may be 0 or 1, MSBd must be set, Rd is not used/zero.
For BX/BLX, when Bit 0 of the value in Rs is zero:
  Processor will be switched into ARM mode!
If so, Bit 1 of Rs must be cleared (32bit word aligned).
Thus, BX PC (switch to ARM) may be issued from word-aligned address
only, the destination is PC+4 (ie. the following halfword is skipped).
BLX may not use R15. BLX saves the return address as LR=PC+3 (with thumb bit).
Assemblers/Disassemblers should use MOV R8,R8 as NOP (in THUMB mode).
Return: Only CMP affects CPSR condition flags!
Execution Time:
 1S     for ADD/MOV/CMP
2S+1N for ADD/MOV with Rd=R15, and for BX

 THUMB.6: load PC-relative < ^

Opcode Format
  Bit    Expl.
15-11 Must be 01001b for this type of instructions
N/A Opcode (fixed)
LDR Rd,[PC,#nn] ;load 32bit Rd = WORD[PC+nn]
10-8 Rd - Destination Register (R0..R7)
7-0 nn - Unsigned offset (0-1020 in steps of 4)
The value of PC will be interpreted as (($+4) AND NOT 2).
Return: No flags affected, data loaded into Rd.
Execution Time: 1S+1N+1I


 THUMB.7: load/store with register offset < ^

Opcode Format
  Bit    Expl.
15-12 Must be 0101b for this type of instructions
11-10 Opcode (0-3)
0: STR Rd,[Rb,Ro] ;store 32bit data WORD[Rb+Ro] = Rd
1: STRB Rd,[Rb,Ro] ;store 8bit data BYTE[Rb+Ro] = Rd
2: LDR Rd,[Rb,Ro] ;load 32bit data Rd = WORD[Rb+Ro]
3: LDRB Rd,[Rb,Ro] ;load 8bit data Rd = BYTE[Rb+Ro]
9 Must be zero (0) for this type of instructions
8-6 Ro - Offset Register (R0..R7)
5-3 Rb - Base Register (R0..R7)
2-0 Rd - Source/Destination Register (R0..R7)
Return: No flags affected, data loaded either into Rd or into memory.
Execution Time: 1S+1N+1I for LDR, or 2N for STR


 THUMB.8: load/store sign-extended byte/halfword < ^

Opcode Format
  Bit    Expl.
15-12 Must be 0101b for this type of instructions
11-10 Opcode (0-3)
0: STRH Rd,[Rb,Ro] ;store 16bit data HALFWORD[Rb+Ro] = Rd
1: LDSB Rd,[Rb,Ro] ;load sign-extended 8bit Rd = BYTE[Rb+Ro]
2: LDRH Rd,[Rb,Ro] ;load zero-extended 16bit Rd = HALFWORD[Rb+Ro]
3: LDSH Rd,[Rb,Ro] ;load sign-extended 16bit Rd = HALFWORD[Rb+Ro]
9 Must be set (1) for this type of instructions
8-6 Ro - Offset Register (R0..R7)
5-3 Rb - Base Register (R0..R7)
2-0 Rd - Source/Destination Register (R0..R7)
Return: No flags affected, data loaded either into Rd or into memory.
Execution Time: 1S+1N+1I for LDR, or 2N for STR


 THUMB.9: load/store with immediate offset < ^

Opcode Format
  Bit    Expl.
15-13 Must be 011b for this type of instructions
12-11 Opcode (0-3)
0: STR Rd,[Rb,#nn] ;store 32bit data WORD[Rb+nn] = Rd
1: LDR Rd,[Rb,#nn] ;load 32bit data Rd = WORD[Rb+nn]
2: STRB Rd,[Rb,#nn] ;store 8bit data BYTE[Rb+nn] = Rd
3: LDRB Rd,[Rb,#nn] ;load 8bit data Rd = BYTE[Rb+nn]
10-6 nn - Unsigned Offset (0-31 for BYTE, 0-124 for WORD)
5-3 Rb - Base Register (R0..R7)
2-0 Rd - Source/Destination Register (R0..R7)
Return: No flags affected, data loaded either into Rd or into memory.
Execution Time: 1S+1N+1I for LDR, or 2N for STR


 THUMB.10: load/store halfword < ^

Opcode Format
  Bit    Expl.
15-12 Must be 1000b for this type of instructions
11 Opcode (0-1)
0: STRH Rd,[Rb,#nn] ;store 16bit data HALFWORD[Rb+nn] = Rd
1: LDRH Rd,[Rb,#nn] ;load 16bit data Rd = HALFWORD[Rb+nn]
10-6 nn - Unsigned Offset (0-62, step 2)
5-3 Rb - Base Register (R0..R7)
2-0 Rd - Source/Destination Register (R0..R7)
Return: No flags affected, data loaded either into Rd or into memory.
Execution Time: 1S+1N+1I for LDR, or 2N for STR


 THUMB.11: load/store SP-relative < ^

Opcode Format
  Bit    Expl.
15-12 Must be 1001b for this type of instructions
11 Opcode (0-1)
0: STR Rd,[SP,#nn] ;store 32bit data WORD[SP+nn] = Rd
1: LDR Rd,[SP,#nn] ;load 32bit data Rd = WORD[SP+nn]
10-8 Rd - Source/Destination Register (R0..R7)
7-0 nn - Unsigned Offset (0-1020, step 4)
Return: No flags affected, data loaded either into Rd or into memory.
Execution Time: 1S+1N+1I for LDR, or 2N for STR


 THUMB.12: get relative address < ^

Opcode Format
  Bit    Expl.
15-12 Must be 1010b for this type of instructions
11 Opcode/Source Register (0-1)
0: ADD Rd,PC,#nn ;Rd = (($+4) AND NOT 2) + nn
1: ADD Rd,SP,#nn ;Rd = SP + nn
10-8 Rd - Destination Register (R0..R7)
7-0 nn - Unsigned Offset (0-1020, step 4)
Return: No flags affected, result in Rd.
Execution Time: 1S


 THUMB.13: add offset to stack pointer < ^

Opcode Format
  Bit    Expl.
15-8 Must be 10110000b for this type of instructions
7 Opcode/Sign
0: ADD SP,#nn ;SP = SP + nn
1: ADD SP,#-nn ;SP = SP - nn
6-0 nn - Unsigned Offset (0-508, step 4)
Return: No flags affected, SP adjusted.
Execution Time: 1S


 THUMB.14: push/pop registers < ^

Opcode Format
  Bit    Expl.
15-12 Must be 1011b for this type of instructions
11 Opcode (0-1)
0: PUSH {Rlist}{LR} ;store in memory, decrements SP (R13)
1: POP {Rlist}{PC} ;load from memory, increments SP (R13)
10-9 Must be 10b for this type of instructions
8 PC/LR Bit (0-1)
0: No
1: PUSH LR (R14), or POP PC (R15)
7-0 Rlist - List of Registers (R7..R0)
In THUMB mode stack is always meant to be 'full descending', ie. PUSH is equivalent to 'STMFD/STMDB' and POP to 'LDMFD/LDMIA' in ARM mode.

Examples:
 PUSH {R0-R3}     ;push R0,R1,R2,R3
PUSH {R0,R2,LR} ;push R0,R2,LR
POP {R4,R7} ;pop R4,R7
POP {R2-R4,PC} ;pop R2,R3,R4,PC
Note: When calling to a sub-routine, the return address is stored in LR register, when calling further sub-routines, PUSH {LR} must be used to save higher return address on stack. If so, POP {PC} can be later used to return from the sub-routine.
POP {PC} ignores the least significant bit of the return address (processor remains in thumb state even if bit0 was cleared), when intending to return with optional mode switch, use a POP/BX combination (eg. POP {R3} / BX R3).
ARM9: POP {PC} copies the LSB to thumb bit (switches to ARM if bit0=0).
Return: No flags affected, SP adjusted, registers loaded/stored.
Execution Time: nS+1N+1I (POP), (n+1)S+2N+1I (POP PC), or (n-1)S+2N (PUSH).


 THUMB.15: multiple load/store < ^

Opcode Format
  Bit    Expl.
15-12 Must be 1100b for this type of instructions
11 Opcode (0-1)
0: STMIA Rb!,{Rlist} ;store in memory, increments Rb
1: LDMIA Rb!,{Rlist} ;load from memory, increments Rb
10-8 Rb - Base register (modified) (R0-R7)
7-0 Rlist - List of Registers (R7..R0)
Both STM and LDM are incrementing the Base Register.
The lowest register in the list (ie. R0, if it's in the list) is stored/loaded at the lowest memory address.
Examples:
 STMIA R7!,{R0-R2}  ;store R0,R1,R2
LDMIA R0!,{R1,R5} ;store R1,R5
Return: No flags affected, Rb adjusted, registers loaded/stored.
Execution Time: nS+1N+1I for LDM, or (n-1)S+2N for STM.

Strange Effects on Invalid Rlist's
Empty Rlist: R15 loaded/stored (ARMv4 only), and Rb=Rb+40h (ARMv4-v5).
Writeback with Rb included in Rlist: Store OLD base if Rb is FIRST entry in Rlist, otherwise store NEW base (STM/ARMv4), always store OLD base (STM/ARMv5), no writeback (LDM/ARMv4/ARMv5; at this point, THUMB opcodes work different than ARM opcodes).


 THUMB.16: conditional branch < ^

Opcode Format
  Bit    Expl.
15-12 Must be 1101b for this type of instructions
11-8 Opcode/Condition (0-Fh)
0: BEQ label ;Z=1 ;equal (zero) (same)
1: BNE label ;Z=0 ;not equal (nonzero) (not same)
2: BCS/BHS label ;C=1 ;unsigned higher or same (carry set)
3: BCC/BLO label ;C=0 ;unsigned lower (carry cleared)
4: BMI label ;N=1 ;negative (minus)
5: BPL label ;N=0 ;positive or zero (plus)
6: BVS label ;V=1 ;overflow (V set)
7: BVC label ;V=0 ;no overflow (V cleared)
8: BHI label ;C=1 and Z=0 ;unsigned higher
9: BLS label ;C=0 or Z=1 ;unsigned lower or same
A: BGE label ;N=V ;greater or equal
B: BLT label ;N<>V ;less than
C: BGT label ;Z=0 and N=V ;greater than
D: BLE label ;Z=1 or N<>V ;less or equal
E: Undefined, should not be used
F: Reserved for SWI instruction (see SWI opcode)
7-0 Signed Offset, step 2 ($+4-256..$+4+254)
Destination address must by halfword aligned (ie. bit 0 cleared)
Return: No flags affected, PC adjusted if condition true
Execution Time:
  2S+1N   if condition true (jump executed)
1S if condition false

 THUMB.17: software interrupt and breakpoint < ^

Opcode Format
  Bit    Expl.
15-8 Opcode
11011111b: SWI nn ;software interrupt
10111110b: BKPT nn ;software breakpoint (ARMv5 and up)
7-0 nn - Comment Immediate (0-255)
SWI supposed for calls to the operating system - Enter Supervisor mode (SVC) in ARM state. BKPT intended for debugging - enters Abort mode in ARM state via Prefetch Abort vector.

Execution SWI/BKPT:
  R14_svc=PC+2     R14_abt=PC+4   ;save return address
SPSR_svc=CPSR SPSR_abt=CPSR ;save CPSR flags
CPSR=<changed> CPSR=<changed> ;Enter svc/abt, ARM state, IRQs disabled
PC=VVVV0008h PC=VVVV000Ch ;jump to SWI/PrefetchAbort vector address
Execution Time: 2S+1N

Interpreting the Comment Field:
The immediate parameter is ignored by the processor, the user interrupt handler may read-out this number by examining the lower 8bit of the 16bit opcode opcode at [R14_svc-2]. In case that your program executes SWI's from inside of ARM mode also: Your SWI handler must then examine the T Bit SPSR_svc in order to determine whether it's been a ARM SWI - if so, examining the lower 24bit of the 32bit opcode opcode at [R14_svc-4].

For Returning from SWI use this instruction:
  MOVS PC,R14
That instructions does both restoring PC and CPSR, ie. PC=R14_svc, and CPSR=SPRS_svc. In this case (as called from THUMB mode), this does also include restoring THUMB mode.

Nesting SWIs:
SPSR_svc and R14_svc should be saved on stack before either invoking nested SWIs, or (if the IRQ handler uses SWIs) before enabling IRQs.


 THUMB.18: unconditional branch < ^

Opcode Format
  Bit    Expl.
15-11 Must be 11100b for this type of instructions
N/A Opcode (fixed)
B label ;branch (jump)
10-0 Signed Offset, step 2 ($+4-2048..$+4+2046)
Return: No flags affected, PC adjusted.
Execution Time: 2S+1N


 THUMB.19: long branch with link < ^

Opcode Format
This may be used to call (or jump) to a subroutine, return address is saved in LR (R14).
Unlike all other THUMB mode instructions, this instruction occupies 32bit of memory which are split into two 16bit THUMB opcodes.

First Instruction - LR = PC+4+(nn SHL 12)
  Bit    Expl.
15-11 Must be 11110b for BL/BLX type of instructions
10-0 nn - Upper 11 bits of Target Address
Second Instruction - PC = LR + (nn SHL 1), and LR = PC+2 OR 1 (and BLX: T=0)
  Bit    Expl.
15-11 Opcode
11111b: BL label ;branch long with link
11101b: BLX label ;branch long with link switch to ARM mode (ARM9)
10-0 nn - Lower 11 bits of Target Address (BLX: Bit0 Must be zero)
The destination address range is (PC+4)-400000h..+3FFFFEh, ie. PC+/-4M.
Target must be halfword-aligned. As Bit 0 in LR is set, it may be used to return by a BX LR instruction (keeping CPU in THUMB mode).
Return: No flags affected, PC adjusted, return address in LR.
Execution Time: 3S+1N (first opcode 1S, second opcode 2S+1N).

Note
Exceptions may or may not occur between first and second opcode, this is "implementation defined" ???


 ARM Instruction Set < ^

When operating in ARM state, full 32bit opcodes are used.

Summaries
ARM Instruction Summary
ARM Condition Field

Jumps and Calls
ARM.3: Branch and Exchange (BX, BLX)
ARM.4: Branch and Branch with Link (B, BL, BLX)
(Also, most various ALU, LDR, LDM opcodes can change PC.)

Register Operations
ARM.5: Data Processing
ARM.6: PSR Transfer (MRS, MSR)
ARM.7: Multiply and Multiply-Accumulate (MUL,MLA)

Memory Addressing Operations
ARM.9: Single Data Transfer (LDR, STR, PLD)
ARM.10: Halfword, Doubleword, and Signed Data Transfer
ARM.11: Block Data Transfer (LDM,STM)
ARM.12: Single Data Swap (SWP)

Exception Calls and Coprocessor
ARM.13: Software Interrupt (SWI,BKPT)
ARM.14: Coprocessor Data Operations (CDP)
ARM.15: Coprocessor Data Transfers (LDC,STC)
ARM.16: Coprocessor Register Transfers (MRC, MCR)
ARM.X: Coprocessor Double-Register Transfer (MCRR,MRRC)
ARM.17: Undefined Instruction

ARM.X: Count Leading Zeros
ARM.X: QADD/QSUB

ARM 26bit Memory Interface

Note:
Switching between ARM and THUMB state can be done by using the Branch and Exchange (BX) instruction.


 ARM Instruction Summary < ^

Modification of CPSR flags is optional for all {S} instructions.

Logical Operations
  Instruction             Cycles   Flags Format Expl.
MOV{cond}{S} Rd,Op2 1S+x+y NZc- 5 Rd = Op2
MVN{cond}{S} Rd,Op2 1S+x+y NZc- 5 Rd = NOT Op2
AND{cond}{S} Rd,Rn,Op2 1S+x+y NZc- 5 Rd = Rn AND Op2
TST{cond}{P} Rn,Op2 1S+x NZc- 5 Void = Rn AND Op2
EOR{cond}{S} Rd,Rn,Op2 1S+x+y NZc- 5 Rd = Rn XOR Op2
TEQ{cond}{P} Rn,Op2 1S+x NZc- 5 Void = Rn XOR Op2
ORR{cond}{S} Rd,Rn,Op2 1S+x+y NZc- 5 Rd = Rn OR Op2
BIC{cond}{S} Rd,Rn,Op2 1S+x+y NZc- 5 Rd = Rn AND NOT Op2
Add x=1I cycles if Op2 shifted-by-register. Add y=1S+1N cycles if Rd=R15.
Carry flag affected only if Op2 contains a non-zero shift amount.

Arithmetic Operations
  Instruction             Cycles  Flags Format Expl.
ADD{cond}{S} Rd,Rn,Op2 1S+x+y NZCV 5 Rd = Rn+Op2
ADC{cond}{S} Rd,Rn,Op2 1S+x+y NZCV 5 Rd = Rn+Op2+Cy
SUB{cond}{S} Rd,Rn,Op2 1S+x+y NZCV 5 Rd = Rn-Op2
SBC{cond}{S} Rd,Rn,Op2 1S+x+y NZCV 5 Rd = Rn-Op2+Cy-1
RSB{cond}{S} Rd,Rn,Op2 1S+x+y NZCV 5 Rd = Op2-Rn
RSC{cond}{S} Rd,Rn,Op2 1S+x+y NZCV 5 Rd = Op2-Rn+Cy-1
CMP{cond}{P} Rn,Op2 1S+x NZCV 5 Void = Rn-Op2
CMN{cond}{P} Rn,Op2 1S+x NZCV 5 Void = Rn+Op2
Add x=1I cycles if Op2 shifted-by-register. Add y=1S+1N cycles if Rd=R15.

Multiply
  Instruction                     Cycles  Flags Format Expl.
MUL{cond}{S} Rd,Rm,Rs 1S+mI NZx- 7 Rd = Rm*Rs
MLA{cond}{S} Rd,Rm,Rs,Rn 1S+mI+1I NZx- 7 Rd = Rm*Rs+Rn
UMULL{cond}{S} RdLo,RdHi,Rm,Rs 1S+mI+1I NZx- 7 RdHiLo = Rm*Rs
UMLAL{cond}{S} RdLo,RdHi,Rm,Rs 1S+mI+2I NZx- 7 RdHiLo = Rm*Rs+RdHiLo
SMULL{cond}{S} RdLo,RdHi,Rm,Rs 1S+mI+1I NZx- 7 RdHiLo = Rm*Rs
SMLAL{cond}{S} RdLo,RdHi,Rm,Rs 1S+mI+2I NZx- 7 RdHiLo = Rm*Rs+RdHiLo
SMLAxy{cond} Rd,Rm,Rs,Rn ---q 7 Rd=HalfRm*HalfRs+Rn ARMv5TE(xP)
SMLAWy{cond} Rd,Rm,Rs,Rn ---q 7 Rd=(Rm*HalfRs)/10000h+Rn ARMv5TE(xP)
SMULWy{cond} Rd,Rm,Rs ---- 7 Rd=(Rm*HalfRs)/10000h ARMv5TE(xP)
SMLALxy{cond} RdLo,RdHi,Rm,Rs ---- 7 RdHiLo=RdHiLo+HalfRm*HalfRs ARMv5TE(xP)
SMULxy{cond} Rd,Rm,Rs ---- 7 Rd=HalfRm*HalfRs ARMv5TE(xP)
Memory Load/Store
  Instruction                     Cycles       Flags Format Expl.
LDR{cond}{B}{T} Rd,<Address> 1S+1N+1I +y ---- 9 Rd=[Rn+/-<offset>]
LDR{cond}H Rd,<Address> 1S+1N+1I +y ---- 10 Load Unsigned halfword
LDR{cond}D Rd,<Address> ---- 10 Load Dword ARMv5TE
LDR{cond}SB Rd,<Address> 1S+1N+1I +y ---- 10 Load Signed byte
LDR{cond}SH Rd,<Address> 1S+1N+1I +y ---- 10 Load Signed halfword
LDM{cond}{amod} Rn{!},<Rlist>{^} nS+1N+1I +y ---- 11 Load Multiple
STR{cond}{B}{T} Rd,<Address> 2N ---- 9 [Rn+/-<offset>]=Rd
STR{cond}H Rd,<Address> 2N ---- 10 Store halfword
STR{cond}D Rd,<Address> ---- 10 Store Dword ARMv5TE
STM{cond}{amod} Rn{!},<Rlist>{^} (n-1)S+2N ---- 11 Store Multiple
SWP{cond}{B} Rd,Rm,[Rn] 1S+2N+1I ---- 12 Rd=[Rn], [Rn]=Rm
PLD <Address> 1S ---- 9 Prepare Cache ARMv5TE
For LDR/LDM, add y=1S+1N if Rd=R15, or if R15 in Rlist.

Jumps, Calls, CPSR Mode, and others
  Instruction              Cycles  Flags Format Expl.
B{cond} label 2S+1N ---- 4 PC=$+8+/-32M
BL{cond} label 2S+1N ---- 4 PC=$+8+/-32M, LR=$+4
BX{cond} Rn 2S+1N ---- 3 PC=Rn, T=Rn.0 (THUMB/ARM)
BLX{cond} Rn 2S+1N ---- 3 PC=Rn, T=Rn.0, LR=PC+4, ARM9
BLX label 2S+1N ---- 3 PC=PC+$+/-32M, LR=$+4, T=1, ARM9
MRS{cond} Rd,Psr 1S ---- 6 Rd=Psr
MSR{cond} Psr{_field},Op 1S (psr) 6 Psr[field]=Op
SWI{cond} Imm24bit 2S+1N ---- 13 PC=8, ARM Svc mode, LR=$+4
BKPT Imm16bit ??? ---- ??? PC=C, ARM Abt mode, LR=$+4 ARM9
The Undefined Instruction 2S+1I+1N ---- 17 PC=4, ARM Und mode, LR=$+4
cond=false 1S ---- .. Any opcode with condition=false
NOP 1S ---- 5 R0=R0
CLZ{cond} Rd,Rm ??? ---- ??? Count Leading Zeros ARMv5
QADD{cond} Rd,Rm,Rn ---q Rd=Rm+Rn ARMv5TE(xP)
QSUB{cond} Rd,Rm,Rn ---q Rd=Rm-Rn ARMv5TE(xP)
QDADD{cond} Rd,Rm,Rn ---q Rd=Rm+Rn*2 ARMv5TE(xP)
QDSUB{cond} Rd,Rm,Rn ---q Rd=Rm-Rn*2 ARMv5TE(xP)
Coprocessor Functions (if any)
  Instruction                         Cycles  Flags Format Expl.
CDP{cond} Pn,<cpopc>,Cd,Cn,Cm{,<cp>} 1S+bI ---- 14 Coprocessor specific
STC{cond}{L} Pn,Cd,<Address> (n-1)S+2N+bI 15 [address] = CRd
LDC{cond}{L} Pn,Cd,<Address> (n-1)S+2N+bI 15 CRd = [address]
MCR{cond} Pn,<cpopc>,Rd,Cn,Cm{,<cp>} 1S+bI+1C 16 CRn = Rn {<op> CRm}
MRC{cond} Pn,<cpopc>,Rd,Cn,Cm{,<cp>} 1S+(b+1)I+1C 16 Rn = CRn {<op> CRm}
CDP2,STC2,LDC2,MCR2,MRC2 - ARMv5 Extensions similar above, without {cond}
MCRR{cond} Pn,<cpopc>,Rd,Rn,Cm ;write Rd,Rn to coproc ARMv5TE
MRRC{cond} Pn,<cpopc>,Rd,Rn,Cm ;read Rd,Rn from coproc ARMv5TE
Note that no sections 1-2 exist, that is because the sections numbers comply with chapter numbers of the official ARM docs, which described ARM opcodes in chapter 3-17.

ARM Binary Opcode Format
  |..3 ..................2 ..................1 ..................0|
|1_0_9_8_7_6_5_4_3_2_1_0_9_8_7_6_5_4_3_2_1_0_9_8_7_6_5_4_3_2_1_0|
|_Cond__|0_0_0|___Op__|S|__Rn___|__Rd___|__Shift__|Typ|0|__Rm___| DataProc
|_Cond__|0_0_0|___Op__|S|__Rn___|__Rd___|__Rs___|0|Typ|1|__Rm___| DataProc
|_Cond__|0_0_1|___Op__|S|__Rn___|__Rd___|_Shift_|___Immediate___| DataProc
|_Cond__|0_0_1_1_0|P|1|0|_Field_|__Rd___|_Shift_|___Immediate___| PSR Imm
|_Cond__|0_0_0_1_0|P|L|0|_Field_|__Rd___|0_0_0_0|0_0_0_0|__Rm___| PSR Reg
|_Cond__|0_0_0_1_0_0_1_0_1_1_1_1_1_1_1_1_1_1_1_1|0_0|L|1|__Rn___| BX,BLX
|1_1_1_0|0_0_0_1_0_0_1_0|_____immediate_________|0_1_1_1|_immed_| BKPT ARM9
|_Cond__|0_0_0_1_0_1_1_0_1_1_1_1|__Rd___|1_1_1_1|0_0_0_1|__Rm___| CLZ ARM9
|_Cond__|0_0_0_1_0|Op_|0|__Rn___|__Rd___|0_0_0_0|0_1_0_1|__Rm___| QALU ARM9
|_Cond__|0_0_0_0_0_0|A|S|__Rd___|__Rn___|__Rs___|1_0_0_1|__Rm___| Multiply
|_Cond__|0_0_0_0_1|U|A|S|_RdHi__|_RdLo__|__Rs___|1_0_0_1|__Rm___| MulLong
|_Cond__|0_0_0_1_0|Op_|0|Rd/RdHi|Rn/RdLo|__Rs___|1|y|x|0|__Rm___| MulHalf
|_Cond__|0_0_0_1_0|B|0_0|__Rn___|__Rd___|0_0_0_0|1_0_0_1|__Rm___| TransSwp12
|_Cond__|0_0_0|P|U|0|W|L|__Rn___|__Rd___|0_0_0_0|1|S|H|1|__Rm___| TransReg10
|_Cond__|0_0_0|P|U|1|W|L|__Rn___|__Rd___|OffsetH|1|S|H|1|OffsetL| TransImm10
|_Cond__|0_1_0|P|U|B|W|L|__Rn___|__Rd___|_________Offset________| TransImm9
|_Cond__|0_1_1|P|U|B|W|L|__Rn___|__Rd___|__Shift__|Typ|0|__Rm___| TransReg9
|_Cond__|0_1_1|________________xxx____________________|1|__xxx__| Undefined
|_Cond__|1_0_0|P|U|S|W|L|__Rn___|__________Register_List________| BlockTrans
|_Cond__|1_0_1|L|___________________Offset______________________| B,BL,BLX
|_Cond__|1_1_0|P|U|N|W|L|__Rn___|__CRd__|__CP#__|____Offset_____| CoDataTrans
|_Cond__|1_1_0_0_0_1_0|L|__Rn___|__Rd___|__CP#__|_CPopc_|__CRm__| CoRR ARM9
|_Cond__|1_1_1_0|_CPopc_|__CRn__|__CRd__|__CP#__|_CP__|0|__CRm__| CoDataOp
|_Cond__|1_1_1_0|CPopc|L|__CRn__|__Rd___|__CP#__|_CP__|1|__CRm__| CoRegTrans
|_Cond__|1_1_1_1|_____________Ignored_by_Processor______________| SWI

 ARM Condition Field < ^

In ARM mode, all instructions can be conditionally executed depending on the state of the CPSR flags (C,N,Z,V). The respective suffixes {cond} must be appended to the mnemonics. For example: BEQ = Branch if Equal, MOVMI = Move if Signed.
  Code Suffix Flags         Meaning
  0:   EQ     Z=1           equal (zero) (same)
  1:   NE     Z=0           not equal (nonzero) (not same)
  2:   CS/HS  C=1           unsigned higher or same (carry set)
  3:   CC/LO  C=0           unsigned lower (carry cleared)
  4:   MI     N=1           negative (minus)
  5:   PL     N=0           positive or zero (plus)
  6:   VS     V=1           overflow (V set)
  7:   VC     V=0           no overflow (V cleared)
  8:   HI     C=1 and Z=0   unsigned higher
  9:   LS     C=0 or Z=1    unsigned lower or same
  A:   GE     N=V           greater or equal
  B:   LT     N<>V          less than
  C:   GT     Z=0 and N=V   greater than
  D:   LE     Z=1 or N<>V   less or equal
  E:   AL     -             always
  F:   NV     -             never (ARMv1,v2 only) (Reserved ARMv3 and up)
To define a non-conditional instruction which is always to be executed (regardless of any flags), the AL suffix may be used - that is the same as if no suffix is specified. For example, MOVAL would be usually abbreviated to MOV.

ARMv5 and up includes a few additional opcodes without condition field and which cannot be made conditional, these opcodes are: BKPT, PLD, CDP2, LDC2, MCR2, MRC2, STC2, and BLX_imm (however BLX_reg can be conditional).

Execution Time: If condition=false: 1S cycle.
Otherwise as specified for the respective opcode.


 ARM.3: Branch and Exchange (BX, BLX) < ^

Opcode Format
  Bit    Expl.
31-28 Condition
27-8 Must be "0001.0010.1111.1111.1111" for this instruction
7-4 Opcode
0001b: BX{cond} Rn ;PC=Rn, T=Rn.0 (ARMv4T and ARMv5 and up)
0011b: BLX{cond} Rn ;PC=Rn, T=Rn.0, LR=PC+4 (ARMv5 and up)
3-0 Rn - Operand Register (R0-R14)
Switching to THUMB Mode: Set Bit 0 of the value in Rn to 1, program continues then at Rn-1 in THUMB mode.
Results in undefined behaviour if using R15 (PC+8 itself) as operand.
Execution Time: 2S + 1N
Return: No flags affected.


 ARM.4: Branch and Branch with Link (B, BL, BLX) < ^

Opcode Format
Branch (B) is supposed to jump to a subroutine. Branch with Link is meant to be used to call to a subroutine, return address is then saved in R14.
  Bit    Expl.
31-28 Condition (must be 1111b for BLX)
27-25 Must be "101" for this instruction
24 Opcode (0-1) (or Halfword Offset for BLX)
0: B{cond} label ;branch PC=PC+8+nn*4
1: BL{cond} label ;branch/link PC=PC+8+nn*4, LR=PC+4
H: BLX label ;ARM9 ;branch/link/thumb PC=PC+8+nn*4+H*2, LR=PC+4, T=1
23-0 nn - Signed Offset, step 4 (-32M..+32M in steps of 4)
Branch with Link can be used to 'call' to a sub-routine, which may then 'return' by MOV PC,R14 for example.
Execution Time: 2S + 1N
Return: No flags affected.


 ARM.5: Data Processing < ^

Opcode Format
  Bit    Expl.
31-28 Condition
27-26 Must be 00b for this instruction
25 I - Immediate 2nd Operand Flag (0=Register, 1=Immediate)
24-21 Opcode (0-Fh) ;*=Arithmetic, otherwise Logical
0: AND{cond}{S} Rd,Rn,Op2 ;AND logical Rd = Rn AND Op2
1: EOR{cond}{S} Rd,Rn,Op2 ;XOR logical Rd = Rn XOR Op2
2: SUB{cond}{S} Rd,Rn,Op2 ;* ;subtract Rd = Rn-Op2
3: RSB{cond}{S} Rd,Rn,Op2 ;* ;subtract reversed Rd = Op2-Rn
4: ADD{cond}{S} Rd,Rn,Op2 ;* ;add Rd = Rn+Op2
5: ADC{cond}{S} Rd,Rn,Op2 ;* ;add with carry Rd = Rn+Op2+Cy
6: SBC{cond}{S} Rd,Rn,Op2 ;* ;sub with carry Rd = Rn-Op2+Cy-1
7: RSC{cond}{S} Rd,Rn,Op2 ;* ;sub cy. reversed Rd = Op2-Rn+Cy-1
8: TST{cond}{P} Rn,Op2 ;test Void = Rn AND Op2
9: TEQ{cond}{P} Rn,Op2 ;test exclusive Void = Rn XOR Op2
A: CMP{cond}{P} Rn,Op2 ;* ;compare Void = Rn-Op2
B: CMN{cond}{P} Rn,Op2 ;* ;compare neg. Void = Rn+Op2
C: ORR{cond}{S} Rd,Rn,Op2 ;OR logical Rd = Rn OR Op2
D: MOV{cond}{S} Rd,Op2 ;move Rd = Op2
E: BIC{cond}{S} Rd,Rn,Op2 ;bit clear Rd = Rn AND NOT Op2
F: MVN{cond}{S} Rd,Op2 ;not Rd = NOT Op2
20 S - Set Condition Codes (0=No, 1=Yes) (Must be 1 for opcode 8-B)
19-16 Rn - 1st Operand Register (R0..R15) (including PC=R15)
Must be 0000b for MOV/MVN.
15-12 Rd - Destination Register (R0..R15) (including PC=R15)
Must be 0000b {or 1111b) for CMP/CMN/TST/TEQ{P}.
When above Bit 25 I=0 (Register as 2nd Operand)
When below Bit 4 R=0 - Shift by Immediate
11-7 Is - Shift amount (1-31, 0=Special/See below)
When below Bit 4 R=1 - Shift by Register
11-8 Rs - Shift register (R0-R14) - only lower 8bit 0-255 used
7 Reserved, must be zero (otherwise multiply or undefined opcode)
6-5 Shift Type (0=LSL, 1=LSR, 2=ASR, 3=ROR)
4 R - Shift by Register Flag (0=Immediate, 1=Register)
3-0 Rm - 2nd Operand Register (R0..R15) (including PC=R15)
When above Bit 25 I=1 (Immediate as 2nd Operand)
11-8 Is - ROR-Shift applied to nn (0-30, in steps of 2)
7-0 nn - 2nd Operand Unsigned 8bit Immediate
Second Operand (Op2)
This may be a shifted register, or a shifted immediate. See Bit 25 and 11-0.
Unshifted Register: Specify Op2 as "Rm", assembler converts to "Rm,LSL#0".
Shifted Register: Specify as "Rm,SSS#Is" or "Rm,SSS Rs" (SSS=LSL/LSR/ASR/ROR).
Immediate: Specify as 32bit value, for example: "#000NN000h", assembler should automatically convert into "#0NNh,ROR#0ssh" as far as possible (ie. as far as a section of not more than 8bits of the immediate is non-zero).

Zero Shift Amount (Shift Register by Immediate, with Immediate=0)
LSL#0: No shift performed, ie. directly Op2=Rm, the C flag is NOT affected.
LSR#0: Interpreted as LSR#32, ie. Op2 becomes zero, C becomes Bit 31 of Rm.
ASR#0: Interpreted as ASR#32, ie. Op2 and C are filled by Bit 31 of Rm.
ROR#0: Interpreted as RRX#1 (RCR), like ROR#1, but Op2 Bit 31 set to old C.
In source code, LSR#32, ASR#32, and RRX#1 should be specified as such - attempts to specify LSR#0, ASR#0, or ROR#0 will be internally converted to LSL#0 by the assembler.

Using R15 (PC)
When using R15 as Destination (Rd), note below CPSR description and Execution time description.
When using R15 as operand (Rm or Rn), the returned value depends on the instruction: PC+12 if I=0,R=1 (shift by register), otherwise PC+8 (shift by immediate).

Returned CPSR Flags
If S=1, Rd<>R15, logical operations (AND,EOR,TST,TEQ,ORR,MOV,BIC,MVN):
  V=not affected
C=carryflag of shift operation (not affected if LSL#0 or Rs=00h)
Z=zeroflag of result
N=signflag of result (result bit 31)
If S=1, Rd<>R15, arithmetic operations (SUB,RSB,ADD,ADC,SBC,RSC,CMP,CMN):
  V=overflowflag of result
C=carryflag of result
Z=zeroflag of result
N=signflag of result (result bit 31)
IF S=1, with unused Rd bits=1111b, {P} opcodes (CMPP/CMNP/TSTP/TEQP):
  R15=result  ;modify PSR bits in R15, ARMv2 and below only.
In user mode only N,Z,C,V bits of R15 can be changed.
In other modes additionally I,F,M1,M0 can be changed.
The PC bits in R15 are left unchanged in all modes.
If S=1, Rd=R15; should not be used in user mode:
  CPSR = SPSR_<current mode>

PC = result
For example: MOVS PC,R14 ;return from SWI (PC=R14_svc, CPSR=SPSR_svc).
If S=0: Flags are not affected (not allowed for CMP,CMN,TEQ,TST).

The instruction "MOV R0,R0" is used as "NOP" opcode in 32bit ARM state.
Execution Time: (1+p)S+rI+pN. Whereas r=1 if I=0 and R=1 (ie. shift by register); otherwise r=0. And p=1 if Rd=R15; otherwise p=0.


 ARM.6: PSR Transfer (MRS, MSR) < ^

Opcode Format
These instructions occupy an unused area (TEQ,TST,CMP,CMN with S=0) of Data Processing opcodes (ARM.5).
  Bit    Expl.
31-28 Condition
27-26 Must be 00b for this instruction
25 I - Immediate Operand Flag (0=Register, 1=Immediate) (Zero for MRS)
24-23 Must be 10b for this instruction
22 Psr - Source/Destination PSR (0=CPSR, 1=SPSR_<current mode>)
21 Opcode
0: MRS{cond} Rd,Psr ;Rd = Psr
1: MSR{cond} Psr{_field},Op ;Psr[field] = Op
20 Must be 0b for this instruction (otherwise TST,TEQ,CMP,CMN)
For MRS:
19-16 Must be 1111b for this instruction (otherwise SWP)
15-12 Rd - Destination Register (R0-R14)
11-0 Not used, must be zero.
For MSR:
19 f write to flags field Bit 31-24 (aka _flg)
18 s write to status field Bit 23-16 (reserved, don't change)
17 x write to extension field Bit 15-8 (reserved, don't change)
16 c write to control field Bit 7-0 (aka _ctl)
15-12 Not used, must be 1111b.
For MSR Psr,Rm (I=0)
11-4 Not used, must be zero. (otherwise BX)
3-0 Rm - Source Register <op> (R0-R14)
For MSR Psr,Imm (I=1)
11-8 Shift applied to Imm (ROR in steps of two 0-30)
7-0 Imm - Unsigned 8bit Immediate
In source code, a 32bit immediate should be specified as operand.
The assembler should then convert that into a shifted 8bit value.
MSR/MRS and CPSR/SPSR supported by ARMv3 and up.
ARMv2 and below contained PSR flags in R15, accessed by CMP/CMN/TST/TEQ{P}.
The field mask bits specify which bits of the destination Psr are write-able (or write-protected), one or more of these bits should be set, for example, CPSR_fsxc (aka CPSR aka CPSR_all) unlocks all bits (see below user mode restriction though).
Restrictions:
In non-privileged mode (user mode): only condition code bits of CPSR can be changed, control bits can't.
Only the SPSR of the current mode can be accessed; In User and System modes no SPSR exists.
The T-bit may not be changed; for THUMB/ARM switching use BX instruction.
Unused Bits in CPSR are reserved for future use and should never be changed (except for unused bits in the flags field).
Execution Time: 1S.

Note: The A22i assembler recognizes MOV as alias for both MSR and MRS because it is practically not possible to remember whether MSR or MRS was the load or store opcode, and/or whether it does load to or from the Psr register.


 ARM.7: Multiply and Multiply-Accumulate (MUL,MLA) < ^

Opcode Format
  Bit    Expl.
31-28 Condition
27-25 Must be 000b for this instruction
24-21 Opcode
0000b: MUL{cond}{S} Rd,Rm,Rs ;multiply Rd = Rm*Rs
0001b: MLA{cond}{S} Rd,Rm,Rs,Rn ;mul.& accumulate Rd = Rm*Rs+Rn
0100b: UMULL{cond}{S} RdLo,RdHi,Rm,Rs ;multiply RdHiLo=Rm*Rs
0101b: UMLAL{cond}{S} RdLo,RdHi,Rm,Rs ;mul.& acc. RdHiLo=Rm*Rs+RdHiLo
0110b: SMULL{cond}{S} RdLo,RdHi,Rm,Rs ;sign.mul. RdHiLo=Rm*Rs
0111b: SMLAL{cond}{S} RdLo,RdHi,Rm,Rs ;sign.m&a. RdHiLo=Rm*Rs+RdHiLo
1000b: SMLAxy{cond} Rd,Rm,Rs,Rn ;Rd=HalfRm*HalfRs+Rn
1001b: SMLAWy{cond} Rd,Rm,Rs,Rn ;Rd=(Rm*HalfRs)/10000h+Rn
1001b: SMULWy{cond} Rd,Rm,Rs ;Rd=(Rm*HalfRs)/10000h
1010b: SMLALxy{cond} RdLo,RdHi,Rm,Rs ;RdHiLo=RdHiLo+HalfRm*HalfRs
1011b: SMULxy{cond} Rd,Rm,Rs ;Rd=HalfRm*HalfRs
20 S - Set Condition Codes (0=No, 1=Yes) (Must be 0 for Halfword mul)
19-16 Rd (or RdHi) - Destination Register (R0-R14)
15-12 Rn (or RdLo) - Accumulate Register (R0-R14) (Set to 0000b if unused)
11-8 Rs - Operand Register (R0-R14)
For Non-Halfword Multiplies
7-4 Must be 1001b for these instructions
For Halfword Multiplies
7 Must be 1 for these instructions
6 y - Rs Top/Bottom flag (0=B=Lower 16bit, 1=T=Upper 16bit)
5 x - Rm Top/Bottom flag (as above), or 0 for SMLAW, or 1 for SMULW
4 Must be 0 for these instructions
3-0 Rm - Operand Register (R0-R14)
Multiply and Multiply-Accumulate (MUL,MLA)
Restrictions: Rd may not be same as Rm. Rd,Rn,Rs,Rm may not be R15.
Note: Only the lower 32bit of the internal 64bit result are stored in Rd, thus no sign/zero extension is required and MUL and MLA can be used for both signed and unsigned calculations!
Execution Time: 1S+mI for MUL, and 1S+(m+1)I for MLA. Whereas 'm' depends on whether/how many most significant bits of Rs are all zero or all one. That is m=1 for Bit 31-8, m=2 for Bit 31-16, m=3 for Bit 31-24, and m=4 otherwise.
Flags (if S=1): Z=zeroflag, N=signflag, C=destroyed (ARMv4 and below) or C=not affected (ARMv5 and up), V=not affected. MUL/MLA supported by ARMv2 and up.

Multiply Long and Multiply-Accumulate Long (MULL, MLAL)
Optionally supported, INCLUDED in ARMv3M, EXCLUDED in ARMv4xM/ARMv5xM.
Restrictions: RdHi,RdLo,Rm must be different registers. R15 may not be used.
Execution Time: 1S+(m+1)I for MULL, and 1S+(m+2)I for MLAL. Whereas 'm' depends on whether/how many most significant bits of Rs are "all zero" (UMULL/UMLAL) or "all zero or all one" (SMULL,SMLAL). That is m=1 for Bit 31-8, m=2 for Bit 31-16, m=3 for Bit 31-24, and m=4 otherwise.
Flags (if S=1): Z=zeroflag, N=signflag, C=destroyed (ARMv4 and below) or C=not affected (ARMv5 and up), V=destroyed??? (ARMv4 and below???) or V=not affected (ARMv5 and up).

Signed Halfword Multiply (SMLAxy,SMLAWy,SMLALxy,SMULxy,SMULWy)
Supported by E variants of ARMv5 and up, ie. ARMv5TE(xP).
Q-flag gets set on 32bit SMLAxy/SMLAWy addition overflows, however, the result is NOT truncated (as it'd be done with QADD opcodes).
Q-flag is NOT affected on (rare) 64bit SMLALxy addition overflows.
SMULxy/SMULWy cannot overflow, and thus leave Q-flag unchanged as well.
NZCV-flags are not affected by Halfword multiplies.
Execution Time: 1S+Interlock (SMULxy,SMLAxy,SMULWx,SMLAWx)
Execution Time: 1S+1I+Interlock (SMLALxy)


 ARM.9: Single Data Transfer (LDR, STR, PLD) < ^

Opcode Format
  Bit    Expl.
31-28 Condition (Must be 1111b for PLD)
27-26 Must be 01b for this instruction
25 I - Immediate Offset Flag (0=Immediate, 1=Shifted Register)
24 P - Pre/Post (0=post; add offset after transfer, 1=pre; before trans.)
23 U - Up/Down Bit (0=down; subtract offset from base, 1=up; add to base)
22 B - Byte/Word bit (0=transfer word quantity, 1=transfer byte quantity)
When above Bit 24 P=0 (Post-indexing, write-back is ALWAYS enabled):
21 T - Memory Management (0=Normal, 1=Force non-privileged access)
When above Bit 24 P=1 (Pre-indexing, write-back is optional):
21 W - Write-back bit (0=no write-back, 1=write address into base)
20 L - Load/Store bit (0=Store to memory, 1=Load from memory)
0: STR{cond}{B}{T} Rd,<Address> ;[Rn+/-<offset>]=Rd
1: LDR{cond}{B}{T} Rd,<Address> ;Rd=[Rn+/-<offset>]
(1: PLD <Address> ;Prepare Cache for Load, see notes below)
Whereas, B=Byte, T=Force User Mode (only for POST-Indexing)
19-16 Rn - Base register (R0..R15) (including R15=PC+8)
15-12 Rd - Source/Destination Register (R0..R15) (including R15=PC+12)
When above I=0 (Immediate as Offset)
11-0 Unsigned 12bit Immediate Offset (0-4095, steps of 1)
When above I=1 (Register shifted by Immediate as Offset)
11-7 Is - Shift amount (1-31, 0=Special/See below)
6-5 Shift Type (0=LSL, 1=LSR, 2=ASR, 3=ROR)
4 Must be 0 (Reserved, see ARM.17, The Undefined Instruction)
3-0 Rm - Offset Register (R0..R14) (not including PC=R15)
Instruction Formats for <Address>
An expression which generates an address:
  <expression>                  ;an immediate used as address
;*** restriction: must be located in range PC+/-4095+8, if so,
;*** assembler will calculate offset and use PC (R15) as base.
Pre-indexed addressing specification:
  [Rn]                          ;offset = zero
[Rn, <#{+/-}expression>]{!} ;offset = immediate
[Rn, {+/-}Rm{,<shift>} ]{!} ;offset = register shifted by immediate
Post-indexed addressing specification:
  [Rn], <#{+/-}expression>      ;offset = immediate
[Rn], {+/-}Rm{,<shift>} ;offset = register shifted by immediate
Whereas...
  <shift>  immediate shift such like LSL#4, ROR#2, etc. (see ARM.5).
{!} exclamation mark ("!") indicates write-back (Rn will be updated).
Notes
Shift amount 0 has special meaning, as described in ARM.5 Data Processing.
When writing a word (32bit) to memory, the address should be word-aligned.
When reading a byte from memory, upper 24 bits of Rd are zero-extended.
LDR PC,<op> on ARMv4 leaves CPSR.T unchanged.
LDR PC,<op> on ARMv5 sets CPSR.T to <op> Bit0, (1=Switch to Thumb).

When reading a word from a halfword-aligned address (which is located in the middle between two word-aligned addresses), the lower 16bit of Rd will contain [address] ie. the addressed halfword, and the upper 16bit of Rd will contain [Rd-2] ie. more or less unwanted garbage. However, by isolating lower bits this may be used to read a halfword from memory. (Above applies to little endian mode, as used in GBA.)

In a virtual memory based environment (ie. not in the GBA), aborts (ie. page faults) may take place during execution, if so, Rm and Rn should not specify the same register when post-indexing is used, as the abort-handler might have problems to reconstruct the original value of the register.

Return: CPSR flags are not affected.
Execution Time: For normal LDR: 1S+1N+1I. For LDR PC: 2S+2N+1I. For STR: 2N.

PLD <Address> ;Prepare Cache for Load
PLD must use following settings cond=1111b, P=1, B=1, W=0, L=1, Rd=1111b, the address may not use post-indexing, and may not use writeback, the opcode is encoded identical as LDRNVB R15,<Address>.
PLD signalizes to the memory system that a specific memory address will be soon accessed, the memory system may use this hint to prepare caching/pipelining, aside from that, PLD does not have any affect to the program logic, and behaves identical as NOP.
PLD supported by ARMv5TE only, not ARMv5, not ARMv5TExP.


 ARM.10: Halfword, Doubleword, and Signed Data Transfer < ^

Opcode Format
  Bit    Expl.
31-28 Condition
27-25 Must be 000b for this instruction
24 P - Pre/Post (0=post; add offset after transfer, 1=pre; before trans.)
23 U - Up/Down Bit (0=down; subtract offset from base, 1=up; add to base)
22 I - Immediate Offset Flag (0=Register Offset, 1=Immediate Offset)
When above Bit 24 P=0 (Post-indexing, write-back is ALWAYS enabled):
21 Not used, must be zero (0)
When above Bit 24 P=1 (Pre-indexing, write-back is optional):
21 W - Write-back bit (0=no write-back, 1=write address into base)
20 L - Load/Store bit (0=Store to memory, 1=Load from memory)
19-16 Rn - Base register (R0-R15) (Including R15=PC+8)
15-12 Rd - Source/Destination Register (R0-R15) (Including R15=PC+12)
11-8 When above Bit 22 I=0 (Register as Offset):
Not used. Must be 0000b
When above Bit 22 I=1 (immediate as Offset):
Immediate Offset (upper 4bits)
7 Reserved, must be set (1)
6-5 Opcode (0-3)
When Bit 20 L=0 (Store) (and Doubleword Load/Store):
0: Reserved for SWP instruction (see ARM.12 Single Data Swap)
1: STR{cond}H Rd,<Address> ;Store halfword [a]=Rd
2: LDR{cond}D Rd,<Address> ;Load Doubleword R(d)=[a], R(d+1)=[a+4]
3: STR{cond}D Rd,<Address> ;Store Doubleword [a]=R(d), [a+4]=R(d+1)
When Bit 20 L=1 (Load):
0: Reserved.
1: LDR{cond}H Rd,<Address> ;Load Unsigned halfword (zero-extended)
2: LDR{cond}SB Rd,<Address> ;Load Signed byte (sign extended)
3: LDR{cond}SH Rd,<Address> ;Load Signed halfword (sign extended)
4 Reserved, must be set (1)
3-0 When above Bit 22 I=0:
Rm - Offset Register (R0-R14) (not including R15)
When above Bit 22 I=1:
Immediate Offset (lower 4bits) (0-255, together with upper bits)
STRH,LDRH,LDRSB,LDRSH supported on ARMv4 and up.
STRD/LDRD supported on ARMv5TE only, not ARMv5, not ARMv5TExP.
STRD/LDRD: base writeback: Rn should not be same as R(d) or R(d+1).
STRD: index register: Rm should not be same as R(d) or R(d+1).
STRD/LDRD: Rd must be an even numbered register (R0,R2,R4,R6,R8,R10,R12).
STRD/LDRD: Address must be double-word aligned (multiple of eight).

Instruction Formats for <Address>
An expression which generates an address:
  <expression>                  ;an immediate used as address
;*** restriction: must be located in range PC+/-255+8, if so,
;*** assembler will calculate offset and use PC (R15) as base.
Pre-indexed addressing specification:
  [Rn]                          ;offset = zero
[Rn, <#{+/-}expression>]{!} ;offset = immediate
[Rn, {+/-}Rm]{!} ;offset = register
Post-indexed addressing specification:
  [Rn], <#{+/-}expression>      ;offset = immediate
[Rn], {+/-}Rm ;offset = register
Whereas...
  {!}      exclamation mark ("!") indicates write-back (Rn will be updated).
Return: No Flags affected.
Execution Time: For Normal LDR, 1S+1N+1I. For LDR PC, 2S+2N+1I. For STRH 2N.


 ARM.11: Block Data Transfer (LDM,STM) < ^

Opcode Format
  Bit    Expl.
31-28 Condition
27-25 Must be 100b for this instruction
24 P - Pre/Post (0=post; add offset after transfer, 1=pre; before trans.)
23 U - Up/Down Bit (0=down; subtract offset from base, 1=up; add to base)
22 S - PSR & force user bit (0=No, 1=load PSR or force user mode)
21 W - Write-back bit (0=no write-back, 1=write address into base)
20 L - Load/Store bit (0=Store to memory, 1=Load from memory)
0: STM{cond}{amod} Rn{!},<Rlist>{^} ;Store (Push)
1: LDM{cond}{amod} Rn{!},<Rlist>{^} ;Load (Pop)
Whereas, {!}=Write-Back (W), and {^}=PSR/User Mode (S)
19-16 Rn - Base register (R0-R14) (not including R15)
15-0 Rlist - Register List
(Above 'offset' is meant to be the number of words specified in Rlist.)
Addressing Modes {amod}
The IB,IA,DB,DA suffixes directly specify the desired U and P bits:
  IB  increment before          ;P=1, U=1
IA increment after ;P=0, U=1
DB decrement before ;P=1, U=0
DA decrement after ;P=0, U=0
Alternately, FD,ED,FA,EA could be used, mostly to simplify mnemonics for stack transfers.
  ED  empty stack, descending   ;LDM: P=1, U=1  ;STM: P=0, U=0
FD full stack, descending ; P=0, U=1 ; P=1, U=0
EA empty stack, ascending ; P=1, U=0 ; P=0, U=1
FA full stack, ascending ; P=0, U=0 ; P=1, U=1
Ie. the following expressions are aliases for each other:
  STMFD=STMDB=PUSH   STMED=STMDA   STMFA=STMIB   STMEA=STMIA
LDMFD=LDMIA=POP LDMED=LDMIB LDMFA=LDMDA LDMEA=LDMDB
Note: The equivalent THUMB functions use fixed organization:
  PUSH/POP: full descending     ;base register SP (R13)
LDM/STM: increment after ;base register R0..R7
Descending is common stack organization as used in 80x86 and Z80 CPUs, SP is decremented when pushing/storing data, and incremented when popping/loading data.

When S Bit is set (S=1)
If instruction is LDM and R15 is in the list: (Mode Changes)
  While R15 loaded, additionally: CPSR=SPSR_<current mode>
Otherwise: (User bank transfer)
  Rlist is referring to User Bank Registers, R0-R15 (rather than
register related to the current mode, such like R14_svc etc.)
Base write-back should not be used for User bank transfer.
! When instruction is LDM: !
! If the following instruction reads from a banked register, !
! like R14_svc, then CPU might still read R14 instead. If !
! necessary insert a dummy instruction such like MOV R0,R0. !
Notes
The lowest Register in Rlist (R0 if its in the list) will be loaded/stored to/from the lowest memory address.
The base address should be usually word-aligned.
LDM Rn,...,PC on ARMv4 leaves CPSR.T unchanged.
LDR Rn,...,PC on ARMv5 sets CPSR.T to <op> Bit0, (1=Switch to Thumb).

Return: No Flags affected.
Execution Time: For normal LDM, nS+1N+1I. For LDM PC, (n+1)S+2N+1I. For STM (n-1)S+2N. Where n is the number of words transferred.

Strange Effects on Invalid Rlist's
Empty Rlist: R15 loaded/stored (ARMv4 only), and Rb=Rb+/-40h (ARMv4-v5).
Writeback with Rb included in Rlist: Store OLD base if Rb is FIRST entry in Rlist, otherwise store NEW base (STM/ARMv4), always store OLD base (STM/ARMv5), no writeback (LDM/ARMv4), writeback if Rb is "the ONLY register, or NOT the LAST register" in Rlist (LDM/ARMv5).


 ARM.12: Single Data Swap (SWP) < ^

Opcode Format
  Bit    Expl.
31-28 Condition
27-23 Must be 00010b for this instruction
Opcode (fixed)
SWP{cond}{B} Rd,Rm,[Rn] ;Rd=[Rn], [Rn]=Rm
22 B - Byte/Word bit (0=swap word quantity, 1=swap byte quantity)
21-20 Must be 00b for this instruction
19-16 Rn - Base register (R0-R14)
15-12 Rd - Destination Register (R0-R14)
11-4 Must be 00001001b for this instruction
3-0 Rm - Source Register (R0-R14)
SWP/SWPB supported by ARMv2a and up.
Swap works properly including if Rm and Rn specify the same register.
R15 may not be used for either Rn,Rd,Rm. (Rn=R15 would be MRS opcode).
Upper bits of Rd are zero-expanded when using Byte quantity. For info about byte and word data memory addressing, read LDR and STR opcode description.
Execution Time: 1S+2N+1I. That is, 2N data cycles, 1S code cycle, plus 1I.


 ARM.13: Software Interrupt (SWI,BKPT) < ^

Opcode Format
  Bit    Expl.
31-28 Condition (must be 1110b for BKPT, ie. Condition=always)
27-24 Opcode
1111b: SWI{cond} nn ;software interrupt
0001b: BKPT nn ;breakpoint (ARMv5 and up)
For SWI:
23-0 nn - Comment Field, ignored by processor (24bit value)
For BKPT:
23-20 Must be 0010b for BKPT
19-8 nn - upper 12bits of comment field, ignored by processor
7-4 Must be 0111b for BKPT
3-0 nn - lower 4bits of comment field, ignored by processor
SWI supposed for calls to the operating system - Enter Supervisor mode (SVC) in ARM state. BKPT intended for debugging - enters Abort mode in ARM state via Prefetch Abort vector.

Execution SWI/BKPT:
  R14_svc=PC+4     R14_abt=PC+4   ;save return address
SPSR_svc=CPSR SPSR_abt=CPSR ;save CPSR flags
CPSR=<changed> CPSR=<changed> ;Enter svc/abt, ARM state, IRQs disabled
PC=VVVV0008h PC=VVVV000Ch ;jump to SWI/PrefetchAbort vector address
Execution Time: 2S+1N

Interpreting the Comment Field:
The immediate parameter is ignored by the processor, the user interrupt handler may read-out this number by examining the lower 24bit of the 32bit opcode opcode at [R14_svc-4]. In case that your program executes SWI's from inside of THUMB mode also: Your SWI handler must then examine the T Bit SPSR_svc in order to determine whether it's been a THUMB SWI - if so, examining the lower 8bit of the 16bit opcode opcode at [R14_svc-2].

For Returning from SWI use this instruction:
  MOVS PC,R14
That instructions does both restoring PC and CPSR, ie. PC=R14_svc, and CPSR=SPRS_svc.

Nesting SWIs:
SPSR_svc and R14_svc should be saved on stack before either invoking nested SWIs, or (if the IRQ handler uses SWIs) before enabling IRQs.


 ARM.14: Coprocessor Data Operations (CDP) < ^

Opcode Format
  Bit    Expl.
31-28 Condition (or 1111b for CDP2 opcode on ARMv5 and up)
27-24 Must be 1110b for this instruction
ARM-Opcode (fixed)
CDP{cond} Pn,<cpopc>,Cd,Cn,Cm{,<cp>}
CDP2 Pn,<cpopc>,Cd,Cn,Cm{,<cp>}
23-20 CP Opc - Coprocessor operation code (0-15)
19-16 Cn - Coprocessor operand Register (C0-C15)
15-12 Cd - Coprocessor destination Register (C0-C15)
11-8 Pn - Coprocessor number (P0-P15)
7-5 CP - Coprocessor information (0-7)
4 Reserved, must be zero (otherwise MCR/MRC opcode)
3-0 Cm - Coprocessor operand Register (C0-C15)
CDP supported by ARMv2 and up, CDP2 by ARMv5 and up.
Execution time: 1S+bI, b=number of cycles in coprocessor busy-wait loop.
Return: No flags affected, no ARM-registers used/modified.
For details refer to original ARM docs, irrelevant in GBA because no coprocessor exists.


 ARM.15: Coprocessor Data Transfers (LDC,STC) < ^

Opcode Format
  Bit    Expl.
31-28 Condition (or 1111b for LDC2/STC2 opcodes on ARMv5 and up)
27-25 Must be 110b for this instruction
24 P - Pre/Post (0=post; add offset after transfer, 1=pre; before trans.)
23 U - Up/Down Bit (0=down; subtract offset from base, 1=up; add to base)
22 N - Transfer length (0-1, interpretation depends on co-processor)
21 W - Write-back bit (0=no write-back, 1=write address into base)
20 Opcode (0-1)
0: STC{cond}{L} Pn,Cd,<Address> ;Store to memory (from coprocessor)
0: STC2{L} Pn,Cd,<Address> ;Store to memory (from coprocessor)
1: LDC{cond}{L} Pn,Cd,<Address> ;Read from memory (to coprocessor)
1: LDC2{L} Pn,Cd,<Address> ;Read from memory (to coprocessor)
whereas {L} indicates long transfer (Bit 22: N=1)
19-16 Rn - ARM Base Register (R0-R15) (R15=PC+8)
15-12 Cd - Coprocessor src/dest Register (C0-C15)
11-8 Pn - Coprocessor number (P0-P15)
7-0 Offset - Unsigned Immediate, step 4 (0-1020, in steps of 4)
LDC/STC supported by ARMv2 and up, LDC2/STC2 by ARMv5 and up.
Execution time: (n-1)S+2N+bI, n=number of words transferred.
For details refer to original ARM docs, irrelevant in GBA because no coprocessor exists.


 ARM.16: Coprocessor Register Transfers (MRC, MCR) < ^

Opcode Format
  Bit    Expl.
31-28 Condition (or 1111b for MRC2/MCR2 opcodes on ARMv5 and up)
27-24 Must be 1110b for this instruction
23-21 CP Opc - Coprocessor operation code (0-7)
20 ARM-Opcode (0-1)
0: MCR{cond} Pn,<cpopc>,Rd,Cn,Cm{,<cp>} ;move from ARM to CoPro
0: MCR2 Pn,<cpopc>,Rd,Cn,Cm{,<cp>} ;move from ARM to CoPro
1: MRC{cond} Pn,<cpopc>,Rd,Cn,Cm{,<cp>} ;move from CoPro to ARM
1: MRC2 Pn,<cpopc>,Rd,Cn,Cm{,<cp>} ;move from CoPro to ARM
19-16 Cn - Coprocessor source/dest. Register (C0-C15)
15-12 Rd - ARM source/destination Register (R0-R15)
11-8 Pn - Coprocessor number (P0-P15)
7-5 CP - Coprocessor information (0-7)
4 Reserved, must be one (1) (otherwise CDP opcode)
3-0 Cm - Coprocessor operand Register (C0-C15)
MCR/MRC supported by ARMv2 and up, MCR2/MRC2 by ARMv5 and up.
A22i syntax allows to use MOV with Rd specified as first (dest), or last (source) operand. Native MCR/MRC syntax uses Rd as middle operand, <cp> can be ommited if <cp> is zero.
When using MCR with R15: Coprocessor will receive a data value of PC+12.
When using MRC with R15: Bit 31-28 of data are copied to Bit 31-28 of CPSR (ie. N,Z,C,V flags), other data bits are ignored, CPSR Bit 27-0 are not affected, R15 (PC) is not affected.
Execution time: 1S+bI+1C for MCR, 1S+(b+1)I+1C for MRC.
Return: For MRC only: Either R0-R14 modified, or flags affected (see above).
For details refer to original ARM docs. The opcodes irrelevant for GBA/NDS7 because no coprocessor exists (except for a dummy CP14 unit). However, NDS9 includes a working CP15 unit.
ARM CP14 ICEbreaker Debug Communications Channel
ARM CP15 System Control Coprocessor


 ARM.X: Coprocessor Double-Register Transfer (MCRR,MRRC) < ^

Opcode Format
  Bit    Expl.
31-28 Condition
27-21 Must be 1100010b for this instruction
20 L - Opcode (Load/Store)
0: MCRR{cond} Pn,opcode,Rd,Rn,Cm ;write Rd,Rn to coproc
1: MRRC{cond} Pn,opcode,Rd,Rn,Cm ;read Rd,Rn from coproc
19-16 Rn - Second source/dest register (R0-R14)
15-12 Rd - First source/dest register (R0-R14)
11-8 Pn - Coprocessor number (P0-P15)
7-4 CP Opc - Coprocessor operation code (0-15)
3-0 Cm - Coprocessor operand Register (C0-C15)
Supported by ARMv5TE only, not ARMv5, not ARMv5TExP.


 ARM.17: Undefined Instruction < ^

Opcode Format
  Bit    Expl.
31-28 Condition
27-25 Must be 011b for this instruction
24-5 Reserved for future use
4 Must be 1b for this instruction
3-0 Reserved for future use
No assembler mnemonic exists, following bitstreams are (not) reserved.
  cond011xxxxxxxxxxxxxxxxxxxx1xxxx - reserved for future use (except below).
cond01111111xxxxxxxxxxxx1111xxxx - free for user.
Execution time: 2S+1I+1N.


 ARM.X: Count Leading Zeros < ^

Opcode Format
  Bit    Expl.
31-28 Condition
27-16 Must be 0001.0110.1111b for this instruction
Opcode (fixed)
CLZ{cond} Rd,Rm ;Rd=Number of leading zeros in Rm
15-12 Rd - Destination Register (R0-R14)
11-4 Must be 1111.0001b for this instruction
3-0 Rm - Source Register (R0-R14)
CLZ supported by ARMv5 and up. Execution time: 1S.
Return: No Flags affected. Rd=0..32.


 ARM.X: QADD/QSUB < ^

Opcode Format
  Bit    Expl.
31-28 Condition
27-24 Must be 0001b for this instruction
23-20 Opcode
0000b: QADD{cond} Rd,Rm,Rn ;Rd=Rm+Rn
0010b: QSUB{cond} Rd,Rm,Rn ;Rd=Rm-Rn
0100b: QDADD{cond} Rd,Rm,Rn ;Rd=Rm+Rn*2 (doubled)
0110b: QDSUB{cond} Rd,Rm,Rn ;Rd=Rm-Rn*2 (doubled)
19-16 Rn - Second Source Register (R0-R14)
15-12 Rd - Destination Register (R0-R14)
11-4 Must be 00000101b for this instruction
3-0 Rm - First Source Register (R0-R14)
Supported by E variants of ARMv5 and up, ie. ARMv5TE(xP).
Execution time: 1S+Interlock.
Results truncated to signed 32bit range in case of overflows, with the Q-flag being set (and being left unchanged otherwise). NZCV flags are not affected.
Note: Rn*2 is internally processed first, and may get truncated - even if the final result would fit into range.


 ARM 26bit Memory Interface < ^

The 26bit Memory Interface was used by ARMv1 and ARMv2. The 32bit interface is used by ARMv3 and newer, however, 26bit backward compatibility was included in all ARMv3 (except ARMv3G), and optionally in some non-T variants of ARMv4.

Format of R15 in 26bit Mode (Program Counter Register)
  Bit   Name     Expl.
31-28 N,Z,C,V Flags (Sign, Zero, Carry, Overflow)
27-26 I,F Interrupt Disable bits (IRQ, FIQ) (1=Disable)
25-2 PC Program Counter, 24bit, Step 4 (64M range)
1-0 M1,M0 Mode (0=User, 1=FIQ, 2=IRQ, 3=Supervisor)
Branches with +/-32M range wrap the PC register, and can reach all 64M memory.

Reading from R15
If R15 is specified in bit16-19 of an opcode, then NZCVIF and M0,1 are masked (zero), otherwise the full 32bits are used.

Writing to R15
Data Processing opcodes with S=1, and LDM opcodes with PSR=1 can write to all 32bits in R15 (in 26bit mode, that is allowed even in user mode, though it does then affect only NZCF, not the write protected IFMM bits ???), other opcodes which write to R15 will modify only the program counter bits. Also, special CMP/CMN/TST/TEQ{P} opcodes can be used to write to the PSR bits in R15 without modifying the PC bits.

Exceptions
SWIs, Reset, Data/Prefetch Aborts and Undefined instructions enter Supervisor mode. Interrupts enter IRQ and FIQ mode. Additionally, a special 26bit Address Exception exists, which enters Supervisor mode on accesses to memory addresses>=64M as follows:
  R14_svc = PC ($+8, including old PSR bits)
M1,M0 = 11b = supervisor mode, F=same, I=1, PC=14h
to continue at the fault location, return by SUBS PC,LR,8.

26bit Backwards Compatibility on 32bit ARMv3 and up
CPSR M4=0 = 26bit mode (with USR,FIQ,IRQ,SVC modes in M1,M0)
32bit CPUs with 26bit compatibility mode can be configured to switch into 32bit mode when encountering exceptions.


 Pseudo Instructions and Directives < ^

ARM Pseudo Instructions
  nop              mov r0,r0
ldr Rd,=Imm ldr Rd,[r15,disp] ;use .pool as parameter field)
add Rd,=addr add/sub Rd,r15,disp
adr Rd,addr add/sub Rd,r15,disp
adrl Rd,addr two add/sub opcodes with disp=xx00h+00yyh
mov Rd,Imm mvn Rd,NOT Imm ;or vice-versa
and Rd,Rn,Imm bic Rd,Rn,NOT Imm ;or vice-versa
cmp Rd,Rn,Imm cmn Rd,Rn,-Imm ;or vice-versa
add Rd,Rn,Imm sub Rd,Rn,-Imm ;or vice-versa
All above opcodes may be made conditional by specifying a {cond} field.

THUMB Pseudo Instructions
  nop              mov r8,r8
ldr Rd,=Imm ldr Rd,[r15,disp] ;use .pool as parameter field
add Rd,=addr add Rd,r15,disp
adr Rd,addr add Rd,r15,disp
mov Rd,Rs add Rd,Rs,0 ;with Rd,Rs in range r0-r7 each
A22i Directives
  org  adr     assume following code from this address on
.gba indicate GBA program
.nds indicate NDS program
.fix fix GBA/NDS header checksum
.ereader_create_bmp create GBA e-Reader dotcode .BMP file(s) (bitmaps)
.ereader_create_raw create GBA e-Reader dotcode .RAW file (useless)
.ereader_create_bin create GBA e-Reader dotcode .BIN file (smallest)
.ereader_japan_plus japanese/plus (default is non-japanese)
.ereader_japan_original japanese/original (with Z80-stub for GBA-code)
.title 'Txt' defines a title (used for e-Reader dotcodes)
.norewrite do not delete existing output file (keep following data in file)
.data? following defines RAM data structure (assembled to nowhere)
.code following is normal ROM code/data (assembled to ROM image)
.include includes specified source code file (no nesting/error handling)
.import imports specified binary file (optional parameters: ,begin,len)
.radix nn changes default numeric format (nn=2,8,10,16 = bin/oct/dec/hex)
.errif expr generates an error message if expression is nonzero
.if expr assembles following code only if expression is nonzero
.else invert previous .if condition
.endif terminate .if/.ifdef/.ifndef
.ifdef sym assemble following only if symbol is defined
.ifndef sym assemble following only if symbol is not defined
.align nn aligns to an address divisible-by-nn, inserts 00's
.msg defines a no$gba debugmessage string, such like .msg 'Init Okay'
.brk defines a no$gba source code break opcode
l equ n l=n
l: [cmd] l=$ (global label)
@@l: [cmd] @@l=$ (local label, all locals are reset at next global label)
end end of source code
db ... define 8bit data (bytes)
dw ... define 16bit data (halfwords)
dd ... define 32bit data (words)
defs nn define nn bytes space (zero-filled)
;... defines a comment (ignored by the assembler)
// alias for CRLF, eg. allows <db 'Text',0 // dw addr> in one line
A22i Alias Directives (for compatibility with other assemblers)
  align        .align 4          code16    .thumb
align nn .align nn .code 16 .thumb
% nn defs nn code32 .arm
.space nn defs nn .code 32 .arm
..ds nn defs nn ltorg .pool
x=n x equ n .ltorg .pool
.equ x,n x equ n ..ltorg .pool
.define x n x equ n dcb db (8bit data)
incbin .import defb db (8bit data)
@@@... ;comment .byte db (8bit data)
@ ... ;comment .ascii db (8bit string)
@*... ;comment dcw dw (16bit data)
@... ;comment defw dw (16bit data)
.text .code .hword dw (16bit data)
.bss .data? dcd dd (32bit data)
.global (ignored) defd dd (32bit data)
.extern (ignored) .long dd (32bit data)
.thumb_func (ignored) .word dw/dd, don't use
#directive .directive .end end
.fill nn,1,0 defs nn
Alias Conditions, Opcodes, Operands
  hs   cs   ;condition higher or same = carry set
lo cc ;condition lower = carry cleared
asl lsl ;arithmetic shift left = logical shift left
A22i Numeric Formats & Dialects
  Type          Normal       Alias
Decimal 85 #85 &d85

Hexadecimal 55h #55h 0x55 #0x55 $55 &h55
Octal 125o 0o125 &o125
Ascii 'U' "U"
Binary 01010101b %01010101 0b01010101 &b01010101
Roman &rLXXXV (very useful for arrays of kings and chapters)
Note: The default numeric format can be changed by the .radix directive (usually 10=decimal). For example, with radix 16, values like "85" and "0101b" are treated as hexadecimal numbers (in that case, decimal and binary numbers can be still defined with prefixes &d and &b).

A22i Numeric Operators Priority
  Prio  Operator           Aliases
8 (,) brackets
7 +,- sign
6 *,/,MOD,SHL,SHR MUL,DIV,<<,>>
5 +,- operation
4 EQ,GE,GT,LE,LT,NE =,>=,>,<=,<,<>,==,!=
3 NOT
2 AND
1 OR,XOR EOR
Operators of same priority are processed from left to right.
Boolean operators (priority 4) return 1=TRUE, 0=FALSE.

A22i Nocash Syntax
Even though A22i does recognize the official ARM syntax, it's also allowing to use friendly code:
  mov   r0,0ffh         ;no C64-style "#", and no C-style "0x" required
stmia [r7]!,r0,r4-r5 ;square [base] brackets, no fancy {rlist} brackets
mov r0,cpsr ;no confusing MSR and MRS (whatever which is which)
mov r0,p0,0,c0,c0,0 ;no confusing MCR and MRC (whatever which is which)
ldr r0,[score] ;allows to use clean brackets for relative addresses
push rlist ;alias for stmfd [r13]!,rlist (and same for pop/ldmfd)
label: ;label definitions recommended to use ":" colons
[A22i is the no$gba debug version's built-in source code assembler.]


 ARM CP14 ICEbreaker Debug Communications Channel < ^

The ICEbreaker aka EmbeddedICE module may be found in ARM7TDMI and possibly also in other ARM processors. The main functionality of the module relies on external inputs (BREAKPT signal, etc.) being controlled by external debugging hardware. At software side, ICEbreaker contains a Debug Communications Channel (again to access external hardware), which can be accessed as coprocessor 14 via following opcodes:
  MRC{cond} P14,0,Rd,C0,C0,0  ;Read Debug Comms Control Register
MRC{cond} P14,0,Rd,C1,C0,0 ;Read Debug Comms Data Register
MRC{cond} P14,0,Rd,C2,C0,0 ;Read Debug Comms Status Register
MCR{cond} P14,0,Rd,C1,C0,0 ;Write Debug Comms Data Register
MCR{cond} P14,0,Rd,C2,C0,0 ;Write Debug Comms Status Register
The Control register consists of Bit31-28=ICEbreaker version (0001b for ARM7TDMI), Bit27-2=Not specified, Bit0/Bit1=Data Read/Write Status Flags.

The NDS7 and GBA allow to access CP14 (unlike as for CP0..CP13 & CP15, access to CP14 doesn't generate any exceptions), however, the ICEbreaker module appears to be disabled (or completely unimplemented), any reads from P14,0,Rd,C0,C0,0 through P14,7,Rd,C15,C15,7 are simply returning the prefetched opcode value from [$+8]. ICEbreaker might be eventually used and enabled in Nintendo's hardware debuggers, although external breakpoints are reportedly implemented via /FIQ input rather than via ICEbreaker hardware.
The NDS9 doesn't include a CP14 unit (or it is fully disabled), any attempts to access it are causing invalid instruction exceptions.


 ARM CP15 System Control Coprocessor < ^

ARM CP15 Overview
ARM CP15 ID Codes
ARM CP15 Control Register
ARM CP15 Memory Managment Unit (MMU)
ARM CP15 Protection Unit (PU)
ARM CP15 Cache Control
ARM CP15 Tightly Coupled Memory (TCM)
ARM CP15 Misc


 ARM CP15 Overview < ^

CP15
In many ARM CPUs, particulary such with memory control facilities, coprocessor number 15 (CP15) is used as built-in System Control Coprocessor.
CPUs without memory control functions typically do include a CP15 at all, in that case even an attempt to read the Main ID register will cause an Undefined Instruction exception.

CP15 Opcodes
CP15 can be accessed via MCR and MRC opcodes, with Pn=P15, and <cpopc>=0.
  MCR{cond} P15,0,Rd,Cn,Cm,<cp>   ;move from ARM to CP15
MRC{cond} P15,0,Rd,Cn,Cm,<cp> ;move from CP15 to ARM
Rd can be any ARM register in range R0-R14, R15 should not be used with P15.
Cn,Cm,<cp> are used to select a CP15 register, eg. C0,C0,0 = Main ID Register.
Other coprocessor opcodes (CDP, LDC, STC) cannot be used with P15.

CP15 Register List
  Register     Expl.
C0,C0,0 Main ID Register (R)
C0,C0,1 Cache Type and Size (R)
C0,C0,2 TCM Physical Size (R)
C1,C0,0 Control Register (R/W, or R=Fixed)
C2,C0,0 PU Cachability Bits for Data/Unified Protection Region
C2,C0,1 PU Cachability Bits for Instruction Protection Region
C3,C0,0 PU Write-Bufferability Bits for Data Protection Regions
C5,C0,0 PU Access Permission Data/Unified Protection Region
C5,C0,1 PU Access Permission Instruction Protection Region
C5,C0,2 PU Extended Access Permission Data/Unified Protection Region
C5,C0,3 PU Extended Access Permission Instruction Protection Region
C6,C0..C7,0 PU Protection Unit Data/Unified Region 0..7
C6,C0..C7,1 PU Protection Unit Instruction Region 0..7
C7,Cm,Op2 Cache Commands and Halt Function (W)
C9,C0,0 Cache Data Lockdown
C9,C0,1 Cache Instruction Lockdown
C9,C1,0 TCM Data TCM Base and Virtual Size
C9,C1,1 TCM Instruction TCM Base and Virtual Size
C13,Cm,Op2 Misc Process ID registers
C15,Cm,Op2 Misc Implementation Defined and Test/Debug registers
Data/Unified Registers
Some Cache/PU/TCM registers are declared as "Data/Unified".
That registers are used for Data accesses in case that the CPU contains separate Data and Instruction registers, otherwise the registers are used for both (unified) Data and Instruction accesses.


 ARM CP15 ID Codes < ^

C0,C0,0 - Main ID Register (R)
  12-15 ARM Era (0=Pre-ARM7, 7=ARM7, other=Post-ARM7)
Post-ARM7 Processors
  0-3   Revision Number
4-15 Primary Part Number (Bit12-15 must be other than 0 or 7)
(eg. 946h for ARM946)
16-19 Architecture (1=v4, 2=v4T, 3=v5, 4=v5T, 5=v5TE)
20-23 Variant Number
24-31 Implementor (41h=ARM, 44h=Digital Equipment Corp, 69h=Intel)
ARM7 Processors
  0-3   Revision Number
4-15 Primary Part Number (Bit12-15 must be 7)
16-22 Variant Number
23 Architecture (0=v3, 1=v4T)
24-31 Implementor (41h=ARM, 44h=Digital Equipment Corp, 69h=Intel)
Pre-ARM7 Processors
  0-3   Revision Number
4-11 Processor ID LSBs (30h=ARM3/v2, 60h,61h,62=ARM600,610,620/v3)
12-31 Processor ID MSBs (fixed, 41560h)
Note: On the NDS9, this register is 41059461h. NDS7 and GBA don't have CP15s.

C0,C0,1 - Cache Type Register (R)
  0-11  Instruction Cache (bits 0-1=len, 2=m, 3-5=assoc, 6-8=size, 9-11=zero)
12-23 Data Cache (bits 0-1=len, 2=m, 3-5=assoc, 6-8=size, 9-11=zero)
24 Separate Cache Flag (0=Unified, 1=Separate Data/Instruction Caches)
25-28 Cache Type (0,1,2,6,7=see below, other=reserved)
Type Method Cache cleaning Cache lock-down
0 Write-through Not needed Not supported
1 Write-back Read data block Not supported
2 Write-back Register 7 operations Not supported
6 Write-back Register 7 operations Format A
7 Write-back Register 7 operations Format B
29-31 Reserved (zero)
The 12bit Instruction/Data values are decoded as shown below,
  Cache Absent  = (ASSOC=0 and M=1)       ;in that case overriding below
Cache Size = 200h+(100h*M) shl SIZE ;min 0.5Kbytes, max 96Kbytes
Associativity = (1+(0.5*M)) shl ASSOC ;min 1-way, max 192-way
Line Length = 8 shl LEN ;min 8 bytes, max 64 bytes
For Unified cache (Bit 24=0), Instruction and Data values are identical.

C0,C0,2 - Tightly Coupled Memory (TCM) Size Register (R)
  0-1   Reserved    (0)
2 ITCM Absent (0=Present, 1=Absent)
3-5 Reserved (0)
6-9 ITCM Size (Size = 512 SHL N) (or 0=None)
10-13 Reserved (0)
14 DTCM Absent (0=Present, 1=Absent)
15-17 Reserved (0)
18-21 DTCM Size (Size = 512 SHL N) (or 0=None)
22-31 Reserved (0)
C0,C0,3..7 - Reserved (R)
Unused/Reserved registers, containing the same value as C0,C0,0.


 ARM CP15 Control Register < ^

C1,C0,0 - Control Register (R/W, or R=Fixed)
  0  MMU/PU Enable         (0=Disable, 1=Enable) (Fixed 0 if none)
1 Alignment Fault Check (0=Disable, 1=Enable) (Fixed 0/1 if none/always on)
2 Data/Unified Cache (0=Disable, 1=Enable) (Fixed 0/1 if none/always on)
3 Write Buffer (0=Disable, 1=Enable) (Fixed 0/1 if none/always on)
4 Exception Handling (0=26bit, 1=32bit) (Fixed 1 if always 32bit)
5 26bit-address faults (0=Enable, 1=Disable) (Fixed 1 if always 32bit)
6 Abort Model (pre v4) (0=Early, 1=Late Abort) (Fixed 1 if ARMv4 and up)
7 Endian (0=Little, 1=Big) (Fixed 0/1 if fixed)
8 System Protection bit (MMU-only)
9 ROM Protection bit (MMU-only)
10 Implementation defined
11 Branch Prediction (0=Disable, 1=Enable)
12 Instruction Cache (0=Disable, 1=Enable) (ignored if Unified cache)
13 Exception Vectors (0=00000000h, 1=FFFF0000h)
14 Cache Replacement (0=Normal/PseudoRandom, 1=Predictable/RoundRobin)
15 Pre-ARMv5 Mode (0=Normal, 1=Pre ARMv5; LDM/LDR/POP_PC.Bit0/Thumb)
16 DTCM Enable (0=Disable, 1=Enable)
17 DTCM Load Mode (0=R/W, 1=DTCM Write-only)
18 ITCM Enable (0=Disable, 1=Enable)
19 ITCM Load Mode (0=R/W, 1=ITCM Write-only)
20-31 Reserved (keep these bits unchanged) (usually zero)
Various bits in this register may be read-only (fixed 0 if unsupported, or fixed 1 if always activated).
On the NDS bit0,2,7,12..19 are R/W, Bit3..6 are always set, all other bits are always zero.


 ARM CP15 Memory Managment Unit (MMU) < ^

Function of some registers depends on whether the CPU contains a MMU or PU.
MMU handles virtual addressing tables.
  C2,Cm,Op2  MMU Translation Table Base
C3,Cm,Op2 MMU Domain Access Control
C5,Cm,Op2 MMU Fault Status
C6,Cm,Op2 MMU Fault Address
C8,Cm,Op2 MMU TLB Control
C10,Cm,Op2 MMU TLB Lockdown
The GBA, and Nintendo DS do not have a MMU.


 ARM CP15 Protection Unit (PU) < ^

Protection Unit can be enabled in Bit0 of C1,C0,0 (Control Register).

C2,C0,0 - Cachability Bits for Data/Unified Protection Region (R/W)
C2,C0,1 - Cachability Bits for Instruction Protection Region (if any) (R/W)
  0-7  Cachable (C) bits for region 0-7
8-31 Reserved/zero
C3,C0,0 - Write-Bufferability Bits for Data Protection Regions (R/W)
  0-7  Bufferable (B) bits for region 0-7
8-31 Reserved/zero
Instruction fetches are, obviously, always read-operations. So, there are no write-bufferability bits for Instruction Protection Regions.

C5,C0,0 - Access Permission Data/Unified Protection Region (R/W)
C5,C0,1 - Access Permission Instruction Protection Region (if any) (R/W)
C5,C0,2 - Extended Access Permission Data/Unified Protection Region (R/W)
C5,C0,3 - Extended Access Permission Instruction Protection Region (if any) (R/W/W)
For C5,C0,0 and C5,C0,1:
  0-15  Access Permission (AP) bits for region 0-7 (Bits 0-1=AP0, 2-3=AP1, etc)
16-31 Reserved/zero
For C5,C0,2 and C5,C0,3 (Extended):
  0-31  Access Permission (AP) bits for region 0-7 (Bits 0-3=AP0, 4-7=AP1, etc)
The possible AP settings (0-3 for C5,C0,0..1, or 0-15 for C5,C0,2..3) are:
  AP  Privileged User
0 - -
1 R/W -
2 R/W R
3 R/W R/W
5 R -
6 R R
Settings 5,6 only for Extended Registers, settings 4,7..15 are Reserved.

C6,C0..C7,0 - Protection Unit Data/Unified Region 0..7 (R/W)
C6,C0..C7,1 - Protection Unit Instruction Region 0..7 (R/W) if any
  0     Protection Region Enable (0=Disable, 1=Enable)
1-5 Protection Region Size (2 SHL X) ;min=(X=11)=4KB, max=(X=31)=4GB
6-11 Reserved/zero
12-31 Protection Region Base address (Addr = Y*4K; must be SIZE-aligned)
Overlapping Regions are allowed, Region 7 is having highest priority, region 0 lowest priority.

Background Region
Additionally, any memory areas outside of the eight Protection Regions are handled as Background Region, this region has neither Read nor Write access.

Unified Region Note
On the NDS, the Region registers are unified (C6,C0..C7,1 are read/write-able mirrors of C6,C0..C7,0). Netherless, the Cachabilty and Permission registers are NOT unified (separate registers exists for code and data settings).


 ARM CP15 Cache Control < ^

Cache enabled/controlled by Bit 2,3,12,14 in Control Register.
Cache type detected in Cache Type Register.

C7,C0..C15,0..7 - Cache Commands (W)
Write-only Cache Command Register. Cm,Op2 operands used to select a specific command, with parameter value in Rd.
  Cn,Cm,Op2 Rd   ARM9 Command
C7,C0,4 0 Yes Wait For Interrupt (Halt)
C7,C5,0 0 Yes Invalidate Entire Instruction Cache
C7,C5,1 VA Yes Invalidate Instruction Cache Line
C7,C5,2 S/I - Invalidate Instruction Cache Line
C7,C5,4 0 - Flush Prefetch Buffer
C7,C5,6 0 - Flush Entire Branch Target Cache
C7,C5,7 IMP? - Flush Branch Target Cache Entry
C7,C6,0 0 Yes Invalidate Entire Data Cache
C7,C6,1 VA Yes Invalidate Data Cache Line
C7,C6,2 S/I - Invalidate Data Cache Line
C7,C7,0 0 - Invalidate Entire Unified Cache
C7,C7,1 VA - Invalidate Unified Cache Line
C7,C7,2 S/I - Invalidate Unified Cache Line
C7,C8,2 0 Yes Wait For Interrupt (Halt), alternately to C7,C0,4
C7,C10,1 VA Yes Clean Data Cache Line
C7,C10,2 S/I Yes Clean Data Cache Line
C7,C10,4 0 - Drain Write Buffer
C7,C11,1 VA - Clean Unified Cache Line
C7,C11,2 S/I - Clean Unified Cache Line
C7,C13,1 VA Yes Prefetch Instruction Cache Line
C7,C14,1 VA Yes Clean and Invalidate Data Cache Line
C7,C14,2 S/I Yes Clean and Invalidate Data Cache Line
C7,C15,1 VA - Clean and Invalidate Unified Cache Line
C7,C15,2 S/I - Clean and Invalidate Unified Cache Line
Parameter values (Rd) formats:
  0    Not used, should be zero
VA Virtual Address
S/I Set/index; Bit 31..(32-A) = Index, Bit (L+S-1)..L = Set ?
C9,C0,0 - Data Cache Lockdown
C9,C0,1 - Instruction Cache Lockdown
(Width (W) of index field depends on cache ASSOCIATIVETY.)
Format A:
  0..(31-W)  Reserved/zero
(32-W)..31 Lockdown Block Index
Format B:
  0..(W-1)   Lockdown Block Index
W..30 Reserved/zero
31 L
Cache/Write-buffer should not be enabled for the whole 4GB memory area, high-speed TCM memory doesn't require caching, and caching would have fatal results on I/O ports. So, cache can be used only in combination with the Protection Unit, which allows to enable/disable caching in specified regions.

Note
ARMv5 instruction set supports a Cache Prepare for Load opcode (PLD), see
ARM.9: Single Data Transfer (LDR, STR, PLD)


 ARM CP15 Tightly Coupled Memory (TCM) < ^

TCM is high-speed memory, directly contained in the ARM CPU core.

TCM and DMA
TCM doesn't use the ARM bus. A minor disadvantage is that TCM cannot be accessed by DMA. However, the main advantage is that, when using TCM, the CPU can be kept running without any waitstates even while the bus is used for DMA transfers. Operation during DMA works only if all code/data is located in TCM, waitstates are generated if any code/data outside TCM is accessed; in worst case (if there are no gaps in the DMA) then the CPU is halted until the DMA finishes.

TCM and DMA and IRQ
No idea if/how IRQs are handled during DMA? Eventually (unlikely) code in TCM is kept executed until DMA finishes (ie. until the IRQ vector can be accessed. Eventually the IRQ vector is instantly accessed (causing to halt the CPU until DMA finishes). In both cases: Assuming that IRQs are enabled, and that the IRQ vector and/or IRQ handler are located outside TCM.

Separate Instruction (ITCM) and Data (DTCM) Memory
DTCM can be used only for Data accesses, typically used for stacks and other frequently accessed data.
ITCM is primarily intended for instruction accesses, but it can be also used for Data accesses (among others allowing to copy code to ITCM), however, performance isn't optimal when simultaneously accessing ITCM for code and data (such like opcodes in ITCM that use literal pool values in ITCM).

TCM Enable, TCM Load Mode
CP15 Control Register allows to enable ITCM and DTCM, and to switch ITCM/DTCM into Load Mode. In Load Mode (when TCM is enabled), TCM becomes write-only; this allows to read data from source addresses in main memory, and to write data to destination addresses in TCM by using the same addresses; useful for initializing TCM with overlapping source/dest addresses; Load mode works with all Load/Store opcodes, it does NOT work with SWP/SWPB opcodes.

TCM Physical Size can be detected in 3rd ID Code Register. (C0,C0,2)

C9,C1,0 - Data TCM Size/Base (R/W)
C9,C1,1 - Instruction TCM Size/Base (R/W)
  0     Reserved     (0)
1-5 Virtual Size (Size = 512 SHL N) ;min=(N=3)=4KB, max=(N=23)=4GB
6-11 Reserved (0)
12-31 Region Base (Base = X SHL 12) ;Base must be Size-aligned
The Virtual size settings should be normally same as the Physical sizes (see C0,C0,2). However, smaller sizes are allowed (using only the 1st some KB), as well as bigger sizes (TCM area is then filled with mirrors of physical TCM).
The ITCM region base may be fixed (read-only), for example, on the NDS, ITCM base is always 00000000h, nethertheless the virtual size may be changed (allowing to mirror ITCM to higher addresses).
If DTCM and ITCM do overlap, then ITCM appears to have priority.

TCM and PU
TCM can be used without Protection Unit.
When the protection unit is enabled, TCM is controlled by the PU just like normal memory, the PU should provide R/W Access Permission for TCM regions; cache and write-buffer are not required for high-speed TCM (so both should be disabled for TCM regions).


 ARM CP15 Misc < ^

C13,C0,0 - Process ID for Fast Context Switch Extension (FCSE) (R/W)
  0-24  Reserved/zero
25-31 Process ID (PID) (0-127) (0=Disable)
The FCSE allows different processes (each assembled with ORG 0) to be located at virtual addresses in the 1st 32MB area. The FCSE splits the total 4GB address space into blocks of 32MB, accesses to Block(0) are redirected to Block(PID):
  IF addr<32M then addr=addr+PID*32M
Respectively, with PID=0, the address remains unchanged (FCSE disabled).
The CPU-to-Memory address handling is shown below:
  1. CPU outputs a virtual address (VA)
2. FCSE adjusts the VA to a modified virtual address (MVA)
3. Cache hits determined by examining the MVA, continue below if no hit
4. MMU translates MVA to physical address (PA) (if no MMU present: PA=MVA)
5. Memory access occurs at PA
The FCSE allows limited virtual addressing even if no MMU is present.
If the MMU is present, then either the FCSE and/or the MMU can be used for virtual addressing; the advantage of using the FCSE (a single write to C13,C0,0) is less overload; using the MMU for the same purpose would require to change virtual address translation table in memory, and to flush the cache.
The NDS doesn't have a FCSE (the FCSE register is read-only, always zero).

C13,C0,1 - Trace Process ID (R/W)
C13,C1,1 - Trace Process ID (Mirror) (R/W)
This value is output to ETMPROCID pins (if any), allowing to notify external hardware about the currently executed process within multi-tasking programs.
  0-31  Process ID
C13,C1,1 is a mirror of C13,C0,1 (for compatibility with other ARM processors).
Both registers are read/write-able on NDS9, but there are no external pin-outs.

<cpopc>
Unlike for all other CP15 registers, the <cpopc> operand of the MRC/MCR opcodes isn't always zero for below registers, so below registers are using "cpopc,Cn,Cm,op2" notation (instead of the normal "Cn,Cm,op2" notation).

Built-In-Self-Test (BIST)
Allows to test internal memory (ie. TCM, Cache Memory, and Cache TAGs). The tests are filling (and verifying) the selected memory region thrice (once with the fillvalue, then with the inverted fillvalue, and then again with the fillvalue). The BIST functions are intended for diagnostics purposes only, not for use in normal program code (ARM doesn't guarantee future processors to have backwards compatible BIST functions).

0,C15,C0,1 - BIST TAG Control Register (R/W)
1,C15,C0,1 - BIST TCM Control Register (R/W)
2,C15,C0,1 - BIST Cache Control Register (R/W)
  0-15  Data Control (see below)
16-31 Instruction Control (see below)
The above 16bit control values are:
  0     Start bit     (Write: 1=Start) (Read: 1=Busy)
1 Pause bit (1=Pause)
2 Enable bit (1=Enable)
3 Fail Flag (1=Error) (Read Only)
4 Complete Flag (1=Ready) (Read Only)
5-15 Size (2^(N+2) bytes) (min=N=1=8bytes, max=N=24=64MB)
Size and Pause are not supported in all implementations.
Caution: While and as long as the Enable bit is set, the corresponding memory region(s) will be disabled. Eg. when testing <either> DTCM <and/or> ITCM, <both> DTCM <and> ITCM are forcefully disabled in C1,C0,0 (Control Register), after the test the software must first clear the BIST enable bit, and then restore DTCM/ITCM bits in C1,C0,0. And of course, the content of the tested memory region must be restored when needed.

0,C15,C0,2 - BIST Instruction TAG Address (R/W)
1,C15,C0,2 - BIST Instruction TCM Address (R/W)
2,C15,C0,2 - BIST Instruction Cache Address (R/W)
0,C15,C0,6 - BIST Data TAG Address (R/W)
1,C15,C0,6 - BIST Data TCM Address (R/W)
2,C15,C0,6 - BIST Data Cache Address (R/W)
  0-31  Word-aligned Destination Address within Memory Block (eg. within ITCM)
On the NDS9, bit0-1, and bit21-31 are always zero.

0,C15,C0,3 - BIST Instruction TAG Fillvalue (R/W)
1,C15,C0,3 - BIST Instruction TCM Fillvalue (R/W)
2,C15,C0,3 - BIST Instruction Cache Fillvalue (R/W)
0,C15,C0,7 - BIST Data TAG Fillvalue (R/W)
1,C15,C0,7 - BIST Data TCM Fillvalue (R/W)
2,C15,C0,7 - BIST Data Cache Fillvalue (R/W)
  0-31  Fillvalue for BIST
After BIST, the selected memory region is filled by that value. That is, on the NDS9 at least, all words will be filled with the SAME value (ie. NOT with increasing or randomly generated numbers).

0,C15,C0,0 - Cache Debug Test State Register (R/W)
  0-8    Reserved (zero)
9 Disable Instruction Cache Linefill
10 Disable Data Cache Linefill
11 Disable Instruction Cache Streaming
12 Disable Data Cache Streaming
13-31 Reserved (zero/unpredictable)
3,C15,C0,0 - Cache Debug Index Register (R/W)
  0..1    Reserved (zero)
2..4 Word Address
5..N Index
N+1..29 Reserved (zero)
30..31 Segment
3,C15,C0,1 - Cache Debug Instruction TAG (R/W)
3,C15,C0,2 - Cache Debug Data TAG (R/W)
3,C15,C0,3 - Cache Debug Instruction Cache (R/W)
3,C15,C0,4 - Cache Debug Data Cache (R/W)
  0..1    Set
2..3 Dirty Bits
4 Valid
5..N Index
N+1..31 TAG Address

 CPU Instruction Cycle Times < ^

Instruction Cycle Summary
  Instruction      Cycles      Additional
  ---------------------------------------------------------------------
  Data Processing  1S          +1S+1N if R15 loaded, +1I if SHIFT(Rs)
  MSR,MRS          1S
  LDR              1S+1N+1I    +1S+1N if R15 loaded
  STR              2N
  LDM              nS+1N+1I    +1S+1N if R15 loaded
  STM              (n-1)S+2N
  SWP              1S+2N+1I
  BL (THUMB)       3S+1N
  B,BL             2S+1N
  SWI,trap         2S+1N
  MUL              1S+ml
  MLA              1S+(m+1)I
  MULL             1S+(m+1)I
  MLAL             1S+(m+2)I
  CDP              1S+bI
  LDC,STC          (n-1)S+2N+bI
  MCR              1N+bI+1C
  MRC              1S+(b+1)I+1C
  {cond} false     1S
ARM9:
  Q{D}ADD/SUB      1S+Interlock.
CLZ 1S.
LDR 1S+1N+1L
LDRB,LDRH,LDRmis 1S+1N+2L
LDR PC ...
STR 1S+1N (not 2N, and both in parallel)
Execution Time: 1S+Interlock (SMULxy,SMLAxy,SMULWx,SMLAWx)
Execution Time: 1S+1I+Interlock (SMLALxy)

Whereas,
  n = number of words transferred
b = number of cycles spent in coprocessor busy-wait loop
m = depends on most significant byte(s) of multiplier operand
Above 'trap' is meant to be the execution time for exceptions. And '{cond} false' is meant to be the execution time for conditional instructions which haven't been actually executed because the condition has been false.

The separate meaning of the N,S,I,C cycles is:

N - Non-sequential cycle
Requests a transfer to/from an address which is NOT related to the address used in the previous cycle. (Called 1st Access in GBA language).
The execution time for 1N is 1 clock cycle (plus non-sequential access waitstates).

S - Sequential cycle
Requests a transfer to/from an address which is located directly after the address used in the previous cycle. Ie. for 16bit or 32bit accesses at incrementing addresses, the first access is Non-sequential, the following accesses are sequential. (Called 2nd Access in GBA language).
The execution time for 1S is 1 clock cycle (plus sequential access waitstates).

I - Internal Cycle
CPU is just too busy, not even requesting a memory transfer for now.
The execution time for 1I is 1 clock cycle (without any waitstates).

C - Coprocessor Cycle
The CPU uses the data bus to communicate with the coprocessor (if any), but no memory transfers are requested.

Memory Waitstates
Ideally, memory may be accessed free of waitstates (1N and 1S are then equal to 1 clock cycle each). However, a memory system may generate waitstates for several reasons: The memory may be just too slow. Memory is currently accessed by DMA, eg. sound, video, memory transfers, etc. Or when data is squeezed through a 16bit data bus (in that special case, 32bit access may have more waitstates than 8bit and 16bit accesses). Also, the memory system may separate between S and N cycles (if so, S cycles would be typically faster than N cycles).

Memory Waitstates for Different Memory Areas
Different memory areas (eg. ROM and RAM) may have different waitstates. When executing code in one area which accesses data in another area, then the S+N cycles must be split into code and data accesses: 1N is used for data access, plus (n-1)S for LDM/STM, the remaining S+N are code access. If an instruction jumps to a different memory area, then all code cycles for that opcode are having waitstate characteristics of the NEW memory area (except Thumb BL which still executes 1S in OLD area).


 CPU Versions < ^

Version Numbers
ARM CPUs are distributed by name ARM#, and are described as ARMv# in specifications, whereas "#" is NOT the same than "v#", for example, ARM7TDMI is ARMv4TM. That is so confusing, that ARM didn't even attempt to clarify the relationship between the various "#" and "v#" values.

Version Variants
Suffixes like "M" (long multiply), "T" (Thumb support), "E" (Enhanced DSP) indicate presence of special features, additionally to the standard instruction set of a given version, or, when preceded by an "x", indicate the absence of that features.

ARMv1 aka ARM1
Some sort of a beta version, according to ARM never been used in any commercial products.

ARMv2 and up
MUL,MLA
CDP,LDC,MCR,MRC,STC
SWP/SWPB (ARMv2a and up only)
Two new FIQ registers

ARMv3 and up
MRS,MSR opcodes (instead CMP/CMN/TST/TEQ{P} opcodes)
CPSR,SPSR registers (instead PSR bits in R15)
Removed never condition, cond=NV no longer valid
32bit addressing (instead 26bit addressing in older versions)
26bit addressing backwards comptibility mode (except v3G)
Abt and Und modes (instead handling aborts/undefined in Svc mode)
SMLAL,SMULL,UMLAL,UMULL (optionally, INCLUDED in v3M, EXCLUDED in v4xM/v5xM)

ARMv4 aka ARM7 and up
LDRH,LDRSB,LDRSH,STRH
Sys mode (privileged user mode)
BX (only ARMv4T, and any ARMv5 or ARMv5T and up)
THUMB code (only T variants, ie. ARMv4T, ARMv5T)

ARMv5 aka ARM9 and up
BKPT,BLX,CLZ (BKPT,BLX also in THUMB mode)
LDM/LDR/POP PC with mode switch (POP PC also in THUMB mode)
CDP2,LDC2,MCR2,MRC2,STC2 (new coprocessor opcodes)
C-flag unchanged by MUL (instead undefined flag value)
changed instruction cycle timings / interlock ??? or not ???
QADD,QDADD,QDSUB,QSUB opcodes, CPSR.Q flag (v5TE and V5TExP only)
SMLAxy,SMLALxy,SMLAWy,SMULxy,SMULWy (v5TE and V5TExP only)
LDRD,STRD,PLD,MCRR,MRRC (v5TE only, not v5, not v5TExP)

ARMv6
No public specifications available.

A Milestone in Computer History
Original ARMv2 has been used in the relative rare and expensive Archimedes deluxe home computers in the late eighties, the Archimedes has caught a lot of attention, particularly for being the first home computer that used a BIOS being programmed in BASIC language - which has been a absolutely revolutionary decadency at that time.
Inspired, programmers all over the world have successfully developed even slower and much more inefficient programming languages, which are nowadays consequently used by nearly all ARM programmers, and by most non-ARM programmers as well.


 CPU Data Sheet < ^

This present document is an attempt to supply a brief ARM7TDMI reference, hopefully including all information which is relevant for programmers.

Some details that I have treated as meaningless for GBA programming aren't included - such like Big Endian format, and Virtual Memory data aborts, and most of the chapters listed below.

Have a look at the complete data sheet (URL see below) for more detailed verbose information about ARM7TDMI instructions. That document also includes:

- Signal Description
  Pins of the original CPU, probably other for GBA.
- Memory Interface
  Optional virtual memory circuits, etc. not for GBA.
- Coprocessor Interface
  As far as I know, none such in GBA.
- Debug Interface
  For external hardware-based debugging.
- ICEBreaker Module
  For external hardware-based debugging also.
- Instruction Cycle Operations
  Detailed: What happens during each cycle of each instruction.
- DC Parameters (Power supply)
- AC Parameters (Signal timings)

The official ARM7TDMI data sheet can be downloaded from ARMs webpage,
  http://www.arm.com/Documentation/UserMans/PDF/ARM7TDMI.html
Be prepared for bloated PDF Format, approx 1.3 MB, about 200 pages.