Friday, December 14, 2007

Arm MMU in linux: page table Initialization and tweaks for integeration with Memory management code.

This post covers details of MMU initialization code of linux 2.6.11 kernel for ARM and few internal tweaks used for mapping arm page tables to x86 style page tables which linux memory management code expects. We'll cover only architecutre specific details, generic MM code of kernel will not be covered here. For all the architecutre specific details we take ARM 11 (Architecture V6) as our reference.

Following conventions are used in this writeup:
  • All 'C' function names are in italics.
  • All 'Assembly' function names and lables are metioned in bold italics.
  • No file names will be mentioned here (with exception of assembly files), use ctags.
  • All details are for uniprocessor system, nothing related to SMP is covered here.
Kernel's execution starts at stext in (arch/arm/kernel/head.S) , at this point kernel expects that MMU is off, I-cache is off, D-cache is off. After looking up of processor type and machine number, we call __create_page_tables which sets up the initial MMU tables for which are used only during initialization of MMU. Here we create MMU mapping tables for 4MB of memory starting from kernel's text (More preciesly it starts at a 1MB section in which kernel's text starts). Following is the code snippet which creates these mappings:

-----------------------------

_create_page_tables:

  1. ldr r5, [r8, #MACHINFO_PHYSRAM] @ physram
  2. pgtbl r4, r5 @ page table address
  3. mov r0, r4
  4. mov r3, #0
  5. add r6, r0, #0x4000
  6. 1: str r3, [r0], #4
  7. str r3, [r0], #4
  8. str r3, [r0], #4
  9. str r3, [r0], #4
  10. teq r0, r6
  11. bne 1b
  12. ldr r7, [r10, #PROCINFO_MMUFLAGS] @ mmuflags
  13. mov r6, pc, lsr #20 @ start of kernel section
  14. orr r3, r7, r6, lsl #20 @ flags + kernel base
  15. str r3, [r4, r6, lsl #2] @ identity mapping
  16. add r0, r4, #(TEXTADDR & 0xff000000) >> 18 @ start of kernel
  17. str r3, [r0, #(TEXTADDR & 0x00f00000) >> 18]!
  18. add r3, r3, #1 <<>
  19. str r3, [r0, #4]! @ KERNEL + 1MB
  20. add r3, r3, #1 <<>
  21. str r3, [r0, #4]! @ KERNEL + 2MB
  22. add r3, r3, #1 <<>
  23. str r3, [r0, #4] @ KERNEL + 3MB

-----------------------------

In this function we first of all find out the physical address where our RAM starts, and locate the address where we'll store our inital MMU tables (line #1 and #2). We create these inital MMU tables 16kb below kernel entry point. Line #12- #15 create an identity mapping of 1MB starting from kernel entry point (physical address of kernel entry, i.e stext). Here we do'nt create a second level page table for these mapping, instead of that we specify in first level descriptor that these mappings are for section (each section mapping is of size 1MB). Simiarly we create mapping for 4 more sections (these 4 are non identity mappings, size of each section is 1MB here also) starting at TEXTADDR (virtual address of kernel entry point)


so the initial memory map looks something line this:



After the initial page tables are setup, next step is to enable MMU. This code is tricky, does lot of deep magic.

----------------------------


  1. ldr r13, __switch_data @ address to jump to after mmu has been enabled
  2. adr lr, __enable_mmu @ return (PIC) address
  3. add pc, r10, #PROCINFO_INITFUNC

  4. .type __switch_data, %object

  5. __switch_data: .long __mmap_switched
  6. .long __data_loc @ r4
  7. .long __data_start @ r5
  8. .long __bss_start @ r6
  9. .long _end @ r7
  10. .long processor_id @ r4
  11. .long __machine_arch_type @ r5
  12. .long cr_alignment @ r6
  13. .long init_thread_union+8192 @ sp
-----------------------------


line #1 puts virtual address of __mmap_switched in r13, after enabling MMU kernel will jump to address in r13. The virtual address that is used here is not from identity mapping, but it is PAGE_OFFSET + physical address of __mmap_switched. And now since in __mmap_switched we start refering to virtual addresses of variables and functions, so starting from __mmap_switched is no longer position independent.

At line #2-#3, we put position independent address of __enable_mmu ('adr' Psuedo instruction translates to PC relative addressing, that's why it is position independent) and jumps to at a offset of PROCINFO_INITFUNC (12 bytes) in __v6_proc_info structure (arch/arm/mm/proc-v6.S). At PROCINFO_INITFUNC in __v6_proc_info we have a branch to __v6_setup, which does following setup for enabling MMU:

  • Clean and Invalidate D-cache and Icache, and invalidate TLB.
  • Prepare the control register1 (C1) value that needs to be written when enabling mmu, and return the value which needs to be written in C1.
As __v6_setup returns we enter __enable_mmu, which just sets Cache enable bits, branch-prediction enable bits in 'r' which is the value to be written in C1 (value to be written in C1 was returned in 'r0' by __v6_setup) and then calls __turn_mmu_on.

--------------------------------------

__turn_mmu_on:


  1. mov r0, r0
  2. mcr p15, 0, r0, c1, c0, 0 @ write control reg
  3. mrc p15, 0, r3, c0, c0, 0 @ read id reg
  4. mov r3, r3
  5. mov r3, r3
  6. mov pc, r13
--------------------------------------

__turn_mmu_on
just writes 'r0' to C1 to enable MMU. Line #4 and #5 are the nops to make sure that pipeline does not contain and invalid address access when C1 is written. Line #6 'r13' is moved in 'pc' and we enter __mmap_switched. Now MMU is ON, every address is virtual no physical addresses anymore. But the final kernel page tables are still not setup (final page tables will be setup by paging_init and mappings created by __create_page_tables will be discarded), we are still running with 4Mb mapping that __create_page_tables had set it for us. __mmap_switched copies the data segment if required, clears the BSS and calls start_kernel.

The mappings page table mappings that we discussed above makes sure that the position dependent code of kernel startup runs peacefully, and these mappings are overwritten at later stages by a function called paging_init( ) ( start_kernel( ) -> setup_arch( )-> paging_init( )).

paging_init() populates the master L1 page table (init_mm) with linear mappings of complete SDRAM and the SOC specific address space (SOC specific memory mappings are created by mdesc->mapio() function, this function pointer is initialized by SOS specific code, arch/arm/mach-*). So in master L1 page table (init_mm) we have mappings which map virtual addresses in range PAGE_OFFSET - (PAGE_OFFSET + sdram size) to physical address of SDRAM start - (physical address of SDRAM start + sdram size). Also we have SOC specific mappings created by mdesc->mapio() function.

One more point worth noting here is that whenever a new process is created, a new L1 page table is allocated for it, and the kernel mappings (sdram mapping, SOC specific mappings) are copied to it from the master L1 page table (init_mm). Every process has its own user space mappings so no need to copy anything from anywhere.

Handling of mapings for VMALLOC REGION are is bit tricky, because VMALLOC virtual addressed are allocated when a process calls vmalloc( ). So if we have some processes' which were created before the process which called vmalloc then their L1 page tables will have no maping for new vmalloc'ed region. So how this is taken care of is very simple, the mappings for vmalloc'ed region are updated in the master L1 page table (init_mm) and when a process whose page tables do not have newly created mapping accesses the newly vmalloc'ed region, a page fault is generated. And kernel handles page faults in VMALLOC region specially by copying the mappings for newly vmalloc'ed area to page tables of the process which generated the fault.

Tweaks for integerating ARM 2 level page tables with linux implementation of 3 level ix86 style page tables

1. Linux assumes that it is dealing with 3 level page tables and ARM has 2 level page tables. For handling this, for ARM in ARCH include files __pmd is defined as a identity macro in (include/asm-arm/page.h):

-------------------------------------------------

#define __pmd(x) ((pmd_t) { (x) } )

--------------------------------------------------

So effectively for ARM, linux is told that pmd has just one entry, effectively bypassing pmd.

2. Memory management code of in Linux expects ix86 type page table entries, for example it uses 'P' (present ), 'D' (dirty) bits but ARM page table entries (PTEs) don't have these bits. As a workaround to provide ix86 PTE flags, what ARM page table implementation does is that, it tells linux that PGD has 2048 entries of 8 bytes each (whereas ARM hardware level 1 pagetable has 4096 entries of 4 bytes each).

Also it tells Linux that each PTE table has 512 entries (whereas ARM hardware PTE table of 256 entries).This means that the PTE table that is exposed to Linux is actually 2 ARM PTE tables, arranged contigously in memory. NowAfter these 2 ARM PTE tables (lets say h/w PTE table 1 and h/w PTE table 2), 2 more PTE tables (256 entries each, say linux PTE table 1 and 2) are allocated in memory contiguous to Arm hardware page table 2. Linux PTE table 1 and 2 contains PTE flags of ix86 style corresponding to entries in ARM PTE table 1 and 2. So whenever Linux needs ix86 style PTE flags it uses entries in Linux PTE table 1 and 2. ARM never looks into Linux PTE table 1 and 2 during hardware page table walk, it uses only h/w PTE tables as mentioned above. Refer to include/asm-arm/pgtable.h (line: 20-76) for details of how ARM h/w PTE table 1 and 2, and linux PTE table 1 and 2 are organized in memory.

So here we conclude the architecture specific page table related stuff for ARM.

Friday, November 16, 2007

The story Interrupt handling in linux 2.6.11 on ARM.

This post explains how the interrupts are handled in linux kernel, what homework kernel has to do before first interrupt is recieved. We took 2.6.11 kernel as reference, and for architecture specific details we used a ARM11 based devlopment board.

Following conventions are used in this writeup:

  • All 'C' function names are in italics.
  • All 'Assembly' function names and lables are metioned in bold italics.
  • No file names will be mentioned here (with exception of assembly files), use ctags.
  • All details are for uniprocessor system, nothing related to SMP is covered here.
  • Any other convention - ???? :) NO.

We'll cover the whole interrupt stuff in two sections:

  • Interrupt setup - Explanation of Generic and architecture specific setup that kernel does.

  • Interrupt handling - Explanation of what happens after processor recieves an interrupt.

1. Interrupt setup

start_kernel( ) is the first 'C' function that opens its eyes when kernel is booting up. It intializes various subsystems of the kernel, including IRQ system. Intialization of IRQ requires that you have valid vector table in place and you have first level interrupt hadlers in place, both of these things are architecture specifc. Lets setup the vector table first.

start_kernel( ) calls a function called trap_init( ) which does following:

  • Call __trap_init() to setup exception vector table at location 0xffff0000.
  • Flush the icache in range 0xffff0000 to 0xffff0000 + PAGE_SIZE. This is required because __trap_init( ) moves the vector table and vector stubs to 0xffff0000.

Vector table and vector stub code for ARM resides in arch/arm/kernel/entry-armv.S file. In this file you'll find implementaion of __trap_init( ) funtion. Following is a code snippet from entry-armv.S, we'll go in details of this code:

----------------------------------------------------

  1. .equ __real_stubs_start, .LCvectors + 0x200
  2. .LCvectors:
  3. swi SYS_ERROR0
  4. b __real_stubs_start + (vector_und - __stubs_start)
  5. ldr pc, __real_stubs_start + (.LCvswi - __stubs_start)
  6. b __real_stubs_start + (vector_pabt - __stubs_start)
  7. b __real_stubs_start + (vector_dabt - __stubs_start)
  8. b __real_stubs_start + (vector_addrexcptn - __stubs_start)
  9. b __real_stubs_start + (vector_irq - __stubs_start)
  10. b __real_stubs_start + (vector_fiq - __stubs_start)
  11. ENTRY(__trap_init)
  12. stmfd sp!, {r4 - r6, lr}
  13. mov r0, #0xff000000
  14. orr r0, r0, #0x00ff0000 @ high vectors position
  15. adr r1, .LCvectors @ set up the vectors
  16. ldmia r1, {r1, r2, r3, r4, r5, r6, ip, lr}
  17. stmia r0, {r1, r2, r3, r4, r5, r6, ip, lr}
  18. add r2, r0, #0x200
  19. adr r0, __stubs_start @ copy stubs to 0x200
  20. adr r1, __stubs_end
  21. 1: ldr r3, [r0], #4
  22. str r3, [r2], #4
  23. cmp r0, r1
  24. blt 1b

LOADREGS(fd, sp!, {r4 - r6, pc})

-------------------------------------------------------------
In the code snippet given above vector table is between line 4- 10. As we can see this vector contains branch instruction for branching to exception handler code which also resides in same file (arm-entryV.S). Vector table contains the branch instructions for all the exceptions defined in ARM (Undefined instruction, SWI, data abort, prefecth abort, IRQ, and FIQ). The most important thing to note about this vector table is that branch instructions are used for all exceptions except SWI. Using branch instruction instead of loading PC directly with the exceptions handler address makes this code position independent. Since branch instruction take offset (+ive or -ve) from current PC, this code will run fine as long as the offset between vector table instructions and the exception handlers is maintained as desired by this code. And It is assumed here that exception handlers will be at +0x200 offset from starting address of vector table.

__trap_init( ) function copies the vector table at location 0xffff0000 (line 13-17) and copies the exception handlers at 0xffff0200 (line 18-24). Remember that addresses we are talking about here are virtual addresses.

So we are done with setting up vector tables and the exception handlers. If you want then you can hook your exception handler directly to the vector table, so that you bypass all linux interrupt handling code, which is pretty heavy. But if you do so then you'll have to get your hands dirty with all the architecture details which kernel handles beautifully and cleanly.

After setting up vector tables start_kernel ( ) calls init_IRQ() to set up kernel IRQ handling infrastructure, on ARM we have 32 hard interrupts for which kernel kernel sets up the a default desctiptor called bad_irq_desc, which has do_bad_IRQ( ) as IRQ handler. Then init_IRQ( ) calls init_arch_irq( ), here the architecture specfic code has to setup the IRQ handlers for 32 IRQs, if not set then do_bad_IRQ( ) will handle the IRQs. On our reference architecture we setup do_level_IRQ( ) as IRQ handlers of all the IRQs except the system timer IRQ. This is the second place where you can bypass kernel IRQ handlers and hook your IRQ handler directly. If you do'nt have requirement of hooking your IRQ handler here then just let do_level_IRQ( ) handle the IRQs then you can register your IRQ handlers with request_irq() in traditional way, and kernel will call your IRQ handler whenever interrupt occurs hiding all the dirty arch details :)

So now we have our IRQ infrastructure in place, and various modules can register thier IRQ handlers through request_irq(). When you call request_irq( ) kernel appends your IRQ handler to list of IRQ handlers registered for that particular IRQ line, it does not change the exception vector table.

Now lets see what happens after interrupt is recieved.

2. Interrupt Handling

When a IRQ is raised, ARM stops what it is processing ( Asuming it is not processing a FIQ!), disables further IRQs (not FIQs), puts CPSR in SPSR, puts current PC to LR and swithes to IRQ mode, refers to the vector table and jumps to the exception handler. In our case it jumps to the exception handler of IRQ. following is the snippet of code for exception handler code for IRQ (again from entry-armV.S file):

------------------------------------------------------------------------

  1. vector_irq:
  2. ldr r13, .LCsirq
  3. .if \correction
  4. sub lr, lr, #\correction
  5. .endif
  6. str lr, [r13] @ save lr_IRQ
  7. mrs lr, spsr
  8. str lr, [r13, #4] @ save spsr_IRQ
  9. mrs r13, cpsr
  10. bic r13, r13, #MODE_MASK
  11. orr r13, r13, #MODE_SVC
  12. msr spsr_cxsf, r13 @ switch to SVC_32 mode
  13. and lr, lr, #15
  14. ldr lr, [pc, lr, lsl #2]
  15. movs pc, lr @ Changes mode and branches
  16. .long __irq_usr @ 0 (USR_26 / USR_32)
  17. .long __irq_invalid @ 1 (FIQ_26 / FIQ_32)
  18. .long __irq_invalid @ 2 (IRQ_26 / IRQ_32)
  19. .long __irq_svc @ 3 (SVC_26 / SVC_32)

--------------------------------------------------------------------------

In the above snipped a macro called vector_stub has been intensionally expanded to improve readability. So when ARM refers to the vector table it follows the branch and lands up at line 2. At this moment ARM is in IRQ mode, IRQs are disabled, LR contains PC of when interrupt occured and SPSR contains CPSR of when interrupt occured. Since we are in IRQ mode so r13 (SP) is banked, so we load SP with address of a small stack frame that we have created at __temp_irq (.LCsirq contains address of __temp_irq). This stack is only used when we are are in IRQ mode.

In lines 6-8 we save LR and SPSR on the temporary IRQ stack (__temp_irq). And we switch to SVC mode (line 10-12). After this basic setup is done depending on the mode in which ARM was there when interrupt occured we switch to specific handler. We'll assume that ARM was executing in SVC mode, so we'll look ino the details of __irq_svc .

__irq_svc saves r0-12 on SVC mode stack (i.e kernel stack of process which was interrupted), reads LR and SPSR from temporary IRQ stack (__temp_irq) and saves them on SVC mode stack, increments the preemt count and then calls get_irqnr_and_base to find out the IRQ line number. Each architecutre has to provide implementation of get_irqnr_and_base, in which it has to query the Interrupt Controller in ARCH specific way to find out which IRQ line raised this interrupt.

After doing all this __irq_svc calls asm_do_IRQ(), Finally we are out of assembly code, now life will be simpler :). And after asm_do_IRQ returns __irq_svc will restore the state of process which was interrupted.

asm_do_IRQ( ) just calls the IRQ handler that was registered by architecure code (refer to section 1), in our case we had set do_level_irq( ) as interrupt handler for all IRQs except timer IRQ. so for all IRQs except timer do_level_irq( ) will handle our interrupts.

do_level_irq() first ACKs the interrupt, by calling architecure specific ACK function. On our reference architecture we just mask this interrupt line, which means that no IRQs of same IRQ line will be allowed until all the IRQ handlers have completed their job. After this do_level_irq( ) checks whether we have any action registered for this IRQ (the interrupt handler that you register through request_irq ( ) are called actions). If there is an action registered then __do_irq( ) is called which in trun enables the interrupts (remember ARM had disabled it) except current IRQ line, and executes the actions.

After all actions have completed excecuting, IRQ line for which interrupt was raised is unmasked and do_level_irq() returns. After this interrupt handling is complete __irq_svc restores the state of interrupted process and that process lives happily until another interrupt bugs it again :)

That was the overview of interrupt handling in linux on ARM, we hope it was useful :)

- Pankaj And Sripurna.