Having Fun with ARM64 Linux Kernel
Lately, I’ve been working on a project related to recording Linux kernel execution traces at the object code level. During this process, I had the opportunity to work with the ARM64 architecture on Linux. I decided to write this blog to capture some of the highlights I learned along the way.
Table of Contents
- Objdump ARM Kernel
- Some Booting Process Introduction
.word
Directives andudf
Instructions- Runtime Patching
Objdump ARM Kernel
Since I’m working with an x86-64 server, performing an object dump on ARM64 ELF files can’t be done simply using the objdump
program. Instead, I had to use the aarch64-linux-gnu-objdump
program. This tool can be installed on Ubuntu via the binutils-aarch64-linux-gnu
package.
The command I used to objdump an ARM64 kernel image is:
aarch64-linux-gnu-objdump -d -z vmlinux
Note that the -z
flag in the command above ensures that the full object dump is shown, preventing zeroed lines from being collapsed into ellipses (…).
Some Booting Process Introduction
The beginning of ARM64 kernel image objdump looks like:
/linux/vmlinux: file format elf64-littleaarch64
Disassembly of section .head.text:
ffff800080000000 <_text>:
ffff800080000000: fa405a4d ccmp x18, #0x0, #0xd, pl // pl = nfrst
ffff800080000004: 14713c37 b ffff800081c4f0e0 <primary_entry>
ffff800080000008: 00000000 .word 0x00000000
ffff80008000000c: 00000000 .word 0x00000000
ffff800080000010: 02c20000 .word 0x02c20000
ffff800080000014: 00000000 .word 0x00000000
ffff800080000018: 0000000a .word 0x0000000a
ffff80008000001c: 00000000 .word 0x00000000
ffff800080000020: 00000000 .word 0x00000000
ffff800080000024: 00000000 .word 0x00000000
ffff800080000028: 00000000 .word 0x00000000
ffff80008000002c: 00000000 .word 0x00000000
ffff800080000030: 00000000 .word 0x00000000
ffff800080000034: 00000000 .word 0x00000000
ffff800080000038: 644d5241 .word 0x644d5241
ffff80008000003c: 00000040 .word 0x00000040
ffff800080000040: 00004550 .word 0x00004550
ffff800080000044: 0002aa64 .word 0x0002aa64
ffff800080000048: 00000000 .word 0x00000000
The first two lines are instructions executed when the kernel is entered from the bootloader. primary_entry
function can be found in linux/arch/arm64/kernel/head.S
source file. Let’s take a look at the corresponding snippet:
/*
* Kernel startup entry point.
* ---------------------------
*
* The requirements are:
* MMU = off, D-cache = off, I-cache = on or off,
* x0 = physical address to the FDT blob.
*
* Note that the callee-saved registers are used for storing variables
* that are useful before the MMU is enabled. The allocations are described
* in the entry routines.
*/
__HEAD
/*
* DO NOT MODIFY. Image header expected by Linux boot-loaders.
*/
efi_signature_nop // special NOP to identity as PE/COFF executable
b primary_entry // branch to kernel start, magic
.quad 0 // Image load offset from start of RAM, little-endian
le64sym _kernel_size_le // Effective size of kernel image, little-endian
le64sym _kernel_flags_le // Informative flags, little-endian
.quad 0 // reserved
.quad 0 // reserved
.quad 0 // reserved
.ascii ARM64_IMAGE_MAGIC // Magic number
.long .Lpe_header_offset // Offset to the PE header.
__EFI_PE_HEADER
To better understand the connection between the head.S
snippet and the objdump results, I made the following annotation:
/linux/vmlinux: file format elf64-littleaarch64
Disassembly of section .head.text:
ffff800080000000 <_text>:
ffff800080000000: fa405a4d ccmp x18, #0x0, #0xd || efi_signature_nop // special NOP to identity as PE/COFF executable
ffff800080000004: 14713c37 b ffff800081c4f0e0 || b primary_entry // branch to kernel start, magic
ffff800080000008: 00000000 .word 0x00000000 || .quad 0 // Image load offset from start of RAM, little-endian
ffff80008000000c: 00000000 .word 0x00000000
ffff800080000010: 02c20000 .word 0x02c20000 || le64sym _kernel_size_le // Effective size of kernel image, little-endian
ffff800080000014: 00000000 .word 0x00000000
ffff800080000018: 0000000a .word 0x0000000a || le64sym _kernel_flags_le // Informative flags, little-endian
ffff80008000001c: 00000000 .word 0x00000000
ffff800080000020: 00000000 .word 0x00000000 || .quad 0 // reserved
ffff800080000024: 00000000 .word 0x00000000
ffff800080000028: 00000000 .word 0x00000000 || .quad 0 // reserved
ffff80008000002c: 00000000 .word 0x00000000
ffff800080000030: 00000000 .word 0x00000000 || .quad 0 // reserved
ffff800080000034: 00000000 .word 0x00000000
ffff800080000038: 644d5241 .word 0x644d5241 || .ascii ARM64_IMAGE_MAGIC // Magic number
ffff80008000003c: 00000040 .word 0x00000040 || .long .Lpe_header_offset // Offset to the PE header.
ffff800080000040: 00004550 .word 0x00004550
ffff800080000044: 0002aa64 .word 0x0002aa64
ffff800080000048: 00000000 .word 0x00000000
... || _EFI_PE_HEADER
So, the .word
directives coming after the first two instructions are representing data or instructions used by booting and kernel initial setup process.
MMU Enabling
When recording the entire booting process of ARM64 Linux kernel, I noticed that the start of the booting process was not captured by my tool when using virtual addresses. This happens because the MMU (Memory Management Unit) is not enabled at the very beginning of kernel initialization.
In linux/arch/arm64/kernel/head.S
I located the following snippet at the end of primary_entry
function:
SYM_CODE_START(primary_entry)
...
/*
* The following calls CPU setup code, see arch/arm64/mm/proc.S for
* details.
* On return, the CPU will be ready for the MMU to be turned on and
* the TCR will have been set.
*/
bl __cpu_setup // initialise processor
b __primary_switch
SYM_CODE_END(primary_entry)
Following to the comment, I found the __cpu_setup
function in linux/arch/arm64/mm/proc.S
, it comes with a clear comment saying:
/*
* __cpu_setup
*
* Initialise the processor for turning the MMU on.
*
* Output:
* Return in x0 the value of the SCTLR_EL1 register.
*/
With this, I can conclude the following takeaway:
- When ARM64 Linux kernel boots, the bootloader runs at the beginning of the entire process. Then, the execution flow starts in kernel image from its beginning, where
primary_entry
function is the first one to be called. The kernel image reserves a section of .word at the beginning which contains data for booting and kernel initialization (e.g. ARM ELF magic number, kernel image size).primary_entry
function can be found inlinux/arch/arm64/kernel/head.S
, at the end of this function there is a function call to__cpu_setup
, where the MMU is set up and you get virtual addresses mapped to physical ones.
How Does QEMU Emulate Kernel Booting?
There are two options to load a kernel in QEMU:
- Boot via UEFI firmware.
- Direct Linux kernel boot.
When booting via UEFI firmware, pass in command-line arguments like -bios QEMU_EFI.fd
when launching QEMU, and QEMU loads a UEFI firmware binary into memory, which runs like it would on real hardware.
When booting directly as Linux kernel, pass in command-line arguments like -kernel Image
, QEMU will load the kernel into guest RAM, then it will load the device tree and set up CPU registers, after which it will transfer control to the kernel.
.word
Directives and udf
Instructions
In ARM64 kernel image objdump, one can observe many instances of .word
directives, examples can be found in the Some Booting Process Introduction section above. These .word
entries contain data rather than executable code.
A question comes naturally: How does the objdump
program tell if a 4 byte binary in an ELF file is an instruction or a .word
data?
Besides .word
directives, one can also observe udf
instances, for example:
ffff800080010d48: cb2063e0 sub x0, sp, x0
ffff800080010d4c: f274cc1f tst x0, #0xfffffffffffff000
ffff800080010d50: 54001581 b.ne ffff800080011000 <__bad_stack> // b.any
ffff800080010d54: cb2063ff sub sp, sp, x0
ffff800080010d58: d53bd060 mrs x0, tpidrro_el0
ffff800080010d5c: 14000262 b ffff8000800116e4 <el0t_64_fiq>
ffff800080010d60: 00000000 udf #0
ffff800080010d64: 00000000 udf #0
ffff800080010d68: 00000000 udf #0
ffff800080010d6c: 00000000 udf #0
ffff800080010d70: 00000000 udf #0
ffff800080010d74: 00000000 udf #0
ffff800080010d78: 00000000 udf #0
ffff800080010d7c: 00000000 udf #0
ffff800080010d80: 14000003 b ffff800080010d8c <vectors+0x58c>
These udf
instructions are used as paddings and never got executed. In general, udf
instructions are in the form of udf #imm16
, meaning 0x00000009
could also be a udf
instruction intrepreted as udf #9
.
So, another natural question is: How does the objdump
program decides if a 0x00000000
is translated to .word 00000000
or udf #0
?
How Are .word
and udf
Recognized?
I did a tiny experiment with the following example as example.s
:
.section .text
.global _start
_start:
mov x0, #1 // set x0 = 1 (exit code)
.word 0xd2800020
udf #0
.word 0x00000000
mov x1, #2 // set x1 = 2
mov x8, #93 // syscall number for exit on ARM64
svc #0 // syscall
On a x86 machine, I assembled and loaded it for ARM64.
aarch64-linux-gnu-as example.s -o example.o
aarch64-linux-gnu-ld example.o -o example
Next, I disassembled it with aarch64-linux-gnu-objdump
:
Notice how the two consective 0x00000000
got collapsed into ellipses!
I added the -z
flag to ask for the complete objdump:
It turned out that the aarch64-linux-gnu-objdump
program correctly recognizes both .word
and udf
!!
Now, take a look at how the actual binary looks like in example
:
Well, it appears that within the .text
section the raw bytes for 0xd2800020
and 0x00000000
pairs are indistinguishable!
So, the takeaway is:
In ELF files there are debug info (like DWARF) and other symbol info (like function boundaries, labels, or assembler-generated metadata) that helps
objdump
program to recognize.word
andudf
entries in.text
section, the raw bytes themselves don’t really reflect such differences.
Runtime Patching
An interesting instance caught my attention when I was analyzing the execution trace of the ARM64 Linux kernel with QEMU.
In the objdump of the kernel image, there is the following snippet containing an NOP instruction at 0xffff800080018800
:
ffff8000800187b0 <copy_thread>:
ffff8000800187b0: d503233f paciasp
ffff8000800187b4: a9bb7bfd stp x29, x30, [sp, #-80]!
...
ffff8000800187e4: f9401275 ldr x21, [x19, #32]
ffff8000800187e8: f9401a96 ldr x22, [x20, #48]
ffff8000800187ec: f9402298 ldr x24, [x20, #64]
ffff8000800187f0: 9441b294 bl ffff800081085240 <__memset>
ffff8000800187f4: aa1303e0 mov x0, x19
ffff8000800187f8: 97fffa41 bl ffff8000800170fc <fpsimd_flush_task_state>
ffff8000800187fc: 1400002c b ffff8000800188ac <copy_thread+0xfc>
ffff800080018800: d503201f nop
ffff800080018804: f9403280 ldr x0, [x20, #96]
ffff800080018808: d287d604 mov x4, #0x3eb0 // #16048
ffff80008001880c: 8b0402b9 add x25, x21, x4
ffff800080018810: b5000700 cbnz x0, ffff8000800188f0 <copy_thread+0x140>
ffff800080018814: 914012b5 add x21, x21, #0x4, lsl #12
However, in the actually emulation by QEMU, I found that the NOP got swapped with the branch instruction before it!
That’s saying, in objdump of kernel image we observe
ffff8000800187fc: 1400002c b ffff8000800188ac <copy_thread+0xfc>
ffff800080018800: d503201f nop
while in QEMU emulation there was
ffff8000800187fc: d503201f nop
ffff800080018800: 1400002c b ffff8000800188ac <copy_thread+0xfc>
To double-check that in QEMU emulation the instructions really got swappped, I used the gdb-multiarch
program to inspect runtime memory:
Turned out that the instructions truely got swapped during runtime. So I started the investigation by reading the source code of this copy_thread
function in linux/arch/arm64/kernel/process.c
. After comparing the source code with the objdump snippet, I came up with a rough matching between them:
Apparently, function ptrauth_thread_init_kernel
looked pretty suspicious. In linux/arch/arm64/include/asm/pointer_auth.h
, this function was defined by a call to ptrauth_keys_init_kernel
:
#ifdef CONFIG_ARM64_PTR_AUTH_KERNEL
#define ptrauth_thread_init_kernel(tsk) \
ptrauth_keys_init_kernel(&(tsk)->thread.keys_kernel)
#define ptrauth_thread_switch_kernel(tsk) \
ptrauth_keys_switch_kernel(&(tsk)->thread.keys_kernel)
#else
#define ptrauth_thread_init_kernel(tsk)
#define ptrauth_thread_switch_kernel(tsk)
#endif /* CONFIG_ARM64_PTR_AUTH_KERNEL */
At this point, I concluded that the unusual swapping behavior was caused by runtime patching, which was triggered by the #ifdef
macro.
I came across a blog post talking about runtime patching of Linux kernel: https://blogs.oracle.com/linux/post/exploring-arm64-runtime-patching-alternatives. The blog mentioned that the source code related to ARM64 runtime patching can be found in file linux/arch/arm64/kernel/alternative.c
.
Specifically, the patch_alternative
function (source code) appeared to be highly related:
With gdb, I inserted a breakpoint at patch_alternative
to find out when does such runtime patching take place.
One can manually press enter over one thousand times to find out how many times this function got triggered. Or, store the following gdb script as instr.txt
and use it when launching gdb:
set architecture aarch64
target remote :1234
b patch_alternative
set $i = 0
while ($i < 10000)
continue
set $i = $i + 1
end
Launch gdb with the following command, and all breakpoint hits will be printed to the screen.
gdb-multiarch -x instr.txt vmlinux
So, a takeaways for this section:
- The ARM Linux kernel may exhibit runtime patching, which modifies some of the instructions loaded into memory. A blog discussing this behavior can be found at: https://blogs.oracle.com/linux/post/exploring-arm64-runtime-patching-alternatives, the blog mentioned that such runtime patching is done during boot time.
- QEMU faithfully emulates this runtime patching behavior.
Follow Up: Could This Be Disabled?
As far as I know, there is no way to entirely disable runtime patching.
Taking the same example from above, when I tried to use the same kernel image but added the-append arm64.nopauth
command-line argument while launching QEMU, it produced the following runtime memory:
This time, the branch instruction at 0xffff8000800187fc
got replaced by a NOP instruction, while the original NOP didn’t get replaced. So, no matter how you try to disable the runtime patching behavior, your CPU (or the configuration for your QEMU emulation) will make the necessary patches to happen.
Update:
Another blog about kernel patch alternative (link), it introduces a related IDA plugin.