Having Fun with ARM64 Linux Kernel

Lately, I’ve been working on a project related to recording Linux kernel execution traces at the object code level. During this process, I had the opportunity to work with the ARM64 architecture on Linux. I decided to write this blog to capture some of the highlights I learned along the way.

Table of Contents

Objdump ARM Kernel
Booting Process - A Brief Introduction
- MMU Enabling
- How Does QEMU Emulate Kernel Booting?
.word Directives and udf Instructions
- How Are .word and udf Recognized?
Runtime Patching

Objdump ARM Kernel

Since I’m working with an x86-64 server, performing an object dump on ARM64 ELF files can’t be done simply using the objdump program. Instead, I had to use the aarch64-linux-gnu-objdump program. This tool can be installed on Ubuntu via the binutils-aarch64-linux-gnu package.

The command I used to objdump an ARM64 kernel image is:

aarch64-linux-gnu-objdump -d -z vmlinux

Note that the -z flag in the command above ensures that the full object dump is shown, preventing zeroed lines from being collapsed into ellipses (…).

Booting Process - A Brief Introduction

The beginning of ARM64 kernel image objdump looks like:

/linux/vmlinux:     file format elf64-littleaarch64


Disassembly of section .head.text:

ffff800080000000 <_text>:
ffff800080000000:	fa405a4d 	ccmp	x18, #0x0, #0xd, pl	// pl = nfrst
ffff800080000004:	14713c37 	b	ffff800081c4f0e0 <primary_entry>
ffff800080000008:	00000000 	.word	0x00000000
ffff80008000000c:	00000000 	.word	0x00000000
ffff800080000010:	02c20000 	.word	0x02c20000
ffff800080000014:	00000000 	.word	0x00000000
ffff800080000018:	0000000a 	.word	0x0000000a
ffff80008000001c:	00000000 	.word	0x00000000
ffff800080000020:	00000000 	.word	0x00000000
ffff800080000024:	00000000 	.word	0x00000000
ffff800080000028:	00000000 	.word	0x00000000
ffff80008000002c:	00000000 	.word	0x00000000
ffff800080000030:	00000000 	.word	0x00000000
ffff800080000034:	00000000 	.word	0x00000000
ffff800080000038:	644d5241 	.word	0x644d5241
ffff80008000003c:	00000040 	.word	0x00000040
ffff800080000040:	00004550 	.word	0x00004550
ffff800080000044:	0002aa64 	.word	0x0002aa64
ffff800080000048:	00000000 	.word	0x00000000

The first two lines are instructions executed when the kernel is entered from the bootloader. primary_entry function can be found in linux/arch/arm64/kernel/head.S source file. Let’s take a look at the corresponding snippet:

/*
 * Kernel startup entry point.
 * ---------------------------
 *
 * The requirements are:
 *   MMU = off, D-cache = off, I-cache = on or off,
 *   x0 = physical address to the FDT blob.
 *
 * Note that the callee-saved registers are used for storing variables
 * that are useful before the MMU is enabled. The allocations are described
 * in the entry routines.
 */
	__HEAD
	/*
	 * DO NOT MODIFY. Image header expected by Linux boot-loaders.
	 */
	efi_signature_nop			// special NOP to identity as PE/COFF executable
	b	primary_entry			// branch to kernel start, magic
	.quad	0				// Image load offset from start of RAM, little-endian
	le64sym	_kernel_size_le			// Effective size of kernel image, little-endian
	le64sym	_kernel_flags_le		// Informative flags, little-endian
	.quad	0				// reserved
	.quad	0				// reserved
	.quad	0				// reserved
	.ascii	ARM64_IMAGE_MAGIC		// Magic number
	.long	.Lpe_header_offset		// Offset to the PE header.

	__EFI_PE_HEADER

To better understand the connection between the head.S snippet and the objdump results, I made the following annotation:

/linux/vmlinux:     file format elf64-littleaarch64


Disassembly of section .head.text:

ffff800080000000 <_text>:
ffff800080000000:	fa405a4d 	ccmp	x18, #0x0, #0xd || efi_signature_nop // special NOP to identity as PE/COFF executable
ffff800080000004:	14713c37 	b	ffff800081c4f0e0    || b primary_entry // branch to kernel start, magic
ffff800080000008:	00000000 	.word	0x00000000      || .quad    0 // Image load offset from start of RAM, little-endian
ffff80008000000c:	00000000 	.word	0x00000000                    
ffff800080000010:	02c20000 	.word	0x02c20000      || le64sym _kernel_size_le // Effective size of kernel image, little-endian
ffff800080000014:	00000000 	.word	0x00000000
ffff800080000018:	0000000a 	.word	0x0000000a      || le64sym _kernel_flags_le // Informative flags, little-endian
ffff80008000001c:	00000000 	.word	0x00000000
ffff800080000020:	00000000 	.word	0x00000000      || .quad    0 // reserved
ffff800080000024:	00000000 	.word	0x00000000
ffff800080000028:	00000000 	.word	0x00000000      || .quad    0 // reserved
ffff80008000002c:	00000000 	.word	0x00000000
ffff800080000030:	00000000 	.word	0x00000000      || .quad    0 // reserved
ffff800080000034:	00000000 	.word	0x00000000
ffff800080000038:	644d5241 	.word	0x644d5241      || .ascii	ARM64_IMAGE_MAGIC // Magic number
ffff80008000003c:	00000040 	.word	0x00000040      || .long	.Lpe_header_offset // Offset to the PE header.
ffff800080000040:	00004550 	.word	0x00004550
ffff800080000044:	0002aa64 	.word	0x0002aa64
ffff800080000048:	00000000 	.word	0x00000000
...                                                     || _EFI_PE_HEADER

So, the .word directives coming after the first two instructions are representing data or instructions used by booting and kernel initial setup process.

MMU Enabling

When recording the entire booting process of ARM64 Linux kernel, I noticed that the start of the booting process was not captured by my tool when using virtual addresses. This happens because the MMU (Memory Management Unit) is not enabled at the very beginning of kernel initialization.

In linux/arch/arm64/kernel/head.S I located the following snippet at the end of primary_entry function:

SYM_CODE_START(primary_entry)
...
	/*
	 * The following calls CPU setup code, see arch/arm64/mm/proc.S for
	 * details.
	 * On return, the CPU will be ready for the MMU to be turned on and
	 * the TCR will have been set.
	 */
	bl	__cpu_setup			// initialise processor
	b	__primary_switch
SYM_CODE_END(primary_entry)

Following to the comment, I found the __cpu_setup function in linux/arch/arm64/mm/proc.S, it comes with a clear comment saying:

/*
 *	__cpu_setup
 *
 *	Initialise the processor for turning the MMU on.
 *
 * Output:
 *	Return in x0 the value of the SCTLR_EL1 register.
 */

With this, I can conclude the following takeaway:

When ARM64 Linux kernel boots, the bootloader runs at the beginning of the entire process. Then, the execution flow starts in kernel image from its beginning, where primary_entry function is the first one to be called. The kernel image reserves a section of .word at the beginning which contains data for booting and kernel initialization (e.g. ARM ELF magic number, kernel image size).

primary_entry function can be found in linux/arch/arm64/kernel/head.S, at the end of this function there is a function call to __cpu_setup, where the MMU is set up and you get virtual addresses mapped to physical ones.

How Does QEMU Emulate Kernel Booting?

There are two options to load a kernel in QEMU:

Boot via UEFI firmware.
Direct Linux kernel boot.

When booting via UEFI firmware, pass in command-line arguments like -bios QEMU_EFI.fd when launching QEMU, and QEMU loads a UEFI firmware binary into memory, which runs like it would on real hardware.

When booting directly as Linux kernel, pass in command-line arguments like -kernel Image, QEMU will load the kernel into guest RAM, then it will load the device tree and set up CPU registers, after which it will transfer control to the kernel.

`.word` Directives and `udf` Instructions

In ARM64 kernel image objdump, one can observe many instances of .word directives, examples can be found in the Some Booting Process Introduction section above. These .word entries contain data rather than executable code.

A question comes naturally: How does the objdump program tell if a 4 byte binary in an ELF file is an instruction or a .word data?

Besides .word directives, one can also observe udf instances, for example:

ffff800080010d48:       cb2063e0        sub     x0, sp, x0
ffff800080010d4c:       f274cc1f        tst     x0, #0xfffffffffffff000
ffff800080010d50:       54001581        b.ne    ffff800080011000 <__bad_stack>  // b.any
ffff800080010d54:       cb2063ff        sub     sp, sp, x0
ffff800080010d58:       d53bd060        mrs     x0, tpidrro_el0
ffff800080010d5c:       14000262        b       ffff8000800116e4 <el0t_64_fiq>
ffff800080010d60:       00000000        udf     #0
ffff800080010d64:       00000000        udf     #0
ffff800080010d68:       00000000        udf     #0
ffff800080010d6c:       00000000        udf     #0
ffff800080010d70:       00000000        udf     #0
ffff800080010d74:       00000000        udf     #0
ffff800080010d78:       00000000        udf     #0
ffff800080010d7c:       00000000        udf     #0
ffff800080010d80:       14000003        b       ffff800080010d8c <vectors+0x58c>

These udf instructions are used as paddings and never got executed. In general, udf instructions are in the form of udf #imm16, meaning 0x00000009 could also be a udf instruction intrepreted as udf #9.

So, another natural question is: How does the objdump program decides if a 0x00000000 is translated to .word 00000000 or udf #0?

How Are `.word` and `udf` Recognized?

I did a tiny experiment with the following example as example.s:

    .section .text
    .global _start
_start:
    mov x0, #1             // set x0 = 1 (exit code)
    .word 0xd2800020
    udf #0
    .word 0x00000000
    mov x1, #2             // set x1 = 2
    mov x8, #93            // syscall number for exit on ARM64
    svc #0                 // syscall

On a x86 machine, I assembled and loaded it for ARM64.

aarch64-linux-gnu-as example.s -o example.o
aarch64-linux-gnu-ld example.o -o example

Next, I disassembled it with aarch64-linux-gnu-objdump:

Notice how the two consective 0x00000000 got collapsed into ellipses!

I added the -z flag to ask for the complete objdump: disassembled

It turned out that the aarch64-linux-gnu-objdump program correctly recognizes both .word and udf!!

Now, take a look at how the actual binary looks like in example:

offset binary

Well, it appears that within the .text section the raw bytes for 0xd2800020 and 0x00000000 pairs are indistinguishable!

So, the takeaway is:

In ELF files there are debug info (like DWARF) and other symbol info (like function boundaries, labels, or assembler-generated metadata) that helps objdump program to recognize .word and udf entries in .text section, the raw bytes themselves don’t really reflect such differences.

Runtime Patching

Prologue

An interesting instance caught my attention when I was analyzing the execution trace of the ARM64 Linux kernel with QEMU.

In the objdump of the kernel image, there is the following snippet containing an NOP instruction at 0xffff800080018800:

ffff8000800187b0 <copy_thread>:
ffff8000800187b0:	d503233f 	paciasp
ffff8000800187b4:	a9bb7bfd 	stp	x29, x30, [sp, #-80]!
...
ffff8000800187e4:	f9401275 	ldr	x21, [x19, #32]
ffff8000800187e8:	f9401a96 	ldr	x22, [x20, #48]
ffff8000800187ec:	f9402298 	ldr	x24, [x20, #64]
ffff8000800187f0:	9441b294 	bl	ffff800081085240 <__memset>
ffff8000800187f4:	aa1303e0 	mov	x0, x19
ffff8000800187f8:	97fffa41 	bl	ffff8000800170fc <fpsimd_flush_task_state>
ffff8000800187fc:	1400002c 	b	ffff8000800188ac <copy_thread+0xfc>
ffff800080018800:	d503201f 	nop
ffff800080018804:	f9403280 	ldr	x0, [x20, #96]
ffff800080018808:	d287d604 	mov	x4, #0x3eb0                	// #16048
ffff80008001880c:	8b0402b9 	add	x25, x21, x4
ffff800080018810:	b5000700 	cbnz	x0, ffff8000800188f0 <copy_thread+0x140>
ffff800080018814:	914012b5 	add	x21, x21, #0x4, lsl #12

However, in the actually emulation by QEMU, I found that the NOP got swapped with the branch instruction before it!

qemu1

That’s saying, in objdump of kernel image we observe

ffff8000800187fc:	1400002c 	b	ffff8000800188ac <copy_thread+0xfc>
ffff800080018800:	d503201f 	nop

while in QEMU emulation there was

ffff8000800187fc:	d503201f 	nop
ffff800080018800:	1400002c 	b	ffff8000800188ac <copy_thread+0xfc>

To double-check that in QEMU emulation the instructions really got swappped, I used the gdb-multiarch program to inspect runtime memory: runtime

Turned out that the instructions truely got swapped during runtime.

Runtime Patching Analysis

There are many instances of instructions differing at runtime in a static kernel image.

Runtime code patching in the Linux kernel can occur due to three main mechanisms:

Alternatives (a.k.a. patch alternatives): A compile-time mechanism that can patch instructions at boot based on CPU features. It is applied once early during boot, not dynamically at runtime.
Static keys (a.k.a. jump labels): A dynamic runtime patching mechanism used to enable/disable code paths efficiently. Static keys modify code in place at runtime to enable fast conditional execution.
Live Patching: A mechanism that allows runtime replacement of functions or code in the kernel to fix bugs or security issues without rebooting the system.

1. Alternatives

There is a blog by Oracle that introduces Arm64 runtime patching alternatives. I primarily refer to this resource in the content below.

The Linux Alternatives Framework is a set of macros that kernel developers can use to prepare their code for boot time patching. It is available for multiple CPU architectures, including X86, ARM64, S390, and PA-RISC. The alternative macro stores the default original code in the .text 0 section and the replacement code in the .text 1 section.

oracle

The macro also creates an alt_instr structure containing the offset locations, instruction length, and the CPU feature bit. The structure is stored in the .alternative section. At boot time, the Linux kernel will walk through the .alternative section and compare each alt_instr structure with the running CPU’s features. If the machine does not have the specific feature, the default code remains unchanged. Otherwise, the kernel will replace the default code with the replacement code using the information available in the alt_instr structure.

struct alt_instr {
    s32 orig_offset; /* offset to original instruction          */
    s32 alt_offset;  /* offset to replacement instruction       */
    u16 cpufeature;  /* cpufeature bit set for replacement      */
    u8 orig_len;     /* size of original instruction(s)         */
    u8 alt_len;      /* size of new instruction(s), <= orig_len */
};

For each code snippet that can have alternative, there is a pair of macros applied. An example is

SYM_FUNC_START(crc32_le)           /* start of the function                                    */
alternative_if_not ARM64_HAS_CRC32 /* assuming the runtime machine has no hardware CRC feature */
    b crc32_le_base                /* default branch to software CRC routine                   */
alternative_else_nop_endif         /* patch with nop if machine has hardware CRC feature       */
    __crc32                        /* a macro which uses hardware CRC instructions             */
SYM_FUNC_END(crc32_le) 

Please see the Oracle blog for a detailed analysis of the example above, in which they expanded the macros and illustrated how the assembly code creates the alt_instr structure and stores the replacement code in a separate section.

The source code related to ARM64 runtime patching can be found in file linux/arch/arm64/kernel/alternative.c.

Specifically, the patch_alternative function (source code) appeared to be highly related:

alternative

To capture all the runtime memory changes done by patch alternatives, I set breakpoints at functions apply_boot_alternatives and apply_alternatives_all. At the breakpoints, I used gdb command to dump runtime memory as binaries, an example of dumping the .text section looks like:

(gdb) dump binary memory text.bin 0xffff800080010000 0xffff8000810d5000

Compare the memory dump before and after apply_boot_alternatives and apply_alternatives_all (meaning you need to dump four times in total) will give all the changes caused by patch alternatives.

The reason why one cannot simply observe the result memory after apply_alternatives_all for all changes related to patch alternatives is that there are changes to the .text section made in between the two patch alternative functions.

I also extracted the .text section from the vmlinux image I built. It turned out that four addresses were modified before apply_boot_alternatives was called, and the changes were actually applied even before __primary_switched. The four modified addresses appear as follows in the static kernel image.

ffff8000810c2e78 <__kvm_nvhe_$d>:
ffff8000810c2e78:	00000000 	udf	#0
ffff8000810c2e7c:	00000000 	udf	#0
ffff8000810c2e80:	00000000 	udf	#0
ffff8000810c2e84:	00000000 	udf	#0

It’s worth noticing that in the Oracle blog they used hbreak rather than break to set breakpoints.

gdb oracle

It turns out that the hbreak are hardware breakpoints that use the CPU’s hardware debug registers and should be used in:

Debugging read-only memory (e.g., ROM or flash).
Debugging code in shared libraries or other regions where inserting software breakpoints might be unsafe or unsupported.
Debugging low-level system code (like kernels or embedded systems) where instruction modification isn’t possible or desirable. 👀

According to the QEMU document and my observation, it treats breakpoint as hardware breakpoints in TCG mode.

An important follow up question: Is there a way to fully disable patch alternatives? Sadly, according to what I found on the Internet, there is no configuration availble when building Linux kernel to disable ALL the patch alternatives.

2. Static Keys

Related documentation can be found at this link.

Static keys allows the inclusion of seldom used features in performance-sensitive fast-path kernel code, via a GCC feature and a code patching technique.

A quick example:

DEFINE_STATIC_KEY_FALSE(key);

...

if (static_branch_unlikely(&key))
    do unlikely code
else
    do likely code

...
static_branch_enable(&key);
...
static_branch_disable(&key);
...

The static_branch_unlikely() branch will be generated into the code with as little impact to the likely code path as possible.

I didn’t dive too deep into this topic since I found that setting CONFIG_JUMP_LABEL=n in the kernel configuration can fully disable these static keys.

3. Live Patching

Related documentation can be found at this link.

There are multiple mechanisms in the Linux kernel that are directly related to redirection of code execution; namely: kernel probes, function tracing, and livepatching:

The kernel probes are the most generic. The code can be redirected by putting a breakpoint instruction instead of any instruction.
The function tracer calls the code from a predefined location that is close to the function entry point. This location is generated by the compiler using the -pg gcc option.
Livepatching typically needs to redirect the code at the very beginning of the function entry before the function parameters or the stack are in any way modified.

All three approaches need to modify the existing code at runtime. Therefore they need to be aware of each other and not step over each other’s toes. Most of these problems are solved by using the dynamic ftrace framework as a base.

Runtime Patching Conclusion

After disabling Static Keys (a.k.a. jump label) with CONFIG_JUMP_LABEL=n in kernel compilation, the main contribution to runtime code modification in kernel .text section is the patch alternative mechanism.

With gdb and QEMU, I collected a simple boot-shutdown emulation of an arm64 Linux kernel, I dumped the runtime memory .text section at various breakpoints and compare them with the previous one, to collect the following statistics for future reference:

gdb-nice-graph

apply_boot_alternatives and apply_alternatives_all are two main sources of runtime code rewriting. Both are part of the Linux patch alternative framework.

See source code for the kvm_apply_hyp_relocations function. The function is responsible for relocating certain symbols used by the KVM hypervisor code (HYP mode) during the early initialization of KVM.

See source code for the vfs_caches_init_early function. The function is called early during the kernel boot process to prepare basic data structures used by the VFS. Further looking at the source code, I annotated the place where rewrite happens:

void __init vfs_caches_init_early(void)
{
	int i;

	for (i = 0; i < ARRAY_SIZE(in_lookup_hashtable); i++)
		INIT_HLIST_BL_HEAD(&in_lookup_hashtable[i]);

	dcache_init_early();            //   static void __init dcache_init_early(void)
                                    // 	{
                                    // 		/* If hashes are distributed across NUMA nodes, defer
                                    // 		* hash allocation until vmalloc space is available.
                                    // 		*/
                                    // 		if (hashdist)
                                    // 			return;

                                    // 		dentry_hashtable =
                                    // 			alloc_large_system_hash("Dentry cache",
                                    // 						sizeof(struct hlist_bl_head),
                                    // 						dhash_entries,
                                    // 						13,
                                    // 						HASH_EARLY | HASH_ZERO,
                                    // 						&d_hash_shift,
                                    // 						NULL,
                                    // 						0,
                                    // 						0);
                                    // 		d_hash_shift = 32 - d_hash_shift;

                                    // 		runtime_const_init(shift, d_hash_shift);       <- rewrite happens!
                                    // 		runtime_const_init(ptr, dentry_hashtable);     <- rewrite happens!
                                    // 	}
	inode_init_early();
}

Note that the “Relocate Kernel” breakpoint in the nice graph above is actually referring to relocate_kernel function, see source code. But in GDB you cannot simply set breakpoint at this function name, since this function is copied to a low kernel address when executed at the very early stage of booting. In the assembly this function is named as __pi_relocate_kernel, the suffix pi stands for position-independent, these functions are particularly in relation to the kernel KASLR (Kernel Address Space Layout Randomization) support on ARM64. With KASLR, the kernel is loaded at a randomized virtual address. To support this, some parts of the kernel code are built as position-independent, and internally, the kernel uses relocation and runtime fixups.

2025 Jun 25 Note: Simply disable KVM with CONFIG_KVM=n in kernel compilation can remove all the rewrites related to KVM (including both “Relocate Kernel” and “KVM Apply hyp Relocation”).

A Useful QEMU Plugin

When investigating the runtime patch behavior, I implemented a QEMU plugin called runtime_patch_monitor at my QEMU plugin fork. Please see this link for source code.

The plugin makes use of the qemu_plugin_register_vcpu_mem_cb to detect all instructions involving memory writes, and further filter by store instruciton and target address, and find all writes to the physical address range of .text section during QEMU emulation. The plugin dumps all the addresses together with the updated 32-bit data as binaries, and this dump can be later parsed to analyze all the runtime memory writes performed to the .text section.

To get the updated data I used qemu_plugin_read_memory_vaddr, this API function came in QEMU v10.0, so the plugin has to be used with at least this version of QEMU.

Two minor issues I observed when using this plugin:

On very rare occasions, some memory rewrites are performed using instructions like ldrsw x5, [x21, #16], which extend a 32-bit value into 64-bit. This means two addresses are rewritten, but the plugin only captures the lower address, since the write appears to affect only that one.
The plugin faithfully records all memory writes. However, a small number of them (fewer than 50 in my experiments) turned out to be writing the same content that was already there. These are benign false positives to keep in mind.

Aside from these two issues, the plugin worked extremely well in detecting runtime patches. I manually verified all the abnormal entries (about 40 in total) it generated against gdb runtime memory dumps, and concluded that the plugin is capable of capturing ALL RUNTIME CODE PATCHES.

A IDA Pro Plugin

See this link for a related IDA Pro plugin for analyzing patch alternatives in kernel image.

Written on May 30, 2025