32-bit x86 is one of the worst CPU architectures out there – so it was only a matter of time until I got so fed up with the idio(t)syncrasies of this lovely architecture and took the plunge to port it to amd64.

Truth be told, I had planned on porting the OS to some 64-bit ARMv8 platform rather than amd64, but the kludged together 32-bit paging code was starting to break in odd ways, as I added more and more features to the system. Crusty code, brokenness on real hardware, and most importantly, a little bit of inebriation drove me to instead try porting everything to amd64. After all: the kernel was nicely separated into architecture-specific code.

Looking back on the past week or so of coding in the evenings to make this happen makes me realize I would have saved myself so much time, effort, and most importantly, pain and suffering had I gone with amd64 from the start. I’ll describe some of the changes I made to the system below and problems I faced.

This article was really supposed to be done and published two or three weeks prior to when it actually did – Real Life™ (batteries not included) unfortunately kept me from finishing this post up. I haven’t had much time since to actually work on the OS either: I graduate in May and will hopefully get a few weeks then to work on this.

Bootloader

Most 64-bit kernels on the PC have some sort of stub (which runs in either 32-bit protected more, or in 16-bit real mode) that sets up the processor before putting it in long mode and executing the rest of the kernel. In fact, this is supported (and the only way) to make a 64-bit kernel, which can be booted using the Multiboot protocol via GRUB2.

However, I really didn’t want to have to continue the hacky assembly entry stubs that did way more work than they needed to, and was looking for any excuse to drop GRUB2. This meant I could either implement direct booting via EFI of the kernel, or build a small bootloader myself that the EFI loads, which can in turn load the kernel and init bundle from a filesystem.

Thankfully, a bit more research led me to BOOTBOOT: a relatively new legacy-free 64-bit loader. Not only does it abstract loading the binary and init bundle away (even allowing both to be compressed transparently,) it also supports AArch64, which will be nice when I eventually get around to porting it to that architecture.

Getting the kernel booting to a “hello world” level with BOOTBOOT was actually incredibly easy: I managed this in about an hour and a half, most of which was spent setting up the toolchain definitions, and integrating the BOOTBOOT tools for generating the boot disk image.

Paging

Although the 32-bit x86 architecture code supported and implemented PAE, which used a very similar 64-bit wide page table entry format, it really wasn’t actually used to support more than 4G of physical memory, but rather for the no-execute¹ support. So keeping any of that code around was just going to be worthless.

To support the larger virtual address space, paging structures are extended from the 32-bit PAE versions: the third level Page Directory Pointer Table (PDPT) is extended from only 4 entries to the full 512 (to map 512GB of VM space), and a fourth level PML4 table was added:

Note that this diagram isn’t to scale – each table actually has 512 entries, and they all occupy a whole 4K page. The top 16 bits of the virtual address will be a copy of bit 47.

The nice part about moving to a 64-bit platform is that there’s more virtual address space than you could possibly know what to do with. On current amd64 implementations, this is limited to 48 bits: the high 16 bits are “sign extended”² so systems code doesn’t try to use them as tag bits. This ensures that implementations that provide more than 48 virtual address bits in the future can continue to run software written today.³

Physical Identity Mapping

All this extra virtual address space means that we can drop the entire recursive mapping scheme the PAE-based x86 mapping code had a lot of hacks to accomodate. Instead, the low 2TB of the kernel’s address space are reserved for a physical identity map aperture. Using the 1GB size super-pages supported on amd64, the first 2TB of physical memory are mapped here. This means that given a physical address, accessing it as simple as adding the base address of this region.

This solution offers some nice performance benefits: 1G pages save us a lot of memory for paging structures, and means that accessing a page given its physical address requires only pointer arithmetic. That last benefit is twofold: it removes the need to handle temporary mappings, or modify page tables at all.

Despite the benefits, this does create a bit of a security problem in that this is basically an arbitrary read/write to any task’s memory, and even access to pages otherwise mapped read-only: such as kernel executable sections. Since there are very few places outside of the page table manipulation code that need to access arbitrary physical pages, a possible solution is to carve out slabs of physical memory that can be identity mapped; then use these regions only to satisfy paging structure allocations, and identity map only them.

Physical Allocator

Moving to 64-bit allows for significantly more addressable physical memory. To support this, the old memory allocator had to go. It represented each page as a single bit in a bitmap: efficient as far as space goes, but allocating a new page required iterating the entire bitmap. This in turn could also be optimized to check 32 (or 64) pages in the bitmap at once, but even still, memory allocations would get incredibly slow after a few hundred MB had been allocated.

On top of all that, the old allocator did not really support allocating larger blocks of contiguous physical memory. This made it all but impossible to implement support for large pages.

The new physical allocator works on the buddy principle: all physical regions are divided into blocks, each of which are \(\text{pagesz} \times 2^n\) in size, where \(n\) is the block’s order: a number between 0 and an implementation-defined maximum (currently: 11) that defines the maximum satisfiable single allocation. Each order has a bitmap that indicates whether a particular block is allocated, and an associated free list for all allocated blocks.

When we want to allocate physical memory, we’ll see if the order from which to allocate from has any blocks in the free map; if so, we’ll mark it as allocated and return it. Otherwise, we’ll go up order by order until we find a free block of order \(l\). This block is then split time and time again, with one half going to the free list of order \(l - 1\), the other being returned (or split further if it’s still too large.)

Freeing blocks works in a similar way. When freeing a block of order \(o\), we’ll see if its buddy is free. If so, we’ll remove the buddy from the free list of order \(o\), and create a merged block and insert it into the free list of \(o + 1\). This then will continue until the block’s buddy is allocated (or nonexistent) or it’s been merged to the largest size block.

SMP

The previous kernel never started any APs, even if they were available. The BOOTBOOT loader will start them for you, set up an unique stack and jump to the kernel entry point. This entry point checks whether it’s running on the initial processor (BSP) and if so, runs the kernel initialization as it currently exists.

Currently, the AP code just increments a counter and spins forever. This should wait for the kernel to be set up, then perform some per CPU initialization and launch the scheduler on that core. That is to say there’s no real implementation of SMP yet, but it’ll come sooner or later. The scheduler needs to be rewritten first anyhow: it’s currently pretty garbage code and doesn’t support priorities particularly well. I will likely write about this at some point in the future.

Conclusion

All in all, I was pleasantly surprised at how much nicer amd64 is: lots of the legacy stuff (segmentation, real mode, etc.) can simply be ignored. SIMD instructions (at least SSE2) are guaranteed to be available, which allows for better optimizations. And the 64-bit virtual address space simplifies a lot of things.

It’s pretty likely I will remove support for 32-bit x86 from the kernel. There was very little work required to get the kernel itself 64-bit clean, so I assume that this means that if I ever want to add support for another 32-bit platform, it should be doable. As it stands right now, we can get to the root server entry point, but timers are broken, and the dynamic linker hasn’t been updated to support 64-bit binaries.

On amd64, this is called the NX, or Never eXecute bit. ↩
Addresses with this pseudo-sign extension done are called canonical addresses. ↩
Back in the days of Classic Mac OS (a much better time when the 68000 architecture reigned supreme…) on 24-bit machines, the system used the high 8 bits of addresses to represent memory state. This led to code that wasn’t 32-bit clean and would crash on 68020+ machines that had all 32 address bits brought out. ↩

amd64-Flavored Memory Management

Bootloader

Paging

Physical Identity Mapping

Physical Allocator

SMP

Conclusion

Further Reading

Yet Another Microkernel

Building std::shared_ptr

A Stupid (Simple) RPC Framework