With much of the foundation for the internals of kush-os built, it was time to break from the confines of the small, size-limited boot RAM disk. This means reading Stuff™ from disk: so it’s time to write a driver for the AHCI to allow filesystem drivers to interface with SATA disks.

This was also the first more complicated driver that I’ve written for the system. There are already a few drivers: namely the PS/2 controller for keyboard and mouse input, the ACPI interpreter used to discover hardware, and support for PCI and PCI Express busses. But none of these are particularly complicated: they mostly sit around in a message loop waiting for requests to call into some existing code, or are otherwise relatively simple in their operation.

Background

The Advanced Host Controller Interface (or AHCI) is a standardized interface exposed by compatible HBAs to allow a generic driver to communicate with SATA devices. It is designed as a simple data movement engine that uses bus mastering DMA to transfer data to/from devices and host memory via command lists stored in RAM. This makes it much easier to achieve high performance compared to the legacy IO port based ATA controllers.

However, unlike the legacy ATA controllers, there is a lot less code out there (and the samples I’ve found are… of dubious quality) and generally fewer people seem to implement AHCI for their operating systems as the performance difference likely just does not matter in emulation. The best resources I’ve found for developing this driver were the official spec from Intel as well as an ATA command reference.

I did consider briefly implementing support for the virtio block device driver, which qemu also supports. This however wouldn’t serve any purpose on real hardware so I went with AHCI instead, though virtio support will likely get added sooner rather than later.

Data Transfer

All data transfers between the device and host, in either direction, are encapsulated by Frame Information Structures (FIS) the structures of which vary depending on the type of command.¹ In most cases, these FIS are similar to the legacy ATA “task files,” or collections of device registers. For example, to send a read request, we build a host-to-device register FIS with the appropriate LBA, count, and command byte specified.

Data read from/written to the device (whether this is through PIO transfers like ATA IDENTIFY command, or DMA read/writes) is automatically written to/read from the host’s memory via the port’s built-in DMA engine. This is implemented using a scatter-gather DMA: the driver specifies a list of physical region descriptors that contains 32-bit aligned physical addresses and the number of bytes to write into the memory region covered by the descriptor.

Design

The driver is designed so that it can support multiple discrete controllers, each of which is connected via a single PCI or PCI Express device. A controller handles interrupt routing, PCI resource management, and some general housekeeping for all ports, as well as the work loop. Each controller, in turn, can have one or more active ports, each of which are represented by their own object with their own states and command queues.

In addition to the per controller work loop, the driver contains one RPC server that’s used to interface with all disk-type devices.² It provides a relatively simple interface that clients can use to set up the shared memory regions used for the command and read/write buffers, as well as to submit commands.

Work Loop

Each controller has a work loop. This consists of a thread that sits in an infinite loop waiting to receive notifications – these come from either an interrupt being fired by the controller, or by a work item being delivered to the loop.

void Controller::workLoopMain() {
    // ...
    while(this->workLoopRun) {
        const auto bits = NotificationReceive(0, UINTPTR_MAX);

        if(bits & kAhciIrqBit) {
            this->handleAhciIrq();
        } if(bits & kWorkBit) {
            this->handleWorkQueue();
        }
    }
    // ...
}

Work queue items are simply functions we invoke from the context of the work loop thread. This is used for completion handlers fired by the interrupt machinery (to ensure that all interrupts are handled before any callbacks execute) as well as for some external requests that require access to the hardware. By performing all hardware access from the work loop, we can skip putting locks around everything.

Interrupts

Rather than mucking with the legacy PCI interrupt signaling scheme, we take advantage of the fact that PCI Express makes message signaled interrupts (MSI) mandatory. This is the only type of interrupt that we support for PCIe devices anyways; these work by having the device write to a special memory address, that will cause a particular processor core to receive an IPI with a particular vector number. The device may vary some of the payload data to trigger different interrupts.

The kernel already has a mechanism for handling interrupts in userspace – threads can be notified by using the IrqHandlerInstall() syscall when an interrupt is triggered – but these rely on the caller knowing the physical IRQ number. On the PC, this is limited to whatever the IOAPIC provides: the first 16 interrupts are likely reserved for legacy ISA stuff. This doesn’t quite work when we’re using message signaled interrupts. Instead, another system call, IrqHandlerInstallLocal(), allows the caller to allocate a core-local vector, and bind a standard interrupt handler to that vector. As a result, the destination thread is bound to execute on only that core.

Interrupts are handled by the controller’s work loop in a separate function invoked when the appropriate notification bit is set. It then invokes the interrupt handler methods for every port that has its interrupt status flag set.

void Controller::handleAhciIrq() {
    const auto is = this->abar->irqStatus;
    this->abar->irqStatus = is;

    for(size_t i = 0; i < kMaxPorts; i++) {
        if(is & (1U << i)) {
            this->ports[i]->handleIrq();
        }
    }
}

In the port-specific interrupt handler, we can read out more specific interrupt information to determine the source. This can roughly be divided into success and failure interrupts as they usually come in response to a command we previously sent. Currently, the driver waits to receive either a task file error (indicating we screwed up our command submission) or a received “device to host register” packet, from which we can read the ATA status field and use that to either complete or fail the corresponding command.

Ports

Most of the driver’s behavior is implemented on a per-port basis: in the AHCI, each port is effectively an independent unit. Each port has associated with it a buffer for FIS received from the device, as well as a command list. The command list contains up to 32 command headers, each of which can point to a command table. This allows support for native command queuing.

When the port is constructed, it probes the identification register for the port to determine the attached device. Regular SATA disks result in the allocation of an AtaDisk object, which provides some convenience methods to issue the required read/write ATA commands, and ensures the disk is registered in the driver forest. At this time, we also read out the entire ATA IDENTIFY command response to determine various drive parameters.

Device Objects

Currently, there’s only support for hard disk-type drives, which are represented by the AtaDisk class in the driver. It works in conjunction with the RPC server (described further below) to expose read/write methods that wrap the appropriate ATA commands, and cache some metadata about the disk.

RPC Interface

Reading from/writing to disks is accomplished via an RPC interface common to all block device (disk) type drivers. In the context of the AHCI driver, this is implemented by the ATA disk server which translates the RPC requests into read/write calls (and thus, the corresponding ATA commands) as needed.

Most of the calls on the RPC interface are dedicated to allocating and managing an IO session. These sessions have multiple shared memory regions associated with them: the first, which is always allocated, is the command buffer region. Data is transferred through the lazily allocated read/write buffers.

Command Buffer

The command buffer consists of an array (of fixed size at initialization, defining the maximum number of simultaneous outstanding IO commands) of command structures, which define the operation to be performed:

enum class CommandType: uint8_t {
    None, Read, Write
};
struct Command {
    bool allocated{false}, busy{false}, completed{false};

    CommandType type{CommandType::None};
    int __attribute__((aligned(4))) status{0};

    uintptr_t notifyThread;
    uintptr_t notifyBits;

    uint64_t diskId{0};

    uint64_t sector;
    uint64_t bufferOffset;
    uint32_t numSectors;
    uint32_t bytesTransfered;

    uint8_t reserved[8];
} __attribute__((packed));

Most of the fields in the command structure are hopefully self-explanatory. Command completion is indicated by sending the specified notification bits to a particular thread; this makes blocking IO incredibly easy and allows for some degree of IO multiplexing and async IO by polling on notification bits. The bufferOffset field indicates the byte offset into either the read buffer (written by the driver) or into the write buffer (written by the client, after allocating a write buffer region via RPC) that data is located at.

When the client wants to submit a new command, it uses atomic operations over the command slots to find and allocate a slot; and when the command completes, the driver is responsible for marking the slot as allocatable again.

size_t Disk::allocCommandSlot() {
    for(size_t i = 0; i < this->numCommands; i++) {
        auto &command = this->commandList[i];
        if(!__atomic_test_and_set(&command.allocated, __ATOMIC_RELAXED)) {
            return i;
        }
    }

    return -1;
}

These atomic operations over some shared memory regions are much more performant than serializing RPC messages and sending them to a remote task port for each command; we can then notify the driver that one or more commands are available through a single call, and it can scan the shared memory region for any newly allocated, but not yet started commands.

Data Buffers

In addition to the command region, two additional shared memory regions are defined: the read and write buffers. These consist of physical pages, which are locked in memory, to/from which the driver will directly perform DMA transactions. These buffers are established as required when the first read or write requests come in.

Buffer regions are released automatically when the write command completes, or manually via RPC call once the data of a read command has been parsed or copied out of the IO buffer.

Wrapper Library

All of the complexities of the client side of this protocol are implemented in a simple wrapper library that higher-level drivers can build on. It provides a simple object-oriented read/write interface for each disk, which is represented by an object instantiated with the disk’s driver forest path:

class Disk: public rpc::DiskDriverClient {
    public:
        /// Attempt to allocate a disk with the given forest path
        [[nodiscard]] static int Alloc(const std::string_view &forestPath,
                std::shared_ptr<Disk> &outDisk);

        /// Return the capacity of the disk (bytes per sector, number of sectors)
        int GetCapacity(std::pair<uint32_t, uint64_t> &outCapacity);
        /// Performs a read from disk
        int Read(const uint64_t sector, const size_t numSectors, std::vector<std::byte> &out);
    // ...
};

This automatic connection relies on serialized connection info (a port handle and associated disk handle) being stored as a property on the device’s leaf in the driver manager forest. These info blobs are simply encoded MessagePack structs which are trivial to encode in drivers, and for the helper library to decode.

A higher level filesystem driver, for example, can open a disk and read a hypothetical partition table stored in its first sector as simply as this:

const std::string_view path{"/AcpiGenericPc/AcpiPciExpressRootBridge/PciExpress8086.2922@0.1f.2/GenericDisk@0"};

std::shared_ptr<DriverSupport::disk::Disk> disk;
int err = DriverSupport::disk::Disk::Alloc(path, disk);
if(err) Abort("Failed to open disk: %d", err);

std::vector<std::byte> data;
err = disk->Read(0, 1, data);
if(err) Abort("Failed to read from disk: %d", err);

The wrapper library takes care of establishing the IO session, setting up the required shared memory segments, and building the command. It also copies the read data out of the read buffer and into the caller’s buffer automatically.

Conclusion

Overall, having the drivers be just another userspace application made for a very pleasant development experience. All the standard C++ runtime library features are available, which is well worth it for the extensive containers library that’s part of the STL. Autogenerated RPC stubs made interfacing the driver with the rest of the system trivial.

However, debugging (when things went wrong) was pretty awful. There’s zero support for userspace debugging in the kernel right now, nor have I gotten around to writing some Python scripts lldb could use to extract task information and virtual memory maps (for loaded libraries, for example.) On top of that, qemu would often simply assert if some structures were incorrectly filled out.

There were also a few times where I’d screw up and give a virtual address to the hardware. Unlike some of the platforms I’ve done most of my driver development on, we don’t get the benefit of an IOMMU that would allow this to work. There really isn’t a way to catch this as many of the userspace virtual addresses are perfectly valid physical addresses as well, other than sometimes the AHCI controller would raise some sort of bus fault interrupt.

If you’re interested in more of the nitty-gritty implementation details, the driver’s source is available on GitHub, as with the rest of the kush-os code. It supports submitting arbitrary ATA commands with an associated data transfer component. Only reads from disks are implemented, however.

All FIS types are outlined in the SATA specification; a few of them (necessary for the basic AHCI driver) are defined in a header file in the driver. ↩
The driver can detect SATAPI peripherals such as optical drives, but doesn’t support sending ATA packet (SCSI) commands to them yet. A second RPC server will handle these types of devices. ↩

Building an AHCI Driver