Getting Started with Vulkan Driver Development: a Practical Guide Using the Vkd Software Driver

Important (Series Context)

This article is the first in a series about building a Vulkan driver. I’m writing it as I learn the Vulkan driver ecosystem, so it may contain mistakes or omissions. I’ll revisit topics and refine the implementation in future articles.

I built Vkd (short for Vulkan Driver) as an experimental CPU‑based ICD using modern C++20. It’s cross‑platform and designed as a learning project. Throughout this article, I use Vkd as a concrete example of how I have structured a driver. I reference the official Vulkan 1.3 specification and the Vulkan Loader and ICD Interface for correctness.

Note (Current Implementation Status)

I’ve successfully implemented the core ICD infrastructure with working transfer operations (buffer fill/copy), fence synchronization, and a ThreadPool-based command execution system. A test program validates the end-to-end pipeline. Rendering features (shaders, pipelines, images) are not yet implemented.

1 Project architecture: abstraction and implementation

Vkd is structured as a layered system separating the agnostic driver interface from hardware-specific implementations. This design allows multiple backend implementations to coexist while sharing common infrastructure.

The two-layer architecture

vkd (static library) – The platform-agnostic core where I define abstract classes for all Vulkan objects (Instance, PhysicalDevice, Device, CommandBuffer, etc.). This layer implements:
- ICD interface negotiation (Icd.cpp)
- Entry-point dispatch tables
- Handle conversions and type-safety macros
- Vulkan object lifecycle management
- Command recording abstractions
vkd-Software (shared library) – My concrete CPU-based implementation that inherits from vkd base classes. This layer provides:
- software::PhysicalDevice – Emulates a CPU-based GPU
- software::Device & software::Queue – CPU execution contexts
- CommandDispatcher – Executes recorded commands through CpuContext
- Memory management using host RAM

Extensibility for real hardware

I designed this architecture to support future hardware implementations like:

vkd-AMD – AMD GPU backend (interfacing with amdgpu kernel driver)
vkd-Nvidia – Nvidia GPU backend (using nvidia-drm)
vkd-Intel – Intel integrated graphics backend

Each implementation would compile as a separate DLL with its own ICD manifest, registered independently with the Vulkan loader. For example, vkd-Software.json currently points to vkd-Software.dll:

1
{
2
    "file_format_version": "1.0.1",
3
    "ICD": {
4
        "library_path": "C:\\path\\to\\vkd-Software.dll",
5
        "api_version": "1.4.304",
6
        "library_arch": "64",
7
        "is_portability_driver": false
8
    }
9
}

Build system (xmake)

The xmake.lua defines the vkd target with common infrastructure, then iterates over a drivers table to generate vkd-{DriverName} targets. Each driver target:

Links against the vkd static library
Includes its own source tree (Src/Vkd{DriverName}/)
Generates an ICD manifest after build

This approach ensures that adding a new hardware backend requires minimal build system changes-just add an entry to the drivers table.

2 Understanding the loader and dispatch chains

Before writing a driver, you need to understand how the Vulkan loader works. The loader sits between applications and drivers. At startup, it enumerates layer and driver manifest files and builds a dispatch chain. Each exported Vulkan function, such as vkCreateInstance, is a trampoline that calls into the loader. The loader consults the list of layers and ICDs and routes the call down the chain so that each layer sees it before the ICD.

A Vulkan driver does not export the official function names directly; instead, it implements ICD interface functions such as vk_icdGetInstanceProcAddr so that the loader can query it for entry points. The Loader specification explains that this function must return valid pointers for all global‑level and instance‑level commands, including vkGetDeviceProcAddr, and that global entry points must also be queryable with a NULL instance.

The loader also needs to know which version of the ICD interface the driver supports. During initialisation it calls vk_icdNegotiateLoaderICDInterfaceVersion, which allows the driver to tell the loader its supported version. For example, if your ICD supports interface version 7, the loader will call vk_icdGetPhysicalDeviceProcAddr for physical-device‑level commands. The spec also notes that for drivers supporting only Vulkan 1.0, the loader will pass a clamped VkApplicationInfo with apiVersion = VK_API_VERSION_1_0 to prevent VK_ERROR_INCOMPATIBLE_DRIVER.

3 Implementing `vk_icdGetInstanceProcAddr` and entry‑point dispatch

Because the driver does not export symbols such as vkCreateInstance or vkEnumeratePhysicalDevices, the loader uses vk_icdGetInstanceProcAddr to query pointers to global and instance‑level functions.

Important (ICD Interface Requirements)

The ICD must correctly implement entry-point lookup to integrate with the Vulkan loader. This includes returning NULL for unsupported functions and providing pointers to all ICD interface functions.

A typical implementation does the following:

Match function names. The ICD compares the requested name against a static map of supported commands and returns the corresponding pointer. Vkd’s Icd class implements this with C++ macros to reduce boilerplate.
Return NULL for unsupported functions. Returning NULL tells the loader to skip or emulate the function. For instance, since Vkd doesn’t implement swapchains yet, the loader won’t advertise VK_KHR_swapchain.
Provide pointers for ICD interface functions. The ICD must return function pointers to vk_icdGetInstanceProcAddr, vk_icdNegotiateLoaderICDInterfaceVersion, and optionally vk_icdGetPhysicalDeviceProcAddr when queried.

Example: Vkd’s entry-point lookup

Here is how Icd.cpp implements vk_icdGetInstanceProcAddr:

1
PFN_vkVoidFunction Icd::GetInstanceProcAddr(VkInstance pInstance, const char* pName)
2
{
3
    if (pName == nullptr)
4
        return nullptr;
5

6
    #define VKD_ENTRYPOINT_LOOKUP(klass, name) \
7
        if (strcmp(pName, "vk" #name) == 0) \
8
            return (PFN_vkVoidFunction)static_cast<PFN_vk##name>(klass::name)
9

10
    // Standard Vulkan API functions
11
    VKD_ENTRYPOINT_LOOKUP(vkd::Instance, CreateInstance);
12
    VKD_ENTRYPOINT_LOOKUP(vkd::Instance, DestroyInstance);
13
    VKD_ENTRYPOINT_LOOKUP(vkd::Instance, EnumeratePhysicalDevices);
14
    VKD_ENTRYPOINT_LOOKUP(vkd::PhysicalDevice, GetPhysicalDeviceProperties);
15
    VKD_ENTRYPOINT_LOOKUP(vkd::Device, CreateDevice);
16
    // ... more entries
17

18
    #undef VKD_ENTRYPOINT_LOOKUP
19

20
    #define VKD_ICD_ENTRYPOINT_LOOKUP(klass, name) \
21
        if (strcmp(pName, "vk_icd" #name) == 0) \
22
            return (PFN_vkVoidFunction) klass::name
23

24
    // ICD interface functions
25
    VKD_ICD_ENTRYPOINT_LOOKUP(vkd::Icd, NegotiateLoaderICDInterfaceVersion);
26
    VKD_ICD_ENTRYPOINT_LOOKUP(vkd::Icd, GetInstanceProcAddr);
27
    VKD_ICD_ENTRYPOINT_LOOKUP(vkd::Icd, GetPhysicalDeviceProcAddr);
28

29
    #undef VKD_ICD_ENTRYPOINT_LOOKUP
30

31
    return nullptr; // Function not supported
32
}

4 Type conversions and handle wrappers

An essential part of writing a C++ Vulkan driver is converting between opaque Vulkan handles and internal C++ objects. Handles such as VkInstance or VkBuffer are just pointers to opaque data, while the driver manipulates strongly typed C++ objects (Instance, Device, Buffer, etc.).

Note (Handle Type Distinction)

Understanding the difference between dispatchable and non-dispatchable handles is crucial for correct driver implementation. Dispatchable handles require a loader dispatch table pointer as their first member.

Dispatchable vs Non-Dispatchable Handles

The Vulkan specification distinguishes between two types of handles, which require different internal representations:

Dispatchable handles (VkInstance, VkPhysicalDevice, VkDevice, VkQueue, VkCommandBuffer):

Must have a loader dispatch table pointer as their first member (required by the Vulkan loader spec)
Wrapped in a DispatchableObject<T> structure defined in ObjectBase.hpp:

1
template<typename T>
2
struct DispatchableObject
3
{
4
    VK_LOADER_DATA LoaderData;  // Dispatch table pointer (first member!)
5
    T* Object;                  // Actual object pointer
6
};

The loader intercepts function calls by reading the dispatch table from the first member, which is why this layout is mandatory for dispatchable handles.

Non-dispatchable handles (VkBuffer, VkDeviceMemory, VkFence, VkPipeline, VkCommandPool, etc.):

Don’t need dispatch tables (they’re implicitly associated with a device that already has one)
Stored as direct pointers to objects, no wrapper structure needed
More efficient: one less indirection compared to dispatchable handles

Conversion macros

Vkd provides conversion macros in Defines.hpp that handle both types transparently:

1
// Generic conversion macros (work for both types via FromHandle)
2
#define VKD_FROM_HANDLE(type, name, handle)    \
3
    VKD_CHECK(handle != nullptr);              \
4
    type* name = type::FromHandle(handle);     \
5
    VKD_CHECK((name) != nullptr);              \
6
    VKD_CHECK(dynamic_cast<type*>(name) != nullptr)
7

8
#define VKD_TO_HANDLE(type, handle) (type)(handle)
9

10
// For dispatchable handles: unwrap DispatchableObject
11
#define VKD_DISPATCHABLE_HANDLE(type)                                       \
12
    static inline type* FromHandle(Vk##type instance)                       \
13
    {                                                                       \
14
        auto* dispatchable = reinterpret_cast<DispatchableObject<type>*>(instance); \
15
        if (!dispatchable) return nullptr;                                  \
16
        if (dispatchable->Object->GetObjectType() != type::ObjectType)      \
17
        {                                                                   \
18
            CCT_ASSERT_FALSE("Invalid Object Type");                        \
19
            return nullptr;                                                 \
20
        }                                                                   \
21
        return dispatchable->Object;  // Extract actual object              \
22
    }
23

24
// For non-dispatchable handles: direct pointer cast
25
#define VKD_NON_DISPATCHABLE_HANDLE(type)                                   \
26
    static inline type* FromHandle(Vk##type instance)                       \
27
    {                                                                       \
28
        return reinterpret_cast<type*>(instance);  // Simple cast           \
29
    }

Each Vulkan object class declares which type it is. For example, Buffer.hpp:

1
class Buffer : public ObjectBase
2
{
3
public:
4
    static constexpr VkObjectType ObjectType = VK_OBJECT_TYPE_BUFFER;
5
    VKD_NON_DISPATCHABLE_HANDLE(Buffer);  // Buffer is non-dispatchable
6
    // ...
7
};

Whereas Device.hpp declares:

1
class Device : public ObjectBase
2
{
3
public:
4
    static constexpr VkObjectType ObjectType = VK_OBJECT_TYPE_DEVICE;
5
    VKD_DISPATCHABLE_HANDLE(Device);  // Device is dispatchable
6
    // ...
7
};

Example: Bidirectional conversion in vkGetDeviceQueue

Here’s a simple example from Device.cpp showing both conversions:

1
void Device::GetDeviceQueue(VkDevice pDevice, uint32_t queueFamilyIndex,
2
                            uint32_t queueIndex, VkQueue* pQueue)
3
{
4
    // Convert VkDevice (dispatchable) -> Device*
5
    VKD_FROM_HANDLE(Device, device, pDevice);
6
    VKD_CHECK(pQueue);
7

8
    // Retrieve the queue object from the device's internal map
9
    auto* queue = device->GetQueue(queueFamilyIndex, queueIndex);
10
    if (!queue)
11
    {
12
        *pQueue = VK_NULL_HANDLE;
13
        return;
14
    }
15

16
    // Convert Queue* -> VkQueue (dispatchable)
17
    *pQueue = VKD_TO_HANDLE(VkQueue, queue);
18
}

The same pattern works for non-dispatchable handles (like VkBuffer or VkFence), but without the DispatchableObject wrapper overhead-they’re just direct pointer casts.

5 Creating an instance

After negotiation, the driver must implement vkCreateInstance. Vkd’s Instance class performs the following steps:

Validate VkInstanceCreateInfo, checking requested extensions and API version.
Enumerate physical devices by calling EnumeratePlatformPhysicalDevices.
Build an instance dispatch table mapping function names to pointers.
Return the handle to the loader.

The Vulkan spec on instance creation details this process.

6 Enumerating physical devices

When vkEnumeratePhysicalDevices is called, the driver must provide an array of physical device handles. Each represents a hardware or software device. In Vkd, this is implemented as a single software::PhysicalDevice.

It fills VkPhysicalDeviceProperties, queue family data, and feature structures.

7 Creating a device and queues

According to the Vulkan specification, all device queues are created at the same time as the device and are specified via the array of VkDeviceQueueCreateInfo structures passed to vkCreateDevice. The spec further clarifies that the number of queues created for each queue family is defined at device creation (see Device and Queue Creation).

Warning (Queue Creation Specification)

All device queues must be created during vkCreateDevice. You cannot create or destroy queues after device creation. Applications only retrieve existing handles via vkGetDeviceQueue.

What this implies

You cannot create or destroy queues after the device is created.
The driver must validate requested families/counts against vkGetPhysicalDeviceQueueFamilyProperties and instantiate the queues as part of vkCreateDevice.
Applications retrieve the already-created handles with vkGetDeviceQueue / vkGetDeviceQueue2; these do not create queues.

Minimal driver-side flow (Vkd-style pseudocode)

1
VkResult Device::Create(PhysicalDevice& phys, const VkDeviceCreateInfo* ci) {
2
    // Create queues now (spec requirement)
3
    for (uint32_t i = 0; i < ci->queueCreateInfoCount; ++i) {
4
        const VkDeviceQueueCreateInfo& qci = ci->pQueueCreateInfos[i];
5
        for (uint32_t q = 0; q < qci.queueCount; ++q) {
6
            auto qObj = std::make_unique<software::Queue>();
7
            VK_CHECK(qObj->Create(*this, qci.queueFamilyIndex, q, qci.flags));
8
            m_queues[{qci.queueFamilyIndex, q}] = std::move(qObj);
9
        }
10
    }
11

12
    return VK_SUCCESS;
13
}

Retrieving queues (does not create)

1
void Device::GetDeviceQueue(uint32_t family, uint32_t index, VkQueue* out) const {
2
    auto it = m_queues.find({family, index});
3
    VKD_CHECK(it != m_queues.end());
4
    *out = VKD_TO_HANDLE(VkQueue, it->second.get());
5
}

8 ThreadPool architecture for asynchronous queue execution

I implemented asynchronous command execution using a custom ThreadPool class located in VkdUtils/ThreadPool.

Why a ThreadPool?

My initial implementation used raw std::thread with detach() for queue submissions. However, I quickly realized this approach had several drawbacks:

Resource overhead: Creating and destroying threads for every queue submission is expensive
Uncontrolled thread count: Multiple rapid submissions could spawn excessive threads
Undefined behavior on shutdown: If the program terminates before detached threads finish execution, resources may be destroyed prematurely, leading to undefined behavior or crashes.
Difficult wait semantics: Implementing vkQueueWaitIdle required tracking all detached threads

The ThreadPool I built solves these issues by maintaining a fixed pool of worker threads that process tasks from a shared queue.

Tip (ThreadPool Benefits)

Using a ThreadPool instead of detached threads provides better resource management, controlled concurrency, and proper shutdown semantics. This is essential for implementing Vulkan’s asynchronous queue submission model correctly.

ThreadPool API

The ThreadPool.hpp interface provides two submission modes:

1
// Fire-and-forget for void-returning functions
2
template<typename F>
3
    requires std::invocable<std::decay_t<F>> && std::is_void_v<std::invoke_result_t<std::decay_t<F>>>
4
void AddTask(F&& f);
5

6
// Returns std::future<T> for result retrieval and chaining
7
template<typename F>
8
    requires std::invocable<std::decay_t<F>>
9
auto Submit(F&& f) -> std::future<std::invoke_result_t<std::decay_t<F>>>;

Key features:

Modern C++20: Uses std::jthread, std::stop_token, and concepts for clean shutdown
Wait operations: Wait() and WaitFor() block until all in-flight tasks complete
Graceful shutdown: RequestStop() prevents new submissions while finishing current work
Thread-safe: All public methods use appropriate synchronization primitives

How to integrate queue serialization

The Vulkan specification states:

Queue submission commands (vkQueueSubmit, vkQueueBindSparse) have no implicit ordering constraints, but submission order within the same queue defines the submission order for batches.

I implement this requirement using chained futures. Each Queue maintains:

1
// From https://github.com/ArthurVasseur/Vkd/tree/711c288ad8596880a08ef961480cffb59cfa78a0/Src/VkdSoftware/Queue/Queue.hpp
2
std::future<bool> m_previousSubmit;
3
std::mutex m_submitMutex;

When vkQueueSubmit is called (see Queue.cpp:38-59):

The previous std::future is moved into the new task’s lambda capture
The task waits for the previous future before executing
The new future becomes m_previousSubmit for the next submission

This creates a dependency chain where submission N+1 automatically waits for submission N to complete, ensuring proper ordering while still allowing parallel execution across different queues.

Example: Queue::WaitIdle implementation

The WaitIdle implementation simply waits for the last submission:

1
VkResult Queue::WaitIdle()
2
{
3
    std::lock_guard<std::mutex> lock(m_submitMutex);
4

5
    // Wait for the previous submit to complete
6
    if (m_previousSubmit.valid())
7
    {
8
        m_previousSubmit.wait();
9
    }
10

11
    return VK_SUCCESS;
12
}

This guarantees that all prior submissions have finished executing, as each submission waits for its predecessor in the chain.

9 Recording and dispatching commands

Command buffers record GPU‑like operations. Functions like vkBeginCommandBuffer, vkCmdCopyBuffer, and vkEndCommandBuffer add entries to the command buffer.

Command recording architecture

Important (Command Recording vs Execution)

Vulkan separates command recording (building command buffers) from execution (submitting to queues). This fundamental design allows multithreaded recording while maintaining ordered execution.

Following the Vulkan specification, I’ve separated command recording from execution into two distinct phases:

Recording phase (vkCmdXXX functions): When vkCmdCopyBuffer or similar functions are called, they do not execute immediately. Instead, the command buffer creates an operation structure (e.g., Buffer::OpCopy) and stores it in a std::vector<Op>. This is purely a recording operation—no actual work happens.
Execution phase (vkQueueSubmit): Only when vkQueueSubmit is called does execution begin. The queue submits the recorded command buffer to the ThreadPool, which instantiates a CpuContext and CommandDispatcher, then iterates through and executes the recorded operations asynchronously.

This separation is fundamental to Vulkan’s design: recording can happen on any thread, while execution is controlled by the queue and respects submission order.

Example: Command buffer operations

CommandBuffer.hpp defines operations as a variant of all possible command types:

1
using Op = std::variant<Buffer::Op, vkd::Op, /*...*/>;

Ops.hpp defines the operation structures:

1
struct OpBindVertexBuffer
2
{
3
    std::vector<Buffer*> Buffers;
4
    std::vector<VkDeviceSize> Offsets;
5
    UInt32 FirstBinding;
6
};
7

8
struct OpDraw
9
{
10
    UInt32 VertexCount;
11
    UInt32 InstanceCount;
12
    UInt32 FirstVertex;
13
    UInt32 FirstInstance;
14
};
15

16
struct OpBindPipeline
17
{
18
    VkPipelineBindPoint BindPoint;
19
    Pipeline* Pipeline;
20
};

Each Vulkan command (e.g., vkCmdDraw) is recorded as a corresponding operation structure pushed into the command buffer’s operation vector.

10 The software implementation: CommandDispatcher & CpuContext

In the vkd-Software driver, I’ve implemented CPU-based execution of Vulkan commands. In this section, I’ll explain how CommandDispatcher and CpuContext work together to execute recorded commands.

Architecture overview

1
Queue::Submit
2
    └─> spawns thread
3
         └─> CpuContext ctx           (execution state)
4
         └─> CommandDispatcher disp   (visitor pattern)
5
              └─> disp.Execute(cmdBuf)
6
                   └─> for each operation in cmdBuf:
7
                        std::visit(operation) → ctx.Draw() / ctx.CopyBuffer() / ...

CommandDispatcher: The visitor pattern

CommandDispatcher.hpp defines a visitor that dispatches operations to the CpuContext:

1
class CommandDispatcher
2
{
3
public:
4
    explicit CommandDispatcher(CpuContext& ctx);
5

6
    VkResult Execute(const vkd::CommandBuffer& cb);
7

8
private:
9
    VkResult operator()(vkd::Buffer::OpFill op);
10
    VkResult operator()(vkd::Buffer::OpCopy op);
11
    // ... more operation handlers
12

13
    CpuContext* m_context;
14
};

The Execute method in CommandDispatcher.cpp uses std::visit to dispatch each operation:

1
VkResult CommandDispatcher::Execute(const vkd::CommandBuffer& cb)
2
{
3
    if (!cb.IsSealed())
4
        return VK_ERROR_VALIDATION_FAILED_EXT;
5

6
    const auto& ops = cb.GetOps();
7
    for (const auto& op : ops)
8
    {
9
        VkResult result = std::visit([this]<typename T>(T operation)
10
        {
11
            return (*this)(std::move(operation));
12
        }, op);
13

14
        if (result != VK_SUCCESS)
15
            return result;
16
    }
17

18
    return VK_SUCCESS;
19
}

Each overloaded operator() forwards the operation to the appropriate CpuContext method:

1
VkResult CommandDispatcher::operator()(vkd::Buffer::OpCopy op)
2
{
3
    return m_context->CopyBuffer(std::move(op));
4
}
5

6
VkResult CommandDispatcher::operator()(vkd::OpDraw op)
7
{
8
    return m_context->Draw(std::move(op));
9
}

CpuContext: Execution state and implementation

CpuContext.hpp maintains the execution state for the CPU renderer:

1
class CpuContext
2
{
3
public:
4
    CpuContext();
5

6
    VkResult CopyBuffer(vkd::Buffer::OpCopy op);
7
    VkResult FillBuffer(vkd::Buffer::OpFill op);
8
    // ... more command implementations
9
};

Example implementations

CopyBuffer (CpuContext.cpp:55-77) performs CPU-side memory copies:

1
VkResult CpuContext::CopyBuffer(vkd::Buffer::OpCopy op)
2
{
3
    for (auto& region : op.regions)
4
    {
5
        cct::UByte* srcData = nullptr;
6
        op.src->GetMemory()->Map(region.srcOffset, region.size, reinterpret_cast<void**>(&srcData));
7

8
        cct::UByte* dstData = nullptr;
9
        op.dst->GetMemory()->Map(region.dstOffset, region.size, reinterpret_cast<void**>(&dstData));
10

11
        std::memcpy(dstData, srcData, region.size);
12

13
        op.dst->GetMemory()->Unmap();
14
        op.src->GetMemory()->Unmap();
15
    }
16

17
    return VK_SUCCESS;
18
}

FillBuffer (CpuContext.cpp:79-94) implements vkCmdFillBuffer:

1
VkResult CpuContext::FillBuffer(vkd::Buffer::OpFill op)
2
{
3
    cct::UByte* data = nullptr;
4
    op.dst->GetMemory()->Map(op.offset, op.size,
5
                            reinterpret_cast<void**>(&data));
6

7
    UInt32* data32 = reinterpret_cast<UInt32*>(data);
8
    size_t count = op.size / sizeof(UInt32);
9
    for (size_t i = 0; i < count; ++i)
10
        data32[i] = op.data;
11

12
    op.dst->GetMemory()->Unmap();
13
    return VK_SUCCESS;
14
}

Why this design?

This architecture provides several benefits:

Separation of concerns: CommandBuffer (in vkd) records operations agnostically, while CommandDispatcher and CpuContext (in vkd-Software) handle execution details.
Extensibility: A hardware backend (e.g., vkd-AMD) would implement its own dispatcher and context that translate operations into GPU commands.
Type safety: Using std::variant and std::visit ensures compile-time checking of all operation types.

11 Memory management and synchronization

Device memory implementation

I designed Vkd’s memory management to follow the Vulkan specification for vkAllocateMemory and vkFreeMemory. The base DeviceMemory class (Vkd/DeviceMemory/DeviceMemory.hpp) defines an abstract interface:

1
class DeviceMemory : public vkd::ObjectBase<DeviceMemory>
2
{
3
public:
4
    virtual ~DeviceMemory() = default;
5

6
    virtual VkResult Map(VkDeviceSize offset, VkDeviceSize size, void** ppData) = 0;
7
    virtual void Unmap() = 0;
8

9
    VkDeviceSize GetSize() const noexcept { return m_size; }
10
    uint32_t GetMemoryTypeIndex() const noexcept { return m_memoryTypeIndex; }
11

12
protected:
13
    VkDeviceSize m_size = 0;
14
    uint32_t m_memoryTypeIndex = 0;
15
    bool m_mapped = false;
16
};

The software implementation (VkdSoftware/DeviceMemory/DeviceMemory.hpp) uses host memory:

1
class DeviceMemory : public vkd::DeviceMemory
2
{
3
public:
4
    VkResult Create(vkd::Device& owner, const VkMemoryAllocateInfo& info) override
5
    {
6
        m_size = info.allocationSize;
7
        m_memoryTypeIndex = info.memoryTypeIndex;
8
        m_data.resize(m_size);  // Allocate host memory
9
        return VK_SUCCESS;
10
    }
11

12
    VkResult Map(VkDeviceSize offset, VkDeviceSize size, void** ppData) override
13
    {
14
        VKD_CHECK(!m_mapped);
15
        *ppData = m_data.data() + offset;  // Return pointer into vector
16
        m_mapped = true;
17
        return VK_SUCCESS;
18
    }
19

20
    void Unmap() override
21
    {
22
        VKD_CHECK(m_mapped);
23
        m_mapped = false;
24
        // No-op: host memory remains accessible
25
    }
26

27
private:
28
    std::vector<UByte> m_data;  // Host-side backing storage
29
};

Key design decisions I made:

Host memory backing: I use std::vector<UByte> for memory storage in the software driver, making it trivially mappable.
Persistent mapping: Unlike real GPU drivers that may need to establish kernel mappings, my software implementation’s memory is always accessible-I made Unmap() essentially a no-op that just updates the m_mapped flag for validation. While this simplifies the implementation, I’m still considering whether tracking the mapped state is worth it for a CPU driver.
No memory types: I simplified the implementation by exposing only a single memory heap with VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT. But in the future, I could add support for different memory types.
Buffer memory binding: When vkBindBufferMemory is called, the buffer stores a pointer to the DeviceMemory object and the offset, allowing direct memory access during command execution.

Synchronization primitives

Fences (VkdSoftware/Synchronization/Fence) use standard C++ synchronization:

1
class Fence : public vkd::Fence
2
{
3
public:
4
    VkResult Create(vkd::Device& owner, const VkFenceCreateInfo& info) override
5
    {
6
        m_signaled = (info.flags & VK_FENCE_CREATE_SIGNALED_BIT) != 0;
7
        return VK_SUCCESS;
8
    }
9

10
    void Signal()  // Called by Queue::Submit
11
    {
12
        std::lock_guard<std::mutex> lock(m_mutex);
13
        m_signaled = true;
14
        m_cv.notify_all();
15
    }
16

17
    VkResult Wait(uint64_t timeout) override
18
    {
19
        std::unique_lock<std::mutex> lock(m_mutex);
20
        if (timeout == UINT64_MAX)
21
        {
22
            m_cv.wait(lock, [this] { return m_signaled; });
23
            return VK_SUCCESS;
24
        }
25
        else
26
        {
27
            auto success = m_cv.wait_for(lock, std::chrono::nanoseconds(timeout),
28
                                        [this] { return m_signaled; });
29
            return success ? VK_SUCCESS : VK_TIMEOUT;
30
        }
31
    }
32

33
private:
34
    std::mutex m_mutex;
35
    std::condition_variable m_cv;
36
    bool m_signaled = false;
37
};

My fence implementation demonstrates how I can emulate GPU synchronization semantics using CPU-based primitives:

I use std::mutex + std::condition_variable to replace GPU wait operations
The timeout support matches Vulkan’s vkWaitForFences behavior
Thread-safe signaling allows the queue submission thread to notify waiting threads

What I plan to add for synchronization:

Semaphores (timeline and binary) for queue-to-queue synchronization
Events for fine-grained command buffer synchronization
Pipeline barriers (currently unimplemented)

12 Profiling infrastructure

I’ve integrated Tracy Profiler into Vkd for performance analysis. The VKD_AUTO_PROFILER_SCOPE() macro is placed at the entry of critical functions to automatically track execution time:

1
VkResult Queue::Submit(uint32_t submitCount, const VkSubmitInfo* pSubmits, VkFence fence)
2
{
3
    VKD_AUTO_PROFILER_SCOPE();  // Automatic scope profiling
4
    // ... implementation
5
}

Where I’ve instrumented profiling:

All queue operations (Submit, WaitIdle, BindSparse)
Command buffer operations (begin, end, reset)
Device creation and destruction
Memory allocation and mapping
Command execution in CpuContext

Tracy provides me with a visual timeline showing:

Function call hierarchies
Execution durations
Thread activity
Bottleneck identification

This instrumentation is particularly useful for analyzing the ThreadPool behavior, identifying lock contention in queue submissions, and measuring command execution overhead. Since Vkd is CPU-based, profiling helps me optimize the driver’s own overhead separate from the simulated “GPU” work.

Note (Future Implementation Roadmap)

The following features represent the next steps in building a complete Vulkan driver. While not yet implemented, the current architecture is designed to support these additions.

Major features I haven’t implemented yet

The following Vulkan features would significantly expand Vkd’s capabilities:

Rendering pipeline:

Shader compilation and SPIR-V parsing (vkCreateShaderModule)
Render passes and framebuffers (vkCreateRenderPass, vkCreateFramebuffer)
Actual rasterization (triangle setup, fragment processing)
Vertex input attribute processing
Viewport and scissor transformations

Image/Texture support:

Image creation and views (vkCreateImage, vkCreateImageView)
Image memory binding and layout transitions
Samplers (vkCreateSampler)
Image-to-image and buffer-to-image copies
Texture filtering and mipmapping

Descriptor sets:

Descriptor set layouts and pools (vkCreateDescriptorSetLayout, vkCreateDescriptorPool)
Descriptor set allocation and updates
Binding descriptors in command buffers
Push constants

Synchronization:

Semaphores (binary and timeline) for queue-to-queue synchronization
Events (vkCreateEvent, vkCmdSetEvent, vkCmdWaitEvents)
Pipeline barriers for image layout transitions
Memory barriers for cache coherency

Presentation:

Swapchain extension (VK_KHR_swapchain)
Surface creation and present support
Integration with window systems (Win32, Xlib, Wayland)

Advanced features:

Compute shaders and compute pipelines
Query pools (timestamps, occlusion queries)
Indirect drawing (vkCmdDrawIndirect, vkCmdDrawIndexedIndirect are stubbed)
Multi-draw indirect
Pipeline cache for faster shader compilation

Building a CPU-Based Vulkan driver from Scratch

Getting Started with Vulkan Driver Development: a Practical Guide Using the Vkd Software Driver

1 Project architecture: abstraction and implementation

The two-layer architecture

Extensibility for real hardware

Build system (xmake)

2 Understanding the loader and dispatch chains

3 Implementing `vk_icdGetInstanceProcAddr` and entry‑point dispatch

Example: Vkd’s entry-point lookup

4 Type conversions and handle wrappers

Dispatchable vs Non-Dispatchable Handles

Conversion macros

Example: Bidirectional conversion in vkGetDeviceQueue

5 Creating an instance

6 Enumerating physical devices

7 Creating a device and queues

What this implies

Minimal driver-side flow (Vkd-style pseudocode)

Retrieving queues (does not create)

8 ThreadPool architecture for asynchronous queue execution

Why a ThreadPool?

ThreadPool API

How to integrate queue serialization

Example: Queue::WaitIdle implementation

9 Recording and dispatching commands

Command recording architecture

Example: Command buffer operations

10 The software implementation: CommandDispatcher & CpuContext

Architecture overview

CommandDispatcher: The visitor pattern

CpuContext: Execution state and implementation

Example implementations

Why this design?

11 Memory management and synchronization

Device memory implementation

Synchronization primitives

12 Profiling infrastructure

Major features I haven’t implemented yet

13 References

Building a CPU-Based Vulkan driver from Scratch

Getting Started with Vulkan Driver Development: a Practical Guide Using the Vkd Software Driver

1 Project architecture: abstraction and implementation

The two-layer architecture

Extensibility for real hardware

Build system (xmake)

2 Understanding the loader and dispatch chains

3 Implementing vk_icdGetInstanceProcAddr and entry‑point dispatch

Example: Vkd’s entry-point lookup

4 Type conversions and handle wrappers

Dispatchable vs Non-Dispatchable Handles

Conversion macros

Example: Bidirectional conversion in vkGetDeviceQueue

5 Creating an instance

6 Enumerating physical devices

7 Creating a device and queues

What this implies

Minimal driver-side flow (Vkd-style pseudocode)

Retrieving queues (does not create)

8 ThreadPool architecture for asynchronous queue execution

Why a ThreadPool?

ThreadPool API

How to integrate queue serialization

Example: Queue::WaitIdle implementation

9 Recording and dispatching commands

Command recording architecture

Example: Command buffer operations

10 The software implementation: CommandDispatcher & CpuContext

Architecture overview

CommandDispatcher: The visitor pattern

CpuContext: Execution state and implementation

Example implementations

Why this design?

11 Memory management and synchronization

Device memory implementation

Synchronization primitives

12 Profiling infrastructure

Major features I haven’t implemented yet

13 References

3 Implementing `vk_icdGetInstanceProcAddr` and entry‑point dispatch