Goldy: Modern GPU Library

Goldy is a Rust GPU library built around a typed bindless programming model, a dependency-driven task graph, and first-class compute support — targeting Vulkan 1.4+, DX12, and Metal Tier 2+ with native backends (no translation layers).

Typed Bindless Programming

Shaders are written in Slang using goldy_exp virtual entry points ([goldy_compute], [goldy_vertex], [goldy_fragment]). Resources are declared as typed parameters — the Goldy compiler resolves bindless slots automatically:

import goldy_exp;

[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(MyUniforms cfg, Scattered<uint> data, ThreadId id) {
    data[id.x] = data[id.x] + cfg.base;
}

Type	Maps To	Use
`Scattered<T>`	`RWStructuredBuffer<T>`	Read/write storage
`BufRO<T>`	`StructuredBuffer<T>`	Read-only storage
`DirectSpatial<T>`	`RWTexture2D<T>`	Read/write texture
`Interpolated<T>`	`Texture2D<T>`	Sampled texture
`Filter`	`SamplerState`	Texture sampler
`ThreadId`	`SV_DispatchThreadID`	Compute thread index
`VertexId`	`SV_VertexID`	Vertex index

Struct parameters are automatically treated as broadcast (constant buffer) data.

Task Graph

TaskGraph provides explicit dependency scheduling for bindless compute work. You declare what each node reads and writes; Goldy inserts optimal barriers, parallelizes independent dispatches across waves, and aliases transient resources:

#![allow(unused)]
fn main() {
let mut graph = TaskGraph::new();
graph
    .node("simulate", &sim_pipeline)
    .bind_buffer(&particles, NodeAccess::ReadWrite)
    .bind_resources_raw(&[particles_handle.index()])
    .dispatch(group_count, 1, 1);
}

Compute-to-Surface

Compute shaders can write directly to swapchain textures — no graphics pipeline, no vertex buffers, no render passes. Acquire a frame, get its texture handle, dispatch, present:

#![allow(unused)]
fn main() {
let frame = surface.begin()?;
let texture = frame.texture();
// ... build TaskGraph, dispatch compute ...
frame.submit_compute(&graph)?;
frame.present()?;
}

Multi-Backend, Single Shader Language

Goldy compiles Slang shaders to SPIR-V (Vulkan), DXIL (DX12), and Metal IR at runtime via the bundled Slang compiler. Each backend is a native implementation — Metal uses Metal idioms, not translated Vulkan.

Platform	Backend
Linux	Vulkan
Windows	DX12 (Vulkan optional)
macOS	Metal

Quick Links

License

Goldy is dual-licensed under LGPL-2.1-or-later and a commercial license. See License for details.

Installation

Requirements

Rust stable (recent version recommended)
A supported GPU

Adding Goldy to Your Project

[dependencies]
goldy = "0.1"

Or with cargo:

cargo add goldy

Feature Flags

Feature	Default	Description
`vulkan`	yes	Vulkan 1.4+ backend (Linux, Windows)
`dx12`	yes	DirectX 12 backend (Windows)
`metal`	yes	Metal Tier 2+ backend (macOS)
`instrumentation`	yes	Structured tracing via `tracing-subscriber` (zero-cost when disabled)

Platform-inappropriate features are no-ops — enabling metal on Linux or dx12 on macOS compiles cleanly but does nothing.

To build with only specific backends:

[dependencies]
goldy = { version = "0.1", default-features = false, features = ["vulkan"] }

Shader Toolchain

Goldy uses Slang as its shader language. The Slang compiler is bundled automatically via slang-rs — no separate SDK install is needed. Shaders are compiled at runtime to the appropriate target (SPIR-V, DXIL, or Metal IR).

Verifying Installation

use goldy::{Instance, DeviceType};

fn main() -> anyhow::Result<()> {
    let instance = Instance::new()?;

    println!("Available GPUs:");
    for adapter in instance.enumerate_adapters() {
        println!("  {} ({:?})", adapter.name, adapter.device_type);
    }

    let device = instance.create_device(DeviceType::DiscreteGpu)?;
    println!("\nUsing: {}", device.adapter_info().name);

    Ok(())
}

cargo run

Expected output:

Available GPUs:
  NVIDIA GeForce RTX 4060 Ti (DiscreteGpu)
  Intel(R) UHD Graphics 770 (IntegratedGpu)

Using: NVIDIA GeForce RTX 4060 Ti

Backend Selection

Goldy selects the best backend for your platform automatically:

Platform	Default Backend
Windows	DX12
Linux	Vulkan
macOS	Metal

Override at runtime with GOLDY_BACKEND:

GOLDY_BACKEND=vulkan cargo run

Platform-Specific Setup

Windows

DX12 is used by default and requires no additional setup. For the Vulkan backend, install the Vulkan SDK. Ensure your GPU drivers are up to date.

Linux

Install Vulkan development packages:

# Ubuntu/Debian
sudo apt install libvulkan-dev vulkan-tools

# Fedora
sudo dnf install vulkan-loader-devel vulkan-tools

# Arch
sudo pacman -S vulkan-icd-loader vulkan-tools

macOS

Goldy uses the native Metal backend — no MoltenVK or Vulkan SDK needed. Ensure macOS 12+ and Xcode command-line tools are installed:

xcode-select --install

Windowing (for examples)

The examples use winit for windowing:

[dev-dependencies]
winit = "0.30"
anyhow = "1.0"

Next Steps

Your First Triangle — draw a colored triangle
Your First Compute Shader — write pixels from compute

Your First Triangle

This tutorial draws a colored triangle in a window using Goldy's render pipeline and Surface API.

Complete Code

use goldy::{
    shader::builtins, Buffer, Color, CommandEncoder, DataAccess, DeviceType,
    Instance, RenderPipeline, RenderPipelineDesc, ShaderModule, Surface, Vertex2D,
};
use std::sync::Arc;
use winit::{
    application::ApplicationHandler,
    event::WindowEvent,
    event_loop::{ActiveEventLoop, ControlFlow, EventLoop},
    window::{Window, WindowId},
};

struct App {
    instance: Instance,
    device: Option<Arc<goldy::Device>>,
    vertex_buffer: Option<Buffer>,
    pipeline: Option<RenderPipeline>,
    window: Option<Arc<Window>>,
    surface: Option<Surface>,
}

impl App {
    fn new() -> anyhow::Result<Self> {
        Ok(Self {
            instance: Instance::new()?,
            device: None,
            vertex_buffer: None,
            pipeline: None,
            window: None,
            surface: None,
        })
    }

    fn init_gpu(&mut self, window: &Arc<Window>) -> anyhow::Result<()> {
        let device = Arc::new(self.instance.create_device(DeviceType::DiscreteGpu)?);

        let vertices = [
            Vertex2D::new(0.0, -0.5, Color::RED),
            Vertex2D::new(-0.5, 0.5, Color::GREEN),
            Vertex2D::new(0.5, 0.5, Color::BLUE),
        ];
        let vertex_buffer = Buffer::with_data(&device, &vertices, DataAccess::Scattered)?;

        let surface = Surface::new(&device, window.as_ref())?;

        let shader = ShaderModule::from_slang(&device, builtins::VERTEX_COLOR_2D)?;
        let pipeline = RenderPipeline::new(
            &device,
            &shader,
            &shader,
            &RenderPipelineDesc {
                vertex_layout: Vertex2D::layout(),
                target_format: surface.format(),
                ..Default::default()
            },
        )?;

        self.device = Some(device);
        self.vertex_buffer = Some(vertex_buffer);
        self.pipeline = Some(pipeline);
        self.surface = Some(surface);
        Ok(())
    }

    fn render(&mut self) -> anyhow::Result<()> {
        let window = self.window.as_ref().unwrap();
        let size = window.inner_size();
        if size.width == 0 || size.height == 0 {
            return Ok(());
        }

        let pipeline = self.pipeline.as_ref().unwrap();
        let vertex_buffer = self.vertex_buffer.as_ref().unwrap();
        let surface = self.surface.as_ref().unwrap();

        let frame = surface.begin()?;

        let mut encoder = CommandEncoder::new();
        {
            let mut pass = encoder.begin_render_pass();
            pass.clear(Color { r: 0.1, g: 0.1, b: 0.2, a: 1.0 });
            pass.set_pipeline(pipeline);
            pass.set_vertex_buffer(0, vertex_buffer);
            pass.draw(0..3, 0..1);
        }

        frame.render(encoder)?;
        frame.present()?;
        Ok(())
    }
}

impl ApplicationHandler for App {
    fn resumed(&mut self, event_loop: &ActiveEventLoop) {
        if self.window.is_none() {
            let window = Arc::new(
                event_loop
                    .create_window(
                        Window::default_attributes()
                            .with_title("Goldy - Triangle")
                            .with_inner_size(winit::dpi::LogicalSize::new(800, 600)),
                    )
                    .unwrap(),
            );
            self.window = Some(window.clone());
            self.init_gpu(&window).unwrap();
        }
    }

    fn window_event(&mut self, event_loop: &ActiveEventLoop, _: WindowId, event: WindowEvent) {
        match event {
            WindowEvent::CloseRequested => event_loop.exit(),
            WindowEvent::RedrawRequested => {
                self.render().ok();
                self.window.as_ref().unwrap().request_redraw();
            }
            WindowEvent::Resized(new_size) => {
                if new_size.width > 0 && new_size.height > 0 {
                    if let Some(surface) = &mut self.surface {
                        let _ = surface.resize(new_size.width, new_size.height);
                    }
                }
            }
            _ => {}
        }
    }
}

fn main() -> anyhow::Result<()> {
    let event_loop = EventLoop::new()?;
    event_loop.set_control_flow(ControlFlow::Poll);
    event_loop.run_app(&mut App::new()?)?;
    Ok(())
}

Walkthrough

Instance and Device

#![allow(unused)]
fn main() {
let instance = Instance::new()?;
let device = Arc::new(instance.create_device(DeviceType::DiscreteGpu)?);
}

Instance discovers available GPUs. create_device opens a connection to one. The Arc wrapper is required for Surface lifetime management.

Vertex Buffer

#![allow(unused)]
fn main() {
let vertices = [
    Vertex2D::new(0.0, -0.5, Color::RED),
    Vertex2D::new(-0.5, 0.5, Color::GREEN),
    Vertex2D::new(0.5, 0.5, Color::BLUE),
];
let vertex_buffer = Buffer::with_data(&device, &vertices, DataAccess::Scattered)?;
}

Vertex2D is a built-in vertex type with position and color. Buffer::with_data allocates a GPU buffer and uploads the data. DataAccess::Scattered marks it as a bindless storage buffer.

Shader and Pipeline

#![allow(unused)]
fn main() {
let shader = ShaderModule::from_slang(&device, builtins::VERTEX_COLOR_2D)?;
let pipeline = RenderPipeline::new(&device, &shader, &shader, &RenderPipelineDesc {
    vertex_layout: Vertex2D::layout(),
    target_format: surface.format(),
    ..Default::default(),
})?;
}

builtins::VERTEX_COLOR_2D is a built-in Slang shader from the goldy_exp library that uses [goldy_vertex] and [goldy_fragment] virtual entry points to render vertex-colored geometry. ShaderModule::from_slang compiles Slang source to the active backend's IR at runtime.

The pipeline takes the same shader module for both vertex and fragment stages — goldy_exp virtual entry points let a single source file define both.

Surface and Presentation

#![allow(unused)]
fn main() {
let surface = Surface::new(&device, window.as_ref())?;
let frame = surface.begin()?;

let mut encoder = CommandEncoder::new();
{
    let mut pass = encoder.begin_render_pass();
    pass.clear(Color { r: 0.1, g: 0.1, b: 0.2, a: 1.0 });
    pass.set_pipeline(pipeline);
    pass.set_vertex_buffer(0, vertex_buffer);
    pass.draw(0..3, 0..1);
}

frame.render(encoder)?;
frame.present()?;
}

Surface manages the swapchain. begin() acquires the next swapchain image. Commands are recorded into a CommandEncoder, rendered to the frame with frame.render(), then presented with frame.present(). Rendering happens directly on the GPU — no CPU readback.

Run It

cargo run --example triangle

You should see a window with a colored triangle on a dark blue background.

Next Steps

Your First Compute Shader — bypass the graphics pipeline entirely
Examples — more complex demos

Your First Compute Shader

This tutorial renders an animated plasma effect by dispatching a compute shader directly to the swapchain texture — no graphics pipeline, no vertex buffers, no render passes.

The Shader

The compute shader uses goldy_exp virtual entry points. It reads uniforms via BufRO<Uniforms> and writes pixels to the swapchain texture via DirectSpatial<float4>:

import goldy_exp;

struct Uniforms {
    uint width;
    uint height;
    float time;
    float _padding;
};

[goldy_compute]
[numthreads(8, 8, 1)]
void cs_main(BufRO<Uniforms> uniforms_buf, DirectSpatial<float4> output, ThreadId tid) {
    Uniforms u = uniforms_buf[0];

    if (tid.x >= u.width || tid.y >= u.height)
        return;

    float2 uv = float2(float(tid.x) / float(u.width),
                       float(tid.y) / float(u.height));
    float2 p = uv * 2.0 - 1.0;
    p.x *= float(u.width) / float(u.height);

    float t = u.time;
    float v = 0.0;
    v += sin(p.x * 6.0 + t);
    v += sin(p.y * 6.0 + t * 1.3);
    v += sin((p.x + p.y) * 4.0 + t * 0.7);
    v += sin(length(p) * 8.0 - t * 2.0);
    v *= 0.25;

    float3 col = float3(0.5 + 0.5 * sin(v * 3.14159 + 0.0),
                        0.5 + 0.5 * sin(v * 3.14159 + 2.094),
                        0.5 + 0.5 * sin(v * 3.14159 + 4.188));
    output[tid.xy] = float4(col, 1.0);
}

Key points:

BufRO<Uniforms> is a read-only structured buffer. Index with [0] to load the single element.
DirectSpatial<float4> is an RWTexture2D<float4> — write to it with output[tid.xy].
ThreadId maps to SV_DispatchThreadID. Each thread handles one pixel.
The [goldy_compute] attribute tells the Goldy compiler to wire up bindless slots automatically.

Rust Side

Uniform Buffer

Define the uniform struct on the Rust side with matching layout:

#![allow(unused)]
fn main() {
#[repr(C)]
#[derive(Clone, Copy, bytemuck::Pod, bytemuck::Zeroable)]
struct Uniforms {
    width: u32,
    height: u32,
    time: f32,
    _padding: f32,
}
impl goldy::StructuredBufferElement for Uniforms {}
}

Create the buffer with DataAccess::Scattered so it gets a bindless descriptor:

#![allow(unused)]
fn main() {
let uniform_buffer = Buffer::with_data(
    &device,
    &[Uniforms { width, height, time: 0.0, _padding: 0.0 }],
    DataAccess::Scattered,
)?;
}

Pass a typed &[Uniforms] slice, not raw bytes. Buffer::with_data::<T> uses size_of::<T>() as the structured-buffer stride, which backends rely on for correct addressing.

Compute Pipeline

Compile the Slang source and create a ComputePipeline:

#![allow(unused)]
fn main() {
let shader = ShaderModule::from_slang(&device, COMPUTE_SHADER)?;
let compute_pipeline = ComputePipeline::new(&device, &shader)?;
}

Rendering a Frame

Each frame follows this pattern: update uniforms, acquire the swapchain texture, build a TaskGraph, submit, present.

#![allow(unused)]
fn main() {
fn render_frame(state: &mut RenderState) -> Result<()> {
    let (width, height) = state.surface.size();
    let elapsed = state.start_time.elapsed().as_secs_f32();

    state.uniform_buffer.write(
        0,
        bytemuck::bytes_of(&Uniforms {
            width,
            height,
            time: elapsed,
            _padding: 0.0,
        }),
    )?;

    let frame = state.surface.begin()?;
    let texture = frame.texture();

    let wg_x = width.div_ceil(8);
    let wg_y = height.div_ceil(8);

    let uniform_handle = state.uniform_buffer
        .bindless_srv_handle()
        .expect("Uniform buffer has no bindless SRV handle");
    let texture_handle = texture
        .bindless_handle()
        .expect("Surface texture has no bindless handle");

    let mut graph = TaskGraph::new();
    graph
        .node("compute", &state.compute_pipeline)
        .bind_buffer(&state.uniform_buffer, NodeAccess::Read)
        .bind_resources_raw(&[uniform_handle.index(), texture_handle.index()])
        .dispatch(wg_x, wg_y, 1);

    frame.submit_compute(&graph)?;
    frame.present()?;
    Ok(())
}
}

Step by Step

Update uniforms — Buffer::write uploads new time/size values each frame.

Acquire the frame — surface.begin() returns a Frame. frame.texture() gives you the swapchain Texture for this frame.

Get bindless handles — bindless_srv_handle() returns the read-only descriptor index for the uniform buffer. bindless_handle() returns the storage-image descriptor index for the swapchain texture. These indices are passed to the shader as the BufRO<Uniforms> and DirectSpatial<float4> slots respectively.

Build the TaskGraph — graph.node() creates a compute node bound to a pipeline. bind_buffer() declares the dependency (the uniform buffer is read). bind_resources_raw() passes the bindless descriptor indices as push-constant slots. dispatch() sets the workgroup count.

Submit and present — frame.submit_compute(&graph) records and submits the compute work to the GPU. frame.present() presents the swapchain image. The compute shader already wrote the pixels — there is no blit or copy step.

Run It

cargo run --example compute_to_surface

You should see an animated plasma pattern filling the window, rendered entirely from compute.

Next Steps

Task Graph — multi-node graphs, transient resources, indirect dispatch
Examples — particles, game of life, and more compute examples

Bindless by Default

Goldy uses a typed bindless resource model: there are no descriptor sets, no binding tables, and no manual layout declarations. Every GPU resource — buffers, textures, samplers — is identified at dispatch time by a small integer index packed into push constants (Vulkan/DX12) or argument buffers (Metal).

How It Works

Traditional GPU APIs require you to declare descriptor set layouts, allocate descriptor pools, update descriptor sets, and bind them before each draw or dispatch. Goldy eliminates all of this. Instead:

Resources are registered in per-category descriptor heaps when created.
Each resource gets a BindlessHandle — a (category, index) pair.
At dispatch time, you pass these handles as ordinary arguments. The GPU shader resolves them to live buffer/texture/sampler handles through the goldy_exp access functions.

CPU side:                         GPU side:
                                  
Buffer::with_data(...)            goldy_scattered<T>(slot)
  → BindlessHandle {                → descriptor_heap[slot]
      category: Scattered,            → RWStructuredBuffer<T>
      index: 3,
    }

BindlessCategory

Goldy's descriptor heaps are organized into five pools, one per access pattern. A resource's index is only meaningful within its category:

Category	Pool	Shader Access Function
`Scattered`	Storage buffers	`goldy_scattered<T>()` / `goldy_buf_ro<T>()`
`Broadcast`	Uniform/constant buffers	`goldy_broadcast<T>()`
`Texture`	Sampled textures	`goldy_interpolated<T>()`
`StorageImage`	Writable textures	`goldy_direct_spatial<T>()`
`Sampler`	Sampler states	`goldy_filter()`

Scattered slot 3 and Broadcast slot 3 refer to different physical entries — on Metal these are storageBuffers[3] vs uniformBuffers[3], on Vulkan they live in different descriptor array bindings.

BindlessHandle

BindlessHandle is the typed wrapper that carries both the raw index and the resource category:

#![allow(unused)]
fn main() {
let buf = Buffer::with_data(&device, DataAccess::Scattered, &particles)?;
let handle: BindlessHandle = buf.bindless_handle().unwrap();

assert_eq!(handle.category(), BindlessCategory::Scattered);
assert_eq!(handle.index(), 3); // assigned by the device
}

When you bind handles at dispatch time, Goldy can validate that the handle's category matches what the shader expects in that slot — a Broadcast handle bound to a slot the shader reads through goldy_scattered is caught as a type error rather than silently producing garbage.

Typed Bindless Parameters

In shader code, goldy_exp provides type aliases that map directly to the underlying Slang resource types. These are used as entry-point parameters in virtual entry points:

Goldy Type	Underlying Slang Type	Usage
`Scattered<T>`	`RWStructuredBuffer<T>`	Read/write buffer: `data[i]`, `data[i].field = v`
`BufRO<T>`	`StructuredBuffer<T>`	Read-only buffer: `buf[i]`
`Interpolated<T>`	`Texture2D<T>`	Sampled texture: `tex.Sample(samp, uv)`
`DirectSpatial<T>`	`RWTexture2D<T>`	Writable texture: `img[int2(x,y)]`
`ByteAddress`	`RWByteAddressBuffer`	Raw byte access: `.Load()`, `.Store()`, `.Interlocked*()`
`Filter`	`SamplerState`	Sampler for texture filtering

Any user-defined struct type (e.g. MyUniforms) declared as a parameter is automatically treated as a constant-buffer broadcast — no wrapper type needed.

Dispatch-Time Type Checking

When you call bind_resources_typed, Goldy compares each BindlessHandle.category against the shader's declared parameter types (extracted via extract_push_constant_categories):

#![allow(unused)]
fn main() {
let uniforms = uniform_buf.bindless_handle().unwrap();  // Broadcast
let data     = storage_buf.bindless_handle().unwrap();   // Scattered

// Category validation happens here:
pass.bind_resources_typed(&[uniforms, data]);
pass.dispatch(workgroups, 1, 1);
}

If slot 0 expects Broadcast (from the shader's MyUniforms cfg parameter) but receives a Scattered handle, the dispatch fails with a clear error instead of producing undefined behavior.

Contrast with Traditional Binding

	Traditional (Vulkan/DX12)	Goldy Bindless
Setup	Declare descriptor set layouts, allocate pools, create and update descriptor sets	Create resources; indices assigned automatically
Binding	Bind descriptor sets before each draw/dispatch	Pass `BindlessHandle` values as push constants
Shader access	`layout(set=0, binding=1) buffer ...`	`Scattered<T> data` as a function parameter
Validation	Runtime errors or silent corruption on mismatch	Category + stride checks at dispatch time
Cross-backend	Layout declarations differ per API	Same shader code on Vulkan, DX12, and Metal

Example: Compute Shader with Bindless Resources

Shader (particle_update.slang):

import goldy_exp;

struct SimParams {
    float dt;
    uint count;
};

struct Particle {
    float2 pos;
    float2 vel;
};

[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(SimParams params, Scattered<Particle> particles, ThreadId id) {
    if (id.x >= params.count) return;

    Particle p = particles[id.x];
    p.pos += p.vel * params.dt;
    particles[id.x] = p;
}

Rust dispatch:

#![allow(unused)]
fn main() {
let params_buf = Buffer::with_data(&device, DataAccess::Broadcast, &[sim_params])?;
let particle_buf = Buffer::with_data(&device, DataAccess::Scattered, &particles)?;

let shader = ShaderModule::from_slang(&device, PARTICLE_UPDATE_SOURCE)?;
let pipeline = ComputePipeline::new(&device, &shader)?;

let mut encoder = ComputeEncoder::new();
let mut pass = encoder.begin_compute_pass();
pass.set_pipeline(&pipeline);
pass.bind_resources_typed(&[
    params_buf.bindless_handle().unwrap(),    // slot 0 → Broadcast → SimParams
    particle_buf.bindless_handle().unwrap(),  // slot 1 → Scattered → Particle
]);
pass.dispatch((particle_count + 63) / 64, 1, 1);
drop(pass);
encoder.dispatch(&device)?;
}

The shader author writes natural function parameters. The Rust side binds handles in declaration order. Goldy handles the rest — slot packing, category validation, and cross-backend descriptor plumbing.

Virtual Entry Points

Goldy's virtual entry points let you write shader entry points with clean, typed parameters instead of raw uniform uint slots and SV_* semantics. You annotate your function with [goldy_compute], [goldy_vertex], or [goldy_fragment], and a source-to-source transform generates the real Slang [shader("...")] entry point with all the bindless plumbing wired up.

The Attributes

Attribute	Stage	Generated Slang Attribute
`[goldy_compute]`	Compute	`[shader("compute")]`
`[goldy_vertex]`	Vertex	`[shader("vertex")]`
`[goldy_fragment]`	Fragment	`[shader("fragment")]`

A minimal example:

import goldy_exp;

[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(Scattered<uint> data, ThreadId id) {
    data[id.x] = data[id.x] * 2;
}

This is equivalent to manually writing a [shader("compute")] entry point with uniform uint push-constant parameters, descriptor heap lookups, and SV_DispatchThreadID — but without any of that boilerplate.

What Virtual Entry Points Accept

Resource Parameters

Each resource parameter occupies one bindless slot (a 16-bit index packed into push constants). The generated wrapper calls the corresponding goldy_* free function to resolve the slot to a live GPU handle.

Parameter Type	Resolves Via	Description
`Scattered<T>`	`goldy_scattered<T>(slot)`	Read/write storage buffer
`BufRO<T>`	`goldy_buf_ro<T>(slot)`	Read-only storage buffer
`Interpolated<T>`	`goldy_interpolated<T>(slot)`	Sampled 2D texture
`DirectSpatial<T>`	`goldy_direct_spatial<T>(slot)`	Read/write 2D texture
`ByteAddress`	`goldy_byte_address(slot)`	Raw byte-address buffer
`Filter`	`goldy_filter(slot)`	Sampler state

Broadcast Parameters

Any user-defined struct type that isn't a recognized resource or system-value type is treated as a broadcast (constant buffer). The generated code calls goldy_broadcast<T>(slot) to fetch the entire struct from a uniform buffer:

struct SimParams { float dt; uint count; };

[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(SimParams params, Scattered<Particle> data, ThreadId id) {
    // params is fetched from a constant buffer automatically
}

In vertex and fragment shaders, the last unrecognized struct is treated as the stage input (vertex attributes or fragment varyings) rather than a broadcast. All preceding unrecognized structs are broadcasts.

System-Value Parameters

System-value wrapper types are mapped to SV_* semantics. The generated entry point declares the raw semantic parameter and constructs the wrapper:

Wrapper Type	Maps To	Available Fields
`ThreadId`	`SV_DispatchThreadID`	`.x`, `.y`, `.z`, `.xy`, `.xyz`
`GroupThreadId`	`SV_GroupThreadID`	`.x`, `.y`, `.z`, `.xy`, `.xyz`
`GroupId`	`SV_GroupID`	`.x`, `.y`, `.z`, `.xy`, `.xyz`
`VertexId`	`SV_VertexID`	`.value`
`InstanceId`	`SV_InstanceID`	`.value`
`IsFrontFace`	`SV_IsFrontFace`	`.value`

Scalar Parameters

Plain scalar types (uint, float, int, bool, and vector variants) become user parameters — full-precision u32 words in a separate region of the push constants. These are bound from Rust via bind_resources_raw_with_user:

[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(Scattered<uint> data, uint offset, ThreadId id) {
    data[id.x + offset] += 1;
}

Pass-Through Parameters

In vertex and fragment shaders, the last unrecognized struct parameter passes through as a stage input (vertex attributes or interpolated varyings). It appears directly in the generated entry point signature without bindless resolution:

[goldy_fragment]
float4 fs_main(MyUniforms cfg, FullscreenVarying input) : SV_Target {
    // cfg → broadcast (slot 0)
    // input → pass-through stage input (interpolated varyings)
    return float4(cfg.time, 0, 0, 1);
}

The Source-to-Source Transform

The transform (implemented in slang/virtual_main.rs) runs before Slang compilation and performs three operations:

Generates a wrapper function with the real [shader("...")] attribute and a fixed 16-word push-constant signature.
Renames the user function from cs_main to _goldy_user_cs_main so both can coexist.
Removes the [goldy_*] attribute and [numthreads] from the renamed user function (they live on the generated wrapper).

Push Constant Layout

The generated entry point always declares a fixed signature regardless of how many parameters the user function has:

Words  0–7:  _bw0.._bw7   — 16 × u16 bindless indices packed 2 per word
Words  8–15: _uw0.._uw7   — 8 × u32 user scalar parameters

Bindless indices are packed as pairs into 32-bit words: the low 16 bits of _bw0 hold slot 0, the high 16 bits hold slot 1, and so on. This fits up to 16 resource/broadcast parameters and 8 scalar parameters in 64 bytes of push constants.

Before and After

What you write:

[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(TimeUniforms cfg, Scattered<uint> data, ThreadId id) {
    data[id.x] = data[id.x] + cfg.base;
}

What gets compiled (generated wrapper prepended, user function renamed):

[shader("compute")]
[numthreads(64, 1, 1)]
void cs_main(uniform uint _bw0, ..., uniform uint _bw7,
             uniform uint _uw0, ..., uniform uint _uw7,
             uint3 _sv0 : SV_DispatchThreadID) {
    TimeUniforms cfg = goldy_broadcast<TimeUniforms>(_bw0 & 0xFFFFu);
    Scattered<uint> data = goldy_scattered<uint>((_bw0 >> 16u) & 0xFFFFu);
    ThreadId id = ThreadId(_sv0);
    _goldy_user_cs_main(cfg, data, id);
}

// Original function, renamed:
void _goldy_user_cs_main(TimeUniforms cfg, Scattered<uint> data, ThreadId id) {
    data[id.x] = data[id.x] + cfg.base;
}

The #line 1 directive is inserted between the generated wrapper and the user source so that compiler diagnostics report correct line numbers.

Vertex/Fragment Example

[goldy_vertex]
VSOutput vs_main(SceneUniforms scene, Scattered<Instance> instances, VertexId vid, InstanceId iid) {
    // scene → broadcast (slot 0)
    // instances → scattered (slot 1)
    // vid → SV_VertexID
    // iid → SV_InstanceID
    Instance inst = instances[iid.value];
    VSOutput out;
    // ... transform vertex ...
    return out;
}

[goldy_fragment]
float4 fs_main(SceneUniforms scene, Interpolated<float4> albedo, Filter samp,
               VSOutput input) : SV_Target {
    // scene → broadcast (slot 0)
    // albedo → texture (slot 1)
    // samp → sampler (slot 2)
    // input → pass-through stage varying
    return albedo.Sample(samp, input.uv) * scene.tint;
}

Both entry points share the same push-constant layout. Fragment shader slot expectations take precedence when Goldy extracts category metadata (since resource binding typically lives there in a vertex+fragment pair).

Preprocessor Conditionals

Virtual entry points support #ifdef/#else/#endif blocks directly inside the parameter list. This is useful for shader variants like MSAA:

[goldy_compute]
[numthreads(4, 16, 1)]
void cs_main(BufRO<uint> config,
#ifdef msaa
             BufRO<uint> mask_lut, DirectSpatial<float4> out_tex,
#else
             DirectSpatial<float4> out_tex,
#endif
             ThreadId tid) {
    // ...
}

The transform generates conditional blocks in the wrapper's signature, body, and call arguments so that the correct branch is selected at compile time based on preprocessor defines.

Slang in One Source

Goldy uses Slang as its single shader language across all backends. You write one .slang file and Goldy compiles it to the native format for whichever GPU API is in use — no manual HLSL/GLSL/MSL translation, no per-backend shader files.

Compilation Targets

Backend	Target Format	API Requirement
Vulkan	SPIR-V	Vulkan 1.4+
DirectX 12	DXIL	Windows 10+
Metal	Metal IR	Metal Tier 2+ (Argument Buffers)

Slang compiles through its native slang.dll / libslang.dylib — the same compiler used by NVIDIA, Khronos, and major game engines. Goldy links it directly; there is no intermediate translation step.

Why Slang

One source: Vertex, fragment, and compute shaders all live in a single .slang file. No preprocessor gymnastics to target different backends.
HLSL-compatible syntax: If you know HLSL, you already know Slang. Standard types (float4, uint3, Texture2D), standard intrinsics (mul, lerp, smoothstep), standard semantics (SV_Position, SV_Target).
Modern language features: Modules (import), generics, interfaces, operator overloading, and automatic differentiation — features that HLSL and GLSL lack.
Khronos governance: Long-term stability under open-source stewardship.

Cross-Backend Matrix Layout Consistency

Slang normalizes matrix layout across all backends. HLSL defaults to column-major storage, GLSL to column-major, and Metal to column-major — but the conventions for how mul(matrix, vector) is interpreted differ. Slang's compilation ensures that a float4x4 in your shader has identical memory layout and multiplication semantics whether it compiles to SPIR-V, DXIL, or Metal IR.

This means your Rust-side #[repr(C)] matrix types can use the same byte layout regardless of which backend the application runs on.

Shader Module Creation

Basic Compilation

ShaderModule::from_slang() compiles a Slang source string into GPU bytecode:

#![allow(unused)]
fn main() {
let shader = ShaderModule::from_slang(&device, r#"
    import goldy_exp;

    [goldy_compute]
    [numthreads(64, 1, 1)]
    void cs_main(Scattered<float> data, ThreadId id) {
        data[id.x] = data[id.x] * 2.0;
    }
"#)?;
}

The goldy_exp library is pre-registered on every device — import goldy_exp works without any setup.

Additional Search Paths

ShaderModule::from_slang_with_paths() adds filesystem directories to the Slang module search path:

#![allow(unused)]
fn main() {
let shader = ShaderModule::from_slang_with_paths(
    &device,
    source,
    &["my_project/shaders"],
)?;
}

Preprocessor Defines

ShaderModule::from_slang_with_paths_and_defines() passes preprocessor defines for shader variants:

#![allow(unused)]
fn main() {
let shader = ShaderModule::from_slang_with_paths_and_defines(
    &device,
    source,
    &[],
    &[("msaa", "1"), ("SAMPLE_COUNT", "4")],
)?;
}

Full Options

ShaderModule::from_slang_with_options() provides complete control — search paths, defines, optimization level, and layout validation checks:

#![allow(unused)]
fn main() {
let shader = ShaderModule::from_slang_with_options(
    &device,
    source,
    &["shaders/"],
    &[("DEBUG", "1")],
    OptimizationLevel::Default,
    &[TimeUniforms::LAYOUT_CHECK],
)?;
}

Built-in Shader Modules

Goldy ships a few complete shaders as Rust string constants in goldy::shader::builtins:

Constant	Description
`VERTEX_COLOR_2D`	2D vertex+fragment shader with per-vertex color
`SOLID_COLOR`	Solid color fragment shader with a uniform

These are self-contained (no import needed) and useful for bootstrapping:

#![allow(unused)]
fn main() {
use goldy::shader::builtins;

let shader = ShaderModule::from_slang(&device, builtins::VERTEX_COLOR_2D)?;
}

Shader Libraries

Shader libraries are reusable Slang modules registered with a Device. Once registered, any shader compiled on that device can import the library.

The Built-in `goldy_exp` Library

Every device comes with goldy_exp pre-registered. It provides:

Resource type aliases (Scattered<T>, BufRO<T>, Interpolated<T>, etc.)
System-value wrappers (ThreadId, VertexId, InstanceId, etc.)
Vertex formats (FullscreenVarying, ColoredVarying, etc.)
Math utilities (hash(), center_uv(), smootherstep(), etc.)
Color utilities (rainbow(), palette(), hsv_to_rgb(), etc.)
Procedural geometry (quad_position(), billboard_position(), etc.)

Registering Custom Libraries

#![allow(unused)]
fn main() {
use goldy::ShaderLibrary;

device.register_library(ShaderLibrary::from_source("myutils", r#"
    module myutils;
    public float3 my_effect(float t) { return float3(t, t * 0.5, 1.0 - t); }
"#))?;
}

Now any shader can import myutils:

import myutils;

[goldy_fragment]
float4 fs_main(FullscreenVarying input) : SV_Target {
    return float4(my_effect(input.uv.x), 1.0);
}

Multi-Module Libraries

For larger libraries with internal sub-modules:

#![allow(unused)]
fn main() {
let lib = ShaderLibrary::from_embedded("effects", &[
    ("effects", r#"
        module effects;
        __include "effects/blur";
        __include "effects/bloom";
    "#),
    ("effects/blur", r#"
        implementing effects;
        public float4 gaussian_blur(Texture2D<float4> tex, SamplerState s, float2 uv) { ... }
    "#),
    ("effects/bloom", r#"
        implementing effects;
        public float4 bloom(Texture2D<float4> tex, SamplerState s, float2 uv, float threshold) { ... }
    "#),
]);

device.register_library(lib)?;
}

Loading from the Filesystem

#![allow(unused)]
fn main() {
let lib = ShaderLibrary::from_directory("effects", Path::new("shaders/effects/"))?;
device.register_library(lib)?;
}

Library Management

#![allow(unused)]
fn main() {
device.has_library("goldy_exp");       // true — always registered
device.list_libraries();               // ["goldy_exp", "myutils", ...]
device.unregister_library("myutils");  // remove a custom library
}

Layout Validation

When Rust structs are passed to shaders as uniform data (e.g. via Broadcast), the memory layout must match exactly. Goldy can validate this at compile time using Slang reflection.

Setup

Derive LayoutCheckable on your Rust struct:

#![allow(unused)]
fn main() {
#[derive(LayoutCheckable)]
#[repr(C)]
struct TimeUniforms {
    time: f32,
    delta_time: f32,
    frame: u32,
    _pad: u32,
}
}

Pass the layout check to shader compilation:

#![allow(unused)]
fn main() {
let shader = ShaderModule::from_slang_with_options(
    &device,
    source,
    &[],
    &[],
    OptimizationLevel::Default,
    &[TimeUniforms::LAYOUT_CHECK],
)?;
}

Enable validation via environment variable:

GOLDY_VALIDATE_LAYOUTS=1 cargo run
# or
GOLDY_VALIDATION=layout cargo run
# or enable everything:
GOLDY_VALIDATION=all cargo run

What Gets Validated

Field offsets: Each field's byte offset in the Rust struct is compared against the Slang reflection data.
Struct size: Total size must match.
Buffer element stride: At dispatch time, the buffer's recorded element stride is checked against what the shader expects.

Validation is zero-cost when disabled — the checks are skipped entirely, not compiled out. The environment variable is read at runtime so it can be toggled without recompiling.

GOLDY_VALIDATION

The GOLDY_VALIDATION environment variable controls multiple validation categories:

Value	Layout Checks	GPU API Validation
`layout`	Yes	No
`api`	No	Yes
`layout,api`	Yes	Yes
`all`	Yes	Yes
`1` / `true` / `yes`	No	Yes

GOLDY_VALIDATE_LAYOUTS=1 is a standalone toggle that enables layout checks regardless of GOLDY_VALIDATION.

ComputeEncoder

ComputeEncoder records compute commands into a flat command list. It is lock-free and can be used from any thread — no GPU interaction happens until you submit.

For multi-dispatch workloads with data dependencies between passes, prefer the Task Graph, which analyzes dependencies and inserts barriers automatically. ComputeEncoder is best for simple, single-dispatch workloads or cases where you manage barriers yourself.

Creating an encoder

#![allow(unused)]
fn main() {
let mut encoder = ComputeEncoder::new();
}

Recording a compute pass

Open a ComputePass, set a pipeline, bind resources, and dispatch:

#![allow(unused)]
fn main() {
let mut pass = encoder.begin_compute_pass();
pass.set_pipeline(&pipeline);
pass.bind_resources_raw(&[buffer.bindless_index().unwrap()]);
pass.dispatch(16, 1, 1);
}

The pass borrows the encoder mutably. Drop it (or let it go out of scope) before opening another pass or finishing the encoder.

Binding resources

There are three ways to pass resource handles to a compute shader:

bind_resources — pass Buffer references directly. Indices are bound in declaration order:

#![allow(unused)]
fn main() {
pass.bind_resources(&[&particle_buffer, &params_buffer]);
}

bind_resources_raw — pass raw u32 slot indices. Use this when you need to mix buffer, texture, and sampler indices:

#![allow(unused)]
fn main() {
let tex_idx = texture.bindless_index().unwrap();
let buf_idx = buffer.bindless_index().unwrap();
pass.bind_resources_raw(&[buf_idx, tex_idx]);
}

bind_resources_typed — pass typed BindlessHandles that carry both the index and the resource category:

#![allow(unused)]
fn main() {
let uniforms = uniform_buf.bindless_handle().unwrap();
let output = output_tex.bindless_handle().unwrap();
pass.bind_resources_typed(&[uniforms, output]);
}

Per-dispatch scalar parameters

Parameters that aren't heap indices — offsets, counts, flags — are declared as typed entry-point parameters in the shader and passed alongside resource indices:

[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(Scattered<uint> data, uint offset, uint stride, ThreadId id) {
    data[id.x * stride + offset] += 1;
}

#![allow(unused)]
fn main() {
pass.bind_resources_raw(&[data_buf.bindless_index().unwrap(), offset, stride]);
}

Or use the two-region form to separate resource indices (region A) from user scalars (region B):

#![allow(unused)]
fn main() {
pass.bind_resources_raw_with_user(
    &[data_buf.bindless_index().unwrap()],
    &[offset, stride],
);
}

Dispatching workgroups

The total thread count is the product of dispatch(x, y, z) and the shader's [numthreads(x, y, z)]:

#![allow(unused)]
fn main() {
let elements = 1024u32;
let threads_per_group = 64u32;
let groups = elements.div_ceil(threads_per_group);
pass.dispatch(groups, 1, 1); // 16 groups × 64 threads = 1024
}

Indirect dispatch

Let a prior pass write the workgroup counts into a buffer, then read them at dispatch time:

#![allow(unused)]
fn main() {
pass.dispatch_indirect(&count_buffer, 0);
}

The buffer must contain three consecutive u32 values (x, y, z) at the given byte offset.

Barriers and buffer clears

Insert a global memory barrier between dispatches within the same encoder:

#![allow(unused)]
fn main() {
pass.barrier();
}

Clear a buffer region to zero, batched into the same submission:

#![allow(unused)]
fn main() {
pass.clear_buffer(&buffer, 0, 0); // size=0 → clear to end of buffer
}

Submitting

Blocking — submit and wait for the GPU to finish:

#![allow(unused)]
fn main() {
encoder.dispatch(&device)?;
}

Non-blocking — submit and get a TimelineValue for later synchronization:

#![allow(unused)]
fn main() {
let tv = encoder.submit(&device)?;

// CPU work while GPU is busy...

device.wait_until(tv)?;
}

See Device Timeline for more on TimelineValue and gpu_progress.

Recording into a task graph

For multi-pass workloads, record each dispatch as a task graph node instead of using ComputeEncoder directly. The task graph handles barriers for you:

#![allow(unused)]
fn main() {
let mut graph = TaskGraph::new();

graph.node("my_pass", &pipeline)
    .bind_buffer(&buf, NodeAccess::ReadWrite)
    .bind_resources_raw(&[buf.bindless_index().unwrap()])
    .dispatch(16, 1, 1);

graph.dispatch(&device)?;
}

See Task Graph for the full API.

Task Graph

The task graph is one of Goldy's core abstractions. It pairs the bindless resource model with explicit dependency declarations so the runtime can insert optimal barriers and maximize GPU parallelism — all within a single command buffer.

Why the task graph exists

Goldy uses a bindless resource model: shaders access buffers and textures through heap-backed argument buffers indexed by slot numbers. This gives shaders flexible, low-overhead access to any resource, but it makes the GPU's automatic dependency tracking blind. Metal, for example, cannot see through argument buffer indirection to know which resources a dispatch reads or writes, so it cannot insert barriers automatically.

Without the task graph, the only correct approach is to submit each dispatch as a separate command buffer. This works, but it serializes everything and adds per-command-buffer scheduling overhead — worse than APIs like wgpu that use bind groups to infer hazards.

The task graph solves this: you declare what each node reads and writes, and Goldy does the rest.

Builds a dependency DAG from declared resource access patterns
Groups independent dispatches into waves that execute concurrently
Inserts per-resource barriers only at true dependency edges (RAW, WAR, WAW)
Submits everything in a single command buffer

Building a task graph

Create a TaskGraph, add nodes with resource access declarations, and submit:

#![allow(unused)]
fn main() {
use goldy::{TaskGraph, NodeAccess};

let mut graph = TaskGraph::new();

graph.node("write_data", &pipeline_a)
    .bind_buffer(&buf, NodeAccess::Write)
    .bind_resources_raw(&[buf_idx])
    .dispatch(64, 1, 1);

graph.node("read_data", &pipeline_b)
    .bind_buffer(&buf, NodeAccess::Read)
    .bind_resources_raw(&[buf_idx])
    .dispatch(64, 1, 1);

let tv = graph.submit(&device)?;
device.wait_until(tv)?;
}

The analyzer sees that read_data depends on write_data (RAW hazard on buf) and inserts a barrier between them. If two nodes touch completely different resources, they execute in the same wave with no barrier.

Node types

Builder method	GPU operation
`graph.node(label, &pipeline)`	Compute dispatch (direct or indirect)
`graph.clear_buffer(&buf, offset, size)`	GPU-side buffer zero-fill
`graph.clear_buffer_view(&view, offset, size)`	GPU-side zero-fill of a pool view region
`graph.write_buffer(&buf, offset, data)`	CPU→GPU buffer upload
`graph.write_texture(&tex, data)`	CPU→GPU texture upload
`graph.render_pass(label, &target)`	Offscreen render pass

All node types participate in the same dependency analysis.

Declaring resource access

Each node declares its resource access via bind_buffer, bind_buffer_view, or bind_texture:

#![allow(unused)]
fn main() {
graph.node("reduce", &pipeline)
    .bind_buffer(&input, NodeAccess::Read)
    .bind_buffer(&output, NodeAccess::Write)
    .bind_resources_raw(&[input_idx, output_idx])
    .dispatch(64, 1, 1);
}

bind_resources_raw sets the actual shader slot indices. The bind_buffer / bind_texture calls are purely for dependency analysis — they tell the scheduler what this node touches, not how to bind it.

Finalizing nodes

Compute nodes must be finalized with dispatch(x, y, z) or dispatch_indirect(&buf, offset). Render pass nodes are finalized with finish(commands) or finish_encoder(encoder).

NodeAccess and SWMR scheduling

NodeAccess is the per-node logical access, orthogonal to a buffer's physical DataAccess:

#![allow(unused)]
fn main() {
pub enum NodeAccess {
    Read,      // can overlap with other Reads
    Write,     // exclusive access
    ReadWrite, // exclusive access
}
}

The scheduler implements single-writer/multiple-reader (SWMR) parallelism:

Multiple Read nodes on the same resource run concurrently in the same wave.
A Write or ReadWrite node serializes against all prior accessors of that resource.
Barriers are inserted only at true RAW/WAR/WAW edges.

Diamond example

#![allow(unused)]
fn main() {
let mut graph = TaskGraph::new();

// Wave 0: A writes buf_x
graph.node("A", &p1)
    .bind_buffer(&buf_x, NodeAccess::Write)
    .dispatch(1, 1, 1);

// Wave 1: B and C both read buf_x (SWMR — they run concurrently)
graph.node("B", &p2)
    .bind_buffer(&buf_x, NodeAccess::Read)
    .bind_buffer(&buf_y, NodeAccess::Write)
    .dispatch(1, 1, 1);

graph.node("C", &p3)
    .bind_buffer(&buf_x, NodeAccess::Read)
    .bind_buffer(&buf_z, NodeAccess::Write)
    .dispatch(1, 1, 1);

// Wave 2: D reads both outputs
graph.node("D", &p4)
    .bind_buffer(&buf_y, NodeAccess::Read)
    .bind_buffer(&buf_z, NodeAccess::Read)
    .dispatch(1, 1, 1);

graph.dispatch(&device)?;
}

This produces three waves with two barriers — the minimum possible for this dependency pattern.

Buffer views and pool tracking

When using BufferPool, you can declare access at view granularity. Non-overlapping views of the same pool produce no dependency edge and execute in the same wave:

#![allow(unused)]
fn main() {
let view_a = pool.alloc::<u32>(64)?;
let view_b = pool.alloc::<u32>(64)?;

let mut graph = TaskGraph::new();

graph.node("write_a", &pipeline)
    .bind_buffer_view(&view_a, NodeAccess::Write)
    .dispatch(1, 1, 1);

graph.node("write_b", &pipeline)
    .bind_buffer_view(&view_b, NodeAccess::Write)
    .dispatch(1, 1, 1);

// No barrier — view_a and view_b occupy disjoint byte ranges
graph.dispatch(&device)?;
}

Barriers are emitted against the parent buffer handle, so backends require no changes. The scheduler tracks byte ranges internally to determine true overlap.

Transient resources

Transient buffers and textures exist only for the lifetime of a single graph submission. They are allocated from a shared heap, and non-overlapping lifetimes can alias onto the same memory — reducing allocation pressure for temporaries.

#![allow(unused)]
fn main() {
let mut graph = TaskGraph::new();

let tmp = graph.transient_buffer(256);

graph.node("produce", &pipeline_a)
    .bind_transient_buffer(tmp, NodeAccess::Write)
    .bind_resources_raw(&[0])
    .dispatch(1, 1, 1);

graph.node("consume", &pipeline_b)
    .bind_transient_buffer(tmp, NodeAccess::Read)
    .bind_resources_raw(&[0])
    .dispatch(1, 1, 1);

graph.dispatch(&device)?;
}

Transient textures work the same way:

#![allow(unused)]
fn main() {
let tmp_tex = graph.transient_texture(width, height, TextureFormat::Rgba8Unorm);

graph.node("render", &pipeline)
    .bind_transient_texture(tmp_tex, NodeAccess::Write)
    .bind_resources_raw(&[0])
    .dispatch(wg_x, wg_y, 1);
}

When transients are used, the graph blocks until the GPU completes so the staging heap can be freed. The scheduler uses wave-interval analysis to determine which transients can alias: if two transient buffers are never live in the same wave, they share the same backing memory.

Per-resource barriers on Metal

The graph emits ResourceBarrier commands with per-resource granularity. Each backend maps this to its native mechanism:

Backend	Behavior
Metal	`memoryBarrierWithResources:count:` — precise per-resource barriers within a single compute encoder
Vulkan	Global compute pipeline barrier (per-resource `VkBufferMemoryBarrier` is a future optimization)
DX12	Global UAV barrier (per-resource `D3D12_RESOURCE_BARRIER` is a future optimization)

On Metal — the primary beneficiary — the graph enables single-encoder submission with per-resource barriers, eliminating the per-command-buffer overhead of the one-dispatch-per-command-buffer workaround.

Single command buffer submission

All nodes in a TaskGraph are submitted in a single command buffer (or compute encoder on Metal). The scheduler groups independent nodes into waves and inserts barriers only between waves that have true data dependencies. This minimizes scheduling overhead and enables the GPU to overlap independent work within a wave.

Blocking vs non-blocking submission

Non-blocking — returns a TimelineValue for CPU-side synchronization:

#![allow(unused)]
fn main() {
let tv = graph.submit(&device)?;
// CPU work while GPU executes...
device.wait_until(tv)?;
}

Blocking — submits and waits for completion:

#![allow(unused)]
fn main() {
graph.dispatch(&device)?;
}

Practical example: Game of Life

A ping-pong compute pattern using buffer pool views and the task graph:

#![allow(unused)]
fn main() {
let (read_view, write_view) = if use_buffer_a {
    (&view_a, &view_b)
} else {
    (&view_b, &view_a)
};

let mut graph = TaskGraph::new();
graph.node("game_of_life", &compute_pipeline)
    .bind_buffer_view(read_view, NodeAccess::Read)
    .bind_buffer_view(write_view, NodeAccess::Write)
    .bind_resources_raw(&[
        read_view.bindless_handle().unwrap().index(),
        write_view.bindless_handle().unwrap().index(),
    ])
    .dispatch(GRID_WIDTH.div_ceil(8), GRID_HEIGHT.div_ceil(8), 1);
graph.dispatch(&device)?;

use_buffer_a = !use_buffer_a;
}

The graph analyzes the Read and Write declarations on each view and inserts barriers only where needed. Because the two views occupy disjoint byte ranges in the same pool, the scheduler can verify they don't alias — enabling correct execution with minimal synchronization.

Device Timeline

Goldy tracks GPU completion with a monotonic timeline counter — a u64 value (TimelineValue) that increments with each submission. This replaces fence-per-submission models with a single, always-increasing counter on the device.

TimelineValue

Every non-blocking submission returns a TimelineValue:

#![allow(unused)]
fn main() {
let tv: TimelineValue = graph.submit(&device)?;
}

This value represents a point on the device's timeline. When the GPU finishes executing that submission, the timeline advances past tv.

Both TaskGraph::submit and ComputeEncoder::submit return timeline values. Surface presentation via Frame::present also returns one.

Querying GPU progress

device.gpu_progress() returns the latest completed timeline value without blocking:

#![allow(unused)]
fn main() {
let current = device.gpu_progress();
if current >= tv {
    // submission has finished — safe to read back results
}
}

This is a lightweight query (single atomic read on most backends) suitable for polling in a loop or checking once per frame.

Waiting for completion

device.wait_until(value) blocks the current thread until the GPU timeline reaches at least value:

#![allow(unused)]
fn main() {
let tv = graph.submit(&device)?;

// CPU work while GPU executes...
prepare_next_frame();

// Block until this specific submission completes
device.wait_until(tv)?;
}

For bounded waits, use wait_until_timeout:

#![allow(unused)]
fn main() {
let completed = device.wait_until_timeout(tv, 1000)?; // 1 second timeout
if !completed {
    // GPU hasn't finished yet — handle timeout
}
}

Blocking dispatch

For simple cases where you don't need CPU/GPU overlap, dispatch combines submit + wait:

#![allow(unused)]
fn main() {
graph.dispatch(&device)?; // submits and blocks until complete
}

This is equivalent to:

#![allow(unused)]
fn main() {
let tv = graph.submit(&device)?;
device.wait_until(tv)?;
}

How this differs from fence-based synchronization

Traditional GPU APIs use one fence object per submission. You create a fence, attach it to a submit call, then query or wait on that specific fence. Managing multiple in-flight submissions means tracking multiple fence objects.

Goldy's timeline is a single monotonic counter shared across all submissions on a device:

	Fence-based	Timeline-based
Tracking	One fence per submission	One counter for the device
Query	Poll each fence individually	`gpu_progress() >= value`
Wait	Wait on a specific fence	`wait_until(value)`
Ordering	Fences are independent	Values are monotonically ordered
Multi-frame	Track N fence objects	Compare N `u64` values

Because timeline values are ordered, you can reason about completion transitively: if gpu_progress() >= tv_b and tv_b > tv_a, then tv_a has also completed.

Practical use cases

CPU readback after compute

#![allow(unused)]
fn main() {
let tv = graph.submit(&device)?;
device.wait_until(tv)?;

let result: Vec<f32> = buffer.read_data(0)?;
}

Multi-frame pipelining

Overlap CPU frame N+1 preparation with GPU frame N execution:

#![allow(unused)]
fn main() {
let mut pending: Option<TimelineValue> = None;

loop {
    // Wait for the previous frame to finish before reusing its resources
    if let Some(tv) = pending {
        device.wait_until(tv)?;
    }

    // Prepare frame N+1 on the CPU
    update_uniforms(&uniform_buffer)?;

    // Submit frame N+1 — GPU starts working, CPU continues
    let tv = graph.submit(&device)?;
    pending = Some(tv);

    // CPU work for the next iteration...
}
}

Polling without blocking

Check completion in a non-blocking render loop:

#![allow(unused)]
fn main() {
let tv = graph.submit(&device)?;

loop {
    if device.gpu_progress() >= tv {
        break; // done
    }
    // do other work, yield, etc.
}
}

Resource lifetime

Dropping a Buffer or Texture may be deferred internally: the GPU memory stays alive until all submissions that reference it have completed. Submit (or present a frame) before dropping resources that must outlive those commands.

Compute to Surface

Compute-to-surface lets a compute shader write directly to the swapchain texture, bypassing the rasterization pipeline entirely. There is no RenderPipeline, no vertex buffers, no CommandEncoder — just a compute dispatch that fills pixels.

When to use compute-to-surface

Use compute-to-surface when your rendering is naturally a per-pixel computation rather than geometry rasterization:

Fullscreen image effects (plasma, fractals, ray marching)
GPU-driven 2D renderers where the compute shader owns the output layout
Post-processing that doesn't need triangle rasterization
Prototyping visual effects without setting up a render pipeline

Use traditional rendering when you need the rasterization pipeline's features: triangle assembly, depth testing, MSAA, alpha blending, or vertex/fragment shader stages.

Getting the swapchain texture

Acquire a frame from the surface and call frame.texture() to get a writable Texture handle to the current swapchain image:

#![allow(unused)]
fn main() {
let frame = surface.begin()?;
let texture = frame.texture();
}

This texture is valid until the frame is presented. You can obtain its bindless handle and pass it to a compute shader like any other texture:

#![allow(unused)]
fn main() {
let texture_handle = texture
    .bindless_handle()
    .expect("Surface texture has no bindless handle");
}

Building the task graph

Create a TaskGraph with a compute node that writes to the swapchain texture. The task graph handles barrier insertion between compute writes and the presentation engine:

#![allow(unused)]
fn main() {
let wg_x = width.div_ceil(8);
let wg_y = height.div_ceil(8);

let mut graph = TaskGraph::new();
graph.node("compute", &compute_pipeline)
    .bind_buffer(&uniform_buffer, NodeAccess::Read)
    .bind_resources_raw(&[uniform_handle.index(), texture_handle.index()])
    .dispatch(wg_x, wg_y, 1);
}

Submitting and presenting

Use frame.submit_compute(graph) to record the compute work into the frame, then present:

#![allow(unused)]
fn main() {
frame.submit_compute(&graph)?;
frame.present()?;
}

submit_compute compiles the task graph into a command stream and records it into the frame's command buffer. Presentation happens when you call present() — the compute shader has already written the pixels.

The compute shader

The shader receives the output texture as a DirectSpatial<float4> — a read-write 2D texture accessed by integer coordinates:

import goldy_exp;

struct Uniforms {
    uint width;
    uint height;
    float time;
    float _padding;
};

[goldy_compute]
[numthreads(8, 8, 1)]
void cs_main(BufRO<Uniforms> uniforms_buf, DirectSpatial<float4> output, ThreadId tid) {
    Uniforms u = uniforms_buf[0];

    if (tid.x >= u.width || tid.y >= u.height)
        return;

    float2 uv = float2(float(tid.x) / float(u.width),
                       float(tid.y) / float(u.height));

    // Compute pixel color...
    float3 col = my_color_function(uv, u.time);
    output[tid.xy] = float4(col, 1.0);
}

The [numthreads(8, 8, 1)] workgroup size maps naturally to 2D image tiles. Dispatch enough workgroups to cover the full resolution:

#![allow(unused)]
fn main() {
let wg_x = width.div_ceil(8);
let wg_y = height.div_ceil(8);
}

Guard against out-of-bounds writes in the shader when the resolution isn't a multiple of the workgroup size.

Full example

A complete compute-to-surface application rendering an animated plasma effect:

#![allow(unused)]
fn main() {
use goldy::{
    Buffer, ComputePipeline, DataAccess, DeviceType, Instance,
    NodeAccess, PresentMode, ShaderModule, Surface, SurfaceConfig, TaskGraph,
};

// Create device and surface
let instance = Instance::new()?;
let device = instance.create_device(DeviceType::DiscreteGpu)?;

let surface = Surface::new_with_config(
    &device,
    &window,
    SurfaceConfig {
        present_mode: PresentMode::Fifo,
        depth_format: None,
    },
)?;

// Compile compute shader and create pipeline
let shader = ShaderModule::from_slang(&device, COMPUTE_SHADER)?;
let compute_pipeline = ComputePipeline::new(&device, &shader)?;

// Create uniform buffer
let uniform_buffer = Buffer::with_data(
    &device,
    &[Uniforms {
        width: surface.width(),
        height: surface.height(),
        time: 0.0,
        _padding: 0.0,
    }],
    DataAccess::Scattered,
)?;

// --- Render loop ---

// Update uniforms
uniform_buffer.write(0, bytemuck::bytes_of(&Uniforms {
    width, height, time: elapsed, _padding: 0.0,
}))?;

// Acquire frame and get swapchain texture
let frame = surface.begin()?;
let texture = frame.texture();

let uniform_handle = uniform_buffer
    .bindless_srv_handle()
    .expect("Uniform buffer has no bindless SRV handle");
let texture_handle = texture
    .bindless_handle()
    .expect("Surface texture has no bindless handle");

// Build and submit compute graph
let wg_x = width.div_ceil(8);
let wg_y = height.div_ceil(8);

let mut graph = TaskGraph::new();
graph.node("compute", &compute_pipeline)
    .bind_buffer(&uniform_buffer, NodeAccess::Read)
    .bind_resources_raw(&[uniform_handle.index(), texture_handle.index()])
    .dispatch(wg_x, wg_y, 1);

frame.submit_compute(&graph)?;
frame.present()?;
}

The uniform buffer uses bindless_srv_handle() because the shader accesses it through BufRO<Uniforms>, which maps to a read-only SRV on DX12. On Vulkan and Metal this falls back to the unified storage-buffer index.

Pipelines

Pipelines combine compiled shaders with fixed-function rendering state. Goldy provides RenderPipeline for graphics and ComputePipeline for compute workloads.

Render Pipelines

A RenderPipeline pairs vertex and fragment shaders with a RenderPipelineDesc that configures vertex input, primitive assembly, depth testing, and the output format.

Creating a Render Pipeline

#![allow(unused)]
fn main() {
use goldy::{
    RenderPipeline, RenderPipelineDesc, ShaderModule,
    Vertex2D, TextureFormat, PrimitiveTopology,
};

let vs = ShaderModule::from_slang(&device, include_str!("shaders/tri.vs.slang"))?;
let fs = ShaderModule::from_slang(&device, include_str!("shaders/tri.fs.slang"))?;

let pipeline = RenderPipeline::new(&device, &vs, &fs, &RenderPipelineDesc {
    vertex_layout: Vertex2D::layout(),
    topology: PrimitiveTopology::TriangleList,
    target_format: surface.format(),
    depth_stencil: None,
})?;
}

RenderPipelineDesc

#![allow(unused)]
fn main() {
pub struct RenderPipelineDesc {
    pub vertex_layout: VertexBufferLayout,
    pub topology: PrimitiveTopology,
    pub target_format: TextureFormat,
    pub depth_stencil: Option<DepthStencilState>,
}
}

Field	Purpose	Default
`vertex_layout`	Describes vertex buffer stride and attributes	Empty (no vertex input)
`topology`	How vertices are assembled into primitives	`TriangleList`
`target_format`	Pixel format of the render target — must match `surface.format()` or the format passed to `RenderTarget::new()`	`Rgba8Unorm`
`depth_stencil`	Depth/stencil test configuration, or `None` to disable	`None`

The default descriptor is valid for fullscreen passes that generate geometry from SV_VertexID and render to an Rgba8Unorm target without depth testing.

Format Matching

The pipeline's target_format must match the render target it will draw into. Mismatched formats produce backend errors or undefined output.

#![allow(unused)]
fn main() {
let desc = RenderPipelineDesc {
    target_format: surface.format(),
    ..Default::default()
};
}

Vertex Buffer Layouts

A VertexBufferLayout tells the pipeline how to interpret vertex buffer memory. For passes that do not use vertex buffers (fullscreen triangles, quad instancing), the default empty layout is correct.

For typed vertex input, use the from_formats builder or a built-in type's layout() method. See Vertex Types and Layouts for details.

#![allow(unused)]
fn main() {
let layout = VertexBufferLayout::from_formats::<MyVertex>(&[
    VertexFormat::Float32x3, // position
    VertexFormat::Float32x2, // uv
]);
}

Primitive Topology

Controls how the vertex stream is assembled into geometric primitives:

#![allow(unused)]
fn main() {
pub enum PrimitiveTopology {
    PointList,
    LineList,
    LineStrip,
    TriangleList,   // default
    TriangleStrip,
}
}

PointList:     •  •  •  •
LineList:      •——•  •——•
LineStrip:     •——•——•——•
TriangleList:  △  △
TriangleStrip: △▽△▽

Depth/Stencil State

Enable depth testing by setting depth_stencil. The surface or render target must have been created with a matching depth format.

#![allow(unused)]
fn main() {
use goldy::{DepthStencilState, DepthFormat, CompareFunction};

let pipeline = RenderPipeline::new(&device, &vs, &fs, &RenderPipelineDesc {
    vertex_layout: Vertex2D::layout(),
    target_format: surface.format(),
    topology: PrimitiveTopology::TriangleList,
    depth_stencil: Some(DepthStencilState {
        format: DepthFormat::Depth32Float,
        depth_write_enabled: true,
        depth_compare: CompareFunction::Less,
    }),
})?;
}

DepthStencilState fields:

Field	Purpose	Default
`format`	Depth texture format (`Depth16Unorm`, `Depth24Plus`, `Depth32Float`, etc.)	`Depth24Plus`
`depth_write_enabled`	Whether fragments write to the depth buffer	`true`
`depth_compare`	Comparison function — `Less`, `LessEqual`, `Greater`, `Always`, etc.	`Less`

Available depth formats:

Format	Bits	Stencil
`Depth16Unorm`	16-bit	No
`Depth24Plus`	24-bit (may use 32 internally)	No
`Depth24PlusStencil8`	24-bit + 8-bit stencil	Yes
`Depth32Float`	32-bit float	No
`Depth32FloatStencil8`	32-bit float + 8-bit stencil	Yes

For reverse-Z rendering, use CompareFunction::Greater and clear depth to 0.0.

Compute Pipelines

ComputePipeline wraps a single compute shader. See the compute documentation for the full compute API.

#![allow(unused)]
fn main() {
use goldy::{ComputePipeline, ShaderModule};

let cs = ShaderModule::from_slang(&device, include_str!("shaders/sim.cs.slang"))?;
let pipeline = ComputePipeline::new(&device, &cs)?;
}

Why Goldy Has Fewer Pipelines

Pipeline State Object (PSO) explosion is one of the biggest pain points in modern graphics. Engines routinely manage thousands of pipeline permutations and ship massive shader caches. Goldy eliminates most combinatorial dimensions:

Dimension	Traditional Vulkan/DX12	Goldy
Render pass compatibility	N render passes × M subpasses	Eliminated — dynamic rendering
Descriptor set layouts	Per-material layout permutations	One global bindless layout
Pipeline layouts	Per-material	One shared layout
Viewport / scissor	Baked into PSO	Dynamic state
Vertex format	Baked	Baked (unavoidable)
Target format	Baked	Baked (unavoidable)

RenderPipelineDesc has exactly four fields. The permutation space is vertex_layouts × topologies × target_formats × depth_configs — deliberately small.

Performance

Pipelines are expensive to create (shader compilation, PSO allocation) but cheap to bind during rendering. Create them once at startup and reuse across frames.

#![allow(unused)]
fn main() {
struct Renderer {
    scene_pipeline: RenderPipeline,
    ui_pipeline: RenderPipeline,
    wireframe_pipeline: RenderPipeline,
}

impl Renderer {
    fn new(device: &Device, surface: &Surface) -> Result<Self> {
        // Create all pipelines upfront
        Ok(Self {
            scene_pipeline: create_scene_pipeline(device, surface.format())?,
            ui_pipeline: create_ui_pipeline(device, surface.format())?,
            wireframe_pipeline: create_wireframe_pipeline(device, surface.format())?,
        })
    }
}
}

Command Encoding

CommandEncoder records GPU rendering commands without executing them. It is completely lock-free and does not touch the GPU backend — you can create and fill encoders on any thread. The actual GPU work happens when you submit the commands through Frame::render() or RenderTarget::render().

Recording Commands

#![allow(unused)]
fn main() {
use goldy::{CommandEncoder, Color};

let mut encoder = CommandEncoder::new();
{
    let mut pass = encoder.begin_render_pass();
    pass.clear(Color::CORNFLOWER_BLUE);
    pass.set_pipeline(&pipeline);
    pass.set_vertex_buffer(0, &vertices);
    pass.draw(0..3, 0..1);
} // pass ends when dropped

let commands = encoder.finish();
}

Render Pass

A RenderPass is a borrow of the encoder that groups drawing commands. It begins with begin_render_pass() and ends when the RenderPass value is dropped.

#![allow(unused)]
fn main() {
let mut encoder = CommandEncoder::new();
{
    let mut pass = encoder.begin_render_pass();
    // all draw commands go here
}
}

Commands within a pass execute in recorded order.

Clearing

Clear the color attachment, the depth buffer, or both:

#![allow(unused)]
fn main() {
pass.clear(Color::BLACK);
pass.clear_depth(1.0);  // standard depth clear (far plane)
pass.clear_depth(0.0);  // reverse-Z depth clear
}

Setting the Pipeline

Bind the active RenderPipeline. You can switch pipelines within the same pass.

#![allow(unused)]
fn main() {
pass.set_pipeline(&scene_pipeline);
// ... draw scene ...

pass.set_pipeline(&ui_pipeline);
// ... draw UI ...
}

Vertex and Index Buffers

Bind vertex data to a numbered slot. Both Buffer and BufferView are accepted — for pool-allocated views, the parent buffer and offset are resolved automatically.

#![allow(unused)]
fn main() {
pass.set_vertex_buffer(0, &vertex_buffer);

// With an explicit additional offset:
pass.set_vertex_buffer_offset(0, &vertex_buffer, byte_offset);
}

Bind an index buffer for indexed drawing:

#![allow(unused)]
fn main() {
use goldy::IndexFormat;

pass.set_index_buffer(&index_buffer, IndexFormat::Uint16);

// With an additional offset:
pass.set_index_buffer_offset(&index_buffer, byte_offset, IndexFormat::Uint32);
}

Binding Resources

Goldy's bindless model passes resource indices to shaders through push constants. There are three binding methods:

Typed handles (preferred for new code) — each handle carries its BindlessCategory, enabling validation against shader reflection:

#![allow(unused)]
fn main() {
let tex = texture.bindless_handle().unwrap();
let samp = sampler.bindless_handle().unwrap();
pass.bind_resources_typed(&[tex, samp]);
}

Buffer references — extracts bindless indices from Buffer objects:

#![allow(unused)]
fn main() {
pass.bind_resources(&[&uniform_buffer, &data_buffer]);
}

Raw indices — for manual control or when mixing resource types:

#![allow(unused)]
fn main() {
let tex_idx = texture.bindless_index().unwrap();
let samp_idx = sampler.bindless_index().unwrap();
pass.bind_resources_raw(&[tex_idx, samp_idx]);
}

Raw indices can also carry user scalars alongside bindless indices:

#![allow(unused)]
fn main() {
pass.bind_resources_raw_with_user(
    &[buf_idx, tex_idx],  // bindless indices (region A)
    &[frame_number],      // user scalars (region B)
);
}

Draw Calls

draw

Draw non-indexed primitives:

#![allow(unused)]
fn main() {
// draw(vertex_range, instance_range)
pass.draw(0..3, 0..1);      // 3 vertices, 1 instance
pass.draw(0..6, 0..10);     // 6 vertices, 10 instances
pass.draw(100..106, 0..1);  // 6 vertices starting at vertex 100
}

draw_indexed

Draw indexed primitives. Requires a prior set_index_buffer() call.

#![allow(unused)]
fn main() {
// draw_indexed(index_range, base_vertex, instance_range)
pass.draw_indexed(0..36, 0, 0..1);

// base_vertex is added to each index before vertex fetch
pass.draw_indexed(0..6, 1000, 0..1);

// negative base_vertex is allowed
pass.draw_indexed(0..3, -50, 0..1);
}

draw_fullscreen

Draw a fullscreen triangle (3 vertices, no vertex buffer needed). Pair with vs_fullscreen_triangle() from goldy_exp.vertex or fullscreen_position()/fullscreen_uv() from goldy_exp.primitives.

#![allow(unused)]
fn main() {
pass.set_pipeline(&fullscreen_pipeline);
pass.bind_resources(&[&uniform_buffer]);
pass.draw_fullscreen();
}

This is more efficient than a fullscreen quad (3 vertices vs 6) and eliminates vertex buffer overhead entirely.

draw_quads

Draw N instanced quads (6 vertices each, no vertex buffer needed). The shader reads per-instance data from a buffer and uses quad_position() from goldy_exp.primitives to generate vertex positions.

#![allow(unused)]
fn main() {
pass.set_pipeline(&instanced_pipeline);
pass.bind_resources(&[&instance_buffer]);
pass.draw_quads(400);  // draw 400 quads
}

Submitting Commands

After recording, submit the encoder to a surface frame or render target:

#![allow(unused)]
fn main() {
// Surface presentation
let frame = surface.begin()?;
frame.render(encoder)?;
frame.present()?;

// Headless render target
target.render(encoder)?;
}

Complete Example

#![allow(unused)]
fn main() {
let mut encoder = CommandEncoder::new();
{
    let mut pass = encoder.begin_render_pass();

    pass.clear(Color::BLACK);
    pass.clear_depth(1.0);

    // Draw opaque geometry
    pass.set_pipeline(&scene_pipeline);
    pass.set_vertex_buffer(0, &mesh_vertices);
    pass.set_index_buffer(&mesh_indices, IndexFormat::Uint32);
    pass.bind_resources(&[&camera_uniforms]);
    pass.draw_indexed(0..index_count, 0, 0..1);

    // Draw fullscreen post-process
    pass.set_pipeline(&post_pipeline);
    pass.bind_resources(&[&post_uniforms]);
    pass.draw_fullscreen();
}

let frame = surface.begin()?;
frame.render(encoder)?;
frame.present()?;
}

Best Practices

Batch draws by pipeline. Pipeline switches are cheap but not free. Group objects that share the same pipeline.
Clear once per pass. Issue clear() at the start, then draw everything.
Use convenience methods. draw_fullscreen() and draw_quads() avoid unnecessary vertex buffer allocations.
Encode on any thread. CommandEncoder is lock-free; build command buffers in parallel if needed.

Vertex Types and Layouts

Goldy provides built-in vertex types for common 2D rendering and a layout builder for custom vertex formats. Vertex data is described by a VertexBufferLayout that tells the pipeline how to interpret buffer memory.

Built-in Vertex Types

Vertex2D

Position + color. Use for colored primitives, particles, and debug visualization.

#![allow(unused)]
fn main() {
use goldy::{Vertex2D, Color};

let vertices = vec![
    Vertex2D::new(-0.5, -0.5, Color::RED),
    Vertex2D::new( 0.5, -0.5, Color::GREEN),
    Vertex2D::new( 0.0,  0.5, Color::BLUE),
];
}

Memory layout (24 bytes per vertex):

Location	Field	Format	Offset
0	`position`	`Float32x2`	0
1	`color`	`Float32x4`	8

Get the pipeline layout with Vertex2D::layout().

Vertex2DUv

Position + texture coordinates. Use for textured quads, sprites, and shader effects.

#![allow(unused)]
fn main() {
use goldy::Vertex2DUv;

let vertices = vec![
    Vertex2DUv::new(-1.0, -1.0, 0.0, 1.0),
    Vertex2DUv::new( 1.0, -1.0, 1.0, 1.0),
    Vertex2DUv::new( 0.0,  1.0, 0.5, 0.0),
];
}

Memory layout (16 bytes per vertex):

Location	Field	Format	Offset
0	`position`	`Float32x2`	0
1	`uv`	`Float32x2`	8

Get the pipeline layout with Vertex2DUv::layout().

Using Built-in Types in Pipelines

Both types provide a layout() method that returns the correct VertexBufferLayout:

#![allow(unused)]
fn main() {
let pipeline = RenderPipeline::new(&device, &vs, &fs, &RenderPipelineDesc {
    vertex_layout: Vertex2D::layout(),
    target_format: surface.format(),
    ..Default::default()
})?;
}

Both types implement StructuredBufferElement, so they can also be stored in Buffer::with_data and BufferPool::alloc_with_data.

Custom Vertex Layouts

Defining a Custom Vertex

Custom vertex types must be #[repr(C)] and derive bytemuck::Pod and bytemuck::Zeroable:

#![allow(unused)]
fn main() {
#[repr(C)]
#[derive(Clone, Copy, bytemuck::Pod, bytemuck::Zeroable)]
struct MyVertex {
    position: [f32; 3],
    normal: [f32; 3],
    uv: [f32; 2],
    color: u32,
}
}

Building a Layout with from_formats

VertexBufferLayout::from_formats::<T> infers locations (sequential from 0) and offsets (accumulated from format sizes), then validates that the total matches size_of::<T>():

#![allow(unused)]
fn main() {
use goldy::types::{VertexBufferLayout, VertexFormat};

let layout = VertexBufferLayout::from_formats::<MyVertex>(&[
    VertexFormat::Float32x3, // position (12 bytes)
    VertexFormat::Float32x3, // normal   (12 bytes)
    VertexFormat::Float32x2, // uv       (8 bytes)
    VertexFormat::Uint32,    // color    (4 bytes)
]);
// stride = 36, 4 attributes
}

The builder panics if the summed format sizes don't equal size_of::<T>(), catching field-list mismatches at pipeline creation rather than producing silent GPU corruption.

Manual Layout

For full control, construct the layout directly:

#![allow(unused)]
fn main() {
use goldy::types::{VertexBufferLayout, VertexAttribute, VertexFormat};

let layout = VertexBufferLayout {
    stride: 32,
    attributes: vec![
        VertexAttribute { location: 0, format: VertexFormat::Float32x3, offset: 0 },
        VertexAttribute { location: 1, format: VertexFormat::Float32x3, offset: 12 },
        VertexAttribute { location: 2, format: VertexFormat::Float32x2, offset: 24 },
    ],
};
}

Empty Layout

When the vertex shader generates geometry from SV_VertexID (fullscreen triangles, instanced quads), use the default empty layout:

#![allow(unused)]
fn main() {
let pipeline = RenderPipeline::new(&device, &vs, &fs, &RenderPipelineDesc {
    vertex_layout: VertexBufferLayout::empty(),
    ..Default::default()
})?;
}

VertexBufferLayout::default() also returns an empty layout.

Vertex Formats

Available formats for vertex attributes:

Format	Rust Type	Size
`Float32`	`f32`	4
`Float32x2`	`[f32; 2]`	8
`Float32x3`	`[f32; 3]`	12
`Float32x4`	`[f32; 4]`	16
`Uint32`	`u32`	4
`Sint32`	`i32`	4
`Uint8x4`	`[u8; 4]` (packed)	4
`Unorm8x4`	`[u8; 4]` (normalized)	4

Vertex Data Flow

In Slang shaders, vertex attributes arrive through the [goldy_vertex] virtual entry point. The pipeline's VertexBufferLayout determines which attributes the hardware feeds into the shader's input struct. Attribute locations in the layout must match the shader's declared input locations.

For passes that bypass vertex buffers entirely, Slang helpers like vs_fullscreen_triangle() and quad_position() in goldy_exp.primitives generate geometry from SV_VertexID and SV_InstanceID.

Rendering Outputs

Surface manages a swapchain for zero-copy GPU-to-display presentation. It wraps the platform window handle, acquires drawable textures each frame, and presents finished frames to the display.

Creating a Surface

A Surface requires a Device and a window that implements HasWindowHandle + HasDisplayHandle (from the raw-window-handle crate).

#![allow(unused)]
fn main() {
use goldy::{Surface, SurfaceConfig, PresentMode, DepthFormat};

// Simplest form — Auto present mode, no depth buffer
let surface = Surface::new(&device, &window)?;

// With explicit configuration
let surface = Surface::new_with_config(&device, &window, SurfaceConfig {
    present_mode: PresentMode::Fifo,
    depth_format: Some(DepthFormat::Depth32Float),
})?;

// Shorthand for depth-only configuration
let surface = Surface::new_with_depth(&device, &window, Some(DepthFormat::Depth24Plus))?;
}

SurfaceConfig

#![allow(unused)]
fn main() {
pub struct SurfaceConfig {
    pub present_mode: PresentMode,
    pub depth_format: Option<DepthFormat>,
}
}

Field	Purpose	Default
`present_mode`	Vsync strategy	`Auto`
`depth_format`	Depth buffer format, or `None` to disable	`None`

Present Modes

Mode	Behavior	Backend Mapping
`Fifo`	Vsync — wait for display refresh. No tearing, capped at monitor Hz.	Metal `displaySyncEnabled=YES`, Vulkan `FIFO`, DX12 `Present(1)`
`Mailbox`	Triple-buffered — latest frame queued, older dropped. Low latency + no tearing.	Vulkan `MAILBOX`. Falls back to `Fifo` on Metal and some DX12 configurations.
`Immediate`	No sync, may tear. Maximum throughput for benchmarks.	Metal `displaySyncEnabled=NO`, Vulkan `IMMEDIATE`, DX12 `Present(0)`
`Auto`	Goldy chooses (`Mailbox` if available, then `Fifo`).	—

Change the present mode at runtime without recreating the surface:

#![allow(unused)]
fn main() {
surface.set_present_mode(PresentMode::Immediate)?;
let current = surface.present_mode();
}

Frame Acquisition Cycle

Each frame follows a begin → record → present sequence:

#![allow(unused)]
fn main() {
loop {
    // 1. Begin the frame (acquire a swapchain image)
    let frame = surface.begin()?;

    // 2. Record rendering commands
    let mut encoder = CommandEncoder::new();
    {
        let mut pass = encoder.begin_render_pass();
        pass.clear(Color::CORNFLOWER_BLUE);
        pass.set_pipeline(&pipeline);
        pass.set_vertex_buffer(0, &vertices);
        pass.draw(0..3, 0..1);
    }

    // 3. Submit and present
    frame.render(encoder)?;
    frame.present()?;
}
}

surface.acquire() is a legacy alias for surface.begin().

Frame

Frame represents a single swapchain image bracket. It tracks whether the frame has been presented and auto-presents on drop if you forget.

Frame Properties

#![allow(unused)]
fn main() {
let frame = surface.begin()?;

frame.width();   // frame dimensions (may differ from surface after resize)
frame.height();
}

Graphics Path — Frame::render

Record draw commands into a CommandEncoder and submit with render():

#![allow(unused)]
fn main() {
frame.render(encoder)?;
frame.present()?;
}

Compute Path — Frame::submit_compute

For compute-to-surface workflows, access the frame's texture directly and submit a TaskGraph:

#![allow(unused)]
fn main() {
let frame = surface.begin()?;
let tex = frame.texture();  // the swapchain texture as a storage image

// Build a task graph that writes to tex...
frame.submit_compute(&task_graph)?;
frame.present()?;
}

frame.texture() returns a &Texture with SpatialAccess::Direct, suitable for binding as a storage image in compute shaders.

Presenting

frame.present() consumes the Frame, submits all recorded work, and queues the image for display. It returns a TimelineValue that can be used with Device::wait_until().

#![allow(unused)]
fn main() {
let timeline = frame.present()?;
}

If a Frame is dropped without calling present(), it auto-presents to avoid leaking the swapchain image. This is safe but wastes a frame.

Surface Queries

#![allow(unused)]
fn main() {
surface.width();
surface.height();
surface.size();        // (width, height)
surface.format();      // TextureFormat of the swapchain images

// Validate that a pipeline's target format matches
surface.validate_pipeline_format(pipeline_format)?;
}

Resize Handling

Call resize() when the window size changes. Zero-size dimensions are silently ignored (common during window minimize).

#![allow(unused)]
fn main() {
fn on_resize(surface: &mut Surface, width: u32, height: u32) -> Result<()> {
    surface.resize(width, height)?;
    Ok(())
}
}

Texture Format

The swapchain format is chosen by the backend at surface creation (typically Bgra8UnormSrgb). Always use surface.format() when creating pipelines to ensure a match:

#![allow(unused)]
fn main() {
let desc = RenderPipelineDesc {
    target_format: surface.format(),
    ..Default::default()
};
}

Frame Lifetime

Frame follows Rust ownership semantics:

begin() acquires the swapchain image and returns a Frame
texture() borrows the frame — valid until present() is called
present() consumes the Frame — the borrow checker prevents use-after-present
Dropping without presenting auto-presents (prevents swapchain deadlock)

#![allow(unused)]
fn main() {
let frame = surface.begin()?;
let tex = frame.texture();
// tex is valid here
frame.present()?;
// tex is now invalid — Rust prevents accessing it
}

Buffers

Buffer is a GPU memory allocation for storing typed data — uniforms, vertex data, index data, compute storage, or anything a shader needs to read or write.

Creating Buffers

With Typed Data

Buffer::with_data creates a buffer and uploads an initial slice. The element stride is inferred from T, which is critical for correct StructuredBuffer views on DX12.

#![allow(unused)]
fn main() {
use goldy::{Buffer, DataAccess};

let positions = vec![[0.0f32, 1.0, 0.0], [1.0, 0.0, 0.0]];
let buffer = Buffer::with_data(&device, &positions, DataAccess::Scattered)?;
}

Type matters. Passing &[u8] (e.g. from bytemuck::bytes_of) sets the element stride to 1 byte, while shaders usually expect a larger struct stride. Use a typed slice or with_bytes_stride instead.

With Typed Data and Flags

#![allow(unused)]
fn main() {
let buffer = Buffer::with_data_and_flags(
    &device,
    &data,
    DataAccess::Scattered,
    BufferFlags::CPU_READABLE,
)?;
}

With Raw Bytes

When the data is naturally &[u8], use one of the byte-oriented constructors:

#![allow(unused)]
fn main() {
// Stride defaults to 1 (byte-addressable)
let buffer = Buffer::with_bytes(&device, &raw_bytes, DataAccess::Scattered)?;

// Explicit stride for structured buffer views
let buffer = Buffer::with_bytes_stride(&device, &raw_bytes, DataAccess::Scattered, 16)?;
}

Empty Buffer

#![allow(unused)]
fn main() {
let buffer = Buffer::new(&device, 4096, DataAccess::Scattered)?;

// With a specific element stride
let buffer = Buffer::new_with_stride(&device, 4096, DataAccess::Scattered, Some(64))?;
}

Data Access Patterns

The access pattern describes how shader threads access the buffer. This drives hardware optimizations and determines the bindless descriptor category.

#![allow(unused)]
fn main() {
pub enum DataAccess {
    Scattered, // default — any thread, any address, read/write
    Broadcast, // all threads read the same address
}
}

Pattern	Shader Mapping	Use When
`Scattered`	`StructuredBuffer<T>`, `RWStructuredBuffer<T>`	General storage: particles, meshes, compute I/O
`Broadcast`	`ConstantBuffer` / uniform buffer	Uniform data: transforms, time, settings

For read-only input buffers that don't need write access, create with DataAccess::Scattered and access through goldy_buf_ro<T> in the shader. This enables hardware read-cache optimizations without requiring a separate access pattern.

BufferFlags

#![allow(unused)]
fn main() {
bitflags! {
    pub struct BufferFlags: u32 {
        const COPY_SRC      = 1 << 0;
        const COPY_DST      = 1 << 1;
        const CPU_READABLE  = 1 << 2;
    }
}
}

Flag	Purpose
`COPY_SRC`	Buffer can be a copy source
`COPY_DST`	Buffer can be a copy destination
`CPU_READABLE`	Optimize for readback. On Vulkan/Metal, `read_to_cpu` is a direct memcpy from host-visible memory. On DX12, it performs a GPU copy into a READBACK heap and waits.

Query DeviceCapabilities::has_zero_copy_storage_readback to detect whether readback is zero-copy on the current backend.

Writing Data

Raw bytes

#![allow(unused)]
fn main() {
buffer.write(offset, &bytes)?;
}

Typed data

#![allow(unused)]
fn main() {
buffer.write_data(offset, &[1.0f32, 2.0, 3.0])?;
}

Both methods write at a byte offset from the start of the buffer.

Reading Data

Read buffer contents back to the CPU. The buffer should have been created with BufferFlags::CPU_READABLE for optimal performance.

#![allow(unused)]
fn main() {
let mut output = vec![0u8; buffer.size() as usize];
buffer.read_to_cpu(&device, &mut output)?;
}

Clearing

Zero-fill a region of the buffer:

#![allow(unused)]
fn main() {
buffer.clear(&device, offset, size)?;
}

Bindless Descriptors

Every buffer with Scattered or Broadcast access is registered in the global bindless descriptor set. Retrieve the index to pass to shaders:

#![allow(unused)]
fn main() {
// Typed handle (preferred) — carries BindlessCategory for validation
let handle = buffer.bindless_handle().unwrap();

// Raw index
let index = buffer.bindless_index().unwrap();

// Read-only SRV index (separate from UAV on DX12; same on Vulkan/Metal)
let srv_handle = buffer.bindless_srv_handle().unwrap();
}

BufferView

A BufferView is a sub-region of an existing Buffer with its own bindless descriptor. The shader sees the sub-region as a zero-based buffer.

Creating Views

#![allow(unused)]
fn main() {
// Raw byte view — offset, size, optional element stride
let view = buffer.create_view(1024, 512, Some(16))?;

// Typed view — first element index, element count
let view = buffer.create_typed_view::<[f32; 4]>(0, 256)?;
}

Using Views

Views implement BufferSource, so they work anywhere a Buffer does — set_vertex_buffer, set_index_buffer, write_data, read_to_cpu, clear, and bindless binding:

#![allow(unused)]
fn main() {
let view_handle = view.bindless_handle().unwrap();
pass.set_vertex_buffer(0, &view);
}

Lifetime

Dropping a BufferView unregisters its descriptor but does not free the parent buffer's memory. Multiple views of the same buffer can exist simultaneously.

StructuredBufferElement

The StructuredBufferElement trait marks types safe for Buffer::with_data and BufferPool::alloc_with_data. It is implemented for common multi-byte primitives (u16, u32, f32, f64, etc.), fixed-size arrays of those types, and #[repr(C)] structs via #[derive(goldy_derive::StructuredBufferElement)].

Not implemented for u8/i8 — passing &[u8] would set stride to 1, which almost never matches the shader's expected struct stride. Use Buffer::with_bytes_stride for raw bytes.

Matrix Convention

Goldy uses column-major matrix layout in uniform/constant buffers across all backends. Rust math libraries (glam, nalgebra, ultraviolet) already store matrices column-major, so upload directly without transposing:

#![allow(unused)]
fn main() {
let uniforms = MyUniforms {
    projection: proj.to_cols_array_2d(),
    modelview: view.to_cols_array_2d(),
};
buffer.write_data(0, &[uniforms])?;
}

Goldy sets SLANG_MATRIX_LAYOUT_COLUMN_MAJOR at the Slang session level, so DX12, Vulkan, and Metal all interpret float4x4 the same way.

Textures and Samplers

Texture holds image data on the GPU. Sampler controls how that data is filtered and addressed when read in shaders. Together, they provide the standard texture sampling pipeline.

Creating a Texture

#![allow(unused)]
fn main() {
use goldy::{Texture, SpatialAccess, TextureFormat, TextureFlags};

let texture = Texture::new(
    &device,
    512, 512,
    TextureFormat::Rgba8Unorm,
    SpatialAccess::Interpolated,
    TextureFlags::COPY_DST,
)?;
}

With Initial Data

Data must be raw bytes matching width × height × bytes_per_pixel:

#![allow(unused)]
fn main() {
let pixels: Vec<u8> = load_image_rgba("sprite.png");
let texture = Texture::with_data(
    &device,
    &pixels,
    256, 256,
    TextureFormat::Rgba8Unorm,
    SpatialAccess::Interpolated,
    TextureFlags::COPY_DST,
)?;
}

Spatial Access Patterns

The access pattern determines how the texture is bound and accessed in shaders:

Access	Shader Mapping	Use When
`Interpolated`	`Texture2D` with sampler	Image data filtered between texels — sprites, materials, UI
`Direct`	`RWTexture2D`	Storage images, compute output, exact pixel reads/writes

Texture Formats

Format	BPP	Notes
`R8Unorm`	1	Single-channel (masks, SDFs)
`Rg8Unorm`	2	Two-channel (normal maps, motion vectors)
`Rgba8Unorm`	4	Standard 8-bit RGBA
`Rgba8UnormSrgb`	4	sRGB color space
`Bgra8UnormSrgb`	4	sRGB, swapped channels (common swapchain format)
`Bgra8Unorm`	4	Linear, swapped channels
`Rgba16Float`	8	HDR
`Rgba32Float`	16	Full precision

TextureFlags

#![allow(unused)]
fn main() {
bitflags! {
    pub struct TextureFlags: u32 {
        const COPY_SRC       = 1 << 0;
        const COPY_DST       = 1 << 1;
        const RENDER_TARGET  = 1 << 2;
    }
}
}

Flag	Purpose
`COPY_SRC`	Texture can be a copy source (needed for `read_to_cpu`)
`COPY_DST`	Texture can be a copy destination (needed for `write` / `write_region`)
`RENDER_TARGET`	Texture can be used as a color attachment

Writing Data

Prefer TaskGraph::write_texture() for batched, non-blocking uploads. The synchronous methods below stall the GPU:

#![allow(unused)]
fn main() {
#[allow(deprecated)]
texture.write(&pixels)?;

#[allow(deprecated)]
texture.write_region(x, y, width, height, &region_pixels)?;
}

Reading Data

Read texture contents back to CPU memory. The texture must have been created with TextureFlags::COPY_SRC:

#![allow(unused)]
fn main() {
let mut output = vec![0u8; texture.byte_size()];
texture.read_to_cpu(&mut output)?;
}

Texture Queries

#![allow(unused)]
fn main() {
texture.width();
texture.height();
texture.format();
texture.byte_size();  // width * height * bytes_per_pixel
texture.access();     // SpatialAccess
texture.flags();      // TextureFlags
texture.is_owned();   // true if dropping destroys the GPU resource
}

Bindless Descriptors

Textures are registered in the global bindless descriptor set. The category depends on the access pattern: Interpolated maps to BindlessCategory::Texture, Direct maps to BindlessCategory::StorageImage.

#![allow(unused)]
fn main() {
// Typed handle (preferred)
let handle = texture.bindless_handle().unwrap();

// Raw index
let index = texture.bindless_index().unwrap();
}

Texture Borrowing

Texture::borrow() creates a non-owning view that shares the GPU resource. Dropping a borrowed texture does not destroy the underlying resource. Use this when handing a texture reference into a system that may drop it before the owner is done.

#![allow(unused)]
fn main() {
let borrowed = texture.borrow();
assert!(!borrowed.is_owned());
// dropping `borrowed` does not free GPU memory
}

Depth Textures

Depth textures are created through SurfaceConfig or RenderTarget, not directly via Texture::new. Available depth formats:

Format	Bits	Stencil
`Depth16Unorm`	16	No
`Depth24Plus`	24	No
`Depth24PlusStencil8`	24 + 8	Yes
`Depth32Float`	32	No
`Depth32FloatStencil8`	32 + 8	Yes

#![allow(unused)]
fn main() {
let surface = Surface::new_with_config(&device, &window, SurfaceConfig {
    depth_format: Some(DepthFormat::Depth32Float),
    ..Default::default()
})?;
}

Texture as Render Target

A texture created with TextureFlags::RENDER_TARGET can be used as a color attachment for offscreen rendering.

#![allow(unused)]
fn main() {
let offscreen = Texture::new(
    &device,
    1920, 1080,
    TextureFormat::Rgba16Float,
    SpatialAccess::Interpolated,
    TextureFlags::RENDER_TARGET | TextureFlags::COPY_SRC,
)?;
}

Samplers

A Sampler defines how texture coordinates are interpreted — filtering between texels and handling coordinates outside [0, 1].

Creating a Sampler

#![allow(unused)]
fn main() {
use goldy::{Sampler, SamplerDesc, FilterMode, AddressMode};

let sampler = Sampler::new(&device, &SamplerDesc {
    mag_filter: FilterMode::Linear,
    min_filter: FilterMode::Linear,
    mipmap_filter: FilterMode::Linear,
    address_mode_u: AddressMode::Repeat,
    address_mode_v: AddressMode::Repeat,
    ..Default::default()
})?;
}

Convenience Constructors

#![allow(unused)]
fn main() {
let nearest   = Sampler::nearest(&device)?;        // nearest filter, clamp to edge
let linear    = Sampler::linear(&device)?;          // linear filter, clamp to edge
let tiling    = Sampler::linear_repeat(&device)?;   // linear filter, repeat addressing
let default   = Sampler::default_sampler(&device)?; // nearest filter, clamp to edge
}

SamplerDesc

#![allow(unused)]
fn main() {
pub struct SamplerDesc {
    pub address_mode_u: AddressMode,    // default: ClampToEdge
    pub address_mode_v: AddressMode,    // default: ClampToEdge
    pub address_mode_w: AddressMode,    // default: ClampToEdge
    pub mag_filter: FilterMode,         // default: Nearest
    pub min_filter: FilterMode,         // default: Nearest
    pub mipmap_filter: FilterMode,      // default: Nearest
    pub max_anisotropy: f32,            // default: 1.0 (disabled)
    pub compare: Option<CompareFunction>, // default: None
    pub lod_min_clamp: f32,             // default: 0.0
    pub lod_max_clamp: f32,             // default: 32.0
}
}

Filter Modes

Mode	Effect
`Nearest`	Pixelated — nearest texel, no interpolation
`Linear`	Smooth — bilinear interpolation between neighbors

Address Modes

Mode	Effect for UVs outside [0, 1]
`ClampToEdge`	Stretches the border texel
`Repeat`	Tiles the texture
`MirrorRepeat`	Tiles with alternating mirror flips

Depth Comparison Samplers

For shadow mapping and depth-based effects, set the compare field:

#![allow(unused)]
fn main() {
let shadow_sampler = Sampler::new(&device, &SamplerDesc {
    compare: Some(CompareFunction::LessEqual),
    mag_filter: FilterMode::Linear,
    min_filter: FilterMode::Linear,
    ..Default::default()
})?;
}

Bindless Descriptors

Samplers are registered under BindlessCategory::Sampler:

#![allow(unused)]
fn main() {
let handle = sampler.bindless_handle().unwrap();
let index = sampler.bindless_index().unwrap();
}

Binding Textures and Samplers in Shaders

Pass texture and sampler indices together through resource bindings:

#![allow(unused)]
fn main() {
let tex = texture.bindless_handle().unwrap();
let samp = sampler.bindless_handle().unwrap();
pass.bind_resources_typed(&[tex, samp]);
}

In Slang:

import goldy_exp;

[goldy_fragment]
float4 fs_main(Interpolated<float4> tex, Filter smp, float2 uv : TEXCOORD) {
    return tex.Sample(smp, uv);
}

Pooling and Sub-Allocation

GPU resource allocation is expensive. Creating many small buffers or textures each frame produces allocation overhead, descriptor churn, and VRAM fragmentation. Goldy provides three pooling types to amortize these costs.

BufferPool

BufferPool sub-allocates typed regions from a single large DataAccess::Scattered backing buffer. Each region gets its own bindless descriptor, so shaders see independent zero-based buffers.

Creating a Pool

#![allow(unused)]
fn main() {
use goldy::BufferPool;

let mut pool = BufferPool::new(&device, 1024 * 1024)?; // 1 MB pool
}

The backing buffer uses DataAccess::Scattered and a default sub-allocation alignment of 256 bytes (satisfies minStorageBufferOffsetAlignment on all known Vulkan/DX12 hardware).

For custom alignment:

#![allow(unused)]
fn main() {
let mut pool = BufferPool::with_alignment(&device, total_size, 512)?;
}

Allocating Regions

Typed allocation — stride is inferred from T:

#![allow(unused)]
fn main() {
let tiles: BufferView = pool.alloc::<[u32; 2]>(1024)?;    // 1024 elements
let segments: BufferView = pool.alloc::<[f32; 6]>(4096)?;  // 4096 elements
}

Allocate and fill in one call:

#![allow(unused)]
fn main() {
let data = vec![[1.0f32, 0.0, 0.0]; 100];
let view: BufferView = pool.alloc_with_data(&data)?;
}

Raw byte allocation with explicit stride:

#![allow(unused)]
fn main() {
let view = pool.alloc_bytes(4096, Some(16))?;
}

Each allocation is aligned to satisfy both the pool alignment (256) and offset % element_stride == 0 (required by DX12 StructuredBuffer views).

Using Allocated Views

Every BufferView from a pool has its own bindless descriptor. Bind it like any buffer:

#![allow(unused)]
fn main() {
let tile_handle = tiles.bindless_handle().unwrap();
pass.bind_resources_typed(&[tile_handle]);

// Or as a vertex/index buffer
pass.set_vertex_buffer(0, &tiles);
}

Write data into a view:

#![allow(unused)]
fn main() {
view.write_data(&new_data)?;
}

Sizing a Pool

Use BufferPool::padded_size to compute the exact byte capacity needed for a known set of allocations, including alignment padding:

#![allow(unused)]
fn main() {
let size = BufferPool::padded_size(&[
    (1024, std::mem::size_of::<[u32; 2]>()),  // tiles
    (4096, std::mem::size_of::<[f32; 6]>()),  // segments
    (512,  std::mem::size_of::<u32>()),        // indices
]);
let mut pool = BufferPool::new(&device, size)?;
}

Resetting

reset() moves the bump pointer back to zero without invalidating existing views. Use for frame-to-frame reuse when previous views are no longer in flight.

#![allow(unused)]
fn main() {
pool.reset();
}

Pool Queries

#![allow(unused)]
fn main() {
pool.used();             // bytes currently allocated
pool.capacity();         // total pool size
pool.remaining();        // bytes available
pool.backing_buffer();   // reference to the underlying Buffer
}

BufferPoolRing

BufferPoolRing is a fixed-size ring of BufferPools for double- (or N-) buffered rendering. Each frame advances to the next slot, and the pool that was active N frames ago is safe to reset because its GPU work has completed.

Usage

#![allow(unused)]
fn main() {
use goldy::BufferPoolRing;

let mut ring = BufferPoolRing::<2>::new(); // double-buffered

// Each frame:
ring.advance();
ring.prepare(&device, needed_bytes)?;

if ring.take_clear_flag() {
    // New backing buffer was allocated — zero-fill it
    let pool = ring.current_mut().unwrap();
    pool.backing_buffer().clear(&device, 0, pool.capacity())?;
}

let pool = ring.current_mut().unwrap();
let view = pool.alloc::<[f32; 4]>(256)?;
}

How It Works

advance() — rotates to the next pool slot (call once at frame start)
prepare(device, size) — ensures the current slot has at least size bytes. Resets the pool if large enough, or allocates a new one if not. Sets a clear flag when a new allocation occurs.
take_clear_flag() — returns true exactly once after prepare allocates a new backing buffer. Issue a clear_buffer for the backing when this fires.
current_mut() / current() — access the current frame's pool

Bounded Prepare

prepare_bounded adds an optional upper bound. If the current pool exceeds max_size, it is reallocated at size, enabling hysteresis-based shrinking:

#![allow(unused)]
fn main() {
ring.prepare_bounded(&device, needed_size, Some(max_size))?;
}

Cleanup

#![allow(unused)]
fn main() {
ring.clear(); // drop all pools and reset state
}

TexturePool

TexturePool caches released textures for reuse, avoiding repeated GPU allocation and deallocation. This is particularly valuable on DX12 where texture allocation involves descriptor heap management.

Creating a Pool

#![allow(unused)]
fn main() {
use goldy::{TexturePool, TexturePoolConfig};

let mut pool = TexturePool::new(TexturePoolConfig {
    max_per_key: 4, // keep up to 4 textures per (width, height, format, access, flags) key
});

// Or use defaults (max_per_key = 8)
let mut pool = TexturePool::default();
}

Acquire and Release

#![allow(unused)]
fn main() {
use goldy::{SpatialAccess, TextureFormat, TextureFlags};

// Acquire — returns a pooled texture if available, otherwise creates a new one
let texture = pool.acquire(
    &device,
    1920, 1080,
    TextureFormat::Rgba16Float,
    SpatialAccess::Direct,
    TextureFlags::COPY_SRC | TextureFlags::COPY_DST,
)?;

// ... use the texture for this frame's work ...

// Release — return to pool after GPU work completes
pool.release(texture);
}

Borrowed textures (texture.borrow()) are silently dropped on release and not pooled.

Pool Key

Textures are keyed by (width, height, format, access, flags). Acquiring a texture only matches exact keys — a 128×128 texture will not be returned for a 256×256 request.

Eviction

When a key already holds max_per_key entries, additional releases are dropped (destroyed) immediately.

Stats and Cleanup

#![allow(unused)]
fn main() {
let stats = pool.stats();
println!("{} textures pooled, ~{} bytes", stats.entries, stats.estimated_bytes);

pool.clear(); // drop all pooled textures, free GPU memory
}

When to Use Pooling

Scenario	Recommendation
Many small storage buffers with similar lifetime	`BufferPool` — one allocation, many views
Per-frame uniform/storage data that changes every frame	`BufferPoolRing` — ring-buffered pools, safe reset each frame
Transient render targets or compute textures	`TexturePool` — acquire/release cycle avoids allocation churn
Long-lived buffers (mesh data, static textures)	Individual `Buffer` / `Texture` — pooling adds no benefit
Uniform buffer updated once at startup	Individual `Buffer` — no per-frame reuse needed

Sub-Allocation Patterns

Static Geometry Pool

Pack all static mesh data into one BufferPool at load time:

#![allow(unused)]
fn main() {
let size = BufferPool::padded_size(&[
    (vertex_count, std::mem::size_of::<Vertex>()),
    (index_count, std::mem::size_of::<u32>()),
]);
let mut pool = BufferPool::new(&device, size)?;

let vertices = pool.alloc_with_data(&vertex_data)?;
let indices = pool.alloc_with_data(&index_data)?;
}

Per-Frame Dynamic Data

Use BufferPoolRing for data that changes every frame:

#![allow(unused)]
fn main() {
let mut ring = BufferPoolRing::<2>::new();

// In the render loop:
ring.advance();
ring.prepare(&device, frame_data_size)?;

let pool = ring.current_mut().unwrap();
let uniforms = pool.alloc_with_data(&[camera_data])?;
let instances = pool.alloc_with_data(&instance_transforms)?;
}

Transient Compute Textures

Pool intermediate textures in a multi-pass compute pipeline:

#![allow(unused)]
fn main() {
let mut tex_pool = TexturePool::default();

// Each frame:
let temp = tex_pool.acquire(&device, w, h, fmt, SpatialAccess::Direct, flags)?;
// ... compute pass writes to temp ...
// ... next pass reads from temp ...
tex_pool.release(temp); // return for reuse next frame
}

Backend Architecture

Goldy supports three GPU backends, each implemented natively against the platform graphics API — no translation layers (like MoltenVK) are involved.

Backend	API Level	Platforms	Rust Crate
Vulkan	1.4+	Windows, Linux	`ash`
DX12	Direct3D 12	Windows	`windows` + `gpu-allocator`
Metal	Tier 2+	macOS, iOS	`metal`

Native Implementations

Each backend maps Goldy concepts directly to the most natural primitives of its target API:

┌─────────────────────────────────────────────────────────────┐
│                    Goldy Core API                           │
│                                                             │
│   Device, Buffer, Texture, Pipeline, CommandEncoder, ...    │
└─────────────────────────────────────────────────────────────┘
        │                    │                    │
        ▼                    ▼                    ▼
┌───────────────┐    ┌───────────────┐    ┌───────────────┐
│ Vulkan 1.4+   │    │ Metal 2+      │    │ DX12          │
│               │    │               │    │               │
│ • ash crate   │    │ • metal-rs    │    │ • windows-rs  │
│ • Dynamic     │    │ • Argument    │    │ • Root        │
│   rendering   │    │   buffers     │    │   signatures  │
│ • Descriptor  │    │ • Native      │    │ • Descriptor  │
│   indexing    │    │   hazard      │    │   heaps       │
│ • Buffer      │    │   tracking    │    │               │
│   device addr │    │               │    │               │
└───────────────┘    └───────────────┘    └───────────────┘

Translation layers introduce overhead from API mismatches, incompatible synchronization models, and extra validation. Native backends can leverage each API's strengths directly — for example, Metal's built-in hazard tracking, or Vulkan's descriptor indexing for bindless rendering.

Backend Selection

Default Selection

Goldy selects the platform-preferred backend automatically:

Platform	Default Backend
macOS	Metal
Windows	DX12
Linux	Vulkan

Runtime Override — `GOLDY_BACKEND`

Override the backend at runtime with the GOLDY_BACKEND environment variable:

GOLDY_BACKEND=vulkan cargo run --example triangle
GOLDY_BACKEND=dx12   cargo run --example triangle

Accepted values (case-insensitive):

Value	Backend
`vulkan`, `vk`	Vulkan
`dx12`, `d3d12`, `directx`	DX12
`metal`, `mtl`	Metal

An unrecognized value produces a clear error listing the valid options.

Programmatic Selection

Query the active backend at runtime:

#![allow(unused)]
fn main() {
let instance = Instance::new()?;
println!("Backend: {:?}", instance.backend_type());
// Prints: Backend: Dx12   (on Windows)
// Prints: Backend: Vulkan (on Linux)
// Prints: Backend: Metal  (on macOS)
}

Compile-Time Selection (Feature Flags)

You can also restrict which backends are compiled in via Cargo features. This excludes both the code and the dependencies of unselected backends:

cargo build --no-default-features --features vulkan

See Conditional Compilation for details on feature flags, dependency exclusion, and CI setup.

Adapter Enumeration

After creating an Instance, enumerate available GPU adapters to inspect what hardware is present:

#![allow(unused)]
fn main() {
let instance = Instance::new()?;
let adapters = instance.enumerate_adapters();

for adapter in &adapters {
    println!("{}: {} ({})", adapter.id(), adapter.name(), adapter.vendor());
    println!("  Type: {:?}", adapter.device_type());
}
}

`DeviceType`

Each adapter reports a DeviceType:

Variant	Meaning
`DiscreteGpu`	Dedicated graphics card with its own VRAM
`IntegratedGpu`	GPU integrated into the CPU (shared memory)
`Cpu`	Software renderer (e.g. WARP on DX12, lavapipe on Vulkan)
`Other`	Unknown or unrecognized device class

Creating a Device

Request a device with a preferred DeviceType. If no adapter matches, Goldy falls back to the first available adapter:

#![allow(unused)]
fn main() {
let device = instance.create_device(DeviceType::DiscreteGpu)?;

// Or target a specific adapter by ID:
let device = instance.create_device_for_adapter(adapter.id())?;
}

Backend Capabilities

Device Capabilities

Query format preferences and backend-specific capabilities after creating a device:

#![allow(unused)]
fn main() {
let caps = device.capabilities();

println!("Surface format:     {:?}", caps.preferred_surface_format);
println!("Render target fmt:  {:?}", caps.preferred_render_target_format);
println!("Zero-copy readback: {}", caps.has_zero_copy_storage_readback);
}

Capability	Vulkan	DX12	Metal
Zero-copy CPU storage readback	Yes	No (requires GPU copy to readback heap)	Yes
Preferred surface format	`Bgra8UnormSrgb`	`Bgra8UnormSrgb`	`Bgra8UnormSrgb`

Vulkan Backend

The Vulkan backend requires Vulkan 1.4+ and uses:

Dynamic rendering (VK_KHR_dynamic_rendering) — no VkRenderPass or VkFramebuffer objects
Descriptor indexing — bindless resource access by index in shaders
Buffer device address — 64-bit GPU pointers for direct memory access in shaders

DX12 Backend

The DX12 backend uses the windows crate and provides:

Root signatures for resource binding
Descriptor heaps for efficient bindless resource management
Shader compilation via Slang to DXIL
WARP software rasterizer for headless/CI use (GOLDY_DX12_FORCE_WARP=1)
GPU-Based Validation for deep debugging (GOLDY_DX12_GBV=1)

Metal Backend

The Metal backend uses the metal crate (native Metal, not MoltenVK):

Argument buffers for bindless resource binding
Native hazard tracking — Metal tracks resource hazards automatically
Shader compilation via Slang to Metal Shading Language

The `GpuBackend` Trait

All backends implement the GpuBackend trait, which defines the full interface for device management, resource creation, shader compilation, pipeline management, rendering, and compute dispatch:

#![allow(unused)]
fn main() {
pub trait GpuBackend: Send + Sync {
    fn backend_type(&self) -> BackendType;
    fn enumerate_adapters(&self) -> Vec<AdapterInfo>;
    fn create_device(&mut self, adapter_id: u32) -> Result<DeviceHandle>;
    fn create_buffer(&mut self, device: DeviceHandle, ...) -> Result<BufferHandle>;
    fn create_shader_with_paths(&mut self, device: DeviceHandle, ...) -> Result<ShaderHandle>;
    fn create_pipeline(&mut self, device: DeviceHandle, ...) -> Result<PipelineHandle>;
    // ... rendering, compute, surface, texture, sampler, timeline ...
}
}

Resources are identified by opaque u64 handles (DeviceHandle, BufferHandle, ShaderHandle, etc.) that each backend maps to native API objects internally.

Conditional Compilation

Most users should use GOLDY_BACKEND for runtime switching — see Backend Architecture.

Compile-time feature flags are useful when you need smaller binaries, faster builds, or want to verify that each backend compiles independently in CI.

When to Use Compile-Time Features

Use --no-default-features --features <backend> when you need:

Smaller binaries — exclude unused backend code
Faster builds — skip compiling heavy backend dependencies
Missing SDK — build on a system that lacks the Vulkan SDK or Windows SDK
CI matrix — verify each backend compiles independently

Feature Flags

Goldy defines one feature per backend plus an instrumentation feature:

[features]
default = ["vulkan", "metal", "dx12", "instrumentation"]
vulkan  = ["dep:ash"]
dx12    = ["dep:windows", "dep:gpu-allocator", "dep:windows-core"]
metal   = ["dep:metal", "dep:cocoa", "dep:objc", "dep:core-graphics-types",
           "dep:foreign-types", "dep:block"]

instrumentation = ["dep:tracing-subscriber"]

Dependency Exclusion

Building with only one backend excludes both the code and the dependencies for the others:

Feature	Dependencies
`vulkan`	`ash`
`dx12`	`windows`, `gpu-allocator`, `windows-core`
`metal`	`metal`, `cocoa`, `objc`, `core-graphics-types`, `foreign-types`, `block`

# Default build on Windows — compiles Vulkan + DX12 dependencies
cargo build

# Vulkan-only build — downloads only ash
cargo build --no-default-features --features vulkan

# DX12-only build
cargo build --no-default-features --features dx12

This can significantly reduce build times and binary size.

Platform-Specific Considerations

Backend	Available On	Notes
`vulkan`	Windows, Linux (any platform with a Vulkan loader)	Broadest platform support
`dx12`	Windows only	Gated by `#[cfg(target_os = "windows")]` — the feature is ignored on other platforms
`metal`	macOS, iOS only	Gated by `#[cfg(target_os = "macos")]` — the feature is ignored on other platforms

On macOS, enabling both vulkan and metal is valid — the default backend will be Metal, but you can switch to Vulkan at runtime via GOLDY_BACKEND=vulkan if a Vulkan loader (e.g. MoltenVK) is present.

Default Features

The default feature set enables all three backends plus instrumentation:

default = ["vulkan", "metal", "dx12", "instrumentation"]

To override, use --no-default-features and enable only what you need:

# Only Vulkan, no instrumentation
cargo build --no-default-features --features vulkan

# Vulkan + instrumentation
cargo build --no-default-features --features vulkan,instrumentation

# Metal-only on macOS
cargo build --no-default-features --features metal

FFI and Python Feature Passthrough

The goldy-ffi and goldy-py crates propagate features to the core goldy crate, so you can control backend selection in downstream builds:

# FFI bindings with only Vulkan backend
cargo build -p goldy-ffi --no-default-features --features vulkan

# Python bindings with only DX12 backend
cargo build -p goldy-py --no-default-features --features dx12

This is useful for creating platform-specific binary distributions.

Cross-Compilation

When cross-compiling, keep in mind that platform-gated features are silently ignored if the target platform doesn't match:

# Targeting macOS — dx12 feature is silently ignored, only metal + vulkan
# are active
cargo build --target aarch64-apple-darwin

# Targeting Windows — metal feature is silently ignored
cargo build --target x86_64-pc-windows-msvc --no-default-features --features dx12

For cross-compilation to work, you need the appropriate system SDKs available. Vulkan is the most portable backend since the ash crate only needs a Vulkan loader at runtime, not at compile time.

CI Matrix Example

Verify each backend compiles independently in CI:

# GitHub Actions
jobs:
  lint:
    strategy:
      matrix:
        include:
          - os: ubuntu-latest
            features: vulkan
          - os: windows-latest
            features: vulkan
          - os: windows-latest
            features: dx12
          - os: macos-latest
            features: metal
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4
      - run: cargo clippy --no-default-features --features ${{ matrix.features }} -- -D warnings

Checking the Active Backend

At runtime, query which backend was selected:

#![allow(unused)]
fn main() {
let instance = Instance::new()?;
println!("Backend: {:?}", instance.backend_type());
}

If no backend feature is enabled for the current platform, Instance::new() returns an error:

No GPU backend available - enable 'vulkan', 'dx12', or 'metal' feature

Debugging and Observability

Goldy provides validation layers, structured instrumentation, and environment variable controls that together cover the full debugging workflow — from catching API misuse to profiling frame timing.

Validation

`GOLDY_VALIDATION` Environment Variable

The primary control for runtime validation. Accepts a comma-, semicolon-, or whitespace-separated list of categories:

Value	Effect
`api`	Enable backend GPU API validation (see below)
`layout`	Enable Rust ↔ Slang struct layout checks and buffer stride checks
`all`	Enable both `api` and `layout`
`1`, `true`, `yes`	GPU API validation only (legacy shorthand; does not enable layout checks)

Categories can be combined:

# API validation only
GOLDY_VALIDATION=api cargo run --example triangle

# Layout validation only
GOLDY_VALIDATION=layout cargo run --example triangle

# Both
GOLDY_VALIDATION=all cargo run --example triangle
GOLDY_VALIDATION=layout,api cargo run --example triangle

API Validation

When GOLDY_VALIDATION includes api (or 1/true/yes), Goldy enables backend-specific validation:

Backend	What Gets Enabled
Vulkan	`VK_LAYER_KHRONOS_validation` + `VK_EXT_debug_utils` at instance creation
Metal	Sets `MTL_SHADER_VALIDATION=1` (if not already set) before the first device is created
DX12	See DX12 Debug Layer below

For Vulkan, validation is also enabled when VK_INSTANCE_LAYERS contains VK_LAYER_KHRONOS_validation (the standard loader-driven workflow).

Layout Validation

Layout validation catches mismatches between Rust struct layouts and their Slang shader counterparts at shader compile time, and buffer element-stride mismatches at dispatch time.

Enable via either:

GOLDY_VALIDATION=layout  cargo run
GOLDY_VALIDATE_LAYOUTS=1 cargo run   # legacy variable, equivalent

`#[derive(LayoutCheckable)]`

Annotate Rust structs that mirror Slang types to opt into automatic validation:

#![allow(unused)]
fn main() {
#[derive(LayoutCheckable)]
#[repr(C)]
struct SceneUniforms {
    projection: [[f32; 4]; 4],
    view: [[f32; 4]; 4],
    time: f32,
}
}

The derive macro generates a LAYOUT_CHECK constant containing the struct's name, total size, and per-field offsets. Pass it when creating a shader module:

#![allow(unused)]
fn main() {
let shader = ShaderModule::from_slang_with_options(
    &device,
    source,
    &[],          // extra search paths
    &[],          // defines
    Default::default(),
    &[SceneUniforms::LAYOUT_CHECK],
)?;
}

When layout validation is enabled, Goldy compiles the Slang shader, reflects each named struct, and compares:

Total struct size — Rust size_of vs. Slang reflection
Field offsets — each named field's byte offset

A mismatch produces an error naming the struct, the field, and the expected vs. actual offset — immediately surfacing padding or alignment bugs. When validation is disabled, the checks are skipped at zero cost.

Buffer Stride Validation

At dispatch time (when layout validation is enabled), Goldy also checks that each bound buffer's element_stride matches the stride the shader expects from Slang reflection. A mismatch produces an error like:

buffer element-stride mismatch in shader `my_shader`:
  slot 0: shader expects element stride 16 but buffer has 4

DX12-Specific Debugging

DX12 Debug Layer

Variable	Values	Effect
`GOLDY_DX12_DEBUG`	`1`	Force-enable the D3D12 debug layer (even in release builds)
`GOLDY_DX12_NO_DEBUG`	`1`	Disable the D3D12 debug layer (useful for parallel tests that crash the debug layer)
`GOLDY_DX12_GBV`	`1`	Enable GPU-Based Validation (very slow; requires the debug layer)

GPU-Based Validation (GBV) instruments shaders on the GPU to detect issues that the CPU-side debug layer cannot catch — such as out-of-bounds descriptor accesses and uninitialized resource reads. Expect a significant performance hit.

WARP Software Rasterizer

WARP is Microsoft's software implementation of D3D12. It runs on the CPU, so it works on headless CI runners with no GPU.

GOLDY_DX12_FORCE_WARP=1 cargo nextest run

After the first WARP device is created, Goldy prints a confirmation:

[WARP] d3d10warp.dll loaded from: C:\WINDOWS\SYSTEM32\d3d10warp.dll

On Windows, DX12 is the default backend, so GOLDY_DX12_FORCE_WARP=1 is the only variable you need to run tests on a machine without a GPU.

Structured Instrumentation

Goldy includes a structured instrumentation system built on the tracing crate. It provides named observation points with hierarchical dot-notation names and structured context data.

Enabling Instrumentation

Instrumentation requires the instrumentation Cargo feature (enabled by default). When disabled, all macros compile to no-ops at zero cost.

# Explicitly enable
cargo build --features instrumentation

# Disable (zero-cost removal)
cargo build --no-default-features --features vulkan

`goldy_span!` — Timed Sections

Create a span to measure the duration of a code section:

#![allow(unused)]
fn main() {
use goldy::goldy_span;

fn compile_shader(&self) {
    let _span = goldy_span!("slang.compile", target = "metal").entered();
    // ... compilation code ...
    // Duration is recorded automatically when _span is dropped
}
}

`goldy_event!` — Instant Markers

Emit a one-shot structured event:

#![allow(unused)]
fn main() {
use goldy::goldy_event;

goldy_event!("slang.library.load",
    path = %lib_path.display(),
    success = true
);
}

Built-in Observation Points

Goldy instruments its own internals at these observation points:

Category	Point Name	Emitted Data
Slang	`slang.library.load`	`path`, `success`
	`slang.compile.start`	`target`, `entry_points`, `bindless`
	`slang.compile.end`	`duration_ms`, `output_size`, `success`
	`slang.reflection.extract`	`parameter_blocks`, `fields`
Shader	`shader.module.create`	`backend`, `shader_type`
	`shader.pipeline.create`	`pipeline_type`, `bind_groups`
Resource	`resource.buffer.create`	`size`, `usage`
	`resource.texture.create`	`dimensions`, `format`
	`resource.bind_group.create`	`bindings_count`
Render	`render.frame.start`	`frame_id`
	`render.compute.dispatch`	`workgroups`, `pipeline`
	`render.draw`	`vertices`, `instances`
	`render.frame.end`	`frame_id`, `duration_ms`

JSON Logging

Install a JSON file logger to capture all instrumentation output as structured JSON:

#![allow(unused)]
fn main() {
use goldy::instrumentation::install_json_logger;

install_json_logger("/tmp/goldy-debug.json")?;

// All subsequent goldy_span!/goldy_event! calls are written to the file
}

Filtering with `RUST_LOG`

Use the standard RUST_LOG environment variable to control verbosity. All Goldy instrumentation uses the goldy target:

RUST_LOG=goldy=debug cargo run --example triangle
RUST_LOG=goldy::render=trace cargo run --example triangle

Environment Variables Summary

Variable	Values	Effect
`GOLDY_BACKEND`	`vulkan`/`vk`, `dx12`/`d3d12`/`directx`, `metal`/`mtl`	Override backend selection
`GOLDY_VALIDATION`	`api`, `layout`, `all`, `1`/`true`/`yes`	Enable validation categories
`GOLDY_VALIDATE_LAYOUTS`	`1`, `true`, `yes`	Enable layout validation (legacy; prefer `GOLDY_VALIDATION=layout`)
`GOLDY_DX12_FORCE_WARP`	`1`	Use WARP software rasterizer
`GOLDY_DX12_DEBUG`	`1`	Force-enable D3D12 debug layer in release
`GOLDY_DX12_NO_DEBUG`	`1`	Disable D3D12 debug layer
`GOLDY_DX12_GBV`	`1`	Enable GPU-Based Validation
`RUST_LOG`	e.g. `goldy=debug`	Filter instrumentation output

Common Debugging Patterns

Catch API misuse early

GOLDY_VALIDATION=api cargo run --example my_app

Turn on API validation during development to catch invalid GPU API calls. On Vulkan this enables the Khronos validation layer; on Metal it enables shader validation.

Diagnose struct layout bugs

GOLDY_VALIDATION=layout cargo test

If a LayoutCheckable struct diverges from its Slang counterpart (due to padding, alignment, or a field being added on only one side), the error message names the exact struct and field.

Headless CI on Windows

GOLDY_DX12_FORCE_WARP=1 cargo nextest run

WARP gives you a fully functional D3D12 device on machines with no GPU. Combine with GOLDY_VALIDATION=api for maximum coverage.

Profile frame timing

#![allow(unused)]
fn main() {
use goldy::instrumentation::install_json_logger;

install_json_logger("/tmp/goldy-profile.json")?;

// Run your application, then inspect the JSON output for
// render.frame.start / render.frame.end durations
}

Deep DX12 debugging

GOLDY_DX12_DEBUG=1 GOLDY_DX12_GBV=1 cargo run --example my_app

GPU-Based Validation catches GPU-side issues the CPU debug layer cannot see, at a significant performance cost. Use it when you suspect descriptor or resource access bugs.

Python Bindings

Goldy provides Python bindings via PyO3, offering a Pythonic API for GPU programming with seamless NumPy integration.

Installation

From PyPI

pip install goldy

From Source

git clone https://github.com/koubaa/goldy.git
cd goldy/python
pip install maturin
maturin develop --release

Requirements

Python 3.9+
NumPy 1.20+
A GPU with Vulkan 1.4+, DX12, or Metal Tier 2+ support

Optional Dependencies

pip install goldy[dev]   # pytest, pillow
pip install pillow       # image output only

Quick Start

import goldy
import numpy as np
from PIL import Image

# Setup
instance = goldy.Instance()
device = instance.create_device(goldy.DeviceType.DISCRETE_GPU)
target = goldy.RenderTarget(device, 800, 600, goldy.TextureFormat.RGBA8_UNORM)

# Render
encoder = goldy.CommandEncoder()
with encoder.begin_render_pass() as rp:
    rp.clear(goldy.Color.CORNFLOWER_BLUE)
target.render(encoder)

# Read back as NumPy array and save
pixels = target.read_to_cpu()              # shape (600, 800, 4), dtype uint8
Image.fromarray(pixels, mode='RGBA').save('hello_goldy.png')

NumPy Integration

Creating GPU Buffers from Arrays

vertices = np.array([
    # x, y, r, g, b, a
    0.0, -0.5, 1.0, 0.0, 0.0, 1.0,
    0.5,  0.5, 0.0, 1.0, 0.0, 1.0,
   -0.5,  0.5, 0.0, 0.0, 1.0, 1.0,
], dtype=np.float32)

buffer = goldy.Buffer(device, vertices, goldy.DataAccess.SCATTERED)

Supported dtypes

NumPy dtype	Typical use case
`np.float32`	Vertex positions, colors, uniforms
`np.float64`	High-precision data
`np.uint32`	Index buffers, compute data
`np.int32`	Signed integer data
`np.uint16`	16-bit index buffers
`np.uint8`	Raw byte data

Reading Results Back to NumPy

Render target readback returns a NumPy array directly:

pixels = target.read_to_cpu()
print(pixels.shape)   # (height, width, 4)
print(pixels.dtype)   # uint8

Updating Buffers

buffer = goldy.Buffer(device, np.zeros(256, dtype=np.float32), goldy.DataAccess.BROADCAST)

# Full update
buffer.write(0, np.random.rand(256).astype(np.float32))

# Partial update (starting at byte offset 64)
buffer.write(64, np.ones(32, dtype=np.float32))

Performance Tips

Create once, update often — avoid allocating new Buffer objects every frame. Use buffer.write() instead.
Use np.float32 — match the GPU's expected dtype to avoid an extra conversion.
Ensure contiguity — sliced arrays may not be contiguous. Call np.ascontiguousarray() before uploading if needed.

Compute Shaders

Goldy supports GPU compute from Python using Slang shaders.

Basic Example

import goldy
import numpy as np

instance = goldy.Instance()
device = instance.create_device(goldy.DeviceType.DISCRETE_GPU)

data = np.arange(256, dtype=np.float32)
buffer = goldy.Buffer(device, data, goldy.DataAccess.SCATTERED)

SHADER = """
import goldy_exp;

[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(Scattered<float> data, ThreadId id) {
    data[id.x] = data[id.x] * 2.0;
}
"""

shader = goldy.ShaderModule.from_slang(device, SHADER)
pipeline = goldy.ComputePipeline(device, shader)

encoder = goldy.ComputeEncoder()
with encoder.begin_compute_pass() as cp:
    cp.set_pipeline(pipeline)
    cp.bind_resources([buffer])
    cp.dispatch(4, 1, 1)      # 4 workgroups × 64 threads = 256 threads
encoder.dispatch(device)

Ping-Pong Buffers

For iterative algorithms, alternate two buffers as input/output:

buf_a = goldy.Buffer(device, initial_data, goldy.DataAccess.SCATTERED)
buf_b = goldy.Buffer(device, initial_data, goldy.DataAccess.SCATTERED)

use_a = True
for _ in range(100):
    encoder = goldy.ComputeEncoder()
    with encoder.begin_compute_pass() as cp:
        cp.set_pipeline(pipeline)
        cp.bind_resources([buf_a, buf_b] if use_a else [buf_b, buf_a])
        cp.dispatch(workgroups_x, workgroups_y, 1)
    encoder.dispatch(device)
    use_a = not use_a

Combining Compute and Graphics

Use compute results directly in a subsequent render pass through shared storage buffers:

# Compute pass
compute_encoder = goldy.ComputeEncoder()
with compute_encoder.begin_compute_pass() as cp:
    cp.set_pipeline(compute_pipeline)
    cp.bind_resources([buffer])
    cp.dispatch(workgroups, 1, 1)
compute_encoder.dispatch(device)

# Render pass — reads the same buffer
render_encoder = goldy.CommandEncoder()
with render_encoder.begin_render_pass() as rp:
    rp.set_pipeline(render_pipeline)
    rp.bind_resources([buffer])
    rp.draw(range(3))
target.render(render_encoder)

Key Differences from Rust

Aspect	Rust	Python
Instance creation	`Instance::new()?`	`goldy.Instance()`
Error handling	`Result<T, GoldyError>`	Raises `goldy.GoldyError`
Buffer data	`Buffer::with_data(&device, &[T], access)`	`goldy.Buffer(device, numpy_array, access)`
Render pass	`encoder.begin_render_pass()` returns struct	Context manager (`with ... as rp`)
Pixel readback	`target.read_to_cpu()` → `Vec<u8>`	`target.read_to_cpu()` → NumPy array `(H, W, 4)`
Resource lifetime	Explicit `Arc<Device>` ownership	Managed by Python GC via PyO3

Backend Selection

Goldy auto-selects the best backend per platform (DX12 on Windows, Vulkan on Linux). Override with GOLDY_BACKEND:

import os
os.environ["GOLDY_BACKEND"] = "vulkan"   # set before importing goldy

import goldy
instance = goldy.Instance()

API Reference

Core Classes

`Instance`

instance = goldy.Instance()
instance.backend_type            # BackendType (Vulkan, DX12, Metal)
instance.enumerate_adapters()    # list of AdapterInfo
instance.create_device(type)     # Device

`Device`

device = instance.create_device(goldy.DeviceType.DISCRETE_GPU)
device.is_valid()                # bool

`Buffer`

buf = goldy.Buffer(device, data, access)    # data: numpy array or bytes
buf = goldy.Buffer.empty(device, size, access)
buf.size                                    # int (bytes)
buf.write(offset, data)                     # update contents

`RenderTarget`

target = goldy.RenderTarget(device, width, height, format, depth_format=None)
target.width, target.height
target.format
target.has_depth
target.render(encoder)
target.read_to_cpu()       # numpy array (H, W, 4)

`ShaderModule`

shader = goldy.ShaderModule.from_slang(device, slang_source)

`RenderPipeline`

pipeline = goldy.RenderPipeline(device, vertex_shader, fragment_shader, desc)

`RenderPipelineDesc`

desc = goldy.RenderPipelineDesc(
    vertex_layout=None,
    topology=goldy.PrimitiveTopology.TRIANGLE_LIST,
    target_format=goldy.TextureFormat.RGBA8_UNORM,
    depth_stencil=None,
)

`CommandEncoder` / `RenderPass`

encoder = goldy.CommandEncoder()
with encoder.begin_render_pass() as rp:
    rp.clear(goldy.Color.BLACK)
    rp.set_pipeline(pipeline)
    rp.set_vertex_buffer(slot, buffer)
    rp.set_index_buffer(buffer, format)
    rp.bind_resources([buf1, buf2])
    rp.draw(vertices, instances=range(1))
    rp.draw_indexed(indices, base_vertex, instances)

Compute Classes

`ComputePipeline`

pipeline = goldy.ComputePipeline(device, shader)

`ComputeEncoder`

encoder = goldy.ComputeEncoder()
with encoder.begin_compute_pass() as cp:
    cp.set_pipeline(pipeline)
    cp.bind_resources([buffer])
    cp.dispatch(wg_x, wg_y, wg_z)
encoder.dispatch(device)

Enums

# Device selection
goldy.DeviceType.DISCRETE_GPU | INTEGRATED_GPU | CPU | OTHER

# Texture formats
goldy.TextureFormat.RGBA8_UNORM | RGBA8_UNORM_SRGB | BGRA8_UNORM
                   | R8_UNORM | RG8_UNORM | RGBA16_FLOAT | RGBA32_FLOAT

# Buffer access patterns
goldy.DataAccess.SCATTERED    # any thread, any address (StructuredBuffer)
goldy.DataAccess.BROADCAST    # all threads same address (ConstantBuffer)

# Texture access patterns
goldy.SpatialAccess.INTERPOLATED   # hardware-filtered (Texture2D + sampler)
goldy.SpatialAccess.DIRECT         # direct indexing (RWTexture2D)

# Primitive topology
goldy.PrimitiveTopology.POINT_LIST | LINE_LIST | LINE_STRIP
                       | TRIANGLE_LIST | TRIANGLE_STRIP

# Index format
goldy.IndexFormat.UINT16 | UINT32

Types

`Color`

color = goldy.Color(r, g, b, a=1.0)       # floats 0-1
color = goldy.Color.from_rgb(255, 128, 0)  # bytes 0-255

# Predefined
goldy.Color.BLACK | WHITE | RED | GREEN | BLUE | CORNFLOWER_BLUE

`VertexBufferLayout`

layout = goldy.VertexBufferLayout.vertex_2d()       # pos(2) + color(4)
layout = goldy.VertexBufferLayout.vertex_2d_uv()    # pos(2) + uv(2)
layout = goldy.VertexBufferLayout(stride, [
    goldy.VertexAttribute(location, format, offset),
])

`DepthStencilState`

depth = goldy.DepthStencilState(
    format=goldy.DepthFormat.DEPTH32_FLOAT,
    depth_write_enabled=True,
    depth_compare=goldy.CompareFunction.LESS,
)

Exceptions

All errors are raised as goldy.GoldyError:

try:
    device = instance.create_device(goldy.DeviceType.DISCRETE_GPU)
except goldy.GoldyError as e:
    print(f"GPU error: {e}")

.NET Bindings

Goldy provides first-class C# bindings via P/Invoke interop over the native Rust FFI layer.

Installation

NuGet Package

dotnet add package Goldy

Or add to your .csproj directly:

<PackageReference Include="Goldy" Version="0.1.*" />

The NuGet package bundles native Goldy + Slang libraries for all supported platforms — no separate native installation is needed.

Building from Source

cargo build --package goldy-ffi --release
dotnet add reference path/to/goldy/dotnet/Goldy/Goldy.csproj

Requirements

.NET 8.0 or later
Windows x64, Linux x64, or macOS (x64 / arm64)
A GPU with Vulkan 1.4+, DX12, or Metal Tier 2+ support

Quick Start

Headless Rendering

using Goldy;

using var instance = new Instance();
using var device = instance.CreateDevice(DeviceType.DiscreteGpu);
using var target = new RenderTarget(device, 800, 600, TextureFormat.Rgba8Unorm);

var encoder = new CommandEncoder();
encoder.Clear(new Color(0.2f, 0.3f, 0.8f, 1.0f));

target.Render(encoder);

byte[] pixels = target.ReadToCpu();
Console.WriteLine($"Rendered {pixels.Length} bytes ({target.Width}x{target.Height})");

Windowed Rendering

For interactive applications, use Surface with a window handle:

using Goldy;

using var surface = new Surface(device, windowHandle);

while (running)
{
    using var frame = surface.Acquire();

    var encoder = new CommandEncoder();
    encoder.Clear(Color.CornflowerBlue);
    // ... draw calls ...

    frame.Render(encoder);
    surface.Present(frame);
}

Shaders (Slang)

Goldy uses Slang as its shader language across all backends:

var source = """
    [shader("vertex")]
    float4 vs_main(float2 pos : POSITION) : SV_Position {
        return float4(pos, 0.0, 1.0);
    }

    [shader("fragment")]
    float4 fs_main() : SV_Target {
        return float4(1.0, 0.5, 0.0, 1.0);
    }
    """;

using var shader = new ShaderModule(device, source);
using var pipeline = new RenderPipeline(device, shader, new RenderPipelineDesc
{
    TargetFormat = TextureFormat.Rgba8Unorm,
    Topology = PrimitiveTopology.TriangleList,
});

Resource Management

All Goldy objects implement IDisposable. Use using declarations or using blocks to ensure GPU resources are released promptly:

// Preferred: using declaration (C# 8+)
using var device = instance.CreateDevice(DeviceType.DiscreteGpu);

// Also valid: explicit using block
using (var target = new RenderTarget(device, 512, 512, TextureFormat.Rgba8Unorm))
{
    // target is released when the block exits
}

Key Differences from Rust

Aspect	Rust	C#
Instance creation	`Instance::new()?`	`new Instance()`
Error handling	`Result<T, GoldyError>`	Exceptions
Device lifetime	`Arc<Device>`	`IDisposable` / `using`
Buffer creation	`Buffer::with_data(&device, &[T], access)`	`Buffer.WithData<T>(device, data, access)`
Pixel readback	`Vec<u8>`	`byte[]`
Enums	`DeviceType::DiscreteGpu`	`DeviceType.DiscreteGpu`

API Reference

Instance

public sealed class Instance : IDisposable
{
    public Instance();
    public IEnumerable<AdapterInfo> EnumerateAdapters();
    public Device CreateDevice(DeviceType deviceType);
    public Device CreateDeviceById(uint adapterId);
}

Device

public sealed class Device : IDisposable
{
    public uint AdapterId { get; }
    public bool IsValid { get; }
    public ulong GpuProgress { get; }
    public void WaitUntil(ulong value);
    public bool WaitUntilTimeout(ulong value, uint timeoutMs);
    public bool HasLibrary(string name);
}

Buffer

public sealed class Buffer : IDisposable
{
    public static Buffer New(Device device, ulong size, DataAccess access);
    public static Buffer WithData<T>(Device device, T[] data, DataAccess access)
        where T : unmanaged;
    public void Write<T>(T[] data) where T : unmanaged;
    public void Write<T>(ulong offset, T[] data) where T : unmanaged;
    public ulong Size { get; }
}

ShaderModule

public sealed class ShaderModule : IDisposable
{
    public ShaderModule(Device device, string slangSource);
}

RenderPipeline / RenderPipelineDesc

public sealed class RenderPipeline : IDisposable
{
    public RenderPipeline(Device device, ShaderModule shader, RenderPipelineDesc desc);
}

public sealed class RenderPipelineDesc
{
    public TextureFormat TargetFormat { get; set; }
    public PrimitiveTopology Topology { get; set; }
    // ... vertex layout, depth state
}

CommandEncoder / RenderPass

public sealed class CommandEncoder
{
    public CommandEncoder();
    public void Clear(Color color);
    public RenderPass BeginRenderPass();
}

public sealed class RenderPass : IDisposable
{
    public void SetPipeline(RenderPipeline pipeline);
    public void SetVertexBuffer(uint slot, Buffer buffer);
    public void Draw(uint vertexStart, uint vertexCount,
                     uint instanceStart = 0, uint instanceCount = 1);
    public void DrawIndexed(uint indexCount, uint instanceCount = 1);
}

RenderTarget

public sealed class RenderTarget : IDisposable
{
    public RenderTarget(Device device, uint width, uint height, TextureFormat format);
    public void Render(CommandEncoder encoder);
    public byte[] ReadToCpu();
    public void ReadToBuffer(byte[] output);
    public uint Width { get; }
    public uint Height { get; }
    public TextureFormat Format { get; }
    public int BufferSize { get; }
}

Surface / SurfaceFrame

public sealed class Surface : IDisposable
{
    public Surface(Device device, nint windowHandle);
    public SurfaceFrame Acquire();
    public void Present(SurfaceFrame frame);
    public void Resize(uint width, uint height);
    public uint Width { get; }
    public uint Height { get; }
}

public sealed class SurfaceFrame : IDisposable
{
    public void Render(CommandEncoder encoder);
}

Compute

public sealed class ComputePipeline : IDisposable
{
    public ComputePipeline(Device device, ShaderModule computeShader);
}

public sealed class ComputeEncoder
{
    public ComputeEncoder();
    public void SetPipeline(ComputePipeline pipeline);
    public void BindResources(params Buffer[] buffers);
    public void BindResourcesRaw(uint[] indices);
    public void Dispatch(uint x, uint y, uint z);
    public void DispatchIndirect(Buffer buffer, ulong offset);
    public void ClearBuffer(Buffer buffer, ulong offset, ulong size);
    public void Dispatch(Device device);     // dispatch and block
    public ulong Submit(Device device);      // submit, return timeline value
}

Texture / Sampler

public sealed class Texture : IDisposable
{
    public Texture(Device device, uint width, uint height, TextureFormat format,
                   SpatialAccess access, TextureFlags flags = TextureFlags.None);
    public void Write(byte[] data);
    public uint Width { get; }
    public uint Height { get; }
    public TextureFormat Format { get; }
}

public sealed class Sampler : IDisposable
{
    public Sampler(Device device, SamplerDesc desc);
}

public struct SamplerDesc
{
    public FilterMode MagFilter { get; set; }
    public FilterMode MinFilter { get; set; }
    public AddressMode AddressModeU { get; set; }
    public AddressMode AddressModeV { get; set; }
}

Enums

public enum DeviceType   { DiscreteGpu, IntegratedGpu, Cpu, Other }
public enum BackendType  { Vulkan, Metal, Dx12 }
public enum DataAccess   { Scattered, Broadcast }
public enum SpatialAccess { Interpolated, Direct }
public enum FilterMode   { Nearest, Linear }
public enum AddressMode  { Repeat, MirrorRepeat, ClampToEdge, ClampToBorder }

public enum TextureFormat
{
    Rgba8Unorm, Rgba8Srgb, Bgra8Unorm,
    Rgba16Float, Rgba32Float, Depth32Float,
}

public struct Color
{
    public float R, G, B, A;
    public Color(float r, float g, float b, float a);
    public static Color CornflowerBlue { get; }
    public static Color Black { get; }
    public static Color White { get; }
}

Non-Blocking Submissions

ComputeEncoder.Submit returns a ulong device timeline value. Poll or wait on it via Device.GpuProgress and Device.WaitUntil:

ulong ticket = computeEncoder.Submit(device);

// ... do other work ...

device.WaitUntil(ticket);   // block until the GPU catches up

Examples Gallery

Goldy ships with 22 examples that demonstrate its core concepts. Every example uses Slang shaders and runs on all supported backends (Vulkan 1.4+, DX12, Metal Tier 2+).

Running Examples

cd goldy
cargo run --example <name> --release

All windowed examples support Escape to exit and automatic window-resize handling.

Bindless Basics

These examples cover fundamental Goldy patterns: vertex buffers, the Surface API, uniforms, and fragment shaders.

Example	What it demonstrates	Source
`triangle`	The minimal Goldy program. Creates a vertex buffer with colored vertices, builds a render pipeline, and presents to a window via the zero-copy Surface API.	`triangle.rs`
`gradient`	Animated full-screen gradient driven by a time uniform. Uses vertex-less rendering (`SV_VertexID`) and demonstrates `GOLDY_VALIDATE_LAYOUTS` for Rust ↔ Slang struct layout validation.	`gradient.rs`
`window`	Triangle with continuous animation, showing the Surface API render loop and frame pacing.	`window.rs`
`checkerboard`	Procedural animated checkerboard via UV distortion in a fragment shader. Also supports `GOLDY_VALIDATE_LAYOUTS`.	`checkerboard.rs`

Compute Workflows

Examples that use ComputePipeline and TaskGraph for GPU-side data processing, including the compute-to-surface pattern.

Example	What it demonstrates	Source
`compute_particles`	Full compute + graphics loop. A compute shader updates 1024 particle positions and velocities each frame; a graphics shader renders them as instanced colored quads. Uses `TaskGraph` for dependency scheduling.	`compute_particles.rs`
`game_of_life`	Conway's Game of Life on the GPU. A compute shader applies cellular-automaton rules on a 128×128 grid using ping-pong `BufferView`s from a shared `BufferPool`. A separate graphics pass renders the result.	`game_of_life.rs`
`compute_to_surface`	Pure compute rendering — no `RenderPipeline`, no `CommandEncoder`, no vertex buffers. A compute shader writes directly to the swapchain texture via `frame.texture()` and `TaskGraph`. Demonstrates the compute-to-surface workflow.	`compute_to_surface.rs`

Graphics Pipelines

Classic rendering techniques: depth testing, textures, instancing, and 3D projection.

Example	What it demonstrates	Source
`solid_cube`	Solid 3D cube with per-face colors. Demonstrates 3D rendering with a depth buffer and model/view/projection matrices.	`solid_cube.rs`
`spinning_cube`	3D wireframe cube using line primitives. Shows 3D projection and rotation matrices without depth testing.	`spinning_cube.rs`
`depth_quads`	Two full-screen quads with oscillating depth values. Drawn in a fixed order, the depth buffer (`CompareFunction::Less`) ensures the nearer quad always wins — proving draw order independence.	`depth_quads.rs`
`textured_quad`	Procedural checkerboard texture displayed on a quad. Demonstrates `Texture`, `Sampler`, cross-backend bindless resource access, and linear filtering with repeat addressing.	`textured_quad.rs`
`instancing`	400 rotating quads driven entirely by the GPU. A compute shader updates per-instance transforms and HSV-derived colors each frame; the graphics shader reads them from a storage buffer — no vertex buffer needed.	`instancing.rs`
`bouncing_lines`	Lines bouncing off window edges. Uses the `LINE_LIST` primitive topology and simple physics.	`bouncing_lines.rs`
`waveform`	Audio-style waveform visualizer using `LINE_STRIP` topology and multiple draw calls per frame.	`waveform.rs`

Advanced Patterns

More complex examples combining multiple Goldy features or demonstrating interactive input, visual effects, and multi-window management.

Fragment Shader Effects

Example	What it demonstrates	Source
`plasma`	Classic demoscene plasma effect using complex trigonometric math in a fragment shader with time-based animation.	`plasma.rs`
`tunnel`	Flying-through-a-tunnel effect using polar coordinates and procedural checkerboard texturing in screen space.	`tunnel.rs`
`metaballs`	Organic blob simulation using distance-field evaluation and thresholding in a fragment shader.	`metaballs.rs`
`starfield`	3D starfield fly-through simulated entirely in a fragment shader with depth-based brightness.	`starfield.rs`

Interactive Input

Example	What it demonstrates	Source
`mandelbrot`	Real-time fractal explorer. Arrow keys pan, +/- zoom, R resets. Demonstrates interactive uniform updates driving a fragment shader.	`mandelbrot.rs`
`particles`	Rain and snow particle simulation. Press Space to toggle mode. Shows CPU-driven particle state with per-frame vertex buffer updates.	`particles.rs`
`digital_clock`	7-segment LED display rendered from vertex data. Space pauses, click changes color. Demonstrates dynamic vertex generation for complex shapes.	`digital_clock.rs`

Multi-Window

Example	What it demonstrates	Source
`multi_window`	Three simultaneous windows, each running an independent effect (plasma, tunnel, starfield) with its own Surface, pipeline, and input handling. Demonstrates managing multiple GPU surfaces from a single device.	`multi_window.rs`

Common Patterns

Surface API Render Loop (Rust)

#![allow(unused)]
fn main() {
let frame = surface.begin()?;

let mut encoder = CommandEncoder::new();
{
    let mut pass = encoder.begin_render_pass();
    pass.clear(background_color);
    pass.set_pipeline(&pipeline);
    pass.set_vertex_buffer(0, &vertices);
    pass.draw(0..vertex_count, 0..1);
}

frame.render(encoder)?;
frame.present()?;
}

Compute + Graphics with TaskGraph

#![allow(unused)]
fn main() {
let mut graph = TaskGraph::new();
graph
    .node("update", &compute_pipeline)
    .bind_buffer(&buffer, NodeAccess::ReadWrite)
    .bind_resources_raw(&[buffer.bindless_index().unwrap()])
    .dispatch(workgroups, 1, 1);
graph.dispatch(&device)?;
}

Slang Shader Template

import goldy_exp;

struct VertexOutput {
    float4 position : SV_Position;
    float2 uv;
};

[shader("vertex")]
VertexOutput vs_main(float2 pos : POSITION, float2 uv : TEXCOORD) {
    VertexOutput output;
    output.position = float4(pos, 0.0, 1.0);
    output.uv = uv;
    return output;
}

[shader("fragment")]
float4 fs_main(VertexOutput input) : SV_Target {
    return float4(input.uv, 0.5, 1.0);
}

Motivation

The Problem with "Modern" Graphics APIs

DX12, Vulkan, and Metal are commonly called modern APIs, but they were designed over a decade ago for hardware that has since changed dramatically. Sebastian Aaltonen's "No Graphics API" captures the core tension:

"DirectX 12, Vulkan, and Metal are often referred to as 'modern APIs'. These APIs are now 10 years old. They were initially designed to support GPUs that are now 13 years old, an incredibly long time in GPU history."

The GPU architectures those APIs targeted lacked coherent caches, bindless descriptors, and 64-bit pointers. The APIs compensated with layers of indirection — descriptor sets, render pass objects, explicit image layout transitions, pipeline layouts as first-class objects — that served as hints and contracts for hardware that needed them.

Modern GPUs (roughly 2018+) no longer need most of that scaffolding:

Then (2012-era)	Now (2018+)
Incoherent caches, manual flush	Coherent L2, automatic
Discrete memory, explicit copies	PCIe REBAR, unified where possible
32-bit pointers, indirect	64-bit, direct in shaders
CPU-bound descriptor binding	Bindless, GPU-resident
Render passes for tile optimization	Dynamic rendering works fine

Yet every application using these APIs still pays the complexity cost of the old model, even when targeting only recent hardware.

Why Bindless Matters

Traditional GPU programming organizes resources into descriptor sets — fixed layouts of bindings that must be declared ahead of time, allocated from pools, and swapped between draw calls. This model creates a cascade of complexity:

Pipeline layout explosion: Every unique combination of descriptor set layouts produces a distinct pipeline layout, and each pipeline layout dimension multiplies the total pipeline state permutation count.
CPU overhead: Updating and binding descriptor sets each frame is a significant portion of CPU-side draw call cost.
Shader inflexibility: Shaders are coupled to their binding layout; changing which resources a shader accesses means changing the pipeline.

Bindless resource access replaces all of this with a single concept: resources live in GPU-visible memory, and shaders access them by index. There are no set layouts to declare, no pools to manage, no binding points to track. A shader that needs buffer #7 just reads slot 7 from a flat descriptor heap.

This isn't exotic — it's how game engines have been working internally for years. Goldy makes it the public API rather than hiding it behind compatibility abstractions.

Why a Task Graph

Bindless access means shaders can read any resource at any time. The traditional model of inserting barriers at the call site ("I'm about to read this buffer, so transition it now") breaks down when the set of resources a dispatch touches isn't known until the shader runs.

Goldy uses a task graph to solve this. You declare tasks and their resource dependencies; Goldy derives the barriers, layout transitions, and execution order automatically. This is both safer (no missed barriers) and simpler (no manual synchronization) than the alternative.

The task graph also enables Goldy to batch and reorder work across the frame, which matters for compute-heavy workloads where multiple dispatches feed into each other before anything reaches the screen.

Why Slang

The shader language landscape is fragmented. GLSL, HLSL, MSL, and WGSL each target a subset of platforms, and none is a clean superset of the others. Libraries that support multiple shading languages maintain translation layers and per-language workarounds, which is a significant source of bugs and complexity.

Slang solves this at the source level. A single Slang source file compiles to SPIR-V (Vulkan), DXIL (DX12), and MSL (Metal). It uses HLSL-familiar syntax with additions that matter for modern GPU programming:

Feature	Why it matters
Modules and `import`	True separate compilation, no `#include` fragility
Generics	Type-safe reusable shader code
Automatic differentiation	First-class for ML and physics workloads
Khronos governance	Long-term stability and active development

By committing to Slang as the sole shader language, Goldy eliminates an entire category of cross-platform bugs and keeps its codebase focused on GPU work rather than shader translation.

Intellectual Roots

Goldy synthesizes ideas from several sources:

Sebastian Aaltonen, "No Graphics API" — The primary philosophical foundation. Modern GPUs have converged enough that a dramatically simpler API is possible if you drop legacy support.
Ralph Levien, "Requiem for piet-gpu-hal" — The insight that good abstractions expose cost and reality while abstracting meaning and rules. Classic HALs failed by hiding both.
wgpu — Excellent API ergonomics (Instance/Device architecture, CommandEncoder pattern, explicit pass structure). Goldy borrows patterns but is free to diverge from the WebGPU spec.
Wayland compositor architecture — Frames, not commands. Explicit synchronization, not implicit state machines.
TU Darmstadt, "Recursive Hardware Abstraction Layers" — Rigorous analysis of what a minimal HAL actually needs when targeting converged modern hardware.
CUDA — A composable language that exposes memory directly, with a broad library ecosystem built on that simplicity.

No single source defines Goldy. The value is in the synthesis — and the willingness to ship an opinionated library rather than wait for committee consensus.

The Name

Goldy aspires to exist in the golden mean between wgpu's emphasis on compatibility and the vision of no-graphics-api.

What Goldy Sheds

Goldy's bindless model and modern-hardware baseline make several traditional GPU programming concepts unnecessary. These aren't missing features — they're intentional design choices that keep the API small and the programming model coherent.

No Descriptor Set Management

Traditional APIs require you to declare descriptor set layouts, allocate descriptor pools, write descriptor sets, and bind them before each draw or dispatch. A typical Vulkan pipeline touches three to four descriptor set objects before anything reaches the GPU.

Goldy replaces all of this with a flat bindless heap. Resources get a slot index when created, and shaders access them by that index. There are no layouts, no pools, no binding calls.

// Shader receives resources by index — no descriptor sets
[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(Scattered<Particle> particles, ThreadId id) {
    particles[id.x].position += particles[id.x].velocity;
}

This also eliminates pipeline layouts as objects. In Vulkan, each unique combination of descriptor set layouts produces a pipeline layout, which is baked into the pipeline at creation time. Goldy's single global bindless layout means one pipeline layout for all pipelines.

No Manual Barrier Insertion

In Vulkan and DX12, you manually insert memory barriers and image layout transitions to tell the GPU when a resource changes from "written by compute" to "read by fragment" (or any other transition). Missing a barrier is a silent correctness bug; inserting too many is a performance bug.

Goldy's task graph handles this automatically. You declare what each task reads and writes; Goldy derives the minimal set of barriers and transitions. This is both safer and typically more efficient than hand-placed barriers, because the task graph has a global view of the frame.

No Shader Permutation Systems

Traditional engines maintain thousands of shader variants — combinations of feature flags, render pass compatibility, descriptor set layout versions, and pipeline state. Some ship dedicated cloud infrastructure just to compile and cache them all.

Goldy collapses most of the dimensions that drive permutation counts:

Traditional dimension	Goldy equivalent
Render pass compatibility	Dynamic rendering — no render pass objects
Descriptor set layout	One global bindless layout
Pipeline layout	Implicit from the global layout
Viewport/scissor state	Dynamic state, not baked into PSO

What remains — shader source × vertex format × target format × depth config — is a small, manageable space. Goldy addresses pipeline variety by having fewer pipelines, not by building infrastructure to manage many variants.

Minimal Pipeline State Management

A Vulkan VkGraphicsPipelineCreateInfo touches blend state, depth/stencil state, rasterizer state, multisample state, input assembly, viewport/scissor, dynamic state flags, render pass, subpass, pipeline layout, and shader stages. Many of these are baked in at pipeline creation time, producing the combinatorial explosion that drives PSO caches.

Goldy uses dynamic rendering and dynamic state to move viewport, scissor, and render target configuration out of the pipeline object. The remaining pipeline state is intentionally minimal:

#![allow(unused)]
fn main() {
let pipeline = RenderPipeline::new(&device, &shader, &shader, &desc)?;
}

Blend mode, depth testing, and vertex format are still part of the pipeline — they represent genuine hardware configuration. But the many compatibility dimensions that traditional APIs bake in are gone.

No Separate Compute API

OpenCL introduced compute to GPUs as an entirely separate API with its own device model, memory model, and dispatch semantics. Even "unified" APIs like Vulkan treat compute as a second-class citizen — compute pipelines and graphics pipelines share almost no code paths.

In Goldy, compute is a first-class citizen on the same footing as graphics. Compute shaders use the same bindless resource model, the same buffer types, and the same task graph. A compute dispatch that writes to a buffer and a draw call that reads from it are just nodes in the same graph.

#![allow(unused)]
fn main() {
// Compute updates particles, render draws them — same resources, same graph
graph.add_compute("update", &compute_shader, &[&particle_buf], [workgroups, 1, 1]);
graph.add_render("draw", &render_pipeline, &[&particle_buf], &surface);
}

The Design Principle

Each of these omissions follows the same logic: if modern hardware doesn't need a concept for correctness or performance, Goldy doesn't expose it. The result is an API where the concepts that remain — buffers, textures, shaders, pipelines, task graph — each carry their weight.

Goldy vs wgpu

Both Goldy and wgpu are Rust GPU libraries with multi-backend support. They make different tradeoffs that suit different use cases.

At a Glance

	wgpu	Goldy
Identity	WebGPU implementation for Rust	Modern Rust GPU library
Spec governance	W3C WebGPU specification	Independent, opinionated
Browser support	Yes (WebGPU)	No
Minimum hardware	Wide compatibility (Vulkan 1.0+)	Modern only (Vulkan 1.4+, DX12, Metal 2+)
Shader language	WGSL (primary), SPIR-V, GLSL, naga	Slang (compiles to SPIR-V, DXIL, MSL)
Resource model	Descriptor-based (bind groups)	Typed bindless
Synchronization	Manual pass ordering	Task graph
Metal support	Via MoltenVK or wgpu-hal	Native Metal backend
Compute model	Supported but secondary	First-class (compute-to-surface)

Resource Binding: Descriptors vs Bindless

wgpu uses bind groups — the WebGPU equivalent of Vulkan descriptor sets. You declare a bind group layout, create bind groups that match it, and bind them before each draw or dispatch:

#![allow(unused)]
fn main() {
// wgpu: declare layout, create group, bind before draw
let layout = device.create_bind_group_layout(&desc);
let group = device.create_bind_group(&wgpu::BindGroupDescriptor {
    layout: &layout,
    entries: &[wgpu::BindGroupEntry { binding: 0, resource: buffer.as_entire_binding() }],
    ..
});
pass.set_bind_group(0, &group, &[]);
}

Goldy uses bindless access. Resources get a slot index at creation time, and shaders access them directly by index. There are no layouts, groups, or binding calls:

#![allow(unused)]
fn main() {
// Goldy: buffer already has a bindless slot, shader reads it by index
let buffer = Buffer::with_data(&device, &data, DataAccess::Scattered)?;
pass.bind_resources_raw(&[buffer.bindless_index().unwrap()]);
}

The bindless approach eliminates an entire layer of API surface and the pipeline layout permutations that come with it.

Synchronization: Manual vs Task Graph

wgpu provides implicit synchronization within a render/compute pass but requires you to order passes correctly. Resource transitions between passes are handled by wgpu internally, following WebGPU's implicit rules.

Goldy uses an explicit task graph. You declare tasks and their resource dependencies; Goldy derives barriers, layout transitions, and execution order. This gives the runtime a global view of the frame for optimal scheduling and makes synchronization bugs structurally impossible.

Shader Language: WGSL vs Slang

wgpu's primary shader language is WGSL, the WebGPU Shading Language. WGSL is designed for safety and portability across web and native targets, but it lacks features like modules, generics, and automatic differentiation.

Goldy uses Slang exclusively. Slang compiles a single source file to SPIR-V (Vulkan), DXIL (DX12), and MSL (Metal). It provides modules with true separate compilation, generics, and HLSL-familiar syntax. The goldy_exp shader library builds on Slang's module system to provide shared types and utilities:

import goldy_exp;

[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(Scattered<Particle> particles, ThreadId id) {
    particles[id.x].position += particles[id.x].velocity;
}

Compute as First-Class Citizen

wgpu supports compute shaders, but the API is oriented around render passes. Compute-to-render workflows require manual buffer management and pass ordering.

Goldy treats compute and graphics as peers. Compute-to-surface is a built-in pattern: a compute dispatch writes to a buffer or texture, and a subsequent render pass reads from it, with the task graph handling the dependency automatically.

Metal: Native vs MoltenVK

wgpu supports Metal through its wgpu-hal Metal backend or via MoltenVK (Vulkan-on-Metal translation). MoltenVK adds a translation layer that can introduce overhead and compatibility limitations.

Goldy has a native Metal backend that uses Metal APIs directly — Argument Buffers Tier 2 for bindless, MSL compiled from Slang, and native Metal types throughout. No translation layer sits between Goldy and the Metal driver.

Architecture

wgpu:

Application → wgpu (WebGPU API) → wgpu-hal → Vulkan / Metal / DX12 / WebGPU

Goldy:

Application → Goldy (native API) → Vulkan 1.4+ / Metal 2+ / DX12

wgpu implements the WebGPU specification faithfully, then maps it onto each backend through an internal HAL. Goldy talks to each backend directly using native idioms.

When to Choose Which

Choose wgpu when:

You need browser deployment via WebGPU
You need to support older GPUs or wide device compatibility
You want the stability of a specification-driven API
You need the wgpu ecosystem (examples, community, tooling)

Choose Goldy when:

You target only modern desktop/mobile hardware (2018+)
You want a minimal API surface with bindless as the default
You want native Metal without a translation layer
You want Slang's module system and shader language features
Compute workloads are central to your application

Both libraries are valid choices — the right one depends on your hardware requirements, deployment targets, and whether you value broad compatibility or API simplicity.

Target Hardware

Goldy targets modern GPUs exclusively. This is a deliberate design choice — by requiring hardware from roughly 2018 onward, Goldy can use bindless descriptors, dynamic rendering, and coherent caches as baseline assumptions rather than optional features.

Backend Requirements

Vulkan 1.4+

Goldy requires Vulkan 1.4, which promotes several extensions that were optional in earlier versions to core:

Feature	Vulkan history	Goldy usage
Dynamic rendering	`VK_KHR_dynamic_rendering` (1.3)	No render pass objects
Descriptor indexing	`VK_EXT_descriptor_indexing` (1.2)	Bindless resource access
Buffer device address	`VK_KHR_buffer_device_address` (1.2)	64-bit GPU pointers
Synchronization2	`VK_KHR_synchronization2` (1.3)	Simplified barrier model
Push descriptors	Core in 1.4	Efficient uniform updates

Supported hardware:

NVIDIA: Turing and later (RTX 2000 / GTX 1600 series, 2018+)
AMD: RDNA 1 and later (RX 5000 series, 2019+)
Intel: Xe architecture and later (Arc, 2022+)
Qualcomm: Adreno 650+ (2019+, driver dependent)

DX12

Goldy's DX12 backend requires:

Requirement	Details
D3D12 Enhanced Barriers	Windows 11 + WDDM 3.0+ driver
`ResourceDescriptorHeap`	SM 6.6 bindless (Shader Model 6.6)
Root constants	Push constants equivalent

Enhanced Barriers are mandatory — Goldy does not fall back to legacy resource state transitions. This effectively requires Windows 11 with a modern driver.

For software rendering and CI, Goldy supports the WARP software rasterizer via GOLDY_DX12_FORCE_WARP=1.

Metal Tier 2+

Goldy's Metal backend is native (no MoltenVK) and requires Argument Buffers Tier 2 for bindless resource access:

Requirement	Details
Argument Buffers Tier 2	Bindless via `ParameterBlock`
MSL (via Slang)	Slang compiles directly to Metal Shading Language

Supported hardware:

Apple Silicon: All models (M1/M2/M3/M4, A14+)
Intel Macs: 2017+ (different iGPUs; some very early Intel UHD may not qualify)
AMD discrete GPUs in Macs: 2015+

Older Intel integrated GPUs (pre-2017 Macs) are not supported — they lack Argument Buffers Tier 2.

What "Modern GPU" Means for Goldy

Goldy's hardware floor is defined by a set of architectural capabilities, not specific product names:

Capability	Why Goldy needs it
Coherent L2 cache	No manual cache flush/invalidate logic
Bindless descriptors	Single global descriptor model, no set layouts
Dynamic rendering	No render pass objects or framebuffer compatibility
64-bit buffer addresses	Direct pointer access in shaders
Unified or REBAR memory	Simplified CPU-GPU data transfer

GPUs from roughly 2018 onward universally support these features. The specific API version requirements (Vulkan 1.4, DX12 Enhanced Barriers, Metal Tier 2) are the mechanism by which Goldy enforces this floor.

What This Excludes

Excluded	Reason
NVIDIA GTX 900 series (Maxwell)	No Vulkan 1.4 support
AMD GCN (RX 400/500)	Driver support ended; limited bindless
Intel Gen9 (HD 500/600)	Incomplete Vulkan feature coverage
Intel integrated GPUs pre-2017 (Mac)	No Argument Buffers Tier 2
Pre-Windows 11 DX12	No Enhanced Barriers

Checking Compatibility

Goldy reports unsupported devices at initialization:

#![allow(unused)]
fn main() {
let instance = Instance::new()?;

for adapter in instance.enumerate_adapters() {
    println!("{}: {:?}", adapter.name, adapter.device_type);
}

// create_device returns an error on unsupported hardware
let device = instance.create_device(DeviceType::DiscreteGpu)?;
}

The Tradeoff

By drawing a line at modern hardware, Goldy avoids the fallback paths, compatibility checks, and feature-level negotiation that dominate traditional GPU libraries. Every code path in Goldy assumes the full feature set is available. This keeps the implementation small and the API surface predictable.

The cost is clear: Goldy cannot run on the long tail of older hardware. For applications that need broad device support, wgpu is the better choice.

Slang Quick Reference

Goldy uses Slang as its sole shading language. This page covers what you need to write Goldy shaders — not a full Slang language reference.

Basics

Slang uses HLSL-style syntax. If you've written HLSL or GLSL, most of it will look familiar.

Scalar Types

float f = 1.0;
int   i = -5;
uint  u = 10;
bool  b = true;

Vector and Matrix Types

float2 v2 = float2(1.0, 2.0);
float3 v3 = float3(1.0, 2.0, 3.0);
float4 v4 = float4(1.0, 2.0, 3.0, 4.0);

// Swizzling
float2 xy  = v4.xy;
float3 rgb = v4.rgb;

// Matrices
float4x4 mvp;
float4 transformed = mul(mvp, float4(pos, 1.0));

Structs

struct Particle {
    float2 position;
    float2 velocity;
    float  age;
};

Functions

float square(float x) { return x * x; }

// Public functions are exported from modules
public float3 my_effect(float2 uv) { return float3(uv, 0.5); }

Modules

Slang has a real module system (not #include). Modules are separate compilation units:

// In mylib.slang
module mylib;
public float3 effect(float2 uv) { return float3(uv, 1.0); }

// In shader.slang
import mylib;
float3 c = effect(uv);

`goldy_exp` Resource Types

The goldy_exp module defines type aliases that map to native Slang buffer and texture types. When used as parameters in [goldy_*] entry points, the Goldy compiler automatically resolves slot indices to live resource handles.

Buffer Types

Type alias	Underlying type	Access pattern	Usage
`Scattered<T>`	`StorageBuffer<T>` (`RWStructuredBuffer<T>`)	Read/write, any thread, any address	`data[i]`, `data[i].field = v`
`BufRO<T>`	`ReadOnlyBuffer<T>` (`StructuredBuffer<T>`)	Read-only, hardware read-cache hint	`data[i]`
`ByteAddress`	`ByteAddressView` (`RWByteAddressBuffer`)	Raw byte-level access	`.Load(addr)`, `.Store(addr, v)`, `.InterlockedMin(...)`

Texture Types

Type alias	Underlying type	Access pattern	Usage
`Interpolated<T>`	`Texture2D<T>`	Hardware-filtered sampling	`tex.Sample(samp, uv)`, `tex.Load(loc)`
`DirectSpatial<T>`	`RWTexture2D<T>`	Direct 2D read/write, no filtering	`img[int2(x,y)]`, `img.GetDimensions(w,h)`

Sampler Type

Type alias	Underlying type	Usage
`Filter`	`SamplerState`	Pass to `tex.Sample(filter, uv)`

Broadcast (Constant Buffer)

To pass uniform data (same value for all threads), declare a struct type directly as a parameter — no wrapper needed. The codegen recognizes any non-resource, non-system-value struct as a constant-buffer broadcast:

struct TimeUniforms { float time; float delta_time; };

[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(TimeUniforms cfg, Scattered<Particle> particles, ThreadId id) {
    particles[id.x].position += particles[id.x].velocity * cfg.delta_time;
}

System-Value Types

Declare these as parameters in [goldy_*] entry points to receive GPU-provided values. The codegen maps each type to its SV_* semantic automatically.

Compute

Type	Maps to	Components
`ThreadId`	`SV_DispatchThreadID`	`.x`, `.y`, `.z`, `.xy`, `.xyz`
`GroupThreadId`	`SV_GroupThreadID`	`.x`, `.y`, `.z`, `.xy`, `.xyz`
`GroupId`	`SV_GroupID`	`.x`, `.y`, `.z`, `.xy`, `.xyz`

Graphics

Type	Maps to	Components
`VertexId`	`SV_VertexID`	`.value`
`InstanceId`	`SV_InstanceID`	`.value`
`IsFrontFace`	`SV_IsFrontFace`	`.value`

Entry Point Attributes

`[goldy_compute]`

Marks a compute shader entry point. The Goldy compiler generates the real [shader("compute")] wrapper that resolves resource slots and system values.

import goldy_exp;

[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(Scattered<uint> data, uint offset, ThreadId id) {
    data[id.x + offset] += 1;
}

`[goldy_vertex]`

Marks a vertex shader entry point.

import goldy_exp;

struct VSOutput {
    float4 position : SV_Position;
    float4 color    : COLOR;
};

[goldy_vertex]
VSOutput vs_main(BufRO<Vertex> verts, VertexId vid) {
    Vertex v = verts[vid.value];
    VSOutput o;
    o.position = float4(v.pos, 0.0, 1.0);
    o.color    = v.color;
    return o;
}

`[goldy_fragment]`

Marks a fragment shader entry point.

import goldy_exp;

[goldy_fragment]
float4 fs_main(Interpolated<float4> tex, Filter samp, float2 uv : TEXCOORD0) : SV_Target {
    return tex.Sample(samp, uv);
}

Common Patterns

Accessing Buffers by Index

All Scattered<T> and BufRO<T> parameters support standard array indexing. Field-level writes work directly on Scattered<T>:

[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(Scattered<Particle> particles, ThreadId id) {
    Particle p = particles[id.x];
    p.position += p.velocity;
    particles[id.x] = p;

    // Or field-level write:
    particles[id.x].age += 1.0;
}

Sampling Textures

[goldy_fragment]
float4 fs_main(Interpolated<float4> albedo, Filter samp, float2 uv : TEXCOORD0) : SV_Target {
    return albedo.Sample(samp, uv);
}

Writing to Storage Images

[goldy_compute]
[numthreads(8, 8, 1)]
void cs_main(DirectSpatial<float4> output, ThreadId id) {
    output[int2(id.x, id.y)] = float4(float(id.x) / 512.0, float(id.y) / 512.0, 0.5, 1.0);
}

Fullscreen Triangle (Vertex-less)

Use vs_fullscreen_triangle() from goldy_exp to render fullscreen effects without a vertex buffer:

import goldy_exp;

[shader("vertex")]
FullscreenVarying vs_main(uint vertex_id : SV_VertexID) {
    return vs_fullscreen_triangle(vertex_id);
}

[shader("fragment")]
float4 fs_main(FullscreenVarying input) : SV_Target {
    return float4(input.uv, 0.5, 1.0);
}

Compute shaders and graphics shaders share the same bindless buffers. The task graph handles the dependency:

// Compute: update particles
[goldy_compute]
[numthreads(64, 1, 1)]
void cs_update(TimeUniforms cfg, Scattered<Particle> particles, ThreadId id) {
    particles[id.x].position += particles[id.x].velocity * cfg.delta_time;
}

// Vertex: read particles for rendering
[goldy_vertex]
VSOutput vs_draw(BufRO<Particle> particles, InstanceId iid, VertexId vid) {
    Particle p = particles[iid.value];
    // Generate quad geometry from particle position...
}

Rust-Side Resource Binding

Resources are bound in declaration order (left to right in the shader signature):

#![allow(unused)]
fn main() {
pass.bind_resources_raw(&[
    cfg_buf.bindless_index().unwrap(),
    particle_buf.bindless_index().unwrap(),
]);
}

Plain scalar parameters (uint offset) are also push-constant bindings — no wrapper struct needed.

`goldy_exp` Utility Modules

Module	Contents
`goldy_exp/math.slang`	`PI`, `TAU`, `hash()`, `hash2()`, `center_uv()`, `scale_uv()`, `to_polar()`, `smootherstep()`
`goldy_exp/color.slang`	`rainbow()`, `palette()`, `heat()`, `hsv_to_rgb()`, `luminance()`, `gamma_correct()`
`goldy_exp/primitives.slang`	`quad_position()`, `quad_position_rotated()`, `billboard_position()`, `fullscreen_position()`, `fullscreen_uv()`
`goldy_exp/types.slang`	`Particle2D`, `Particle3D`, `FrameUniforms`, `Transform2D`, `Instance2D`
`goldy_exp/vertex.slang`	`FullscreenVarying`, `ColoredVertex`, `ColoredVarying`, `vs_fullscreen_triangle()`
`goldy_exp/access.slang`	Resource type aliases and system-value types (documented above)

Environment Variables

Goldy reads several environment variables at runtime for backend selection, validation, debugging, and Slang configuration.

General

Variable	Values	Default	Description
`GOLDY_BACKEND`	`vulkan`, `vk`, `dx12`, `d3d12`, `directx`, `metal`, `mtl`	Platform default (macOS → Metal, Windows → DX12, Linux → Vulkan)	Override backend selection at runtime.
`GOLDY_SLANG_PATH`	File path	(not set)	Override the path to the Slang shared library (`slang.dll` / `libslang.dylib` / `libslang.so`). Bypasses the default search order (vendored next to executable → extracted from embedded).

Validation

Variable	Values	Default	Description
`GOLDY_VALIDATION`	Comma/semicolon/whitespace-separated list: `api`, `layout`, `layouts`, `all`; or `1` / `true` / `yes`	(not set)	Enable validation categories. `api` enables GPU API validation (Vulkan validation layers, Metal shader validation). `layout` enables Rust/Slang struct layout and buffer stride checks. `all` enables both. The shorthand `1` / `true` / `yes` enables GPU API only (layout stays opt-in).
`GOLDY_VALIDATE_LAYOUTS`	`1`, `true`, `yes`	(not set)	Legacy toggle for layout validation only. Equivalent to `GOLDY_VALIDATION=layout`.

Validation Examples

# GPU API validation only (Vulkan validation layers, Metal shader validation)
GOLDY_VALIDATION=api cargo run --example triangle

# Layout + stride checks only
GOLDY_VALIDATION=layout cargo run --example triangle

# Everything
GOLDY_VALIDATION=all cargo run --example triangle

# Shorthand for GPU API only
GOLDY_VALIDATION=1 cargo run --example triangle

DX12-Specific

Variable	Values	Default	Description
`GOLDY_DX12_DEBUG`	`1`, `true`	On in debug builds	Enable the D3D12 debug layer. On by default in debug builds; set explicitly for release builds.
`GOLDY_DX12_NO_DEBUG`	`1`, `true`	(not set)	Force-disable the D3D12 debug layer even in debug builds. Useful to avoid debug-layer crashes in parallel test threads.
`GOLDY_DX12_GBV`	`1`, `true`	(not set)	Enable D3D12 GPU-Based Validation. Catches UAV/SRV descriptor mismatches, resource state errors, and out-of-bounds access on the GPU timeline. Very slow — use for targeted debugging only.
`GOLDY_DX12_FORCE_WARP`	`1`, `true`	(not set)	Force the DX12 backend to use the WARP software rasterizer, even when hardware GPUs are present. Use for headless CI or reproducing WARP-specific rendering bugs.
`GOLDY_DX12_ALLOW_WARP`	`1`, `true`	(not set)	Allow the WARP adapter to appear in device enumeration. Without this or `GOLDY_DX12_FORCE_WARP`, WARP is hidden.

Debugging

Variable	Values	Default	Description
`GOLDY_DUMP_SHADERS`	Directory path	(not set)	Dump compiled shader bytecode (SPIR-V, DXIL, MSL) to the specified directory. Files are written at shader compilation time. Useful for inspecting what Slang produces for each backend.

Interop with System Variables

Goldy also respects these non-Goldy environment variables:

Variable	Backend	Description
`VK_INSTANCE_LAYERS`	Vulkan	If set to include `VK_LAYER_KHRONOS_validation`, Goldy enables Vulkan validation regardless of `GOLDY_VALIDATION`.
`VK_LAYER_PATH`	Vulkan	Standard Vulkan loader variable for locating validation layer manifests.
`MTL_SHADER_VALIDATION`	Metal	When `GOLDY_VALIDATION` enables API validation and this variable is unset, Goldy sets it to `1` before creating the first Metal device. If you set it yourself, Goldy does not override it.

License

Goldy is dual-licensed under the GNU Lesser General Public License v2.1 or later (LGPL-2.1-or-later) and a commercial license.

Open Source (LGPL-2.1-or-later)

You may use Goldy freely in any project — including proprietary and commercial software — as long as you comply with the LGPL:

✅ Use Goldy as a dynamically linked library in proprietary software
✅ Distribute your application without releasing your own source code
✅ Modify Goldy for your own use
✅ Use commercially

You must:

Distribute (or offer access to) the source code of Goldy itself (including any modifications you make to it)
Allow users to replace the Goldy library with their own build (dynamic linking satisfies this)
Include the LGPL license and copyright notice

Commercial License

A commercial license removes all LGPL obligations. This is appropriate when you need to:

Statically link Goldy into a proprietary binary
Distribute modified versions of Goldy without source disclosure
Embed Goldy in locked-down or proprietary firmware/SDKs
Satisfy corporate policies that prohibit copyleft dependencies

For commercial licensing terms, contact: koubaa@github

Dependencies

Goldy depends on various open-source libraries with their own licenses:

Dependency	License
ash	MIT/Apache-2.0
anyhow	MIT/Apache-2.0
thiserror	MIT/Apache-2.0
tracing	MIT
bitflags	MIT/Apache-2.0
bytemuck	Zlib/MIT/Apache-2.0

All dependencies are permissively licensed and compatible with the LGPL.

Goldy - Modern GPU Library