Goldy: Modern GPU Library
Goldy is a Rust GPU library built around a typed bindless programming model, a dependency-driven task graph, and first-class compute support — targeting Vulkan 1.4+, DX12, and Metal Tier 2+ with native backends (no translation layers).
Typed Bindless Programming
Shaders are written in Slang using goldy_exp virtual entry points ([goldy_compute], [goldy_vertex], [goldy_fragment]). Resources are declared as typed parameters — the Goldy compiler resolves bindless slots automatically:
import goldy_exp;
[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(MyUniforms cfg, Scattered<uint> data, ThreadId id) {
data[id.x] = data[id.x] + cfg.base;
}
| Type | Maps To | Use |
|---|---|---|
Scattered<T> | RWStructuredBuffer<T> | Read/write storage |
BufRO<T> | StructuredBuffer<T> | Read-only storage |
DirectSpatial<T> | RWTexture2D<T> | Read/write texture |
Interpolated<T> | Texture2D<T> | Sampled texture |
Filter | SamplerState | Texture sampler |
ThreadId | SV_DispatchThreadID | Compute thread index |
VertexId | SV_VertexID | Vertex index |
Struct parameters are automatically treated as broadcast (constant buffer) data.
Task Graph
TaskGraph provides explicit dependency scheduling for bindless compute work. You declare what each node reads and writes; Goldy inserts optimal barriers, parallelizes independent dispatches across waves, and aliases transient resources:
#![allow(unused)] fn main() { let mut graph = TaskGraph::new(); graph .node("simulate", &sim_pipeline) .bind_buffer(&particles, NodeAccess::ReadWrite) .bind_resources_raw(&[particles_handle.index()]) .dispatch(group_count, 1, 1); }
Compute-to-Surface
Compute shaders can write directly to swapchain textures — no graphics pipeline, no vertex buffers, no render passes. Acquire a frame, get its texture handle, dispatch, present:
#![allow(unused)] fn main() { let frame = surface.begin()?; let texture = frame.texture(); // ... build TaskGraph, dispatch compute ... frame.submit_compute(&graph)?; frame.present()?; }
Multi-Backend, Single Shader Language
Goldy compiles Slang shaders to SPIR-V (Vulkan), DXIL (DX12), and Metal IR at runtime via the bundled Slang compiler. Each backend is a native implementation — Metal uses Metal idioms, not translated Vulkan.
| Platform | Backend |
|---|---|
| Linux | Vulkan |
| Windows | DX12 (Vulkan optional) |
| macOS | Metal |
Quick Links
License
Goldy is dual-licensed under LGPL-2.1-or-later and a commercial license. See License for details.
Installation
Requirements
- Rust stable (recent version recommended)
- A supported GPU
Adding Goldy to Your Project
[dependencies]
goldy = "0.1"
Or with cargo:
cargo add goldy
Feature Flags
| Feature | Default | Description |
|---|---|---|
vulkan | yes | Vulkan 1.4+ backend (Linux, Windows) |
dx12 | yes | DirectX 12 backend (Windows) |
metal | yes | Metal Tier 2+ backend (macOS) |
instrumentation | yes | Structured tracing via tracing-subscriber (zero-cost when disabled) |
Platform-inappropriate features are no-ops — enabling metal on Linux or dx12 on macOS compiles cleanly but does nothing.
To build with only specific backends:
[dependencies]
goldy = { version = "0.1", default-features = false, features = ["vulkan"] }
Shader Toolchain
Goldy uses Slang as its shader language. The Slang compiler is bundled automatically via slang-rs — no separate SDK install is needed. Shaders are compiled at runtime to the appropriate target (SPIR-V, DXIL, or Metal IR).
Verifying Installation
use goldy::{Instance, DeviceType}; fn main() -> anyhow::Result<()> { let instance = Instance::new()?; println!("Available GPUs:"); for adapter in instance.enumerate_adapters() { println!(" {} ({:?})", adapter.name, adapter.device_type); } let device = instance.create_device(DeviceType::DiscreteGpu)?; println!("\nUsing: {}", device.adapter_info().name); Ok(()) }
cargo run
Expected output:
Available GPUs:
NVIDIA GeForce RTX 4060 Ti (DiscreteGpu)
Intel(R) UHD Graphics 770 (IntegratedGpu)
Using: NVIDIA GeForce RTX 4060 Ti
Backend Selection
Goldy selects the best backend for your platform automatically:
| Platform | Default Backend |
|---|---|
| Windows | DX12 |
| Linux | Vulkan |
| macOS | Metal |
Override at runtime with GOLDY_BACKEND:
GOLDY_BACKEND=vulkan cargo run
Platform-Specific Setup
Windows
DX12 is used by default and requires no additional setup. For the Vulkan backend, install the Vulkan SDK. Ensure your GPU drivers are up to date.
Linux
Install Vulkan development packages:
# Ubuntu/Debian
sudo apt install libvulkan-dev vulkan-tools
# Fedora
sudo dnf install vulkan-loader-devel vulkan-tools
# Arch
sudo pacman -S vulkan-icd-loader vulkan-tools
macOS
Goldy uses the native Metal backend — no MoltenVK or Vulkan SDK needed. Ensure macOS 12+ and Xcode command-line tools are installed:
xcode-select --install
Windowing (for examples)
The examples use winit for windowing:
[dev-dependencies]
winit = "0.30"
anyhow = "1.0"
Next Steps
- Your First Triangle — draw a colored triangle
- Your First Compute Shader — write pixels from compute
Your First Triangle
This tutorial draws a colored triangle in a window using Goldy's render pipeline and Surface API.
Complete Code
use goldy::{ shader::builtins, Buffer, Color, CommandEncoder, DataAccess, DeviceType, Instance, RenderPipeline, RenderPipelineDesc, ShaderModule, Surface, Vertex2D, }; use std::sync::Arc; use winit::{ application::ApplicationHandler, event::WindowEvent, event_loop::{ActiveEventLoop, ControlFlow, EventLoop}, window::{Window, WindowId}, }; struct App { instance: Instance, device: Option<Arc<goldy::Device>>, vertex_buffer: Option<Buffer>, pipeline: Option<RenderPipeline>, window: Option<Arc<Window>>, surface: Option<Surface>, } impl App { fn new() -> anyhow::Result<Self> { Ok(Self { instance: Instance::new()?, device: None, vertex_buffer: None, pipeline: None, window: None, surface: None, }) } fn init_gpu(&mut self, window: &Arc<Window>) -> anyhow::Result<()> { let device = Arc::new(self.instance.create_device(DeviceType::DiscreteGpu)?); let vertices = [ Vertex2D::new(0.0, -0.5, Color::RED), Vertex2D::new(-0.5, 0.5, Color::GREEN), Vertex2D::new(0.5, 0.5, Color::BLUE), ]; let vertex_buffer = Buffer::with_data(&device, &vertices, DataAccess::Scattered)?; let surface = Surface::new(&device, window.as_ref())?; let shader = ShaderModule::from_slang(&device, builtins::VERTEX_COLOR_2D)?; let pipeline = RenderPipeline::new( &device, &shader, &shader, &RenderPipelineDesc { vertex_layout: Vertex2D::layout(), target_format: surface.format(), ..Default::default() }, )?; self.device = Some(device); self.vertex_buffer = Some(vertex_buffer); self.pipeline = Some(pipeline); self.surface = Some(surface); Ok(()) } fn render(&mut self) -> anyhow::Result<()> { let window = self.window.as_ref().unwrap(); let size = window.inner_size(); if size.width == 0 || size.height == 0 { return Ok(()); } let pipeline = self.pipeline.as_ref().unwrap(); let vertex_buffer = self.vertex_buffer.as_ref().unwrap(); let surface = self.surface.as_ref().unwrap(); let frame = surface.begin()?; let mut encoder = CommandEncoder::new(); { let mut pass = encoder.begin_render_pass(); pass.clear(Color { r: 0.1, g: 0.1, b: 0.2, a: 1.0 }); pass.set_pipeline(pipeline); pass.set_vertex_buffer(0, vertex_buffer); pass.draw(0..3, 0..1); } frame.render(encoder)?; frame.present()?; Ok(()) } } impl ApplicationHandler for App { fn resumed(&mut self, event_loop: &ActiveEventLoop) { if self.window.is_none() { let window = Arc::new( event_loop .create_window( Window::default_attributes() .with_title("Goldy - Triangle") .with_inner_size(winit::dpi::LogicalSize::new(800, 600)), ) .unwrap(), ); self.window = Some(window.clone()); self.init_gpu(&window).unwrap(); } } fn window_event(&mut self, event_loop: &ActiveEventLoop, _: WindowId, event: WindowEvent) { match event { WindowEvent::CloseRequested => event_loop.exit(), WindowEvent::RedrawRequested => { self.render().ok(); self.window.as_ref().unwrap().request_redraw(); } WindowEvent::Resized(new_size) => { if new_size.width > 0 && new_size.height > 0 { if let Some(surface) = &mut self.surface { let _ = surface.resize(new_size.width, new_size.height); } } } _ => {} } } } fn main() -> anyhow::Result<()> { let event_loop = EventLoop::new()?; event_loop.set_control_flow(ControlFlow::Poll); event_loop.run_app(&mut App::new()?)?; Ok(()) }
Walkthrough
Instance and Device
#![allow(unused)] fn main() { let instance = Instance::new()?; let device = Arc::new(instance.create_device(DeviceType::DiscreteGpu)?); }
Instance discovers available GPUs. create_device opens a connection to one. The Arc wrapper is required for Surface lifetime management.
Vertex Buffer
#![allow(unused)] fn main() { let vertices = [ Vertex2D::new(0.0, -0.5, Color::RED), Vertex2D::new(-0.5, 0.5, Color::GREEN), Vertex2D::new(0.5, 0.5, Color::BLUE), ]; let vertex_buffer = Buffer::with_data(&device, &vertices, DataAccess::Scattered)?; }
Vertex2D is a built-in vertex type with position and color. Buffer::with_data allocates a GPU buffer and uploads the data. DataAccess::Scattered marks it as a bindless storage buffer.
Shader and Pipeline
#![allow(unused)] fn main() { let shader = ShaderModule::from_slang(&device, builtins::VERTEX_COLOR_2D)?; let pipeline = RenderPipeline::new(&device, &shader, &shader, &RenderPipelineDesc { vertex_layout: Vertex2D::layout(), target_format: surface.format(), ..Default::default(), })?; }
builtins::VERTEX_COLOR_2D is a built-in Slang shader from the goldy_exp library that uses [goldy_vertex] and [goldy_fragment] virtual entry points to render vertex-colored geometry. ShaderModule::from_slang compiles Slang source to the active backend's IR at runtime.
The pipeline takes the same shader module for both vertex and fragment stages — goldy_exp virtual entry points let a single source file define both.
Surface and Presentation
#![allow(unused)] fn main() { let surface = Surface::new(&device, window.as_ref())?; let frame = surface.begin()?; let mut encoder = CommandEncoder::new(); { let mut pass = encoder.begin_render_pass(); pass.clear(Color { r: 0.1, g: 0.1, b: 0.2, a: 1.0 }); pass.set_pipeline(pipeline); pass.set_vertex_buffer(0, vertex_buffer); pass.draw(0..3, 0..1); } frame.render(encoder)?; frame.present()?; }
Surface manages the swapchain. begin() acquires the next swapchain image. Commands are recorded into a CommandEncoder, rendered to the frame with frame.render(), then presented with frame.present(). Rendering happens directly on the GPU — no CPU readback.
Run It
cargo run --example triangle
You should see a window with a colored triangle on a dark blue background.
Next Steps
- Your First Compute Shader — bypass the graphics pipeline entirely
- Examples — more complex demos
Your First Compute Shader
This tutorial renders an animated plasma effect by dispatching a compute shader directly to the swapchain texture — no graphics pipeline, no vertex buffers, no render passes.
The Shader
The compute shader uses goldy_exp virtual entry points. It reads uniforms via BufRO<Uniforms> and writes pixels to the swapchain texture via DirectSpatial<float4>:
import goldy_exp;
struct Uniforms {
uint width;
uint height;
float time;
float _padding;
};
[goldy_compute]
[numthreads(8, 8, 1)]
void cs_main(BufRO<Uniforms> uniforms_buf, DirectSpatial<float4> output, ThreadId tid) {
Uniforms u = uniforms_buf[0];
if (tid.x >= u.width || tid.y >= u.height)
return;
float2 uv = float2(float(tid.x) / float(u.width),
float(tid.y) / float(u.height));
float2 p = uv * 2.0 - 1.0;
p.x *= float(u.width) / float(u.height);
float t = u.time;
float v = 0.0;
v += sin(p.x * 6.0 + t);
v += sin(p.y * 6.0 + t * 1.3);
v += sin((p.x + p.y) * 4.0 + t * 0.7);
v += sin(length(p) * 8.0 - t * 2.0);
v *= 0.25;
float3 col = float3(0.5 + 0.5 * sin(v * 3.14159 + 0.0),
0.5 + 0.5 * sin(v * 3.14159 + 2.094),
0.5 + 0.5 * sin(v * 3.14159 + 4.188));
output[tid.xy] = float4(col, 1.0);
}
Key points:
BufRO<Uniforms>is a read-only structured buffer. Index with[0]to load the single element.DirectSpatial<float4>is anRWTexture2D<float4>— write to it withoutput[tid.xy].ThreadIdmaps toSV_DispatchThreadID. Each thread handles one pixel.- The
[goldy_compute]attribute tells the Goldy compiler to wire up bindless slots automatically.
Rust Side
Uniform Buffer
Define the uniform struct on the Rust side with matching layout:
#![allow(unused)] fn main() { #[repr(C)] #[derive(Clone, Copy, bytemuck::Pod, bytemuck::Zeroable)] struct Uniforms { width: u32, height: u32, time: f32, _padding: f32, } impl goldy::StructuredBufferElement for Uniforms {} }
Create the buffer with DataAccess::Scattered so it gets a bindless descriptor:
#![allow(unused)] fn main() { let uniform_buffer = Buffer::with_data( &device, &[Uniforms { width, height, time: 0.0, _padding: 0.0 }], DataAccess::Scattered, )?; }
Pass a typed &[Uniforms] slice, not raw bytes. Buffer::with_data::<T> uses size_of::<T>() as the structured-buffer stride, which backends rely on for correct addressing.
Compute Pipeline
Compile the Slang source and create a ComputePipeline:
#![allow(unused)] fn main() { let shader = ShaderModule::from_slang(&device, COMPUTE_SHADER)?; let compute_pipeline = ComputePipeline::new(&device, &shader)?; }
Rendering a Frame
Each frame follows this pattern: update uniforms, acquire the swapchain texture, build a TaskGraph, submit, present.
#![allow(unused)] fn main() { fn render_frame(state: &mut RenderState) -> Result<()> { let (width, height) = state.surface.size(); let elapsed = state.start_time.elapsed().as_secs_f32(); state.uniform_buffer.write( 0, bytemuck::bytes_of(&Uniforms { width, height, time: elapsed, _padding: 0.0, }), )?; let frame = state.surface.begin()?; let texture = frame.texture(); let wg_x = width.div_ceil(8); let wg_y = height.div_ceil(8); let uniform_handle = state.uniform_buffer .bindless_srv_handle() .expect("Uniform buffer has no bindless SRV handle"); let texture_handle = texture .bindless_handle() .expect("Surface texture has no bindless handle"); let mut graph = TaskGraph::new(); graph .node("compute", &state.compute_pipeline) .bind_buffer(&state.uniform_buffer, NodeAccess::Read) .bind_resources_raw(&[uniform_handle.index(), texture_handle.index()]) .dispatch(wg_x, wg_y, 1); frame.submit_compute(&graph)?; frame.present()?; Ok(()) } }
Step by Step
Update uniforms — Buffer::write uploads new time/size values each frame.
Acquire the frame — surface.begin() returns a Frame. frame.texture() gives you the swapchain Texture for this frame.
Get bindless handles — bindless_srv_handle() returns the read-only descriptor index for the uniform buffer. bindless_handle() returns the storage-image descriptor index for the swapchain texture. These indices are passed to the shader as the BufRO<Uniforms> and DirectSpatial<float4> slots respectively.
Build the TaskGraph — graph.node() creates a compute node bound to a pipeline. bind_buffer() declares the dependency (the uniform buffer is read). bind_resources_raw() passes the bindless descriptor indices as push-constant slots. dispatch() sets the workgroup count.
Submit and present — frame.submit_compute(&graph) records and submits the compute work to the GPU. frame.present() presents the swapchain image. The compute shader already wrote the pixels — there is no blit or copy step.
Run It
cargo run --example compute_to_surface
You should see an animated plasma pattern filling the window, rendered entirely from compute.
Next Steps
- Task Graph — multi-node graphs, transient resources, indirect dispatch
- Examples — particles, game of life, and more compute examples
Bindless by Default
Goldy uses a typed bindless resource model: there are no descriptor sets, no binding tables, and no manual layout declarations. Every GPU resource — buffers, textures, samplers — is identified at dispatch time by a small integer index packed into push constants (Vulkan/DX12) or argument buffers (Metal).
How It Works
Traditional GPU APIs require you to declare descriptor set layouts, allocate descriptor pools, update descriptor sets, and bind them before each draw or dispatch. Goldy eliminates all of this. Instead:
- Resources are registered in per-category descriptor heaps when created.
- Each resource gets a
BindlessHandle— a(category, index)pair. - At dispatch time, you pass these handles as ordinary arguments. The GPU shader resolves them to live buffer/texture/sampler handles through the
goldy_expaccess functions.
CPU side: GPU side:
Buffer::with_data(...) goldy_scattered<T>(slot)
→ BindlessHandle { → descriptor_heap[slot]
category: Scattered, → RWStructuredBuffer<T>
index: 3,
}
BindlessCategory
Goldy's descriptor heaps are organized into five pools, one per access pattern. A resource's index is only meaningful within its category:
| Category | Pool | Shader Access Function |
|---|---|---|
Scattered | Storage buffers | goldy_scattered<T>() / goldy_buf_ro<T>() |
Broadcast | Uniform/constant buffers | goldy_broadcast<T>() |
Texture | Sampled textures | goldy_interpolated<T>() |
StorageImage | Writable textures | goldy_direct_spatial<T>() |
Sampler | Sampler states | goldy_filter() |
Scattered slot 3 and Broadcast slot 3 refer to different physical entries — on Metal these are storageBuffers[3] vs uniformBuffers[3], on Vulkan they live in different descriptor array bindings.
BindlessHandle
BindlessHandle is the typed wrapper that carries both the raw index and the resource category:
#![allow(unused)] fn main() { let buf = Buffer::with_data(&device, DataAccess::Scattered, &particles)?; let handle: BindlessHandle = buf.bindless_handle().unwrap(); assert_eq!(handle.category(), BindlessCategory::Scattered); assert_eq!(handle.index(), 3); // assigned by the device }
When you bind handles at dispatch time, Goldy can validate that the handle's category matches what the shader expects in that slot — a Broadcast handle bound to a slot the shader reads through goldy_scattered is caught as a type error rather than silently producing garbage.
Typed Bindless Parameters
In shader code, goldy_exp provides type aliases that map directly to the underlying Slang resource types. These are used as entry-point parameters in virtual entry points:
| Goldy Type | Underlying Slang Type | Usage |
|---|---|---|
Scattered<T> | RWStructuredBuffer<T> | Read/write buffer: data[i], data[i].field = v |
BufRO<T> | StructuredBuffer<T> | Read-only buffer: buf[i] |
Interpolated<T> | Texture2D<T> | Sampled texture: tex.Sample(samp, uv) |
DirectSpatial<T> | RWTexture2D<T> | Writable texture: img[int2(x,y)] |
ByteAddress | RWByteAddressBuffer | Raw byte access: .Load(), .Store(), .Interlocked*() |
Filter | SamplerState | Sampler for texture filtering |
Any user-defined struct type (e.g. MyUniforms) declared as a parameter is automatically treated as a constant-buffer broadcast — no wrapper type needed.
Dispatch-Time Type Checking
When you call bind_resources_typed, Goldy compares each BindlessHandle.category against the shader's declared parameter types (extracted via extract_push_constant_categories):
#![allow(unused)] fn main() { let uniforms = uniform_buf.bindless_handle().unwrap(); // Broadcast let data = storage_buf.bindless_handle().unwrap(); // Scattered // Category validation happens here: pass.bind_resources_typed(&[uniforms, data]); pass.dispatch(workgroups, 1, 1); }
If slot 0 expects Broadcast (from the shader's MyUniforms cfg parameter) but receives a Scattered handle, the dispatch fails with a clear error instead of producing undefined behavior.
Contrast with Traditional Binding
| Traditional (Vulkan/DX12) | Goldy Bindless | |
|---|---|---|
| Setup | Declare descriptor set layouts, allocate pools, create and update descriptor sets | Create resources; indices assigned automatically |
| Binding | Bind descriptor sets before each draw/dispatch | Pass BindlessHandle values as push constants |
| Shader access | layout(set=0, binding=1) buffer ... | Scattered<T> data as a function parameter |
| Validation | Runtime errors or silent corruption on mismatch | Category + stride checks at dispatch time |
| Cross-backend | Layout declarations differ per API | Same shader code on Vulkan, DX12, and Metal |
Example: Compute Shader with Bindless Resources
Shader (particle_update.slang):
import goldy_exp;
struct SimParams {
float dt;
uint count;
};
struct Particle {
float2 pos;
float2 vel;
};
[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(SimParams params, Scattered<Particle> particles, ThreadId id) {
if (id.x >= params.count) return;
Particle p = particles[id.x];
p.pos += p.vel * params.dt;
particles[id.x] = p;
}
Rust dispatch:
#![allow(unused)] fn main() { let params_buf = Buffer::with_data(&device, DataAccess::Broadcast, &[sim_params])?; let particle_buf = Buffer::with_data(&device, DataAccess::Scattered, &particles)?; let shader = ShaderModule::from_slang(&device, PARTICLE_UPDATE_SOURCE)?; let pipeline = ComputePipeline::new(&device, &shader)?; let mut encoder = ComputeEncoder::new(); let mut pass = encoder.begin_compute_pass(); pass.set_pipeline(&pipeline); pass.bind_resources_typed(&[ params_buf.bindless_handle().unwrap(), // slot 0 → Broadcast → SimParams particle_buf.bindless_handle().unwrap(), // slot 1 → Scattered → Particle ]); pass.dispatch((particle_count + 63) / 64, 1, 1); drop(pass); encoder.dispatch(&device)?; }
The shader author writes natural function parameters. The Rust side binds handles in declaration order. Goldy handles the rest — slot packing, category validation, and cross-backend descriptor plumbing.
Virtual Entry Points
Goldy's virtual entry points let you write shader entry points with clean, typed parameters instead of raw uniform uint slots and SV_* semantics. You annotate your function with [goldy_compute], [goldy_vertex], or [goldy_fragment], and a source-to-source transform generates the real Slang [shader("...")] entry point with all the bindless plumbing wired up.
The Attributes
| Attribute | Stage | Generated Slang Attribute |
|---|---|---|
[goldy_compute] | Compute | [shader("compute")] |
[goldy_vertex] | Vertex | [shader("vertex")] |
[goldy_fragment] | Fragment | [shader("fragment")] |
A minimal example:
import goldy_exp;
[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(Scattered<uint> data, ThreadId id) {
data[id.x] = data[id.x] * 2;
}
This is equivalent to manually writing a [shader("compute")] entry point with uniform uint push-constant parameters, descriptor heap lookups, and SV_DispatchThreadID — but without any of that boilerplate.
What Virtual Entry Points Accept
Resource Parameters
Each resource parameter occupies one bindless slot (a 16-bit index packed into push constants). The generated wrapper calls the corresponding goldy_* free function to resolve the slot to a live GPU handle.
| Parameter Type | Resolves Via | Description |
|---|---|---|
Scattered<T> | goldy_scattered<T>(slot) | Read/write storage buffer |
BufRO<T> | goldy_buf_ro<T>(slot) | Read-only storage buffer |
Interpolated<T> | goldy_interpolated<T>(slot) | Sampled 2D texture |
DirectSpatial<T> | goldy_direct_spatial<T>(slot) | Read/write 2D texture |
ByteAddress | goldy_byte_address(slot) | Raw byte-address buffer |
Filter | goldy_filter(slot) | Sampler state |
Broadcast Parameters
Any user-defined struct type that isn't a recognized resource or system-value type is treated as a broadcast (constant buffer). The generated code calls goldy_broadcast<T>(slot) to fetch the entire struct from a uniform buffer:
struct SimParams { float dt; uint count; };
[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(SimParams params, Scattered<Particle> data, ThreadId id) {
// params is fetched from a constant buffer automatically
}
In vertex and fragment shaders, the last unrecognized struct is treated as the stage input (vertex attributes or fragment varyings) rather than a broadcast. All preceding unrecognized structs are broadcasts.
System-Value Parameters
System-value wrapper types are mapped to SV_* semantics. The generated entry point declares the raw semantic parameter and constructs the wrapper:
| Wrapper Type | Maps To | Available Fields |
|---|---|---|
ThreadId | SV_DispatchThreadID | .x, .y, .z, .xy, .xyz |
GroupThreadId | SV_GroupThreadID | .x, .y, .z, .xy, .xyz |
GroupId | SV_GroupID | .x, .y, .z, .xy, .xyz |
VertexId | SV_VertexID | .value |
InstanceId | SV_InstanceID | .value |
IsFrontFace | SV_IsFrontFace | .value |
Scalar Parameters
Plain scalar types (uint, float, int, bool, and vector variants) become user parameters — full-precision u32 words in a separate region of the push constants. These are bound from Rust via bind_resources_raw_with_user:
[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(Scattered<uint> data, uint offset, ThreadId id) {
data[id.x + offset] += 1;
}
Pass-Through Parameters
In vertex and fragment shaders, the last unrecognized struct parameter passes through as a stage input (vertex attributes or interpolated varyings). It appears directly in the generated entry point signature without bindless resolution:
[goldy_fragment]
float4 fs_main(MyUniforms cfg, FullscreenVarying input) : SV_Target {
// cfg → broadcast (slot 0)
// input → pass-through stage input (interpolated varyings)
return float4(cfg.time, 0, 0, 1);
}
The Source-to-Source Transform
The transform (implemented in slang/virtual_main.rs) runs before Slang compilation and performs three operations:
- Generates a wrapper function with the real
[shader("...")]attribute and a fixed 16-word push-constant signature. - Renames the user function from
cs_mainto_goldy_user_cs_mainso both can coexist. - Removes the
[goldy_*]attribute and[numthreads]from the renamed user function (they live on the generated wrapper).
Push Constant Layout
The generated entry point always declares a fixed signature regardless of how many parameters the user function has:
Words 0–7: _bw0.._bw7 — 16 × u16 bindless indices packed 2 per word
Words 8–15: _uw0.._uw7 — 8 × u32 user scalar parameters
Bindless indices are packed as pairs into 32-bit words: the low 16 bits of _bw0 hold slot 0, the high 16 bits hold slot 1, and so on. This fits up to 16 resource/broadcast parameters and 8 scalar parameters in 64 bytes of push constants.
Before and After
What you write:
[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(TimeUniforms cfg, Scattered<uint> data, ThreadId id) {
data[id.x] = data[id.x] + cfg.base;
}
What gets compiled (generated wrapper prepended, user function renamed):
[shader("compute")]
[numthreads(64, 1, 1)]
void cs_main(uniform uint _bw0, ..., uniform uint _bw7,
uniform uint _uw0, ..., uniform uint _uw7,
uint3 _sv0 : SV_DispatchThreadID) {
TimeUniforms cfg = goldy_broadcast<TimeUniforms>(_bw0 & 0xFFFFu);
Scattered<uint> data = goldy_scattered<uint>((_bw0 >> 16u) & 0xFFFFu);
ThreadId id = ThreadId(_sv0);
_goldy_user_cs_main(cfg, data, id);
}
// Original function, renamed:
void _goldy_user_cs_main(TimeUniforms cfg, Scattered<uint> data, ThreadId id) {
data[id.x] = data[id.x] + cfg.base;
}
The #line 1 directive is inserted between the generated wrapper and the user source so that compiler diagnostics report correct line numbers.
Vertex/Fragment Example
[goldy_vertex]
VSOutput vs_main(SceneUniforms scene, Scattered<Instance> instances, VertexId vid, InstanceId iid) {
// scene → broadcast (slot 0)
// instances → scattered (slot 1)
// vid → SV_VertexID
// iid → SV_InstanceID
Instance inst = instances[iid.value];
VSOutput out;
// ... transform vertex ...
return out;
}
[goldy_fragment]
float4 fs_main(SceneUniforms scene, Interpolated<float4> albedo, Filter samp,
VSOutput input) : SV_Target {
// scene → broadcast (slot 0)
// albedo → texture (slot 1)
// samp → sampler (slot 2)
// input → pass-through stage varying
return albedo.Sample(samp, input.uv) * scene.tint;
}
Both entry points share the same push-constant layout. Fragment shader slot expectations take precedence when Goldy extracts category metadata (since resource binding typically lives there in a vertex+fragment pair).
Preprocessor Conditionals
Virtual entry points support #ifdef/#else/#endif blocks directly inside the parameter list. This is useful for shader variants like MSAA:
[goldy_compute]
[numthreads(4, 16, 1)]
void cs_main(BufRO<uint> config,
#ifdef msaa
BufRO<uint> mask_lut, DirectSpatial<float4> out_tex,
#else
DirectSpatial<float4> out_tex,
#endif
ThreadId tid) {
// ...
}
The transform generates conditional blocks in the wrapper's signature, body, and call arguments so that the correct branch is selected at compile time based on preprocessor defines.
Slang in One Source
Goldy uses Slang as its single shader language across all backends. You write one .slang file and Goldy compiles it to the native format for whichever GPU API is in use — no manual HLSL/GLSL/MSL translation, no per-backend shader files.
Compilation Targets
| Backend | Target Format | API Requirement |
|---|---|---|
| Vulkan | SPIR-V | Vulkan 1.4+ |
| DirectX 12 | DXIL | Windows 10+ |
| Metal | Metal IR | Metal Tier 2+ (Argument Buffers) |
Slang compiles through its native slang.dll / libslang.dylib — the same compiler used by NVIDIA, Khronos, and major game engines. Goldy links it directly; there is no intermediate translation step.
Why Slang
- One source: Vertex, fragment, and compute shaders all live in a single
.slangfile. No preprocessor gymnastics to target different backends. - HLSL-compatible syntax: If you know HLSL, you already know Slang. Standard types (
float4,uint3,Texture2D), standard intrinsics (mul,lerp,smoothstep), standard semantics (SV_Position,SV_Target). - Modern language features: Modules (
import), generics, interfaces, operator overloading, and automatic differentiation — features that HLSL and GLSL lack. - Khronos governance: Long-term stability under open-source stewardship.
Cross-Backend Matrix Layout Consistency
Slang normalizes matrix layout across all backends. HLSL defaults to column-major storage, GLSL to column-major, and Metal to column-major — but the conventions for how mul(matrix, vector) is interpreted differ. Slang's compilation ensures that a float4x4 in your shader has identical memory layout and multiplication semantics whether it compiles to SPIR-V, DXIL, or Metal IR.
This means your Rust-side #[repr(C)] matrix types can use the same byte layout regardless of which backend the application runs on.
Shader Module Creation
Basic Compilation
ShaderModule::from_slang() compiles a Slang source string into GPU bytecode:
#![allow(unused)] fn main() { let shader = ShaderModule::from_slang(&device, r#" import goldy_exp; [goldy_compute] [numthreads(64, 1, 1)] void cs_main(Scattered<float> data, ThreadId id) { data[id.x] = data[id.x] * 2.0; } "#)?; }
The goldy_exp library is pre-registered on every device — import goldy_exp works without any setup.
Additional Search Paths
ShaderModule::from_slang_with_paths() adds filesystem directories to the Slang module search path:
#![allow(unused)] fn main() { let shader = ShaderModule::from_slang_with_paths( &device, source, &["my_project/shaders"], )?; }
Preprocessor Defines
ShaderModule::from_slang_with_paths_and_defines() passes preprocessor defines for shader variants:
#![allow(unused)] fn main() { let shader = ShaderModule::from_slang_with_paths_and_defines( &device, source, &[], &[("msaa", "1"), ("SAMPLE_COUNT", "4")], )?; }
Full Options
ShaderModule::from_slang_with_options() provides complete control — search paths, defines, optimization level, and layout validation checks:
#![allow(unused)] fn main() { let shader = ShaderModule::from_slang_with_options( &device, source, &["shaders/"], &[("DEBUG", "1")], OptimizationLevel::Default, &[TimeUniforms::LAYOUT_CHECK], )?; }
Built-in Shader Modules
Goldy ships a few complete shaders as Rust string constants in goldy::shader::builtins:
| Constant | Description |
|---|---|
VERTEX_COLOR_2D | 2D vertex+fragment shader with per-vertex color |
SOLID_COLOR | Solid color fragment shader with a uniform |
These are self-contained (no import needed) and useful for bootstrapping:
#![allow(unused)] fn main() { use goldy::shader::builtins; let shader = ShaderModule::from_slang(&device, builtins::VERTEX_COLOR_2D)?; }
Shader Libraries
Shader libraries are reusable Slang modules registered with a Device. Once registered, any shader compiled on that device can import the library.
The Built-in goldy_exp Library
Every device comes with goldy_exp pre-registered. It provides:
- Resource type aliases (
Scattered<T>,BufRO<T>,Interpolated<T>, etc.) - System-value wrappers (
ThreadId,VertexId,InstanceId, etc.) - Vertex formats (
FullscreenVarying,ColoredVarying, etc.) - Math utilities (
hash(),center_uv(),smootherstep(), etc.) - Color utilities (
rainbow(),palette(),hsv_to_rgb(), etc.) - Procedural geometry (
quad_position(),billboard_position(), etc.)
Registering Custom Libraries
#![allow(unused)] fn main() { use goldy::ShaderLibrary; device.register_library(ShaderLibrary::from_source("myutils", r#" module myutils; public float3 my_effect(float t) { return float3(t, t * 0.5, 1.0 - t); } "#))?; }
Now any shader can import myutils:
import myutils;
[goldy_fragment]
float4 fs_main(FullscreenVarying input) : SV_Target {
return float4(my_effect(input.uv.x), 1.0);
}
Multi-Module Libraries
For larger libraries with internal sub-modules:
#![allow(unused)] fn main() { let lib = ShaderLibrary::from_embedded("effects", &[ ("effects", r#" module effects; __include "effects/blur"; __include "effects/bloom"; "#), ("effects/blur", r#" implementing effects; public float4 gaussian_blur(Texture2D<float4> tex, SamplerState s, float2 uv) { ... } "#), ("effects/bloom", r#" implementing effects; public float4 bloom(Texture2D<float4> tex, SamplerState s, float2 uv, float threshold) { ... } "#), ]); device.register_library(lib)?; }
Loading from the Filesystem
#![allow(unused)] fn main() { let lib = ShaderLibrary::from_directory("effects", Path::new("shaders/effects/"))?; device.register_library(lib)?; }
Library Management
#![allow(unused)] fn main() { device.has_library("goldy_exp"); // true — always registered device.list_libraries(); // ["goldy_exp", "myutils", ...] device.unregister_library("myutils"); // remove a custom library }
Layout Validation
When Rust structs are passed to shaders as uniform data (e.g. via Broadcast), the memory layout must match exactly. Goldy can validate this at compile time using Slang reflection.
Setup
- Derive
LayoutCheckableon your Rust struct:
#![allow(unused)] fn main() { #[derive(LayoutCheckable)] #[repr(C)] struct TimeUniforms { time: f32, delta_time: f32, frame: u32, _pad: u32, } }
- Pass the layout check to shader compilation:
#![allow(unused)] fn main() { let shader = ShaderModule::from_slang_with_options( &device, source, &[], &[], OptimizationLevel::Default, &[TimeUniforms::LAYOUT_CHECK], )?; }
- Enable validation via environment variable:
GOLDY_VALIDATE_LAYOUTS=1 cargo run
# or
GOLDY_VALIDATION=layout cargo run
# or enable everything:
GOLDY_VALIDATION=all cargo run
What Gets Validated
- Field offsets: Each field's byte offset in the Rust struct is compared against the Slang reflection data.
- Struct size: Total size must match.
- Buffer element stride: At dispatch time, the buffer's recorded element stride is checked against what the shader expects.
Validation is zero-cost when disabled — the checks are skipped entirely, not compiled out. The environment variable is read at runtime so it can be toggled without recompiling.
GOLDY_VALIDATION
The GOLDY_VALIDATION environment variable controls multiple validation categories:
| Value | Layout Checks | GPU API Validation |
|---|---|---|
layout | Yes | No |
api | No | Yes |
layout,api | Yes | Yes |
all | Yes | Yes |
1 / true / yes | No | Yes |
GOLDY_VALIDATE_LAYOUTS=1 is a standalone toggle that enables layout checks regardless of GOLDY_VALIDATION.
ComputeEncoder
ComputeEncoder records compute commands into a flat command list. It is lock-free and can be used from any thread — no GPU interaction happens until you submit.
For multi-dispatch workloads with data dependencies between passes, prefer the Task Graph, which analyzes dependencies and inserts barriers automatically. ComputeEncoder is best for simple, single-dispatch workloads or cases where you manage barriers yourself.
Creating an encoder
#![allow(unused)] fn main() { let mut encoder = ComputeEncoder::new(); }
Recording a compute pass
Open a ComputePass, set a pipeline, bind resources, and dispatch:
#![allow(unused)] fn main() { let mut pass = encoder.begin_compute_pass(); pass.set_pipeline(&pipeline); pass.bind_resources_raw(&[buffer.bindless_index().unwrap()]); pass.dispatch(16, 1, 1); }
The pass borrows the encoder mutably. Drop it (or let it go out of scope) before opening another pass or finishing the encoder.
Binding resources
There are three ways to pass resource handles to a compute shader:
bind_resources — pass Buffer references directly. Indices are bound in declaration order:
#![allow(unused)] fn main() { pass.bind_resources(&[&particle_buffer, ¶ms_buffer]); }
bind_resources_raw — pass raw u32 slot indices. Use this when you need to mix buffer, texture, and sampler indices:
#![allow(unused)] fn main() { let tex_idx = texture.bindless_index().unwrap(); let buf_idx = buffer.bindless_index().unwrap(); pass.bind_resources_raw(&[buf_idx, tex_idx]); }
bind_resources_typed — pass typed BindlessHandles that carry both the index and the resource category:
#![allow(unused)] fn main() { let uniforms = uniform_buf.bindless_handle().unwrap(); let output = output_tex.bindless_handle().unwrap(); pass.bind_resources_typed(&[uniforms, output]); }
Per-dispatch scalar parameters
Parameters that aren't heap indices — offsets, counts, flags — are declared as typed entry-point parameters in the shader and passed alongside resource indices:
[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(Scattered<uint> data, uint offset, uint stride, ThreadId id) {
data[id.x * stride + offset] += 1;
}
#![allow(unused)] fn main() { pass.bind_resources_raw(&[data_buf.bindless_index().unwrap(), offset, stride]); }
Or use the two-region form to separate resource indices (region A) from user scalars (region B):
#![allow(unused)] fn main() { pass.bind_resources_raw_with_user( &[data_buf.bindless_index().unwrap()], &[offset, stride], ); }
Dispatching workgroups
The total thread count is the product of dispatch(x, y, z) and the shader's [numthreads(x, y, z)]:
#![allow(unused)] fn main() { let elements = 1024u32; let threads_per_group = 64u32; let groups = elements.div_ceil(threads_per_group); pass.dispatch(groups, 1, 1); // 16 groups × 64 threads = 1024 }
Indirect dispatch
Let a prior pass write the workgroup counts into a buffer, then read them at dispatch time:
#![allow(unused)] fn main() { pass.dispatch_indirect(&count_buffer, 0); }
The buffer must contain three consecutive u32 values (x, y, z) at the given byte offset.
Barriers and buffer clears
Insert a global memory barrier between dispatches within the same encoder:
#![allow(unused)] fn main() { pass.barrier(); }
Clear a buffer region to zero, batched into the same submission:
#![allow(unused)] fn main() { pass.clear_buffer(&buffer, 0, 0); // size=0 → clear to end of buffer }
Submitting
Blocking — submit and wait for the GPU to finish:
#![allow(unused)] fn main() { encoder.dispatch(&device)?; }
Non-blocking — submit and get a TimelineValue for later synchronization:
#![allow(unused)] fn main() { let tv = encoder.submit(&device)?; // CPU work while GPU is busy... device.wait_until(tv)?; }
See Device Timeline for more on TimelineValue and gpu_progress.
Recording into a task graph
For multi-pass workloads, record each dispatch as a task graph node instead of using ComputeEncoder directly. The task graph handles barriers for you:
#![allow(unused)] fn main() { let mut graph = TaskGraph::new(); graph.node("my_pass", &pipeline) .bind_buffer(&buf, NodeAccess::ReadWrite) .bind_resources_raw(&[buf.bindless_index().unwrap()]) .dispatch(16, 1, 1); graph.dispatch(&device)?; }
See Task Graph for the full API.
Task Graph
The task graph is one of Goldy's core abstractions. It pairs the bindless resource model with explicit dependency declarations so the runtime can insert optimal barriers and maximize GPU parallelism — all within a single command buffer.
Why the task graph exists
Goldy uses a bindless resource model: shaders access buffers and textures through heap-backed argument buffers indexed by slot numbers. This gives shaders flexible, low-overhead access to any resource, but it makes the GPU's automatic dependency tracking blind. Metal, for example, cannot see through argument buffer indirection to know which resources a dispatch reads or writes, so it cannot insert barriers automatically.
Without the task graph, the only correct approach is to submit each dispatch as a separate command buffer. This works, but it serializes everything and adds per-command-buffer scheduling overhead — worse than APIs like wgpu that use bind groups to infer hazards.
The task graph solves this: you declare what each node reads and writes, and Goldy does the rest.
- Builds a dependency DAG from declared resource access patterns
- Groups independent dispatches into waves that execute concurrently
- Inserts per-resource barriers only at true dependency edges (RAW, WAR, WAW)
- Submits everything in a single command buffer
Building a task graph
Create a TaskGraph, add nodes with resource access declarations, and submit:
#![allow(unused)] fn main() { use goldy::{TaskGraph, NodeAccess}; let mut graph = TaskGraph::new(); graph.node("write_data", &pipeline_a) .bind_buffer(&buf, NodeAccess::Write) .bind_resources_raw(&[buf_idx]) .dispatch(64, 1, 1); graph.node("read_data", &pipeline_b) .bind_buffer(&buf, NodeAccess::Read) .bind_resources_raw(&[buf_idx]) .dispatch(64, 1, 1); let tv = graph.submit(&device)?; device.wait_until(tv)?; }
The analyzer sees that read_data depends on write_data (RAW hazard on buf) and inserts a barrier between them. If two nodes touch completely different resources, they execute in the same wave with no barrier.
Node types
| Builder method | GPU operation |
|---|---|
graph.node(label, &pipeline) | Compute dispatch (direct or indirect) |
graph.clear_buffer(&buf, offset, size) | GPU-side buffer zero-fill |
graph.clear_buffer_view(&view, offset, size) | GPU-side zero-fill of a pool view region |
graph.write_buffer(&buf, offset, data) | CPU→GPU buffer upload |
graph.write_texture(&tex, data) | CPU→GPU texture upload |
graph.render_pass(label, &target) | Offscreen render pass |
All node types participate in the same dependency analysis.
Declaring resource access
Each node declares its resource access via bind_buffer, bind_buffer_view, or bind_texture:
#![allow(unused)] fn main() { graph.node("reduce", &pipeline) .bind_buffer(&input, NodeAccess::Read) .bind_buffer(&output, NodeAccess::Write) .bind_resources_raw(&[input_idx, output_idx]) .dispatch(64, 1, 1); }
bind_resources_raw sets the actual shader slot indices. The bind_buffer / bind_texture calls are purely for dependency analysis — they tell the scheduler what this node touches, not how to bind it.
Finalizing nodes
Compute nodes must be finalized with dispatch(x, y, z) or dispatch_indirect(&buf, offset). Render pass nodes are finalized with finish(commands) or finish_encoder(encoder).
NodeAccess and SWMR scheduling
NodeAccess is the per-node logical access, orthogonal to a buffer's physical DataAccess:
#![allow(unused)] fn main() { pub enum NodeAccess { Read, // can overlap with other Reads Write, // exclusive access ReadWrite, // exclusive access } }
The scheduler implements single-writer/multiple-reader (SWMR) parallelism:
- Multiple
Readnodes on the same resource run concurrently in the same wave. - A
WriteorReadWritenode serializes against all prior accessors of that resource. - Barriers are inserted only at true RAW/WAR/WAW edges.
Diamond example
#![allow(unused)] fn main() { let mut graph = TaskGraph::new(); // Wave 0: A writes buf_x graph.node("A", &p1) .bind_buffer(&buf_x, NodeAccess::Write) .dispatch(1, 1, 1); // Wave 1: B and C both read buf_x (SWMR — they run concurrently) graph.node("B", &p2) .bind_buffer(&buf_x, NodeAccess::Read) .bind_buffer(&buf_y, NodeAccess::Write) .dispatch(1, 1, 1); graph.node("C", &p3) .bind_buffer(&buf_x, NodeAccess::Read) .bind_buffer(&buf_z, NodeAccess::Write) .dispatch(1, 1, 1); // Wave 2: D reads both outputs graph.node("D", &p4) .bind_buffer(&buf_y, NodeAccess::Read) .bind_buffer(&buf_z, NodeAccess::Read) .dispatch(1, 1, 1); graph.dispatch(&device)?; }
This produces three waves with two barriers — the minimum possible for this dependency pattern.
Buffer views and pool tracking
When using BufferPool, you can declare access at view granularity. Non-overlapping views of the same pool produce no dependency edge and execute in the same wave:
#![allow(unused)] fn main() { let view_a = pool.alloc::<u32>(64)?; let view_b = pool.alloc::<u32>(64)?; let mut graph = TaskGraph::new(); graph.node("write_a", &pipeline) .bind_buffer_view(&view_a, NodeAccess::Write) .dispatch(1, 1, 1); graph.node("write_b", &pipeline) .bind_buffer_view(&view_b, NodeAccess::Write) .dispatch(1, 1, 1); // No barrier — view_a and view_b occupy disjoint byte ranges graph.dispatch(&device)?; }
Barriers are emitted against the parent buffer handle, so backends require no changes. The scheduler tracks byte ranges internally to determine true overlap.
Transient resources
Transient buffers and textures exist only for the lifetime of a single graph submission. They are allocated from a shared heap, and non-overlapping lifetimes can alias onto the same memory — reducing allocation pressure for temporaries.
#![allow(unused)] fn main() { let mut graph = TaskGraph::new(); let tmp = graph.transient_buffer(256); graph.node("produce", &pipeline_a) .bind_transient_buffer(tmp, NodeAccess::Write) .bind_resources_raw(&[0]) .dispatch(1, 1, 1); graph.node("consume", &pipeline_b) .bind_transient_buffer(tmp, NodeAccess::Read) .bind_resources_raw(&[0]) .dispatch(1, 1, 1); graph.dispatch(&device)?; }
Transient textures work the same way:
#![allow(unused)] fn main() { let tmp_tex = graph.transient_texture(width, height, TextureFormat::Rgba8Unorm); graph.node("render", &pipeline) .bind_transient_texture(tmp_tex, NodeAccess::Write) .bind_resources_raw(&[0]) .dispatch(wg_x, wg_y, 1); }
When transients are used, the graph blocks until the GPU completes so the staging heap can be freed. The scheduler uses wave-interval analysis to determine which transients can alias: if two transient buffers are never live in the same wave, they share the same backing memory.
Per-resource barriers on Metal
The graph emits ResourceBarrier commands with per-resource granularity. Each backend maps this to its native mechanism:
| Backend | Behavior |
|---|---|
| Metal | memoryBarrierWithResources:count: — precise per-resource barriers within a single compute encoder |
| Vulkan | Global compute pipeline barrier (per-resource VkBufferMemoryBarrier is a future optimization) |
| DX12 | Global UAV barrier (per-resource D3D12_RESOURCE_BARRIER is a future optimization) |
On Metal — the primary beneficiary — the graph enables single-encoder submission with per-resource barriers, eliminating the per-command-buffer overhead of the one-dispatch-per-command-buffer workaround.
Single command buffer submission
All nodes in a TaskGraph are submitted in a single command buffer (or compute encoder on Metal). The scheduler groups independent nodes into waves and inserts barriers only between waves that have true data dependencies. This minimizes scheduling overhead and enables the GPU to overlap independent work within a wave.
Blocking vs non-blocking submission
Non-blocking — returns a TimelineValue for CPU-side synchronization:
#![allow(unused)] fn main() { let tv = graph.submit(&device)?; // CPU work while GPU executes... device.wait_until(tv)?; }
Blocking — submits and waits for completion:
#![allow(unused)] fn main() { graph.dispatch(&device)?; }
Practical example: Game of Life
A ping-pong compute pattern using buffer pool views and the task graph:
#![allow(unused)] fn main() { let (read_view, write_view) = if use_buffer_a { (&view_a, &view_b) } else { (&view_b, &view_a) }; let mut graph = TaskGraph::new(); graph.node("game_of_life", &compute_pipeline) .bind_buffer_view(read_view, NodeAccess::Read) .bind_buffer_view(write_view, NodeAccess::Write) .bind_resources_raw(&[ read_view.bindless_handle().unwrap().index(), write_view.bindless_handle().unwrap().index(), ]) .dispatch(GRID_WIDTH.div_ceil(8), GRID_HEIGHT.div_ceil(8), 1); graph.dispatch(&device)?; use_buffer_a = !use_buffer_a; }
The graph analyzes the Read and Write declarations on each view and inserts barriers only where needed. Because the two views occupy disjoint byte ranges in the same pool, the scheduler can verify they don't alias — enabling correct execution with minimal synchronization.
Device Timeline
Goldy tracks GPU completion with a monotonic timeline counter — a u64 value (TimelineValue) that increments with each submission. This replaces fence-per-submission models with a single, always-increasing counter on the device.
TimelineValue
Every non-blocking submission returns a TimelineValue:
#![allow(unused)] fn main() { let tv: TimelineValue = graph.submit(&device)?; }
This value represents a point on the device's timeline. When the GPU finishes executing that submission, the timeline advances past tv.
Both TaskGraph::submit and ComputeEncoder::submit return timeline values. Surface presentation via Frame::present also returns one.
Querying GPU progress
device.gpu_progress() returns the latest completed timeline value without blocking:
#![allow(unused)] fn main() { let current = device.gpu_progress(); if current >= tv { // submission has finished — safe to read back results } }
This is a lightweight query (single atomic read on most backends) suitable for polling in a loop or checking once per frame.
Waiting for completion
device.wait_until(value) blocks the current thread until the GPU timeline reaches at least value:
#![allow(unused)] fn main() { let tv = graph.submit(&device)?; // CPU work while GPU executes... prepare_next_frame(); // Block until this specific submission completes device.wait_until(tv)?; }
For bounded waits, use wait_until_timeout:
#![allow(unused)] fn main() { let completed = device.wait_until_timeout(tv, 1000)?; // 1 second timeout if !completed { // GPU hasn't finished yet — handle timeout } }
Blocking dispatch
For simple cases where you don't need CPU/GPU overlap, dispatch combines submit + wait:
#![allow(unused)] fn main() { graph.dispatch(&device)?; // submits and blocks until complete }
This is equivalent to:
#![allow(unused)] fn main() { let tv = graph.submit(&device)?; device.wait_until(tv)?; }
How this differs from fence-based synchronization
Traditional GPU APIs use one fence object per submission. You create a fence, attach it to a submit call, then query or wait on that specific fence. Managing multiple in-flight submissions means tracking multiple fence objects.
Goldy's timeline is a single monotonic counter shared across all submissions on a device:
| Fence-based | Timeline-based | |
|---|---|---|
| Tracking | One fence per submission | One counter for the device |
| Query | Poll each fence individually | gpu_progress() >= value |
| Wait | Wait on a specific fence | wait_until(value) |
| Ordering | Fences are independent | Values are monotonically ordered |
| Multi-frame | Track N fence objects | Compare N u64 values |
Because timeline values are ordered, you can reason about completion transitively: if gpu_progress() >= tv_b and tv_b > tv_a, then tv_a has also completed.
Practical use cases
CPU readback after compute
#![allow(unused)] fn main() { let tv = graph.submit(&device)?; device.wait_until(tv)?; let result: Vec<f32> = buffer.read_data(0)?; }
Multi-frame pipelining
Overlap CPU frame N+1 preparation with GPU frame N execution:
#![allow(unused)] fn main() { let mut pending: Option<TimelineValue> = None; loop { // Wait for the previous frame to finish before reusing its resources if let Some(tv) = pending { device.wait_until(tv)?; } // Prepare frame N+1 on the CPU update_uniforms(&uniform_buffer)?; // Submit frame N+1 — GPU starts working, CPU continues let tv = graph.submit(&device)?; pending = Some(tv); // CPU work for the next iteration... } }
Polling without blocking
Check completion in a non-blocking render loop:
#![allow(unused)] fn main() { let tv = graph.submit(&device)?; loop { if device.gpu_progress() >= tv { break; // done } // do other work, yield, etc. } }
Resource lifetime
Dropping a Buffer or Texture may be deferred internally: the GPU memory stays alive until all submissions that reference it have completed. Submit (or present a frame) before dropping resources that must outlive those commands.
Compute to Surface
Compute-to-surface lets a compute shader write directly to the swapchain texture, bypassing the rasterization pipeline entirely. There is no RenderPipeline, no vertex buffers, no CommandEncoder — just a compute dispatch that fills pixels.
When to use compute-to-surface
Use compute-to-surface when your rendering is naturally a per-pixel computation rather than geometry rasterization:
- Fullscreen image effects (plasma, fractals, ray marching)
- GPU-driven 2D renderers where the compute shader owns the output layout
- Post-processing that doesn't need triangle rasterization
- Prototyping visual effects without setting up a render pipeline
Use traditional rendering when you need the rasterization pipeline's features: triangle assembly, depth testing, MSAA, alpha blending, or vertex/fragment shader stages.
Getting the swapchain texture
Acquire a frame from the surface and call frame.texture() to get a writable Texture handle to the current swapchain image:
#![allow(unused)] fn main() { let frame = surface.begin()?; let texture = frame.texture(); }
This texture is valid until the frame is presented. You can obtain its bindless handle and pass it to a compute shader like any other texture:
#![allow(unused)] fn main() { let texture_handle = texture .bindless_handle() .expect("Surface texture has no bindless handle"); }
Building the task graph
Create a TaskGraph with a compute node that writes to the swapchain texture. The task graph handles barrier insertion between compute writes and the presentation engine:
#![allow(unused)] fn main() { let wg_x = width.div_ceil(8); let wg_y = height.div_ceil(8); let mut graph = TaskGraph::new(); graph.node("compute", &compute_pipeline) .bind_buffer(&uniform_buffer, NodeAccess::Read) .bind_resources_raw(&[uniform_handle.index(), texture_handle.index()]) .dispatch(wg_x, wg_y, 1); }
Submitting and presenting
Use frame.submit_compute(graph) to record the compute work into the frame, then present:
#![allow(unused)] fn main() { frame.submit_compute(&graph)?; frame.present()?; }
submit_compute compiles the task graph into a command stream and records it into the frame's command buffer. Presentation happens when you call present() — the compute shader has already written the pixels.
The compute shader
The shader receives the output texture as a DirectSpatial<float4> — a read-write 2D texture accessed by integer coordinates:
import goldy_exp;
struct Uniforms {
uint width;
uint height;
float time;
float _padding;
};
[goldy_compute]
[numthreads(8, 8, 1)]
void cs_main(BufRO<Uniforms> uniforms_buf, DirectSpatial<float4> output, ThreadId tid) {
Uniforms u = uniforms_buf[0];
if (tid.x >= u.width || tid.y >= u.height)
return;
float2 uv = float2(float(tid.x) / float(u.width),
float(tid.y) / float(u.height));
// Compute pixel color...
float3 col = my_color_function(uv, u.time);
output[tid.xy] = float4(col, 1.0);
}
The [numthreads(8, 8, 1)] workgroup size maps naturally to 2D image tiles. Dispatch enough workgroups to cover the full resolution:
#![allow(unused)] fn main() { let wg_x = width.div_ceil(8); let wg_y = height.div_ceil(8); }
Guard against out-of-bounds writes in the shader when the resolution isn't a multiple of the workgroup size.
Full example
A complete compute-to-surface application rendering an animated plasma effect:
#![allow(unused)] fn main() { use goldy::{ Buffer, ComputePipeline, DataAccess, DeviceType, Instance, NodeAccess, PresentMode, ShaderModule, Surface, SurfaceConfig, TaskGraph, }; // Create device and surface let instance = Instance::new()?; let device = instance.create_device(DeviceType::DiscreteGpu)?; let surface = Surface::new_with_config( &device, &window, SurfaceConfig { present_mode: PresentMode::Fifo, depth_format: None, }, )?; // Compile compute shader and create pipeline let shader = ShaderModule::from_slang(&device, COMPUTE_SHADER)?; let compute_pipeline = ComputePipeline::new(&device, &shader)?; // Create uniform buffer let uniform_buffer = Buffer::with_data( &device, &[Uniforms { width: surface.width(), height: surface.height(), time: 0.0, _padding: 0.0, }], DataAccess::Scattered, )?; // --- Render loop --- // Update uniforms uniform_buffer.write(0, bytemuck::bytes_of(&Uniforms { width, height, time: elapsed, _padding: 0.0, }))?; // Acquire frame and get swapchain texture let frame = surface.begin()?; let texture = frame.texture(); let uniform_handle = uniform_buffer .bindless_srv_handle() .expect("Uniform buffer has no bindless SRV handle"); let texture_handle = texture .bindless_handle() .expect("Surface texture has no bindless handle"); // Build and submit compute graph let wg_x = width.div_ceil(8); let wg_y = height.div_ceil(8); let mut graph = TaskGraph::new(); graph.node("compute", &compute_pipeline) .bind_buffer(&uniform_buffer, NodeAccess::Read) .bind_resources_raw(&[uniform_handle.index(), texture_handle.index()]) .dispatch(wg_x, wg_y, 1); frame.submit_compute(&graph)?; frame.present()?; }
The uniform buffer uses bindless_srv_handle() because the shader accesses it through BufRO<Uniforms>, which maps to a read-only SRV on DX12. On Vulkan and Metal this falls back to the unified storage-buffer index.
Pipelines
Pipelines combine compiled shaders with fixed-function rendering state. Goldy provides RenderPipeline for graphics and ComputePipeline for compute workloads.
Render Pipelines
A RenderPipeline pairs vertex and fragment shaders with a RenderPipelineDesc that configures vertex input, primitive assembly, depth testing, and the output format.
Creating a Render Pipeline
#![allow(unused)] fn main() { use goldy::{ RenderPipeline, RenderPipelineDesc, ShaderModule, Vertex2D, TextureFormat, PrimitiveTopology, }; let vs = ShaderModule::from_slang(&device, include_str!("shaders/tri.vs.slang"))?; let fs = ShaderModule::from_slang(&device, include_str!("shaders/tri.fs.slang"))?; let pipeline = RenderPipeline::new(&device, &vs, &fs, &RenderPipelineDesc { vertex_layout: Vertex2D::layout(), topology: PrimitiveTopology::TriangleList, target_format: surface.format(), depth_stencil: None, })?; }
RenderPipelineDesc
#![allow(unused)] fn main() { pub struct RenderPipelineDesc { pub vertex_layout: VertexBufferLayout, pub topology: PrimitiveTopology, pub target_format: TextureFormat, pub depth_stencil: Option<DepthStencilState>, } }
| Field | Purpose | Default |
|---|---|---|
vertex_layout | Describes vertex buffer stride and attributes | Empty (no vertex input) |
topology | How vertices are assembled into primitives | TriangleList |
target_format | Pixel format of the render target — must match surface.format() or the format passed to RenderTarget::new() | Rgba8Unorm |
depth_stencil | Depth/stencil test configuration, or None to disable | None |
The default descriptor is valid for fullscreen passes that generate geometry from SV_VertexID and render to an Rgba8Unorm target without depth testing.
Format Matching
The pipeline's target_format must match the render target it will draw into. Mismatched formats produce backend errors or undefined output.
#![allow(unused)] fn main() { let desc = RenderPipelineDesc { target_format: surface.format(), ..Default::default() }; }
Vertex Buffer Layouts
A VertexBufferLayout tells the pipeline how to interpret vertex buffer memory. For passes that do not use vertex buffers (fullscreen triangles, quad instancing), the default empty layout is correct.
For typed vertex input, use the from_formats builder or a built-in type's layout() method. See Vertex Types and Layouts for details.
#![allow(unused)] fn main() { let layout = VertexBufferLayout::from_formats::<MyVertex>(&[ VertexFormat::Float32x3, // position VertexFormat::Float32x2, // uv ]); }
Primitive Topology
Controls how the vertex stream is assembled into geometric primitives:
#![allow(unused)] fn main() { pub enum PrimitiveTopology { PointList, LineList, LineStrip, TriangleList, // default TriangleStrip, } }
PointList: • • • •
LineList: •——• •——•
LineStrip: •——•——•——•
TriangleList: △ △
TriangleStrip: △▽△▽
Depth/Stencil State
Enable depth testing by setting depth_stencil. The surface or render target must have been created with a matching depth format.
#![allow(unused)] fn main() { use goldy::{DepthStencilState, DepthFormat, CompareFunction}; let pipeline = RenderPipeline::new(&device, &vs, &fs, &RenderPipelineDesc { vertex_layout: Vertex2D::layout(), target_format: surface.format(), topology: PrimitiveTopology::TriangleList, depth_stencil: Some(DepthStencilState { format: DepthFormat::Depth32Float, depth_write_enabled: true, depth_compare: CompareFunction::Less, }), })?; }
DepthStencilState fields:
| Field | Purpose | Default |
|---|---|---|
format | Depth texture format (Depth16Unorm, Depth24Plus, Depth32Float, etc.) | Depth24Plus |
depth_write_enabled | Whether fragments write to the depth buffer | true |
depth_compare | Comparison function — Less, LessEqual, Greater, Always, etc. | Less |
Available depth formats:
| Format | Bits | Stencil |
|---|---|---|
Depth16Unorm | 16-bit | No |
Depth24Plus | 24-bit (may use 32 internally) | No |
Depth24PlusStencil8 | 24-bit + 8-bit stencil | Yes |
Depth32Float | 32-bit float | No |
Depth32FloatStencil8 | 32-bit float + 8-bit stencil | Yes |
For reverse-Z rendering, use CompareFunction::Greater and clear depth to 0.0.
Compute Pipelines
ComputePipeline wraps a single compute shader. See the compute documentation for the full compute API.
#![allow(unused)] fn main() { use goldy::{ComputePipeline, ShaderModule}; let cs = ShaderModule::from_slang(&device, include_str!("shaders/sim.cs.slang"))?; let pipeline = ComputePipeline::new(&device, &cs)?; }
Why Goldy Has Fewer Pipelines
Pipeline State Object (PSO) explosion is one of the biggest pain points in modern graphics. Engines routinely manage thousands of pipeline permutations and ship massive shader caches. Goldy eliminates most combinatorial dimensions:
| Dimension | Traditional Vulkan/DX12 | Goldy |
|---|---|---|
| Render pass compatibility | N render passes × M subpasses | Eliminated — dynamic rendering |
| Descriptor set layouts | Per-material layout permutations | One global bindless layout |
| Pipeline layouts | Per-material | One shared layout |
| Viewport / scissor | Baked into PSO | Dynamic state |
| Vertex format | Baked | Baked (unavoidable) |
| Target format | Baked | Baked (unavoidable) |
RenderPipelineDesc has exactly four fields. The permutation space is vertex_layouts × topologies × target_formats × depth_configs — deliberately small.
Performance
Pipelines are expensive to create (shader compilation, PSO allocation) but cheap to bind during rendering. Create them once at startup and reuse across frames.
#![allow(unused)] fn main() { struct Renderer { scene_pipeline: RenderPipeline, ui_pipeline: RenderPipeline, wireframe_pipeline: RenderPipeline, } impl Renderer { fn new(device: &Device, surface: &Surface) -> Result<Self> { // Create all pipelines upfront Ok(Self { scene_pipeline: create_scene_pipeline(device, surface.format())?, ui_pipeline: create_ui_pipeline(device, surface.format())?, wireframe_pipeline: create_wireframe_pipeline(device, surface.format())?, }) } } }
Command Encoding
CommandEncoder records GPU rendering commands without executing them. It is completely lock-free and does not touch the GPU backend — you can create and fill encoders on any thread. The actual GPU work happens when you submit the commands through Frame::render() or RenderTarget::render().
Recording Commands
#![allow(unused)] fn main() { use goldy::{CommandEncoder, Color}; let mut encoder = CommandEncoder::new(); { let mut pass = encoder.begin_render_pass(); pass.clear(Color::CORNFLOWER_BLUE); pass.set_pipeline(&pipeline); pass.set_vertex_buffer(0, &vertices); pass.draw(0..3, 0..1); } // pass ends when dropped let commands = encoder.finish(); }
Render Pass
A RenderPass is a borrow of the encoder that groups drawing commands. It begins with begin_render_pass() and ends when the RenderPass value is dropped.
#![allow(unused)] fn main() { let mut encoder = CommandEncoder::new(); { let mut pass = encoder.begin_render_pass(); // all draw commands go here } }
Commands within a pass execute in recorded order.
Clearing
Clear the color attachment, the depth buffer, or both:
#![allow(unused)] fn main() { pass.clear(Color::BLACK); pass.clear_depth(1.0); // standard depth clear (far plane) pass.clear_depth(0.0); // reverse-Z depth clear }
Setting the Pipeline
Bind the active RenderPipeline. You can switch pipelines within the same pass.
#![allow(unused)] fn main() { pass.set_pipeline(&scene_pipeline); // ... draw scene ... pass.set_pipeline(&ui_pipeline); // ... draw UI ... }
Vertex and Index Buffers
Bind vertex data to a numbered slot. Both Buffer and BufferView are accepted — for pool-allocated views, the parent buffer and offset are resolved automatically.
#![allow(unused)] fn main() { pass.set_vertex_buffer(0, &vertex_buffer); // With an explicit additional offset: pass.set_vertex_buffer_offset(0, &vertex_buffer, byte_offset); }
Bind an index buffer for indexed drawing:
#![allow(unused)] fn main() { use goldy::IndexFormat; pass.set_index_buffer(&index_buffer, IndexFormat::Uint16); // With an additional offset: pass.set_index_buffer_offset(&index_buffer, byte_offset, IndexFormat::Uint32); }
Binding Resources
Goldy's bindless model passes resource indices to shaders through push constants. There are three binding methods:
Typed handles (preferred for new code) — each handle carries its BindlessCategory, enabling validation against shader reflection:
#![allow(unused)] fn main() { let tex = texture.bindless_handle().unwrap(); let samp = sampler.bindless_handle().unwrap(); pass.bind_resources_typed(&[tex, samp]); }
Buffer references — extracts bindless indices from Buffer objects:
#![allow(unused)] fn main() { pass.bind_resources(&[&uniform_buffer, &data_buffer]); }
Raw indices — for manual control or when mixing resource types:
#![allow(unused)] fn main() { let tex_idx = texture.bindless_index().unwrap(); let samp_idx = sampler.bindless_index().unwrap(); pass.bind_resources_raw(&[tex_idx, samp_idx]); }
Raw indices can also carry user scalars alongside bindless indices:
#![allow(unused)] fn main() { pass.bind_resources_raw_with_user( &[buf_idx, tex_idx], // bindless indices (region A) &[frame_number], // user scalars (region B) ); }
Draw Calls
draw
Draw non-indexed primitives:
#![allow(unused)] fn main() { // draw(vertex_range, instance_range) pass.draw(0..3, 0..1); // 3 vertices, 1 instance pass.draw(0..6, 0..10); // 6 vertices, 10 instances pass.draw(100..106, 0..1); // 6 vertices starting at vertex 100 }
draw_indexed
Draw indexed primitives. Requires a prior set_index_buffer() call.
#![allow(unused)] fn main() { // draw_indexed(index_range, base_vertex, instance_range) pass.draw_indexed(0..36, 0, 0..1); // base_vertex is added to each index before vertex fetch pass.draw_indexed(0..6, 1000, 0..1); // negative base_vertex is allowed pass.draw_indexed(0..3, -50, 0..1); }
draw_fullscreen
Draw a fullscreen triangle (3 vertices, no vertex buffer needed). Pair with vs_fullscreen_triangle() from goldy_exp.vertex or fullscreen_position()/fullscreen_uv() from goldy_exp.primitives.
#![allow(unused)] fn main() { pass.set_pipeline(&fullscreen_pipeline); pass.bind_resources(&[&uniform_buffer]); pass.draw_fullscreen(); }
This is more efficient than a fullscreen quad (3 vertices vs 6) and eliminates vertex buffer overhead entirely.
draw_quads
Draw N instanced quads (6 vertices each, no vertex buffer needed). The shader reads per-instance data from a buffer and uses quad_position() from goldy_exp.primitives to generate vertex positions.
#![allow(unused)] fn main() { pass.set_pipeline(&instanced_pipeline); pass.bind_resources(&[&instance_buffer]); pass.draw_quads(400); // draw 400 quads }
Submitting Commands
After recording, submit the encoder to a surface frame or render target:
#![allow(unused)] fn main() { // Surface presentation let frame = surface.begin()?; frame.render(encoder)?; frame.present()?; // Headless render target target.render(encoder)?; }
Complete Example
#![allow(unused)] fn main() { let mut encoder = CommandEncoder::new(); { let mut pass = encoder.begin_render_pass(); pass.clear(Color::BLACK); pass.clear_depth(1.0); // Draw opaque geometry pass.set_pipeline(&scene_pipeline); pass.set_vertex_buffer(0, &mesh_vertices); pass.set_index_buffer(&mesh_indices, IndexFormat::Uint32); pass.bind_resources(&[&camera_uniforms]); pass.draw_indexed(0..index_count, 0, 0..1); // Draw fullscreen post-process pass.set_pipeline(&post_pipeline); pass.bind_resources(&[&post_uniforms]); pass.draw_fullscreen(); } let frame = surface.begin()?; frame.render(encoder)?; frame.present()?; }
Best Practices
- Batch draws by pipeline. Pipeline switches are cheap but not free. Group objects that share the same pipeline.
- Clear once per pass. Issue
clear()at the start, then draw everything. - Use convenience methods.
draw_fullscreen()anddraw_quads()avoid unnecessary vertex buffer allocations. - Encode on any thread.
CommandEncoderis lock-free; build command buffers in parallel if needed.
Vertex Types and Layouts
Goldy provides built-in vertex types for common 2D rendering and a layout builder for custom vertex formats. Vertex data is described by a VertexBufferLayout that tells the pipeline how to interpret buffer memory.
Built-in Vertex Types
Vertex2D
Position + color. Use for colored primitives, particles, and debug visualization.
#![allow(unused)] fn main() { use goldy::{Vertex2D, Color}; let vertices = vec![ Vertex2D::new(-0.5, -0.5, Color::RED), Vertex2D::new( 0.5, -0.5, Color::GREEN), Vertex2D::new( 0.0, 0.5, Color::BLUE), ]; }
Memory layout (24 bytes per vertex):
| Location | Field | Format | Offset |
|---|---|---|---|
| 0 | position | Float32x2 | 0 |
| 1 | color | Float32x4 | 8 |
Get the pipeline layout with Vertex2D::layout().
Vertex2DUv
Position + texture coordinates. Use for textured quads, sprites, and shader effects.
#![allow(unused)] fn main() { use goldy::Vertex2DUv; let vertices = vec![ Vertex2DUv::new(-1.0, -1.0, 0.0, 1.0), Vertex2DUv::new( 1.0, -1.0, 1.0, 1.0), Vertex2DUv::new( 0.0, 1.0, 0.5, 0.0), ]; }
Memory layout (16 bytes per vertex):
| Location | Field | Format | Offset |
|---|---|---|---|
| 0 | position | Float32x2 | 0 |
| 1 | uv | Float32x2 | 8 |
Get the pipeline layout with Vertex2DUv::layout().
Using Built-in Types in Pipelines
Both types provide a layout() method that returns the correct VertexBufferLayout:
#![allow(unused)] fn main() { let pipeline = RenderPipeline::new(&device, &vs, &fs, &RenderPipelineDesc { vertex_layout: Vertex2D::layout(), target_format: surface.format(), ..Default::default() })?; }
Both types implement StructuredBufferElement, so they can also be stored in Buffer::with_data and BufferPool::alloc_with_data.
Custom Vertex Layouts
Defining a Custom Vertex
Custom vertex types must be #[repr(C)] and derive bytemuck::Pod and bytemuck::Zeroable:
#![allow(unused)] fn main() { #[repr(C)] #[derive(Clone, Copy, bytemuck::Pod, bytemuck::Zeroable)] struct MyVertex { position: [f32; 3], normal: [f32; 3], uv: [f32; 2], color: u32, } }
Building a Layout with from_formats
VertexBufferLayout::from_formats::<T> infers locations (sequential from 0) and offsets (accumulated from format sizes), then validates that the total matches size_of::<T>():
#![allow(unused)] fn main() { use goldy::types::{VertexBufferLayout, VertexFormat}; let layout = VertexBufferLayout::from_formats::<MyVertex>(&[ VertexFormat::Float32x3, // position (12 bytes) VertexFormat::Float32x3, // normal (12 bytes) VertexFormat::Float32x2, // uv (8 bytes) VertexFormat::Uint32, // color (4 bytes) ]); // stride = 36, 4 attributes }
The builder panics if the summed format sizes don't equal size_of::<T>(), catching field-list mismatches at pipeline creation rather than producing silent GPU corruption.
Manual Layout
For full control, construct the layout directly:
#![allow(unused)] fn main() { use goldy::types::{VertexBufferLayout, VertexAttribute, VertexFormat}; let layout = VertexBufferLayout { stride: 32, attributes: vec![ VertexAttribute { location: 0, format: VertexFormat::Float32x3, offset: 0 }, VertexAttribute { location: 1, format: VertexFormat::Float32x3, offset: 12 }, VertexAttribute { location: 2, format: VertexFormat::Float32x2, offset: 24 }, ], }; }
Empty Layout
When the vertex shader generates geometry from SV_VertexID (fullscreen triangles, instanced quads), use the default empty layout:
#![allow(unused)] fn main() { let pipeline = RenderPipeline::new(&device, &vs, &fs, &RenderPipelineDesc { vertex_layout: VertexBufferLayout::empty(), ..Default::default() })?; }
VertexBufferLayout::default() also returns an empty layout.
Vertex Formats
Available formats for vertex attributes:
| Format | Rust Type | Size |
|---|---|---|
Float32 | f32 | 4 |
Float32x2 | [f32; 2] | 8 |
Float32x3 | [f32; 3] | 12 |
Float32x4 | [f32; 4] | 16 |
Uint32 | u32 | 4 |
Sint32 | i32 | 4 |
Uint8x4 | [u8; 4] (packed) | 4 |
Unorm8x4 | [u8; 4] (normalized) | 4 |
Vertex Data Flow
In Slang shaders, vertex attributes arrive through the [goldy_vertex] virtual entry point. The pipeline's VertexBufferLayout determines which attributes the hardware feeds into the shader's input struct. Attribute locations in the layout must match the shader's declared input locations.
For passes that bypass vertex buffers entirely, Slang helpers like vs_fullscreen_triangle() and quad_position() in goldy_exp.primitives generate geometry from SV_VertexID and SV_InstanceID.
Rendering Outputs
Surface manages a swapchain for zero-copy GPU-to-display presentation. It wraps the platform window handle, acquires drawable textures each frame, and presents finished frames to the display.
Creating a Surface
A Surface requires a Device and a window that implements HasWindowHandle + HasDisplayHandle (from the raw-window-handle crate).
#![allow(unused)] fn main() { use goldy::{Surface, SurfaceConfig, PresentMode, DepthFormat}; // Simplest form — Auto present mode, no depth buffer let surface = Surface::new(&device, &window)?; // With explicit configuration let surface = Surface::new_with_config(&device, &window, SurfaceConfig { present_mode: PresentMode::Fifo, depth_format: Some(DepthFormat::Depth32Float), })?; // Shorthand for depth-only configuration let surface = Surface::new_with_depth(&device, &window, Some(DepthFormat::Depth24Plus))?; }
SurfaceConfig
#![allow(unused)] fn main() { pub struct SurfaceConfig { pub present_mode: PresentMode, pub depth_format: Option<DepthFormat>, } }
| Field | Purpose | Default |
|---|---|---|
present_mode | Vsync strategy | Auto |
depth_format | Depth buffer format, or None to disable | None |
Present Modes
| Mode | Behavior | Backend Mapping |
|---|---|---|
Fifo | Vsync — wait for display refresh. No tearing, capped at monitor Hz. | Metal displaySyncEnabled=YES, Vulkan FIFO, DX12 Present(1) |
Mailbox | Triple-buffered — latest frame queued, older dropped. Low latency + no tearing. | Vulkan MAILBOX. Falls back to Fifo on Metal and some DX12 configurations. |
Immediate | No sync, may tear. Maximum throughput for benchmarks. | Metal displaySyncEnabled=NO, Vulkan IMMEDIATE, DX12 Present(0) |
Auto | Goldy chooses (Mailbox if available, then Fifo). | — |
Change the present mode at runtime without recreating the surface:
#![allow(unused)] fn main() { surface.set_present_mode(PresentMode::Immediate)?; let current = surface.present_mode(); }
Frame Acquisition Cycle
Each frame follows a begin → record → present sequence:
#![allow(unused)] fn main() { loop { // 1. Begin the frame (acquire a swapchain image) let frame = surface.begin()?; // 2. Record rendering commands let mut encoder = CommandEncoder::new(); { let mut pass = encoder.begin_render_pass(); pass.clear(Color::CORNFLOWER_BLUE); pass.set_pipeline(&pipeline); pass.set_vertex_buffer(0, &vertices); pass.draw(0..3, 0..1); } // 3. Submit and present frame.render(encoder)?; frame.present()?; } }
surface.acquire() is a legacy alias for surface.begin().
Frame
Frame represents a single swapchain image bracket. It tracks whether the frame has been presented and auto-presents on drop if you forget.
Frame Properties
#![allow(unused)] fn main() { let frame = surface.begin()?; frame.width(); // frame dimensions (may differ from surface after resize) frame.height(); }
Graphics Path — Frame::render
Record draw commands into a CommandEncoder and submit with render():
#![allow(unused)] fn main() { frame.render(encoder)?; frame.present()?; }
Compute Path — Frame::submit_compute
For compute-to-surface workflows, access the frame's texture directly and submit a TaskGraph:
#![allow(unused)] fn main() { let frame = surface.begin()?; let tex = frame.texture(); // the swapchain texture as a storage image // Build a task graph that writes to tex... frame.submit_compute(&task_graph)?; frame.present()?; }
frame.texture() returns a &Texture with SpatialAccess::Direct, suitable for binding as a storage image in compute shaders.
Presenting
frame.present() consumes the Frame, submits all recorded work, and queues the image for display. It returns a TimelineValue that can be used with Device::wait_until().
#![allow(unused)] fn main() { let timeline = frame.present()?; }
If a Frame is dropped without calling present(), it auto-presents to avoid leaking the swapchain image. This is safe but wastes a frame.
Surface Queries
#![allow(unused)] fn main() { surface.width(); surface.height(); surface.size(); // (width, height) surface.format(); // TextureFormat of the swapchain images // Validate that a pipeline's target format matches surface.validate_pipeline_format(pipeline_format)?; }
Resize Handling
Call resize() when the window size changes. Zero-size dimensions are silently ignored (common during window minimize).
#![allow(unused)] fn main() { fn on_resize(surface: &mut Surface, width: u32, height: u32) -> Result<()> { surface.resize(width, height)?; Ok(()) } }
Texture Format
The swapchain format is chosen by the backend at surface creation (typically Bgra8UnormSrgb). Always use surface.format() when creating pipelines to ensure a match:
#![allow(unused)] fn main() { let desc = RenderPipelineDesc { target_format: surface.format(), ..Default::default() }; }
Frame Lifetime
Frame follows Rust ownership semantics:
begin()acquires the swapchain image and returns aFrametexture()borrows the frame — valid untilpresent()is calledpresent()consumes theFrame— the borrow checker prevents use-after-present- Dropping without presenting auto-presents (prevents swapchain deadlock)
#![allow(unused)] fn main() { let frame = surface.begin()?; let tex = frame.texture(); // tex is valid here frame.present()?; // tex is now invalid — Rust prevents accessing it }
Buffers
Buffer is a GPU memory allocation for storing typed data — uniforms, vertex data, index data, compute storage, or anything a shader needs to read or write.
Creating Buffers
With Typed Data
Buffer::with_data creates a buffer and uploads an initial slice. The element stride is inferred from T, which is critical for correct StructuredBuffer views on DX12.
#![allow(unused)] fn main() { use goldy::{Buffer, DataAccess}; let positions = vec![[0.0f32, 1.0, 0.0], [1.0, 0.0, 0.0]]; let buffer = Buffer::with_data(&device, &positions, DataAccess::Scattered)?; }
Type matters. Passing &[u8] (e.g. from bytemuck::bytes_of) sets the element stride to 1 byte, while shaders usually expect a larger struct stride. Use a typed slice or with_bytes_stride instead.
With Typed Data and Flags
#![allow(unused)] fn main() { let buffer = Buffer::with_data_and_flags( &device, &data, DataAccess::Scattered, BufferFlags::CPU_READABLE, )?; }
With Raw Bytes
When the data is naturally &[u8], use one of the byte-oriented constructors:
#![allow(unused)] fn main() { // Stride defaults to 1 (byte-addressable) let buffer = Buffer::with_bytes(&device, &raw_bytes, DataAccess::Scattered)?; // Explicit stride for structured buffer views let buffer = Buffer::with_bytes_stride(&device, &raw_bytes, DataAccess::Scattered, 16)?; }
Empty Buffer
#![allow(unused)] fn main() { let buffer = Buffer::new(&device, 4096, DataAccess::Scattered)?; // With a specific element stride let buffer = Buffer::new_with_stride(&device, 4096, DataAccess::Scattered, Some(64))?; }
Data Access Patterns
The access pattern describes how shader threads access the buffer. This drives hardware optimizations and determines the bindless descriptor category.
#![allow(unused)] fn main() { pub enum DataAccess { Scattered, // default — any thread, any address, read/write Broadcast, // all threads read the same address } }
| Pattern | Shader Mapping | Use When |
|---|---|---|
Scattered | StructuredBuffer<T>, RWStructuredBuffer<T> | General storage: particles, meshes, compute I/O |
Broadcast | ConstantBuffer / uniform buffer | Uniform data: transforms, time, settings |
For read-only input buffers that don't need write access, create with DataAccess::Scattered and access through goldy_buf_ro<T> in the shader. This enables hardware read-cache optimizations without requiring a separate access pattern.
BufferFlags
#![allow(unused)] fn main() { bitflags! { pub struct BufferFlags: u32 { const COPY_SRC = 1 << 0; const COPY_DST = 1 << 1; const CPU_READABLE = 1 << 2; } } }
| Flag | Purpose |
|---|---|
COPY_SRC | Buffer can be a copy source |
COPY_DST | Buffer can be a copy destination |
CPU_READABLE | Optimize for readback. On Vulkan/Metal, read_to_cpu is a direct memcpy from host-visible memory. On DX12, it performs a GPU copy into a READBACK heap and waits. |
Query DeviceCapabilities::has_zero_copy_storage_readback to detect whether readback is zero-copy on the current backend.
Writing Data
Raw bytes
#![allow(unused)] fn main() { buffer.write(offset, &bytes)?; }
Typed data
#![allow(unused)] fn main() { buffer.write_data(offset, &[1.0f32, 2.0, 3.0])?; }
Both methods write at a byte offset from the start of the buffer.
Reading Data
Read buffer contents back to the CPU. The buffer should have been created with BufferFlags::CPU_READABLE for optimal performance.
#![allow(unused)] fn main() { let mut output = vec![0u8; buffer.size() as usize]; buffer.read_to_cpu(&device, &mut output)?; }
Clearing
Zero-fill a region of the buffer:
#![allow(unused)] fn main() { buffer.clear(&device, offset, size)?; }
Bindless Descriptors
Every buffer with Scattered or Broadcast access is registered in the global bindless descriptor set. Retrieve the index to pass to shaders:
#![allow(unused)] fn main() { // Typed handle (preferred) — carries BindlessCategory for validation let handle = buffer.bindless_handle().unwrap(); // Raw index let index = buffer.bindless_index().unwrap(); // Read-only SRV index (separate from UAV on DX12; same on Vulkan/Metal) let srv_handle = buffer.bindless_srv_handle().unwrap(); }
BufferView
A BufferView is a sub-region of an existing Buffer with its own bindless descriptor. The shader sees the sub-region as a zero-based buffer.
Creating Views
#![allow(unused)] fn main() { // Raw byte view — offset, size, optional element stride let view = buffer.create_view(1024, 512, Some(16))?; // Typed view — first element index, element count let view = buffer.create_typed_view::<[f32; 4]>(0, 256)?; }
Using Views
Views implement BufferSource, so they work anywhere a Buffer does — set_vertex_buffer, set_index_buffer, write_data, read_to_cpu, clear, and bindless binding:
#![allow(unused)] fn main() { let view_handle = view.bindless_handle().unwrap(); pass.set_vertex_buffer(0, &view); }
Lifetime
Dropping a BufferView unregisters its descriptor but does not free the parent buffer's memory. Multiple views of the same buffer can exist simultaneously.
StructuredBufferElement
The StructuredBufferElement trait marks types safe for Buffer::with_data and BufferPool::alloc_with_data. It is implemented for common multi-byte primitives (u16, u32, f32, f64, etc.), fixed-size arrays of those types, and #[repr(C)] structs via #[derive(goldy_derive::StructuredBufferElement)].
Not implemented for u8/i8 — passing &[u8] would set stride to 1, which almost never matches the shader's expected struct stride. Use Buffer::with_bytes_stride for raw bytes.
Matrix Convention
Goldy uses column-major matrix layout in uniform/constant buffers across all backends. Rust math libraries (glam, nalgebra, ultraviolet) already store matrices column-major, so upload directly without transposing:
#![allow(unused)] fn main() { let uniforms = MyUniforms { projection: proj.to_cols_array_2d(), modelview: view.to_cols_array_2d(), }; buffer.write_data(0, &[uniforms])?; }
Goldy sets SLANG_MATRIX_LAYOUT_COLUMN_MAJOR at the Slang session level, so DX12, Vulkan, and Metal all interpret float4x4 the same way.
Textures and Samplers
Texture holds image data on the GPU. Sampler controls how that data is filtered and addressed when read in shaders. Together, they provide the standard texture sampling pipeline.
Creating a Texture
#![allow(unused)] fn main() { use goldy::{Texture, SpatialAccess, TextureFormat, TextureFlags}; let texture = Texture::new( &device, 512, 512, TextureFormat::Rgba8Unorm, SpatialAccess::Interpolated, TextureFlags::COPY_DST, )?; }
With Initial Data
Data must be raw bytes matching width × height × bytes_per_pixel:
#![allow(unused)] fn main() { let pixels: Vec<u8> = load_image_rgba("sprite.png"); let texture = Texture::with_data( &device, &pixels, 256, 256, TextureFormat::Rgba8Unorm, SpatialAccess::Interpolated, TextureFlags::COPY_DST, )?; }
Spatial Access Patterns
The access pattern determines how the texture is bound and accessed in shaders:
| Access | Shader Mapping | Use When |
|---|---|---|
Interpolated | Texture2D with sampler | Image data filtered between texels — sprites, materials, UI |
Direct | RWTexture2D | Storage images, compute output, exact pixel reads/writes |
Texture Formats
| Format | BPP | Notes |
|---|---|---|
R8Unorm | 1 | Single-channel (masks, SDFs) |
Rg8Unorm | 2 | Two-channel (normal maps, motion vectors) |
Rgba8Unorm | 4 | Standard 8-bit RGBA |
Rgba8UnormSrgb | 4 | sRGB color space |
Bgra8UnormSrgb | 4 | sRGB, swapped channels (common swapchain format) |
Bgra8Unorm | 4 | Linear, swapped channels |
Rgba16Float | 8 | HDR |
Rgba32Float | 16 | Full precision |
TextureFlags
#![allow(unused)] fn main() { bitflags! { pub struct TextureFlags: u32 { const COPY_SRC = 1 << 0; const COPY_DST = 1 << 1; const RENDER_TARGET = 1 << 2; } } }
| Flag | Purpose |
|---|---|
COPY_SRC | Texture can be a copy source (needed for read_to_cpu) |
COPY_DST | Texture can be a copy destination (needed for write / write_region) |
RENDER_TARGET | Texture can be used as a color attachment |
Writing Data
Prefer TaskGraph::write_texture() for batched, non-blocking uploads. The synchronous methods below stall the GPU:
#![allow(unused)] fn main() { #[allow(deprecated)] texture.write(&pixels)?; #[allow(deprecated)] texture.write_region(x, y, width, height, ®ion_pixels)?; }
Reading Data
Read texture contents back to CPU memory. The texture must have been created with TextureFlags::COPY_SRC:
#![allow(unused)] fn main() { let mut output = vec![0u8; texture.byte_size()]; texture.read_to_cpu(&mut output)?; }
Texture Queries
#![allow(unused)] fn main() { texture.width(); texture.height(); texture.format(); texture.byte_size(); // width * height * bytes_per_pixel texture.access(); // SpatialAccess texture.flags(); // TextureFlags texture.is_owned(); // true if dropping destroys the GPU resource }
Bindless Descriptors
Textures are registered in the global bindless descriptor set. The category depends on the access pattern: Interpolated maps to BindlessCategory::Texture, Direct maps to BindlessCategory::StorageImage.
#![allow(unused)] fn main() { // Typed handle (preferred) let handle = texture.bindless_handle().unwrap(); // Raw index let index = texture.bindless_index().unwrap(); }
Texture Borrowing
Texture::borrow() creates a non-owning view that shares the GPU resource. Dropping a borrowed texture does not destroy the underlying resource. Use this when handing a texture reference into a system that may drop it before the owner is done.
#![allow(unused)] fn main() { let borrowed = texture.borrow(); assert!(!borrowed.is_owned()); // dropping `borrowed` does not free GPU memory }
Depth Textures
Depth textures are created through SurfaceConfig or RenderTarget, not directly via Texture::new. Available depth formats:
| Format | Bits | Stencil |
|---|---|---|
Depth16Unorm | 16 | No |
Depth24Plus | 24 | No |
Depth24PlusStencil8 | 24 + 8 | Yes |
Depth32Float | 32 | No |
Depth32FloatStencil8 | 32 + 8 | Yes |
#![allow(unused)] fn main() { let surface = Surface::new_with_config(&device, &window, SurfaceConfig { depth_format: Some(DepthFormat::Depth32Float), ..Default::default() })?; }
Texture as Render Target
A texture created with TextureFlags::RENDER_TARGET can be used as a color attachment for offscreen rendering.
#![allow(unused)] fn main() { let offscreen = Texture::new( &device, 1920, 1080, TextureFormat::Rgba16Float, SpatialAccess::Interpolated, TextureFlags::RENDER_TARGET | TextureFlags::COPY_SRC, )?; }
Samplers
A Sampler defines how texture coordinates are interpreted — filtering between texels and handling coordinates outside [0, 1].
Creating a Sampler
#![allow(unused)] fn main() { use goldy::{Sampler, SamplerDesc, FilterMode, AddressMode}; let sampler = Sampler::new(&device, &SamplerDesc { mag_filter: FilterMode::Linear, min_filter: FilterMode::Linear, mipmap_filter: FilterMode::Linear, address_mode_u: AddressMode::Repeat, address_mode_v: AddressMode::Repeat, ..Default::default() })?; }
Convenience Constructors
#![allow(unused)] fn main() { let nearest = Sampler::nearest(&device)?; // nearest filter, clamp to edge let linear = Sampler::linear(&device)?; // linear filter, clamp to edge let tiling = Sampler::linear_repeat(&device)?; // linear filter, repeat addressing let default = Sampler::default_sampler(&device)?; // nearest filter, clamp to edge }
SamplerDesc
#![allow(unused)] fn main() { pub struct SamplerDesc { pub address_mode_u: AddressMode, // default: ClampToEdge pub address_mode_v: AddressMode, // default: ClampToEdge pub address_mode_w: AddressMode, // default: ClampToEdge pub mag_filter: FilterMode, // default: Nearest pub min_filter: FilterMode, // default: Nearest pub mipmap_filter: FilterMode, // default: Nearest pub max_anisotropy: f32, // default: 1.0 (disabled) pub compare: Option<CompareFunction>, // default: None pub lod_min_clamp: f32, // default: 0.0 pub lod_max_clamp: f32, // default: 32.0 } }
Filter Modes
| Mode | Effect |
|---|---|
Nearest | Pixelated — nearest texel, no interpolation |
Linear | Smooth — bilinear interpolation between neighbors |
Address Modes
| Mode | Effect for UVs outside [0, 1] |
|---|---|
ClampToEdge | Stretches the border texel |
Repeat | Tiles the texture |
MirrorRepeat | Tiles with alternating mirror flips |
Depth Comparison Samplers
For shadow mapping and depth-based effects, set the compare field:
#![allow(unused)] fn main() { let shadow_sampler = Sampler::new(&device, &SamplerDesc { compare: Some(CompareFunction::LessEqual), mag_filter: FilterMode::Linear, min_filter: FilterMode::Linear, ..Default::default() })?; }
Bindless Descriptors
Samplers are registered under BindlessCategory::Sampler:
#![allow(unused)] fn main() { let handle = sampler.bindless_handle().unwrap(); let index = sampler.bindless_index().unwrap(); }
Binding Textures and Samplers in Shaders
Pass texture and sampler indices together through resource bindings:
#![allow(unused)] fn main() { let tex = texture.bindless_handle().unwrap(); let samp = sampler.bindless_handle().unwrap(); pass.bind_resources_typed(&[tex, samp]); }
In Slang:
import goldy_exp;
[goldy_fragment]
float4 fs_main(Interpolated<float4> tex, Filter smp, float2 uv : TEXCOORD) {
return tex.Sample(smp, uv);
}
Pooling and Sub-Allocation
GPU resource allocation is expensive. Creating many small buffers or textures each frame produces allocation overhead, descriptor churn, and VRAM fragmentation. Goldy provides three pooling types to amortize these costs.
BufferPool
BufferPool sub-allocates typed regions from a single large DataAccess::Scattered backing buffer. Each region gets its own bindless descriptor, so shaders see independent zero-based buffers.
Creating a Pool
#![allow(unused)] fn main() { use goldy::BufferPool; let mut pool = BufferPool::new(&device, 1024 * 1024)?; // 1 MB pool }
The backing buffer uses DataAccess::Scattered and a default sub-allocation alignment of 256 bytes (satisfies minStorageBufferOffsetAlignment on all known Vulkan/DX12 hardware).
For custom alignment:
#![allow(unused)] fn main() { let mut pool = BufferPool::with_alignment(&device, total_size, 512)?; }
Allocating Regions
Typed allocation — stride is inferred from T:
#![allow(unused)] fn main() { let tiles: BufferView = pool.alloc::<[u32; 2]>(1024)?; // 1024 elements let segments: BufferView = pool.alloc::<[f32; 6]>(4096)?; // 4096 elements }
Allocate and fill in one call:
#![allow(unused)] fn main() { let data = vec![[1.0f32, 0.0, 0.0]; 100]; let view: BufferView = pool.alloc_with_data(&data)?; }
Raw byte allocation with explicit stride:
#![allow(unused)] fn main() { let view = pool.alloc_bytes(4096, Some(16))?; }
Each allocation is aligned to satisfy both the pool alignment (256) and offset % element_stride == 0 (required by DX12 StructuredBuffer views).
Using Allocated Views
Every BufferView from a pool has its own bindless descriptor. Bind it like any buffer:
#![allow(unused)] fn main() { let tile_handle = tiles.bindless_handle().unwrap(); pass.bind_resources_typed(&[tile_handle]); // Or as a vertex/index buffer pass.set_vertex_buffer(0, &tiles); }
Write data into a view:
#![allow(unused)] fn main() { view.write_data(&new_data)?; }
Sizing a Pool
Use BufferPool::padded_size to compute the exact byte capacity needed for a known set of allocations, including alignment padding:
#![allow(unused)] fn main() { let size = BufferPool::padded_size(&[ (1024, std::mem::size_of::<[u32; 2]>()), // tiles (4096, std::mem::size_of::<[f32; 6]>()), // segments (512, std::mem::size_of::<u32>()), // indices ]); let mut pool = BufferPool::new(&device, size)?; }
Resetting
reset() moves the bump pointer back to zero without invalidating existing views. Use for frame-to-frame reuse when previous views are no longer in flight.
#![allow(unused)] fn main() { pool.reset(); }
Pool Queries
#![allow(unused)] fn main() { pool.used(); // bytes currently allocated pool.capacity(); // total pool size pool.remaining(); // bytes available pool.backing_buffer(); // reference to the underlying Buffer }
BufferPoolRing
BufferPoolRing is a fixed-size ring of BufferPools for double- (or N-) buffered rendering. Each frame advances to the next slot, and the pool that was active N frames ago is safe to reset because its GPU work has completed.
Usage
#![allow(unused)] fn main() { use goldy::BufferPoolRing; let mut ring = BufferPoolRing::<2>::new(); // double-buffered // Each frame: ring.advance(); ring.prepare(&device, needed_bytes)?; if ring.take_clear_flag() { // New backing buffer was allocated — zero-fill it let pool = ring.current_mut().unwrap(); pool.backing_buffer().clear(&device, 0, pool.capacity())?; } let pool = ring.current_mut().unwrap(); let view = pool.alloc::<[f32; 4]>(256)?; }
How It Works
advance()— rotates to the next pool slot (call once at frame start)prepare(device, size)— ensures the current slot has at leastsizebytes. Resets the pool if large enough, or allocates a new one if not. Sets a clear flag when a new allocation occurs.take_clear_flag()— returnstrueexactly once afterprepareallocates a new backing buffer. Issue aclear_bufferfor the backing when this fires.current_mut()/current()— access the current frame's pool
Bounded Prepare
prepare_bounded adds an optional upper bound. If the current pool exceeds max_size, it is reallocated at size, enabling hysteresis-based shrinking:
#![allow(unused)] fn main() { ring.prepare_bounded(&device, needed_size, Some(max_size))?; }
Cleanup
#![allow(unused)] fn main() { ring.clear(); // drop all pools and reset state }
TexturePool
TexturePool caches released textures for reuse, avoiding repeated GPU allocation and deallocation. This is particularly valuable on DX12 where texture allocation involves descriptor heap management.
Creating a Pool
#![allow(unused)] fn main() { use goldy::{TexturePool, TexturePoolConfig}; let mut pool = TexturePool::new(TexturePoolConfig { max_per_key: 4, // keep up to 4 textures per (width, height, format, access, flags) key }); // Or use defaults (max_per_key = 8) let mut pool = TexturePool::default(); }
Acquire and Release
#![allow(unused)] fn main() { use goldy::{SpatialAccess, TextureFormat, TextureFlags}; // Acquire — returns a pooled texture if available, otherwise creates a new one let texture = pool.acquire( &device, 1920, 1080, TextureFormat::Rgba16Float, SpatialAccess::Direct, TextureFlags::COPY_SRC | TextureFlags::COPY_DST, )?; // ... use the texture for this frame's work ... // Release — return to pool after GPU work completes pool.release(texture); }
Borrowed textures (texture.borrow()) are silently dropped on release and not pooled.
Pool Key
Textures are keyed by (width, height, format, access, flags). Acquiring a texture only matches exact keys — a 128×128 texture will not be returned for a 256×256 request.
Eviction
When a key already holds max_per_key entries, additional releases are dropped (destroyed) immediately.
Stats and Cleanup
#![allow(unused)] fn main() { let stats = pool.stats(); println!("{} textures pooled, ~{} bytes", stats.entries, stats.estimated_bytes); pool.clear(); // drop all pooled textures, free GPU memory }
When to Use Pooling
| Scenario | Recommendation |
|---|---|
| Many small storage buffers with similar lifetime | BufferPool — one allocation, many views |
| Per-frame uniform/storage data that changes every frame | BufferPoolRing — ring-buffered pools, safe reset each frame |
| Transient render targets or compute textures | TexturePool — acquire/release cycle avoids allocation churn |
| Long-lived buffers (mesh data, static textures) | Individual Buffer / Texture — pooling adds no benefit |
| Uniform buffer updated once at startup | Individual Buffer — no per-frame reuse needed |
Sub-Allocation Patterns
Static Geometry Pool
Pack all static mesh data into one BufferPool at load time:
#![allow(unused)] fn main() { let size = BufferPool::padded_size(&[ (vertex_count, std::mem::size_of::<Vertex>()), (index_count, std::mem::size_of::<u32>()), ]); let mut pool = BufferPool::new(&device, size)?; let vertices = pool.alloc_with_data(&vertex_data)?; let indices = pool.alloc_with_data(&index_data)?; }
Per-Frame Dynamic Data
Use BufferPoolRing for data that changes every frame:
#![allow(unused)] fn main() { let mut ring = BufferPoolRing::<2>::new(); // In the render loop: ring.advance(); ring.prepare(&device, frame_data_size)?; let pool = ring.current_mut().unwrap(); let uniforms = pool.alloc_with_data(&[camera_data])?; let instances = pool.alloc_with_data(&instance_transforms)?; }
Transient Compute Textures
Pool intermediate textures in a multi-pass compute pipeline:
#![allow(unused)] fn main() { let mut tex_pool = TexturePool::default(); // Each frame: let temp = tex_pool.acquire(&device, w, h, fmt, SpatialAccess::Direct, flags)?; // ... compute pass writes to temp ... // ... next pass reads from temp ... tex_pool.release(temp); // return for reuse next frame }
Backend Architecture
Goldy supports three GPU backends, each implemented natively against the platform graphics API — no translation layers (like MoltenVK) are involved.
| Backend | API Level | Platforms | Rust Crate |
|---|---|---|---|
| Vulkan | 1.4+ | Windows, Linux | ash |
| DX12 | Direct3D 12 | Windows | windows + gpu-allocator |
| Metal | Tier 2+ | macOS, iOS | metal |
Native Implementations
Each backend maps Goldy concepts directly to the most natural primitives of its target API:
┌─────────────────────────────────────────────────────────────┐
│ Goldy Core API │
│ │
│ Device, Buffer, Texture, Pipeline, CommandEncoder, ... │
└─────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Vulkan 1.4+ │ │ Metal 2+ │ │ DX12 │
│ │ │ │ │ │
│ • ash crate │ │ • metal-rs │ │ • windows-rs │
│ • Dynamic │ │ • Argument │ │ • Root │
│ rendering │ │ buffers │ │ signatures │
│ • Descriptor │ │ • Native │ │ • Descriptor │
│ indexing │ │ hazard │ │ heaps │
│ • Buffer │ │ tracking │ │ │
│ device addr │ │ │ │ │
└───────────────┘ └───────────────┘ └───────────────┘
Translation layers introduce overhead from API mismatches, incompatible synchronization models, and extra validation. Native backends can leverage each API's strengths directly — for example, Metal's built-in hazard tracking, or Vulkan's descriptor indexing for bindless rendering.
Backend Selection
Default Selection
Goldy selects the platform-preferred backend automatically:
| Platform | Default Backend |
|---|---|
| macOS | Metal |
| Windows | DX12 |
| Linux | Vulkan |
Runtime Override — GOLDY_BACKEND
Override the backend at runtime with the GOLDY_BACKEND environment variable:
GOLDY_BACKEND=vulkan cargo run --example triangle
GOLDY_BACKEND=dx12 cargo run --example triangle
Accepted values (case-insensitive):
| Value | Backend |
|---|---|
vulkan, vk | Vulkan |
dx12, d3d12, directx | DX12 |
metal, mtl | Metal |
An unrecognized value produces a clear error listing the valid options.
Programmatic Selection
Query the active backend at runtime:
#![allow(unused)] fn main() { let instance = Instance::new()?; println!("Backend: {:?}", instance.backend_type()); // Prints: Backend: Dx12 (on Windows) // Prints: Backend: Vulkan (on Linux) // Prints: Backend: Metal (on macOS) }
Compile-Time Selection (Feature Flags)
You can also restrict which backends are compiled in via Cargo features. This excludes both the code and the dependencies of unselected backends:
cargo build --no-default-features --features vulkan
See Conditional Compilation for details on feature flags, dependency exclusion, and CI setup.
Adapter Enumeration
After creating an Instance, enumerate available GPU adapters to inspect
what hardware is present:
#![allow(unused)] fn main() { let instance = Instance::new()?; let adapters = instance.enumerate_adapters(); for adapter in &adapters { println!("{}: {} ({})", adapter.id(), adapter.name(), adapter.vendor()); println!(" Type: {:?}", adapter.device_type()); } }
DeviceType
Each adapter reports a DeviceType:
| Variant | Meaning |
|---|---|
DiscreteGpu | Dedicated graphics card with its own VRAM |
IntegratedGpu | GPU integrated into the CPU (shared memory) |
Cpu | Software renderer (e.g. WARP on DX12, lavapipe on Vulkan) |
Other | Unknown or unrecognized device class |
Creating a Device
Request a device with a preferred DeviceType. If no adapter matches,
Goldy falls back to the first available adapter:
#![allow(unused)] fn main() { let device = instance.create_device(DeviceType::DiscreteGpu)?; // Or target a specific adapter by ID: let device = instance.create_device_for_adapter(adapter.id())?; }
Backend Capabilities
Device Capabilities
Query format preferences and backend-specific capabilities after creating a device:
#![allow(unused)] fn main() { let caps = device.capabilities(); println!("Surface format: {:?}", caps.preferred_surface_format); println!("Render target fmt: {:?}", caps.preferred_render_target_format); println!("Zero-copy readback: {}", caps.has_zero_copy_storage_readback); }
| Capability | Vulkan | DX12 | Metal |
|---|---|---|---|
| Zero-copy CPU storage readback | Yes | No (requires GPU copy to readback heap) | Yes |
| Preferred surface format | Bgra8UnormSrgb | Bgra8UnormSrgb | Bgra8UnormSrgb |
Vulkan Backend
The Vulkan backend requires Vulkan 1.4+ and uses:
- Dynamic rendering (
VK_KHR_dynamic_rendering) — noVkRenderPassorVkFramebufferobjects - Descriptor indexing — bindless resource access by index in shaders
- Buffer device address — 64-bit GPU pointers for direct memory access in shaders
DX12 Backend
The DX12 backend uses the windows crate and provides:
- Root signatures for resource binding
- Descriptor heaps for efficient bindless resource management
- Shader compilation via Slang to DXIL
- WARP software rasterizer for headless/CI use (
GOLDY_DX12_FORCE_WARP=1) - GPU-Based Validation for deep debugging (
GOLDY_DX12_GBV=1)
Metal Backend
The Metal backend uses the metal crate (native Metal, not MoltenVK):
- Argument buffers for bindless resource binding
- Native hazard tracking — Metal tracks resource hazards automatically
- Shader compilation via Slang to Metal Shading Language
The GpuBackend Trait
All backends implement the GpuBackend trait, which defines the full
interface for device management, resource creation, shader compilation,
pipeline management, rendering, and compute dispatch:
#![allow(unused)] fn main() { pub trait GpuBackend: Send + Sync { fn backend_type(&self) -> BackendType; fn enumerate_adapters(&self) -> Vec<AdapterInfo>; fn create_device(&mut self, adapter_id: u32) -> Result<DeviceHandle>; fn create_buffer(&mut self, device: DeviceHandle, ...) -> Result<BufferHandle>; fn create_shader_with_paths(&mut self, device: DeviceHandle, ...) -> Result<ShaderHandle>; fn create_pipeline(&mut self, device: DeviceHandle, ...) -> Result<PipelineHandle>; // ... rendering, compute, surface, texture, sampler, timeline ... } }
Resources are identified by opaque u64 handles (DeviceHandle,
BufferHandle, ShaderHandle, etc.) that each backend maps to native
API objects internally.
Conditional Compilation
Most users should use GOLDY_BACKEND for runtime switching — see
Backend Architecture.
Compile-time feature flags are useful when you need smaller binaries, faster builds, or want to verify that each backend compiles independently in CI.
When to Use Compile-Time Features
Use --no-default-features --features <backend> when you need:
- Smaller binaries — exclude unused backend code
- Faster builds — skip compiling heavy backend dependencies
- Missing SDK — build on a system that lacks the Vulkan SDK or Windows SDK
- CI matrix — verify each backend compiles independently
Feature Flags
Goldy defines one feature per backend plus an instrumentation feature:
[features]
default = ["vulkan", "metal", "dx12", "instrumentation"]
vulkan = ["dep:ash"]
dx12 = ["dep:windows", "dep:gpu-allocator", "dep:windows-core"]
metal = ["dep:metal", "dep:cocoa", "dep:objc", "dep:core-graphics-types",
"dep:foreign-types", "dep:block"]
instrumentation = ["dep:tracing-subscriber"]
Dependency Exclusion
Building with only one backend excludes both the code and the dependencies for the others:
| Feature | Dependencies |
|---|---|
vulkan | ash |
dx12 | windows, gpu-allocator, windows-core |
metal | metal, cocoa, objc, core-graphics-types, foreign-types, block |
# Default build on Windows — compiles Vulkan + DX12 dependencies
cargo build
# Vulkan-only build — downloads only ash
cargo build --no-default-features --features vulkan
# DX12-only build
cargo build --no-default-features --features dx12
This can significantly reduce build times and binary size.
Platform-Specific Considerations
| Backend | Available On | Notes |
|---|---|---|
vulkan | Windows, Linux (any platform with a Vulkan loader) | Broadest platform support |
dx12 | Windows only | Gated by #[cfg(target_os = "windows")] — the feature is ignored on other platforms |
metal | macOS, iOS only | Gated by #[cfg(target_os = "macos")] — the feature is ignored on other platforms |
On macOS, enabling both vulkan and metal is valid — the default
backend will be Metal, but you can switch to Vulkan at runtime via
GOLDY_BACKEND=vulkan if a Vulkan loader (e.g. MoltenVK) is present.
Default Features
The default feature set enables all three backends plus instrumentation:
default = ["vulkan", "metal", "dx12", "instrumentation"]
To override, use --no-default-features and enable only what you need:
# Only Vulkan, no instrumentation
cargo build --no-default-features --features vulkan
# Vulkan + instrumentation
cargo build --no-default-features --features vulkan,instrumentation
# Metal-only on macOS
cargo build --no-default-features --features metal
FFI and Python Feature Passthrough
The goldy-ffi and goldy-py crates propagate features to the core
goldy crate, so you can control backend selection in downstream builds:
# FFI bindings with only Vulkan backend
cargo build -p goldy-ffi --no-default-features --features vulkan
# Python bindings with only DX12 backend
cargo build -p goldy-py --no-default-features --features dx12
This is useful for creating platform-specific binary distributions.
Cross-Compilation
When cross-compiling, keep in mind that platform-gated features are silently ignored if the target platform doesn't match:
# Targeting macOS — dx12 feature is silently ignored, only metal + vulkan
# are active
cargo build --target aarch64-apple-darwin
# Targeting Windows — metal feature is silently ignored
cargo build --target x86_64-pc-windows-msvc --no-default-features --features dx12
For cross-compilation to work, you need the appropriate system SDKs
available. Vulkan is the most portable backend since the ash crate only
needs a Vulkan loader at runtime, not at compile time.
CI Matrix Example
Verify each backend compiles independently in CI:
# GitHub Actions
jobs:
lint:
strategy:
matrix:
include:
- os: ubuntu-latest
features: vulkan
- os: windows-latest
features: vulkan
- os: windows-latest
features: dx12
- os: macos-latest
features: metal
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- run: cargo clippy --no-default-features --features ${{ matrix.features }} -- -D warnings
Checking the Active Backend
At runtime, query which backend was selected:
#![allow(unused)] fn main() { let instance = Instance::new()?; println!("Backend: {:?}", instance.backend_type()); }
If no backend feature is enabled for the current platform, Instance::new()
returns an error:
No GPU backend available - enable 'vulkan', 'dx12', or 'metal' feature
Debugging and Observability
Goldy provides validation layers, structured instrumentation, and environment variable controls that together cover the full debugging workflow — from catching API misuse to profiling frame timing.
Validation
GOLDY_VALIDATION Environment Variable
The primary control for runtime validation. Accepts a comma-, semicolon-, or whitespace-separated list of categories:
| Value | Effect |
|---|---|
api | Enable backend GPU API validation (see below) |
layout | Enable Rust ↔ Slang struct layout checks and buffer stride checks |
all | Enable both api and layout |
1, true, yes | GPU API validation only (legacy shorthand; does not enable layout checks) |
Categories can be combined:
# API validation only
GOLDY_VALIDATION=api cargo run --example triangle
# Layout validation only
GOLDY_VALIDATION=layout cargo run --example triangle
# Both
GOLDY_VALIDATION=all cargo run --example triangle
GOLDY_VALIDATION=layout,api cargo run --example triangle
API Validation
When GOLDY_VALIDATION includes api (or 1/true/yes), Goldy
enables backend-specific validation:
| Backend | What Gets Enabled |
|---|---|
| Vulkan | VK_LAYER_KHRONOS_validation + VK_EXT_debug_utils at instance creation |
| Metal | Sets MTL_SHADER_VALIDATION=1 (if not already set) before the first device is created |
| DX12 | See DX12 Debug Layer below |
For Vulkan, validation is also enabled when VK_INSTANCE_LAYERS contains
VK_LAYER_KHRONOS_validation (the standard loader-driven workflow).
Layout Validation
Layout validation catches mismatches between Rust struct layouts and their Slang shader counterparts at shader compile time, and buffer element-stride mismatches at dispatch time.
Enable via either:
GOLDY_VALIDATION=layout cargo run
GOLDY_VALIDATE_LAYOUTS=1 cargo run # legacy variable, equivalent
#[derive(LayoutCheckable)]
Annotate Rust structs that mirror Slang types to opt into automatic validation:
#![allow(unused)] fn main() { #[derive(LayoutCheckable)] #[repr(C)] struct SceneUniforms { projection: [[f32; 4]; 4], view: [[f32; 4]; 4], time: f32, } }
The derive macro generates a LAYOUT_CHECK constant containing the
struct's name, total size, and per-field offsets. Pass it when creating a
shader module:
#![allow(unused)] fn main() { let shader = ShaderModule::from_slang_with_options( &device, source, &[], // extra search paths &[], // defines Default::default(), &[SceneUniforms::LAYOUT_CHECK], )?; }
When layout validation is enabled, Goldy compiles the Slang shader, reflects each named struct, and compares:
- Total struct size — Rust
size_ofvs. Slang reflection - Field offsets — each named field's byte offset
A mismatch produces an error naming the struct, the field, and the expected vs. actual offset — immediately surfacing padding or alignment bugs. When validation is disabled, the checks are skipped at zero cost.
Buffer Stride Validation
At dispatch time (when layout validation is enabled), Goldy also checks
that each bound buffer's element_stride matches the stride the shader
expects from Slang reflection. A mismatch produces an error like:
buffer element-stride mismatch in shader `my_shader`:
slot 0: shader expects element stride 16 but buffer has 4
DX12-Specific Debugging
DX12 Debug Layer
| Variable | Values | Effect |
|---|---|---|
GOLDY_DX12_DEBUG | 1 | Force-enable the D3D12 debug layer (even in release builds) |
GOLDY_DX12_NO_DEBUG | 1 | Disable the D3D12 debug layer (useful for parallel tests that crash the debug layer) |
GOLDY_DX12_GBV | 1 | Enable GPU-Based Validation (very slow; requires the debug layer) |
GPU-Based Validation (GBV) instruments shaders on the GPU to detect issues that the CPU-side debug layer cannot catch — such as out-of-bounds descriptor accesses and uninitialized resource reads. Expect a significant performance hit.
WARP Software Rasterizer
WARP is Microsoft's software implementation of D3D12. It runs on the CPU, so it works on headless CI runners with no GPU.
GOLDY_DX12_FORCE_WARP=1 cargo nextest run
After the first WARP device is created, Goldy prints a confirmation:
[WARP] d3d10warp.dll loaded from: C:\WINDOWS\SYSTEM32\d3d10warp.dll
On Windows, DX12 is the default backend, so GOLDY_DX12_FORCE_WARP=1 is
the only variable you need to run tests on a machine without a GPU.
Structured Instrumentation
Goldy includes a structured instrumentation system built on the tracing
crate. It provides named observation points with hierarchical
dot-notation names and structured context data.
Enabling Instrumentation
Instrumentation requires the instrumentation Cargo feature (enabled by
default). When disabled, all macros compile to no-ops at zero cost.
# Explicitly enable
cargo build --features instrumentation
# Disable (zero-cost removal)
cargo build --no-default-features --features vulkan
goldy_span! — Timed Sections
Create a span to measure the duration of a code section:
#![allow(unused)] fn main() { use goldy::goldy_span; fn compile_shader(&self) { let _span = goldy_span!("slang.compile", target = "metal").entered(); // ... compilation code ... // Duration is recorded automatically when _span is dropped } }
goldy_event! — Instant Markers
Emit a one-shot structured event:
#![allow(unused)] fn main() { use goldy::goldy_event; goldy_event!("slang.library.load", path = %lib_path.display(), success = true ); }
Built-in Observation Points
Goldy instruments its own internals at these observation points:
| Category | Point Name | Emitted Data |
|---|---|---|
| Slang | slang.library.load | path, success |
slang.compile.start | target, entry_points, bindless | |
slang.compile.end | duration_ms, output_size, success | |
slang.reflection.extract | parameter_blocks, fields | |
| Shader | shader.module.create | backend, shader_type |
shader.pipeline.create | pipeline_type, bind_groups | |
| Resource | resource.buffer.create | size, usage |
resource.texture.create | dimensions, format | |
resource.bind_group.create | bindings_count | |
| Render | render.frame.start | frame_id |
render.compute.dispatch | workgroups, pipeline | |
render.draw | vertices, instances | |
render.frame.end | frame_id, duration_ms |
JSON Logging
Install a JSON file logger to capture all instrumentation output as structured JSON:
#![allow(unused)] fn main() { use goldy::instrumentation::install_json_logger; install_json_logger("/tmp/goldy-debug.json")?; // All subsequent goldy_span!/goldy_event! calls are written to the file }
Filtering with RUST_LOG
Use the standard RUST_LOG environment variable to control verbosity.
All Goldy instrumentation uses the goldy target:
RUST_LOG=goldy=debug cargo run --example triangle
RUST_LOG=goldy::render=trace cargo run --example triangle
Environment Variables Summary
| Variable | Values | Effect |
|---|---|---|
GOLDY_BACKEND | vulkan/vk, dx12/d3d12/directx, metal/mtl | Override backend selection |
GOLDY_VALIDATION | api, layout, all, 1/true/yes | Enable validation categories |
GOLDY_VALIDATE_LAYOUTS | 1, true, yes | Enable layout validation (legacy; prefer GOLDY_VALIDATION=layout) |
GOLDY_DX12_FORCE_WARP | 1 | Use WARP software rasterizer |
GOLDY_DX12_DEBUG | 1 | Force-enable D3D12 debug layer in release |
GOLDY_DX12_NO_DEBUG | 1 | Disable D3D12 debug layer |
GOLDY_DX12_GBV | 1 | Enable GPU-Based Validation |
RUST_LOG | e.g. goldy=debug | Filter instrumentation output |
Common Debugging Patterns
Catch API misuse early
GOLDY_VALIDATION=api cargo run --example my_app
Turn on API validation during development to catch invalid GPU API calls. On Vulkan this enables the Khronos validation layer; on Metal it enables shader validation.
Diagnose struct layout bugs
GOLDY_VALIDATION=layout cargo test
If a LayoutCheckable struct diverges from its Slang counterpart (due to
padding, alignment, or a field being added on only one side), the error
message names the exact struct and field.
Headless CI on Windows
GOLDY_DX12_FORCE_WARP=1 cargo nextest run
WARP gives you a fully functional D3D12 device on machines with no GPU.
Combine with GOLDY_VALIDATION=api for maximum coverage.
Profile frame timing
#![allow(unused)] fn main() { use goldy::instrumentation::install_json_logger; install_json_logger("/tmp/goldy-profile.json")?; // Run your application, then inspect the JSON output for // render.frame.start / render.frame.end durations }
Deep DX12 debugging
GOLDY_DX12_DEBUG=1 GOLDY_DX12_GBV=1 cargo run --example my_app
GPU-Based Validation catches GPU-side issues the CPU debug layer cannot see, at a significant performance cost. Use it when you suspect descriptor or resource access bugs.
Python Bindings
Goldy provides Python bindings via PyO3, offering a Pythonic API for GPU programming with seamless NumPy integration.
Installation
From PyPI
pip install goldy
From Source
git clone https://github.com/koubaa/goldy.git
cd goldy/python
pip install maturin
maturin develop --release
Requirements
- Python 3.9+
- NumPy 1.20+
- A GPU with Vulkan 1.4+, DX12, or Metal Tier 2+ support
Optional Dependencies
pip install goldy[dev] # pytest, pillow
pip install pillow # image output only
Quick Start
import goldy
import numpy as np
from PIL import Image
# Setup
instance = goldy.Instance()
device = instance.create_device(goldy.DeviceType.DISCRETE_GPU)
target = goldy.RenderTarget(device, 800, 600, goldy.TextureFormat.RGBA8_UNORM)
# Render
encoder = goldy.CommandEncoder()
with encoder.begin_render_pass() as rp:
rp.clear(goldy.Color.CORNFLOWER_BLUE)
target.render(encoder)
# Read back as NumPy array and save
pixels = target.read_to_cpu() # shape (600, 800, 4), dtype uint8
Image.fromarray(pixels, mode='RGBA').save('hello_goldy.png')
NumPy Integration
Creating GPU Buffers from Arrays
vertices = np.array([
# x, y, r, g, b, a
0.0, -0.5, 1.0, 0.0, 0.0, 1.0,
0.5, 0.5, 0.0, 1.0, 0.0, 1.0,
-0.5, 0.5, 0.0, 0.0, 1.0, 1.0,
], dtype=np.float32)
buffer = goldy.Buffer(device, vertices, goldy.DataAccess.SCATTERED)
Supported dtypes
| NumPy dtype | Typical use case |
|---|---|
np.float32 | Vertex positions, colors, uniforms |
np.float64 | High-precision data |
np.uint32 | Index buffers, compute data |
np.int32 | Signed integer data |
np.uint16 | 16-bit index buffers |
np.uint8 | Raw byte data |
Reading Results Back to NumPy
Render target readback returns a NumPy array directly:
pixels = target.read_to_cpu()
print(pixels.shape) # (height, width, 4)
print(pixels.dtype) # uint8
Updating Buffers
buffer = goldy.Buffer(device, np.zeros(256, dtype=np.float32), goldy.DataAccess.BROADCAST)
# Full update
buffer.write(0, np.random.rand(256).astype(np.float32))
# Partial update (starting at byte offset 64)
buffer.write(64, np.ones(32, dtype=np.float32))
Performance Tips
- Create once, update often — avoid allocating new
Bufferobjects every frame. Usebuffer.write()instead. - Use
np.float32— match the GPU's expected dtype to avoid an extra conversion. - Ensure contiguity — sliced arrays may not be contiguous. Call
np.ascontiguousarray()before uploading if needed.
Compute Shaders
Goldy supports GPU compute from Python using Slang shaders.
Basic Example
import goldy
import numpy as np
instance = goldy.Instance()
device = instance.create_device(goldy.DeviceType.DISCRETE_GPU)
data = np.arange(256, dtype=np.float32)
buffer = goldy.Buffer(device, data, goldy.DataAccess.SCATTERED)
SHADER = """
import goldy_exp;
[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(Scattered<float> data, ThreadId id) {
data[id.x] = data[id.x] * 2.0;
}
"""
shader = goldy.ShaderModule.from_slang(device, SHADER)
pipeline = goldy.ComputePipeline(device, shader)
encoder = goldy.ComputeEncoder()
with encoder.begin_compute_pass() as cp:
cp.set_pipeline(pipeline)
cp.bind_resources([buffer])
cp.dispatch(4, 1, 1) # 4 workgroups × 64 threads = 256 threads
encoder.dispatch(device)
Ping-Pong Buffers
For iterative algorithms, alternate two buffers as input/output:
buf_a = goldy.Buffer(device, initial_data, goldy.DataAccess.SCATTERED)
buf_b = goldy.Buffer(device, initial_data, goldy.DataAccess.SCATTERED)
use_a = True
for _ in range(100):
encoder = goldy.ComputeEncoder()
with encoder.begin_compute_pass() as cp:
cp.set_pipeline(pipeline)
cp.bind_resources([buf_a, buf_b] if use_a else [buf_b, buf_a])
cp.dispatch(workgroups_x, workgroups_y, 1)
encoder.dispatch(device)
use_a = not use_a
Combining Compute and Graphics
Use compute results directly in a subsequent render pass through shared storage buffers:
# Compute pass
compute_encoder = goldy.ComputeEncoder()
with compute_encoder.begin_compute_pass() as cp:
cp.set_pipeline(compute_pipeline)
cp.bind_resources([buffer])
cp.dispatch(workgroups, 1, 1)
compute_encoder.dispatch(device)
# Render pass — reads the same buffer
render_encoder = goldy.CommandEncoder()
with render_encoder.begin_render_pass() as rp:
rp.set_pipeline(render_pipeline)
rp.bind_resources([buffer])
rp.draw(range(3))
target.render(render_encoder)
Key Differences from Rust
| Aspect | Rust | Python |
|---|---|---|
| Instance creation | Instance::new()? | goldy.Instance() |
| Error handling | Result<T, GoldyError> | Raises goldy.GoldyError |
| Buffer data | Buffer::with_data(&device, &[T], access) | goldy.Buffer(device, numpy_array, access) |
| Render pass | encoder.begin_render_pass() returns struct | Context manager (with ... as rp) |
| Pixel readback | target.read_to_cpu() → Vec<u8> | target.read_to_cpu() → NumPy array (H, W, 4) |
| Resource lifetime | Explicit Arc<Device> ownership | Managed by Python GC via PyO3 |
Backend Selection
Goldy auto-selects the best backend per platform (DX12 on Windows, Vulkan on Linux). Override with GOLDY_BACKEND:
import os
os.environ["GOLDY_BACKEND"] = "vulkan" # set before importing goldy
import goldy
instance = goldy.Instance()
API Reference
Core Classes
Instance
instance = goldy.Instance()
instance.backend_type # BackendType (Vulkan, DX12, Metal)
instance.enumerate_adapters() # list of AdapterInfo
instance.create_device(type) # Device
Device
device = instance.create_device(goldy.DeviceType.DISCRETE_GPU)
device.is_valid() # bool
Buffer
buf = goldy.Buffer(device, data, access) # data: numpy array or bytes
buf = goldy.Buffer.empty(device, size, access)
buf.size # int (bytes)
buf.write(offset, data) # update contents
RenderTarget
target = goldy.RenderTarget(device, width, height, format, depth_format=None)
target.width, target.height
target.format
target.has_depth
target.render(encoder)
target.read_to_cpu() # numpy array (H, W, 4)
ShaderModule
shader = goldy.ShaderModule.from_slang(device, slang_source)
RenderPipeline
pipeline = goldy.RenderPipeline(device, vertex_shader, fragment_shader, desc)
RenderPipelineDesc
desc = goldy.RenderPipelineDesc(
vertex_layout=None,
topology=goldy.PrimitiveTopology.TRIANGLE_LIST,
target_format=goldy.TextureFormat.RGBA8_UNORM,
depth_stencil=None,
)
CommandEncoder / RenderPass
encoder = goldy.CommandEncoder()
with encoder.begin_render_pass() as rp:
rp.clear(goldy.Color.BLACK)
rp.set_pipeline(pipeline)
rp.set_vertex_buffer(slot, buffer)
rp.set_index_buffer(buffer, format)
rp.bind_resources([buf1, buf2])
rp.draw(vertices, instances=range(1))
rp.draw_indexed(indices, base_vertex, instances)
Compute Classes
ComputePipeline
pipeline = goldy.ComputePipeline(device, shader)
ComputeEncoder
encoder = goldy.ComputeEncoder()
with encoder.begin_compute_pass() as cp:
cp.set_pipeline(pipeline)
cp.bind_resources([buffer])
cp.dispatch(wg_x, wg_y, wg_z)
encoder.dispatch(device)
Enums
# Device selection
goldy.DeviceType.DISCRETE_GPU | INTEGRATED_GPU | CPU | OTHER
# Texture formats
goldy.TextureFormat.RGBA8_UNORM | RGBA8_UNORM_SRGB | BGRA8_UNORM
| R8_UNORM | RG8_UNORM | RGBA16_FLOAT | RGBA32_FLOAT
# Buffer access patterns
goldy.DataAccess.SCATTERED # any thread, any address (StructuredBuffer)
goldy.DataAccess.BROADCAST # all threads same address (ConstantBuffer)
# Texture access patterns
goldy.SpatialAccess.INTERPOLATED # hardware-filtered (Texture2D + sampler)
goldy.SpatialAccess.DIRECT # direct indexing (RWTexture2D)
# Primitive topology
goldy.PrimitiveTopology.POINT_LIST | LINE_LIST | LINE_STRIP
| TRIANGLE_LIST | TRIANGLE_STRIP
# Index format
goldy.IndexFormat.UINT16 | UINT32
Types
Color
color = goldy.Color(r, g, b, a=1.0) # floats 0-1
color = goldy.Color.from_rgb(255, 128, 0) # bytes 0-255
# Predefined
goldy.Color.BLACK | WHITE | RED | GREEN | BLUE | CORNFLOWER_BLUE
VertexBufferLayout
layout = goldy.VertexBufferLayout.vertex_2d() # pos(2) + color(4)
layout = goldy.VertexBufferLayout.vertex_2d_uv() # pos(2) + uv(2)
layout = goldy.VertexBufferLayout(stride, [
goldy.VertexAttribute(location, format, offset),
])
DepthStencilState
depth = goldy.DepthStencilState(
format=goldy.DepthFormat.DEPTH32_FLOAT,
depth_write_enabled=True,
depth_compare=goldy.CompareFunction.LESS,
)
Exceptions
All errors are raised as goldy.GoldyError:
try:
device = instance.create_device(goldy.DeviceType.DISCRETE_GPU)
except goldy.GoldyError as e:
print(f"GPU error: {e}")
.NET Bindings
Goldy provides first-class C# bindings via P/Invoke interop over the native Rust FFI layer.
Installation
NuGet Package
dotnet add package Goldy
Or add to your .csproj directly:
<PackageReference Include="Goldy" Version="0.1.*" />
The NuGet package bundles native Goldy + Slang libraries for all supported platforms — no separate native installation is needed.
Building from Source
cargo build --package goldy-ffi --release
dotnet add reference path/to/goldy/dotnet/Goldy/Goldy.csproj
Requirements
- .NET 8.0 or later
- Windows x64, Linux x64, or macOS (x64 / arm64)
- A GPU with Vulkan 1.4+, DX12, or Metal Tier 2+ support
Quick Start
Headless Rendering
using Goldy;
using var instance = new Instance();
using var device = instance.CreateDevice(DeviceType.DiscreteGpu);
using var target = new RenderTarget(device, 800, 600, TextureFormat.Rgba8Unorm);
var encoder = new CommandEncoder();
encoder.Clear(new Color(0.2f, 0.3f, 0.8f, 1.0f));
target.Render(encoder);
byte[] pixels = target.ReadToCpu();
Console.WriteLine($"Rendered {pixels.Length} bytes ({target.Width}x{target.Height})");
Windowed Rendering
For interactive applications, use Surface with a window handle:
using Goldy;
using var surface = new Surface(device, windowHandle);
while (running)
{
using var frame = surface.Acquire();
var encoder = new CommandEncoder();
encoder.Clear(Color.CornflowerBlue);
// ... draw calls ...
frame.Render(encoder);
surface.Present(frame);
}
Shaders (Slang)
Goldy uses Slang as its shader language across all backends:
var source = """
[shader("vertex")]
float4 vs_main(float2 pos : POSITION) : SV_Position {
return float4(pos, 0.0, 1.0);
}
[shader("fragment")]
float4 fs_main() : SV_Target {
return float4(1.0, 0.5, 0.0, 1.0);
}
""";
using var shader = new ShaderModule(device, source);
using var pipeline = new RenderPipeline(device, shader, new RenderPipelineDesc
{
TargetFormat = TextureFormat.Rgba8Unorm,
Topology = PrimitiveTopology.TriangleList,
});
Resource Management
All Goldy objects implement IDisposable. Use using declarations or using blocks to ensure GPU resources are released promptly:
// Preferred: using declaration (C# 8+)
using var device = instance.CreateDevice(DeviceType.DiscreteGpu);
// Also valid: explicit using block
using (var target = new RenderTarget(device, 512, 512, TextureFormat.Rgba8Unorm))
{
// target is released when the block exits
}
Key Differences from Rust
| Aspect | Rust | C# |
|---|---|---|
| Instance creation | Instance::new()? | new Instance() |
| Error handling | Result<T, GoldyError> | Exceptions |
| Device lifetime | Arc<Device> | IDisposable / using |
| Buffer creation | Buffer::with_data(&device, &[T], access) | Buffer.WithData<T>(device, data, access) |
| Pixel readback | Vec<u8> | byte[] |
| Enums | DeviceType::DiscreteGpu | DeviceType.DiscreteGpu |
API Reference
Instance
public sealed class Instance : IDisposable
{
public Instance();
public IEnumerable<AdapterInfo> EnumerateAdapters();
public Device CreateDevice(DeviceType deviceType);
public Device CreateDeviceById(uint adapterId);
}
Device
public sealed class Device : IDisposable
{
public uint AdapterId { get; }
public bool IsValid { get; }
public ulong GpuProgress { get; }
public void WaitUntil(ulong value);
public bool WaitUntilTimeout(ulong value, uint timeoutMs);
public bool HasLibrary(string name);
}
Buffer
public sealed class Buffer : IDisposable
{
public static Buffer New(Device device, ulong size, DataAccess access);
public static Buffer WithData<T>(Device device, T[] data, DataAccess access)
where T : unmanaged;
public void Write<T>(T[] data) where T : unmanaged;
public void Write<T>(ulong offset, T[] data) where T : unmanaged;
public ulong Size { get; }
}
ShaderModule
public sealed class ShaderModule : IDisposable
{
public ShaderModule(Device device, string slangSource);
}
RenderPipeline / RenderPipelineDesc
public sealed class RenderPipeline : IDisposable
{
public RenderPipeline(Device device, ShaderModule shader, RenderPipelineDesc desc);
}
public sealed class RenderPipelineDesc
{
public TextureFormat TargetFormat { get; set; }
public PrimitiveTopology Topology { get; set; }
// ... vertex layout, depth state
}
CommandEncoder / RenderPass
public sealed class CommandEncoder
{
public CommandEncoder();
public void Clear(Color color);
public RenderPass BeginRenderPass();
}
public sealed class RenderPass : IDisposable
{
public void SetPipeline(RenderPipeline pipeline);
public void SetVertexBuffer(uint slot, Buffer buffer);
public void Draw(uint vertexStart, uint vertexCount,
uint instanceStart = 0, uint instanceCount = 1);
public void DrawIndexed(uint indexCount, uint instanceCount = 1);
}
RenderTarget
public sealed class RenderTarget : IDisposable
{
public RenderTarget(Device device, uint width, uint height, TextureFormat format);
public void Render(CommandEncoder encoder);
public byte[] ReadToCpu();
public void ReadToBuffer(byte[] output);
public uint Width { get; }
public uint Height { get; }
public TextureFormat Format { get; }
public int BufferSize { get; }
}
Surface / SurfaceFrame
public sealed class Surface : IDisposable
{
public Surface(Device device, nint windowHandle);
public SurfaceFrame Acquire();
public void Present(SurfaceFrame frame);
public void Resize(uint width, uint height);
public uint Width { get; }
public uint Height { get; }
}
public sealed class SurfaceFrame : IDisposable
{
public void Render(CommandEncoder encoder);
}
Compute
public sealed class ComputePipeline : IDisposable
{
public ComputePipeline(Device device, ShaderModule computeShader);
}
public sealed class ComputeEncoder
{
public ComputeEncoder();
public void SetPipeline(ComputePipeline pipeline);
public void BindResources(params Buffer[] buffers);
public void BindResourcesRaw(uint[] indices);
public void Dispatch(uint x, uint y, uint z);
public void DispatchIndirect(Buffer buffer, ulong offset);
public void ClearBuffer(Buffer buffer, ulong offset, ulong size);
public void Dispatch(Device device); // dispatch and block
public ulong Submit(Device device); // submit, return timeline value
}
Texture / Sampler
public sealed class Texture : IDisposable
{
public Texture(Device device, uint width, uint height, TextureFormat format,
SpatialAccess access, TextureFlags flags = TextureFlags.None);
public void Write(byte[] data);
public uint Width { get; }
public uint Height { get; }
public TextureFormat Format { get; }
}
public sealed class Sampler : IDisposable
{
public Sampler(Device device, SamplerDesc desc);
}
public struct SamplerDesc
{
public FilterMode MagFilter { get; set; }
public FilterMode MinFilter { get; set; }
public AddressMode AddressModeU { get; set; }
public AddressMode AddressModeV { get; set; }
}
Enums
public enum DeviceType { DiscreteGpu, IntegratedGpu, Cpu, Other }
public enum BackendType { Vulkan, Metal, Dx12 }
public enum DataAccess { Scattered, Broadcast }
public enum SpatialAccess { Interpolated, Direct }
public enum FilterMode { Nearest, Linear }
public enum AddressMode { Repeat, MirrorRepeat, ClampToEdge, ClampToBorder }
public enum TextureFormat
{
Rgba8Unorm, Rgba8Srgb, Bgra8Unorm,
Rgba16Float, Rgba32Float, Depth32Float,
}
public struct Color
{
public float R, G, B, A;
public Color(float r, float g, float b, float a);
public static Color CornflowerBlue { get; }
public static Color Black { get; }
public static Color White { get; }
}
Non-Blocking Submissions
ComputeEncoder.Submit returns a ulong device timeline value. Poll or wait on it via Device.GpuProgress and Device.WaitUntil:
ulong ticket = computeEncoder.Submit(device);
// ... do other work ...
device.WaitUntil(ticket); // block until the GPU catches up
Examples Gallery
Goldy ships with 22 examples that demonstrate its core concepts. Every example uses Slang shaders and runs on all supported backends (Vulkan 1.4+, DX12, Metal Tier 2+).
Running Examples
cd goldy
cargo run --example <name> --release
All windowed examples support Escape to exit and automatic window-resize handling.
Bindless Basics
These examples cover fundamental Goldy patterns: vertex buffers, the Surface API, uniforms, and fragment shaders.
| Example | What it demonstrates | Source |
|---|---|---|
triangle | The minimal Goldy program. Creates a vertex buffer with colored vertices, builds a render pipeline, and presents to a window via the zero-copy Surface API. | triangle.rs |
gradient | Animated full-screen gradient driven by a time uniform. Uses vertex-less rendering (SV_VertexID) and demonstrates GOLDY_VALIDATE_LAYOUTS for Rust ↔ Slang struct layout validation. | gradient.rs |
window | Triangle with continuous animation, showing the Surface API render loop and frame pacing. | window.rs |
checkerboard | Procedural animated checkerboard via UV distortion in a fragment shader. Also supports GOLDY_VALIDATE_LAYOUTS. | checkerboard.rs |
Compute Workflows
Examples that use ComputePipeline and TaskGraph for GPU-side data processing, including the compute-to-surface pattern.
| Example | What it demonstrates | Source |
|---|---|---|
compute_particles | Full compute + graphics loop. A compute shader updates 1024 particle positions and velocities each frame; a graphics shader renders them as instanced colored quads. Uses TaskGraph for dependency scheduling. | compute_particles.rs |
game_of_life | Conway's Game of Life on the GPU. A compute shader applies cellular-automaton rules on a 128×128 grid using ping-pong BufferViews from a shared BufferPool. A separate graphics pass renders the result. | game_of_life.rs |
compute_to_surface | Pure compute rendering — no RenderPipeline, no CommandEncoder, no vertex buffers. A compute shader writes directly to the swapchain texture via frame.texture() and TaskGraph. Demonstrates the compute-to-surface workflow. | compute_to_surface.rs |
Graphics Pipelines
Classic rendering techniques: depth testing, textures, instancing, and 3D projection.
| Example | What it demonstrates | Source |
|---|---|---|
solid_cube | Solid 3D cube with per-face colors. Demonstrates 3D rendering with a depth buffer and model/view/projection matrices. | solid_cube.rs |
spinning_cube | 3D wireframe cube using line primitives. Shows 3D projection and rotation matrices without depth testing. | spinning_cube.rs |
depth_quads | Two full-screen quads with oscillating depth values. Drawn in a fixed order, the depth buffer (CompareFunction::Less) ensures the nearer quad always wins — proving draw order independence. | depth_quads.rs |
textured_quad | Procedural checkerboard texture displayed on a quad. Demonstrates Texture, Sampler, cross-backend bindless resource access, and linear filtering with repeat addressing. | textured_quad.rs |
instancing | 400 rotating quads driven entirely by the GPU. A compute shader updates per-instance transforms and HSV-derived colors each frame; the graphics shader reads them from a storage buffer — no vertex buffer needed. | instancing.rs |
bouncing_lines | Lines bouncing off window edges. Uses the LINE_LIST primitive topology and simple physics. | bouncing_lines.rs |
waveform | Audio-style waveform visualizer using LINE_STRIP topology and multiple draw calls per frame. | waveform.rs |
Advanced Patterns
More complex examples combining multiple Goldy features or demonstrating interactive input, visual effects, and multi-window management.
Fragment Shader Effects
| Example | What it demonstrates | Source |
|---|---|---|
plasma | Classic demoscene plasma effect using complex trigonometric math in a fragment shader with time-based animation. | plasma.rs |
tunnel | Flying-through-a-tunnel effect using polar coordinates and procedural checkerboard texturing in screen space. | tunnel.rs |
metaballs | Organic blob simulation using distance-field evaluation and thresholding in a fragment shader. | metaballs.rs |
starfield | 3D starfield fly-through simulated entirely in a fragment shader with depth-based brightness. | starfield.rs |
Interactive Input
| Example | What it demonstrates | Source |
|---|---|---|
mandelbrot | Real-time fractal explorer. Arrow keys pan, +/- zoom, R resets. Demonstrates interactive uniform updates driving a fragment shader. | mandelbrot.rs |
particles | Rain and snow particle simulation. Press Space to toggle mode. Shows CPU-driven particle state with per-frame vertex buffer updates. | particles.rs |
digital_clock | 7-segment LED display rendered from vertex data. Space pauses, click changes color. Demonstrates dynamic vertex generation for complex shapes. | digital_clock.rs |
Multi-Window
| Example | What it demonstrates | Source |
|---|---|---|
multi_window | Three simultaneous windows, each running an independent effect (plasma, tunnel, starfield) with its own Surface, pipeline, and input handling. Demonstrates managing multiple GPU surfaces from a single device. | multi_window.rs |
Common Patterns
Surface API Render Loop (Rust)
#![allow(unused)] fn main() { let frame = surface.begin()?; let mut encoder = CommandEncoder::new(); { let mut pass = encoder.begin_render_pass(); pass.clear(background_color); pass.set_pipeline(&pipeline); pass.set_vertex_buffer(0, &vertices); pass.draw(0..vertex_count, 0..1); } frame.render(encoder)?; frame.present()?; }
Compute + Graphics with TaskGraph
#![allow(unused)] fn main() { let mut graph = TaskGraph::new(); graph .node("update", &compute_pipeline) .bind_buffer(&buffer, NodeAccess::ReadWrite) .bind_resources_raw(&[buffer.bindless_index().unwrap()]) .dispatch(workgroups, 1, 1); graph.dispatch(&device)?; }
Slang Shader Template
import goldy_exp;
struct VertexOutput {
float4 position : SV_Position;
float2 uv;
};
[shader("vertex")]
VertexOutput vs_main(float2 pos : POSITION, float2 uv : TEXCOORD) {
VertexOutput output;
output.position = float4(pos, 0.0, 1.0);
output.uv = uv;
return output;
}
[shader("fragment")]
float4 fs_main(VertexOutput input) : SV_Target {
return float4(input.uv, 0.5, 1.0);
}
Motivation
The Problem with "Modern" Graphics APIs
DX12, Vulkan, and Metal are commonly called modern APIs, but they were designed over a decade ago for hardware that has since changed dramatically. Sebastian Aaltonen's "No Graphics API" captures the core tension:
"DirectX 12, Vulkan, and Metal are often referred to as 'modern APIs'. These APIs are now 10 years old. They were initially designed to support GPUs that are now 13 years old, an incredibly long time in GPU history."
The GPU architectures those APIs targeted lacked coherent caches, bindless descriptors, and 64-bit pointers. The APIs compensated with layers of indirection — descriptor sets, render pass objects, explicit image layout transitions, pipeline layouts as first-class objects — that served as hints and contracts for hardware that needed them.
Modern GPUs (roughly 2018+) no longer need most of that scaffolding:
| Then (2012-era) | Now (2018+) |
|---|---|
| Incoherent caches, manual flush | Coherent L2, automatic |
| Discrete memory, explicit copies | PCIe REBAR, unified where possible |
| 32-bit pointers, indirect | 64-bit, direct in shaders |
| CPU-bound descriptor binding | Bindless, GPU-resident |
| Render passes for tile optimization | Dynamic rendering works fine |
Yet every application using these APIs still pays the complexity cost of the old model, even when targeting only recent hardware.
Why Bindless Matters
Traditional GPU programming organizes resources into descriptor sets — fixed layouts of bindings that must be declared ahead of time, allocated from pools, and swapped between draw calls. This model creates a cascade of complexity:
- Pipeline layout explosion: Every unique combination of descriptor set layouts produces a distinct pipeline layout, and each pipeline layout dimension multiplies the total pipeline state permutation count.
- CPU overhead: Updating and binding descriptor sets each frame is a significant portion of CPU-side draw call cost.
- Shader inflexibility: Shaders are coupled to their binding layout; changing which resources a shader accesses means changing the pipeline.
Bindless resource access replaces all of this with a single concept: resources live in GPU-visible memory, and shaders access them by index. There are no set layouts to declare, no pools to manage, no binding points to track. A shader that needs buffer #7 just reads slot 7 from a flat descriptor heap.
This isn't exotic — it's how game engines have been working internally for years. Goldy makes it the public API rather than hiding it behind compatibility abstractions.
Why a Task Graph
Bindless access means shaders can read any resource at any time. The traditional model of inserting barriers at the call site ("I'm about to read this buffer, so transition it now") breaks down when the set of resources a dispatch touches isn't known until the shader runs.
Goldy uses a task graph to solve this. You declare tasks and their resource dependencies; Goldy derives the barriers, layout transitions, and execution order automatically. This is both safer (no missed barriers) and simpler (no manual synchronization) than the alternative.
The task graph also enables Goldy to batch and reorder work across the frame, which matters for compute-heavy workloads where multiple dispatches feed into each other before anything reaches the screen.
Why Slang
The shader language landscape is fragmented. GLSL, HLSL, MSL, and WGSL each target a subset of platforms, and none is a clean superset of the others. Libraries that support multiple shading languages maintain translation layers and per-language workarounds, which is a significant source of bugs and complexity.
Slang solves this at the source level. A single Slang source file compiles to SPIR-V (Vulkan), DXIL (DX12), and MSL (Metal). It uses HLSL-familiar syntax with additions that matter for modern GPU programming:
| Feature | Why it matters |
|---|---|
Modules and import | True separate compilation, no #include fragility |
| Generics | Type-safe reusable shader code |
| Automatic differentiation | First-class for ML and physics workloads |
| Khronos governance | Long-term stability and active development |
By committing to Slang as the sole shader language, Goldy eliminates an entire category of cross-platform bugs and keeps its codebase focused on GPU work rather than shader translation.
Intellectual Roots
Goldy synthesizes ideas from several sources:
- Sebastian Aaltonen, "No Graphics API" — The primary philosophical foundation. Modern GPUs have converged enough that a dramatically simpler API is possible if you drop legacy support.
- Ralph Levien, "Requiem for piet-gpu-hal" — The insight that good abstractions expose cost and reality while abstracting meaning and rules. Classic HALs failed by hiding both.
- wgpu — Excellent API ergonomics (Instance/Device architecture, CommandEncoder pattern, explicit pass structure). Goldy borrows patterns but is free to diverge from the WebGPU spec.
- Wayland compositor architecture — Frames, not commands. Explicit synchronization, not implicit state machines.
- TU Darmstadt, "Recursive Hardware Abstraction Layers" — Rigorous analysis of what a minimal HAL actually needs when targeting converged modern hardware.
- CUDA — A composable language that exposes memory directly, with a broad library ecosystem built on that simplicity.
No single source defines Goldy. The value is in the synthesis — and the willingness to ship an opinionated library rather than wait for committee consensus.
The Name
Goldy aspires to exist in the golden mean between wgpu's emphasis on compatibility and the vision of no-graphics-api.
Further Reading
- Sebastian Aaltonen: No Graphics API
- Ralph Levien: Requiem for piet-gpu-hal
- TU Darmstadt: Recursive HALs
- What Goldy Sheds
- Goldy vs wgpu
What Goldy Sheds
Goldy's bindless model and modern-hardware baseline make several traditional GPU programming concepts unnecessary. These aren't missing features — they're intentional design choices that keep the API small and the programming model coherent.
No Descriptor Set Management
Traditional APIs require you to declare descriptor set layouts, allocate descriptor pools, write descriptor sets, and bind them before each draw or dispatch. A typical Vulkan pipeline touches three to four descriptor set objects before anything reaches the GPU.
Goldy replaces all of this with a flat bindless heap. Resources get a slot index when created, and shaders access them by that index. There are no layouts, no pools, no binding calls.
// Shader receives resources by index — no descriptor sets
[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(Scattered<Particle> particles, ThreadId id) {
particles[id.x].position += particles[id.x].velocity;
}
This also eliminates pipeline layouts as objects. In Vulkan, each unique combination of descriptor set layouts produces a pipeline layout, which is baked into the pipeline at creation time. Goldy's single global bindless layout means one pipeline layout for all pipelines.
No Manual Barrier Insertion
In Vulkan and DX12, you manually insert memory barriers and image layout transitions to tell the GPU when a resource changes from "written by compute" to "read by fragment" (or any other transition). Missing a barrier is a silent correctness bug; inserting too many is a performance bug.
Goldy's task graph handles this automatically. You declare what each task reads and writes; Goldy derives the minimal set of barriers and transitions. This is both safer and typically more efficient than hand-placed barriers, because the task graph has a global view of the frame.
No Shader Permutation Systems
Traditional engines maintain thousands of shader variants — combinations of feature flags, render pass compatibility, descriptor set layout versions, and pipeline state. Some ship dedicated cloud infrastructure just to compile and cache them all.
Goldy collapses most of the dimensions that drive permutation counts:
| Traditional dimension | Goldy equivalent |
|---|---|
| Render pass compatibility | Dynamic rendering — no render pass objects |
| Descriptor set layout | One global bindless layout |
| Pipeline layout | Implicit from the global layout |
| Viewport/scissor state | Dynamic state, not baked into PSO |
What remains — shader source × vertex format × target format × depth config — is a small, manageable space. Goldy addresses pipeline variety by having fewer pipelines, not by building infrastructure to manage many variants.
Minimal Pipeline State Management
A Vulkan VkGraphicsPipelineCreateInfo touches blend state, depth/stencil state, rasterizer state, multisample state, input assembly, viewport/scissor, dynamic state flags, render pass, subpass, pipeline layout, and shader stages. Many of these are baked in at pipeline creation time, producing the combinatorial explosion that drives PSO caches.
Goldy uses dynamic rendering and dynamic state to move viewport, scissor, and render target configuration out of the pipeline object. The remaining pipeline state is intentionally minimal:
#![allow(unused)] fn main() { let pipeline = RenderPipeline::new(&device, &shader, &shader, &desc)?; }
Blend mode, depth testing, and vertex format are still part of the pipeline — they represent genuine hardware configuration. But the many compatibility dimensions that traditional APIs bake in are gone.
No Separate Compute API
OpenCL introduced compute to GPUs as an entirely separate API with its own device model, memory model, and dispatch semantics. Even "unified" APIs like Vulkan treat compute as a second-class citizen — compute pipelines and graphics pipelines share almost no code paths.
In Goldy, compute is a first-class citizen on the same footing as graphics. Compute shaders use the same bindless resource model, the same buffer types, and the same task graph. A compute dispatch that writes to a buffer and a draw call that reads from it are just nodes in the same graph.
#![allow(unused)] fn main() { // Compute updates particles, render draws them — same resources, same graph graph.add_compute("update", &compute_shader, &[&particle_buf], [workgroups, 1, 1]); graph.add_render("draw", &render_pipeline, &[&particle_buf], &surface); }
The Design Principle
Each of these omissions follows the same logic: if modern hardware doesn't need a concept for correctness or performance, Goldy doesn't expose it. The result is an API where the concepts that remain — buffers, textures, shaders, pipelines, task graph — each carry their weight.
Goldy vs wgpu
Both Goldy and wgpu are Rust GPU libraries with multi-backend support. They make different tradeoffs that suit different use cases.
At a Glance
| wgpu | Goldy | |
|---|---|---|
| Identity | WebGPU implementation for Rust | Modern Rust GPU library |
| Spec governance | W3C WebGPU specification | Independent, opinionated |
| Browser support | Yes (WebGPU) | No |
| Minimum hardware | Wide compatibility (Vulkan 1.0+) | Modern only (Vulkan 1.4+, DX12, Metal 2+) |
| Shader language | WGSL (primary), SPIR-V, GLSL, naga | Slang (compiles to SPIR-V, DXIL, MSL) |
| Resource model | Descriptor-based (bind groups) | Typed bindless |
| Synchronization | Manual pass ordering | Task graph |
| Metal support | Via MoltenVK or wgpu-hal | Native Metal backend |
| Compute model | Supported but secondary | First-class (compute-to-surface) |
Resource Binding: Descriptors vs Bindless
wgpu uses bind groups — the WebGPU equivalent of Vulkan descriptor sets. You declare a bind group layout, create bind groups that match it, and bind them before each draw or dispatch:
#![allow(unused)] fn main() { // wgpu: declare layout, create group, bind before draw let layout = device.create_bind_group_layout(&desc); let group = device.create_bind_group(&wgpu::BindGroupDescriptor { layout: &layout, entries: &[wgpu::BindGroupEntry { binding: 0, resource: buffer.as_entire_binding() }], .. }); pass.set_bind_group(0, &group, &[]); }
Goldy uses bindless access. Resources get a slot index at creation time, and shaders access them directly by index. There are no layouts, groups, or binding calls:
#![allow(unused)] fn main() { // Goldy: buffer already has a bindless slot, shader reads it by index let buffer = Buffer::with_data(&device, &data, DataAccess::Scattered)?; pass.bind_resources_raw(&[buffer.bindless_index().unwrap()]); }
The bindless approach eliminates an entire layer of API surface and the pipeline layout permutations that come with it.
Synchronization: Manual vs Task Graph
wgpu provides implicit synchronization within a render/compute pass but requires you to order passes correctly. Resource transitions between passes are handled by wgpu internally, following WebGPU's implicit rules.
Goldy uses an explicit task graph. You declare tasks and their resource dependencies; Goldy derives barriers, layout transitions, and execution order. This gives the runtime a global view of the frame for optimal scheduling and makes synchronization bugs structurally impossible.
Shader Language: WGSL vs Slang
wgpu's primary shader language is WGSL, the WebGPU Shading Language. WGSL is designed for safety and portability across web and native targets, but it lacks features like modules, generics, and automatic differentiation.
Goldy uses Slang exclusively. Slang compiles a single source file to SPIR-V (Vulkan), DXIL (DX12), and MSL (Metal). It provides modules with true separate compilation, generics, and HLSL-familiar syntax. The goldy_exp shader library builds on Slang's module system to provide shared types and utilities:
import goldy_exp;
[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(Scattered<Particle> particles, ThreadId id) {
particles[id.x].position += particles[id.x].velocity;
}
Compute as First-Class Citizen
wgpu supports compute shaders, but the API is oriented around render passes. Compute-to-render workflows require manual buffer management and pass ordering.
Goldy treats compute and graphics as peers. Compute-to-surface is a built-in pattern: a compute dispatch writes to a buffer or texture, and a subsequent render pass reads from it, with the task graph handling the dependency automatically.
Metal: Native vs MoltenVK
wgpu supports Metal through its wgpu-hal Metal backend or via MoltenVK (Vulkan-on-Metal translation). MoltenVK adds a translation layer that can introduce overhead and compatibility limitations.
Goldy has a native Metal backend that uses Metal APIs directly — Argument Buffers Tier 2 for bindless, MSL compiled from Slang, and native Metal types throughout. No translation layer sits between Goldy and the Metal driver.
Architecture
wgpu:
Application → wgpu (WebGPU API) → wgpu-hal → Vulkan / Metal / DX12 / WebGPU
Goldy:
Application → Goldy (native API) → Vulkan 1.4+ / Metal 2+ / DX12
wgpu implements the WebGPU specification faithfully, then maps it onto each backend through an internal HAL. Goldy talks to each backend directly using native idioms.
When to Choose Which
Choose wgpu when:
- You need browser deployment via WebGPU
- You need to support older GPUs or wide device compatibility
- You want the stability of a specification-driven API
- You need the wgpu ecosystem (examples, community, tooling)
Choose Goldy when:
- You target only modern desktop/mobile hardware (2018+)
- You want a minimal API surface with bindless as the default
- You want native Metal without a translation layer
- You want Slang's module system and shader language features
- Compute workloads are central to your application
Both libraries are valid choices — the right one depends on your hardware requirements, deployment targets, and whether you value broad compatibility or API simplicity.
Target Hardware
Goldy targets modern GPUs exclusively. This is a deliberate design choice — by requiring hardware from roughly 2018 onward, Goldy can use bindless descriptors, dynamic rendering, and coherent caches as baseline assumptions rather than optional features.
Backend Requirements
Vulkan 1.4+
Goldy requires Vulkan 1.4, which promotes several extensions that were optional in earlier versions to core:
| Feature | Vulkan history | Goldy usage |
|---|---|---|
| Dynamic rendering | VK_KHR_dynamic_rendering (1.3) | No render pass objects |
| Descriptor indexing | VK_EXT_descriptor_indexing (1.2) | Bindless resource access |
| Buffer device address | VK_KHR_buffer_device_address (1.2) | 64-bit GPU pointers |
| Synchronization2 | VK_KHR_synchronization2 (1.3) | Simplified barrier model |
| Push descriptors | Core in 1.4 | Efficient uniform updates |
Supported hardware:
- NVIDIA: Turing and later (RTX 2000 / GTX 1600 series, 2018+)
- AMD: RDNA 1 and later (RX 5000 series, 2019+)
- Intel: Xe architecture and later (Arc, 2022+)
- Qualcomm: Adreno 650+ (2019+, driver dependent)
DX12
Goldy's DX12 backend requires:
| Requirement | Details |
|---|---|
| D3D12 Enhanced Barriers | Windows 11 + WDDM 3.0+ driver |
ResourceDescriptorHeap | SM 6.6 bindless (Shader Model 6.6) |
| Root constants | Push constants equivalent |
Enhanced Barriers are mandatory — Goldy does not fall back to legacy resource state transitions. This effectively requires Windows 11 with a modern driver.
For software rendering and CI, Goldy supports the WARP software rasterizer via GOLDY_DX12_FORCE_WARP=1.
Metal Tier 2+
Goldy's Metal backend is native (no MoltenVK) and requires Argument Buffers Tier 2 for bindless resource access:
| Requirement | Details |
|---|---|
| Argument Buffers Tier 2 | Bindless via ParameterBlock |
| MSL (via Slang) | Slang compiles directly to Metal Shading Language |
Supported hardware:
- Apple Silicon: All models (M1/M2/M3/M4, A14+)
- Intel Macs: 2017+ (different iGPUs; some very early Intel UHD may not qualify)
- AMD discrete GPUs in Macs: 2015+
Older Intel integrated GPUs (pre-2017 Macs) are not supported — they lack Argument Buffers Tier 2.
What "Modern GPU" Means for Goldy
Goldy's hardware floor is defined by a set of architectural capabilities, not specific product names:
| Capability | Why Goldy needs it |
|---|---|
| Coherent L2 cache | No manual cache flush/invalidate logic |
| Bindless descriptors | Single global descriptor model, no set layouts |
| Dynamic rendering | No render pass objects or framebuffer compatibility |
| 64-bit buffer addresses | Direct pointer access in shaders |
| Unified or REBAR memory | Simplified CPU-GPU data transfer |
GPUs from roughly 2018 onward universally support these features. The specific API version requirements (Vulkan 1.4, DX12 Enhanced Barriers, Metal Tier 2) are the mechanism by which Goldy enforces this floor.
What This Excludes
| Excluded | Reason |
|---|---|
| NVIDIA GTX 900 series (Maxwell) | No Vulkan 1.4 support |
| AMD GCN (RX 400/500) | Driver support ended; limited bindless |
| Intel Gen9 (HD 500/600) | Incomplete Vulkan feature coverage |
| Intel integrated GPUs pre-2017 (Mac) | No Argument Buffers Tier 2 |
| Pre-Windows 11 DX12 | No Enhanced Barriers |
Checking Compatibility
Goldy reports unsupported devices at initialization:
#![allow(unused)] fn main() { let instance = Instance::new()?; for adapter in instance.enumerate_adapters() { println!("{}: {:?}", adapter.name, adapter.device_type); } // create_device returns an error on unsupported hardware let device = instance.create_device(DeviceType::DiscreteGpu)?; }
The Tradeoff
By drawing a line at modern hardware, Goldy avoids the fallback paths, compatibility checks, and feature-level negotiation that dominate traditional GPU libraries. Every code path in Goldy assumes the full feature set is available. This keeps the implementation small and the API surface predictable.
The cost is clear: Goldy cannot run on the long tail of older hardware. For applications that need broad device support, wgpu is the better choice.
Slang Quick Reference
Goldy uses Slang as its sole shading language. This page covers what you need to write Goldy shaders — not a full Slang language reference.
Basics
Slang uses HLSL-style syntax. If you've written HLSL or GLSL, most of it will look familiar.
Scalar Types
float f = 1.0;
int i = -5;
uint u = 10;
bool b = true;
Vector and Matrix Types
float2 v2 = float2(1.0, 2.0);
float3 v3 = float3(1.0, 2.0, 3.0);
float4 v4 = float4(1.0, 2.0, 3.0, 4.0);
// Swizzling
float2 xy = v4.xy;
float3 rgb = v4.rgb;
// Matrices
float4x4 mvp;
float4 transformed = mul(mvp, float4(pos, 1.0));
Structs
struct Particle {
float2 position;
float2 velocity;
float age;
};
Functions
float square(float x) { return x * x; }
// Public functions are exported from modules
public float3 my_effect(float2 uv) { return float3(uv, 0.5); }
Modules
Slang has a real module system (not #include). Modules are separate compilation units:
// In mylib.slang
module mylib;
public float3 effect(float2 uv) { return float3(uv, 1.0); }
// In shader.slang
import mylib;
float3 c = effect(uv);
goldy_exp Resource Types
The goldy_exp module defines type aliases that map to native Slang buffer and texture types. When used as parameters in [goldy_*] entry points, the Goldy compiler automatically resolves slot indices to live resource handles.
Buffer Types
| Type alias | Underlying type | Access pattern | Usage |
|---|---|---|---|
Scattered<T> | StorageBuffer<T> (RWStructuredBuffer<T>) | Read/write, any thread, any address | data[i], data[i].field = v |
BufRO<T> | ReadOnlyBuffer<T> (StructuredBuffer<T>) | Read-only, hardware read-cache hint | data[i] |
ByteAddress | ByteAddressView (RWByteAddressBuffer) | Raw byte-level access | .Load(addr), .Store(addr, v), .InterlockedMin(...) |
Texture Types
| Type alias | Underlying type | Access pattern | Usage |
|---|---|---|---|
Interpolated<T> | Texture2D<T> | Hardware-filtered sampling | tex.Sample(samp, uv), tex.Load(loc) |
DirectSpatial<T> | RWTexture2D<T> | Direct 2D read/write, no filtering | img[int2(x,y)], img.GetDimensions(w,h) |
Sampler Type
| Type alias | Underlying type | Usage |
|---|---|---|
Filter | SamplerState | Pass to tex.Sample(filter, uv) |
Broadcast (Constant Buffer)
To pass uniform data (same value for all threads), declare a struct type directly as a parameter — no wrapper needed. The codegen recognizes any non-resource, non-system-value struct as a constant-buffer broadcast:
struct TimeUniforms { float time; float delta_time; };
[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(TimeUniforms cfg, Scattered<Particle> particles, ThreadId id) {
particles[id.x].position += particles[id.x].velocity * cfg.delta_time;
}
System-Value Types
Declare these as parameters in [goldy_*] entry points to receive GPU-provided values. The codegen maps each type to its SV_* semantic automatically.
Compute
| Type | Maps to | Components |
|---|---|---|
ThreadId | SV_DispatchThreadID | .x, .y, .z, .xy, .xyz |
GroupThreadId | SV_GroupThreadID | .x, .y, .z, .xy, .xyz |
GroupId | SV_GroupID | .x, .y, .z, .xy, .xyz |
Graphics
| Type | Maps to | Components |
|---|---|---|
VertexId | SV_VertexID | .value |
InstanceId | SV_InstanceID | .value |
IsFrontFace | SV_IsFrontFace | .value |
Entry Point Attributes
[goldy_compute]
Marks a compute shader entry point. The Goldy compiler generates the real [shader("compute")] wrapper that resolves resource slots and system values.
import goldy_exp;
[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(Scattered<uint> data, uint offset, ThreadId id) {
data[id.x + offset] += 1;
}
[goldy_vertex]
Marks a vertex shader entry point.
import goldy_exp;
struct VSOutput {
float4 position : SV_Position;
float4 color : COLOR;
};
[goldy_vertex]
VSOutput vs_main(BufRO<Vertex> verts, VertexId vid) {
Vertex v = verts[vid.value];
VSOutput o;
o.position = float4(v.pos, 0.0, 1.0);
o.color = v.color;
return o;
}
[goldy_fragment]
Marks a fragment shader entry point.
import goldy_exp;
[goldy_fragment]
float4 fs_main(Interpolated<float4> tex, Filter samp, float2 uv : TEXCOORD0) : SV_Target {
return tex.Sample(samp, uv);
}
Common Patterns
Accessing Buffers by Index
All Scattered<T> and BufRO<T> parameters support standard array indexing. Field-level writes work directly on Scattered<T>:
[goldy_compute]
[numthreads(64, 1, 1)]
void cs_main(Scattered<Particle> particles, ThreadId id) {
Particle p = particles[id.x];
p.position += p.velocity;
particles[id.x] = p;
// Or field-level write:
particles[id.x].age += 1.0;
}
Sampling Textures
[goldy_fragment]
float4 fs_main(Interpolated<float4> albedo, Filter samp, float2 uv : TEXCOORD0) : SV_Target {
return albedo.Sample(samp, uv);
}
Writing to Storage Images
[goldy_compute]
[numthreads(8, 8, 1)]
void cs_main(DirectSpatial<float4> output, ThreadId id) {
output[int2(id.x, id.y)] = float4(float(id.x) / 512.0, float(id.y) / 512.0, 0.5, 1.0);
}
Fullscreen Triangle (Vertex-less)
Use vs_fullscreen_triangle() from goldy_exp to render fullscreen effects without a vertex buffer:
import goldy_exp;
[shader("vertex")]
FullscreenVarying vs_main(uint vertex_id : SV_VertexID) {
return vs_fullscreen_triangle(vertex_id);
}
[shader("fragment")]
float4 fs_main(FullscreenVarying input) : SV_Target {
return float4(input.uv, 0.5, 1.0);
}
Compute + Render Buffer Sharing
Compute shaders and graphics shaders share the same bindless buffers. The task graph handles the dependency:
// Compute: update particles
[goldy_compute]
[numthreads(64, 1, 1)]
void cs_update(TimeUniforms cfg, Scattered<Particle> particles, ThreadId id) {
particles[id.x].position += particles[id.x].velocity * cfg.delta_time;
}
// Vertex: read particles for rendering
[goldy_vertex]
VSOutput vs_draw(BufRO<Particle> particles, InstanceId iid, VertexId vid) {
Particle p = particles[iid.value];
// Generate quad geometry from particle position...
}
Rust-Side Resource Binding
Resources are bound in declaration order (left to right in the shader signature):
#![allow(unused)] fn main() { pass.bind_resources_raw(&[ cfg_buf.bindless_index().unwrap(), particle_buf.bindless_index().unwrap(), ]); }
Plain scalar parameters (uint offset) are also push-constant bindings — no wrapper struct needed.
goldy_exp Utility Modules
| Module | Contents |
|---|---|
goldy_exp/math.slang | PI, TAU, hash(), hash2(), center_uv(), scale_uv(), to_polar(), smootherstep() |
goldy_exp/color.slang | rainbow(), palette(), heat(), hsv_to_rgb(), luminance(), gamma_correct() |
goldy_exp/primitives.slang | quad_position(), quad_position_rotated(), billboard_position(), fullscreen_position(), fullscreen_uv() |
goldy_exp/types.slang | Particle2D, Particle3D, FrameUniforms, Transform2D, Instance2D |
goldy_exp/vertex.slang | FullscreenVarying, ColoredVertex, ColoredVarying, vs_fullscreen_triangle() |
goldy_exp/access.slang | Resource type aliases and system-value types (documented above) |
Further Reading
Environment Variables
Goldy reads several environment variables at runtime for backend selection, validation, debugging, and Slang configuration.
General
| Variable | Values | Default | Description |
|---|---|---|---|
GOLDY_BACKEND | vulkan, vk, dx12, d3d12, directx, metal, mtl | Platform default (macOS → Metal, Windows → DX12, Linux → Vulkan) | Override backend selection at runtime. |
GOLDY_SLANG_PATH | File path | (not set) | Override the path to the Slang shared library (slang.dll / libslang.dylib / libslang.so). Bypasses the default search order (vendored next to executable → extracted from embedded). |
Validation
| Variable | Values | Default | Description |
|---|---|---|---|
GOLDY_VALIDATION | Comma/semicolon/whitespace-separated list: api, layout, layouts, all; or 1 / true / yes | (not set) | Enable validation categories. api enables GPU API validation (Vulkan validation layers, Metal shader validation). layout enables Rust/Slang struct layout and buffer stride checks. all enables both. The shorthand 1 / true / yes enables GPU API only (layout stays opt-in). |
GOLDY_VALIDATE_LAYOUTS | 1, true, yes | (not set) | Legacy toggle for layout validation only. Equivalent to GOLDY_VALIDATION=layout. |
Validation Examples
# GPU API validation only (Vulkan validation layers, Metal shader validation)
GOLDY_VALIDATION=api cargo run --example triangle
# Layout + stride checks only
GOLDY_VALIDATION=layout cargo run --example triangle
# Everything
GOLDY_VALIDATION=all cargo run --example triangle
# Shorthand for GPU API only
GOLDY_VALIDATION=1 cargo run --example triangle
DX12-Specific
| Variable | Values | Default | Description |
|---|---|---|---|
GOLDY_DX12_DEBUG | 1, true | On in debug builds | Enable the D3D12 debug layer. On by default in debug builds; set explicitly for release builds. |
GOLDY_DX12_NO_DEBUG | 1, true | (not set) | Force-disable the D3D12 debug layer even in debug builds. Useful to avoid debug-layer crashes in parallel test threads. |
GOLDY_DX12_GBV | 1, true | (not set) | Enable D3D12 GPU-Based Validation. Catches UAV/SRV descriptor mismatches, resource state errors, and out-of-bounds access on the GPU timeline. Very slow — use for targeted debugging only. |
GOLDY_DX12_FORCE_WARP | 1, true | (not set) | Force the DX12 backend to use the WARP software rasterizer, even when hardware GPUs are present. Use for headless CI or reproducing WARP-specific rendering bugs. |
GOLDY_DX12_ALLOW_WARP | 1, true | (not set) | Allow the WARP adapter to appear in device enumeration. Without this or GOLDY_DX12_FORCE_WARP, WARP is hidden. |
Debugging
| Variable | Values | Default | Description |
|---|---|---|---|
GOLDY_DUMP_SHADERS | Directory path | (not set) | Dump compiled shader bytecode (SPIR-V, DXIL, MSL) to the specified directory. Files are written at shader compilation time. Useful for inspecting what Slang produces for each backend. |
Interop with System Variables
Goldy also respects these non-Goldy environment variables:
| Variable | Backend | Description |
|---|---|---|
VK_INSTANCE_LAYERS | Vulkan | If set to include VK_LAYER_KHRONOS_validation, Goldy enables Vulkan validation regardless of GOLDY_VALIDATION. |
VK_LAYER_PATH | Vulkan | Standard Vulkan loader variable for locating validation layer manifests. |
MTL_SHADER_VALIDATION | Metal | When GOLDY_VALIDATION enables API validation and this variable is unset, Goldy sets it to 1 before creating the first Metal device. If you set it yourself, Goldy does not override it. |
License
Goldy is dual-licensed under the GNU Lesser General Public License v2.1 or later (LGPL-2.1-or-later) and a commercial license.
Open Source (LGPL-2.1-or-later)
You may use Goldy freely in any project — including proprietary and commercial software — as long as you comply with the LGPL:
- ✅ Use Goldy as a dynamically linked library in proprietary software
- ✅ Distribute your application without releasing your own source code
- ✅ Modify Goldy for your own use
- ✅ Use commercially
You must:
- Distribute (or offer access to) the source code of Goldy itself (including any modifications you make to it)
- Allow users to replace the Goldy library with their own build (dynamic linking satisfies this)
- Include the LGPL license and copyright notice
Commercial License
A commercial license removes all LGPL obligations. This is appropriate when you need to:
- Statically link Goldy into a proprietary binary
- Distribute modified versions of Goldy without source disclosure
- Embed Goldy in locked-down or proprietary firmware/SDKs
- Satisfy corporate policies that prohibit copyleft dependencies
For commercial licensing terms, contact: koubaa@github
Dependencies
Goldy depends on various open-source libraries with their own licenses:
| Dependency | License |
|---|---|
| ash | MIT/Apache-2.0 |
| anyhow | MIT/Apache-2.0 |
| thiserror | MIT/Apache-2.0 |
| tracing | MIT |
| bitflags | MIT/Apache-2.0 |
| bytemuck | Zlib/MIT/Apache-2.0 |
All dependencies are permissively licensed and compatible with the LGPL.