A scene with a thousand trees of the same model has a problem the naive engine doesn’t notice: each Tree instance owns its own copy of the same vertex data, the same index data, and issues its own draw call against them. The GPU now holds a thousand identical buffers; the renderer makes a thousand calls per frame to draw geometry that is, byte for byte, the same. Both the memory and the CPU overhead are recoverable.
Vertex and index buffer sharing solves the first by giving many Tree objects a shared pointer to one buffer pair; instanced rendering solves the second by drawing many copies in a single draw call, with per-instance data supplied through a separate buffer. Together they are how modern engines render forests, particle systems, and crowds without melting the GPU.
Three things to take away:
- C++-level sharing: store vertex and index buffers (Nibble #071) behind
std::shared_ptrso many renderable objects reference one GPU resource. - GPU-level sharing:
glDrawElementsInstanceddraws N copies of a mesh in one draw call, with per-instance attributes supplied through a second buffer markedglVertexAttribDivisor(loc, 1). - The two techniques compose: shared ownership cuts GPU memory; instanced draws cut CPU overhead; together they are the standard pattern for rendering many copies of identical geometry.
The naive cost
The straightforward way to put a thousand trees in a scene is to give each tree its own mesh:
struct Tree {
VertexBuffer vbo; // owns its own GPU buffer (from Nibble #071)
IndexBuffer ibo;
glm::mat4 transform;
// ...
};
std::vector<Tree> forest;
forest.reserve(1000);
// Populate with 1000 trees, each constructing its own VBO and IBO.
for (const auto& tree : forest) {
tree.vbo.bind();
tree.ibo.bind();
setUniform("modelMatrix", tree.transform);
glDrawElements(GL_TRIANGLES, /* count */, GL_UNSIGNED_INT, nullptr);
}
Two costs are visible here, both growing linearly with tree count:
- GPU memory. A 50KB tree mesh times 1000 instances is 50MB of GPU memory holding the same bytes a thousand times. For a forest of varied vegetation across multiple levels, the duplication can dominate the scene’s memory budget.
- Draw call overhead. Each
glDrawElementsrequires the driver to validate state, set up GPU command buffers, and communicate over the CPU↔GPU bus. The bus is “relatively slow” by GPU standards; a thousand calls per frame eats real time even when each call’s actual work is tiny.
These problems are independent. Sharing buffers fixes the first; instanced rendering fixes the second. Most production code does both.
C++-level sharing: one buffer, many references
The VertexBuffer and IndexBuffer wrappers from Nibble #071 were specifically written to be non-copyable and movable — so that two C++ objects could never own the same GPU handle simultaneously. That rule keeps RAII honest, but it also stops us from giving many Tree instances “a reference to the same VBO.”
The standard answer is to wrap the buffer in std::shared_ptr and share that:
class Mesh {
public:
Mesh(std::shared_ptr<const VertexBuffer> vbo,
std::shared_ptr<const IndexBuffer> ibo,
std::size_t indexCount)
: m_vbo{ std::move(vbo) },
m_ibo{ std::move(ibo) },
m_indexCount{ indexCount } {}
void bind() const{ m_vbo->bind(); m_ibo->bind(); }
std::size_t indexCount() const{ return m_indexCount; }
private:
std::shared_ptr<const VertexBuffer> m_vbo;
std::shared_ptr<const IndexBuffer> m_ibo;
std::size_t m_indexCount;
};
The buffer is loaded once, into one std::shared_ptr, and every Mesh that uses it receives a copy of the pointer. The underlying GPU buffer is destroyed only when the last Mesh holding it is destroyed — RAII still works, just with shared ownership.
A scene’s renderable objects then split into two layers:
// A model loader returns Meshes that share their backing buffers.
class ModelLibrary {
public:
std::shared_ptr<Mesh> loadTree(){
if (!m_treeMesh) {
auto vbo = std::make_shared<VertexBuffer>(treeVertexData, /* ... */);
auto ibo = std::make_shared<IndexBuffer>(treeIndexData, /* ... */);
m_treeMesh = std::make_shared<Mesh>(std::move(vbo), std::move(ibo),
treeIndexCount);
}
return m_treeMesh;
}
private:
std::shared_ptr<Mesh> m_treeMesh;
};
// A renderable object holds a reference to a shared Mesh and its own state.
struct Renderable {
std::shared_ptr<Mesh> mesh;
glm::mat4 transform;
glm::vec4 tintColor;
};
A thousand Renderables pointing at one Mesh cost one set of GPU buffers plus a thousand (transform, color) pairs in CPU memory. The factor-of-1000 GPU-memory waste is gone.
The const qualifiers in the shared_ptr<const VertexBuffer> declarations are deliberate: a shared buffer should be immutable from any single owner’s perspective. Modifying the buffer through one Mesh would change what every other Mesh sees, which is almost never what you want. Mark the shared resource const and the type system enforces the discipline.
GPU-level sharing: instanced rendering
C++-level sharing fixes the memory waste, but the rendering loop still issues a thousand draw calls. Modern graphics APIs expose a second mechanism — instanced rendering — that collapses N draws of the same mesh into one:
glDrawElementsInstanced(GL_TRIANGLES,
mesh.indexCount(),
GL_UNSIGNED_INT,
nullptr,
instanceCount);
The call says: “draw this mesh instanceCount times.” The GPU runs the vertex shader for each vertex, for each instance, with a built-in input gl_InstanceID running from 0 to instanceCount - 1. The driver makes one validation pass and the CPU↔GPU communication happens once.
But the same vertices are still being sent through the same shader — without per-instance data, every instance would draw on top of the others. The mechanism for varying state across instances is a second buffer of per-instance attributes, marked with glVertexAttribDivisor(location, 1):
// Upload one buffer with per-instance transforms.
GLuint instanceVbo;
glGenBuffers(1, &instanceVbo);
glBindBuffer(GL_ARRAY_BUFFER, instanceVbo);
glBufferData(GL_ARRAY_BUFFER, sizeof(glm::mat4) * transforms.size(),
transforms.data(), GL_DYNAMIC_DRAW);
// Wire up the per-instance attribute (a 4x4 matrix is four vec4 attributes).
for (int i = 0; i < 4; ++i) {
glEnableVertexAttribArray(3 + i);
glVertexAttribPointer(3 + i, 4, GL_FLOAT, GL_FALSE,
sizeof(glm::mat4),
(void*)(sizeof(glm::vec4) * i));
glVertexAttribDivisor(3 + i, 1); // <- one update per instance, not per vertex
}
The divisor is the load-bearing call. Set to 0 (the default), the GPU advances the attribute per vertex — the standard behaviour for position, normal, UV. Set to 1, the GPU advances it per instance: every vertex of instance 0 sees transforms[0]; every vertex of instance 1 sees transforms[1]; and so on. A single draw call now produces correctly-positioned trees across the entire forest, each one shaded by the matrix the divisor selected for it.
Inside the vertex shader, the per-instance attribute is just another input:
layout(location = 0) in vec3 aPos;
layout(location = 3) in mat4 aInstanceTransform;
void main() {
gl_Position = uViewProj * aInstanceTransform * vec4(aPos, 1.0);
}
The shader sees one position per vertex and one transform per instance. The driver calls the vertex shader vertexCount × instanceCount times in one batched submission.
The two techniques compose
The two techniques solve different problems and stack cleanly:
| Technique | Cost reduced | Mechanism |
|---|---|---|
| Shared buffers | GPU memory | std::shared_ptr<VertexBuffer> |
| Instanced rendering | CPU draw-call overhead | glDrawElementsInstanced + divisor |
A thousand-tree forest using both costs:
- One vertex buffer and one index buffer on the GPU (shared across every tree).
- One per-instance attribute buffer holding the thousand transforms.
- One draw call per frame to render the entire forest.
The naive baseline was 1000 GPU buffers and 1000 draw calls. The combined optimisation is 2 GPU buffers (plus the per-instance attribute buffer) and 1 draw call. The savings are not incremental.
When this pattern fits, and when it doesn’t
Buffer sharing and instancing apply when:
- The mesh data is identical across instances. Same vertex positions, same indices, same UVs. Variation goes into the per-instance buffer (transforms, tints, IDs).
- The number of instances is large enough to make the fixed setup cost worthwhile. Instancing pays off in the hundreds-or-more range; for five trees, a regular loop is fine.
- The scene is dominated by repetition. Forests, crowds, particle systems, debris fields, grass, brick walls — these are the canonical use cases.
Cases where the pattern doesn’t fit:
- Per-instance mesh variation. If each tree has a slightly different number of branches, the meshes aren’t shareable. The mesh data must be identical for one buffer to back many draws.
- Heavy per-instance fragment work. Instancing collapses the vertex pipeline; if each instance needs unique textures or expensive fragment shaders, you’re not really saving much.
- Tiny instance counts. The overhead of setting up the divisor-attached buffer is real; for a handful of objects, ordinary draw calls are simpler and not measurably slower.
Takeaway
Naive rendering of N copies of a mesh costs N copies of the GPU buffers and N draw calls per frame. Buffer sharing fixes the first by storing each VertexBuffer and IndexBuffer behind std::shared_ptr and letting many Mesh objects reference one GPU resource — RAII still applies, just with shared ownership. Instanced rendering fixes the second by sending one draw call that produces all N copies, with per-instance state pulled from a separate buffer marked glVertexAttribDivisor(location, 1).
The two techniques compose: shared buffers cut GPU memory, instanced draws cut CPU overhead, and the combination is how modern engines render forests, crowds, and particle systems without paying the naive cost. The rule is simple: when you find yourself rendering the same mesh many times, share the buffers and instance the draws.