Monday, 10 August 2015
JSON - Yet Another Markup Language?
I have to admit, I am not sold on JSON - it is no more or less useful than XML in terms of configuring applications or loading assets.
I've implemented support for JSON-based config files, but I wonder if it was worth the effort.
Friday, 17 July 2015
Multi-Pass Shaders
There's no such thing, really, as a multi-pass shader.
To perform multi-pass rendering, we select a shader and optionally a render target surface (FBO), and we draw some subset of our renderables. Then we select another shader, and repeat the process.
Multi-Pass Rendering involves multiple shaders, and drawing stuff more than once.
Not all renderables will be drawn during every 'pass' - for example, a pass might only draw Opaque objects, or it might only draw light volumes.
Materials were used in the days of single-pass rendering to encapsulate all the various requirements for drawing surfaces, mainly texture and light equation variables, which would then be presented, typically, to Uniforms of a very specific 'game geometry shader'.
In order to support complex 'multi-pass' rendering, it appears to make sense that we should further extend the Material class to hold data for a number of shaders, and variables that target a number of passes, which all sounds rather elaborate. It was decided that Materials are in fact not the appropriate place to declare mappings for multiple shader passes.
FireStorm's Renderer implements a 'RenderGraph' made from 'RenderNodes', where each node (except the root node) represents a render target surface, and the root node represents final screen output. The rendergraph is described elsewhere in this blog, but basically this tree of rendertargets can define complex scene compositions - the composition being exactly how the final screen image is built up from zero or more rendertargets, which can be arranged as a dependency tree.
RenderNodes don't interact with a Material class - they instead deal with RenderState and ShaderPass classes, the latter being a container for visible Renderables, which contain a bitkey used to associate a renderable with a subset of shaderpasses held by that renderable's specific target RenderNode.
In this way, each Renderable can declare which RenderNode it will be rendered to, and which ShaderPasses (within that Node) it will be rendered by.
Each shader can register its interest in various Material Attributes, and should it be required, a different material can be associated with each Renderable, for each ShaderPass.
When a RenderNode is processed, it executes a series of ShaderPasses, each with its own RenderState, its own Shader, and its own subset of (references to) Renderables, the results being drawn to that Node's rendertarget (either an FBO, or the screen).
Thursday, 16 July 2015
FireStorm ECS -> RenderGraph
The ultimate aims of the FireStorm ECS are those of any game system: apply behavioral logic, determine what is visible (and implicitly, what isn't), and draw visible stuff to some target surface, using some shader, some camera view and projection, etc.
When we reach the RenderableSystem, we have reached the end of the ECS mechanism: renderables which are visible to some camera will issue a draw request to some draw target.
FireStorm implements a RenderGraph which represents the high-level scene composition tree in terms of RenderTarget Surfaces (FBOs in GL speak) and defines how they are connected.
The RenderGraph can be visualized as a 'reverse tree', where the 'root node' represents the final output to the Screen, while every other 'RenderNode' is a FrameBuffer Object. This arrangement is close to what happens when we 'compose' our scene using multiple rendering stages via arbitrary control logic, and models what we actually want to express as graphics programmers (in my humble opinion) - how each target texture is to be rendered, and how they chain together to form the final image. This is non trivial due to Multiple Render Target technology, and in my opinion is best and most cleanly described in graph terms. Others may hold different opinions, please, chime in!
Please note again, the RenderGraph is *not* part of the ECS, it's not a Component or a System, not everything needs to fit inside the ECS.
The ECS RenderingSystem emits 'virtual draw requests' to specific nodes of the RenderGraph, they are sorted, batched and collected as Messages, awaiting final draw processing, which is realized through 'recursion return'. As usual, it can be multithreaded fairly easily, up to this point.
When it's time to draw the frame, the RenderGraph itself is recursed by the MAIN PROCESS THREAD, which owns the ONLY RENDER CONTEXT.
We start at the output node (the tree root), recurse the tree as we normally might, but only process the Messages captured by each RenderNode as we are RETURNING from recursion.
This ensures that Child rendertarget nodes are drawn ahead of Parent nodes who depend on them (this is a reverse tree remember? the root node is the final output...)
By processing each node only *after* recursing its children, we end up rendering to our rendertargets in such an order that any target surface is always ready for use as an input texture to a higher-order node (where 'higher order' means 'closer to the root node, which is final output')
RenderNodes are containers for render surfaces (output textures, MRT supported), but they are also containers for ShaderPasses. In order to draw a RenderNode, we execute its ShaderPasses. Renderables subscribe to specific Passes via a binary key, so we can draw some things in some passes, and not in others, to a specific surface, and as we've learned, with a specific camera.
DOD-ECS RenderableComponent : this is where the ECS ends, and the Renderer begins.
Today I fleshed out the Lua interface for the RenderableComponent.
Without explaining what FireStorm's Script Bindings are about, here it is:
LUA_CLASS_METHODS(RenderableComponent) =
{
LUA_CLASS_METHOD( getRenderTarget, RenderableComponent::get_targetSurface), // target RenderNode (container for FBO)
LUA_CLASS_METHOD( setRenderTarget, RenderableComponent::set_targetSurface),
LUA_CLASS_METHOD( getCameraID, RenderableComponent::get_CameraID), // which camera entity to use for culling and POV-rendering
LUA_CLASS_METHOD( setCameraID, RenderableComponent::set_CameraID),
LUA_CLASS_METHOD( getGeometry, RenderableComponent::get_geometry), // reference to GLVertexArray
LUA_CLASS_METHOD( setGeometry, RenderableComponent::set_geometry),
LUA_CLASS_METHOD( getPrimitiveType, RenderableComponent::get_primitiveType), // instructions for drawing
LUA_CLASS_METHOD( setPrimitiveType, RenderableComponent::set_primitiveType),
LUA_CLASS_METHOD( getFirstIndex, RenderableComponent::get_firstIndex),
LUA_CLASS_METHOD( setFirstIndex, RenderableComponent::set_firstIndex),
LUA_CLASS_METHOD( getNumIndices, RenderableComponent::get_numIndices),
LUA_CLASS_METHOD( setNumIndices, RenderableComponent::set_numIndices),
{0,0}
};
We can learn a lot about FireStorm's internals by examining the above interface description.
For example, it supports multiple active render cameras within a single frame.
Each renderable can be associated with exactly one camera, and exactly one render target surface.
We can also see that rendering is broken down into virtual draw requests which can be sorted and batched elsewhere. Indexed drawing, and drawing of indexed subsets is supported.
The PrimitiveType can be set per renderable, and we know that FireStorm's ECS supports more than one renderable per entity, should we require it (support for submeshes is implied).
I'm still not sold on the idea of exposing Component classes to the Script Layer (Lua in my case).
It seems to make more sense design-wise to expose System classes, and force any 'setters' to use the Message mechanism rather than doing direct writes. Have to think about it some more.
If I decide that I should be dealing with Systems, and not Components, on the Script side, then so be it, some code moves around, nothing's lost.
Tuesday, 14 July 2015
DOD-ECS: Decoupling Systems is good for Concurrency, and good for Cache
FireStorm ECS implements a kind of distributed messaging service, fed by a central Dispatch mechanism: each System provides a Message Queue to receive Messages sent from other Systems, and addressed to one or more of its Components.
Just before each ECS System is Updated, any Messages sent to that System are Delivered, in terms of actually being Processed.
Messages are the primary way for the ECS Systems to send data to other Systems.
Since each System is updated separately, and has its own Message Queue,
the Message paradigm effectively decouples Systems from one another, which is good for Concurrency - the implications are that it's safe for multiple Threads to update different Components owned by the same System, and it's always safe to Read from Components of Another System, with no need for any kind of locking, other than a 'barrier' to ensure only one System is updated at any moment in time.
System Updates are the primary cause of messages getting sent to other Systems.
Since each System processes its Messages before updating its Components,
we usually want to send Messages to Systems that are of lower priority (are yet to be Updated).
The reason is that Messages sent to higher-priority Systems won't be processed until the next Frame... so the trick is to try to arrange your system updates rationally (which I impose through ordered system ids).
So - why have a Message QUEUE at all? Why not just process ECS events as soon as they occur?
Messages are used to convey data from a component in one System, to a component in another System, usually for Writing locally within that remote component.
They are a means of implementing 'deferred-write' thread safety!
Is concurrency the only benefit to be gained from buffering inter-System messages?
We also benefit in at least one other important way: we get a much better data access pattern,
leading to much improved cache performance, which is perhaps the number one performance killer in modern games.
Think about it: if a System processed ECS events as we generated them, we would be immediately touching data
owned by other Systems in a non-linear fashion, which implies lots of cache misses.
We would later go on to process that System, by which time any cache warming we might have gained has likely been lost.
Rather, if each System first deals with messages sent to its components, then updates its components,
we get the full benefit of cache-warming just before our full linear component update pass.
If we're careful, the overhead we pay for this should be a lot less than the performance gained through
the application of cache-friendly data access patterns.
Just before each ECS System is Updated, any Messages sent to that System are Delivered, in terms of actually being Processed.
Messages are the primary way for the ECS Systems to send data to other Systems.
Since each System is updated separately, and has its own Message Queue,
the Message paradigm effectively decouples Systems from one another, which is good for Concurrency - the implications are that it's safe for multiple Threads to update different Components owned by the same System, and it's always safe to Read from Components of Another System, with no need for any kind of locking, other than a 'barrier' to ensure only one System is updated at any moment in time.
System Updates are the primary cause of messages getting sent to other Systems.
Since each System processes its Messages before updating its Components,
we usually want to send Messages to Systems that are of lower priority (are yet to be Updated).
The reason is that Messages sent to higher-priority Systems won't be processed until the next Frame... so the trick is to try to arrange your system updates rationally (which I impose through ordered system ids).
So - why have a Message QUEUE at all? Why not just process ECS events as soon as they occur?
Messages are used to convey data from a component in one System, to a component in another System, usually for Writing locally within that remote component.
They are a means of implementing 'deferred-write' thread safety!
Is concurrency the only benefit to be gained from buffering inter-System messages?
We also benefit in at least one other important way: we get a much better data access pattern,
leading to much improved cache performance, which is perhaps the number one performance killer in modern games.
Think about it: if a System processed ECS events as we generated them, we would be immediately touching data
owned by other Systems in a non-linear fashion, which implies lots of cache misses.
We would later go on to process that System, by which time any cache warming we might have gained has likely been lost.
Rather, if each System first deals with messages sent to its components, then updates its components,
we get the full benefit of cache-warming just before our full linear component update pass.
If we're careful, the overhead we pay for this should be a lot less than the performance gained through
the application of cache-friendly data access patterns.
Saturday, 11 July 2015
Core Values: Treating strings as reference-counted, shareable objects
FireStorm provides a CORE::Value container class, originally introduced to facilitate interoperability with the Lua script language,.
Value class is basically a variant-type container that supports all the usual suspects (boolean, number, string, object, table, etc), implemented as a unioned structure. Until today, this structure was bloated by a 256-byte buffer whose purpose is to hold data for a short string, while all other possible types require far less memory than that.
This never presented as a problem until I recently implemented an ECS AttributeComponent, which essentially wraps a pair of Values (key, value pair). When the components were aligned in a Pool, it became painfully obvious how much memory was being wasted, so the decision was made to move strings out of the Value container, replacing them with a unique integer identifier whose value is based on a string hashing algorithm.
Values of 'string type' now hold 32-bit hashes, which reference unique strings.
The global container for unique strings is a map whose keys are hashes, and whose values are pairs containing the original string, and a reference count, which is used to unload strings when the last remaining reference is removed. Dynamic strings are not a problem, and modifying a string Value does not affect external references to the old string Value.
The changes to the Value container class seemed trivial, however I did run into a few problems relating to destruction of local value copies. Now it's done, the size of an AttributeComponent has been reduced from 512 bytes to 16 bytes per Attribute, making Values and Attributes a lot more attractive in terms of memory footprint.
These changes removed an artificial limitation on the length of strings stored in Values, which was also the major culprit with respect to wasted memory in Value-based systems such as AttributeSystem. Further, references to the same string are now very cheap, as only one copy of the actual string data is held anywhere in memory.
Arguably, these changes move away from data-oriented-design principles, since strings are no longer plain old data, but instead are references to plain old data, and lookups have been introduced, however the cost is not huge, given that the hashmap is already sorted.
Sparse Arrays meets Parented Transforms: so long, SceneGraph!
FireStorm Systems are derived from an Object Pooling Template.
In simple terms, that means each System stores a specific kind of Data, and we only have one instance of each System - they are effectively singletons.
The base pooling template class implements a custom memory allocator that is very fast for things we create and delete frequently (plain structs, or object classes, both are fine), it allocates a small, flat array of space under the hood, and manages this space in terms of allocated and unallocated space, until forced to grow.
When objects are deleted from a pool, they don't move automatically, they are marked as dead, and their index is stored on a list of 'dead stuff' that is also used as a first option when objects are created.
This is called an object recycling pool mechanism, and it's much faster than your operating system's heap allocator, which can't make guesses about the size of the objects being stored.
So, when we create a new object in a pool, the index of the new or recycled slot is returned.
The new element is identified uniquely by the combination of the pool it lives in (its type), and the index it resides at (its local ID).
This is fine until we start deleting things, or swap things around...
At this point, we create 'holes' in our pool's underlying array space, which although unsightly, they might not present a bother to the cpu cache, and we could in most cases, forgive that they exist.
But some systems like to sort components, or want their components sorted, or maybe we just occasionally want to defragment our pools (eliminate the holes in the arrays), potentially allowing us to dynamically adjust our memory footprint within practical reason.
Transforms are a good example case: we will want our TransformSystem to ensure that our TransformComponents are sorted such that any component that has a parent transform, well, we want the Parent element to appear earlier in the flat array than any of its children - this allows us to get rid of the SceneGraph!
Modifying a SceneGraph becomes simply a matter of modifying the parenting of specific components of a specific system, and the details can be hidden to leave the impression of a conventional OOP scenegraph as far as the user can tell, while still taking full advantage of the hardware and composition benefits that DOD-ECS has to offer.
Once we've sorted our Transforms according to the rule "parents must appear earlier than children", we can linearly process them, there's no need for graph recursion.
In a nutshell: when the TransformSystem is updated, it performs a special kind of bubble-sort pass over the TransformComponents it manages: for each transform component, we check if the component has a parent - if not, we can move on. If it has a parent, we check if the parent is stored at a higher index, and if so, we swap the components in-situ, causing their component ids to be swapped as well, and then we repeat the compare step apon the current transform component, which now holds the parent transform (we're checking if the parent has a parent, and if that parent is at a higher index...) This algorithm has the effect of moving parent nodes to earlier array slot indices.
Defragmenting of component pools is performed similarly, but using a different comparison rule.
In order to defragment a component pool, we iterate the slot indices 'forwards', looking for 'dead' elements, and simultaneously, we iterate the slot indices 'backwards', looking for 'live' elements.
Each time we discover a dead element, we swap it for a living element with the highest known index.
In this way, living elements are moved toward the beginning of the array, while dead elements are moved toward the end of the array. When the operation is complete, there is an opportunity to compact the array to dynamically reduce the memory footprint, if there are more dead elements than live ones (I use a grow/shrink factor of 2).
EDIT (16 July) - I'm looking at the possibility of moving the Sort operation into attach and detach methods, leaving the Update operation to just deal with updating subtrees of transforms that are marked as dirty. It's not a scenegraph, I swear!, but it has similarities with respect to hierarchical transforms - the fundamentals are still the same, but the recursion is gone.
Subscribe to:
Comments (Atom)