I have spent a lot of time trying to figure out how to write a blog. I mean really, how can someone write something daily or even weekly that is remotely interesting or relevant? However, somewhere in my head I knew that I should be doing something with this space - something useful. It finally hit me what would be both useful and a way to actual write this in a relevant manner - I will document the ongoing changes and work being done on my home project (TGS), features and consideration. It may one day, by some person, be useful when they start puttering around with engine code -)
My current trajectory for code development is to get a basic simulation loop up and running. I spent the last 6-8 months tweaking simulation code. This code base is primarily meant for floating point operation - a couple routines were made to be used on vectorized architectures - but I will have to revisit the collision code once I have isolated what tests and operations I need for vectorized architectures. However, to get a simulation loop up and running in a useful manner I need to be able to draw it, eh -) So I’ve spent the last couple of weeks on a rendering engine, based on the design/architecture I originally layed out a few years ago. Amazingly - its still valid (to both my shock and surprise!).
In working with physics code for the last three/four years I have found that visualization of the data is extremely important. The amount of data a simulation both consumes and emits is very often much more than can be easily or reasonably consumed by simple examination. Since it is time based, inserting in code that changes the timing of the loop can also cause a difference in a bugs manifestation. Thus, proper visualization tools are very important - and they have to prevent any non-trivial change in engine execution time. To this end I implemented a basic geometry draw call in the render engine (this allows someone to simple call for a draw(sphere) type of thing). Since physics code most often uses basic primitive forms other than meshes (or convex mesh) this allows for the ability to use certain optimization techniques that might otherwise not be available. Specifically we know that we will be re-rendering a very small and select group of vertex buffers. So I created a simple test where I crawled the current render engine (5 FPS, caused by a 5-way tessellated sphere being drawn 1024 times in a fixed grid pattern). This was the main task during the week.
The weekend task was to get a geometry instancing approach implemented for these debug render functions. I went with a hardware (shader 3.0) approach where the vertex stream for the primitive is assigned a frequency of the number of primitives to be drawn, and a instance data stream containing a colour and a model to world transformation ( as a 3x4 matrix). The primitive stream is a managed data stream and the instance data is a dynamic stream that is kept locked whenever its not specifically rendering. A max limit is stored as an enumeration in the class, and if the number of calls exceed this limit - the render call is immediately pushed out and the instance data is reset. Put more simply:
Instance Draw Call -> Store data in instance array -> Draw all instances
Not unsurprising, this did not change my resulting frame rate very much. I say not unsurprising because given the lack of any other processing occurring on the system - I am most definitely not CPU bound - and most of my processing time would be spent on pixel bound issues since textures are for the most part not being used. Vertex processing time between instanced and non-instanced call would not be that different other than perhaps a slightly higher cost due to the instance data stream. However, in a fully working environment, I definitely see this new method as a win. The previous method required sending the colour and the model to world matrix to the card for each render call - that is a lot of SetVertexShaderConstantF calls. So the instancing method eliminate the ShaderConstant calls and drops the number of DIPs - thus, reducing the number of possible CPU-GPU synchronization points. I am hoping this will make a big difference when running in a real world application. Future testing will tell me if I’m right -)