OpenGL: storing index/vertex data together inside a single buffer

With modern OpenGL, it is possible to store index and vertex data together in a single buffer. Here is a short code snipped showing how to do this (Direct State Access for OpenGL 4.5+).

Lets assume our data is initially loaded in these two vectors.

vector<vec3> positions;
vector<unsigned int> indices;
...

First, let’s create a new buffer and upload everything into it.

const size_t sizeI = sizeof(unsigned int) * indices.size();
const size_t sizeV = sizeof(vec3) * positions.size();

GLuint buf;
glCreateBuffers(1, &buf);
glNamedBufferStorage(buf, sizeI + sizeV, nullptr, GL_DYNAMIC_STORAGE_BIT);
glNamedBufferSubData(buf, 0, sizeI, indices.data());
glNamedBufferSubData(buf, sizeI, sizeV, positions.data());

Create and set up a new VAO.

GLuint vao;
glCreateVertexArrays( 1, &VAO);

Now we can tell OpenGL that our indices should be read from the same buffer and the offset of our vertex data from the beginning of the buffer is equal to sizeI.

glVertexArrayElementBuffer(vao, buf);
glVertexArrayVertexBuffer(vao, 0, buf, sizeI, sizeof(vec3));
glEnableVertexArrayAttrib(vao, 0);
glVertexArrayAttribFormat(vao, 0, 3, GL_FLOAT, GL_FALSE, 0);
glVertexArrayAttribBinding(vao, 0, 0);

That’s it. Rendering with glDrawElements(GL_TRIANGLES, indices.size(), GL_UNSIGNED_INT, nullptr); just works as expected.

Dreaming of an image loading/saving/manipulation library

For decades I was using FreeImage as a “backend” images loading/saving library which I wrapped into my own Bitmap class providing additional image manipulation functionality on top: convert colors, blit, resize, convolve, move channels around, cut & paste, and so on and so forth. The Bitmap class has been evolving for almost 15 years to become what it is now, a true abomination combining different coding styles and looking like a patched stovepipe with numerous ugly fixes and hacks. While it is still doing its job, it is becoming more and more difficult to support and maintain it. A full-featured clean image manipulation library would be really nice to have. Here is my features wish list for it.

  • Support 1D, 2D, 3D and Cube images. I’m doing graphics and using this class as a staging ground to upload textures to GPU. 1D image can be a particular case of 2D image with the height of 1, 2D image can be a case of 3D image with the depth of 1, and Cube image can be a 3D image with the depth equal to 6. Should be able to convert equirectangular projection to a cube image and back. Should be able to convert vertical/horizontal cross image to a cube image and back. Should be able to extract a 2D slice from a 3D image (and write it back as well).
  • Support different pixel formats. 8/16/32-bit unsigned/signed integer. 16/32-bit floating point. Arbitrary number of channels, R/RG/RGB/RGBA. Basic image operations like GetPixel, SetPixel, Resize, Flip, etc should work with all these basic formats. GetPixel should support bilinear and trilinear (for 3D textures) filtering. Cube image lookup should support seamless mode.
  • Some notion of channels semantics. All flavours of RGB-BGR, RGBA-BGRA, ARGB-RGBA, etc conversions. Different color spaces (CMYK, Lab) desirable but are not a show stopper.
  • Conversion from sRGB to linear color and back.
  • Support of some esoteric formats to store compressed image data and upload it later to GPU. ETC2/ETC2_EAC is a minimum here. Basic image operations can skip all work with these formats. That’s it, just storage.
  • Blending operations. Normal, Lighten, Darken, Multiply, Average, Add, Subtract, Difference, Negation, Screen, Exclusion, SoftLight, HardLight, VividLight, PinLight, LinearLight, HardMix, ColodDodge, ColorBurn, LinearDodge, LinearBurn, etc.
  • The library should not be too pessimized on performance if compiled with -O0 optimization (many C++14/17/20 libraries have this). Should be able to run the code in debug builds reasonably well.
  • Add/remove scanline stride.
  • Load and save different image formats. JPEG, PNG, HDR, KTX, RAW (uncompressed pixels, mostly for 3D textures) loading and saving are vital necessary. EXR loading is necessary as well, however, saving is optional.
  • Load and save from/to files and memory.
  • Move channels around. “Take the 2nd channel from this image and put it into the 1st channel of this one”.
  • Some rudimentary drawing operations: ClearColor, FillBox, MakeXorPattern, line drawing, etc.
  • Store mipmap levels (just having a NextImage pointer will do). Treat mipmaps well while loading and saving image formats which support them (KTX).
  • Calculate normal maps from heigh maps and height maps from normal maps.
  • Should have configurable dependencies. All supported image formats should be optional and configurable at will.
  • Support multiple image loading/saving backends switchable at compile time. For example, libpng/libjpeg vs stb_image.
  • Simple to compile. Two files would be ideal: .cpp and .hpp
  • Anyone?

Visual Studio 2015 and lambda parameters

Was writing this code in Visual Studio 2015. Note the ‘auto’ specifier in the lambda parameter list.

template <typename Container, typename Entity>
void Remove( Container& c, Entity e )
{
	auto Iter = std::remove_if(
		c.begin(), c.end(),
		[ e ]( const auto& Ent ){ return Ent == e; }
	);
	c.erase( Iter, c.end() );
}

Ended up with this code to stay compatible with Visual Studio 2013.

template <typename Container, typename Entity>
void Remove( Container& c, Entity e )
{
	auto Iter = std::remove_if(
		c.begin(), c.end(),
		[ e ]( const typename Container::value_type& Ent ) { return Ent == e; }
	);
	c.erase( Iter, c.end() );
}

Smart pointers passed by const reference

Smart pointers are often passed by const references. C++ experts, Andrei Alexandrescu, Scott Meyers and Herb Sutter, speculate on this topic during C++ and Beyond 2011 ([04:34] On shared_ptr performance and correctness).

Basically, a smart pointer that is passed-in by const reference already lives in the current scope, somewhere at the call site. It may be stored in a class member and you may do something that clears that member. But this is not the problem of passing by reference, it is the problem of your architecture and ownership policy.

However, this post is not about correctness. It is about performance and what we actually can gain by switching to const references. The first impression may be that the only thing we will get is avoidance of atomic increments/decrements in copy constructor and destructor. Let’s take a closer look and write some code to understand what is going on behind the scenes.

First, some helper functions:

const size_t NUM_CALLS = 10000000;

double GetSeconds()
{
	return ( double )clock() / CLOCKS_PER_SEC;
}

void PrintElapsedTime( double ElapsedTime )
{
	printf( "%f s/Mcalls\n", float( ElapsedTime / double( NUM_CALLS / 10000000 ) )  );
}

Then an intrusive counter:

class iIntrusiveCounter
{
public:
	iIntrusiveCounter():FRefCounter(0) {};
	virtual ~iIntrusiveCounter() {}
	void    IncRefCount() { FRefCounter++; }
	void    DecRefCount() { if ( --FRefCounter == 0 ) { delete this; } }
private:
	std::atomic<int> FRefCounter;
};

And an ad hoc intrusive smart pointer:

template <class T> class clPtr
{
public:
	clPtr(): FObject( 0 ) {}
	clPtr( const clPtr& Ptr ): FObject( Ptr.FObject ) { FObject->IncRefCount(); }
	clPtr( T* const Object ): FObject( Object ) { FObject->IncRefCount(); }
	~clPtr() { FObject->DecRefCount(); }
	clPtr& operator = ( const clPtr& Ptr )
	{
		T* Temp = FObject;
		FObject = Ptr.FObject;
		Ptr.FObject->IncRefCount();
		Temp->DecRefCount();
		return *this;
	}
	inline T* operator -> () const { return FObject; }
private:
	T*    FObject;
};

Pretty simple, right?
Let’s now declare a simple class, a smart pointer to an instance of which will be passed, first, by value and then by const reference to a function:

class clTestObject: public iIntrusiveCounter
{
public:
	clTestObject():FPayload(32167) {}
	// do some dummy work here
	void Do()
	{
		FPayload++;
	}

private:
	int FPayload;
};

Everything is now ready to write the actual benchmarking code:

void ProcessByValue( clPtr<clTestObject> O ) { O->Do(); }
void ProcessByConstRef( const clPtr<clTestObject>& O ) { O->Do(); }

int main()
{
	clPtr<clTestObject> Obj = new clTestObject;
	for ( size_t j = 0; j != 3; j++ )
	{
		double StartTime = GetSeconds();
		for ( size_t i = 0; i != NUM_CALLS; i++ ) { ProcessByValue( Obj ); }
		PrintElapsedTime( GetSeconds() - StartTime );
	}
	for ( size_t j = 0; j != 3; j++ )
	{
		double StartTime = GetSeconds();
		for ( size_t i = 0; i != NUM_CALLS; i++ ) { ProcessByConstRef( Obj ); }
		PrintElapsedTime( GetSeconds() - StartTime );
	}
	return 0;
}

Let’s build it and see what happens. First, we will start with a completely unoptimized debug version (I use gcc.EXE (GCC) 4.10.0 20140420 (experimental)):

gcc -O0 main.cpp -lstdc++ -std=c++11

The run time is 0.375 s/Mcalls for the pass by value version versus 0.124 s/Mcalls for the pass by const reference version. A persuasive 3x performance difference in the debug build. That is good. Let’s take a look at the underlying assembly. The by-value version:

L25:
	leal	-60(%ebp), %eax
	leal	-64(%ebp), %edx
	movl	%edx, (%esp)
	movl	%eax, %ecx
	call	__ZN5clPtrI12clTestObjectEC1ERKS1_		// call copy ctor
	subl	$4, %esp
	leal	-60(%ebp), %eax
	movl	%eax, (%esp)
	call	__Z14ProcessByValue5clPtrI12clTestObjectE
	leal	-60(%ebp), %eax
	movl	%eax, %ecx
	call	__ZN5clPtrI12clTestObjectED1Ev			// call dtor
	addl	$1, -32(%ebp)
L24:
	cmpl	$10000000, -32(%ebp)
	jne	L25

The by-const-reference version. Notice how clean it is even in a debug build:

L29:
	leal	-64(%ebp), %eax
	movl	%eax, (%esp)
	call	__Z17ProcessByConstRefRK5clPtrI12clTestObjectE	// just a single call
	addl	$1, -40(%ebp)
L28:
	cmpl	$10000000, -40(%ebp)
	jne	L29

All the calls are in their places and what we only save here are two expensive atomic operations.
But debug builds are not what we actually want, right? Let’s optimize it and see what happens:

gcc -O3 main.cpp -lstdc++ -std=c++11

The by-value time is now 0.168 seconds per Mcalls. The by-const-reference time is ZERO. I mean it. No matter how many iterations you have, the elapsed time in this simple test sample will be zero. Let’s see the assembly to check if we are not mistaken somewhere. This is the optimized by-value version:

L25:
	call	_clock
	movl	%eax, 36(%esp)
	fildl	36(%esp)
	movl	$10000000, 36(%esp)
	fdivs	LC0
	fstpl	24(%esp)
	.p2align 4,,10
L24:
	movl	32(%esp), %eax
	lock addl	$1, (%eax)		// this is our inlined IncRefCount()...
	movl	40(%esp), %ecx
	addl	$1, 8(%ecx)			// bodies of ProcessByValue() and Do() - 2 instructions
	lock subl	$1, (%eax)		// .. and this is DecRefCount(). Quite impressive.
	jne	L23
	movl	(%ecx), %eax
	call	*4(%eax)
L23:
	subl	$1, 36(%esp)
	jne	L24
	call	_clock

Ok, but why the by-const-reference version is so much faster we cannot measure it? Here it is:

	call	_clock
	movl	%eax, 36(%esp)
	movl	40(%esp), %eax
	addl	$10000000, 8(%eax)		// here is the final result, no loops, no nothing
	call	_clock
	movl	%eax, 32(%esp)
	movl	$20, 4(%esp)
	fildl	32(%esp)
	movl	$LC2, (%esp)
	movl	$1, 48(%esp)
	flds	LC0
	fdivr	%st, %st(1)
	fildl	36(%esp)
	fdivp	%st, %st(1)
	fsubrp	%st, %st(1)
	fstpl	8(%esp)
	call	_printf

Just Wow! The complete benchmark is actually in this assembly lines. The absence of atomic hassle lets the optimizer kick in and unroll everything into a single precalculated value. Of course, this is a very trivial code sample. However, it clearly makes 2 points why passing smart pointers by const reference is not a premature optimization but a serious performance improvement:

1) elimination of atomic operations is a large benefit in itself
2) elimination of atomic ops allows the optimizer to jump in and do its magic

Here is the full source code.

Results with your compiler may vary 🙂

P.S. Herb Sutter has a very elaborate essay on the topic, covering the C++ side in great detail.

Rendering UI transitions on mobile: Adreno 330

In the post Rendering UI transitions on mobile, I mentioned a problem with my transition shader on Adreno 330 GPUs. The solution was pretty easy. I’ve just unrolled the for-loop (and reduced the number of taps, but this is another story). Here is the new code for main() in the fragment shader which works on Adreno perfectly:

void main(void)
{
	float T = u_TransitionValue;

	float S0 = 1.0;
	float S1 = u_PixelSize;
	float S2 = 1.0;

	// 2 segments, 1/2 each
	float Half = 0.5;

	float PixelSize = ( T < Half ) ? mix( S0, S1, T / Half ) : mix( S1, S2, (T-Half) / Half );

	vec2 D = PixelSize * u_Resolution.zw;

	vec2 UV = v_TexCoord.xy;

	// 5-tap Poisson disk coefficients
	vec2 Disk[5];
	Disk[0] = vec2( 0.1134811,   0.6604039) * D + UV;
	Disk[1] = vec2(-0.4988798,   0.2663419) * D + UV;
	Disk[2] = vec2(-0.4542479,  -0.4338912) * D + UV;
	Disk[3] = vec2( 0.7253948,  -0.1434357) * D + UV;
	Disk[4] = vec2( 0.09679408, -0.9359848) * D + UV;

	vec4 C0 = texture( Texture0, UV );
	C0 += texture( Texture0, Disk[0] );
	C0 += texture( Texture0, Disk[1] );
	C0 += texture( Texture0, Disk[2] );
	C0 += texture( Texture0, Disk[3] );
	C0 += texture( Texture0, Disk[4] );
	C0 /= 6.0;

	vec4 C1 = texture( Texture1, UV );
	C1 += texture( Texture1, Disk[0] );
	C1 += texture( Texture1, Disk[1] );
	C1 += texture( Texture1, Disk[2] );
	C1 += texture( Texture1, Disk[3] );
	C1 += texture( Texture1, Disk[4] );
	C1 /= 6.0;

	out_FragColor = mix( C0, C1, T );
}

Btw, you can checkout how this transition looks like on GLSL.io.