Smart pointers passed by const reference

Smart pointers are often passed by const references. C++ experts, Andrei Alexandrescu, Scott Meyers and Herb Sutter, speculate on this topic during C++ and Beyond 2011 ([04:34] On shared_ptr performance and correctness).

Basically, a smart pointer that is passed-in by const reference already lives in the current scope, somewhere at the call site. It may be stored in a class member and you may do something that clears that member. But this is not the problem of passing by reference, it is the problem of your architecture and ownership policy.

However, this post is not about correctness. It is about performance and what we actually can gain by switching to const references. The first impression may be that the only thing we will get is avoidance of atomic increments/decrements in copy constructor and destructor. Let’s take a closer look and write some code to understand what is going on behind the scenes.

First, some helper functions:

const size_t NUM_CALLS = 10000000;

double GetSeconds()
{
	return ( double )clock() / CLOCKS_PER_SEC;
}

void PrintElapsedTime( double ElapsedTime )
{
	printf( "%f s/Mcalls\n", float( ElapsedTime / double( NUM_CALLS / 10000000 ) )  );
}

Then an intrusive counter:

class iIntrusiveCounter
{
public:
	iIntrusiveCounter():FRefCounter(0) {};
	virtual ~iIntrusiveCounter() {}
	void    IncRefCount() { FRefCounter++; }
	void    DecRefCount() { if ( --FRefCounter == 0 ) { delete this; } }
private:
	std::atomic<int> FRefCounter;
};

And an ad hoc intrusive smart pointer:

template <class T> class clPtr
{
public:
	clPtr(): FObject( 0 ) {}
	clPtr( const clPtr& Ptr ): FObject( Ptr.FObject ) { FObject->IncRefCount(); }
	clPtr( T* const Object ): FObject( Object ) { FObject->IncRefCount(); }
	~clPtr() { FObject->DecRefCount(); }
	clPtr& operator = ( const clPtr& Ptr )
	{
		T* Temp = FObject;
		FObject = Ptr.FObject;
		Ptr.FObject->IncRefCount();
		Temp->DecRefCount();
		return *this;
	}
	inline T* operator -> () const { return FObject; }
private:
	T*    FObject;
};

Pretty simple, right?
Let’s now declare a simple class, a smart pointer to an instance of which will be passed, first, by value and then by const reference to a function:

class clTestObject: public iIntrusiveCounter
{
public:
	clTestObject():FPayload(32167) {}
	// do some dummy work here
	void Do()
	{
		FPayload++;
	}

private:
	int FPayload;
};

Everything is now ready to write the actual benchmarking code:

void ProcessByValue( clPtr<clTestObject> O ) { O->Do(); }
void ProcessByConstRef( const clPtr<clTestObject>& O ) { O->Do(); }

int main()
{
	clPtr<clTestObject> Obj = new clTestObject;
	for ( size_t j = 0; j != 3; j++ )
	{
		double StartTime = GetSeconds();
		for ( size_t i = 0; i != NUM_CALLS; i++ ) { ProcessByValue( Obj ); }
		PrintElapsedTime( GetSeconds() - StartTime );
	}
	for ( size_t j = 0; j != 3; j++ )
	{
		double StartTime = GetSeconds();
		for ( size_t i = 0; i != NUM_CALLS; i++ ) { ProcessByConstRef( Obj ); }
		PrintElapsedTime( GetSeconds() - StartTime );
	}
	return 0;
}

Let’s build it and see what happens. First, we will start with a completely unoptimized debug version (I use gcc.EXE (GCC) 4.10.0 20140420 (experimental)):

gcc -O0 main.cpp -lstdc++ -std=c++11

The run time is 0.375 s/Mcalls for the pass by value version versus 0.124 s/Mcalls for the pass by const reference version. A persuasive 3x performance difference in the debug build. That is good. Let’s take a look at the underlying assembly. The by-value version:

L25:
	leal	-60(%ebp), %eax
	leal	-64(%ebp), %edx
	movl	%edx, (%esp)
	movl	%eax, %ecx
	call	__ZN5clPtrI12clTestObjectEC1ERKS1_		// call copy ctor
	subl	$4, %esp
	leal	-60(%ebp), %eax
	movl	%eax, (%esp)
	call	__Z14ProcessByValue5clPtrI12clTestObjectE
	leal	-60(%ebp), %eax
	movl	%eax, %ecx
	call	__ZN5clPtrI12clTestObjectED1Ev			// call dtor
	addl	$1, -32(%ebp)
L24:
	cmpl	$10000000, -32(%ebp)
	jne	L25

The by-const-reference version. Notice how clean it is even in a debug build:

L29:
	leal	-64(%ebp), %eax
	movl	%eax, (%esp)
	call	__Z17ProcessByConstRefRK5clPtrI12clTestObjectE	// just a single call
	addl	$1, -40(%ebp)
L28:
	cmpl	$10000000, -40(%ebp)
	jne	L29

All the calls are in their places and what we only save here are two expensive atomic operations.
But debug builds are not what we actually want, right? Let’s optimize it and see what happens:

gcc -O3 main.cpp -lstdc++ -std=c++11

The by-value time is now 0.168 seconds per Mcalls. The by-const-reference time is ZERO. I mean it. No matter how many iterations you have, the elapsed time in this simple test sample will be zero. Let’s see the assembly to check if we are not mistaken somewhere. This is the optimized by-value version:

L25:
	call	_clock
	movl	%eax, 36(%esp)
	fildl	36(%esp)
	movl	$10000000, 36(%esp)
	fdivs	LC0
	fstpl	24(%esp)
	.p2align 4,,10
L24:
	movl	32(%esp), %eax
	lock addl	$1, (%eax)		// this is our inlined IncRefCount()...
	movl	40(%esp), %ecx
	addl	$1, 8(%ecx)			// bodies of ProcessByValue() and Do() - 2 instructions
	lock subl	$1, (%eax)		// .. and this is DecRefCount(). Quite impressive.
	jne	L23
	movl	(%ecx), %eax
	call	*4(%eax)
L23:
	subl	$1, 36(%esp)
	jne	L24
	call	_clock

Ok, but why the by-const-reference version is so much faster we cannot measure it? Here it is:

	call	_clock
	movl	%eax, 36(%esp)
	movl	40(%esp), %eax
	addl	$10000000, 8(%eax)		// here is the final result, no loops, no nothing
	call	_clock
	movl	%eax, 32(%esp)
	movl	$20, 4(%esp)
	fildl	32(%esp)
	movl	$LC2, (%esp)
	movl	$1, 48(%esp)
	flds	LC0
	fdivr	%st, %st(1)
	fildl	36(%esp)
	fdivp	%st, %st(1)
	fsubrp	%st, %st(1)
	fstpl	8(%esp)
	call	_printf

Just Wow! The complete benchmark is actually in this assembly lines. The absence of atomic hassle lets the optimizer kick in and unroll everything into a single precalculated value. Of course, this is a very trivial code sample. However, it clearly makes 2 points why passing smart pointers by const reference is not a premature optimization but a serious performance improvement:

1) elimination of atomic operations is a large benefit in itself
2) elimination of atomic ops allows the optimizer to jump in and do its magic

Here is the full source code.

Results with your compiler may vary :)

P.S. Herb Sutter has a very elaborate essay on the topic, covering the C++ side in great detail.

Rendering UI transitions on mobile: Adreno 330

In the post Rendering UI transitions on mobile, I mentioned a problem with my transition shader on Adreno 330 GPUs. The solution was pretty easy. I’ve just unrolled the for-loop (and reduced the number of taps, but this is another story). Here is the new code for main() in the fragment shader which works on Adreno perfectly:

void main(void)
{
	float T = u_TransitionValue;

	float S0 = 1.0;
	float S1 = u_PixelSize;
	float S2 = 1.0;

	// 2 segments, 1/2 each
	float Half = 0.5;

	float PixelSize = ( T < Half ) ? mix( S0, S1, T / Half ) : mix( S1, S2, (T-Half) / Half );

	vec2 D = PixelSize * u_Resolution.zw;

	vec2 UV = v_TexCoord.xy;

	// 5-tap Poisson disk coefficients
	vec2 Disk[5];
	Disk[0] = vec2( 0.1134811,   0.6604039) * D + UV;
	Disk[1] = vec2(-0.4988798,   0.2663419) * D + UV;
	Disk[2] = vec2(-0.4542479,  -0.4338912) * D + UV;
	Disk[3] = vec2( 0.7253948,  -0.1434357) * D + UV;
	Disk[4] = vec2( 0.09679408, -0.9359848) * D + UV;

	vec4 C0 = texture( Texture0, UV );
	C0 += texture( Texture0, Disk[0] );
	C0 += texture( Texture0, Disk[1] );
	C0 += texture( Texture0, Disk[2] );
	C0 += texture( Texture0, Disk[3] );
	C0 += texture( Texture0, Disk[4] );
	C0 /= 6.0;

	vec4 C1 = texture( Texture1, UV );
	C1 += texture( Texture1, Disk[0] );
	C1 += texture( Texture1, Disk[1] );
	C1 += texture( Texture1, Disk[2] );
	C1 += texture( Texture1, Disk[3] );
	C1 += texture( Texture1, Disk[4] );
	C1 /= 6.0;

	out_FragColor = mix( C0, C1, T );
}

Btw, you can checkout how this transition looks like on GLSL.io.

Why std::vector<>::pop_back() does not return the popped value?

I was curious why std::vector<>::pop_back() does not return the popped value. Instead, there is a pair of methods: void pop_back() and const_reference back() const.

The answer was obvious when I tried to implement a vector-like class myself. Here is the code for an implementation of a would-be T std::vector<>::pop_back():

	   T pop_back()
	   {
	      FSize -= 1;
	      // call copy ctor
	      T Copy( FArray[ FSize ] );
	      // call dtor
	      FArray[ FSize ].~T();
	      // return the copy - this can raise an exception,
	      // but the value has been already popped from the stack
	      return Copy;
	   }

The implementation of the canonical pop_back() is straightforward and does not perform any redundant fuss with copying:

	void pop_back()
	{
		FSize -= 1;
		// this is exception safe since the dtor never throws
		FArray[ FSize ].~T();
	}

Exception safety is the

Adding density maps to the poisson points generator

I wanted to add varying density maps to my poisson points generator. For example, I want to use this density field to generate foliage:

I started with multiplying the MinDist parameter by the density value. This broke the entire algorithm. The solution that works is to generate a rectangle full of poisson points and then roll a dice for every point and discard it if the roll of the dice is above the density value at the considered point:

			float R = RandomFloat();
			float P = g_DensityMap[ x + y * ImageSize ];
			if ( R > P ) continue;

Here is the resulting image:

The complete source code is on GitHub: https://github.com/corporateshark/poisson-disk-generator

Rendering UI transitions on mobile

I was implementing a kind of defocus blur transition effect between two UI screens and came up with the following shader:

/// viewport resolution (in pixels) in .xy and inverse resolution in .zw
uniform vec4  u_Resolution;       
/// 0...1
uniform float u_TransitionValue;  
/// virtual pixel size
uniform float u_PixelSize;        
uniform sampler2D Texture0;      
uniform sampler2D Texture1;      
out vec4 out_FragColor;
void main(void)
{
	float T = u_TransitionValue;
	float S0 = 1.0;
	float S1 = u_PixelSize;
	float S2 = 1.0;
	// 2 segments, 1/2 each
	float Half = 0.5;
	float PixelSize = ( T < Half ) ? mix( S0, S1, T / Half ) : mix( S1, S2, (T-Half) / Half );
	vec2 D = PixelSize * u_Resolution.zw;
	vec2 UV = (gl_FragCoord.xy * u_Resolution.zw);
	// 12-tap Poisson disk coefficients:
	// https://github.com/spite/Wagner/blob/master/fragment-shaders/poisson-disc-blur-fs.glsl
	const int NumTaps = 12;
	vec2 Disk[NumTaps];
	Disk[0] = vec2(-.326,-.406);
	Disk[1] = vec2(-.840,-.074);
	Disk[2] = vec2(-.696, .457);
	Disk[3] = vec2(-.203, .621);
	Disk[4] = vec2( .962,-.195);
	Disk[5] = vec2( .473,-.480);
	Disk[6] = vec2( .519, .767);
	Disk[7] = vec2( .185,-.893);
	Disk[8] = vec2( .507, .064);
	Disk[9] = vec2( .896, .412);
	Disk[10] = vec2(-.322,-.933);
	Disk[11] = vec2(-.792,-.598);
	vec4 C0 = texture( Texture0, UV );
	vec4 C1 = texture( Texture1, UV );
	for ( int i = 0; i != NumTaps; i++ )
	{
		C0 += texture( Texture0, Disk[i] * D + UV );
		C1 += texture( Texture1, Disk[i] * D + UV );
	}
	C0 /= float(NumTaps+1);
	C1 /= float(NumTaps+1);
	out_FragColor = mix( C0, C1, T );
}

I get wrong results with this fragment shader on some mobile GPUs. For example, Andreno 330 on LG Nexus 5 gives just a series of shifted images instead of blur. Google Nexus 10 runs it just fine. Anyway, I will have to investigate it a bit later.

Work interruptions are evil

I can be doing something very important but if I get distracted for a little bit when I go back I totally forget what I was doing. Here is what happened today.

I wanted to code some new stuff and was about to add the line:

if ( !Node ) { return clPtr<clSceneNode>(); }

just before this line:

if ( !Node->LoadFromStream( Stream ) ) { return clPtr<clSceneNode>(); }

Suddenly I heard a phone ringing, got up from my computer, and left the code incomplete:

if ( !Node )
if ( !Node->LoadFromStream( Stream ) ) { return clPtr<clSceneNode>(); }

After the phone call I decided to drink some tee… I completely forgot what I was doing.
Long story short, when I returned to my coding I spent 30 minutes figuring out why my program was broken and what the heck was going on :(

Accessing Linux from Windows via sshd

I have a Virtual Box virtual machine and an Ubuntu installation running within it. Most of the time a do my development on Windows using Far Manager as my primary shell and text editor.

There is a nice plugin for it, called NetBox: http://plugring.farmanager.com/plugin.php?pid=859 (https://github.com/michaellukashov/Far-NetBox/downloads/),
which allows me to connect to a Linux machine via sshd and access its files as if they were on my local file system.

Here is a step-by-step checklist to make your Virtual Box sshd-enabled:

1. Run sudo apt-get install openssh-server
2. Run ifconfig and remember the IP-address on your virtual machine behind a NAT
3. Go to the Network settings menu: SettingsNetworkPort ForwardingInsert new rule and type:

Rule1 TCP 22 22

Reboot your VM. Now it is ready to accept sshd connections.

There are a few more steps to make your Far Manager to access the VM:

4. Unpack a NetBox distro to the plugins folder (for example, C:\Program Files (x86)\Far2\Plugins) and restart Far Manager
5. Press Alt-F1, select NetBox, press Shift+F4
6. Select the connection type: SCP. Enter the host IP-address and logic/password for your Ubuntu

This plugin allows not only accessing files but run the shell commands from Far.