Monthly Archives: July 2014

Smart pointers passed by const reference

Smart pointers are often passed by const references. C++ experts, Andrei Alexandrescu, Scott Meyers and Herb Sutter, speculate on this topic during C++ and Beyond 2011 ([04:34] On shared_ptr performance and correctness).

Basically, a smart pointer that is passed-in by const reference already lives in the current scope, somewhere at the call site. It may be stored in a class member and you may do something that clears that member. But this is not the problem of passing by reference, it is the problem of your architecture and ownership policy.

However, this post is not about correctness. It is about performance and what we actually can gain by switching to const references. The first impression may be that the only thing we will get is avoidance of atomic increments/decrements in copy constructor and destructor. Let’s take a closer look and write some code to understand what is going on behind the scenes.

First, some helper functions:

1
2
3
4
5
6
7
8
9
10
11
const size_t NUM_CALLS = 10000000;
 
double GetSeconds()
{
    return ( double )clock() / CLOCKS_PER_SEC;
}
 
void PrintElapsedTime( double ElapsedTime )
{
    printf( "%f s/Mcalls\n", float( ElapsedTime / double( NUM_CALLS / 10000000 ) )  );
}

Then an intrusive counter:

1
2
3
4
5
6
7
8
9
10
class iIntrusiveCounter
{
public:
    iIntrusiveCounter():FRefCounter(0) {};
    virtual ~iIntrusiveCounter() {}
    void    IncRefCount() { FRefCounter++; }
    void    DecRefCount() { if ( --FRefCounter == 0 ) { delete this; } }
private:
    std::atomic<int> FRefCounter;
};

And an ad hoc intrusive smart pointer:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
template <class T> class clPtr
{
public:
    clPtr(): FObject( 0 ) {}
    clPtr( const clPtr& Ptr ): FObject( Ptr.FObject ) { FObject->IncRefCount(); }
    clPtr( T* const Object ): FObject( Object ) { FObject->IncRefCount(); }
    ~clPtr() { FObject->DecRefCount(); }
    clPtr& operator = ( const clPtr& Ptr )
    {
        T* Temp = FObject;
        FObject = Ptr.FObject;
        Ptr.FObject->IncRefCount();
        Temp->DecRefCount();
        return *this;
    }
    inline T* operator -> () const { return FObject; }
private:
    T*    FObject;
};

Pretty simple, right?
Let’s now declare a simple class, a smart pointer to an instance of which will be passed, first, by value and then by const reference to a function:

1
2
3
4
5
6
7
8
9
10
11
12
13
class clTestObject: public iIntrusiveCounter
{
public:
    clTestObject():FPayload(32167) {}
    // do some dummy work here
    void Do()
    {
        FPayload++;
    }
 
private:
    int FPayload;
};

Everything is now ready to write the actual benchmarking code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
void ProcessByValue( clPtr<clTestObject> O ) { O->Do(); }
void ProcessByConstRef( const clPtr<clTestObject>& O ) { O->Do(); }
 
int main()
{
    clPtr<clTestObject> Obj = new clTestObject;
    for ( size_t j = 0; j != 3; j++ )
    {
        double StartTime = GetSeconds();
        for ( size_t i = 0; i != NUM_CALLS; i++ ) { ProcessByValue( Obj ); }
        PrintElapsedTime( GetSeconds() - StartTime );
    }
    for ( size_t j = 0; j != 3; j++ )
    {
        double StartTime = GetSeconds();
        for ( size_t i = 0; i != NUM_CALLS; i++ ) { ProcessByConstRef( Obj ); }
        PrintElapsedTime( GetSeconds() - StartTime );
    }
    return 0;
}

Let’s build it and see what happens. First, we will start with a completely unoptimized debug version (I use gcc.EXE (GCC) 4.10.0 20140420 (experimental)):

1
gcc -O0 main.cpp -lstdc++ -std=c++11

The run time is 0.375 s/Mcalls for the pass by value version versus 0.124 s/Mcalls for the pass by const reference version. A persuasive 3x performance difference in the debug build. That is good. Let’s take a look at the underlying assembly. The by-value version:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
L25:
    leal    -60(%ebp), %eax
    leal    -64(%ebp), %edx
    movl    %edx, (%esp)
    movl    %eax, %ecx
    call    __ZN5clPtrI12clTestObjectEC1ERKS1_      // call copy ctor
    subl    $4, %esp
    leal    -60(%ebp), %eax
    movl    %eax, (%esp)
    call    __Z14ProcessByValue5clPtrI12clTestObjectE
    leal    -60(%ebp), %eax
    movl    %eax, %ecx
    call    __ZN5clPtrI12clTestObjectED1Ev          // call dtor
    addl    $1, -32(%ebp)
L24:
    cmpl    $10000000, -32(%ebp)
    jne L25

The by-const-reference version. Notice how clean it is even in a debug build:

1
2
3
4
5
6
7
8
L29:
    leal    -64(%ebp), %eax
    movl    %eax, (%esp)
    call    __Z17ProcessByConstRefRK5clPtrI12clTestObjectE  // just a single call
    addl    $1, -40(%ebp)
L28:
    cmpl    $10000000, -40(%ebp)
    jne L29

All the calls are in their places and what we only save here are two expensive atomic operations.
But debug builds are not what we actually want, right? Let’s optimize it and see what happens:

1
gcc -O3 main.cpp -lstdc++ -std=c++11

The by-value time is now 0.168 seconds per Mcalls. The by-const-reference time is ZERO. I mean it. No matter how many iterations you have, the elapsed time in this simple test sample will be zero. Let’s see the assembly to check if we are not mistaken somewhere. This is the optimized by-value version:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
L25:
    call    _clock
    movl    %eax, 36(%esp)
    fildl   36(%esp)
    movl    $10000000, 36(%esp)
    fdivs   LC0
    fstpl   24(%esp)
    .p2align 4,,10
L24:
    movl    32(%esp), %eax
    lock addl   $1, (%eax)      // this is our inlined IncRefCount()...
    movl    40(%esp), %ecx
    addl    $1, 8(%ecx)         // bodies of ProcessByValue() and Do() - 2 instructions
    lock subl   $1, (%eax)      // .. and this is DecRefCount(). Quite impressive.
    jne L23
    movl    (%ecx), %eax
    call    *4(%eax)
L23:
    subl    $1, 36(%esp)
    jne L24
    call    _clock

Ok, but why the by-const-reference version is so much faster we cannot measure it? Here it is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
call    _clock
movl    %eax, 36(%esp)
movl    40(%esp), %eax
addl    $10000000, 8(%eax)      // here is the final result, no loops, no nothing
call    _clock
movl    %eax, 32(%esp)
movl    $20, 4(%esp)
fildl   32(%esp)
movl    $LC2, (%esp)
movl    $1, 48(%esp)
flds    LC0
fdivr   %st, %st(1)
fildl   36(%esp)
fdivp   %st, %st(1)
fsubrp  %st, %st(1)
fstpl   8(%esp)
call    _printf

Just Wow! The complete benchmark is actually in this assembly lines. The absence of atomic hassle lets the optimizer kick in and unroll everything into a single precalculated value. Of course, this is a very trivial code sample. However, it clearly makes 2 points why passing smart pointers by const reference is not a premature optimization but a serious performance improvement:

1) elimination of atomic operations is a large benefit in itself
2) elimination of atomic ops allows the optimizer to jump in and do its magic

Here is the full source code.

Results with your compiler may vary 🙂

P.S. Herb Sutter has a very elaborate essay on the topic, covering the C++ side in great detail.