How much do GCC security features cost?

GCC has a number of useful features to mitigate possible bugs in the code and a whole new batch was added in version 4.9 with the arrival of Sanitizers. How feasible is their usage in production? Some of these flags are known for long and already used in Debian builds for example. Specifically:

-fstack-protector-all -D_FORTIFY_SOURCE=2 -Wl,-z,relro -Wl,-z,now -fPIE -pie

As result, in Ubuntu 14, standard system binaries are protected by default (checksec script used):

$ checksec --file $(which sshd) RELRO STACK CANARY NX PIE RPATH RUNPATH FILE Full RELRO Canary found NX enabled PIE enabled No RPATH No RUNPATH /usr/sbin/sshd

GCC 4.9 introduced a whole new range of advanced bounds checking mechanisms that are ported from an earlier, separate project (address-sanitizer). They are all listed in GCC documentation and are enabled by -fsanitize flag. There are quite a lot of them and some are mutually exclusive, but perhaps the most interesting are these:

-fsanitize=address

Out-of-bounds writes and write-after-free detector.

-fsanitize=undefined

Undefined behaviour detector. "Undefined" here means a whole range of situations where program execution becomes unpredictable as result of programming bug. </dl> Some of the bugs are demonstrated in this program:


// test2.c
// gcc -std=c99 -ggdb test2.
#include 
#include 
#include 

void main(void) {
        int a = 10;
        printf("a) %d << 20 = %d\n", a, a << 20);

        /* Crashes with SIGFPE
        printf("%x\n", a/0);
        */

        int b = INT_MIN;
        int c = -b;
        printf("c) %d = -%d\n", b, c);

        printf("cc) %d/-1=%d\n", b, b/-1);

        signed char d = SCHAR_MAX;
        printf("d) %x++ = %x\n", d, d++);

        char e[] = "test";
        for(int i =0; i < 10; i++) {
                e[i] = 'x';
        }

        void *f = NULL;
        printf("%s\n", (char *) f);

        char *g = malloc(10);
        free(g);
        g[0] = 'x';
}
</code>

This program guarantees at least two SIGSEGV (e and f) and one SIGFPE (a/0). A recent GCC will also catch the last write-after-free operation (g) using the stack smashing detection, but otherwise most of these operations will be happily executed.

Now try compile the program with the following options enabled:


gcc -std=c99 -ggdb -fsanitize=address -fsanitize=leak -fsanitize=undefined -fsanitize=signed-integer-overflow -fsanitize=shift -fsanitize=integer-divide-by-zero -fsanitize=null   test2.c


And result:


a) 10 << 20 = 10485760
test2.c:14:13: runtime error: negation of -2147483648 cannot be represented in type 'int'; cast to an unsigned type to negate this value to itself
c) -2147483648 = --2147483648
test2.c:17:38: runtime error: division of -2147483648 by -1 cannot be represented in type 'int'
test2.c:17:38: runtime error: negation of -2147483648 cannot be represented in type 'int'; cast to an unsigned type to negate this value to itself
cc) -2147483648/-1=-2147483648
d) ?++ = 7f


The most verbose reaction is to the attempted buffer overrun (e):


=================================================================
==31329==ERROR: AddressSanitizer: stack-buffer-overflow on address 0xbf9af495 at pc 0x8048b2b bp 0xbf9af428 sp 0xbf9af41c
WRITE of size 1 at 0xbf9af495 thread T0
    #0 0x8048b2a in main /home/kravietz/GCC/test2.c:24
    #1 0xb6c39a82 in __libc_start_main (/lib/i386-linux-gnu/libc.so.6+0x19a82)
    #2 0x8048850 (/home/kravietz/GCC/a.out+0x8048850)

Address 0xbf9af495 is located in stack of thread T0 at offset 37 in frame
    #0 0x804893a in main /home/kravietz/GCC/test2.c:5

  This frame has 1 object(s):
    [32, 37) 'e' <== Memory access at offset 37 overflows this variable
HINT: this may be a false positive if your program uses some custom stack unwind mechanism or swapcontext
      (longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-buffer-overflow /home/kravietz/GCC/test2.c:24 main
Shadow bytes around the buggy address:
  0x37f35e40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x37f35e50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x37f35e60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x37f35e70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x37f35e80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1
=>0x37f35e90: f1 f1[05]f4 f4 f4 00 00 00 00 00 00 00 00 00 00
  0x37f35ea0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x37f35eb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x37f35ec0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x37f35ed0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x37f35ee0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Heap right redzone:      fb
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack partial redzone:   f4
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Contiguous container OOB:fc
  ASan internal:           fe
==31329==ABORTING


Performance penalty

All this runtime bounds checking doesn't come for free unfortunately. I have compiled BoringSSL with three option sets and then ran bssl speed on each resulting binary.

The option sets were:


-ggdb (unoptimized)
-O4 -march=native (optimized)
-fsanitize=address -fsanitize=leak -fsanitize=undefined -fstack-protector-all -Wl,-z,relro -Wl,-z,now -fPIE -pie -D_FORTIFY_SOURCE=2 (hardened)
</ul>

The results:


Enabling sanitizers and other GCC protections only slightly impacts assymetric cryptography operations. For RSA 2048 bit key the optimized version did 2994 ops/sec as compared to 2805 ops/sec for protected one. And, well, definition of "slightly" depends on your usage scenario — it's 7% so if you're struggling for every operation per second then it's quite a lot.
Similar for AES-256-GCM encryptions, where optimized version did 29.7 MB/s and protected 28.0 MB/s.
There was huge drop in throughput in case of ChaCha20-Poly1305. The optimized version did 105.0 MB/s and the protected only 60.1 MB/s.
On the other hand, the difference in case of RC4 was negligible (a few percents in favour of optimized version).
There was also interesting difference between throughput of SHA-1 operations depending on hashed block size. When large (8 KB) blocks were hashed there was small difference between optimized and protected version (1% drop). On the other hand, for small (16 bytes) blocks the drop was over 40%.
</ul>

The large drop for ChaCha20 and SHA1 might be related to either the way the implementation manages buffers, or the way one of the undefined sanitizers work. Aparat from that all this looks quite promising and might mean that we could actually use binaries with enabled run-time bounds checking running in production.