homersssearchNovember 08, 2018

AES-NI SSL Performance


a study of AES-NI acceleration using LibreSSL, OpenSSL

The Advanced Encryption Standard Instruction Set (AES-NI) is an extension to the x86 architecture for microprocessors from Intel and AMD. The purpose of AES-NI is to improve the speed of applications performing encryption and decryption using the Advanced Encryption Standard (AES) like the AES-128 and AES-256 ciphers. AES-NI was designed to provide 4x to 8x speed improvements when using AES ciphers for bulk data encryption and decryption.

AES accelerated CPUs can increase efficiency and performance when setting up an SSL Terminator for your HTTP web cluster, a VPN link, a sshfs file system mount or moving bulk data over an SSH connection using scp or rsync.

The following table lists the results of a quick study of various ciphers used on desktop, laptop and mobile devices. The benchmarks focus on the ciphers available to TLS v1.2 and TLS v1.3 connections made by HTTP/2 , HTTPS clients. The ChaCha20 cipher is used as our baseline. ChaCha20 is a 256 bit stream cipher which is not AES accelerated and relies on raw CPU processing power. The other ciphers are 128 bit and 256 bit AES ciphers which are accelerated by the CPU through AES-NI when AES-NI is enabled through the BIOS. LibreSSL (OpenSSL) is used to test all ciphers on various CPUs we have access to. All numbers are in Megabytes per Second (MB/s) per single CPU core. Higher values are better.

Cipher Performance per CPU core

                    AES Performance per CPU core for TLS v1.2 Ciphers
                   (Higher is Better, Speeds in Megabytes per Second)

                   ChaCha20  AES-128-GCM  AES-256-GCM  AES-128-CBC  AES-256-CBC  Total Score

AMD Ryzen 7 1800X   573       3006         2642         1513         1101        = 8835
Intel W-2125        565       2808         2426         1698         1235        = 8732
Intel i7-6700       585       2607         2251         1561         1131        = 8135
Intel i5-6500       410       1729         1520         1078          783        = 5520
Intel i7-4750HQ     369       1556         1353          688          499        = 4465
AMD FX 8350         367       1453         1278          716          514        = 4328
AMD FX 8150         347       1441         1273          716          515        = 4292
Intel E5-2650 v4    404       1479         1286          652          468        = 4289
Intel i7-2700K      382       1353         1212          763          552        = 4262
Intel i7-3840QM     373       1279         1143          725          520        = 4040
Intel i5-2500K      358       1274         1140          728          522        = 4022
AMD FX 6100         326       1344         1186          671          481        = 4008
AMD A10-7850K       321       1303         1176          685          499        = 3984
AMD A8-7600 Kaveri  306       1246         1108          648          470        = 3778
Intel E5-2640 v3    303       1286         1126          585          419        = 3719
AMD Opteron 6380    293       1203         1063          589          423        = 3571
AMD Opteron 6378    282       1138          986          561          406        = 3373
AMD Opteron 6274    232       1054          926          524          376        = 3112
Intel Xeon E5-2630  247        962          864          541          394        = 3008
Intel Xeon E5645    262        817          717          727          524        = 3047
Intel i7-2635QM     151        989          881          564          404        = 2989
Intel Xeon L5630    225        701          610          626          450        = 2612
Intel E5-2603 v4    236        866          754          382          274        = 2512
AMD Opteron 2382    249        651          485          215          150        = 1750
Intel i7-950        401        256          218          358          257        = 1490
AMD Phenom 965      404         84           63          282          198        = 1031
Intel Core2 Q9300   231        126          133          221          161        =  872
AMD X4 610e         225         59           44          198          139        =  665
Intel Core2 Q6600   173        141           79          108           77        =  578
Intel P4 3Ghz Will  109         26           23           55           43        =  256
Intel ATOM D525      98         51           43           28           20        =  240
Snapdragon S4 Pro   131         41            -            -            -        =  172
ARM Cortex A9        73         24            -            -            -        =   97

Testing Notes: 
  LibreSSL 2.5.0 ( ~ OpenSSL 1.0.2d) 
  FreeBSD 11 ; Clang LLVM compiler
  AES-NI acceleration enabled if allowed by the CPU
  Speeds in megabytes per second (MB/s) per real cpu core
  8192 byte blocks
  Five(5) test runs, the average speed reported
  Snapdragon and ARM Cortex values reported by Google Developers

How do I interpret the results ?

Theoretically, let us say we have a project with a 10 gigabit connection to the internet. 10 gigabits per second is 1,250 megabytes per second. The web page designers are expecting the web server to concurrently encrypt and decrypt data to saturate the 10 gigabit connection. Let us also say 100% of our clients are going to be using the AES-128-GCM based cipher just to make it easier to compare numbers from the table above.

Ideally we would want a CPU which could processes 1,250 MB/s of AES encrypted data per cpu core. Since we need to recieve (decrypt) and send (encrypt) the data we need at least two(2) CPU cores, each able to sustain 1,250 MB/s. From the test results above, any of the CPUs starting with the "AMD Opteron 6380" and faster would work perfectly as the "AMD Opteron 6380" can process 1,203 megabytes per second of AES data per CPU core. Note that the AMD Opteron 6380 is a 16 core CPU which leaves plenty of other CPU cores to do other work like network I/O, firewall rules or ZFS file system work.

In the real world the situation would be more complicated. Clients connect with a variety of ciphers and the system is not dedicated to just cipher processing. It is also possible that the cipher processing of multiple cpu cores can be added together to reach the desired speed. The "Intel Xeon L5630" has four cores and each core could processes 701 MB/s of AES data for a around 2,804 MB/s; just enough speed for encrypting and decrypting data on a 10 gigabit link using AES-128-GCM.

Note that AES-NI is only supported by real CPU cores and not hyper threaded (HT) or virtual cores.

Check out our H2O and Nginx tutorials for tips on configuring a fast and secure web server or SSL terminator.

How can I test my own CPU ?

Using the following commands, download and build LibreSSL. The build process statically builds the LibreSSL binaries and libraries in the local directory. No files are installed to the system. Once the build is done, run each of the cipher speed tests with a 30 second sleep in between to make sure the load of the machine reached zero(0). When you are done testing, delete the build directory and everything is cleaned up.

cd /tmp
wget http://ftp.openbsd.org/pub/OpenBSD/LibreSSL/libressl-2.5.0.tar.gz
tar zxvf libressl-2.5.0.tar.gz
cd libressl-2.5.0
./configure && make && echo SUCCESS

./apps/openssl/openssl speed -elapsed -evp chacha
  sleep 30
./apps/openssl/openssl speed -elapsed -evp aes-128-gcm
  sleep 30
./apps/openssl/openssl speed -elapsed -evp aes-256-gcm
  sleep 30
./apps/openssl/openssl speed -elapsed -evp aes-128-cbc
  sleep 30
./apps/openssl/openssl speed -elapsed -evp aes-256-cbc

Cipher Speed Test Output Example

The LibreSSL (OpenSSL) cipher speed test will print out a few lines of output per test performed. The value we are interested in is on the last line under the label "8192 bytes". Our interests are focused on bulk data transfers and "8192 bytes" is the largest block test shown. The "8192 bytes" value is the amount of data the CPU can process using the cipher specified in thousands of bytes per second. Divide the value shown by one(1) thousand to get megabytes per second which is the same as our results in the table above.

# use dmesg and search for the cpu type. for example, 

$ dmesg  | grep CPU0
[    0.120426] smpboot: CPU0: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (fam: 06, model: 5e, stepping: 03)


# run the series of cipher speed tests, chacha is first...

$ ./apps/openssl/openssl speed -elapsed -evp chacha
You have chosen to measure elapsed time instead of user CPU time.
Doing chacha for 3s on 16 size blocks: 66892965 chacha's in 3.00s
Doing chacha for 3s on 64 size blocks: 25017290 chacha's in 3.00s
Doing chacha for 3s on 256 size blocks: 6502076 chacha's in 3.00s
Doing chacha for 3s on 1024 size blocks: 1692776 chacha's in 3.00s
Doing chacha for 3s on 8192 size blocks: 214511 chacha's in 3.00s

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
chacha          356762.48k   533702.19k   554843.82k   577800.87k   585758.04k <----

... the result is 585758.04k / 1000 = 585 MB/s


$ ./apps/openssl/openssl speed -elapsed -evp aes-128-gcm
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-gcm for 3s on 16 size blocks: 134661060 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 64 size blocks: 79432576 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 256 size blocks: 28895019 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 1024 size blocks: 7559486 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 8192 size blocks: 954887 aes-128-gcm's in 3.00s

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-gcm     718192.32k  1694561.62k  2465708.29k  2580304.55k  2607478.10k <----

... the result is 2607478.10k / 1000 = 2,607 MB/s


$ ./apps/openssl/openssl speed -elapsed -evp aes-256-gcm
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-gcm for 3s on 16 size blocks: 125601150 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 64 size blocks: 75507034 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 256 size blocks: 25591359 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 1024 size blocks: 6547497 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 8192 size blocks: 824454 aes-256-gcm's in 3.00s

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-gcm     669872.80k  1610816.73k  2183795.97k  2234878.98k  2251309.06k <----

... the result is 2251309.06k / 1000 = 2,251 MB/s


$ ./apps/openssl/openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 250707357 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 71204109 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 18108237 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 4563775 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 571798 aes-128-cbc's in 3.00s

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc    1337105.90k  1519020.99k  1545236.22k  1557768.53k  1561389.74k <----

... the result is 1561389.74k / 1000 = 1,561 MB/s


$ ./apps/openssl/openssl speed -elapsed -evp aes-256-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 185732038 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 51745988 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 13073843 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 3280738 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 414517 aes-256-cbc's in 3.00s

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc     990570.87k  1103914.41k  1115634.60k  1119825.24k  1131907.75k <----

... the result is 1131907.75k / 1000 = 1,131 MB/s

Questions?

Is OpenSSL faster than LibreSSL ?

Yes, both OpenSSL and BoringSSL are significantly faster than LibreSSL when using modern ciphers. LibreSSL is probibly slower due to more locking, no internal crypto devices and single threaded processes with the idea of being more secure. The following window shows a performance query using the elapsed speed tests built into both OpenSSL and LibreSSL. The server has a moderately powerful CPU with AES-NI enabled in the BIOS. The machine is setup with an Intel i5-6500 CPU, FreeBSD 11, with LibreSSL v2.5.0 and OpenSSL v1.1.0 built from source. The results show that OpenSSL is between 2.3x to 6.7x times faster than LibreSSL using ChaCha20 as well as AES-128-GCM and AES-256-GCM. This performance difference is great enough that you would need multiple https servers running Nginx built with LibreSSL to equal the speed of one(1) Nginx server built with OpenSSL.

Tip: take a look at the Nginx server resource sizing guide for deploying Nginx on bare metal servers and the Nginx testing methodology. The guide shows graduated hardware configurations and how many requests per second, transactions per second and total throughput an https server could achieve.

               AES Performance per CPU core for TLS v1.2 Ciphers
               (Higher is Better, Speeds in Megabytes per Second)

              ChaCha20  AES-128-GCM  AES-256-GCM  AES-128-CBC  AES-256-CBC  Total Score
Intel i5-6500  2762       4900         3554         1067          780       = 13063  OpenSSL   v1.1.0
               1760       4455         3370          460          402       = 10447  BoringSSL v2017_12
                410       1729         1520         1078          783       =  5520  LibreSSL  v2.5.0

###
############### Testing Results ##################
###

dmesg  | grep -i CPU
 CPU: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz (3192.14-MHz K8-class CPU)


cd /tmp
wget http://ftp.openbsd.org/pub/OpenBSD/LibreSSL/libressl-2.5.0.tar.gz
tar zxvf libressl-2.5.0.tar.gz
cd libressl-2.5.0
./configure && make && echo SUCCESS

./apps/openssl/openssl speed -elapsed -evp chacha
 The 'numbers' are in 1000s of bytes per second processed.
 type            16 bytes     64 bytes     256 bytes    1024 bytes   8192 bytes
 chacha          229894.55k   374728.51k   401326.42k   407606.34k   410545.95k
                                                                     ^^^
./apps/openssl/openssl speed -elapsed -evp aes-128-gcm
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes     256 bytes    1024 bytes   8192 bytes
aes-128-gcm      578578.66k   1037298.77k  1496023.55k  1667607.21k  1729668.50k
                                                                     ^^^^
./apps/openssl/openssl speed -elapsed -evp aes-256-gcm
 The 'numbers' are in 1000s of bytes per second processed.
 type            16 bytes     64 bytes     256 bytes    1024 bytes   8192 bytes
 aes-256-gcm     514792.29k   953548.57k   1340996.10k  1478150.01k  1520833.77k
                                                                     ^^^^
./apps/openssl/openssl speed -elapsed -evp aes-128-cbc
 The 'numbers' are in 1000s of bytes per second processed.
 type            16 bytes     64 bytes     256 bytes    1024 bytes   8192 bytes
 aes-128-cbc     1070909.28k  1059120.83k  1084207.69k  1090894.01k  1078315.69k
                                                                     ^^^^
./apps/openssl/openssl speed -elapsed -evp aes-256-cbc
 The 'numbers' are in 1000s of bytes per second processed.
 type            16 bytes     64 bytes     256 bytes    1024 bytes   8192 bytes
 aes-256-cbc     806110.46k   767273.81k   793146.46k   803538.08k   783499.41k
                                                                     ^^^

cd /tmp
wget https://www.openssl.org/source/openssl-1.1.0e.tar.gz
tar zxvf openssl-1.1.0e.tar.gz
cd openssl-1.1.0e
./config && make
cp /tmp/openssl-1.1.0e/libssl.so.1.1 /usr/local/lib/
cp /tmp/openssl-1.1.0e/libcrypto.so.1.1 /usr/local/lib/

./apps/openssl speed -elapsed -evp chacha20
 The 'numbers' are in 1000s of bytes per second processed.
 type            16 bytes     64 bytes     256 bytes    1024 bytes   8192 bytes  16384 bytes
 chacha20        320078.35k   547365.25k   1287720.93k  2649847.21k  2762595.49k  2769084.88k
                                                                     ^^^^
./apps/openssl speed -elapsed -evp aes-128-gcm
 The 'numbers' are in 1000s of bytes per second processed.
 type            16 bytes     64 bytes     256 bytes    1024 bytes   8192 bytes  16384 bytes
 aes-128-gcm     453159.25k   1215246.40k  2437021.95k  3909602.78k  4900248.28k  4996923.22k
                                                                     ^^^^
./apps/openssl speed -elapsed -evp aes-256-gcm
 The 'numbers' are in 1000s of bytes per second processed.
 type            16 bytes     64 bytes     256 bytes    1024 bytes   8192 bytes  16384 bytes
 aes-256-gcm     397133.57k   1118061.03k  2050411.88k  3017616.18k  3554319.58k  3603072.56k
                                                                     ^^^^
./apps/openssl speed -elapsed -evp aes-128-cbc
 The 'numbers' are in 1000s of bytes per second processed.
 type            16 bytes     64 bytes     256 bytes    1024 bytes   8192 bytes  16384 bytes
 aes-128-cbc     812677.93k   1037389.63k  1066182.04k  1068901.72k  1067816.15k  1074969.69k
                                                                     ^^^^
./apps/openssl speed -elapsed -evp aes-256-cbc
 The 'numbers' are in 1000s of bytes per second processed.
 type            16 bytes     64 bytes     256 bytes    1024 bytes   8192 bytes  16384 bytes
 aes-256-cbc     720262.90k   757488.79k   775043.00k   776824.49k   780029.74k   792199.17k


git clone https://boringssl.googlesource.com/boringssl
cmake -GNinja -DCMAKE_BUILD_TYPE=Release .. && ninja
cd build/tools
./bssl speed
...
Did 544000 AES-128-GCM (8192 bytes) seal operations in 1000170us (543907.5 ops/sec): 4455.7 MB/s
Did 412000 AES-256-GCM (8192 bytes) seal operations in 1001476us (411392.8 ops/sec): 3370.1 MB/s
Did 215000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1000321us (214931.0 ops/sec): 1760.7 MB/s
...
Did 57000 AES-128-CBC-SHA1 (8192 bytes) seal operations in 1014216us (56201.0 ops/sec): 460.4 MB/s
Did 50000 AES-256-CBC-SHA1 (8192 bytes) seal operations in 1018187us (49106.9 ops/sec): 402.3 MB/s


How can I test OpenSSL with AES-NI on and off from the command line ?

Using the "OPENSSL_ia32cap" environmental variable you can force OpenSSL to disable AES-NI acceleration. The following two tests show AES-NI results off and then back on. Notice that without AES-NI, the aes-128-gcm cipher processed data at 212 MB/sec. With AES-NI enabled the same aes-128-gcm cipher speed jumped to 1,357 MB/s ! A six(6) times performance boost.

# cpu example type: AMD FX 6100

$ dmesg  | grep -i cpu
[    0.277326] smpboot: CPU0: AMD FX(tm)-6100 Six-Core Processor (fam: 15, model: 01, stepping: 02)


OpenSSL AES-NI = OFF 

$ OPENSSL_ia32cap="~0x200000200000000" openssl speed -elapsed -evp aes-128-gcm
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-gcm for 3s on 16 size blocks: 11810234 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 64 size blocks: 3458208 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 256 size blocks: 2269863 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 1024 size blocks: 612727 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 8192 size blocks: 77820 aes-128-gcm's in 3.00s

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-gcm      62987.91k    73775.10k   193694.98k   209144.15k   212500.48k

... the result is 212500.48k / 1000 = 212 MB/s


OpenSSL AES-NI = ON

$ openssl speed -elapsed -evp aes-128-gcm
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-gcm for 3s on 16 size blocks: 47814322 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 64 size blocks: 32192031 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 256 size blocks: 13198683 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 1024 size blocks: 3757898 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 8192 size blocks: 497117 aes-128-gcm's in 3.00s

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-gcm     255009.72k   686763.33k  1126287.62k  1282695.85k  1357460.82k

... the result is 1357460.82k / 1000 = 1,357 MB/s

How can I test a remote server cipher ?

Use the openssl s_client tool and query a remote server. You can let the client and server choose the most preferred cipher or you can specify the exact cipher name you want to use during the connection.


# Test calomel.org using the client/server negotiated cipher
echo -n | ./apps/openssl/openssl s_client -connect calomel.org:443

# Test calomel.org using the ChaCha cipher
echo -n | ./apps/openssl/openssl s_client -cipher ECDHE-ECDSA-CHACHA20-POLY1305 -connect calomel.org:443

# Test calomel.org using the AES-128-GCM cipher
echo -n | ./apps/openssl/openssl s_client -cipher ECDHE-ECDSA-AES128-GCM-SHA256 -connect calomel.org:443


Contact UsGoogle Site SearchRSS Feed