Tls handshake schnell sicher

From
Jump to navigation Jump to search

Introduction

TLS is a cryptographic protocol used to encrypt data transfer between communication partners. It is located in the application layer of the Internet Protocol Suite (IP Suite) or in the presentation layer of the OSI model. It is typically used as a wrapper for other (insecure) protocols like HTTP or SMTP. Besides encryption TLS could be used for identification purposes of both communication partners. There are several versions of TLS with versions below 1.2 being deprecated. The current versions 1.2[1] and 1.3[2] dropped support of unsecure hash functions like MD5 or SHA224. An unconfigured v1.3 connection defaults to the cipher suite AES256_GCM_SHA384.

Connection Handshake

tls v1.2 handshake
tls v1.3 handshake

A typical v1.2 connection without client authentication is done as follows:

  1. Client Hello: Client initiates connection to the Server, transmitting list of supported cipher suites, possibly indicating requiring server certificate status information (OCSP-stapling via flag status_request)
  2. Server Hello: Server replies chosen supported cipher suite
  3. Server-Certificate-Exchange: Server sends its certificate along with certificate chain (also sends valid OCSP-response if OCSP-stapling is configured)
  4. Client-Certificate-Exchange: Client acknowledges validity of certificate
  5. Session-Ticket: Client generates session ticket using on of the following methods:
    - random data created with public key of the server
    - random key via Diffie-Hellmann-Key-Exchange

In v1.3 this was revised to speed up the handshake, it now is as follows:

  1. Client Hello: Client initiates Connection to the Server, sends list of supported cipher suites
  2. Server Hello and Change Cipher Spec:
    - If Server knows and supports one of the cipher suites, Server sends its certificate, certifcate chain, possibly OCSP-response
    - Server signals started encrypted messaging
  3. Client Change Cipher Spec: Client responds that it also has started encrypted messaging

Unlike version 1.2 the version 1.3 does not allow renegotiation to an unsafe lower version by a request anymore, but rather refuses to complete the handshake. Therefore a connection is not established. Checking of the server or client certificate is implementation dependent, but strongly advised, as to not establish connections with a malicious actor using a compromised certificate. This not only implies building the issuer chain up to a trusted point, but also revocation checking of the certificate in question, either by OCSP[3] (Online Certificate Status Protocol), which is the preferred option, or by CRL[4] (Certificate Revocation List]. Both methods are subject to DOS-attacks, due to this for example Chromium-based web browsers do not use either of the methods when connecting to https-addresses, but rather rely on built in CA white-/blacklists.

Performance considerations

Instead of the standard AES-256-algorithm the ChaCha20-algorithm may be used. This algorithm is typically faster than a software implementation of AES-256. However current CPUs have instructions to execute AES in hardware, which can be used by tools like openssl. A comparison of a openssl-speedtest with and without usage of CPU instructions for AES as well as for ChaCha20 on a Intel Xeon Gold 6150 yields the following results, the OPENSSL_ia32cap[5] was used to disable the AES or AVX-512 CPU-instructions for testing:

[1] $ OPENSSL_ia32cap="~0x200000200000000" openssl speed -elapsed -aead -evp aes-256-gcm
[2] $ openssl speed -elapsed -aead -evp aes-256-gcm
[3] $ OPENSSL_ia32cap=OPENSSL_ia32cap=0x7ffaf3ffffebffff:0x27ab openssl speed -elapsed -aead -evp chacha20-poly1305
[4] $ openssl speed -elapsed -aead -evp chacha20-poly1305
measured throughput per second
type 2 bytes 31 bytes 136 bytes 1024 bytes 8192 bytes 16384 bytes
1 AES-256-GCM 3912.58k 43681.83k 119433.57k 220805.46k 240091.14k 241401.86k
2 AES-256-GCM 12352.14k 156771.30k 600536.56k 2364060.33k 3829205.67k 4008596.82k
3 ChaCha20-Poly1305 4561.57k 67435.59k 185287.58k 472737.11k 557932.54k 565346.30k
4 ChaCha20-Poly1305 5406.06k 79034.59k 256344.41k 1439373.99k 2491817.98k 2634612.74k

Other factors that may lead to a performance gain in Unixoid operating systems is kernel TLS (kTLS)[6][7] wherein the TLS-workflow is completely offloaded from the userspace program to kernelspace functions. The kernel used has to support this, it is typically available since Linux kernel version 5.1 (AES-256-GCM) or 5.11 (CHACHA20-POLY1305) since FreeBSD 13.0[8] and can be queried via the following command

$ lsmod|grep tls
tls                   151552  0

Furthermore the application also must have support for kTLS, either built in or by using external libraries such as openssl. Definite usage by programms is tracked via the following counters, wherein TlsTxSw means 91967 served connections used the kTLS-function, but nothing uses it at the moment of querying.

$ cat /proc/net/tls_stat
TlsCurrTxSw                             0
TlsCurrRxSw                             0
TlsCurrTxDevice                         0
TlsCurrRxDevice                         0
TlsTxSw                                 91967
TlsRxSw                                 0
TlsTxDevice                             0
TlsRxDevice                             0
TlsDecryptError                         0
TlsRxDeviceResync                       0
TlsDecryptRetry                         0
TlsRxNoPadViolation                     0


Since the linux kernel 4.13 it is possible to offload TLS to the kernel level (known as kTLS). This should lead to better performance because it saves context switches between the user level TLS implementation and the network stack. We deployed kTLS on various machines, but we could not get a significant performance improvement. Exemplary the biggest difference was produced on a VPS (1 vCore, 1GB RAM), set up with nginx and a 10MB file, having 10 parallel downloads:

Caption measurement from VPS "edmund"
time throughput
kTLS enabled (ø) 4.57s 112.52 MBit/s
kTLS disabled (ø) 4.73s 108.07 MBit/s

On this machine kTLS only improved the throughput about 3.9%, which does not seem significant to us because it was an virtualized platform.

A more realistic test was run using httperf and a real access log from a production server. The first 10000 connections of the log where queried using 500 parallel connections (i.e. clients) on an Intel Xeon Gold 6150[9] with 96 GB RAM under OpenSuSE 15.6 and on a Raspberry Pi 4[[10]] with 8 GB RAM and CentOS 9 Stream. Both systems used the nginx webserver since this was readily configurable with kTLS. The tests where performed three times and the median was chosen for the evaluation. Note that the graphs are log scaled. It is easily visible, that using ECC-key material instead of RSA-key material is way more influential performance wise than using kTLS, especially on the Rapsberry Pi 4.

AES

ChaCha20

The ChaCha20 stream cipher[11] uses a set number of rounds of ADD and ROT on numbers in a 4x4 matrix. Therefore vector instructions in CPU can be leveraged to gain a performance boost, see previous section for measured values.

Top-10 instructions of chacha20-poly1305 of 2 bytes for 1 s
Pentium 4 Prescott Haswell Skylake-X
591 * 10^6 instr 467 * 10^6 instr 409 * 10^6 instr
Instruction share Instruction share Instruction share
LEA_GPRv_AGEN 0.03 PSHUFD_XMMdq_XMMdq_IMMb 0.0317 LEA_GPRv_AGEN 0.0395
POP_GPRv_58 0.0307 LEA_GPRv_AGEN 0.0363 MOV_MEMv_GPRv 0.0398
TEST_GPRv_GPRv 0.0319 MOV_MEMv_GPRv 0.0366 POP_GPRv_58 0.0455
PUSH_GPRv_50 0.0365 POP_GPRv_58 0.0417 VPADDD_YMMqq_YMMqq_YMMqq 0.046
MOV_MEMv_GPRv 0.0489 PADDD_XMMdq_XMMdq 0.0423 VPROLD_YMMu32_MASKmskw_YMMu32_IMM8_AVX512 0.046
MOV_GPRv_MEMv 0.0629 TEST_GPRv_GPRv 0.0434 VPXOR_YMMqq_YMMqq_YMMqq 0.046
MOV_GPRv_GPRv_89 0.0677 PXOR_XMMdq_XMMdq 0.0452 TEST_GPRv_GPRv 0.0473
ROL_GPRv_IMMb 0.124 PUSH_GPRv_50 0.0466 PUSH_GPRv_50 0.0507
ADD_GPRv_GPRv_01 0.1295 MOV_GPRv_MEMv 0.0568 MOV_GPRv_MEMv 0.0619
XOR_GPRv_GPRv_31 0.1345 MOV_GPRv_GPRv_89 0.091 MOV_GPRv_GPRv_89 0.0992

how to: fast and safe SSL handshake

  • OCSP is not supported anymore by Let's Encypt due to privacy concerns and extensive resource requirements
  • But Chromium derivates do not use OCSP/CRL by default, anyways
  • only use TLSv1.2 and newer with recommended ciphers from the BSI [12][13]
  • use elliptic curve for keys
  • if possible, use SSL offloading to hardware
  • kTLS doesn't seem to make a measurable improvement

for those, who are interested

Enable kTLS
  1. Check if kTLS is supported on your instance?
    - $ lsmod | grep tls
  2. Check if your web server was built with kTLS support
    - Most times, you can find it within the building parameters.
  3. Enable kTLS in our webserver
    - using nginx:
    - Add ssl_conf_command Options KTLS; to the server section.
    - Verify it by enabling debug printouts.
    - You should see lines containing BIO_get_ktls_send and SSL_sendfile.
    - Alternatively verify by looking into /proc/net/tls_stat.
    - using apache2:
    - No kTLS implemented, possibly due to low impact, but
    - Add EnableSendfile On to your config. (see the doc)
    - using lighttpd:
    - Add ssl.openssl.ssl-conf-cmd += ("Options" => "-KTLS")[14]
    - In general, you should be able to verify the changes using a benchmarking tool like ab or httperf.
    - Check in /proc/net/tls_stat at server runtime
  4. Use ECC key material, e.g. using NIST-p256 curve
  5. Use hardware with CPU-instructions for ciphers, e.g. AESENC/AESDEC on x86-CPUs, general vector instructions for CHACHA20, or with bespoke crypto hardware[15].