What is faster: a local connection over a unix domain socket, or a local TCP/IP connection with TLS encryption? To a network expert, the answer seems intuitively obvious: the unix socket. But why? And by how much?
Moreover, online docs and tutorials for both PostgreSQL and CockroachDB recommend—or at least advertise—the use of TCP/IP connections when testing client apps locally during development. How much performance are they leaving on the table by not paying more attention to unix domain sockets? Does it even matter?
In this blog post, I set out to answer these questions experimentally.
Introduction
Both PostgreSQL and CockroachDB are SQL RDBMS that can be deployed on a development computer, to facilitate local experimentation and testing.
When I first started working with PostgreSQL, I discovered an
intriguing discrepancy: the command-line client psql
connects to a
local server using a unix domain socket, whereas tutorials,
documentation and UI tools use a TCP/IP connection to localhost
with, usually, TLS encryption enabled by default (“SSL mode”). Could
it be that this explains why I would observe different query latencies
when running a a query over psql
compared to my own test programs?
Later, when I first started working with CockroachDB, I noticed that
CockroachDB’s own cockroach sql
command-line shell connects to its
server using TCP/IP and TLS encryption enabled by default. Could this
explain why I observed different query latencies compared to running
the same queries in psql
?
Much later, and somewhat recently, I taught CockroachDB to properly serve clients over unix domain sockets too. From then on, I advertised to my friends: “you should use the unix domain socket with CockroachDB, it is faster.” But I did not have data to back up my claim. Was it really true?
Then, much more recently, I helped align PostgreSQL’s and CockroachDBs authentication configuration systems. This made me curious: which of the two database is fastest at authenticating (local) clients? Does it matter in practice?
Today, I spent some time running a few microbenchmarks to answer all these questions. It turns out that the intuition is right, unix sockets are generally faster than local TCP/IP connections with TLS, although not in the ways I had expected. Furthermore, authentication plays a surprising role.
Methodology
Database server
We want to consider PostgreSQL and CockroachDB as a a developer would
find them when they experiment for the first time. For PostgreSQL,
this means a server running in the background, started by the standard
system service manager. For CockroachDB, this means running the
server in a terminal via cockroach demo
.
To the astute reader, this choice may seem unfair with regards to
performance: a CockroachDB database ran via cockroach demo
stores
all its data in RAM, whereas PostgreSQL uses disk by default. This
gives an “unfair” advantage to CockroachDB. However, I was
particularly interested to showcase the default performance “as a
developer would establish first contact”, and cockroach demo
is
the most likely setup in that case. Moreover, as explained further in
the results sections below, this difference does not matter much in
the end.
From this starting point, both databases are configured by default to
accept clients either via a unix domain socket (URL
postgres://user@?host=/path/to/socketdir?port=NNNN
) or via TCP/IP
with TLS (postgres://user@127.0.0.1:NMNN?sslmode=require
).
Note that I am using PostgreSQL v11 and a CockroachDB binary built
from source (from the master
branch at commit abf75a9588
). I
am using a hand-built CockroachDB binary in order to pick up the
latest code that will yield the upcoming v20.1 release, as opposed to
the latest stable release v19.2 which does not support unix sockets well.
Database client
To focus on the protocol and performance differences of the database
server, we will be using the psql
command-line client in every
experiment. This command is equally able to query PostgreSQL and
CockroachDB servers, and it is also lightweight (this detail will
matter later).
We will also use a separate SQL user account in both cases, with an empty (pristine) database; this can be obtained with:
CREATE USER foo WITH PASSWORD 'abc';
CREATE DATABASE foo;
GRANT ALL ON DATABASE foo TO foo;
in both databases.
Note
If you attempt to reproduce these experiments, consider that
the username root
has special status in CockroachDB and
is handled separately with different performance characteristics.
Your results with root
will likely diverge from that obtained
with any other user name.
Experiments
To explore the raw differences between unix domain sockets and a TCP connection with TLS, we will run the following experiments:
Establishing a single connection, and running many simple queries through it, each returning just 1 row. The query is minimal: just
SELECT 1
.This is more or less a throughput benchmark: we will get a number of queries per second.
This will also help identify the (average) latency of individual queries, as the cost of a query execution round-trip: sending the query to execute, letting the server prepare a query plan, and returning the result row. Since the query is minimal, we are expecting the planning time to be negligible so the communication round-trip should dominate.
We will also run this first step with authentication disabled (
trust
authentication method), so as to remove the measurement noise due to the authentication handshake and focus on the query throughput and individual query latency.Establishing many connections and running just 1 query returning 1 row through each. The query is the same as above.
We will use the same query as above so that we can subtract its latency, measured above, from the total connection + query latency measured at this step. This identifies the latency of the initial connection handshake itself.
Running the same experiment as step (2), but with authentication enabled. The difference between the results at step (2) and (3) will tell us the authentication overhead.
In addition to step (1) I was also interested in measuring row throughput—the experiment first above measures query throughput. To achieve this, the last experiment will establish a single connection and run a simple query that returns many rows:
SELECT generate_series(1,N)
with N sufficiently large.
Platforms
To abstract away platform specifics, we should run all the experiments on at least two different hardware/OS combinations.
For reference, I used the following two systems:
Property | System A | System B |
---|---|---|
CPU | Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz |
AMD Ryzen 9 3950X 16-Core (3700.07-MHz K8-class CPU) |
Memory | Samsung DDR4-2400 8GB | G-Skill DDR4-3600 64GB |
OS kernel | Linux 5.3.0-46-generic #38-Ubuntu SMP x86_64 |
FreeBSD 13.0-CURRENT #42 r358991 |
Distribution | Ubuntu 19.10 | FreeBSD 13-CURRENT |
C compiler | GNU C v9 | CLang 10 |
Go compiler (for CockroachDB) | 1.14.2 | 1.14.1 |
Query throughput / average latency (1)
This first experiment compares query throughput across the two types of local connections.
Gathering results
We use the following commands to generate an input files with a large number of simple queries:
for i in $(seq 1 20000); do echo "select 1;"; done >s20k.sql
We verify that authentication is disabled, by loading the following HBA configuration in both databases:
# TYPE DATABASE USER ADDRESS METHOD
local all all trust
host all all 127.0.0.1/32 trust
Then we run the following commands to benchmark:
# CockroachDB unix - substitute the connection URL by that provided by cockroach demo
for i in $(seq 1 10); do time psql "postgres://foo:abc@?host=%2Ftmp%2Fdemo306023277&port=26257" -f s20k.sql >/dev/null; done
# CockroachDB tcp - also substitute the URL provided by cockroach demo
for i in $(seq 1 10); do time psql "postgres://foo:abc@127.0.0.1:14570?sslmode=require" -f s20k.sql >/dev/null; done
# PostgreSQL unix - substitute the socket path by the one applicable on your system
for i in $(seq 1 10); do time psql "postgres://foo:abc@?host=/tmp&port=5432" -f s20k.sql >/dev/null; done
# PostgreSQL tcp
for i in $(seq 1 10); do time psql "postgres://foo:abc@127.0.0.1:5432?sslmode=require" -f s20k.sql >/dev/null; done
The measurements reveal the following:
For PostgreSQL:
On system A:
Measurement unix socket tcp/ip with TLS Total time at 20000 queries 1.265s 2.0509s Total time at 40000 queries 2.5052s 3.9991s Linear regression QPS 16126 10265 Average time per query 62.01µs 97.41µs (+57%) On system B:
Measurement unix socket tcp/ip with TLS Total time at 20000 queries 0.518s 0.7353s Total time at 40000 queries 1.0442s 1.4683s Linear regression QPS 38008 27285 Average time per query 26.13µs 36.65µs (+39%)
For CockroachDB:
On system A:
Measurement unix socket tcp/ip with TLS Total time at 20000 queries 3.228s 4.9898s Total time at 40000 queries 6.5244s 10.3159s Linear regression QPS 6067 3755 Average time per query 164.6µs 266.3µs (+62%) On system B:
Measurement unix socket tcp/ip with TLS Total time at 20000 queries 2.126s 2.5356s Total time at 40000 queries 4.6908s 5.2107s Linear regression QPS 7798 7476 Average time per query 128.2µs 133.7µs (+4.3%)
Lessons learned
What the results teach us:
Sending queries and receiving single-row results over a TLS-encrypted TCP connection after the initial connection handshake is uniformly much slower than using an (unencrypted) unix domain socket, from 5% to 60% slower.
(This seems obvious in hindsight, but it is also good to see it in numbers.)
CockroachDB receives queries and sends results 2-4x more slowly than PostgreSQL over a local connection. This is true even though we are using CockroachDB using RAM-only and no disk access, while PostgreSQL uses disks.
Does it matter?
When looking at individual queries, all these latencies are under a half millisecond. They are largely imperceptible to a human experimenter. So it does not matter.
However…
… when sending many simple queries over a single connection, there are notable differences; both between unix sockets and TCP connections, and between CockroachDB and PostgreSQL.
However…
… this experiment stretches both databases to their peak throughput capacity. For local development (the topic of these experiments), thousands of (simple) queries per second are unusual and likely unrealistic. Both databases can sustain a reasonable “development” workload of 10-100 QPS without issue, regardless of the connection type.
Beware, meanwhile, of what this experiment does not tell us:
- Anything about general query performance: this experiment uses a “simple” query which does nothing and does not access stored data. We can expect that many “normal” queries will have inherent CPU and I/O costs that dominate.
- Anything about performance at scale: we are using a local server without the benefits of horizontal / vertical scaling on production hardware or Cloud deployments.
- Anything about the overhead of the initial connection handshake. We will learn more about this in the next experiment.
Handshake overhead (2)
This second experiment identifies the initial connection overhead between the two types of onnection.
Gathering results
For this experiment, we will still be operating with authentication disabled.
We run the following command:
time bash -c \
'for i in $(seq 1 300); do \
psql "postgres://foo:abc@?host=%2Ftmp%2Fdemo306023277&port=26257" -c "select 1"; \
done >/dev/null'
What is going on here? The basic unit of work is the command in the middle:
psql "postgres://foo:abc@?host=%2Ftmp%2Fdemo306023277&port=26257" -c "select 1"
Because it runs quickly, using time
directly in front of it would
see a lot of noise. So instead, we run it N times in a loop using
bash -c
, and time that instead. (We make N large enough to get
multiple digits worth of time measurement precision, but not so large
that the experiment runs forever.)
The measurements reveal the following:
For PostgreSQL:
On system A:
Measurement unix socket tcp/ip with TLS Combined time for N runs 1.987s 2.181s Value of N 30 30 Combined time for 1 run 66.23ms 72.69ms Average time per query [1] 62.01µs 97.41µs Combined - (per-query time) 66.16ms 72.59ms (+9%) On system B:
Measurement unix socket tcp/ip with TLS Combined time for N runs 1.853s 6.125s Value of N 1000 1000 Combined time for 1 run 1.853ms 6.125ms Average time per query [1] 26.13µs 36.65µs Combined - (per-query time) 1.83ms 6.09ms (+70%)
For CockroachDB:
On system A:
Measurement unix socket tcp/ip with TLS Combined time for N runs 2.001s 2.218s Value of N 30 30 Combined time for 1 run 66.69ms 73.92ms Average time per query [1] 164.6µs 266.3µs Combined - (per-query time) 66.53ms 76.65ms (+10%) On system B:
Measurement unix socket tcp/ip with TLS Combined time for N runs 1.123s 2.135s Value of N 300 300 Combined time for 1 run 3.764ms 7.12ms Average time per query [1] 128.2µs 133.7µs Combined - (per-query time) 3.64ms 6.98ms (+48%)
[1] | (1, 2, 3, 4) We use the average time per query derived in experiment (1) above. |
Lessons learned
What the results teach us:
Establishing a SQL session over a local TCP/IP connection with TLS enabled is uniformly much slower than over a unix domain socket, from 9% to 70% slower.
(This seems obvious in hindsight, but it is also good to see it in numbers.)
While the measurement above combines both the time to load the
psql
executable in memory and establishing the SQL session, the time to loadpsql
is dwarfed by the TCP/IP connection overheads.(This would not be true if we had used CockroachDB’s own
cockroach
executable. With that program, it takes about a second on my systems just to load that 100MB+ executable in memory. For local testing,psql
definitely yields a more lightweight experience.)The overhead of establishing a TCP/IP connection and TLS stream is about the same in PostgreSQL and CockroachDB.
Does it matter?
- When a developer needs to run a batch of SQL statements, this result proves that it is much advantageous to open a single connection and pump all the statements through it, rather than opening a new connection for each statement.
Authentication overheads (3)
The two previous experiments were run with authentication
disabled. How much are the results affected if one adds
PostgreSQL’s/CockroachDB’s simplest authentication method, namely
password
? Lets find out.
Gathering results
First we load the following HBA configuration in both databases:
# TYPE DATABASE USER ADDRESS METHOD
local all all password
host all all 127.0.0.1/32 password
Then we can run the commands from experiment (2), and subtract the new measurement from the previous measurement to see how much time is spent in authentication alone.
The measurements reveal the following:
For PostgreSQL:
On system A:
Measurement unix socket tcp/ip with TLS Combined time for N runs 1.999s 2.201s Value of N 30 30 Combined time for 1 run 66.64ms 73.36ms Subtracting from above 66.22ms 72.69ms Authentication per run 0.41ms (<1%) 0.67ms (+38%) On system B:
Measurement unix socket tcp/ip with TLS Combined time for N runs 2.624s 6.208s Value of N 1000 1000 Combined time for 1 run 2.624ms 6.208ms Subtracting from above 1.853ms 6.125ms Authentication per run 0.77ms (42%) 0.08ms (-8x?)
For CockroachDB:
On system A:
Measurement unix socket tcp/ip with TLS Combined time for N runs 5.726s 5.934s Value of N 30 30 Combined time for 1 run 190.88ms 197.96ms Subtracting from above 66.69ms 73.92ms Authentication per run 124.19ms (+86%) 124.05ms (~) On system B:
Measurement unix socket tcp/ip with TLS Combined time for N runs 1.909s 2.074s Value of N 30 30 Combined time for 1 run 63.60ms 69.15ms Subtracting from above 3.764ms 7.12ms Authentication per run 59.84ms (+15x!) 62.03ms (+3%)
Lessons learned
On PostgreSQL, password authentication comes at a negligible price—less than a millisecond on both systems—compared to the cost of establishing the SQL session (from ~2 to 66ms). This is true for both unix socket and TCP/IP+TLS connections.
In comparison, CockroachDB’s password authentication is super-expensive and dwarfs the cost to establish the connection (from 86% of the cost to a staggering 15x). With latencies north of 100ms, it will likely “feel” slow during interactive development.
This latency, incidentally, is due to the server’s internal use of bcrypt to thwart password brute-forcing. The latency cost of this protection is a known limitation in CockroachDB (see here, here and here) which will probably be lifted in a later version.
There is no significant difference in authentication latency between unix sockets and TCP/IP+TLS connections. The overall difference, if any, exists in the initial connection set-up as demonstrated by experiment (1) above.
What this does not tell us:
- There may be a slight difference in cost due to the additional client-server packet exchange to transmit the password. However, we do not have enough data in this experiment to know for sure. The next experiment will elucidate this somewhat.
Row throughput (4)
This final experiment identifies the cost difference between one-directional and bi-directional traffic in local connections.
Gathering results
For this experiment, we are using the databases with authentication disabled again.
Here we use a single database connection and a single query, and pump a large number of rows through it (2-4M rows).
for i in $(seq 1 10); do
# replace the URL to switch between crdb and pg,
# between unix sockets and TCP,
# and between 2M and 4M rows.
time psql "postgres://foo:abc@?host=/tmp&port=5432" -c 'select * from generate_series(1,2000000)' >/dev/null
done
The measurements reveal the following:
For PostgreSQL:
On system A:
Measurement unix socket tcp/ip with TLS Total time at 2M rows 2.51s 2.5791s Total time at 4M rows 5.1038s 5.1801s Linear regression RPS 771K 769K Average time per row 1.29µs 1.30µs (~) On system B:
Measurement unix socket tcp/ip with TLS Total time at 2M rows 1.426s 1.4509s Total time at 4M rows 2.9434s 2.9737s Linear regression RPS 1.318M 1.313M Average time per row 0.76µs 0.76µs (~)
For CockroachDB:
On system A:
Measurement unix socket tcp/ip with TLS Total time at 2M rows 2.93s 5.9136s Total time at 4M rows 2.9996s 6.0536s Linear regression RPS 670K 655K Average time per row 1.49µs 1.53µs (+2%) On system B:
Measurement unix socket tcp/ip with TLS Total time at 2M rows 1.701s 1.7563s Total time at 4M rows 3.6423s 3.5601s Linear regression RPS 1.030M 1.180M Average time per row 0.97µs 0.90µs (-7%)
Lessons learned
What the results teach us:
With PostgreSQL, the row throughput is independent of whether the client uses a unix socket or TCP/IP with TLS.
This is surprising: in the results of experiment (1) above, we found that query latency was 5% to 60% slower over TCP/IP. How can this be?
The explanation for this can be found by opening the black box and looking at the low-level protocol. In the first experiment, the client and server were exchanging many small packets of data: one packet from client to server and one in the other direction, for each query.
In this new experiment, there is just one small packet from the client to the server at the beginning. Then, all the result rows are streamed from the server to the client using large packets, each containing many result rows.
This explains the difference: there are multiple complex operations happening in the OS kernel to communicate a TCP/IP packet from one process to another locally. Comparatively, there is little OS overhead to exchange packets over unix sockets. With many packets, the cost of the OS operations dominate and so the difference between TCP/IP and unix datagrams is more visible.
With large packets, there are fewer packets overall and thus the cost of copying the data in and out of buffers dominates the OS overheads. This copy cost is also independent of the connection type. This is why the resulting row/sec difference in this experiment is small to inexistent.
What this does not tell us:
Anything about the difference between CockroachDB and PostgreSQL on row throughput.
This experiment is communicating integer (SQL
INT
) values. PostgreSQL’s INTs are 32-bit by default, whereas CockroachDB’s are 64-bit. It is thus expected that CockroachDB is able to transmit fewer rows per second than PostgreSQL: it has to transmit twice as much data, conceptually. However, the encoding algorithms are also likely different. It may be that CockroachDB is better at encoding the integers quickly than PostgreSQL.There are really so many differences in the type of data transferred in this experiment that the PostgreSQL and CockroachDB results are incomparable in this experiment.
What is going on with the small differences in row throughput between unix sockets and TCP/IP + TLS connections for CockroachDB.
On one system I observe that TCP/IP + TLS is 2% slower, whereas on the other is it is 7% faster. What does this mean?
This cannot be explained by a platform (OS/hardware) difference only, because if it could, we would see a similar difference for PostgreSQL. It cannot be explained by a difference in the CockroachDB implementation (or binary code) either, because I used the same source code and compiler version in both.
Further experiments would be needed to further understand this nuance.
Summary and conclusions
Unix domain sockets yield generally faster performance than TCP/IP connections with TLS enabled, even though the latter is the default recommended/advertised connection method for both PostgreSQL and CockroachDB. This result is intuitive to network experts, and is now demonstrated empirically by the experiments above.
Unix domain sockets offer both a lower latency (up to 3x lower) to establish the connection and a faster throughput overall (up to 3x faster) when communicating small queries and result sets back-and-forth between client and server. The throughput difference disappears for large result sets, when data can be streamed.
This difference is most noticeable with PostgreSQL, which has a lean connection set-up cost and protocol otherwise. With CockroachDB, the difference in initial connection cost is less noticeable when password authentication is enabled (because the password check itself is relatively expensive), but remains after the initial handshake.
In summary: use unix domain sockets for local connections. ✔