This blog post provides an overview of the data flows inside CockroachDB and its data security architecture.
This post was originally published in February 2020, targeting CockroachDB v20.1. It was edited in February 2022 to reflect changes in CockroachDB v22.1. A summary of changes is available in this page.
Disclaimers: this information is solely based on the public
CockroachDB source code repository on Github. At the time of this
writing, the most recent stable release of CockroachDB is 21.2.5. The
present analysis reflects the
master branch in the repository
which will become version 22.1. Although I also sometimes contribute to
the CockroachDB project and consult for Cockroach Labs, the opinions
enclosed herein are my own.
A running CockroachDB cluster looks like this: one or more unix processes running the cockroach program binary on different servers, and connected to each other.
(client apps using SQL or web UI sessions in cilent browsers) | | | +------ | ---------+ +------ | ---------+ +------ | ---------+ | v | | v | | v | | +------------+ | | +------------+ | | +------------+ | | | cockroach |<---------->| cockroach |<--------->| cockroach | | | +------------+ | | +------------+ | | +------------+ | | | | | | | | | | | +------------+ | | +------------+ | | +------------+ | | | node-local | | | | node-local | | | | node-local | | | | storage | | | | storage | | | | storage | | | +------------+ | | +------------+ | | +------------+ | | | | | | | +----- server 1 ---+ +----- server 2 ---+ +----- server 3 ---+
CockroachDB automatically replicates the data and handles failover: if a node is shut down or crashes, the other nodes “take over”. Data access is forwarded: clients connected to any node can access data on any other node. When a node goes down, clients can resume their work using any other node.
The diagram above reveals the main two categories of data flows inside CockroachDB:
- inter-node traffic.
- client traffic.
A single process running the
cockroach program contains
multiple separate components with data flows between them.
We can model these components as a directed graph:
- nodes represent the components;
- edges represent the “uses” relationship:
A -> Bmeans “A sends requests to B and receives responses from B”.
We can then draw an inventory of the components and their data flows, as follows:
In this diagram, the boxes with a double border correspond to server
components that accept connections from outside of the
process. There are 5 such server components, separated in 3 groups:
- The HTTP API server, which can serve the web UI assets on its own and delegates all the other API logic to the RPC administrative server (see below).
- The SQL protocol server (“pgwire” inside CockroachDB’s source code),
for client apps compatible with PostgreSQL wire protocol. This also
includes a subset of CockroachDB’s own CLI utilities
- The RPC servers:
- 3a. The RPC server for clients, which expose data to the web UI and
some of CockroachDB’s CLI utilities (e.g.
- 3b. The SQL execution server, which executes (parts of) SQL queries on behalf of other nodes.
- 3c. The internal RPC server for all the services of the KV layer and below. This includes replication, transactions, liveness, etc.
- 3a. The RPC server for clients, which expose data to the web UI and some of CockroachDB’s CLI utilities (e.g.
As revealed by the previous diagram, CockroachDB provides external connection opportunities as follows:
- over its HTTP port (
--listen-http-addr), to its HTTP API server.
- over its SQL port (
--listen-sql-addr), to its SQL protocol server.
- over a unix datagram socket (
--socket-file), to its SQL protocol server.
- over its RPC port (
--listen-addr), to all its RPC servers: this port serves both client and inter-node traffic.
Additionally, an inplicit interface is used: the boundary between the
cockroach process and the underlying storage (filesystem / hard disk):
- data files maintained by RocksDB. This includes persistent data and spill-to-disk temporary query data.
- temporary files created during bulk I/O operations.
- log files, including but not limited to:
- the SQL statement logs (
- the SQL “experimental” audit logs (
- the Pebble log files (
- the “main” operational logs.
- the SQL statement logs (
The Pebble log files only contain operational events (e.g. “the process is shutting down”). The other logs contain possibly sensitive application data, as follows:
- the SQL statement logs contain the SQL text of queries sent by client, including the username, database/table names and the literal values passed as arguments to the query. The results are not logged, however the number of rows in the result is. Error messages returned to SQL clients, if any, are also logged in full.
- The SQL audit logs contain usernames, and database/table names.
- The main (default) logs contain highly sensitive information:
- Data range boundary keys are copied in full in various log messages. Boundary keys include all the data from indexed columns in SQL tables (both primary and secondary indexes).
- Data value strings are included in various logged error messages.
- The amount of data revealed in this way is already relatively high
in the default configuration, and can increase to become very
detailed if the
vmoduleoption is changed (which can happen dynamically, during troubleshooting).
- This logging of keys and values can occur on any table, including
system tables like
Overall, the main log files should be considered highly sensitive and access to them should be considered to offer pathways to privilege escalation up to complete control over the CockroachDB cluster.
Applying the STRIDE model, here is how CockroachDB provides security over its external interfaces.
Tamper resistance is arranged for network connections as follows:
- TLS integrity over all TCP connections.
- Unix process isolation for the unix socket.
Tamper resistance for storage is arranged as follows:
- Regular consistenty checks between replicas stored on different nodes.
- Optionally encryption-at-rest for data files.
Meanwhile, there is no integrity protection for generated log files, nor for temporary files during Bulk I/O operations.
Generally, CockroachDB was designed with the assumption that storage
integrity is provided by the underlying environment and that no
malicious actors have network access to the OS where
running. Production deployments that care about tampering via physical
access should use RAM and disk encryption.
Note: CockroachDB also uses checksums inside the data files. However, if my reading of the source code is right, Pebble does not use cryptographic hashes to maintain integrity, and programmatic tampering remains possible. As noted here, checksums are a protection against accidental damage to the data (e.g disk corruption) and do not provide resistance against malicious tampering.
Resistance against information disclosure over network links is arranged as follows:
- TLS encryption over TCP connections is enabled by default.
- Unix process security for the unix socket.
Resistance against information disclosure in storage is arranged as follows:
- for data files, using optional encryption-at-rest.
- however, there is no confidentiality protection for generated log files nor for certain (non-RocksDB) intermediate temporary files during Bulk I/O operations. An attacker with access to logs can thus escalate privileges and get access to disclose data using a client connection, regardless of whether storage encryption is enabled.
As with tamper resistance above, resistance against information
disclosure (data leaks) of the storage underlying a
server should be provided by the environment, e.g. via RAM and disk
encryption. Until now, CockroachDB was designed with the assumption
cockroach server is a confidential enclave.
This raises the question of the purpose of the encryption-at-rest
feature, if access to the server running
cockroach provides a
realistic path to information disclosure of stored data. I assume
that the feature was designed to maintain confidentiality of data
files when they are manually copied elsewhere, or when
is running against networked storage.
The logging subsystem is organized as channels, where events come from inside CockroachDB, routed to sinks, which decide how the logging events are repoted externally to the process.
It is possible to mark a log sink as
auditable, which makes CockroachDB
synchronize all log writes. This ensures that the log events going to that sink are persisted
before operations are allowed to go throught.
By default, the “important” channels, including accesses to audited tables, are
associated to a sink with the
It is also possible to route the log traffic to an external log collector,
e.g. Fluent, to ensure that the collected events cannot be tampered with
if the server where
cockroach is running gets compromised.
Authentication over network connections is arranged as follows:
- For HTTP clients, using either a password check (hash in
system.users) or a valid session cookie (stored in
system.web_sessions). The web interface, via an RPC, enables creation of a session cookie from a password or an OIDC handshake (“login workflow”); it is also possible to create a cookie manually using the
cockroach auth-session loginCLI.)
- For RPCs between nodes, two-way TLS certificate validation. Only node-to-node certificates are accepted.
- For RPC clients, TLS client certificate validation.
- For SQL clients, depending to the connection origin and the
username, according to the current HBA configuration (see below):
- Either TLS client certificate validation, or
- A password check against the hash stored in
- No authentication (always trust), or
- Via GSSAPI / Kerberos.
CockroachDB also supports OCSP for TLS certificate revocation.
Authentication between the
cockroach process and the accompanying
storage is arranged as follows:
- Access to log files and the data files is authenticated by the unix permission system.
- With encryption-at-rest, using configured encryption keys.
Authentication for outgoing requests performed by
when exporting data or backups to an external storage system, is
performed using credentials specified in SQL when the bulk I/O job is created.
CockroachDB supports a subset of PostgresSQL’s Host-Based
configuration language (
pg_hba.conf), via its cluster setting
I wrote a separate article dedicated to this topic: Authentication configuration in PostgreSQL and CockroachDB.
Resistance to network-based denial of service attacks is arranged as follows:
- Memory usage upon incoming network connections is kept minimal until authentication succeeds (with a caveat, explained below).
- Nodes are “multi-active”: if one node goes away, another node can take over with close to zero time to recovery. This exploits consensus-based replication and thus requires a majority of nodes in every replication group to remain available.
- Automatic scale-up is possible: adding more nodes automatically adds more capacity.
As of v21.1, CockroachDB has a new multi-tenant architecture internally, which provides internal resource isolation between multiple “tenants” that are assumed to be adversaries with regards to availability.
As of this writing, this architecture is only used in the CockroachCloud “serverless” offering. CockroachCloud “dedicated” clusters, as well as users who run CockroachDB themselves, do not have access to this feature. In “single-tenant” deployments, any application can lock the others out via resource exhaustion.
Finally, the note about resource usage until authentication succeeds:
incoming requests are accepted by spawning a goroutine per incoming
connection. There is only minimal throttling based on the total amount
of memory allocated (each connections is initially counted as 10KiB
towards the configured
--max-sql-memory). It is thus possible to
open tens of thousands of pre-authentication connections to a single
node. A semaphore is used internally to limit the CPU usage of
authentication checks for incoming connections and prevent an attacker
from using pre-auth connections to starve a node from CPU.
When nodes are started with the
--insecure flag, all the following
mechanisms are entirely disabled (all-or-nothing):
- confidentiality: all no network and storage encryption is disabled.
- integrity: there are no digests on network messages.
- authentication: any client can authenticate as any principal,
including spoofing the
- authorization: all non-
nodeprincipals are assumed to have the “admin” bit. Since
nodeconnections can be spoofed, any client effectively can also obtain the “node” bit on demand.
This mode is only offered for testing and disaster recovery.
While casually browsing the source code, I found the following points of interest:
- The virtual table
crdb_internal.leasesreports all the leases on SQL objects throughout the cluster, and this virtual table is readable by any principal. This reveals the existence and the name of all the “currently active” tables/views/sequences.
- Any SQL session can request “session tracing” via the statement
SET tracing = on(or
on,kv). After a trace is collected, it can view its own trace via
SHOW TRACEor the virtual table
crdb_internal.session_trace. The trace contains details of KV operations (including detailed key/values) prior to authorization and filtering by authorization deputies. This can be used e.g. to discover the list of all database and table names, by tracing a
SHOW TABLESstatement. In general, SQL session tracing can be used to bypass authorization and read any stored data processed by deputies with the “admin” bit. This is a textbook confused deputy situation.
errorobjects produced in one component flow through other components, including authorization deputies. While deputies filter the “positive” (non-error) data returned to requesters, they typically do not filter out sensitive data contained in errors produced with their “admin” bit. Given that KV-level error objects can contain key/value data, it is theoretically possible to exploit error handling to leak data through authorization deputies. This is another confused deputy situation.
CockroachDB is a distributed database where nodes communicate with each other and accept connections from database clients. From the STRIDE perspective, CockroachDB follows the state of the art and makes reasonable operational assumptions with regards to Authentication, Integrity (tamper-resistance) and “external” Confidentiality. There is little provision for Non-repudiability but this seems to remain out of scope for the project at this time.
CockroachDB’s Availability story is slightly more complex. While it was designed from the ground up as a Highly Available (HA) system in the face of accidents, there are some gaps regarding availability in the face of malicious behavior.
Finally, its Authorization story is “interesting” and offers the most opportunities for future improvements. CockroachDB uses a combination of SQL and non-SQL privilege bits, and the authorization policies that enforce privilege bits on service endpoints (through its HTTP, RPC and SQL APIs) are heterogeneously implemented, without a common structured framework. CockroachDB’s documentation is strong in its advertisement and explanation of the SQL privilege model over SQL resources, but does not yet systematically explain Non-SQL data and non-SQL authorization rules. Although the currently released versions do offer reasonably strong authorization guarantees, there are extremely little guardrails in the source code that prevent the accidental introduction of privilege escalation avenues during the development of new features.
As an engineer with interest in security, there are several low hanging fruits that I could recommend to lift the project into a better and stronger place:
- Immediately change the default configuration to separate the SQL and RPC ports and strongly recommend in documentation to firewall the RPC port to prevent malicious use or “node”-level privilege escalation.
- Separate the privilege bit required to access (and write to)
systemtables, containing critical operational parameters, from the “admin” bit required to view Privileged cluster metadata. This would strongly incentivize the engineers who add new views into cluster metadata to think about implementing and documenting a new authorization deputy with clear escalation rules.
- Systematically eliminate the redundancy in authorization enforcement between the RPC servers and the SQL code. Ideally, a single point in the code should be responsible to both enforce authorization and log controlled privilege escalations, such as that performed by SQL virtual tables that act as non-privileged views into privileged data.
- Gradually introduce a capability authorization sub-system, where the privileged components would require capability bits from their requesters, and where capability tokens would be delivered to a principal as part of their authentication. This system should be designed using cryptographic principles so that an engineer would not be tempted to synthetize a capability token “for convenience” when they are pressed for time implementing a new feature.
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.