This blog post provides an overview of the data flows inside CockroachDB and its data security architecture.
Disclaimers: this information is solely based on the public
CockroachDB source code repository on Github. At the time of this
writing, the most recent stable release of CockroachDB is 19.2.3. The
present analysis reflects the
master branch in the repository
which will become version 20.1. Although I also sometimes contribute to
the CockroachDB project and consult for Cockroach Labs, the opinions
enclosed herein are my own.
A running CockroachDB cluster looks like this: one or more unix processes running the cockroach program binary on different servers, and connected to each other.
(client apps using SQL or web UI sessions in cilent browsers) | | | +------ | ---------+ +------ | ---------+ +------ | ---------+ | v | | v | | v | | +------------+ | | +------------+ | | +------------+ | | | cockroach |<---------->| cockroach |<--------->| cockroach | | | +------------+ | | +------------+ | | +------------+ | | | | | | | | | | | +------------+ | | +------------+ | | +------------+ | | | node-local | | | | node-local | | | | node-local | | | | storage | | | | storage | | | | storage | | | +------------+ | | +------------+ | | +------------+ | | | | | | | +----- server 1 ---+ +----- server 2 ---+ +----- server 3 ---+
CockroachDB automatically replicates the data and handles failover: if a node is shut down or crashes, the other nodes “take over”. Data access is forwarded: clients connected to any node can access data on any other node. When a node goes down, clients can resume their work using any other node.
The diagram above reveals the main two categories of data flows inside CockroachDB:
- inter-node traffic.
- client traffic.
A single process running the
cockroach program contains
multiple separate components with data flows between them.
We can model these components as a directed graph:
- nodes represent the components;
- edges represent the “uses” relationship:
A -> Bmeans “A sends requests to B and receives responses from B”.
We can then draw an inventory of the components and their data flows, as follows:
In this diagram, the boxes with a double border correspond to server
components that accept connections from outside of the
process. There are 5 such server components, separated in 3 groups:
- The HTTP API server, which can serve the web UI assets on its own and delegates all the other API logic to the RPC administrative server (see below).
- The SQL protocol server (“pgwire” inside CockroachDB’s source code),
for client apps compatible with PostgreSQL wire protocol. This also
includes a subset of CockroachDB’s own CLI utilities
- The RPC servers:
- 3a. The RPC server for clients, which expose data to the web UI and
some of CockroachDB’s CLI utilities (e.g.
- 3b. The SQL execution server, which executes (parts of) SQL queries on behalf of other nodes.
- 3c. The internal RPC server for all the services of the KV layer and below. This includes replication, transactions, liveness, etc.
- 3a. The RPC server for clients, which expose data to the web UI and some of CockroachDB’s CLI utilities (e.g.
As revealed by the previous diagram, CockroachDB provides external connection opportunities as follows:
- over its HTTP port (
--listen-http-addr), to its HTTP API server.
- over its SQL port (
--listen-sql-addr), to its SQL protocol server.
- over a unix datagram socket (
--socket-file), to its SQL protocol server.
- over its RPC port (
--listen-addr), to all its RPC servers: this port serves both client and inter-node traffic.
Additionally, an inplicit interface is used: the boundary between the
cockroach process and the underlying storage (filesystem / hard disk):
- data files maintained by RocksDB. This includes persistent data and spill-to-disk temporary query data.
- temporary files created during bulk I/O operations.
- log files, including:
- the SQL statement logs (
- the SQL “experimental” audit logs (
- the RocksDB log files (
- the “main” operational logs.
- the SQL statement logs (
The RocksDB log files only contain operational events (e.g. “the process is shutting down”). The other logs contain possibly sensitive application data, as follows:
- the SQL statement logs contain the SQL text of queries sent by client, including the username, database/table names and the literal values passed as arguments to the query. The results are not logged, however the number of rows in the result is. Error messages returned to SQL clients, if any, are also logged in full.
- The SQL audit logs contain usernames, and database/table names.
- The main (default) logs contain highly sensitive information:
- Data range boundary keys are copied in full in various log messages. Boundary keys include all the data from indexed columns in SQL tables (both primary and secondary indexes).
- Data value strings are included in various logged error messages.
- The amount of data revealed in this way is already relatively high
in the default configuration, and can increase to become very
detailed if the
vmoduleoption is changed (which can happen dynamically, during troubleshooting).
- This logging of keys and values can occur on any table, including
system tables like
Overall, the main log files should be considered highly sensitive and access to them should be considered to offer pathways to privilege escalation up to complete control over the CockroachDB cluster.
Applying the STRIDE model, here is how CockroachDB provides security over its external interfaces.
Tamper resistance is arranged for network connections as follows:
- TLS integrity over all TCP connections.
- Unix process isolation for the unix socket.
Tamper resistance for storage is arranged as follows:
- Regular consistenty checks between replicas stored on different nodes.
- Optionally encryption-at-rest for data files.
Meanwhile, there is no integrity protection for generated log files, nor for temporary files during Bulk I/O operations.
Generally, CockroachDB was designed with the assumption that storage
integrity is provided by the underlying environment and that no
malicious actors have network access to the OS where
running. Production deployments that care about tampering via physical
access should use RAM and disk encryption.
Note: CockroachDB also uses checksums inside the data files. However, if my reading of the source code is right, RocksDB does not use cryptographic hashes to maintain integrity, and programmatic tampering remains possible. As noted here, checksums are a protection against accidental damage to the data (e.g disk corruption) and do not provide resistance against malicious tampering.
Resistance against information disclosure over network links is arranged as follows:
- TLS encryption over all TCP connections.
- Unix process security for the unix socket.
Resistance against information disclosure in storage is arranged as follows:
- for data files, using optional encryption-at-rest.
- however, there is no confidentiality protection for generated log files nor for certain (non-RocksDB) intermediate temporary files during Bulk I/O operations. An attacker with access to logs can thus escalate privileges and get access to disclose data using a client connection, regardless of whether storage encryption is enabled.
As with tamper resistance above, resistance against information
disclosure (data leaks) of the storage underlying a
server should be provided by the environment, e.g. via RAM and disk
encryption. Until now, CockroachDB was designed with the assumption
cockroach server is a confidential enclave.
This raises the question of the purpose of the encryption-at-rest
feature, if access to the server running
cockroach provides a
realistic path to information disclosure of stored data. I assume
that the feature was designed to maintain confidentiality of data
files when they are manually copied elsewhere, or when
is running against networked storage.
There is an “experimental” audit log for access to marked tables. The experimental nature of the feature suggests that non-repudiability is not yet a focus for the project.
Authentication over network connections is arranged as follows:
- For HTTP clients, using either a password check (hash in
system.users) or a valid session cookie (stored in
system.web_sessions). The web interface, via an RPC, enables creation of a session cookie from a password (“login workflow”); it is also possible to create a cookie manually using the
cockroach auth-session loginCLI.)
- For RPCs between nodes, two-way TLS certificate validation. Only node-to-node certificates are accepted.
- For RPC clients, TLS client certificate validation.
- For SQL clients, depending to the connection origin and the
username, according to the current HBA configuration (see below):
- Either TLS client certificate validation, or
- A password check against the hash stored in
- No authentication (always trust), or
- Via GSSAPI / Kerberos.
Authentication between the
cockroach process and the accompanying
storage is arranged as follows:
- Access to log files and the data files is authenticated by the unix permission system.
- With encryption-at-rest, using configured encryption keys.
CockroachDB supports a subset of PostgresSQL’s Host-Based
configuration language (
pg_hba.conf), via its cluster setting
I wrote a separate article dedicated to this topic: Authentication configuration in PostgreSQL and CockroachDB.
Resistance to network-based denial of service attacks is arranged as follows:
- Memory usage upon incoming network connections is kept minimal until authentication succeeds (with a caveat, explained below).
- Nodes are “multi-active”: if one node goes away, another node can take over with close to zero time to recovery. This exploits consensus-based replication and thus requires a majority of nodes in every replication group to remain available.
- Automatic scale-up is possible: adding more nodes automatically adds more capacity.
Note that there is no internal resource isolation inside CockroachDB: all authenticated clients are assumed to not be adversaries with regards to availability.
Conversely, CockroachDB is not suited for multi-tenant hosting, where two or more competing entities share access to a single CockroachDB cluster. Any of such entities can lock the others out via resource exhaustion.
Finally, the note about resource usage until authentication succeeds:
incoming requests are accepted by spawning a goroutine per incoming
connection. There is only minimal throttling based on the total amount
of memory allocated (each connections is initially counted as 10KiB
towards the configured
--max-sql-memory). It is thus possible to
open tens of thousands of pre-authentication connections to a single
node. Meanwhile, password authentication is performed using 10 rounds
of bcrypt. Combined with
the limited throttling, it is thus possible to starve a node of CPU
time by forcing it to process thousands of password authentications concurrently.
When nodes are started with the
--insecure flag, all the following
mechanisms are entirely disabled (all-or-nothing):
- confidentiality: all no network and storage encryption is disabled.
- integrity: there are no digests on network messages.
- authentication: any client can authenticate as any principal,
including spoofing the
- authorization: all non-
nodeprincipals are assumed to have the “admin” bit. Since
nodeconnections can be spoofed, any client effectively can also obtain the “node” bit on demand.
This mode is only offered for testing and disaster recovery.
While casually browsing the source code, I found the following points of interest:
- The virtual table
crdb_internal.leasesreports all the leases on SQL objects throughout the cluster, and this virtual table is readable by any principal. This reveals the existence and the name of all the “currently active” tables/views/sequences.
- Any SQL session can request “session tracing” via the statement
SET tracing = on(or
on,kv). After a trace is collected, it can view its own trace via
SHOW TRACEor the virtual table
crdb_internal.session_trace. The trace contains details of KV operations (including detailed key/values) prior to authorization and filtering by authorization deputies. This can be used e.g. to discover the list of all database and table names, by tracing a
SHOW TABLESstatement. In general, SQL session tracing can be used to bypass authorization and read any stored data processed by deputies with the “admin” bit. This is a textbook confused deputy situation.
errorobjects produced in one component flow through other components, including authorization deputies. While deputies filter the “positive” (non-error) data returned to requesters, they typically do not filter out sensitive data contained in errors produced with their “admin” bit. Given that KV-level error objects can contain key/value data, it is theoretically possible to exploit error handling to leak data through authorization deputies. This is another confused deputy situation.
CockroachDB is a distributed database where nodes communicate with each other and accept connections from database clients. From the STRIDE perspective, CockroachDB follows the state of the art and makes reasonable operational assumptions with regards to Authentication, Integrity (tamper-resistance) and “external” Confidentiality. There is little provision for Non-repudiability but this seems to remain out of scope for the project at this time.
CockroachDB’s Availability story is slightly more complex. While it was designed from the ground up as a Highly Available (HA) system in the face of accidents, there are some gaps regarding availability in the face of malicious behavior.
Finally, its Authorization story is “interesting” and offers the most opportunities for future improvements. CockroachDB uses a combination of SQL and non-SQL privilege bits, and the authorization policies that enforce privilege bits on service endpoints (through its HTTP, RPC and SQL APIs) are heterogeneously implemented, without a common structured framework. CockroachDB’s documentation is strong in its advertisement and explanation of the SQL privilege model over SQL resources, but does not yet systematically explain Non-SQL data and non-SQL authorization rules. Although the currently released versions do offer reasonably strong authorization guarantees, there are extremely little guardrails in the source code that prevent the accidental introduction of privilege escalation avenues during the development of new features.
As an engineer with interest in security, there are several low hanging fruits that I could recommend to lift the project into a better and stronger place:
- Immediately change the default configuration to separate the SQL and RPC ports and strongly recommend in documentation to firewall the RPC port to prevent malicious use or “node”-level privilege escalation.
- Fix the semantics of GRANT by addressing this issue.
- Separate the privilege bit required to access (and write to)
systemtables, containing critical operational parameters, from the “admin” bit required to view Privileged cluster metadata. This would strongly incentivize the engineers who add new views into cluster metadata to think about implementing and documenting a new authorization deputy with clear escalation rules.
- Systematically eliminate the redundancy in authorization enforcement between the RPC servers and the SQL code. Ideally, a single point in the code should be responsible to both enforce authorization and log controlled privilege escalations, such as that performed by SQL virtual tables that act as non-privileged views into privileged data.
- Gradually introduce a capability authorization sub-system, where the privileged components would require capability bits from their requesters, and where capability tokens would be delivered to a principal as part of their authentication. This system should be designed using cryptographic principles so that an engineer would not be tempted to synthetize a capability token “for convenience” when they are pressed for time implementing a new feature.
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.