Cassandra troubleshooting guide

Apache Cassandra is a NoSQL distributed database designed to handle large amount of data across multiple servers. It follows peer-to-peer architecture and any node in the cluster provides the same functionality as any other node in the cluster.

Cluster status

Before jumping to troubleshooting, let's discuss how to check the overall health of the cluster. You can use nodetool status to access the status of cluster, which will be useful if one or more nodes are down or unreachable.

nodetool status

Nodetool status provides various information such as node's status, state, address, owns percentage, host id, and rack.

  • Status - U (Up) or D (Down)
  • State - N (Normal), L (Leaving), J (Joining), M (Moving).
  • Leaving - Indicates a node is being removed/decommissioned from the cluster.
  • Joining - Indicates a node is being added to the cluster.
  • Moving - Indicates a node's token range is being moved.
Datacenter: Datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.40.49.3 64.3 GB 256 48.9% c469cbec-e0e0-46a0-b31e-f1c49860b03f RAC2
UN 10.40.40.55 55.76 GB 256 50.8% 91cab37f-d427-454e-9656-4a25b93d9fe2 RAC1
UN 10.40.38.120 58.01 GB 256 49.7% 47219cd8-f55a-41f8-bb17-b21dbb4092fa RAC3
UN 10.40.40.56 58.88 GB 256 50.4% 2bc08dc1-f132-469f-a4bc-4a6156237061 RAC3
UN 10.40.8.75 58.54 GB 256 51.1% 27fcfee0-2a3f-4e16-878e-cc4897233802 RAC2
DN 10.40.17.20 57 GB 256 49.2% 0934da93-f777-445c-a801-9ed0d4972fef RAC1

Here you can see one of the nodes is unreachable or down. If you have many nodes, use the grep command to identify the abnormal nodes.

nodetool status | grep -v '^UN'

This will list only the nodes that are not Up and Normal.

Datacenter: Datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN 10.40.17.20 57 GB 256 49.2% 0934da93-f777-445c-a801-9ed0d4972fef RAC1

Common troubleshooting steps

  • Identify the problematic nodes
  • Narrow down the problems
  • Fix the root cause

Identify the problematic node

The first step in troubleshooting is to identify where the problem is. With the help of client logs, Cassandra logs, and other tools provided by Cassandra we have to identify if the problem is in client or any one node, or a subset of nodes.

Client logs

Client logs will help us identify the source of the problem. Some of the common errors you could find in logs are:

NoHostAvailable Exception

When the client or driver is not able to establish connection with any node in the cluster, this error will be thrown. It can happen when:

Fig. 1: NoHostAvailable exception in Cassandra client logs
  • Multiple or all nodes are down in the cluster. Use nodetool status to verify the cluster status and node availability.
  • An incorrect configuration in the whitelist policy. For instance, you might have configured DCAwareRoundRobinPolicy with an incorrect localDC or node address
  • When all nodes are busy due to a high load and are not able to accept any new incoming requests.
Unavailable Exception

In this case, the request has reached the coordinator node and there is not enough replicas available to meet the requested consistency level.

Fig 2: Unavailable Exception in Cassandra client logs
  • If this exception is thrown from multiple coordinator nodes, then multiple replicas are not alive or unreachable at that time. Use nodetool getendpoints to get the possible list of nodes for that query and start from there.
  • If the exception is always thrown from a single coordinator node, then it is highly likely that the coordinator node is separated from the cluster. Use nodetool status on the coordinator node to find its status.
OperationTimedOutException

Operation timed out occurs when the coordinator does not respond within the timeout specified by the client. It can occur when the client side timeout is short or the particular query is fetching large amounts of data that could not be processed within the specified time. The error message generally displays the last host tried by the client. The coordinator will be a good starting point to debug the issue.

ReadTimeOutException or WriteTimeOutException

This is a server-side timeout that occurs between coordinator and replicas. This happens when a request reaches the coordinator node, and the coordinator believes there are enough replicas available for processing and request data from replicas. But for some reason, one or more replicas do not respond within the expected time, causing the read timeout or write timeout. This exception generally refers to a server-side problem as the default timeout is high.

Identifying the endpoints

If only a particular query consistently fails, you can use nodetool getendpoints to get all the nodes that own a particular partition key. You can start debugging from those list of nodes.

Assuming the following query is always failing, it means one or more nodes that own the given partition key is the problem.

Select * from sampleks.table1 where partitionkey=123;

Use the following to query the endpoints of "123":

nodetool getendpoints sampleks table1 123
#Lists the replica nodes
10.40.40.55
10.40.38.59
10.40.41.66

Narrow down the problem

Once you have a list of faulty nodes, the next step is to narrow down the problems in node. You can use Cassandra logs and various nodetool commands to find out the status of node.

Read latency

One of the common problems faced by Cassandra users is high read latency. High read latency can be either due to coordinator or individual replicas.

Coordinator query latency

If coordinator node is the suspect, then you can use nodetool proxyhistograms to find the read and write latency.

./nodetool proxyhistograms
#Lists the coordinator node's latency
proxy histograms
Percentile Read Latency Write Latency Range Latency
(micros) (micros) (micros)
50% 315.85 379.02 263.21
75% 545.79 454.83 315.85
95% 943.13 654.95 454.83
98% 1358.10 943.13 454.83
99% 1955.67 1358.10 454.83
Min 51.01 88.15 152.32
Max 8409.01 2816.16 454.83

The common problem for coordinator's high latency are:

  • Cross DC traffic
  • Consistency level
  • Large batches
  • Querying multiple partitions by using the IN clause from the partition key
  • Network latency
Local query latency

Local query latency refers to the latency of the replica nodes to process the request. Common causes for high local latency are:

  • Reading a large number of sstables: Use nodetool tablehistograms to find the sstable count.
  • Reading a large partition: Use nodetool tablestats to find the average and max partition size and use logs to see the max partition size that is flushed or compacted.
  • High tombstone count: Use logs to find the tombstone count read and nodetool tablestats to check the average tombstone per slice.
  • Insufficient threads: Use nodetool tpstats to find the thread stats.
  • Low machine configuration such as low RAM.
Table Histograms

The nodetool tablehistograms provides latency of the local read/write.

./nodetool tablehistograms keyspace table
#provides table level latency
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 3.00 51.01 454.83 88148 258
75% 3.00 73.46 785.94 263210 770
95% 5.00 88.15 1358.10 1131752 2759
98% 5.00 105.78 1955.67 3379391 3311
99% 5.00 126.93 4055.27 4866323 4768
Min 0.00 8.24 51.01 30 0
Max 5.00 263.21 4866.32 802187438 42510

Tablehistograms provides the number of sstables read per query, read latency, write latency, partition size, and cell count.

If the number of sstables read per query is high, then read queries will have a high latency. This can happen if the wrong compaction strategy is chosen. You have to choose a compaction strategy based on your read and write load. Another situation could be if the minor compaction is disabled.

The partition size and cell count can help tune your data pattern.

Table stats

If you suspect issues with one particular table, you can use nodetool tablestats to get the table level latency and other statistics.

nodetool tablestats cassandra.table1
#List the tablestats
Keyspace: cassandra
Read Count: 28498299
Read Latency: 0.3223316537594051 ms.
Write Count: 35110896
Write Latency: 0.04817532471971094 ms.
Pending Flushes: 0
Table: table1
SSTable count: 13
Space used (live): 4551417734
Space used (total): 4551417734
Space used by snapshots (total): 0
Off heap memory used (total): 4548182
SSTable Compression Ratio: 0.14427946324401666
Number of partitions (estimate): 183506
Memtable cell count: 474
Memtable data size: 195955
Memtable off heap memory used: 0
Memtable switch count: 529
Local read count: 28498299
Local read latency: 0.122 ms
Local write count: 35110896
Local write latency: 0.039 ms
Pending flushes: 0
Bloom filter false positives: 17515
Bloom filter false ratio: 0.00000
Bloom filter space used: 431528
Bloom filter off heap memory used: 431424
Index summary off heap memory used: 53606
Compression metadata off heap memory used: 4063152
Compacted partition minimum bytes: 30
Compacted partition maximum bytes: 43388628
Compacted partition mean bytes: 181607
Average live cells per slice (last five minutes): 2.69670113244707
Maximum live cells per slice (last five minutes): 35
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Threadpool Stats

Cassandra maintains multiple thread pools for different tasks. Each threadpool has a specific function to perform. You can use nodetool tpstats to find the state of these thread pools, such as number of tasks that are active, completed, and blocked.

You can also obtain the number of dropped messages, which can shed some light on which stage is creating a problem. In the example below the mutation and request response is dropped.

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
MutationStage 0 0 385618082 0 0
ViewMutationStage 0 0 0 0 0
ReadStage 0 0 537370386 0 0
RequestResponseStage 0 0 376599718 0 0
ReadRepairStage 0 0 1603479 0 0
CounterMutationStage 0 0 0 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 0 0 38697073 0 0
MemtableReclaimMemory 0 0 41849 0 0
PendingRangeCalculator 0 0 2 0 0
GossipStage 0 0 70732668 0 0
SecondaryIndexManagement 0 0 0 0 0
HintsDispatcher 0 0 188 0 0
MigrationStage 0 0 11 0 0
MemtablePostFlush 0 0 68006 0 0
ValidationExecutor 0 0 0 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 41555 0 0
InternalResponseStage 0 0 1122 0 0
AntiEntropyStage 0 0 0 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 1668155897 0 72
UserDefinedFunctions 0 0 0 0 0

Message type Dropped
READ 0
RANGE_SLICE 0
_TRACE 0
HINT 0
MUTATION 10
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 2
PAGED_RANGE 0
READ_REPAIR 0

Cassandra logs

Cassandra provides multiple log files for various purposes. Some of the main logs are:

  • system.log
  • debug.log
  • gc.log

system.log

This is the default general logging provided by Cassandra. Various fields available are level, thread_name/id, date, class name, line number, and message.

Using log level

Sometimes checking the whole log file can be overwhelming. Use log level to filter the messages

# Below command will list last 5 warning/error messages
grep 'WARN\|ERROR' system.log | tail -5
Search all rotated log files

By default, Cassandra will compress the old rotated log files, so if you want to filter through all the rotated logs use zgrep

# Search all compressed rotated files
zgrep 'WARN\|ERROR' system.log.* | less
GC pauses

GCInspector.java class logs message when there is a long gcpause.

#List last 10 gcpauses in system.log
grep 'GCInspector' system.log | tail -10
Reading too many tombstones

Tombstones can bring the entire cluster or node down if not handled properly. The read response time can increase when a large number of tombstones is required to be read.

grep 'tombstone cells' system.log | tail -10

Sample Output

WARN  [SharedPool-Worker-1] 2021-07-30 02:01:53,990 ReadCommand.java:520 - Read 452 live rows and 1392 tombstone cells for query SELECT * FROM keyspace.table1 WHERE id = 456 LIMIT 5000 (see tombstone_warn_threshold)
WARN [SharedPool-Worker-1] 2021-07-30 02:01:53,990 ReadCommand.java:520 - Read 2 live rows and 1892 tombstone cells for query SELECT * FROM keyspace.table1 WHERE id = 9887 LIMIT 5000 (see tombstone_warn_threshold)

In this warning log, too many tombstones are read which will make the query slow. Modify gc_grace_seconds according to the load.

Large partition warning
grep "Writing large partition" system.log | tail -10

Sample Output

WARN  [CompactionExecutor:13606] 2021-08-08 00:00:35,968 BigTableWriter.java:171 - Writing large partition keyspace/table1:8354000001876068 (251029397 bytes to sstable /home/cassandra/data_directories/data/keyspace/table1-01e71f5bfc3d36fd80b7abac29bd7766/mc-77869-big-Data.db) 

WARN [CompactionExecutor:14390] 2021-08-11 00:00:36,176 BigTableWriter.java:171 - Writing large partition keyspace/table1:8354000001876068 (258559119 bytes to sstable /home/cassandra/data_directories/data/keyspace/table1-01e71f5bfc3d36fd80b7abac29bd7766/mc-77873-big-Data.db)

Debug log

Debug log is similar to system.log but provides some additional debugging information. This log is more noisier than system.log. It's recommended that you utilize any centralized logging or log analysis tool to accomplish this efficiently.

If you plan to search manually, it's better if you have a list of messages you are going to search.

#Lists the count of column familys flushed
grep 'Flushing largest CFS' debug.log | awk '{print $9, $10}' | sort | uniq -c

4 CFS(Keyspace='cassandra', ColumnFamily='table1')
1 CFS(Keyspace='cassandra', ColumnFamily='table2')
2 CFS(Keyspace='cassandra', ColumnFamily='table1')
1 CFS(Keyspace='system', ColumnFamily='size_estimates')
1 CFS(Keyspace='sampleks', ColumnFamily='sampletable')

gc.log

This a normal conventional java GC log. You can use this to fine-tune JVM and GC configurations.

Fixing the issue

In this article, we went through various troubleshooting steps, including narrowing down the problem. We also determined the appropriate actions to take to resolve common issues, such as changing the data model, optimizing the querying pattern, and scaling nodes.