Allowed Pod disruptions
You can configure the permitted Pod disruptions for HDFS nodes as described in Allowed Pod disruptions.
Unless you configure something else or disable our PodDisruptionBudgets (PDBs), we write the following PDBs:
JournalNodes
We only allow a single JournalNode to be offline at any given time, regardless of the number of replicas or roleGroups
.
NameNodes
We only allow a single NameNode to be offline at any given time, regardless of the number of replicas or roleGroups
.
DataNodes
For DataNodes the question of how many instances can be unavailable at the same time is a bit harder:
HDFS stores your blocks on the DataNodes.
Every block can be replicated multiple times (to multiple DataNodes) to ensure maximum availability.
The default replication factor is 3
- which can be configured using spec.clusterConfig.dfsReplication
. However, it is also possible to change the replication factor for a specific file or directory to something other than the cluster default.
When you have a replication of 3
, you can safely take down 2 DataNodes, as there will always be a third DataNode holding a copy of each block currently assigned to one of the unavailable DataNodes.
However, you need to be aware that you are now down to a single point of failure - the last of three replicas!
Taking this into consideration, our operator uses the following algorithm to determine the maximum number of DataNodes allowed to be unavailable at the same time:
num_datanodes
is the number of DataNodes in the HDFS cluster, summed over all roleGroups
.
dfs_replication
is default replication factor of the cluster.
// There must always be a datanode to serve the block.
// If we would simply subtract one from the `dfs_replication`, we would end up
// with a single point of failure, so we subtract two instead.
let max_unavailable = dfs_replication.saturating_sub(2);
// We need to make sure at least one datanode remains by having at most
// n - 1 datanodes unavailable. We subtract two to avoid a single point of failure.
let max_unavailable = min(max_unavailable, num_datanodes.saturating_sub(2));
// Clamp to at least a single node allowed to be offline, so we don't block Kubernetes nodes from draining.
let max_unavailable = max(max_unavailable, 1)
This results e.g. in the following numbers:
Number of DataNodes |
Replication factor |
Maximum unavailable DataNodes |
1 |
<any factor> |
1 |
2 |
<any factor> |
1 |
3 |
<any factor> |
1 |
4 |
1, 2, 3 |
1 |
4 |
4, 5 or higher |
2 |
5 |
1, 2, 3 |
1 |
5 |
4 |
2 |
5 |
5 or higher |
3 |
100 |
1, 2, 3 |
1 |
100 |
4 |
2 |
100 |
5 - 100 |
<replication factor> - 2 |
Reduce rolling redeployment durations
The default PDBs we write out are pessimistic and will cause the rolling redeployment to take a considerable amount of time.
As an example, when you have 100 DataNodes and a replication factor of 3
, we can safely only take a single DataNode down at a time. Assuming a DataNode takes 1 minute to properly restart, the whole re-deployment would take 100 minutes.
You can use the following measures to speed this up:
-
Increase the replication factor, e.g. from
3
to5
. In this case the number of allowed disruptions triples from1
to3
(assuming >= 5 DataNodes), reducing the time it takes by 66%. -
Increase
maxUnavailable
using thespec.dataNodes.roleConfig.podDisruptionBudget.maxUnavailable
field as described in Allowed Pod disruptions. -
Write your own PDBs as described in Using you own custom PDBs.
In cases you modify or disable the default PDBs, it’s your responsibility to either make sure there are enough DataNodes available or accept the risk of blocks not being available! |