First steps
Once you have followed the steps in the Installation section to install the operator and its dependencies, you will now deploy an HDFS cluster and its dependencies. Afterwards you can verify that it works by creating, verifying and deleting a test file in HDFS.
Setup
Zookeeper
To deploy a Zookeeper cluster create one file called zk.yaml
:
---
apiVersion: zookeeper.stackable.tech/v1alpha1
kind: ZookeeperCluster
metadata:
name: simple-zk
spec:
image:
productVersion: 3.8.0
stackableVersion: 23.1.0
servers:
roleGroups:
default:
replicas: 1
We also need to define a ZNode that will be used by the HDFS cluster to reference Zookeeper. Create another file called znode.yaml
:
---
apiVersion: zookeeper.stackable.tech/v1alpha1
kind: ZookeeperZnode
metadata:
name: simple-hdfs-znode
spec:
clusterRef:
name: simple-zk
Apply both of these files:
kubectl apply -f zk.yaml
kubectl apply -f znode.yaml
The state of the Zookeeper cluster can be tracked with kubectl
:
kubectl rollout status --watch statefulset/simple-zk-server-default
HDFS
An HDFS cluster has three components: the namenode
, the datanode
and the journalnode
. Create a file named hdfs.yaml
defining 2 namenodes
and one datanode
and journalnode
each:
---
apiVersion: hdfs.stackable.tech/v1alpha1
kind: HdfsCluster
metadata:
name: simple-hdfs
spec:
image:
productVersion: 3.3.4
stackableVersion: 23.1.0
zookeeperConfigMapName: simple-hdfs-znode
dfsReplication: 3
nameNodes:
roleGroups:
default:
replicas: 2
dataNodes:
roleGroups:
default:
replicas: 1
journalNodes:
roleGroups:
default:
replicas: 1
Where:
-
metadata.name
contains the name of the HDFS cluster -
the label of the Docker image provided by Stackable must be set in
spec.version
Please note that the version you need to specify for spec.version is not only the version of Hadoop which you want to roll out, but has to be amended with a Stackable version as shown. This Stackable version is the version of the underlying container image which is used to execute the processes. For a list of available versions please check our
image registry.
It should generally be safe to simply use the latest image version that is available.
|
Create the actual HDFS cluster by applying the file:
kubectl apply -f hdfs.yaml
Track the progress with kubectl
as this step may take a few minutes:
kubectl rollout status --watch statefulset/simple-hdfs-datanode-default
kubectl rollout status --watch statefulset/simple-hdfs-namenode-default
kubectl rollout status --watch statefulset/simple-hdfs-journalnode-default
Verify that it works
To test the cluster you can create a new file, check its status and then delete it. We will execute these actions from within a helper pod. Create a file called webhdfs.yaml
:
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: webhdfs
labels:
app: webhdfs
spec:
replicas: 1
serviceName: webhdfs-svc
selector:
matchLabels:
app: webhdfs
template:
metadata:
labels:
app: webhdfs
spec:
containers:
- name: webhdfs
image: docker.stackable.tech/stackable/testing-tools:0.1.0-stackable0.1.0
stdin: true
tty: true
Apply it and monitor its progress:
kubectl apply -f webhdfs.yaml
kubectl rollout status --watch statefulset/webhdfs
To begin with the cluster should be empty: this can be verified by listing all resources at the root directory (which should return an empty array!):
kubectl exec -n default webhdfs-0 -- curl -s -XGET "http://simple-hdfs-namenode-default-0:9870/webhdfs/v1/?op=LISTSTATUS"
Creating a file in HDFS using the Webhdfs API requires a two-step PUT
(the reason for having a two-step create/append is to prevent clients from sending out data before the redirect). First, create a file with some text in it called testdata.txt
and copy it to the tmp
directory on the helper pod:
kubectl cp -n default ./testdata.txt webhdfs-0:/tmp
Then use curl
to issue a PUT
command:
kubectl exec -n default webhdfs-0 -- \
curl -s -XPUT -T /tmp/testdata.txt "http://simple-hdfs-namenode-default-0:9870/webhdfs/v1/testdata.txt?user.name=stackable&op=CREATE&noredirect=true"
This will return a location that will look something like this:
http://simple-hdfs-datanode-default-0.simple-hdfs-datanode-default.default.svc.cluster.local:9864/webhdfs/v1/testdata.txt?op=CREATE&user.name=stackable&namenoderpcaddress=simple-hdfs&createflag=&createparent=true&overwrite=false
You can assign this to a local variable - e.g. $location
- or you can copy-and-paste it into the URL, and then issue a second PUT
like this:
kubectl exec -n default webhdfs-0 -- curl -s -XPUT -T /tmp/testdata.txt "$location"
Rechecking the status again with:
kubectl exec -n default webhdfs-0 -- curl -s -XGET "http://simple-hdfs-namenode-default-0:9870/webhdfs/v1/?op=LISTSTATUS"
will now display some metadata about the file that was created in the HDFS cluster:
{
"FileStatuses": {
"FileStatus": [
{
"accessTime": 1660821734999,
"blockSize": 134217728,
"childrenNum": 0,
"fileId": 16396,
"group": "supergroup",
"length": 597,
"modificationTime": 1660821735602,
"owner": "stackable",
"pathSuffix": "testdata.txt",
"permission": "644",
"replication": 3,
"storagePolicy": 0,
"type": "FILE"
}
]
}
}
To clean up, the file can be deleted like this:
kubectl exec -n default webhdfs-0 -- curl -s -XDELETE "http://simple-hdfs-namenode-default-0:9870/webhdfs/v1/testdata.txt?user.name=stackable&op=DELETE"
What’s next
Look at the Usage guide to find out more about configuring your HDFS cluster.