Node Management
Key Considerations When Rebooting a Node
- Check if the node has any
rook-ceph-osd-*
pods. Verify the health of the corresponding Ceph cluster and bring down one node at a time. - Check for
haproxy-ingress-*
pods. If the node will be down for an extended period, disable its record in Constellix DNS. - Check if the node has the
nautilus.io/linstor-server
label. This node serves as a Linstor server. Some Linstor servers are redundant, while others are critical. - Check if the node has the
nautilus.io/bgp-speaker
label. There are two nodes used for MetalLB IPs—ensure one remains active. - Check if the node has the
node-role.kubernetes.io/master
label. Rebooting this node will make the cluster inaccessible unless it’s not an Admiralty virtual node.
Prerequisites
- Install Ansible on your local computer.
- Clone the repository of Ansible playbooks:
git clone https://gitlab.nrp-nautilus.io/prp/nautilus-ansible.git
- Pull the latest updates from the playbook repository:
cd nautilus-ansible;git pull
Reboot a Node Due to GPU Failure
Use the following command to reboot the node:
ansible-playbook reboot.yaml -i nautilus-ansible/nautilus-hosts.yaml -l <nodename>
Special Instructions to Reboot Ceph Nodes
To maintain redundancy in the Ceph cluster, only one node can be rebooted at a time.
Run this command to enter the rook-ceph-tools
pod shell. Replace <namespace>
with the appropriate Ceph cluster namespace (e.g., rook
, rook-east
, rook-pacific
, rook-haosu
, rook-suncave
):
kubectl exec -it -n <namespace> $(kubectl get pods -n <namespace> --selector=app=rook-ceph-tools --output=jsonpath={.items..metadata.name}) -- bash
Once inside the pod shell, run:
watch ceph health detail
Wait until [WRN] OSD_DOWN: 1 osds down
disappears from the ceph health detail
output before rebooting the next node.