Skip to content

Node Management

Key Considerations When Rebooting a Node

  1. Check if the node has any rook-ceph-osd-* pods. Verify the health of the corresponding Ceph cluster and bring down one node at a time.
  2. Check for haproxy-ingress-* pods. If the node will be down for an extended period, disable its record in Constellix DNS.
  3. Check if the node has the nautilus.io/linstor-server label. This node serves as a Linstor server. Some Linstor servers are redundant, while others are critical.
  4. Check if the node has the nautilus.io/bgp-speaker label. There are two nodes used for MetalLB IPs—ensure one remains active.
  5. Check if the node has the node-role.kubernetes.io/master label. Rebooting this node will make the cluster inaccessible unless it’s not an Admiralty virtual node.

Prerequisites

  1. Install Ansible on your local computer.
  2. Clone the repository of Ansible playbooks:
Terminal window
git clone https://gitlab.nrp-nautilus.io/prp/nautilus-ansible.git
  1. Pull the latest updates from the playbook repository:
Terminal window
cd nautilus-ansible;
git pull

Reboot a Node Due to GPU Failure

Use the following command to reboot the node:

Terminal window
ansible-playbook reboot.yaml -i nautilus-ansible/nautilus-hosts.yaml -l <nodename>

Special Instructions to Reboot Ceph Nodes

To maintain redundancy in the Ceph cluster, only one node can be rebooted at a time.

Run this command to enter the rook-ceph-tools pod shell. Replace <namespace> with the appropriate Ceph cluster namespace (e.g., rook, rook-east, rook-pacific, rook-haosu, rook-suncave):

Terminal window
kubectl exec -it -n <namespace> $(kubectl get pods -n <namespace> --selector=app=rook-ceph-tools --output=jsonpath={.items..metadata.name}) -- bash

Once inside the pod shell, run:

Terminal window
watch ceph health detail

Wait until [WRN] OSD_DOWN: 1 osds down disappears from the ceph health detail output before rebooting the next node.