RDMA with RoCE
English | 简体中文
Introduction
This chapter introduces how POD access network with the RoCE interface of the host.
Features
RDMA devices' network namespaces have two modes: shared and exclusive. Containers can either share or exclusively access RDMA network cards. In Kubernetes, shared cards can be utilized with macvlan or ipvlan CNI, while the exclusive one can be used with SR-IOV CNI.
-
Shared mode. Spiderpool leverages macvlan or ipvlan CNI to expose RoCE network cards on the host machine for all Pods. The RDMA shared device plugin is employed for exposing RDMA card resources and scheduling Pods.
-
Exclusive mode. Spiderpool utilizes SR-IOV CNI to expose RDMA cards on the host machine for Pods, providing access to RDMA resources. RDMA CNI is used to ensure isolation of RDMA devices.
For isolated RDMA network cards, at least one of the following conditions must be met:
(1) Kernel based on 5.3.0 or newer, RDMA modules loaded in the system. rdma-core package provides means to automatically load relevant modules on system start
(2) Mellanox OFED version 4.7 or newer is required. In this case it is not required to use a Kernel based on 5.3.0 or newer.
Shared RoCE NIC with macvlan or ipvlan
The following steps demonstrate how to enable shared usage of RDMA devices by Pods in a cluster with two nodes via macvlan CNI:
-
Ensure that the host machine has an RDMA card installed and the driver is properly installed, ensuring proper RDMA functioning.
In our demo environment, the host machine is equipped with a Mellanox ConnectX-5 NIC with RoCE capabilities. Follow the official NVIDIA guide to install the latest OFED driver. To confirm the presence of RDMA devices, use the following command:
To confirm the presence of RoCE devices, use the following command:
~# rdma link link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev ens6f0np0 link mlx5_1/1 state ACTIVE physical_state LINK_UP netdev ens6f1np1 ~# ibstat mlx5_0 | grep "Link layer" Link layer: Ethernet
Make sure that the RDMA subsystem of the host is in shared mode. If not, switch to shared mode.
~# rdma system
netns shared copy-on-fork on
# switch to shared mode
~# rdma system set netns shared
-
Verify the details of the RDMA card for subsequent device resource discovery by the device plugin.
Enter the following command with NIC vendors being 15b3 and its deviceIDs being 1017:
~# lspci -nn | grep Ethernet af:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] af:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
-
Install Spiderpool and configure sriov-network-operator:
helm upgrade --install spiderpool spiderpool/spiderpool --namespace kube-system --reuse-values \ --set rdma.rdmaSharedDevicePlugin.install=true \ --set rdma.rdmaSharedDevicePlugin.deviceConfig.resourcePrefix="spidernet.io" \ --set rdma.rdmaSharedDevicePlugin.deviceConfig.resourceName="hca_shared_devices" \ --set rdma.rdmaSharedDevicePlugin.deviceConfig.rdmaHcaMax=500 \ --set rdma.rdmaSharedDevicePlugin.deviceConfig.vendors="15b3" \ --set rdma.rdmaSharedDevicePlugin.deviceConfig.deviceIDs="1017"
-
If Macvlan is not installed in your cluster, you can specify the Helm parameter
--set plugins.installCNI=true
to install Macvlan in your cluster. -
If you are a user from China, you can specify the parameter
--set global.imageRegistryOverride=ghcr.m.daocloud.io
to avoid image pull failures from Spiderpool.
After completing the installation of Spiderpool, you can manually edit the spiderpool-rdma-shared-device-plugin configmap to reconfigure the RDMA shared device plugin.
Once the installation is complete, the following components will be installed:
~# kubectl get pod -n kube-system spiderpool-agent-9sllh 1/1 Running 0 1m spiderpool-agent-h92bv 1/1 Running 0 1m spiderpool-controller-7df784cdb7-bsfwv 1/1 Running 0 1m spiderpool-init 0/1 Completed 0 1m spiderpool-rdma-shared-device-plugin-dr7w8 1/1 Running 0 1m spiderpool-rdma-shared-device-plugin-zj65g 1/1 Running 0 1m
-
-
View the available resources on a node, including the reported RDMA device resources:
~# kubectl get no -o json | jq -r '[.items[] | {name:.metadata.name, allocable:.status.allocatable}]' [ { "name": "10-20-1-10", "allocable": { "cpu": "40", "memory": "263518036Ki", "pods": "110", "spidernet.io/hca_shared_devices": "500", ... } }, ... ]
If the reported resource count is 0, it may be due to the following reasons:
(1) Verify that the vendors and deviceID in the spiderpool-rdma-shared-device-plugin configmap match the actual values.
(2) Check the logs of the rdma-shared-device-plugin. If you encounter errors related to RDMA NIC support, try installing apt-get install rdma-core or dnf install rdma-core on the host machine.
error creating new device: "missing RDMA device spec for device 0000:04:00.0, RDMA device \"issm\" not found"
-
Create macvlan CNI configuration with specifying
spec.macvlan.master
to be an RDMA of the node ,and set up the corresponding ippool resources:cat <<EOF | kubectl apply -f - apiVersion: spiderpool.spidernet.io/v2beta1 kind: SpiderIPPool metadata: name: v4-81 spec: gateway: 172.81.0.1 ips: - 172.81.0.100-172.81.0.120 subnet: 172.81.0.0/16 --- apiVersion: spiderpool.spidernet.io/v2beta1 kind: SpiderMultusConfig metadata: name: macvlan-ens6f0np0 namespace: kube-system spec: cniType: macvlan macvlan: master: - "ens6f0np0" ippools: ipv4: ["v4-81"] EOF
-
Following the configurations from the previous step, create a DaemonSet application that spans across nodes for testing
ANNOTATION_MULTUS="v1.multus-cni.io/default-network: kube-system/macvlan-ens6f0np0" RESOURCE="spidernet.io/hca_shared_devices" NAME=rdma-macvlan cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: DaemonSet metadata: name: ${NAME} labels: app: $NAME spec: selector: matchLabels: app: $NAME template: metadata: name: $NAME labels: app: $NAME annotations: ${ANNOTATION_MULTUS} spec: containers: - image: docker.io/mellanox/rping-test imagePullPolicy: IfNotPresent name: mofed-test securityContext: capabilities: add: [ "IPC_LOCK" ] resources: limits: ${RESOURCE}: 1 command: - sh - -c - | ls -l /dev/infiniband /sys/class/net sleep 1000000 EOF
-
Verify that RDMA communication is correct between the Pods across nodes.
Open a terminal and access one Pod to launch a service:
# You are able to see all the RDMA cards on the host machine ~# rdma link 0/1: mlx5_0/1: state ACTIVE physical_state LINK_UP 1/1: mlx5_1/1: state ACTIVE physical_state LINK_UP # Start an RDMA service ~# ib_read_lat
Open a terminal and access another Pod to launch a service:
# You are able to see all the RDMA cards on the host machine ~# rdma link 0/1: mlx5_0/1: state ACTIVE physical_state LINK_UP 1/1: mlx5_1/1: state ACTIVE physical_state LINK_UP # Access the service running in the other Pod ~# ib_read_lat 172.81.0.120 --------------------------------------------------------------------------------------- RDMA_Read Latency Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 1 Mtu : 1024[B] Link type : Ethernet GID index : 12 Outstand reads : 16 rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x0107 PSN 0x79dd10 OUT 0x10 RKey 0x1fddbc VAddr 0x000000023bd000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:81:00:119 remote address: LID 0000 QPN 0x0107 PSN 0x40001a OUT 0x10 RKey 0x1fddbc VAddr 0x00000000bf9000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:81:00:120 --------------------------------------------------------------------------------------- #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec] Conflicting CPU frequency values detected: 2200.000000 != 1040.353000. CPU Frequency is not max. Conflicting CPU frequency values detected: 2200.000000 != 1849.351000. CPU Frequency is not max. 2 1000 6.88 16.81 7.04 7.06 0.31 7.38 16.81 ---------------------------------------------------------------------------------------
Isolated RoCE NIC with SR-IOV
The following steps demonstrate how to enable isolated usage of RDMA devices by Pods in a cluster with two nodes via SR-IOV CNI:
-
Ensure that the host machine has an RDMA and SR-IOV enabled card and the driver is properly installed.
In our demo environment, the host machine is equipped with a Mellanox ConnectX-5 NIC with RoCE capabilities. Follow the official NVIDIA guide to install the latest OFED driver.
To confirm the presence of RoCE devices, use the following command:
~# rdma link link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev ens6f0np0 link mlx5_1/1 state ACTIVE physical_state LINK_UP netdev ens6f1np1 ~# ibstat mlx5_0 | grep "Link layer" Link layer: Ethernet
Make sure that the RDMA subsystem on the host is operating in exclusive mode. If not, switch to exclusive mode.
# switch to exclusive mode and fail to restart the host ~# rdma system set netns exclusive # apply persistent settings: ~# echo "options ib_core netns_mode=0" >> /etc/modprobe.d/ib_core.conf ~# rdma system netns exclusive copy-on-fork on
(Optional) in an SR-IOV scenario, applications can enable NVIDIA's GPUDirect RDMA feature. For instructions on installing the kernel module, please refer to the official documentation.
-
Install Spiderpool
-
set the values
--set sriov.install=true
-
If you are a user from China, you can specify the parameter
--set global.imageRegistryOverride=ghcr.m.daocloud.io
to pull image from china registry.
Once the installation is complete, the following components will be installed: ~# kubectl get pod -n kube-system spiderpool-agent-9sllh 1/1 Running 0 1m spiderpool-agent-h92bv 1/1 Running 0 1m spiderpool-controller-7df784cdb7-bsfwv 1/1 Running 0 1m spiderpool-sriov-operator-65b59cd75d-89wtg 1/1 Running 0 1m spiderpool-init 0/1 Completed 0 1m
-
-
Configure SR-IOV operator
To enable the SR-IOV CNI on specific nodes, you need to apply the following command to label those nodes. This will allow the sriov-network-operator to install the components on the designated nodes.
kubectl label node $NodeName node-role.kubernetes.io/worker=""
Look up the device information of the RoCE interface. Enter the following command to get NIC vendors 15b3 and deviceIDs 1017
~# lspci -nn | grep Ethernet af:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] af:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
By the way, the number of VFs determines how many SR-IOV network cards can be provided for PODs on a host. The network card from different manufacturers have different amount limit of VFs. For example, the Mellanox connectx5 used in this example can create up to 127 VFs.
Apply the following configuration, and the VFs will be created on the host. Notice, this may cause the nodes to reboot, owing to taking effect the new configuration in the network card driver.
cat <<EOF | kubectl apply -f - apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: roce-sriov namespace: kube-system spec: nodeSelector: kubernetes.io/os: "linux" resourceName: mellanoxroce priority: 99 numVfs: 12 nicSelector: deviceID: "1017" rootDevices: - 0000:af:00.0 vendor: "15b3" deviceType: netdevice isRdma: true EOF
Verify the available resources on the node, including the reported SR-IOV device resources:
~# kubectl get no -o json | jq -r '[.items[] | {name:.metadata.name, allocable:.status.allocatable}]' [ { "name": "10-20-1-10", "allocable": { "cpu": "40", "pods": "110", "spidernet.io/mellanoxroce": "12", ... } }, ... ]
-
Create macvlan CNI configuration and corresponding ippool resources.
cat <<EOF | kubectl apply -f - apiVersion: spiderpool.spidernet.io/v2beta1 kind: SpiderIPPool metadata: name: v4-81 spec: gateway: 172.81.0.1 ips: - 172.81.0.100-172.81.0.120 subnet: 172.81.0.0/16 --- apiVersion: spiderpool.spidernet.io/v2beta1 kind: SpiderMultusConfig metadata: name: roce-sriov namespace: kube-system spec: cniType: sriov sriov: resourceName: spidernet.io/mellanoxroce enableRdma: true ippools: ipv4: ["v4-81"] EOF
-
Following the configurations from the previous step, create a DaemonSet application that spans across nodes for testing
ANNOTATION_MULTUS="v1.multus-cni.io/default-network: kube-system/roce-sriov" RESOURCE="spidernet.io/mellanoxroce" NAME=rdma-sriov cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: DaemonSet metadata: name: ${NAME} labels: app: $NAME spec: selector: matchLabels: app: $NAME template: metadata: name: $NAME labels: app: $NAME annotations: ${ANNOTATION_MULTUS} spec: containers: - image: docker.io/mellanox/rping-test imagePullPolicy: IfNotPresent name: mofed-test securityContext: capabilities: add: [ "IPC_LOCK" ] resources: limits: ${RESOURCE}: 1 command: - sh - -c - | ls -l /dev/infiniband /sys/class/net sleep 1000000 EOF
-
Verify that RDMA communication is correct between the Pods across nodes.
Open a terminal and access one Pod to launch a service:
# Only one RDMA device allocated to the Pod can be found ~# rdma link 7/1: mlx5_3/1: state ACTIVE physical_state LINK_UP netdev eth0 # launch an RDMA service ~# ib_read_lat
Open a terminal and access another Pod to launch a service:
# You are able to see all the RDMA cards on the host machine ~# rdma link 10/1: mlx5_5/1: state ACTIVE physical_state LINK_UP netdev eth0 # Access the service running in the other Pod ~# ib_read_lat 172.81.0.118 libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_4'. libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_2'. libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_0'. libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_3'. libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_1'. --------------------------------------------------------------------------------------- RDMA_Read Latency Test Dual-port : OFF Device : mlx5_5 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 1 Mtu : 1024[B] Link type : Ethernet GID index : 2 Outstand reads : 16 rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x0b69 PSN 0xd476c2 OUT 0x10 RKey 0x006f00 VAddr 0x00000001f91000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:81:00:105 remote address: LID 0000 QPN 0x0d69 PSN 0xbe5c89 OUT 0x10 RKey 0x004f00 VAddr 0x0000000160d000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:81:00:118 --------------------------------------------------------------------------------------- #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec] Conflicting CPU frequency values detected: 2200.000000 != 1338.151000. CPU Frequency is not max. Conflicting CPU frequency values detected: 2200.000000 != 2881.668000. CPU Frequency is not max. 2 1000 6.66 20.37 6.74 6.82 0.78 7.15 20.37 ---------------------------------------------------------------------------------------