RDMA
English | 简体中文
Introduction
Spiderpool employs macvlan, ipvlan, and SR-IOV CNI to expose RDMA network cards on the host machine for Pod. This page provides an overview of how to utilize RDMA network cards in Spiderpool.
Features
RDMA devices' network namespaces have two modes: shared and exclusive. Containers can either share or exclusively access RDMA network cards. In Kubernetes, shared cards can be utilized with macvlan or ipvlan CNI, while the exclusive one can be used with SR-IOV CNI.
In shared mode, Spiderpool leverages macvlan or ipvlan CNI to expose RoCE network cards on the host machine for Pod. The RDMA shared device plugin is employed for exposing RDMA card resources and scheduling Pods.
In exclusive mode, Spiderpool utilizes SR-IOV CNI to expose RDMA cards on the host machine for Pods, providing access to RDMA resources. RDMA CNI is used to ensure isolation of RDMA devices.
Shared usage of RoCE-capable NIC with macvlan or ipvlan
The following steps demonstrate how to enable shared usage of RDMA devices by Pods in a cluster with two nodes via macvlan CNI:
-
Ensure that the host machine has an RDMA card installed and the driver is properly installed, ensuring proper RDMA functioning.
In our demo environment, the host machine is equipped with a Mellanox ConnectX-5 NIC with RoCE capabilities. Follow the official NVIDIA guide to install the latest OFED driver. To confirm the presence of RDMA devices, use the following command:
~# rdma link show link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev ens6f0np0 link mlx5_1/1 state ACTIVE physical_state LINK_UP netdev ens6f1np1
Make sure that the RDMA subsystem on the host is operating in shared mode. If not, switch to shared mode.
~# rdma system
netns shared copy-on-fork on
# switch to shared mode
~# rdma system set netns shared
-
Verify the details of the RDMA card for subsequent device resource discovery by the device plugin.
Enter the following command with NIC vendors being 15b3 and its deviceIDs being 1017:
~# lspci -nn | grep Ethernet af:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] af:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
-
Install Spiderpool and configure sriov-network-operator:
helm install spiderpool spiderpool/spiderpool -n kube-system \ --set multus.multusCNI.defaultCniCRName="macvlan-ens6f0np0" \ --set rdma.rdmaSharedDevicePlugin.install=true \ --set rdma.rdmaSharedDevicePlugin.deviceConfig.resourcePrefix="spidernet.io" \ --set rdma.rdmaSharedDevicePlugin.deviceConfig.resourceName="hca_shared_devices" \ --set rdma.rdmaSharedDevicePlugin.deviceConfig.rdmaHcaMax=500 \ --set rdma.rdmaSharedDevicePlugin.deviceConfig.vendors="15b3" \ --set rdma.rdmaSharedDevicePlugin.deviceConfig.deviceIDs="1017"
-
If Macvlan is not installed in your cluster, you can specify the Helm parameter
--set plugins.installCNI=true
to install Macvlan in your cluster. -
If you are a user from China, you can specify the parameter
--set global.imageRegistryOverride=ghcr.m.daocloud.io
to avoid image pull failures from Spiderpool. -
After completing the installation of Spiderpool, you can manually edit the spiderpool-rdma-shared-device-plugin configmap to reconfigure the RDMA shared device plugin.
-
Specify the name of the NetworkAttachmentDefinition instance for the default CNI used by Multus via
multus.multusCNI.defaultCniCRName
. If themultus.multusCNI.defaultCniCRName
option is provided, an empty NetworkAttachmentDefinition instance will be automatically generated upon installation. Otherwise, Multus will attempt to create a NetworkAttachmentDefinition instance based on the first CNI configuration found in the /etc/cni/net.d directory. If no suitable configuration is found, a NetworkAttachmentDefinition instance nameddefault
will be created to complete the installation of Multus.
Once the installation is complete, the following components will be installed:
~# kubectl get pod -n kube-system spiderpool-agent-9sllh 1/1 Running 0 1m spiderpool-agent-h92bv 1/1 Running 0 1m spiderpool-controller-7df784cdb7-bsfwv 1/1 Running 0 1m spiderpool-init 0/1 Completed 0 1m spiderpool-rdma-shared-device-plugin-dr7w8 1/1 Running 0 1m spiderpool-rdma-shared-device-plugin-zj65g 1/1 Running 0 1m
-
-
View the available resources on a node, including the reported RDMA device resources:
~# kubectl get no -o json | jq -r '[.items[] | {name:.metadata.name, allocable:.status.allocatable}]' [ { "name": "10-20-1-10", "allocable": { "cpu": "40", "memory": "263518036Ki", "pods": "110", "spidernet.io/hca_shared_devices": "500", ... } }, ... ]
If the reported resource count is 0, it may be due to the following reasons:
(1) Verify that the vendors and deviceID in the spiderpool-rdma-shared-device-plugin configmap match the actual values.
(2) Check the logs of the rdma-shared-device-plugin. If you encounter errors related to RDMA NIC support, try installing apt-get install rdma-core or dnf install rdma-core on the host machine.
error creating new device: "missing RDMA device spec for device 0000:04:00.0, RDMA device \"issm\" not found"
-
Create macvlan-related multus configurations using an RDMA card as the master node and set up the corresponding ippool resources:
cat <<EOF | kubectl apply -f - apiVersion: spiderpool.spidernet.io/v2beta1 kind: SpiderIPPool metadata: name: v4-81 spec: gateway: 172.81.0.1 ips: - 172.81.0.100-172.81.0.120 subnet: 172.81.0.0/16 --- apiVersion: spiderpool.spidernet.io/v2beta1 kind: SpiderMultusConfig metadata: name: macvlan-ens6f0np0 namespace: kube-system spec: cniType: macvlan macvlan: master: - "ens6f0np0" ippools: ipv4: ["v4-81"] EOF
-
Following the configurations from the previous step, create a DaemonSet application that spans across nodes:
ANNOTATION_MULTUS="v1.multus-cni.io/default-network: kube-system/macvlan-ens6f0np0" RESOURCE="spidernet.io/hca_shared_devices" NAME=rdma-macvlan cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: DaemonSet metadata: name: ${NAME} labels: app: $NAME spec: selector: matchLabels: app: $NAME template: metadata: name: $NAME labels: app: $NAME annotations: ${ANNOTATION_MULTUS} spec: containers: - image: docker.io/mellanox/rping-test imagePullPolicy: IfNotPresent name: mofed-test securityContext: capabilities: add: [ "IPC_LOCK" ] resources: limits: ${RESOURCE}: 1 command: - sh - -c - | ls -l /dev/infiniband /sys/class/net sleep 1000000 EOF
-
Verify that RDMA data transmission is working correctly between the Pods across nodes.
Open a terminal and access one Pod to launch a service:
# You are able to see all the RDMA cards on the host machine ~# rdma link 0/1: mlx5_0/1: state ACTIVE physical_state LINK_UP 1/1: mlx5_1/1: state ACTIVE physical_state LINK_UP # Start an RDMA service ~# ib_read_lat
Open a terminal and access another Pod to launch a service:
# You are able to see all the RDMA cards on the host machine ~# rdma link 0/1: mlx5_0/1: state ACTIVE physical_state LINK_UP 1/1: mlx5_1/1: state ACTIVE physical_state LINK_UP # Access the service running in the other Pod ~# ib_read_lat 172.81.0.120 --------------------------------------------------------------------------------------- RDMA_Read Latency Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 1 Mtu : 1024[B] Link type : Ethernet GID index : 12 Outstand reads : 16 rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x0107 PSN 0x79dd10 OUT 0x10 RKey 0x1fddbc VAddr 0x000000023bd000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:81:00:119 remote address: LID 0000 QPN 0x0107 PSN 0x40001a OUT 0x10 RKey 0x1fddbc VAddr 0x00000000bf9000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:81:00:120 --------------------------------------------------------------------------------------- #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec] Conflicting CPU frequency values detected: 2200.000000 != 1040.353000. CPU Frequency is not max. Conflicting CPU frequency values detected: 2200.000000 != 1849.351000. CPU Frequency is not max. 2 1000 6.88 16.81 7.04 7.06 0.31 7.38 16.81 ---------------------------------------------------------------------------------------
Isolated usage of RoCE-capable NIC with SR-IOV
The following steps demonstrate how to enable isolated usage of RDMA devices by Pods in a cluster with two nodes via SR-IOV CNI:
-
Ensure that the host machine has an RDMA and SR-IOV enabled card and the driver is properly installed, ensuring proper RDMA functioning.
In our demo environment, the host machine is equipped with a Mellanox ConnectX-5 NIC with RoCE capabilities. Follow the official NVIDIA guide to install the latest OFED driver.
To isolate the usage of an RDMA network card, ensure that at least one of the following conditions is met:
(1) Kernel based on 5.3.0 or newer, RDMA modules loaded in the system. rdma-core package provides means to automatically load relevant modules on system start
(2) Mellanox OFED version 4.7 or newer is required. In this case it is not required to use a Kernel based on 5.3.0 or newer.
To confirm the presence of RDMA devices, use the following command:
~# rdma link show link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev ens6f0np0 link mlx5_1/1 state ACTIVE physical_state LINK_UP netdev ens6f1np1
Make sure that the RDMA subsystem on the host is operating in exclusive mode. If not, switch to exclusive mode.
# switch to exclusive mode and fail to restart the host ~# rdma system set netns exclusive # apply persistent settings: ~# echo "options ib_core netns_mode=0" >> /etc/modprobe.d/ib_core.conf ~# rdma system netns exclusive copy-on-fork on
(Optional) in an SR-IOV scenario, applications can enable NVIDIA's GPUDirect RDMA feature. For instructions on installing the kernel module, please refer to the official documentation.
-
Verify the details of the RDMA card for subsequent device resource discovery by the device plugin.
Enter the following command with NIC vendors being 15b3 and its deviceIDs being 1017:
~# lspci -nn | grep Ethernet af:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] af:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
-
Install Spiderpool
helm install spiderpool spiderpool/spiderpool -n kube-system \ --set sriov.install=true \ --set plugins.installRdmaCNI=true
-
If you are a user from China, you can specify the parameter
--set global.imageRegistryOverride=ghcr.m.daocloud.io
to avoid image pull failures from Spiderpool. -
After completing the installation of Spiderpool, you can manually edit the spiderpool-rdma-shared-device-plugin configmap to reconfigure the RDMA shared device plugin.
-
Specify the name of the NetworkAttachmentDefinition instance for the default CNI used by Multus via
multus.multusCNI.defaultCniCRName
. If themultus.multusCNI.defaultCniCRName
option is provided, an empty NetworkAttachmentDefinition instance will be automatically generated upon installation. Otherwise, Multus will attempt to create a NetworkAttachmentDefinition instance based on the first CNI configuration found in the /etc/cni/net.d directory. If no suitable configuration is found, a NetworkAttachmentDefinition instance nameddefault
will be created to complete the installation of Multus.
Once the installation is complete, the following components will be installed: ~# kubectl get pod -n kube-system spiderpool-agent-9sllh 1/1 Running 0 1m spiderpool-agent-h92bv 1/1 Running 0 1m spiderpool-controller-7df784cdb7-bsfwv 1/1 Running 0 1m spiderpool-init 0/1 Completed 0 1m
-
-
Configure SR-IOV operator
By the way, the number of VFs determines how many SR-IOV network cards can be provided for PODs on a host. The network card from different manufacturers have different amount limit of VFs. For example, the Mellanox connectx5 used in this example can create up to 127 VFs.
Apply the following configuration, and the VFs will be created on the host. Notice, this may cause the nodes to reboot, owing to taking effect the new configuration in the network card driver.
cat <<EOF | kubectl apply -f - apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policyrdma namespace: kube-system spec: nodeSelector: kubernetes.io/os: "linux" resourceName: mellanoxrdma priority: 99 numVfs: 12 nicSelector: deviceID: "1017" rootDevices: - 0000:af:00.0 vendor: "15b3" deviceType: netdevice isRdma: true EOF
Verify the available resources on the node, including the reported SR-IOV device resources:
~# kubectl get no -o json | jq -r '[.items[] | {name:.metadata.name, allocable:.status.allocatable}]' [ { "name": "10-20-1-10", "allocable": { "cpu": "40", "pods": "110", "spidernet.io/mellanoxrdma": "12", ... } }, ... ]
-
Create multus configurations related to SR-IOV and create corresponding ippool resources.
cat <<EOF | kubectl apply -f - apiVersion: spiderpool.spidernet.io/v2beta1 kind: SpiderIPPool metadata: name: v4-81 spec: gateway: 172.81.0.1 ips: - 172.81.0.100-172.81.0.120 subnet: 172.81.0.0/16 --- apiVersion: spiderpool.spidernet.io/v2beta1 kind: SpiderMultusConfig metadata: name: sriov-rdma namespace: kube-system spec: cniType: sriov sriov: resourceName: spidernet.io/mellanoxrdma enableRdma: true ippools: ipv4: ["v4-81"] EOF
-
Following the configurations from the previous step, create a DaemonSet application that spans across nodes:
ANNOTATION_MULTUS="v1.multus-cni.io/default-network: kube-system/sriov-rdma" RESOURCE="spidernet.io/mellanoxrdma" NAME=rdma-sriov cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: DaemonSet metadata: name: ${NAME} labels: app: $NAME spec: selector: matchLabels: app: $NAME template: metadata: name: $NAME labels: app: $NAME annotations: ${ANNOTATION_MULTUS} spec: containers: - image: docker.io/mellanox/rping-test imagePullPolicy: IfNotPresent name: mofed-test securityContext: capabilities: add: [ "IPC_LOCK" ] resources: limits: ${RESOURCE}: 1 command: - sh - -c - | ls -l /dev/infiniband /sys/class/net sleep 1000000 EOF
-
Verify that RDMA data transmission is working correctly between the Pods across nodes.
Open a terminal and access one Pod to launch a service:
# Only one RDMA device allocated to the Pod can be found ~# rdma link 7/1: mlx5_3/1: state ACTIVE physical_state LINK_UP netdev eth0 # launch an RDMA service ~# ib_read_lat
Open a terminal and access another Pod to launch a service:
# You are able to see all the RDMA cards on the host machine ~# rdma link 10/1: mlx5_5/1: state ACTIVE physical_state LINK_UP netdev eth0 # Access the service running in the other Pod ~# ib_read_lat 172.81.0.118 libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_4'. libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_2'. libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_0'. libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_3'. libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_1'. --------------------------------------------------------------------------------------- RDMA_Read Latency Test Dual-port : OFF Device : mlx5_5 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 1 Mtu : 1024[B] Link type : Ethernet GID index : 2 Outstand reads : 16 rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x0b69 PSN 0xd476c2 OUT 0x10 RKey 0x006f00 VAddr 0x00000001f91000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:81:00:105 remote address: LID 0000 QPN 0x0d69 PSN 0xbe5c89 OUT 0x10 RKey 0x004f00 VAddr 0x0000000160d000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:81:00:118 --------------------------------------------------------------------------------------- #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec] Conflicting CPU frequency values detected: 2200.000000 != 1338.151000. CPU Frequency is not max. Conflicting CPU frequency values detected: 2200.000000 != 2881.668000. CPU Frequency is not max. 2 1000 6.66 20.37 6.74 6.82 0.78 7.15 20.37 ---------------------------------------------------------------------------------------