RDMA with Infiniband
English | 简体中文
Introduction
This chapter introduces how POD access network with the infiniband interface of the host.
Features
Different from RoCE, Infiniband network cards are proprietary devices for the Infiniband network, and the Spiderpool offers two CNI options:
-
IB-SRIOV CNI provides SR-IOV network card with the RDMA device. It is suitable for workloads requiring RDMA communication.
It offers two RDMA modes:
-
Shared mode: Pod will have a SR-IOV network interface with RDMA feature, but all RDMA devices cloud be seen by all PODs running in the same node. POD may be confused for which RDMA device it should use.
-
Exclusive mode: Pod will have a SR-IOV network interface with RDMA feature, and POD just enable to see its own RDMA device.
For isolated RDMA network cards, at least one of the following conditions must be met:
(1) Kernel based on 5.3.0 or newer, RDMA modules loaded in the system. rdma-core package provides means to automatically load relevant modules on system start
(2) Mellanox OFED version 4.7 or newer is required. In this case it is not required to use a Kernel based on 5.3.0 or newer.
-
-
IPoIB CNI provides an IPoIB network card for POD, without RDMA device. It is suitable for conventional applications that require TCP/IP communication, as it does not require an SRIOV network card, allowing more PODs to run on the host
RDMA based on IB-SRIOV
The following steps demonstrate how to use IB-SRIOV on a cluster with 2 nodes. It enables POD to own SR-IOV network card and the RDMA devices of isolated network namespace
-
Ensure that the host machine has an Infiniband card installed and the driver is properly installed.
In our demo environment, the host machine is equipped with a Mellanox ConnectX-5 VPI NIC.
Follow the official NVIDIA guide to install the latest OFED driver.
For Mellanox's VPI series network cards, you can refer to the official Switching Infiniband Mode to ensure that the network card is working in Infiniband mode.
To confirm the presence of Inifiniband devices, use the following command:
~# lspci -nn | grep Infiniband 86:00.0 Infiniband controller [0207]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] ~# rdma link link mlx5_0/1 subnet_prefix fe80:0000:0000:0000 lid 2 sm_lid 2 lmc 0 state ACTIVE physical_state LINK_UP ~# ibstat mlx5_0 | grep "Link layer" Link layer: InfiniBand
Make sure that the RDMA subsystem of the host is in exclusive mode. If not, switch to shared mode.
~# rdma system set netns exclusive ~# echo "options ib_core netns_mode=0" >> /etc/modprobe.d/ib_core.conf ~# reboot ~# rdma system netns exclusive copy-on-fork on
if it is expected to work under shared mode,
rm /etc/modprobe.d/ib_core.conf && reboot
(Optional) In an SR-IOV scenario, applications can enable NVIDIA's GPUDirect RDMA feature. For instructions on installing the kernel module, please refer to the official documentation.
-
Install Spiderpool, and notice the helm options:
helm upgrade --install spiderpool spiderpool/spiderpool --namespace kube-system --reuse-values --set sriov.install=true
- If you are a user from China, you can specify the parameter
--set global.imageRegistryOverride=ghcr.m.daocloud.io
to pull image from China registry.
Once the installation is complete, the following components will be installed:
~# kubectl get pod -n kube-system spiderpool-agent-9sllh 1/1 Running 0 1m spiderpool-agent-h92bv 1/1 Running 0 1m spiderpool-controller-7df784cdb7-bsfwv 1/1 Running 0 1m spiderpool-sriov-operator-65b59cd75d-89wtg 1/1 Running 0 1m spiderpool-init 0/1 Completed 0 1m
- If you are a user from China, you can specify the parameter
-
configure SR-IOV operator.
To enable the SR-IOV CNI on specific nodes, you need to apply the following command to label those nodes. This will allow the sriov-network-operator to install the components on the designated nodes.
kubectl label node $NodeName node-role.kubernetes.io/worker=""
Use the following commands to look up the device information of infiniband card
~# lspci -nn | grep Infiniband 86:00.0 Infiniband controller [0207]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
The number of VFs determines how many SR-IOV network cards can be provided for PODs on a host. The network card from different manufacturers have different amount limit of VFs. For example, the Mellanox connectx5 used in this example can create up to 127 VFs.
Apply the following configuration, and the VFs will be created on the host. Notice, this may cause the nodes to reboot, owing to taking effect the new configuration in the network card driver.
cat <<EOF | kubectl apply -f - apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: ib-sriov namespace: kube-system spec: nodeSelector: kubernetes.io/os: "linux" resourceName: mellanoxibsriov priority: 99 numVfs: 12 nicSelector: deviceID: "1017" rootDevices: - 0000:86:00.0 vendor: "15b3" deviceType: netdevice isRdma: true EOF
View the available resources on a node, including the reported RDMA device resources:
~# kubectl get no -o json | jq -r '[.items[] | {name:.metadata.name, allocable:.status.allocatable}]' [ { "name": "10-20-1-10", "allocable": { "cpu": "40", "pods": "110", "spidernet.io/mellanoxibsriov": "12", ... } }, ... ]
-
Create the CNI configuration of IB-SRIOV, and the ippool resource
cat <<EOF | kubectl apply -f - apiVersion: spiderpool.spidernet.io/v2beta1 kind: SpiderIPPool metadata: name: v4-91 spec: gateway: 172.91.0.1 ips: - 172.91.0.100-172.91.0.120 subnet: 172.91.0.0/16 --- apiVersion: spiderpool.spidernet.io/v2beta1 kind: SpiderMultusConfig metadata: name: ib-sriov namespace: kube-system spec: cniType: ib-sriov ibsriov: resourceName: spidernet.io/mellanoxibsriov ippools: ipv4: ["v4-91"] EOF
-
Following the configurations from the previous step, create a DaemonSet application that spans across nodes for testing
ANNOTATION_MULTUS="v1.multus-cni.io/default-network: kube-system/ib-sriov" RESOURCE="spidernet.io/mellanoxibsriov" NAME=ib-sriov cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: DaemonSet metadata: name: ${NAME} labels: app: $NAME spec: selector: matchLabels: app: $NAME template: metadata: name: $NAME labels: app: $NAME annotations: ${ANNOTATION_MULTUS} spec: containers: - image: docker.io/mellanox/rping-test imagePullPolicy: IfNotPresent name: mofed-test securityContext: capabilities: add: [ "IPC_LOCK" ] resources: limits: ${RESOURCE}: 1 command: - sh - -c - | ls -l /dev/infiniband /sys/class/net sleep 1000000 EOF
-
Verify that RDMA data transmission is working correctly between the Pods across nodes.
Open a terminal and access one Pod to launch a service:
~# rdma link link mlx5_4/1 subnet_prefix fe80:0000:0000:0000 lid 8 sm_lid 1 lmc 0 state ACTIVE physical_state LINK_UP # Start an RDMA service ~# ib_read_lat
Open a terminal and access another Pod to launch a service:
~# rdma link link mlx5_8/1 subnet_prefix fe80:0000:0000:0000 lid 7 sm_lid 1 lmc 0 state ACTIVE physical_state LINK_UP # succeed to visit the service on the other POD ~# ib_read_lat 172.91.0.115
IPoIB
The following steps demonstrate how to use IPoIB on a cluster with 2 nodes, it enables Pod to own a regular TCP/IP network cards without RDMA device.
-
Ensure that the host machine has an Infiniband card installed and the driver is properly installed.
In our demo environment, the host machine is equipped with a Mellanox ConnectX-5 VPI NIC.
Follow the official NVIDIA guide to install the latest OFED driver.
For Mellanox's VPI series network cards, you can refer to the official Switching Infiniband Mode to ensure that the network card is working in Infiniband mode.
To confirm the presence of Inifiniband devices, use the following command:
~# lspci -nn | grep Infiniband 86:00.0 Infiniband controller [0207]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] ~# rdma link link mlx5_0/1 subnet_prefix fe80:0000:0000:0000 lid 2 sm_lid 2 lmc 0 state ACTIVE physical_state LINK_UP ~# ibstat mlx5_0 | grep "Link layer" Link layer: InfiniBand
Check the ipoib interface of the Inifiniband device
~# ip a show ibs5f0 9: ibs5f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256 link/infiniband 00:00:10:49:fe:80:00:00:00:00:00:00:e8:eb:d3:03:00:93:ae:10 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff altname ibp134s0f0 inet 172.91.0.10/16 brd 172.91.255.255 scope global ibs5f0 valid_lft forever preferred_lft forever inet6 fd00:91::172:91:0:10/64 scope global valid_lft forever preferred_lft forever inet6 fe80::eaeb:d303:93:ae10/64 scope link valid_lft forever preferred_lft forever
-
Install Spiderpool
If you are a user from China, you can specify the parameter
--set global.imageRegistryOverride=ghcr.m.daocloud.io
to pull image from China registry.Once the installation is complete, the following components will be installed:
~# kubectl get pod -n kube-system spiderpool-agent-9sllh 1/1 Running 0 1m spiderpool-agent-h92bv 1/1 Running 0 1m spiderpool-controller-7df784cdb7-bsfwv 1/1 Running 0 1m spiderpool-init 0/1 Completed 0 1m
-
Create the CNI configuration of ipoib, and the ippool. The
spec.ipoib.master
of SpiderMultusConfig should be set to the infiniband interface of the node.cat <<EOF | kubectl apply -f - apiVersion: spiderpool.spidernet.io/v2beta1 kind: SpiderIPPool metadata: name: v4-91 spec: gateway: 172.91.0.1 ips: - 172.91.0.100-172.91.0.120 subnet: 172.91.0.0/16 --- apiVersion: spiderpool.spidernet.io/v2beta1 kind: SpiderMultusConfig metadata: name: ipoib namespace: kube-system spec: cniType: ipoib ipoib: master: "ibs5f0" ippools: ipv4: ["v4-91"] EOF
-
Following the configurations from the previous step, create a DaemonSet application that spans across nodes for testing
ANNOTATION_MULTUS="v1.multus-cni.io/default-network: kube-system/ipoib" NAME=ipoib cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: DaemonSet metadata: name: ${NAME} labels: app: $NAME spec: selector: matchLabels: app: $NAME template: metadata: name: $NAME labels: app: $NAME annotations: ${ANNOTATION_MULTUS} spec: containers: - image: docker.io/mellanox/rping-test imagePullPolicy: IfNotPresent name: mofed-test securityContext: capabilities: add: [ "IPC_LOCK" ] command: - sh - -c - | ls -l /dev/infiniband /sys/class/net sleep 1000000 EOF
-
Verify that the network communication is correct between the PODs across nodes.
~# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ipoib-psf4q 1/1 Running 0 34s 172.91.0.112 10-20-1-20 <none> <none> ipoib-t9hm7 1/1 Running 0 34s 172.91.0.116 10-20-1-10 <none> <none>
Succeed to access each other
~# kubectl exec -it ipoib-psf4q bash kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead. root@ipoib-psf4q:/# ping 172.91.0.116 PING 172.91.0.116 (172.91.0.116) 56(84) bytes of data. 64 bytes from 172.91.0.116: icmp_seq=1 ttl=64 time=1.10 ms 64 bytes from 172.91.0.116: icmp_seq=2 ttl=64 time=0.235 ms