虚拟 GPU¶
Kayobe 包含用于在支持的 NVIDIA 硬件上配置虚拟 GPU 的 playbook。这允许您静态创建 mdev 设备,这些设备可被 Nova 用于向 guest VM 呈现虚拟 GPU。已知可用的 GPU 包括
NVIDIA A100
BIOS 配置¶
Intel¶
在 BIOS 中启用
VT-x以支持虚拟化。在 BIOS 中启用
VT-d以支持 IOMMU。
AMD¶
在 BIOS 中启用
AMD-V以支持虚拟化。在 BIOS 中启用
AMD-Vi以支持 IOMMU。
示例:Dell¶
使用 racadm 启用 SR-IOV
/opt/dell/srvadmin/bin/idracadm7 set BIOS.IntegratedDevices.SriovGlobalEnable Enabled
/opt/dell/srvadmin/bin/idracadm7 jobqueue create BIOS.Setup.1-1
<reboot>
使用 racadm 启用 CPU 虚拟化
/opt/dell/srvadmin/bin/idracadm7 set BIOS.ProcSettings.ProcVirtualization Enabled
/opt/dell/srvadmin/bin/idracadm7 jobqueue create BIOS.Setup.1-1
<reboot>
从 NVIDIA 许可门户获取驱动程序¶
从 此处 下载 NVIDIA GRID 驱动程序(需要登录)。
配置¶
将支持 GPU 的主机添加到 compute-vgpu 组。如果使用 bifrost 和 kayobe overcloud inventory discover 机制,可以使用
$KAYOBE_CONFIG_PATH/overcloud.yml¶overcloud_group_hosts_map:
compute-vgpu:
- "computegpu000"
配置 NVIDIA 驱动程序的安装位置
$KAYOBE_CONFIG_PATH/vgpu.yml¶ ---
vgpu_driver_url: "https://example.com/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip"
如果您不知道您的显卡支持哪些 vGPU 类型,可以通过遵循 NVIDIA vGPU 类型 来确定这些类型。
然后,您可以定义描述 vGPU 配置的 group_vars
$KAYOBE_CONFIG_PATH/inventory/group_vars/compute-vgpu/vgpu¶#nvidia-692 GRID A100D-4C
#nvidia-693 GRID A100D-8C
#nvidia-694 GRID A100D-10C
#nvidia-695 GRID A100D-16C
#nvidia-696 GRID A100D-20C
#nvidia-697 GRID A100D-40C
#nvidia-698 GRID A100D-80C
#nvidia-699 GRID A100D-1-10C
#nvidia-700 GRID A100D-2-20C
#nvidia-701 GRID A100D-3-40C
#nvidia-702 GRID A100D-4-40C
#nvidia-703 GRID A100D-7-80C
#nvidia-707 GRID A100D-1-10CME
vgpu_definitions:
# Configuring a MIG backed VGPU
- pci_address: "0000:17:00.0"
mig_devices:
# This section describes how to partition the card using MIG. The key
# in the dictionary represents a MIG profile supported by your card and
# the value is the number of MIG devices of that type that you want
# to create. The vGPUS are then created on top of these MIG devices.
# The available profiles can be found in the NVIDIA documentation:
# https://docs.nvda.net.cn/grid/15.0/grid-vgpu-user-guide/index.html#virtual-gpu-types-grid-reference
"1g.10gb": 1
"2g.20gb": 3
virtual_functions:
# The mdev type is the NVIDIA identifier for a particular vGPU. When using
# MIG backed vGPUs these must match up with your MIG devices. See the NVIDIA
# vGPU types section in this document.
- mdev_type: nvidia-700
index: 0
- mdev_type: nvidia-700
index: 1
- mdev_type: nvidia-700
index: 2
- mdev_type: nvidia-699
index: 3
# Configuring a card in a time-sliced configuration (non-MIG backed)
- pci_address: "0000:65:00.0"
virtual_functions:
- mdev_type: nvidia-697
index: 0
- mdev_type: nvidia-697
index: 1
要应用此配置,请使用
(kayobe) $ kayobe overcloud host configure -t vgpu
NVIDIA vGPU 类型¶
必须安装 NVIDIA vGPU 驱动程序才能查询可用的 vGPU 类型。可以通过不在 vGPU 定义中定义任何虚拟功能来实现这一点
$KAYOBE_CONFIG_PATH/inventory/group_vars/compute-vgpu/vgpu¶vgpu_definitions:
- pci_address: "0000:17:00.0"
virtual_functions: []
请参阅 配置。然后,您可以使用 mdevctl 查询可用的 vGPU 类型。
mdevctl types
Kolla Ansible 配置¶
要使用创建的 mdev 设备,请修改 nova.conf 以添加可以传递给 guest 的 mdev 设备列表
$KAYOBE_CONFIG_PATH/kolla/config/nova/nova-compute.conf¶{% raw %}
{% if inventory_hostname in groups['compute-vgpu'] %}
[devices]
enabled_mdev_types = nvidia-700, nvidia-699, nvidia-697
[mdev_nvidia-700]
device_addresses = 0000:17:00.4,0000:17:00.5,0000:17:00.6
mdev_class = CUSTOM_NVIDIA_700
[mdev_nvidia-699]
device_addresses = 0000:17:00.7
mdev_class = CUSTOM_NVIDIA_699
[mdev_nvidia-697]
device_addresses = 0000:65:00.4,0000:65:00.5
mdev_class = CUSTOM_NVIDIA_697
{% endif %}
{% endraw %}
您需要调整 PCI 地址以匹配虚拟功能地址。可以通过在应用 配置 后检查 mdevctl 配置来获取这些地址
# mdevctl list
73269d0f-b2c9-438d-8f28-f9e4bc6c6995 0000:17:00.4 nvidia-700 manual (defined)
dc352ef3-efeb-4a5d-a48e-912eb230bc76 0000:17:00.5 nvidia-700 manual (defined)
a464fbae-1f89-419a-a7bd-3a79c7b2eef4 0000:17:00.6 nvidia-700 manual (defined)
f3b823d3-97c8-4e0a-ae1b-1f102dcb3bce 0000:17:00.7 nvidia-699 manual (defined)
330be289-ba3f-4416-8c8a-b46ba7e51284 0000:65:00.4 nvidia-700 manual (defined)
1ba5392c-c61f-4f48-8fb1-4c6b2bbb0673 0000:65:00.5 nvidia-700 manual (defined)
f6868020-eb3a-49c6-9701-6c93e4e3fa9c 0000:65:00.6 nvidia-700 manual (defined)
00501f37-c468-5ba4-8be2-8d653c4604ed 0000:65:00.7 nvidia-699 manual (defined)
mdev_class 映射到可以在 flavor 定义中设置的资源类。请注意,如果您在给定的 hypervisor 上仅定义一种 mdev 类型,则 mdev_class 配置选项将被静默忽略,并且它将使用 VGPU 资源类(请参阅 bug 1943934)。
要将配置应用于 Nova
(kayobe) $ kayobe overcloud service deploy -kt nova
OpenStack flavors¶
定义一些请求在 nova.conf 中配置的资源类的 flavor。以下是一个示例定义,可与 openstack.cloud.compute_flavor Ansible 模块一起使用
openstack.cloud.compute_flavor:
name: "vgpu.a100.2g.20gb"
ram: 65536
disk: 30
vcpus: 8
is_public: false
extra_specs:
hw:cpu_policy: "dedicated"
hw:cpu_thread_policy: "prefer"
hw:mem_page_size: "1GB"
hw:cpu_sockets: 2
hw:numa_nodes: 8
hw_rng:allowed: "True"
resources:CUSTOM_NVIDIA_700: "1"
更改 VGPU 设备类型¶
将第二张卡转换为 NVIDIA-698(整卡)。hypervisor 应该为空,以便我们可以自由删除 mdev。如果不是这种情况,您需要检查哪些 mdev 正在使用,并格外小心。首先清理 mdev 定义以腾出新设备的空间
[stack@computegpu000 ~]$ sudo mdevctl list
5c630867-a673-5d75-aa31-a499e6c7cb19 0000:21:00.4 nvidia-697 manual (defined)
eaa6e018-308e-58e2-b351-aadbcf01f5a8 0000:21:00.5 nvidia-697 manual (defined)
72291b01-689b-5b7a-9171-6b3480deabf4 0000:81:00.4 nvidia-697 manual (defined)
0a47ffd1-392e-5373-8428-707a4e0ce31a 0000:81:00.5 nvidia-697 manual (defined)
[stack@computegpu000 ~]$ sudo mdevctl stop --uuid 72291b01-689b-5b7a-9171-6b3480deabf4
[stack@computegpu000 ~]$ sudo mdevctl stop --uuid 0a47ffd1-392e-5373-8428-707a4e0ce31a
[stack@computegpu000 ~]$ sudo mdevctl undefine --uuid 0a47ffd1-392e-5373-8428-707a4e0ce31a
[stack@computegpu000 ~]$ sudo mdevctl undefine --uuid 72291b01-689b-5b7a-9171-6b3480deabf4
[stack@computegpu000 ~]$ sudo mdevctl list --defined
5c630867-a673-5d75-aa31-a499e6c7cb19 0000:21:00.4 nvidia-697 manual (active)
eaa6e018-308e-58e2-b351-aadbcf01f5a8 0000:21:00.5 nvidia-697 manual (active)
# We can re-use the first virtual function
其次,删除启动 mdev 设备的 systemd 单元
[stack@computegpu000 ~]$ sudo rm /etc/systemd/system/multi-user.target.wants/nvidia-mdev@0a47ffd1-392e-5373-8428-707a4e0ce31a.service
[stack@computegpu000 ~]$ sudo rm /etc/systemd/system/multi-user.target.wants/nvidia-mdev@72291b01-689b-5b7a-9171-6b3480deabf4.service
调整您的 Kayobe 和 Kolla Ansible 配置以匹配所需状态,然后重新运行主机配置
(kayobe) $ kayobe overcloud host configure --tags vgpu --limit computegpu000
检查结果
[stack@computegpu000 ~]$ mdevctl list
5c630867-a673-5d75-aa31-a499e6c7cb19 0000:21:00.4 nvidia-697 manual
eaa6e018-308e-58e2-b351-aadbcf01f5a8 0000:21:00.5 nvidia-697 manual
72291b01-689b-5b7a-9171-6b3480deabf4 0000:81:00.4 nvidia-698 manual
重新配置 nova 以匹配更改
(kayobe) $ kayobe overcloud service reconfigure -kt nova --kolla-limit computegpu000 --skip-prechecks