Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API for CDI --devices flag in Docker and Podman for mapping GPUs #3290

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

lukeogg
Copy link

@lukeogg lukeogg commented Jun 28, 2023

This PR adds support for passing GPU parameters to Nvidia Container Toolkit through the CDI specification. Although there is a way to map all GPUs to a single node with device mounts, more granularity is desired. This PR will support mapping various device combinations to different nodes as needed.

Would resolve Issue 3164

  • Adds API for devices with a list of strings in the format specified in the CDI specification.
  devices:
  - "nvidia.com/gpu=0"
  - "nvidia.com/gpu=1"
  • Uses the CDI validation package to validate values passed in the devices API.
  • Supports both docker and podman
  • This relies on the upstream PR for Docker being merged and released - presumably in Docker v25.
  • Checks the docker version if devices have been specified.
  • Add documentation for mapping GPUs.

All GPUs mapped to a single control-plane:

{{< codeFromInline lang="yaml" >}}
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  devices:
  - "nvidia.com/gpu=all"

Specific GPUs mapped to specific worker nodes based on index:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
  devices:
  - "nvidia.com/gpu=0"
- role: worker
  devices:
  - "nvidia.com/gpu=1"

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 28, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lukeogg
Once this PR has been reviewed and has the lgtm label, please assign bentheelder for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from aojea June 28, 2023 19:49
@k8s-ci-robot k8s-ci-robot added the area/provider/docker Issues or PRs related to docker label Jun 28, 2023
@k8s-ci-robot k8s-ci-robot added area/provider/podman Issues or PRs related to podman needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 28, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @lukeogg. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 28, 2023
go.mod Outdated Show resolved Hide resolved
Copy link
Contributor

@stmcginnis stmcginnis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

Looks like this should be fine to podman, and I think the docker version handling makes sense.

Some nice markdown cleanup too. ;)

pkg/internal/apis/config/validate_test.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 30, 2023
Copy link

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Some minor comments on the CDI device and documentation side of things.

pkg/internal/apis/config/types.go Outdated Show resolved Hide resolved
@@ -192,6 +196,17 @@ func validatePortMappings(portMappings []PortMapping) error {
return nil
}

func validateDevices(devices []string) error {
for _, device := range devices {
device := strings.TrimSpace(device)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use "github.com/container-orchestrated-devices/container-device-interface/pkg/parser" and call parser.IsQualifiedName(device) here. This is used in docker/cli#4084 and significantly reduces the dependencies pulled in.

It should also be fine to allow podman and / or docker to perform this validation though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give that a try.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update to what Evan suggests to reduce the dependencies pulled in.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenTheElder What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klueska I am using parser.ParseQualifiedName(device) instead of parser.ParseQualifiedName(device). The results are the same except you get the additional error message about why your device string is invalid. I think this is preferable. ParseQualifiedName() is called in IsQualifiedName().

This doesn't change the dependencies at all. There are tests on github.com/container-orchestrated-devices/container-device-interface/pkg/parser pulling in the additional packages. Maybe we should put the tests in a different package? @elezar wdyt?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update the tests in that package to use a different import for comparison or to be in the parser_test package if require. As a matter of interest, which package that is equivalent to require would be preferred?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't personally have a preference - I haven't found any with fewer dependencies. I was thinking use the parser_test package to keep things very lightweight.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created cncf-tags/container-device-interface#149 with an update to the test package. This should prevent github.com/stretchr/testify/require from getting pulled in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, a few things came up back in the real world and here with #3277 etc., the dependency discussion wound up forked in https://github.com/kubernetes-sigs/kind/pull/3290/files#r1304873501

I'm going to try to get #3335 after which when this PR is rebased and go mod tidy it should be clearer what we're actually talking about for the new dependencies.

See linked comment for stance on deps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently for example we have a lot of CRI inspired types with in-tree dependency-free code instead, to avoid creating dep hell for our users. That may not be so reasonable here.

site/content/docs/user/configuration.md Outdated Show resolved Hide resolved
site/content/docs/user/configuration.md Outdated Show resolved Hide resolved
containerPath: /var/run/nvidia-container-devices/all
{{< /codeFromInline >}}

Note: this method only support adding `all` GPUs to a single node. If you want to add specific GPUs to specific nodes, you will need to use the `devices` API.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether this is strictly true. Using /var/run/nvidia-container-devices/0 or /var/run/nvidia-container-devices/{{DEVICE_UUID}} should allow individual devices to be made available.

cc @klueska

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in #3257 (comment), we can't achieve any finer granularity than all (regardless of whether we do volume mounts or use CDI) because the node's container is started with --privileged. Even if you tell it to selectively inject just a single GPU, the node will be able to see all of them because --privileged pulls in all device nodes under /dev.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing some testing today and this does allow access to all GPUs. I'll update the documentation. It is fairly convenient with Kind, however, as you don't need to inject drivers or install them on the node images.

Copy link

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this seems reasonable. The only real issue I see is the limitation around not being able to inject individual GPUs because of the --privileged flag.

I'm assuming this will trip up many users thinking they should be able to selectively give different nodes access to different GPUs (based on their CDI device name), but then suddenly all nodes have access to all GPUs.

Comment on lines +258 to +271
// Append CDI device args (used for GPU support)
if len(node.CDIDevices) > 0 {
// Check for docker > 25
ver := Version()
if ver != "dev" || strings.Split(ver, ".")[0] < "25" {
return nil, errors.Errorf("using devices api in kind requires Docker >= v25, but found %q", ver)
}

// Append args for each device
for _, device := range node.CDIDevices {
args = append(args, "--device", strings.TrimSpace(device))
}
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will also allow passing "standard" devices via the CDIDevices setting? It looks like there is a new validate() call that will prevent this though. Is the validate call made sometime before this is done here?

Copy link
Author

@lukeogg lukeogg Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, validation is called as part of the initial passing for the node configuration. parser.ParseQualifiedName(device) as part of this.

@@ -192,6 +196,17 @@ func validatePortMappings(portMappings []PortMapping) error {
return nil
}

func validateDevices(devices []string) error {
for _, device := range devices {
device := strings.TrimSpace(device)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update to what Evan suggests to reduce the dependencies pulled in.

- role: control-plane
- role: worker
devices:
- "nvidia.com/gpu=0"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will not give you access to just GPU 0, but rather all GPUs on the system.

This is because --privileged is passed to the docker call that creates the node, thus giving it access to all devices under /dev (including all GPU devices).

Individual GPU injection will only be supported once if/when kind is able to start nodes without --privileged.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this reduces the value of adding an API for CDI devices in my opinion.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, this is the same for all other mechanisms too. The --privileged flag will mean that any container will have access to all devices and the only value add is that the driver libraries are also injected.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this reduces the value of adding an API for CDI devices in my opinion.

This isn't true. The use of the device mounts uses our "legacy" stack which has a number of shortcomings that using CDI addresses. It also means that users can access devices from other vendors such as Intel or their own device / resource definitions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only value add is that the driver libraries are also injected.

I don't supposed we can just trigger driver library injection directly in a simple way and reduce the API surface to "yes I'd like the GPU drivers"?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent today putting this together:
https://github.com/klueska/kind-with-gpus-examples

It should be a good starting point for using GPUs with kind independent of this PR

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also pinged the docker maintainers that helped us push CDI through to see what their thoughts are here:
https://dockercommunity.slack.com/archives/C04MZQZJE94/p1712086101448089

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was pointed to this on that thread:
moby/moby#39702 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If moby/moby#47663 becomes available we'll have to version detect to use it to avoid breaking all of the users on older docker installs, and I'll be pretty wary of the branching codepath behaving differently vs --privileged.

We have only once required a newer docker version (when we started to require cgroupns=private be available) which caused a lot of consternation even though that change was aimed at mitigating the runc misc issues that caused kind to very flaky/broken ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would've been nice when kind was being built, but now we have to consider that people undoubtedly have test environments that depend on being in a --privileged node :/

containerPath: /var/run/nvidia-container-devices/all
{{< /codeFromInline >}}

Note: this method only support adding `all` GPUs to a single node. If you want to add specific GPUs to specific nodes, you will need to use the `devices` API.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This limitation of all is true even in the CDI case.

Copy link
Author

@lukeogg lukeogg Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this reduces the value of adding an API for CDI devices in my opinion.

Same here.

Copy link

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lukeogg. Some minor comments / suggestions from my side.

Comment on lines +262 to +264
if ver != "dev" || strings.Split(ver, ".")[0] < "25" {
return nil, errors.Errorf("using devices api in kind requires Docker >= v25, but found %q", ver)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This also requires that the Daemon be configured in experimental mode.

What about dropping this check entirely and relying on the error reported by the Docker daemon. Should be something along the lines of "no driver for "cdi"". Or is this UX to confusing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this approach, rather than make kind to have awareness of the provider versions

@@ -33,6 +33,16 @@ func IsAvailable() bool {
return strings.HasPrefix(lines[0], "Docker version")
}

// Version gets the version of docker available on the system
func Version() string {
cmd := exec.Command("docker", "version", "--format", "'{{.Server.Version}}'")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use docker info we can also extract information as to whether experimental is enabled.

Does changing this function to assertCDISupported() or something similar make sense here? We could then check the combination of server version and experimental to see whether it is supported.

@@ -212,6 +212,13 @@ func runArgsForNode(node *config.Node, clusterIPFamily config.ClusterIPFamily, n
args...,
)

// Append CDI device args (used for GPU support)
if len(node.CDIDevices) > 0 {
for _, device := range node.CDIDevices {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a compatibility check similar to that for docker also required here? CDI support was only really added with Podman 4.1.0. As I've stated before, I think it's also fine to rely on the error returned by the CLI itself, but that may not be the nicest UX.

// Append CDI device args (used for GPU support)
if len(node.CDIDevices) > 0 {
for _, device := range node.CDIDevices {
args = append(args, "--device", strings.TrimSpace(device))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having to remember to call strings.TrimSpace whereever the elements of node.CDIDevices are used, can we do that somewhere once?


As a pre-requisite you install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed on the host.

Using `devices` for GPU support requires Docker v25 or later. A [CDI specification](https://github.com/container-orchestrated-devices/container-device-interface) will need to be generated for your device. For Nvidia GPU devices see notes [here.](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#container-device-interface-cdi-support)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Using `devices` for GPU support requires Docker v25 or later. A [CDI specification](https://github.com/container-orchestrated-devices/container-device-interface) will need to be generated for your device. For Nvidia GPU devices see notes [here.](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#container-device-interface-cdi-support)
Using `devices` for GPU support requires Docker v25 or Podman v4.1.0 or later. A [CDI specification](https://github.com/container-orchestrated-devices/container-device-interface) will need to be generated for your device. For NVIDIA GPU devices see notes [here.](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#container-device-interface-cdi-support)


Using `devices` for GPU support requires Docker v25 or later. A [CDI specification](https://github.com/container-orchestrated-devices/container-device-interface) will need to be generated for your device. For Nvidia GPU devices see notes [here.](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#container-device-interface-cdi-support)

GPU devices can be mapped to Kind node copntainers with the devices API:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is it Kind or KinD?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KIND or kind. #3290 (comment)


Steps to enable this:

1. Add nvidia as your default runtime in `/etc/docker/daemon.json` If you have the [NVIDIA Container Toolkit installed](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) this can be done with: `sudo nvidia-ctk runtime configure --runtime=docker --set-as-default`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Add nvidia as your default runtime in `/etc/docker/daemon.json` If you have the [NVIDIA Container Toolkit installed](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) this can be done with: `sudo nvidia-ctk runtime configure --runtime=docker --set-as-default`
1. Add `nvidia` as your default runtime in `/etc/docker/daemon.json` If you have the [NVIDIA Container Toolkit installed](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) this can be done with: `sudo nvidia-ctk runtime configure --runtime=docker --set-as-default`

- role: control-plane
- role: worker
devices:
- "nvidia.com/gpu=0"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this reduces the value of adding an API for CDI devices in my opinion.

This isn't true. The use of the device mounts uses our "legacy" stack which has a number of shortcomings that using CDI addresses. It also means that users can access devices from other vendors such as Intel or their own device / resource definitions.

worker or control-plane (in HA mode),
KIND runs `kubeadm join` which can be configured using the
KIND runs `kubeadm join` which can be configured using the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind vs KinD vs KIND?

Copy link
Member

@BenTheElder BenTheElder Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs consistently use "KIND" or "kind" but not "KinD". We're not using the acronym anymore as it may be podman or in the future something else. Generally just kind like the CLI, but sometimes "KIND" or kind (code formatted) is less ambiguously the tool vs the english word.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -5,6 +5,7 @@ go 1.16
require (
github.com/BurntSushi/toml v1.0.0
github.com/alessio/shellescape v1.4.1
github.com/container-orchestrated-devices/container-device-interface v0.6.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding dependency management, I'm working on #3335 to switch the modules to 1.17+, which will enable https://go.dev/ref/mod#graph-pruning and make the transitive dependencies clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just glancing at the go.sum changes, there's a concerningly large set of deps, but we can't tell with pre 1.17 go.mod files what is actually in the binary.

It remains important that kind the go libraries and CLI are trivially embeddable with minimal dependencies which also must meet kubernetes / CNCF's (allowed licenses) dependency standards. We have a number of users using kind directly in test tools etc and we don't want to bring in more dependencies.

https://github.com/cncf-tags/container-device-interface/blob/main/go.mod has a lot of deps for what I would expect to just be a spec/parser.

@lukeogg
Copy link
Author

lukeogg commented Aug 30, 2023

I will get back to this in the next week or so. Been out for a bit.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@lukeogg: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kind-verify d2a13ba link true /test pull-kind-verify
pull-kind-conformance-parallel-dual-stack-ipv4-ipv6 d2a13ba link true /test pull-kind-conformance-parallel-dual-stack-ipv4-ipv6
pull-kind-e2e-kubernetes-1-30 d2a13ba link true /test pull-kind-e2e-kubernetes-1-30
pull-kind-e2e-kubernetes-1-31 d2a13ba link true /test pull-kind-e2e-kubernetes-1-31

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/docker Issues or PRs related to docker area/provider/podman Issues or PRs related to podman cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support GPUs
7 participants