Supported Software
Containers & Cluster Management
Run:ai Supported Software
Run:ai is a cloud-native Kubernetes-based orchestration tool that enables data scientists to accelerate, innovate, and complete AI initiatives faster. Reach peak utilization and efficiency by incorporating an AI-dedicated, high-performant super-scheduler tailored for managing NVIDIA CUDA-enable GPU resources. The platform provides IT and MLOps with visibility and control over job scheduling and dynamic GPU resource provisioning.
Run:ai Atlas
More than Just an Orchestration Tool
Faster Time to Innovation
With automatic resource pooling, queuing, and job prioritization researchers can focus more on data science projects and develop ground-breaking innovations.
Increased Productivity
Run:ai platform uses a fairness algorithm and configurable parameters to guarantee all users within the cluster a fair share of the resources. Get jobs done faster with the highest utilization.
Improved GPU Utilization
An integrated automatic super scheduler allows users to easily make use of fractional or multiple GPUs for all kinds of workloads. For the highest optimization, all GPUs should be allocated at any given time.
Capabilities and Features
Fair Scheduling
Allows departments, teams, and jobs to easily share GPU resources automatically with the Run:ai platform. Their GPU quota system enables a guaranteed amount of GPU resources allocated to a specific job. If resources are available elsewhere in the cluster, automatically receive over-quota GPUs to accelerate your task. When other jobs require their GPU quota again, Run:ai will automatically preempt and reallocate over-quota resources.
Fractional GPUs
Optimize workloads that require significantly less GPU utilization like building or inferencing. Instead of allocating 10 GPUs to 10 data scientists (each utilizing about 1/10th of the compute), creating a fractional GPU instance places all 10 data scientists onto a single GPU. The other 9 GPUS can now be used in a more productive training task that requires massive GPU resources.
Distributed Training
Enables the ability to leverage multi-GPU utilization to tackle large AI, training models. Distributed training is executed automatically with little to no intervention from the data scientist. There is no need to code or enable distributed training since it is built into the Run:ai platform.
Visibility
Workloads, resource allocation, and utilization can be viewed through Run:ai platform user interface. Create departments, assign teams, and allocate resources accordingly to specific projects. Monitor usage by cluster, node, project, or a single job user. Visibility of usage can justify additional GPU nodes.
Utilization Only Goes Up
The GPU quota system and fair scheduling work in tandem to dynamically allocate resources enabling peak GPU utilization. When resources from 1 job are not in use, another job can pick up those unused resources to accelerate its task. Automatically.
- Almost always, your training tasks will receive over quota resources from tasks with less utilization
- Worst case, tasks will continue to execute jobs over your requested GPU quota, the same way it would run without Run:ai
- Effectively, all GPUs in your cluster are allocated to a job and utilized to their fullest potential. Every minute a GPU is not utilized is an expensive minute.