danielrosehill's picture
commit
279efce

You are assessing GPU driver status and AI/ML workload capabilities.

Your Task

Evaluate the GPU's driver configuration and suitability for AI/ML workloads, including deep learning frameworks, compute capabilities, and performance optimization.

1. Driver Status Assessment

  • Installed driver: Type (proprietary/open-source) and version
  • Driver source: Distribution package, vendor installer, or compiled
  • Driver status: Loaded, functioning, errors
  • Kernel module: Module name and status
  • Driver age: Release date and recency
  • Latest driver: Compare installed vs. available
  • Driver compatibility: Kernel version compatibility
  • Secure boot status: Impact on driver loading

2. Compute Framework Support

  • CUDA availability: CUDA Toolkit installation status
  • CUDA version: Installed CUDA version
  • CUDA compatibility: GPU compute capability vs. CUDA requirements
  • ROCm availability: For AMD GPUs
  • ROCm version: Installed ROCm version
  • OpenCL support: OpenCL runtime and version
  • oneAPI: Intel oneAPI toolkit status
  • Framework libraries: cuDNN, cuBLAS, TensorRT, etc.

3. GPU Compute Capabilities

  • Compute capability: NVIDIA CUDA compute version (e.g., 8.6, 8.9)
  • Architecture suitability: Architecture generation for AI/ML
  • Tensor cores: Presence and version (Gen 1/2/3/4)
  • RT cores: Ray tracing acceleration (less relevant for ML)
  • Memory bandwidth: Critical for ML workloads
  • VRAM capacity: Memory size for model loading
  • FP64/FP32/FP16/INT8: Precision support
  • TF32: Tensor Float 32 support (Ampere+)
  • Mixed precision: Automatic mixed precision capability

4. Deep Learning Framework Compatibility

  • PyTorch: Installation status and CUDA/ROCm support
  • TensorFlow: Installation and GPU backend
  • JAX: Google JAX framework support
  • ONNX Runtime: ONNX with GPU acceleration
  • MXNet: Apache MXNet support
  • Hugging Face: Transformers library GPU support
  • Framework versions: Installed versions and compatibility

5. AI/ML Library Ecosystem

  • cuDNN: NVIDIA Deep Neural Network library
  • cuBLAS: CUDA Basic Linear Algebra Subprograms
  • TensorRT: High-performance deep learning inference
  • NCCL: NVIDIA Collective Communications Library (multi-GPU)
  • MIOpen: AMD GPU-accelerated primitives
  • rocBLAS: AMD GPU BLAS library
  • oneDNN: Intel Deep Neural Network library

6. Performance Characteristics

  • Memory bandwidth: GB/s for data transfer
  • Compute throughput: TFLOPS for different precisions
    • FP64 (double precision)
    • FP32 (single precision)
    • FP16 (half precision)
    • INT8 (integer quantization)
    • TF32 (Tensor Float 32)
  • Tensor core performance: Dedicated AI acceleration
  • Sparse tensor support: Structured sparsity acceleration

7. Model Size Compatibility

  • VRAM capacity: Total GPU memory
  • Practical model sizes: Estimated model capacity
    • Small models: < 1B parameters
    • Medium models: 1B-7B parameters
    • Large models: 7B-70B parameters
    • Very large models: > 70B parameters
  • Batch size implications: VRAM for different batch sizes
  • Multi-GPU potential: Scaling across GPUs

8. Container and Virtualization Support

  • Docker NVIDIA runtime: nvidia-docker/NVIDIA Container Toolkit
  • Docker ROCm runtime: ROCm Docker support
  • Podman GPU support: GPU passthrough capability
  • Kubernetes GPU: Device plugin support
  • GPU passthrough: VM GPU assignment capability
  • vGPU support: Virtual GPU for multi-tenancy

9. Monitoring and Profiling Tools

  • nvidia-smi: Real-time monitoring (NVIDIA)
  • rocm-smi: ROCm system management (AMD)
  • Nsight Systems: NVIDIA profiling suite
  • Nsight Compute: CUDA kernel profiler
  • nvtop/radeontop: Terminal GPU monitoring
  • PyTorch profiler: Framework-level profiling
  • TensorBoard: Training visualization

10. Optimization Features

  • Automatic mixed precision: AMP support
  • Gradient checkpointing: Memory optimization
  • Flash Attention: Optimized attention mechanisms
  • Quantization support: INT8, INT4 inference
  • Model compilation: TorchScript, XLA, TensorRT
  • Distributed training: Multi-GPU training support
  • CUDA graphs: Kernel launch optimization

11. Workload Suitability Assessment

  • Training capability: Suitable for training workloads
  • Inference capability: Suitable for inference
  • Model type suitability:
    • Computer vision (CNNs)
    • Natural language processing (Transformers)
    • Generative AI (Diffusion models, LLMs)
    • Reinforcement learning
  • Performance tier: Consumer, Professional, Data Center

12. Bottleneck and Limitation Analysis

  • Memory bottlenecks: VRAM limitations for large models
  • Compute bottlenecks: GPU power for training speed
  • PCIe bandwidth: Data transfer limitations
  • Driver limitations: Missing features or bugs
  • Power throttling: Thermal or power constraints
  • Multi-GPU scaling: Efficiency of multi-GPU setup

Commands to Use

GPU and driver detection:

  • nvidia-smi (NVIDIA)
  • rocm-smi (AMD)
  • lspci | grep -i vga
  • lspci -v | grep -A 20 VGA

NVIDIA driver details:

  • nvidia-smi -q
  • cat /proc/driver/nvidia/version
  • modinfo nvidia
  • nvidia-smi --query-gpu=driver_version --format=csv,noheader

AMD driver details:

  • modinfo amdgpu
  • rocminfo
  • /opt/rocm/bin/rocm-smi --showdriverversion

CUDA/ROCm installation:

  • nvcc --version (CUDA compiler)
  • which nvcc
  • ls /usr/local/cuda*/
  • echo $CUDA_HOME
  • hipcc --version (ROCm)
  • ls /opt/rocm/

Compute capability:

  • nvidia-smi --query-gpu=compute_cap --format=csv,noheader
  • nvidia-smi -q | grep "Compute Capability"

Libraries check:

  • ldconfig -p | grep cudnn
  • ldconfig -p | grep cublas
  • ldconfig -p | grep tensorrt
  • ldconfig -p | grep nccl
  • ls /usr/lib/x86_64-linux-gnu/ | grep -i cuda

Python framework check:

  • python3 -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}, Version: {torch.version.cuda}')"
  • python3 -c "import tensorflow as tf; print(f'TensorFlow: {tf.__version__}, GPU: {tf.config.list_physical_devices(\"GPU\")}')"
  • python3 -c "import torch; print(f'Tensor Cores: {torch.cuda.get_device_capability()}')"

Container runtime:

  • docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
  • which nvidia-container-cli
  • nvidia-container-cli info

OpenCL:

  • clinfo
  • clinfo | grep "Device Name"

System libraries:

  • dpkg -l | grep -i cuda
  • dpkg -l | grep -i nvidia
  • dpkg -l | grep -i rocm

Performance info:

  • nvidia-smi --query-gpu=name,memory.total,memory.free,driver_version,compute_cap --format=csv
  • nvidia-smi dmon -s pucvmet (dynamic monitoring)

Output Format

Executive Summary

GPU: [model]
Driver: [proprietary/open] v[version] ([status])
Compute: [CUDA/ROCm] v[version] (Compute [capability])
AI/ML Readiness: [Ready/Partial/Not Ready]
Best For: [Training/Inference/Both]
Recommended Frameworks: [PyTorch, TensorFlow, etc.]

Detailed AI/ML Assessment

Driver Status:

  • Type: [Proprietary/Open Source]
  • Version: [version number]
  • Release Date: [date]
  • Status: [Loaded/Error]
  • Kernel Module: [module] ([loaded/not loaded])
  • Latest Available: [version]
  • Update Recommended: [Yes/No]
  • Secure Boot: [Compatible/Issue]

Compute Framework Availability:

  • CUDA Toolkit: [Installed/Not Installed] - v[version]
  • CUDA Driver API: v[version]
  • ROCm: [Installed/Not Installed] - v[version]
  • OpenCL: [Available/Not Available] - v[version]
  • Compute Capability: [X.X] ([architecture name])

GPU Compute Specifications:

  • Architecture: [Turing/Ampere/Ada/RDNA3/Xe]
  • Tensor Cores: [Yes/No] - [Generation]
  • CUDA Cores / SPs: [count]
  • VRAM: [GB] [memory type]
  • Memory Bandwidth: [GB/s]
  • Precision Support:
    • FP64: [TFLOPS]
    • FP32: [TFLOPS]
    • FP16: [TFLOPS]
    • INT8: [TOPS]
    • TF32: [Yes/No]

AI/ML Libraries:

  • cuDNN: [version] ([installed/missing])
  • cuBLAS: [version] ([installed/missing])
  • TensorRT: [version] ([installed/missing])
  • NCCL: [version] ([installed/missing])
  • MIOpen: [version] (AMD only)
  • rocBLAS: [version] (AMD only)

Deep Learning Framework Support:

  • PyTorch: [version]
    • CUDA Enabled: [Yes/No]
    • CUDA Version: [version]
    • cuDNN Version: [version]
  • TensorFlow: [version]
    • GPU Support: [Yes/No]
    • CUDA Version: [version]
  • JAX: [installed/not installed]
  • ONNX Runtime: [GPU backend available]

Container Support:

  • NVIDIA Container Toolkit: [installed/not installed]
  • Docker GPU Access: [working/not working]
  • Podman GPU Support: [available]

Model Capacity Estimates:

  • Small Models (< 1B params): [batch size X]
  • Medium Models (1B-7B params): [batch size X]
  • Large Models (7B-13B params): [batch size X]
  • Very Large Models (13B-70B params): [requires multi-GPU or not possible]

Example workload estimates based on [GB] VRAM:

  • LLaMA 7B: [inference only/training possible]
  • Stable Diffusion: [batch size X]
  • BERT Base: [batch size X]
  • GPT-2: [batch size X]

Workload Suitability:

  • Training:
    • Small models: [Excellent/Good/Fair/Poor]
    • Medium models: [rating]
    • Large models: [rating]
  • Inference:
    • Real-time: [Excellent/Good/Fair/Poor]
    • Batch: [rating]
    • Low-latency: [rating]

Use Case Recommendations:

  • Computer Vision (CNNs): [Excellent/Good/Fair/Poor]
  • NLP (Transformers): [rating]
  • Generative AI (LLMs): [rating]
  • Diffusion Models: [rating]
  • Reinforcement Learning: [rating]

Performance Tier:

  • Category: [Consumer/Professional/Data Center]
  • Training Performance: [rating]
  • Inference Performance: [rating]
  • Multi-GPU Scaling: [available/not available]

Optimization Features Available:

  • Automatic Mixed Precision: [Yes/No]
  • Tensor Core Utilization: [Yes/No]
  • TensorRT Optimization: [Available]
  • Flash Attention: [Supported]
  • INT8 Quantization: [Supported]
  • Multi-GPU Training: [Possible with [count] GPUs]

Limitations and Bottlenecks:

  • VRAM Constraint: [assessment]
  • Memory Bandwidth: [adequate/limited]
  • Compute Throughput: [assessment]
  • PCIe Bottleneck: [yes/no]
  • Driver Limitations: [any known issues]
  • Power/Thermal: [throttling concerns]

Recommendations:

  1. [Driver update/optimization suggestions]
  2. [Framework installation recommendations]
  3. [Workload optimization suggestions]
  4. [Hardware upgrade path if applicable]
  5. [Container/virtualization setup if beneficial]

AI/ML Readiness Scorecard

Driver Setup:        [βœ“/βœ—/⚠] [details]
CUDA/ROCm Install:   [βœ“/βœ—/⚠] [details]
Framework Support:   [βœ“/βœ—/⚠] [details]
Library Ecosystem:   [βœ“/βœ—/⚠] [details]
Container Runtime:   [βœ“/βœ—/⚠] [details]
VRAM Capacity:       [βœ“/βœ—/⚠] [details]
Compute Performance: [βœ“/βœ—/⚠] [details]

Overall Readiness: [Ready/Needs Setup/Limited/Not Suitable]

AI-Readable JSON

{
  "driver": {
    "type": "proprietary|open_source",
    "version": "",
    "status": "loaded|error",
    "latest_available": "",
    "update_recommended": false
  },
  "compute_platform": {
    "cuda": {
      "installed": false,
      "version": "",
      "compute_capability": ""
    },
    "rocm": {
      "installed": false,
      "version": ""
    },
    "opencl": {
      "available": false,
      "version": ""
    }
  },
  "gpu_specs": {
    "architecture": "",
    "tensor_cores": false,
    "vram_gb": 0,
    "memory_bandwidth_gbs": 0,
    "fp32_tflops": 0,
    "fp16_tflops": 0,
    "int8_tops": 0,
    "tf32_support": false
  },
  "libraries": {
    "cudnn": "",
    "cublas": "",
    "tensorrt": "",
    "nccl": ""
  },
  "frameworks": {
    "pytorch": {
      "installed": false,
      "version": "",
      "cuda_available": false
    },
    "tensorflow": {
      "installed": false,
      "version": "",
      "gpu_available": false
    }
  },
  "container_support": {
    "nvidia_container_toolkit": false,
    "docker_gpu_working": false
  },
  "workload_suitability": {
    "training": {
      "small_models": "excellent|good|fair|poor",
      "medium_models": "",
      "large_models": ""
    },
    "inference": {
      "real_time": "",
      "batch": ""
    }
  },
  "model_capacity": {
    "vram_gb": 0,
    "small_model_batch_size": 0,
    "llama_7b_possible": false,
    "stable_diffusion_batch": 0
  },
  "optimization_features": {
    "amp_support": false,
    "tensor_core_utilization": false,
    "tensorrt_available": false,
    "int8_quantization": false
  },
  "bottlenecks": {
    "vram_limited": false,
    "compute_limited": false,
    "pcie_bottleneck": false
  },
  "ai_ml_readiness": "ready|needs_setup|limited|not_suitable"
}

Execution Guidelines

  1. Identify GPU vendor first: NVIDIA, AMD, or Intel
  2. Check driver installation: Verify driver is loaded and working
  3. Assess compute platform: CUDA for NVIDIA, ROCm for AMD
  4. Query compute capability: Critical for framework compatibility
  5. Check library installation: cuDNN, TensorRT, etc.
  6. Test framework access: Try importing PyTorch/TensorFlow with GPU
  7. Evaluate VRAM capacity: Estimate model sizes
  8. Check container support: Important for ML workflows
  9. Identify bottlenecks: VRAM, compute, or driver issues
  10. Provide actionable recommendations: Setup steps or optimizations

Important Notes

  • NVIDIA GPUs have the most mature AI/ML ecosystem
  • CUDA compute capability determines supported features
  • cuDNN is critical for deep learning performance
  • VRAM is often the primary bottleneck for large models
  • Container runtimes simplify framework management
  • AMD ROCm support is improving but less mature than CUDA
  • Intel GPUs are emerging in AI/ML space
  • Tensor cores provide significant speedup for mixed precision
  • Driver version must match CUDA toolkit requirements
  • Some features require specific GPU generations
  • Multi-GPU setups require additional configuration
  • Consumer GPUs can be effective for smaller workloads
  • Professional/datacenter GPUs offer better reliability and support

Be thorough and practical - provide a clear assessment of AI/ML readiness and actionable next steps.