Cublaslt Grouped Gemm Documentation 2021 Jun 2026
: Working implementation samples can be found in the NVIDIA CUDALibrarySamples GitHub repository , specifically under the cuBLASLt directory. Grouped GEMM vs. Batched GEMM Batched GEMM ( cublasGemmBatchedEx ) Grouped GEMM ( cublasLtMatmul ) Dimensions All GEMMs must have the same Each GEMM can have unique Overhead Lower launch overhead than individual calls. Optimized for disparate problem sizes in one kernel. Flexibility Rigid layout and data types. High flexibility in layouts, epilogues, and precisions. How to Implement
| Function | Purpose | | :--- | :--- | | cublasLtCreate | Initialize the library handle. | | cublasLtMatmulDescCreate | Create the GEMM operation descriptor. | | cublasLtMatrixLayoutCreate | Define dimensions and memory layout for matrices. | | cublasLtMatmulPreferenceCreate | Define constraints for kernel selection (workspace). | | cublasLtMatmulAlgoGetHeuristic | Find the best kernel for the grouped problem. | | cublasLtMatmul | Execute the grouped matrix multiplication. | cublaslt grouped gemm documentation
Unlike standard batched GEMMs, each operation in a group can have unique dimensions. : Working implementation samples can be found in