Matrix multiplication of square bf16 matrices, accumulated in fp32.
N=4096 Kernel: 763 TFLOPs cuBLAS: 716 TFLOPs N=8192 Kernel: 808 TFLOPs cuBLAS: 795 TFLOPs Explanation in https://cudaforfun.substack.com/p/outperforming-cublas-on-h100-a-worklog
make matmul && out/matmul Example kernels are in examples/matmul/ and orchestration is in matmul.cu
We compute sum of 2^30 elements.
make sum && out/sum Kernel: 3240.11 GB/s cub Library: 3193 GB/s Example kernels are in sum.cu