December 2024 • High-Performance Computing
Implementation of Cannon's algorithm for distributed matrix multiplication on HPC clusters. The objective was to minimize communication overhead and maximize computational throughput for large-scale matrix operations through efficient process organization and data distribution strategies.
Language: C++, OpenMPI
Environment: Linux-based HPC Cluster (Pronto)
Tools: mpic++ compiler, Shell scripting for benchmarking
Implemented Cannon's algorithm using MPI collective communication primitives to organize processes into a 2D grid topology. Matrix blocks are distributed cyclically across processes with synchronized shifts along rows and columns to ensure each process receives necessary data blocks for local computation. The algorithm performs systematic data shifting to maintain balanced load and continuous computation while minimizing communication-to-computation ratio through careful design.
Achieved over 50 GFLOPS sustained performance on production HPC hardware with near-linear scaling for increased process counts on large matrices. Demonstrated efficient memory utilization through block-wise distribution and comprehensive performance analysis comparing parallel execution against serial baseline across various matrix sizes. The implementation serves as a foundation for machine learning training pipelines, computational physics simulations, and large-scale numerical linear algebra operations.