At SC21 MemVerge and the DMTCP Project announced a partnership designed to accelerate development and adoption of long-awaited Distributed MultiThreaded Checkpointing (DMTCP) technology.
Checkpointing is commonly used by enterprise apps to minimize downtime but checkpointing is almost impossible for complex distributed HPC apps with massive data sets. Under development for over a decade, DMTCP has recently made the impossible possible for several workloads including VLSI circuit simulators, circuit verification, formalization of mathematics, bioinformatics, network simulators, high energy physics, cybersecurity, big data, middleware, mobile computing, cloud computing, virtualization of GPUs, and high performance computing (HPC). DMTCP stands ready for commercialization and wider deployment.
The collaboration between the DMTCP Project and MemVerge will facilitate DMTCP’s move into the market. The partnership includes MemVerge developers joining the DMTCP Project and contributing to open-source development; MemVerge providing commercial support for the open-source DMTCP software; and MemVerge integrating the fully tested and supported version into application-specific Big Memory Solutions. MemVerge has also begun a collaboration with the National Energy Research Scientific Computing Center (NERSC) to optimize MPI-Agnostic Network-Agnostic (MANA), a plugin on top of DMTCP that has been used for transparent checkpointing of MPI on the Cori and Perlmutter supercomputers.
“Robust, performant checkpointing offers us flexibility in scheduling jobs for system maintenances and real-time data processing for experimental facilities. This feature also allows us to better backfill jobs, which ultimately leads to increased system utilization and improved job throughput for our nearly 8,000 scientific users,” said Rebecca Hartman-Baker, User Engagement Group Lead, National Energy Research Scientific Computing Center (NERSC), Lawrence Berkeley National Laboratory.
Gene Cooperman, Professor at Northeastern University, and leader of the DMTCP Project, has led this open-source DMTCP project for almost 20 years. He is especially excited about the recent three-way collaboration to support MANA for MPI.
According to Professor Cooperman, “The collaboration among NERSC/LBNL, MemVerge, and the DMTCP open-source community will bring reliable and efficient transparent checkpointing to MPI (and later to CUDA) for the production market. While DMTCP and MANA will always remain free and open source, the use of MemVerge technology for rapid writing of memory to stable storage will bring an important enhancement to this technology.”
“Distributed checkpointing is a perfect complement to ZeroIOâ„¢ In-Memory Snapshot technology that MemVerge has pioneered,” said Charles Fan, CEO of MemVerge. “We look forward to collaborating with the DMTCP community on future technology and market development.”
“Being able to seamlessly and graciously recover from system failures during complex simulation runs is critical to optimize efficiency for completing jobs with long run-times,” said Mark Nossokoff, Senior Research Analyst at Hyperion Research. “Checkpointing is a well-understood technique for saving the states of independent node memory during a failure mode and restoring that state when the machine is back up and running. Bringing checkpointing capability to big memory architectures with pooled, distributed memory across multiple nodes operating on large datasets should further enable adoption of in-memory computing techniques within the HPC and AI communities. Kudos to MemVerge for stepping up to provide the industry stewardship to make DMTCP a commercial reality.”
Image licensed by pixabay.com
Related News:
Big Memory Cloud Technology from MemVerge Has Been Unveiled
Bernie Wu Experienced Business Development Leader Joins MemVerge