Open Source Cluster Monitoring and Management Software and Systems


On this page we have selected two open source cluster and server farm monitoring software. If you need monitoring for a cluster system you are going to need a different approach. At AmbitWire we hope that this selection of open source cluster monitoring and management systems will prove helpful to your quest.      

Ganglia - "Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency. The implementation is robust, has been ported to an extensive set of operating systems and processor architectures, and is currently in use on thousands of clusters around the world. It has been used to link clusters across university campuses and around the world and can scale to handle clusters with 2000 nodes."

OVIS - "In the area of high-performance computing, the long-term goal of OVIS is to enable efficient and reliable computational clusters. We envision a system-wide integration of resource managers (e.g., scheduler), applications, and system resource analysis capabilities. Run-time information on resource utilization and predictive capabilities for anticipated resource needs and component failure can be used by schedulers and applications in order to better allocate resources . For example, information on reliable (or unreliable) system components can be used by the scheduler in making job allocation assignments and further used by applications in order to invoke fault-tolerance mechanisms.  The OVIS tool for Intelligent Scalable Real-Time Monitoring for Large Computational Clusters was created to address the piece of this goal involving resource analysis and failure prediction. "