Best Practices for Monitoring On-Site Instances
Overview
In order to effectively monitor an on-site instance, there are 3 main aspects of the service that require attention:
- Access
- Availability
- Utilization
The way you monitor each of these depends largely on the user's specific requirements. However, there are some basic methods that can be employed to provide a level of comfort while ensuring that the system is both accessible and usable.
Choosing a Monitoring System
The first step in the process is selecting a monitoring solution. While there are many applications on the market (both proprietary and open source), there is a fairly short list of solutions that are widely adopted and supported. The most important attribute to look for is for the ability to support the solution in house unless it is specifically desirable to have a SaaS provider who will do it for you. Secondly, it is important to have methods to monitor your network, your servers, and your application in order to ensure the entire stack is in view. Some monitoring applications focus only on 1 or 2 of these areas, so you will need to choose accordingly.
Another consideration when choosing the right system is that this solution will be primarily focused on Sugar. It should be as reliable as Sugar itself to ensure minimal downtime for your instance. The monitoring solution should be architected accordingly, with both high availability and redundancy built into the architecture, if that is how your Sugar installation is designed. Also, take into account the need for the monitoring system to potentially send pages to on-call team members in order to notify them of high priority problems.
Because the Sugar application is key to the success of the sales and customer-facing organizations, it is imperative that you have the ability to monitor Sugar from outside of the network (this is particularly true if a considerable part of the organization accesses Sugar from remote locations). Therefore, consider implementing multiple monitoring solutions to ensure maximum coverage.
Monitoring the SugarCRM Stack
From a high level, the Sugar stack consists of the following types of systems:
- Load Balancer
- Webserver
- Databases
- NFS Server
- Elasticsearch
- Session Storage
Although there can be commonality between these different systems in terms of hardware platform, hardware configuration, operating systems, patch levels, etc., they are also unique in various areas such as:
- Running applications
- Configuration specific to the applications
- Possible hardware differences (based on the function/application the hardware performs)
For these reasons, there are tests at the server/network layer (common) as well as the application layer (unique) to allow for the broadest coverage and specifically focused checks. This helps ensure high availability, speedy recognition of incidents, and identification of global or functional performance trends across the Sugar instance.
Monitoring Access
Access describes all of the different methods for reaching the Sugar instance. This generically involves network access, firewalls, VPN connections, and access control lists. Tests can be performed from within the office network or by 3rd party services. In either case, the basic tests would include the following:
- Ping against the webserver (if ICMP traffic is allowed in the network)
- Telnet to the HTTP port on the server
- HTTP GET test against a suitably available test page
- Ping tests to the database
- Telnet to the database port
- Ping tests to the Elasticsearch server
- Telnet to the Elasticsearch port
- Ping tests to the NFS server
- Telnet to the NFS port
If sessions are stored in a separate location from the webserver, then ensure the same basic ping and telnet tests are setup to verify that the session storage system is accessible.
Once you ensure a network path exists between the Monitoring Server/Service and the instance, these basic tests will tell you if the network path exists to the instance. If the webserver is responding to HTTP requests or if the webserver is serving the test page exposed for monitoring purposes, then the servers are accessible. Also, If the database and elasticsearch systems respond to the telnet attempts, then the servers are accessible.
Access Best Practices
It is not enough to simply ensure access to the instance exists. The devices in between the client computers and the Sugar server(s) also need individual attention as do the pathways into the network. To that end, a more sophisticated system is required to complete the following:
- Ping tests to the various network devices between the server and the Internet Uplink
- SNMP trap monitoring to capture network device events that may lead to downtime (network devices require specific configurations enabled in order to permit this)
- Configuration monitoring to ensure consistent access if configuration is altered as well as notifications should something change unexpectedly that may prevent/inhibit access
- Ping tests across a diverse network connections to ensure a failover of network equipment will not prevent business
- Out of band monitoring to ensure the Sugar instance is available to the people who need to access it
Monitoring Availability
Availability covers not only access, but also the ability to use the instance. To this end, there needs to be additional logic to do the following:
- Test sending an API call or login attempt to the instance and looking for the successful HTTP response code.
- Check the webserver access log file to ensure it is regularly updating (indicating that work is being done by the application).
- Check the webserver error log file to ensure failures and faults within its ability to serve Sugar are caught and appropriately handled.
- Telnet to the database server directly, or connected through the webserver, to ensure the database is available.
- Check the database logs (as appropriate) to ensure the file updates regularly (indicating the database is actually doing work).
- Monitor the webserver processes to ensure they are up and running.
- Monitor the database processes to ensure they are up and running.
- Monitor the elasticsearch processes to ensure they are up and running.
- Monitor the session storage processes to ensure they are up and running.
While it may seem redundant to monitor a server as well as the running processes underneath, monitoring the systems in this manner provides multiple layers of checks. This increases the accuracy and reliability of the monitoring systems as well as increases the support team's ability to use the Sugar application.
Availability Best Practices
In addition to the primary methods of Sugars availability, there should exist secondary systems and pathways that also need to be monitored as these systems represent the company's ability to continue business as usual in the event of failures in the primary systems:
- Monitor heartbeat processes to ensure the existence of a secondary device, as well as to ensure the primary has not otherwise failed.
Note: This can include databases setup in a cluster, load balancers, elasticsearch, and session storage servers. - Monitor the activity between layers of Sugar (webservers, databases, NFS servers, etc.) to ensure a consistent flow of traffic throughout the entire system.
Monitoring Utilization
Monitoring a system's resource utilization provides insight and understanding into various issues including access, availability, and performance problems. If a system is running too close to its maximum capacities, or if it consistently hits those maximums, then problems can (and will) arise that may appear to be access or availability related rather than a resource or scalability problem. In order to ensure such problems do not arise, a minimum of the following needs to be monitored:
- Server resource utilization including:
- Memory usage
- Disk usage
- CPU usage
- CPU load
- Network usage
- Connection counts and states
- Application memory usage
- Application CPU usage
- Network device utilization including:
- Memory usage
- CPU usage (where appropriate)
- Network usage by interface
Utilization Best Practices
Beyond the simple monitoring of computer and network resources, data trending is also highly valuable and necessary in order to provide insights into the many issues that may occur when implementing Sugar. By collecting trend data for each of the items monitored, a resource usage profile can be assembled and analyzed to identify a variety of things such as bottlenecks, resource saturation, usage patterns, opportunities to optimize, or opportunities to downsize. Moreover, presenting trend data visually provides rapid consumption of the data for the support teams who are maintaining the Sugar instance. This allows for speedy action if things go awry. Besides trending the items listed above, here are some other data elements that can be tracked:
- Threshold alerting or agents measuring rapid changes in utilization which would indicate a usage spike or outage that is somehow impacting your instance
- Visual representation of the trend data
- Server and application log parsing for data, which yields insights into user behavior and performance impacting issues
- Application transaction counts
- Application transaction types
If all of the trend data can be visually represented via the same solution, then you have the added ability to cross compare data elements to identify cause/effect correlations (such as network utilization vs. connection rates vs. transaction types/volumes) in order to identify the relationships that may not be obviously apparent.
Summary
In summary, there are many solutions on the market that provide the ability to monitor everything (and more) identified here. While this list is thorough, it is not comprehensive. There are many other factors that a Sugar user or customer can monitor to ensure absolute top-level performance and availability. These factors are subject to the particular implementation details and requirements that may be unique to each customer. Whatever the solution selected, as long as it allows for the above-prescribed items, then you will have a robust monitoring solution that you can build and rely upon.