Infrastructure Monitoring is the key process and methods to diagnose and troubleshoot the performance and capacity problems across different components of the datacenter such as compute, storage, network and security before a production outage occur. Monitoring the IT infrastructure resources and generating alerts automatically will allow organizations to get the most efficient use of the resources by ensuring that computing, networking, and storage resources are properly allocated and that they are correctly engineered and are working as expected.
Also Read: Infrastructure Security Basics and Fundamentals
Previously, we learned the basics and fundamentals of the important components of any IT datacenter such as Compute Virtualization, Storage, Network and Security. In this post, we will quickly review the basic concepts of Infrastructure Monitoring such as different types of Monitoring approaches, techniques, design and best practices that need to be implemented in both on-prem and cloud datacenter.
Infrastructure Monitoring Overview
Every Organization depends on IT resources to create application and deliver their products and services for running business. Organizations completely rely and must build and maintain an IT infrastructure to run the businesses. IT infrastructure means all of the assets that are necessary to deliver and support IT services in the data centers such as Compute servers, networks, computer hardware and software, storage, and other equipment.
Organizations implements specialized software tools that aggregate data in the form of event logs from throughout the organization’s IT infrastructure landscape. Event logs are automatically computer generated by applications or devices on the network in response to network traffic or user activity. These log files contain information such as
- Time and date that the event occurred
- The user that was logged into the machine
- The name of the computer
- A unique identifier
- The source of the event
- Description of the event type.
Some log files may also contain additional information depending on the application where they originated.
Monitoring softwares or tools can be used to capture these log files from various sources and aggregate them into a single database where they can be sorted, queried and analyzed by either humans or machine algorithms. Using this type of infrastructure monitoring, IT organizations can detect operational issues, identify possible security breaches or malicious attacks and identify new areas of business opportunity.
Monitoring tools also helps Organizations to determine if additional capacity is required and implement the changes before the production systems are affected by performance issues. Problems with deployment can also be determined and resolved preferably before they become serious, and steps can be taken to remediate them either manually or by using automation. This data can also be used to plan for future expansion as well as view trends and be proactive in resolving any issues before they become larger.
Also Read: Storage Infrastructure Basics and Fundamentals
Different Types of Infrastructure Monitoring
Today, there are many monitoring tools available in the market and each vendor offers one or both of the below types of monitoring.
Agent based monitoring
- Agent based monitoring tools are usually designed specifically for a particular platform and they are capable of collecting and analyzing the data from the wide variety of systems.
- This type of monitoring requires a software or package to be installed on the systems to collect and send the data to central monitoring platform.
- Therefore, compatibility of these tools on the existing systems has to be checked first before deploying the software’s.
- Since these are vendor specific and its proprietary nature makes it difficult to migrate monitoring data to a different platform without data loss.
Agentless monitoring
- This type of monitoring doesn’t need any software’s or agents on the systems to collect the data.
- These monitoring tools relies on a variety of protocols such as SNMP, WMI, SSH, NetFlow or others to relay system data and statistics to the central monitoring system.
- These built-in features monitor and manage infrastructure information without additional agents.
- Networking devices, servers, flow devices, storage devices, and virtual machines like VMware and Hyper-V are all common components that have agentless monitoring capabilities.
Difference between Infrastructure Monitoring & Infrastructure Management
Infrastructure monitoring involves collecting and reviewing data associated with the various components of an Organizations IT infrastructure. The main purpose of Infrastructure Monitoring is to identify a problem and communicates it to the user who is responsible through alerts.
Whereas Infrastructure management then receives that alert or communication and assesses the impact of the issue and evaluates how to mitigate or improve upon the problem.
Effective infrastructure monitoring is a key component for optimal infrastructure management, and both are essential to the productivity and profit of an Organization.
Key Infrastructure Monitoring Design Principles
IT datacenter is made up of many number of hardware and software resources such as switches, racks, cables, cooling/heating equipment, servers, storage etc in order to meet the business needs. All these resources must be monitored and reviewed regularly based on their priority and address any issues that are noticed. The following principles must be considered and followed before planning and implementing any IT Infrastructure monitoring system.
What can become a problem ?
- Hardware errors and availability problems – Hardware equipment’s running in the datacenter can run into issues due to factors like overload, wear and tear etc. This can result in the availability of the hardware and its services.
- Poor provisioning of Compute Servers– If servers are not provisioned to meet the application requirements such as with required RAM, CPU & storage, then it will cause issues on the application that is installed on the server and as well as to the server hardware resources.
- Misallocation of virtual resources – Poor planning leads to the misallocation of virtual resources like (Hosts, Virtual Machines, Network, Storage) might either cause overload or overprovision.
- Excessive network utilization and error rates – Poor server and application configurations may result in the excessive use of the network resources and may lead to network outages.
- Application problems as identified by built-in knowledge base – Apart from the hardware issues, application related issues are seen such as latency, slowness, integrity issues etc. if the system is not configured properly. These are generally documented in the knowledge base.
What should be monitored ?
- Monitor Hardware Equipment – Hardware devices should be monitored on regular basis, especially when a hardware failure could result in unplanned downtime or lost revenue. Hardware monitoring tools capture data from the sensors that can be found in computers and other machines. These can include battery life data, power and load sensors, current and voltage sensors, fan speed sensors and user-defined artificial sensors that collect data on the operating system. Monitoring these metrics can help identifying a malfunctioning component before its failure causes a server or computer to overheat.
- Monitor Operating Systems – Commonly used operating systems such as windows and unix servers must be configured for automatic monitoring by using various monitoring tools and protocols to identify any OS related issues. These issues can be related to memory, page space, storage and CPU configured on the servers.
- Monitor Virtual resources – Virtualization hardware and softwares must be configured for monitoring to monitor the virtual resources. Generally the virtualization vendor provides the details and list of items that needs to monitor to ensure the issues are addressed before it impacts the applications running on the VMs.
- Monitor Network – Network monitoring helps to verify that the organization’s internal network is functioning appropriately and delivering the expected levels of speed and performance. With IT infrastructure monitoring tools, transfer rates and connectivity levels can be tracked that users are experiencing on the network, as well as monitoring incoming and outgoing connections. Network monitoring can help IT organization respond proactively when an unauthorized user attempts to access the network.
- Monitor Storage – The storage allocated to the compute server and to the application can easily be filled up due to factors like log generations, user activity logs etc. If sufficient storage space is not allocated, the application or compute server may go into a hung state and might result in downtime. Storage resources along with the storage systems must be regularly monitored to prevent these kind of capacity related issues.
- Monitor Applications – Application monitoring such as monitoring Database, Messaging, and Web servers is a critical aspect of IT infrastructure monitoring. Software applications deployed on the servers may be used by members of IT organization or by customers of the business. In either case, applications represent a potential attack vector for a malicious actor and a powerful source of operational and business intelligence. With today’s IT infrastructure monitoring tools, organizations can track user behavior on applications to obtain operational insights and identify business opportunities.
What to do when there is a problem ?
Following steps can be performed when a problem is identified by the monitoring tools.
- If the issue is a recurring problem which can be resolved by running set of OS commands, automate the OS commands or script to fix the problem if possible.
- Prioritize and escalate high severity alerts with text messages or email alerts so that right personnel is contacted in case of any issue.
- Create monitoring dashboards for performance & availability based on user-definable criteria
What kind of reporting is needed ?
Following monitoring reports may be generated for the infrastructure monitoring based on the requirements. Regular review and analysis of these reports will give insights to mitigate the issues in future.
- Create monitoring analytics reports to understand the performance trends and patterns.
- Generate reports for Capacity Planning which helps in rightsizing of IT Infrastructure
- Create Service Level Agreement reports to understand baseline behavior and identify IT Infrastructure components that deviate from the agreements.
- Create problems reports to show alert history and develop automation to resolve similar issues automatically.
Also Read: Network infrastructure Basics and Fundamentals
Best practices for Infrastructure Monitoring & Alerting
IT infrastructure monitoring creates opportunities to proactively identify security risks and mitigate operational issues before they negatively impact business. Below are some of the best practices that can be followed to help organizations achieve and maximize the benefits associated with IT infrastructure monitoring:
- Choose a Reliable Monitoring Tool – Businesses with mature IT organizations face a difficult choice when it comes to IT infrastructure monitoring whether to purchase a tool from a vendor, or develop custom monitoring tool A reliable vendor partner can offer one-on-one assistance and consultation, helping IT team to configure and get the most value from IT infrastructure monitoring solution.
- Redundancy Solutions- Consider every possibility by using a combination of on-premise and cloud-based solutions. Also, if multiple data centers exist, monitor each location for extra security.
-
Create Right Dashboards to the Right People – IT infrastructure monitoring software tools can be configured to present processed data in a dashboard. Dashboards can be configured to provide operational data, give business insights or to highlight anomalous events that could represent security threats. To leverage this data effectively, plan to customize dashboards for each role such as a security dashboard for security team, operational dashboards for operations team and a financial or business metrics dashboard for sales managers or your CFO.
-
Establish baselines and thresholds – Use the monitoring and alerting functions of the application to determine the baseline and then use the baseline as our reference point to determine what is to be considered out of range. This is referred to as the variance. Baselines are also used to determine what is out of normal operations. Baseline statistics can be used as a reference, and if a counter has a variance above or below that value, it will be considered a serious issue.
- Use support services- Contact vendor support teams when there is a question or problem instead of wasting resources trying to independently troubleshoot.
- Configure a Comprehensive Alert System – When configuring alerts, aim for high specificity and high coverage. Configuring alerts with very specific parameters reduces the number of false positives generated by the alerting system.
- Prioritize alerts: Decide which notifications require the most urgency and set up detailed alerts for each so that the issues are not overlooked that could result in downtime or a negative experience for end users. With the density of infrastructure components, a constant flow of alerts can quickly become an irrelevant flood of data. The ability to assign priority to certain issues through user-defined parameters and/or artificial intelligence frees up the time and attention of Ops teams so that they can troubleshoot critical problems that may result in outages or downtime.
- Schedule Dry run or testing of monitoring system- Even the most well-thought out system may require fine-tuning. Schedule monitoring dry runs in regular intervals to find out how well the alert system functions.
-
Review metrics- Review performance metrics regularly and make changes to the initial thresholds and settings as they may not be consistent or relevant over time. Set up periodic reviews to ensure optimal performance. The metrics and KPIs used to configure alerting system may not remain stable over time. It is important to periodically review how these alerts are configured to determine whether any changes are necessary.
-
Create Automation to responses to specific events – The use of variances also assists in the automation of the cloud. This ofcourse can be applied to storage, networking, and all server and applications that are automated. These automated responses to events make IT operations more efficient, responsive, and resilient. The IT owner can set predefined thresholds and when these thresholds are exceeded, the cloud provider can use automation applications to add capacity that has been agreed on and contracted for. With the known variance from the baseline, the automation systems may automatically add additional web compute resources dynamically to handle the additional workload.
Common Infrastructure monitoring techniques and Protocols
Many different types of protocols can be utilized for monitoring and managing IT infrastructure and operations. Many of these protocols are running in the background of applications and it is important to understand what they do and how they are used. Following are some of the commonly used protocols for infrastructure monitoring.
- Simple Network Management Protocol (SNMP): One of the most common and older management protocols is Simple Network Management Protocol (SNMP). SNMP is included in servers, storage, and all most all devices, both physical and virtual, that need to be monitored. A monitoring system will use the SNMP protocol to communicate with remote managed devices to query information that is retrieved and recorded by the management station. SNMP is a management structure for reading from and writing to managed devices.
- Intelligent Platform Management Interface (IPMI): The Intelligent Platform Management Interface (IPMI) protocol is used to access servers’ out-of-band-management interfaces. IT team can access a server that is even powered off, turn it on, and then watch it boot as if it is directly connected to the server in the datacenter. IPMI also allows to make BIOS changes remotely.
- Windows Management Instrumentation (WMI): WMI is a Microsoft-developed protocol used to remotely access server management information. WMI is useful in querying a remote Windows workstation, server, or application to gather information or to make configuration changes. WMI offers the advantage of using scripting to automate many management functions useful in cloud deployment, such as database activity or disk and memory utilization.
- Syslog: The syslog protocol is very common protocol and is used by devices to send logging data to a remote collection server. There it can be stored and correlated with other devices and used by other management and monitoring applications to gather trends, statistics, and any other type of information that application is configured to support. Any device running in the datacenter that supports the syslog protocol is given an IP address of the remote syslog server and the severity level desired to send to the remote server.
Common Infrastructure Alert Methods and protocols
Every monitoring and management tools by default allow for a variety of alerting methods to meet the alerting requirements. Some of the more common alerting methods are for the application to send a text message or email to a distribution list and provide details on the critical event that occurred.
Proactive analysis of application infrastructure allows IT managers and business owners to predict or identify performance problems before they become urgent, and ensures that network resources are operating as intended. This can be achieved by configuring alerts at multiple threshold levels and direct the alerts to right people. Below are some the commonly used alert methods to alert in case of an issue.
- SMS or text messaging
- Email using SMTP
These two alert methods alert an after hours oncall support engineer of an event to allow for quick response to issues. With mission-critical applications being hosted, 24-hour support is very common and important. The use of notification tools such as text or SMTP email that is automated in the management station is often a requirement.
There are also machine-to-machine or application-to-application alerts. The common dashboard published by cloud companies, which shows the health of the operations in real time and is often accessed with a common web browser, is an example of viewing monitoring data collected by a single application.
Another important use of alerting methods is for automation of troubleshooting, where a management application can alert an application to perform troubleshooting and problem resolution based on the issue reported from the management applications.