Lessons learned from an innovative project – where ServiceNow’s new Dynatrace plugin was deployed.
Together with Olivier Hayard, we discussed the challenges and benefits of integrating monitoring systems with a service management solution. According to Olivier, as information systems are increasingly being monitored, the number of alerts produced is growing and therefore the integration of these alerts into an incident management process becomes critical to further rapid and harmonized treatment. A simple example, at night and on weekends, it makes sense that these alerts can be dealt with quickly by the right people.
How did the idea to implement the new Dynatrace plugin from ServiceNow come about?
Olivier Hayard: Before talking about this project specifically, I would like to clarify something regarding the Dynatrace monitoring and alerting solution: an existing plugin existed, and it works well but used the old version of the API. Last November, ServiceNow provided a new plugin with the same utility but using a modernized architecture. There are currently two plugins:
- Item CI configuration synchronization
- Alerts feedback
Can you describe the general context of the project ?
Olivier Hayard: Two years ago, a major change occurred in the management of the IT operations of one of our clients, imposing numerous changes at the organisational, process and technical levels.
So, for two years now, a complete overhaul of the ITSM processes has been underway and ServiceNow has been implemented to cover the following processes: incident, major incident management, problem management, change and release management, as well as the CMDB. Service Request was implemented through the ServiceNow portal.
In parallel, as part of infrastructure management, Dynatrace software was selected for monitoring and alerting. It is worth noting that some technologies have their own alerting system. As we see later, this had an impact on how some decisions were made. Additionally, some other systems were out of scope of this project, their management being ensured in a different approach. The Dynatrace deployment mostly dealt with the Windows and Linux server park which constituted several hundred machines .
What challenges did the customer hope to manage with this approach?
Olivier Hayard: Within the client organisation, everyone agreed on the importance of monitoring. The first challenge was to define the rules around when a service is degraded or interrupted. The second was to manage when a support team should intervene, ideally in a preventive way.
Concretely, two types of monitoring were addressed:
- Business application: a human being could make the same observation, for example a longer response time, an unavailable page, a transaction error, etc. In this case, we speak of Synthetic Monitoring. We monitor the behaviour of an application, a business service, for example the e-banking application.
- Server, infrastructure: in this case we have a lot more information available, thanks to the deployment of agents on all the servers concerned (OneAgent). A lot of different elements can be monitored, such as a host, a process on a machine, a group of processes spread over several machines (tomcat, database, etc.), server’s memory level, the available disk space, etc. It follows that you can have a high number of OneAgents, and so processing must be managed carefully.
What issues did you face with defining alerts?
Olivier Hayard: As we can see from my previous answer, there are specific issues and as there are two plugins with specific functionalities, there are as many issues. I would like to mention two of them:
- Correlation between the CMDB and the monitored elements: the challenge is to have the same level of granularity between the components that we monitor in Dynatrace as well as the Configuration Item CI in the CMDB, which is managed in ServiceNow. i.e. If we monitor a memory component, it must be present in the CMDB. This allows us to create an incident linked to the right element. (For information, these elements are correlated every day).
- The creation of the incident: The challenge is to be able to create the incident against the right support group and the right priority. If you are disturbing your support team in the night to treat a priority issue, it is best to wake up the right team!
(Strictly speaking, Dynatrace creates ”problems”, but not in the ITIL sense).
When an alert goes off, we generally have very little time to react, so the more relevant the information sent by Dynatrace, the better. The better the information in the CMDB, the better the chance of assigning the right resolution group – and therefore, the greater the probability of resolving the issue before an incident happens.
For example, if there is an alert on a server and this server is a member of a cluster, this information is very important. The following are also essential: what are the business impacts, which services could be affected, is there a risk to customer facing, legal, etc.
Another major consideration is not to disturb people with false alarms or false positives. Beyond being woken up at 3 am for no good reason, the false positives are also an issue during the workday with loss of productivity.
In summary: good management of risks, impacts and urgency are the keys to a successful implementation => These elements are provided by the links and information in the CMDB
Can you give us some indication of the time and involvement required for this type of project?
Olivier Hayard: The project goals and stories were well defined, so an Agile project method such as Scrum lent itself to this project.
Additionally, the Plugins are very well done, which obviously made the work much easier. On the other hand, there was a lot of parameterisation, tuning and adjustment, which took time. These activities work well with Agile batch management.
In practice, the project was carried out in two phases, the first was Preparation and the second, Tuning.
- Phase 1: Preparation
Initially, all the Dynatrace problems were brought up to allow the integration to be parameterised. Cases were created to validate that the CIs (Configuration Items) were well correlated, that the information provided was correct, that the groups were correctly determined, and that the priorities are correctly calculated (the right rules are in place).This phase lasted about 2 months and involved between 10 and 20 people.
- Phase 2: Tuning
The production launch was done by group and by domain.
- On the synthetic monitoring side, this was implemented application by application from the most critical to the least critical. Each month we worked with the application teams to add added a group of applications.
- On the Server Monitoring side, implementation way by support team. The Start-up was team by team, as the support organisation (e.g. standby, shift, process) was defined and implemented during this phase.
In Dynatrace there are about 20 alerts per hour, which gives at the end 3 to 4 incidents per day of all types of urgency and importance.
After these phases of preparation and adjustment, what are the first observations you can make?
- The first observation is the quality of the CMDB: out of 100,000 CIs, we may have had to correct a hundred or so by hand, which is very few. Importantly, the “Identification Reconciliation Engine” (IRE) was very successful. Put simply, the ServiceNow CMDB has several sources, including the Dynatrace plugin. When receiving information from one of these sources, the challenge is for ServiceNow to know, if it is the same CI or a different CI. ServiceNow must also know which system has the right to update the data in the CMDB. A lot of work was invested in this IRE and the result gave us a solid foundation, necessary for this type of project to run smoothly.
- Second observation: after not even two months, the bank’s critical applications are covered, with increasingly effective preventive interventions. The time available for intervention is greater, resulting in an improvement in root cause treatment. (i.e. in the ITIL sense of problem investigation and resolution).
Apart from the integration between Dynatrace and ServiceNow, has this initiative had consequences on other alert processing?
Olivier Hayard: Dynatrace was the pioneer, the first monitoring system to be integrated with ServiceNow, an integration facilitated by a good quality plugin. In view of the excellent results obtained, other monitoring systems will also be interfaced with ServiceNow such as Tivoli and certain security monitoring systems.
The current thinking now is, if there is an alert, beyond a certain threshold, it must be sent to ServiceNow and be processed according to the Service Management processes in place, incident, major incident, problem etc…
What are the principle lessons learned?
Olivier Hayard: The support teams were reluctant at first, regarding the CMDB update. Updating the CMDB was historically seen as an administrative task. However they now realise the benefits and value for them.
The decision on where to handle event management must be made judiciously. For this it was decided to handle event management in the respective monitoring tools. Best practice confirms that whether it’s in the Service Management tools or in the monitoring tools, it must be done.
Olivier Hayard, Vice President, Head of Knowledge Management
I have spent the first 15 years of my professional career exclusively delivering client engagements on a full-time basis, acquiring knowledge and applying it to developing methodologies, norms and standards for our clients’ benefits.
Having studied software engineering and mathematics, I have an Information Technology degree in Software Engineering from the Institut Informatique d’Entreprise (IIE).
I started my professional career as a consultant, joining Itecor in 1993 and became a VP and Head of Knowledge Management in 2004.
In this capacity, I contributed to the creation and nurturing of Itecor’s practices, an invaluable knowledge repository that ensures our consultants acquire state-of-the-art expertise so that clients’ needs are precisely met.
Today, I spend 50% of my time with our clients to understand their requirements and organisations and then implement the processes our consultants will follow, e.g. change management, project management, release management as appropriate. Once the framework is in place, I set the measures and KPIs to ensure the processes are followed and I coach the teams.
From 2008 to 2016 I was a board member of the Swiss Society for Project Management in charge of the organization of monthly evening talks (sharing of experiences, Q+A).