Since Unic started a little more than 20 years ago, we have always been a hosting service provider. Back then we just had a few servers and all employees looked after them. Today, we host platforms on more than 700 servers and are always looking to improve efficiency and decrease maintenance effort as much as possible. However, a couple of years ago we started to face some issues and were stuck between a rock and a hard place.
As usual for a service provider, we had a 3-level-service organization where the 1st level was managed by our service desk team. The service desk acted as a single point of contact for external and internal customers, accepted all kind of requests and handled incidents. Second level support was either done by systems admins, system engineers or application engineers. They took over all the tasks the service desk couldn’t do by themselves. The 3rd and last level was provided by the subject-matter experts who have specific know-how in a particular topic. This is close to the standard of service provision so where were our issues?
A vicious cycle
First let's take a look at the service desk: They had a very high number of incoming tickets and a constantly growing backlog of 20 tickets and more per person to manage, which was quite a big workload. In addition, they had a list of more than 40 daily, weekly or monthly recurring tasks to manage.
In the second level support it looked quite similar. System admins and engineers were responsible for maintenance of hosted systems, as well as for setup and improvement tasks. Furthermore, those employees were the same ones doing 24/7 on call and alert management.
These issues extended to the 3rd level support as well which is very unfortunate because these are the highly skilled and thus expensive employees that should not invest too much time in support. This leads to a vicious cycle: employees have no time to improve the situation and cannot work on quality, which in turn decreases quality and makes the situation even worse.
Recruiting IT-specialists from India
To break free from the mentioned circle of doom we made the decision to hire a team abroad. This was the birth of our L1 Team in India – a small team of skilled IT-specialist who can help on multiple levels. The goal was to free level 1 and 2 from recurring and time-consuming task.
With a partner in India we recruited a team of three system administrators with the idea to grow further. Those three colleagues flew to Switzerland where they stayed for some weeks and we on-boarded them to our processes. This had several very good effects. The team got to know our people in Switzerland, our work culture, and ethics as well. They saw firsthand how we worked and what our and our customer’s goals are. This helped to break down some cultural barriers on both side and I’m sure that the beers after work helped too. It was very hands on and direct. Most importantly, we realized that we needed to document better and more conscientiously.
Therefore, we set up a knowledge management process and created a knowledgebase in our Confluence, where we started with step-by-step documentation. Because the team from India was already working with us, we could involve them in the creation of this documentation. We showed them what they need to do and they created the documentation for it. We then just reviewed the documentation to ensure everything was correct. With that, we had 50–100 valuable knowledge base entries in no time.
With that jump-start, the L1 team went home and initially the service desk was the first to profit from them. They could then focus on what they could do best: Making the customers happy! In addition, they had more time to check out more complex tasks and document and standardize them.
The benefits of handling tasks over to the L1
Now with the service desk freed from the vicious cycle and the L1 eager to do more technical tasks we went a step further. The L1 team got extended a little and started to do 24/5 shifts and we handed over alert and event management. The goal here was to free the 2nd level operations and on-call.
L1 got trained and enabled to do level 1 response to incident alerts. We implemented OpsGenie to plan schedules and orchestrate escalations of incoming alerts with the goal to reduce the reaction time and the number of alerts going to the 24/7 on call in Switzerland. To enable the L1 team without giving them any direct access on the customer systems was another hurdle to take. We solved this challenge using ChatOps. We enabled the L1 to send limited set of predefined commands to the systems over our chat tool and get the output back into the chat. This made it possible to hand over tasks to the L1 without bothering too much about managing rights on the hosts. The permissions are strictly limited to the commands we implemented for them. In addition, there is no way to access any data on the systems. Still they are able to react as quick as possible to alerts and handle a lot of them themselves. This leads to more sleep for our 2nd level support and 24/7 on-call and freed them up to put effort into preventing alerts in the first place.
Where are we now?
Our L1 team in India supports us in many ways:
- handling nearly 80 recurring tasks in a weekly, monthly or yearly basis
- handling approximately 60% of the alerts
- handling 80% of the internal standardized request
- handling the monthly windows patching process
In our knowledgebase we now have an exceptional know how reservoir with nearly 500 entries. We plan to do even more with the L1 team. Managing Linux patching and helping in the provisioning and rundown requests are on top of that list. We also think about having the L1 team available 24/7.
The L1 team helped us to work on ourselves by enabling us to think about improvements and automation of tasks. With the need to document and make stuff easier for them we started to work more structured and document more, even tasks we do not hand over.
This with the goal to enable the people to work on what they can do best. Service desk focus on customers, 2nd Level is not swamped with requests and 3rd level that can focus on quality and automation. It’s not perfect yet but already much better. And as Barack Obama said "Better is good. I'll take better every time, because better is hard."