A Week in the Life of a SystemCare Ninja
One thing’s for sure – there’s nothing that surprises the SystemCare team anymore – we see all sorts :)
If you haven’t already seen our ninja video, it’s well worth spending a few minutes to take a look, especially if you’re a fan of the old school 2D platform video games.
How SystemCare works is that we receive alerts to let us know there is or could be a problem with a server soon.
These range from low disk space to a failed backup from the previous night, from out of date anti-virus to a server showing as offline.
Here’s an insight into the type of alerts we saw during a working week, how we resolved these issues, and what might have happened to our customers if not for our ninja-esque X-ray vision!
First thing on Monday morning we found low drive space on a server. This drive filled rapidly over the weekend, which was unusual.
I picked this alert up and flagged it with the customer, explaining there had been a mass change in the average growth of data on this drive and I would look into the reason why.
After logging onto the server, I discovered the main use of space on the drive was SQL backups. Following some investigation within SQL itself, I found a maintenance plan which should have been clearing the backups down once they were older than a week to conserver space. However, this job had stalled, causing the old backups to build up and fill the drive.
I recreated the maintenance plan and tested this to make sure it was working as expected. This then removed the backups older than a week and created enough free space on this drive to push it above our alert threshold.
If this alert wasn’t picked up, the drive would have filled up completely, causing issues with the SQL databases, and also causing a complete shut down which could stop programs such as Sage/Pegasus, SharePoint or any special bespoke applications with a SQL database.
The newer backups of SQL would have also stopped working if the drive was full, so this was a simple error which could have snowballed into a major issue.
We also received an alert to let us know a backup job failed on the previous run. Without this alert, the failed backups could have continued for several weeks, potentially going unnoticed.
The alert was picked up and raised with the customer before our SystemCare engineer logged onto the server and investigated the issue. After checking the backup logs and errors, we found there was a fault with the backup drive.
We performed several fixes on the drive to try to resolve this issue, but unfortunately the drive was beyond repair. Our engineer contacted HP and checked the warranty. Luckily the drive was still covered, so we arranged for a like-for-like replacement to be delivered to the customer’s site the next day.
Once the delivery was confirmed, we also requested a visit for a field engineer to attend the site, install it for the customer, and then test it.
Once the replacement and visit was arranged, our engineer contacted the customer to back up any high priority documents to a local drive as a temporary measure until the new drive was installed the following day.
On Tuesday morning, we spotted an alert which indicated the anti-virus was out of date on a server. If this went unnoticed or was left without being actioned, the server would be completely open to getting a virus, and valuable information could be stolen.
The virus could also spread from the server to the network. Our ninjas picked this up and found the issue on the server was due to a failure of the service which was in place to download the latest Sophos updates.
After applying several fixes such as reinstalling the service and downloading the most recent updates, the server was fully up to date and the most recent updates were then pushed out to the workstations to make sure they were all fully up to date too.
This left the whole company with the most recent updates installed on the machines to help defend against viruses and malware on the systems.
We also received an alert to let us know that a Sage database was corrupt and one of our customers couldn’t get access to Sage. One of our Sage engineers raised the case to request that the database be restored.
I logged onto the affected server and took a backup of the corrupt database to another location, just in case this was still needed. I then removed the corrupt version.
Once this was complete, I restored the working database from the previous night’s backup tape back to the original location and our Sage team were able to reconnect the database and get users up and running again with a few minutes following the restore.
Without a good backup, monitored by our backup ninjas, and the ability to restore the database, the customer would have lost a lot of important financial information.
To begin Wednesday, one of our ninjas picked up an alert to say there was a predictive failure on one of the disks in the server. We immediately got in contact with HP to find if the drive was under warranty. It was still covered so a replacement was ordered and sent to site for one of our field engineers to install. This prevented the customer from potentially losing valuable data on the server if the drive had completely failed.
I received an alert for low disk space on a hyper-v host server. I logged onto the affected server and found the VHDs (virtual hard disks) were set to expand over the physical space available on the host, which would have caused the virtual machines to shut down and encounter issues once the physical drive filled up.
We were able to pick this up in time to prevent this from happening and started the process to resolve this issue by using multiple strategies, including adding more space to the server, resizing VHDs and moving VHDs between drives.
This has now been future proofed from over-expanding, providing no more data is added to the physical drives. It gives the customer peace of mind that the potential shutting down of all of their virtual servers isn’t just resolved in the short term, but in the long term too. And of course we still have our monitoring active to keep a watchful eye for any potential issues happening on the same drive. Just in case.
On Thursday, our SystemCare administrators release the most recent Microsoft patches to all supported workstations and servers. However, we found Microsoft had released a patch which caused an issue with certain applications such as Sage, causing it to crash or not load properly.
Our SystemCare developer created a script which removed a registry key on the affected machines and prevented the patch from being rolled out to any other machines. Any machines that encountered this issue simply needed this script to be run on them.
This enabled us to resolve any errors caused by this patch in seconds for hundreds of machines, as opposed to logging onto each machine individually and removing the registry key or the patch itself.
Without this, someone would have had to log onto each machine and uninstall this manually…I wouldn’t like to be asked to do that job!
We also picked up an issue for a lot of servers in a nearby location going offline, we picked this alert up and checked with several Internet Service Providers to find out if there had been any issues with their service.
After some investigation, they found there was a line fault which caused a lot of sites we monitor to drop offline. Unfortunately we couldn’t do anything about this issue but we were able to provide the customer with this information and monitor the servers for any issues when they came back online following a fix from the ISP.
Without this information, the customer may not have been aware of any issues, they wouldn’t have had an ETA on a fix, and may have been left with problems on the servers following the resolution from the ISP. But luckily the ninjas dealt with all aspects of this problem.
On Friday, we received an alert for low disk space on a server which was dedicated to hosting Exchange databases. If this type of drive filled completely, emails would have stopped coming in for the whole company.
Fortunately, we were able to pick this up in time and I explained several potential fixes to the customer who decided they preferred to clean the database up by removing any old mailboxes and email users no longer needed in order to create whitespace in the database.
Once this had been done, we sent a loan drive to site to run a defragmentation of the Exchange database out of hours to minimize downtime for the customer to claim the whitespace back.
Without our remote monitoring picking this type of alert up, this may have gone by unnoticed until all emails went down, and this would have then rolled on to cause further downtime while a fix was put in place. Luckily we got this in time and prevented a major companywide error before the potential issue actually occurred.
On Friday afternoon, a customer was affected by the infamous Cryptolocker virus so they logged a case asking for the infection to be cleared and for the affected locked data to be restored.
We were able to remove the virus using some scripts which one of our SystemCare developers created. Our anti-virus trained ninjas also cleared any traces of this virus left behind, before passing this case to our backup team to restore the affected data.
Our backup ninjas were able to restore all of the affected data back to the original location from the previous night’s backup, and asked the customer to test this data to make sure everything was now functional without any issues. Following the teamwork of the SystemCare ninjas, the customer confirmed this was all working as expected.
The virus cleanse and the data restore took the customer from completely unable to work (and potentially having their whole network affected by the same virus) to fully functional within a matter of hours.
Without the working backup and the skills of the anti-virus guys, this would have taken hundreds of man hours to recreate all of the data stored on the systems, and the danger of potentially being affected by more viruses left behind on the network, waiting to cause another major issue across the whole company.
Just an average week in the life of a SystemCare ninja :)