Current Email Service Issues - 3PM Update (Detailed).

Email

Posted on: Monday 10 July 2006, 15:39

This update relates to customers who are currently unable to collect their email or are finding that mail previously stored on our server is now unavailable to them.

We are working on this matter as a top priority, and the information below provides additional background on the problem as well as the current ETA for resolution of the outstanding issues we are aware of.

Current Update
---------------

The issue with our mail platform has resulted in two major customer affecting problems:

- Customers who are still being impacted by today's issue will be unable to access their mailboxes and get a username or password error when trying. We expect this to be resolved shortly; please see the "Who is impacted" section of this post for more details.

- Older mail stored in most customer mailboxes is currently unavailable. The data restoration process for older mail will take some time, and we are not yet in a position to confirm if any email data has been permanently lost. Please see the below section entitled "Data Restoration" for more details. This will be especially relevant for customers who use IMAP or Webmail and therefore normally only store mail on our servers.

In order to correct the problem for affected customers, we must do two things:

1. Re-create each problematic mailbox on our central file system. This will then allow for delivery of new mail and collection of mail that has been delivered since Sunday.

2. Move back older mail and other data such as mailing list configurations from back-up data, provided we can access this fully.


Who is impacted?
-----------------

We prioritised the resolution of this problem for business customers, and believe that in the main business customers can now access their mailboxes successfully. We are still however investigating around 100 accounts where multiple mailboxes are in use and access to mail is still unavailable. We expect most of these mailboxes to be restored by the close of play today, and for these issues to be fully resolved before tomorrow morning.

For residential customers, we are running various repair scripts to restore mailboxes still affected by access issues. A further script has also been implemented which will rebuild a broken mailbox whenever a mail is delivered to that mailbox. As such, if any customer is still having a mail access issue they can now send an email to their own address. This will result in the restoration of mail service within around 30 minutes.

Where mailboxes have been restored, email messages from before the weekend will remain unavailable until data restoration has been completed, but all mails sent after Sunday morning should be available soon if not already.

Technical Background
----------------------

This major incident occurred as a result of human error during work to resolve timeouts when collecting mails, which we reported via service status last week:
http://usertools.plus.net/status/archive/1152295291.htm

As of Sunday morning, things had progressed well, and we were on track to solve the issues with mail timeouts, which had started to occur following our move to a new email storage system.

At 8AM on Sunday morning our engineers were in a position to switch over to use of the new storage solution. As the first stage of this, an engineer was in the process of bringing the new back-up storage server into service. As part of the preparation of the mirrored disks on this platform the disks had to be reconfigured and all existing data on them removed.

At the time of making this change the engineer had two management console sessions open - one to the backup storage system and one to live storage. These both have the same interface, and until Friday it was impossible to open more than one connection to any part of the storage system at once. The patches we installed on Friday evening removed this limitation, but unaware of this, the engineer made an incorrect presumption that the window he was working in was the back-up rather than the live server. Subsequently the command to reconfigure the disk pack and remove all data therein was made to the wrong server.

Although this was noticed very quickly, over 700GB of live customer data was removed before the process could be halted.

Data Restoration
-----------------

The live storage platform itself is made up of two halves. When one side of a storage system fails it is normal for the other half to take over and ensure there is no data loss. However, a deliberate change on one system will always be copied immediately to the other half of the storage and in this case the engineer managed to lose the information from both halves of the storage system.

The nature of storing data on any hard disk is that if you remove information, it never gets entirely removed. As such, the first step the engineer took after identifying his error was to freeze the disk replication to prevent any further damage. This ultimately means that all the data should be recoverable, and the process of restoring this is underway now. We have engaged the help of data recovery professionals in this and the frozen half of our storage platform was shipped to them yesterday. They have advised that there is a 99% chance of complete data recovery, and provided this goes to plan we can begin to copy older mail back to customers' mailboxes from tomorrow.

Outcome
---------

Any incident like this is very frustrating for both customers and everyone at PlusNet. Our focus is on resolving all customer affecting issues, and once the issue is resolved we will perform a full internal investigation in relation to this. This will deal with the individual error that was made here and ensure that our own internal processes are looked at closely to identify whether there was anything further we could have done to prevent this. We will also be working with the PlusNet Usergroup, who have set-up a dedicated forum at http://usergroup.plus.net/email_discussion for further discussion about this.

We would like to express our most sincere apologies for the inconvenience this inevitably causes to customers and would again provide assurance that we recognise the seriousness of this issue and will leave no stone unturned when it comes to addressing the root cause and remaining outstanding problems.

With Regards,

Ian Wild
Customer Communications Manager

Return to Index