August 2010 crash

From Hattrick
Jump to navigationJump to search

On monday 23 August 2010 Hattrick went down arround 15:20u HT-time

Official announcement

Hattrick announces
Down Time 24.08.2010
As you have noticed, the site was down since 15:20 HT time yesterday. And here is what happened.

We were performing routine maintenance and upgrades to our servers, our hard disk supplier was on site to perform an upgrade to the disk's firmware on our database storage solution. Unfortunately, a bug in this firmware caused all our disks to crash, which meant that we had to restore most of our databases from backup. Because of the enormous amount of data needed to be restored, it took all night. Thankfully, the last backup ended at around 15:00 HT time yesterday, so just before the site went down. This means that only the data for, give or take, 30 minutes was lost.

How does that impact your team? Read more about it in system info.

These have been circumstances beyond our control, and we are very sorry for the inconvenience this may have caused some of you.

Downtime messages

Hattrick announces
Hattrick announces 23.08.2010
During routine maintenance today, our disk cabinet supplier introduced an upgrade to firmware which resulted in our database being erased. This means we need to go back to a backup which was made a few hours earlier on Monday. Because of the enormous amount of data needed to be restored, this will take at least all night to get done. We will then need to sort through the inconsistencies in the data and let the heart of Hattrick catch up for the down time. We are hopeful to be back online by tomorrow afternoon. At this point we can't guarantee that we will be back before Cup games start, but it is our goal and we will keep you updated throughout Tuesday.
Hattrick announces
Hattrick announces 24.08.2010 (9:50)
Update 09:50: We have now restored all data from a backup taken around 15.00 yesterday (which means not that much data has been lost). We're checking for inconsistencies in the restored data now, and estimate to be back online around noon.
Hattrick announces
Hattrick announces 24.08.2010 (11:30)
Sorry to have to take the site down again, but the new disks are too slow as it seems. We're investigating, together with our hard disk supplier.
Hattrick announces
Hattrick announces 24.08.2010 (13:00)
During routine maintenance yesterday, our disk cabinet supplier introduced an upgrade to firmware which caused all our disks to crash. Because of this we had to restore our databases from a backup, taken about half an hour before the disk crash. This morning we've checked for inconsistencies in the restored data, and the site is ready to launch. Unfortunately, the disk performance is too slow to open the site. We need to wait until our hard disk controller has finished checking the disks for errors, and this will take some hours more. We hope to be back around 16:00 HT-time. We regret any inconvenience this may cause you.
Hattrick announces
Hattrick announces 24.08.2010 (14:25)
14:25 HT-time: During routine maintenance yesterday, our disk cabinet supplier introduced an upgrade to firmware which caused all our disks to crash. Because of this we had to restore our databases from a backup, taken about half an hour before the disk crash. This morning we've checked for inconsistencies in the restored data, and the site is ready to launch. Unfortunately, the disk performance is too slow to open the site. We need to wait until our hard disk controller has finished checking the disks for errors, and this will take some hours more. We hope to be back around 17:00 HT-time. We regret any inconvenience this may cause you.


Arround 17:00 the site started to act very odd, and even brought good old Hammo back. Not a good sign.

Hattrick announces
Hattrick announces 24.08.2010 (16:45)
16:45 We will have to keep Hattrick offline for at least a few more hours, in order to increase the disk performance to open the site. Unfortunately, matches scheduled during this downtime will be played. Transfer deadlines will however be extended. We humbly apologize for all the inconveniences this may have caused you. Rest assured that we are working really hard to remedy the situation as quickly as possible.

By 19:15 the site was up again. By 7:30 the next day everything (including the forums) worked again.


Forum comments

Keywords: (Downtime)
From: HT-Anne (14298242.5) as reply to (14298242.4)
To: Everyone 24.08.2010 at 19:09
So we brought the site back up a little before noon and then we noticed our disks being excruciatingly slow. We had to run a disk check, which meant taking the site back down. The disk check is taking a very long time, and thus we opted to migrate the Data to another set of disks to be able to be up and running as fast as possible. We were able to abort the disk check to finish the migration. And we are now up and running.

All the problems stem from a buggy firmware update, made by our HD cabinet vendor, which crashed our Database.

The Youth section of the site is currently unavailable, we are working to get it back up as soon as possible.


Keywords: (Downtime)
From: HT-Johan (14298243.5) as reply to (14298243.4)
To: Everyone 25.08.2010 at 8:49
The forums are now back up, as you can see. We tried to bring them back online with the rest of the site yesterday evening, but as we had to move the game engine to our backup disk cabinet (which is a bit slower and has a great deal smaller storage capacity) the forums just were too slow to function. (This also why youth could not open either).

We now have them on another temporary disk for the forums and we will have to see how well that works. It is not inconceivable that we will need to take down the forums again when the site gets busier later today. The main disk storage - the one that caused the problems in the first place - is still being prepared to take on the load of the whole system.

We can do this over here I realize now :)

(14298242.1)


Keywords: (Downtime), (Youth Academy)
From: HT-Anne (14298242.54) as reply to (14298242.25)
To: Gambit_ 25.08.2010 at 9:05
Gambit_ wrote:
Youth teams are still down to allow the youth engine to catch up. They should be back shortly.

What do you mean shortly? I am waiting for opening YA since yesterday evening in order to send lineup for my friendly match which starts today at 8:00 :(


I'll update the information over there.

We don't have an estimate on when the Youth Academies will be back up yet. They are now running on the same disks as the forums and we need to monitor performance to be able to give you an estimate. Right now the Youth engine is trying to catch up for the down time. As soon as I know more I'll of course inform you.


Keywords: (Downtime)
From: HT-Anne (14298242.76) as reply to (14298242.48)
To: Pupske 25.08.2010 at 9:22
Pupske wrote:

Anne, it's too bad it happened, but we can't turn back time. What I'm really interested in is what is Hattrick going to do to prevent such thing from happening again?


From the details of what happened, I dont think there is very much that we could have done to prevent it from happening.

In this case our RAID supplier was updating the firm ware and while he did that he crashed it. We managed to get most of it back but one of the DB didn't handle not having a harddisk anymore very well and that one had to be restored from backup. Which had ended about 20-30 minutes before the crash happened.


Keywords: (Downtime)
From: HT-Tjecken (14298242.135) as reply to (14298242.48)
To: Pupske 25.08.2010 at 9:54
Pupske wrote:

Anne, it's too bad it happened, but we can't turn back time. What I'm really interested in is what is Hattrick going to do to prevent such thing from happening again?


Not updating firmware ever again.

We will naturally see what we can done to prevent this, but right now I have no good answer for you. We're still not over this situation, the site is up again and most funcionality is back - but we still got a lot do with our disk performance before we can call this over. And then we can start looking if there are more things we can do to avoid situations like this. That said, as someone pointed out earlier on - one can never totally avoid the risk of disk crashes, downtimes etc. But is should naturally be kept to the very minimum.


Keywords: (Downtime)
From: HT-Anne (14298242.283) as reply to (14298242.271)
To: Schnuff 25.08.2010 at 11:29
We were back before noon yesterday. That is about when the data recovery ended. But then disk checks made the site unbarebly slow. So we had to take it down again. The information regarding that was available on the down web.

While reading you, I feel like you think we want to cause our users harm. Or don't care that some countries had their cup matches starting without them having anytime to check orders. I can assure that nothing is furthest from the truth. We tried all we could to bring back the site before 18:00 HTTime, which is whent he first cup matches started. We even migrated data to another set of disk, but we then were only able to make it back for 19:17. It's unfortunate, but at one point, we were told another 24h. And I can tell you, that I then had tears in my eyes. Because I knew just what that meant for users.


Keywords: (Downtime)
From: HT-Anne (14298242.325) as reply to (14298242.314)
To: Hiddink14 25.08.2010 at 12:05
-(Long quote form Hiddink14)-

Transfers were extended by 24h when we first brought the site back up before noon yesterday. When we pulled the plug again. We couldnt roll back that extension. So we added an extension on top of that.

So transfers between 15h Monday and 15h Tuesday were extended by 24h the first time, and by 5h the next time. But transfers between 15h Tuesday and 19h on Tuesday were extended by only 5h.

Is this clearer ?


Keywords: (Downtime)
From: HT-Anne (14298242.419) as reply to (14298242.405)
To: Schnuff 25.08.2010 at 13:15
Schnuff wrote:
HT-Anne wrote:
While reading you, I feel like you think we want to cause our users harm. Or don't care that 
some countries had their cup matches starting without them having anytime to check orders. 
I can assure that nothing is furthest from the truth.

I can assure you that I am far from thinking this. I am sure that you all do your very best in the current setting/organisation form you are in. But I am also 100% sure that there are lots of things to improve especially when it comes to organisation/prioritisation issues. This is what I am talking about.

HT-Anne wrote:
We tried all we could to bring back the site before 18:00 HTTime, which is whent he first cup matches started. 
We even migrated data to another set of disk, but we then were only able to make it back for 19:17. 
It's unfortunate, but at one point, we were told another 24h. And I can tell you, that I then had tears in my eyes. 
Because I knew just what that meant for users.

There was nothing you could do. And this is what happens quite often. You cannot ensure that a downtime lasts only some minutes/hours and that there are no important matches in this period of time, but you can change the way you handle these problems. Make it independent, I mean, why do you handle Hattrick as live system even if it is completely down? Hattrick can come back online at any time point you want, there is absolutely no need for letting the system catch up while being offline...

Just think about it, it is just a proposal, I know that you all are hard-working and motivated :-)


I'm sorry I made the wrong assumptions :)

The only thing I can tell you about what you've written here, is that Hattrick needs to be dealt with as a live environement because it's planned week to week and its schedule is full week to week. We dont have much wiggle room to shift things in time.

And sometimes, from what I'm told by those who know, it's best to let the system catch up while being offline because it causes less problems later on.

Yesterday when we brought the site back up, the people who were complaining cause the site was down, were then complaining cause it was too slow. And that is what happens when we don't let the system catchup completely. It took us a couple hours to get the engine caught up after we were back.

I'm sure there are things we could improve. We are very far from perfect. But I think this crash was handled in the best possible way considering everything at stake.


Keywords: (Downtime)
From: HT-Anne (14298242.476) as reply to (14298242.441)
To: Andreac-NH 25.08.2010 at 13:42
Andreac-NH wrote:
HT-Anne wrote:
I honestly don't think this debate is fit for this thread, or any thread really. 
We don't discuss our revenues, much like we dont discuss our expenses. 
Sorry.

It'ok. I understand. Really.

I just wonder if you're worried that this disaster could cost a lot of supporters or that you think that all in all, when the storm is gone, everything will be the same?

Are you thinking that the frustration is high, and maybe to show that you DO CARE about us, then it's time to try to be more open to community desires, even very little things like "more than 5 feds" or "change the obsolete goldengoal"?


Yes absolutely. I do think events like these are damaging to our "image". Which is what it boils down to really.

And yes, I agree, extra feds slots should have happened long ago. And will happen. Hopefully soon. We do need to "cuddle" our users now, to help them forget some of the recent problems.

But I also think there is alot of positive in what happened in those 28h. The last crash, we had to roll back a whole week. This time? Only 30 minutes were lost. It means we've improved alot on our processes and backups. Information flow had improved alot. Users were kept informed, which is very important.

So yes of course it's awful that we had this crash. And we will pay for it for a while. But I prefer to look at the bigger picture rather than focus on the negative aspects.


Keywords: (Downtime)
From: HT-Anne (14298242.672) as reply to (14298242.658)
To: Grebulon 25.08.2010 at 16:58
Grebulon wrote:
HT-Anne wrote:
Thankfully, the last backup ended at around 15:00 HT time yesterday, so just before the site went down. 
This means that only the data for, give or take, 30 minutes was lost.

Was it coincidence and luck that a backup finished just before the bottom fell out? Or was this a planned backup which had been scheduled specifically to happen just before the hardware maintenance? (Nudge nudge, wink wink, know what I mean?)


We have back up planned on that DB every hour on the hour, from what I'm told by our IT guru. So we waited for one full turn to complete before we let them near our machines.


Keywords: (Downtime)
From: HT-Anne (14298242.859) as reply to (14298242.805)
To: cartman89 26.08.2010 at 8:55
cartman89 wrote:

I might be the guy who came up with the idea of firing Swedes as a protest (a light-hearted one though ^^), but today, having some understanding of the problems associated with databases, computer networks and hardware failures, I would like to express my support to you the HT-Team... this might have turned out much worse, after all, as it had in the past.

The issues, like loss of training, walkovers, Cup eliminations, are apparent to everybody, I agree, but I can see how hard it would be to avoid them, given the circumstances...

Only thing I'd like to point out: if it isn't possible to postpone matches, in my opinion it would be advisable to prepare some sort of "Plan B".

For instance, free training would be a partial compensation... and it's not like we've never seen any tailor-made training scripts from you before ;)

So it sounds feasible, unless I'm missing something :)


Our policy is to usually never compensate for bugs etc. And I'm not sure giving free training would really solve anything. Should we give out free training, people who lost cup revenues (because they were eliminated for their lack of having a proper lineup) would be happy with that. There is just no contenting everybody. And when we've tried compensating users in the past, the only time I really remember is for the arenas. People weren't happy with that solution.

So far now, compensation is not happening.

I think for us, it is best to start thinking how we can "cuddle" "nurture" our users. Make your lives better on site. Rather than try to compensate arbitrarily.

See also