Myreader.co.uk  
uk news, chat and community
   home   |   control panel login   |   archive   |  
 
net
net
news.announce
news.config
news.management
news.moderation
providers
providers.aaisp
web.authoring
  
 
date: Thu, 02 Jul 2009 09:35:02 +0100,    group: uk.net.providers.aaisp        back       
[Status] [info] 21CN what is the problem?   
Posted at 2009-07-02 09:21 BST by AAISP

It is worth explaining the problem as I am sure there are many that
want to know.

In summary - one tricky problem which has hit us again and we have a
work around for, one simpler problem we havd definitely fixed.

We have deployed new L2TP LNS code. The reasons for this are various,
and it has been planned for a while. The main drive is to allow us to
add a lot of new features, but there were some issues with BT over the
last month and we wanted to be able to sort those. The new code allowed
various work arounds and additional diagnostics.

The new code has had two main issues.

1. Attack management.

Basically, with almost any system, it is possible to overload it with
enough packets, especially small packets. The system is designed to
handle this, but the system was not quite right, and so this caused
watchdog errors. The front end handling has been put in place quickly
and that initial problem was solved.

However, this moved the problem on - the front end could handle the
packets but specific types of packets are handled by the control
processor. Again, storms of packets coming in way too fast for it. We
have systems in place to limit traffic but there was a small window
where certain packets at just the right rate caused a watchdog error.
Faster or slower or bigger packets and it would not - it was packets at
just the wrong rate. The initial step was a very crude limit to stop
this. It was not elegant but worked.

We since have put in a more elegant way of handling this, and we have
today discovered it may be more elegant but it does not work. For now
the crude limit is being reinstated.

What are these attacks though? Well, we have found out. They are not
actually attacks at all. They are some sort of storm on the LAN.
Thousands of small ICMPv6 packets bouncing around between switches in a
fraction of a second. Sometimes. Not all the time. It is clearly a
switch being stupid. We are planning to replace all of the switches
involved over the next few weeks as this is clearly unacceptable, even
if it has highlighted an issue.

This is why it has been impossible to reproduce the specific problem on
the bench.

2. Monday's issue...

On Monday lunchtime we had both LNS's crash repeatably. Thankfully this
too was something we managed to track down. It was a simple matter of
customers accesisng the graphs produced by the LNS from a LAN with a
very large MTU. Yesterday, having set my desktop machine to a large MTU
and not set it back I inadvertantly tested that this was indeed the
cause of the problem. The fix was simple and has been deployed.

What now?

Well, the management of these packet storms did not work correctly, and
so we are issuing a 1GB peak time topup on all of the 21CN logins
today. I really hope this will be the last.

We will be running newer LNS code on the other LNS, probably later
today, and moving people over to it. This will the cruder fix for this
issue in place. We'll go back to the drawing board and work out how to
fix it properly for a later deployment. We'll see if there is any way
we can reproduce the cause on the bench as well.

I hope the explanation is useful.

Please do contact me on irc to discuss in more detail.

Adrian
Director.[IMAGE]

URL: http://aaisp.blogspot.com/2009/07/info-21cn-what-is-problem.html

-- 
AAISP Status Blog
URL:http://aaisp.blogspot.com/
date: Thu, 02 Jul 2009 09:35:02 +0100   author:   AAISP

Google
 
Web myreader.co.uk


    COPYRIGHT 2007, YARDI TECHNOLOGY LIMITED, ALL RIGHT RESERVE  |   contact us