|
|
|
date: Sat, 31 Oct 2009 22:39:04 +0000,
group: uk.net.providers.aaisp
back
[Status] [Update #5] [open] Major issue (not BT)
Posted at 2009-10-31 22:12 GMT by RevK
Update #5: 2009-10-31 22:39 GMT
Something is up. Not a crash but we are investigating now. Looks like
may be an issue with our RADIUS.
Update: Looks like some sort of issue started a little before 9pm and
has resulted in some lines going off line after 10pm.
Update: We can confirm the LNS did not crash. Lines are coming back
now, and we are trying to find the underlying cause of the issue.
Looking at graphs it seems most lines that have been affected were only
off for a couple of minutes.
Update: This looks more complex. We have seen lots of LCP timeouts as
well.
OK, the issue looks like one of RADIUS accounting stopping mostly
around 9pm and it is not clear why. We have restarted radius servers,
and it seems to be working at the moment. We are checking logs
| carefully to find how this happened.
>
> Update: We are clearly sessions stuck on 20CN RASs still, but looks
> like we have most people back on line.[IMAGE]
URL: http://aaisp.blogspot.com/2009/10/open-major-issue.html
--
AAISP Status Blog
URL:http://aaisp.blogspot.com/
date: Sat, 31 Oct 2009 22:39:04 +0000
author: RevK
|
[Status] [Update #6] [closed] Major issue (not BT)
Posted at 2009-10-31 22:12 GMT by RevK
Update #6: 2009-10-31 22:46 GMT
Something is up. Not a crash but we are investigating now. Looks like
may be an issue with our RADIUS.
Update: Looks like some sort of issue started a little before 9pm and
has resulted in some lines going off line after 10pm.
Update: We can confirm the LNS did not crash. Lines are coming back
now, and we are trying to find the underlying cause of the issue.
Looking at graphs it seems most lines that have been affected were only
off for a couple of minutes.
Update: This looks more complex. We have seen lots of LCP timeouts as
well.
OK, the issue looks like one of RADIUS accounting stopping mostly
around 9pm and it is not clear why. We have restarted radius servers,
and it seems to be working at the moment. We are checking logs
carefully to find how this happened.
Update: We are clearly sessions stuck on 20CN RASs still, but looks
| like we have most people back on line.
>
> Update: It looks like 21CN lines had a brief outage but 20CN much
> longer as we have had to go through clearing stuck sessions in BT.
> Something that was meant to have been fixed some time ago. We'll chase
> this with BT.[IMAGE]
URL: http://aaisp.blogspot.com/2009/10/open-major-issue.html
--
AAISP Status Blog
URL:http://aaisp.blogspot.com/
date: Sat, 31 Oct 2009 22:46:00 +0000
author: RevK
|
[Status] [Update #7] [closed] Major issue (not BT)
Posted at 2009-10-31 22:12 GMT by RevK
Update #7: 2009-11-01 16:54 GMT
Something is up. Not a crash but we are investigating now. Looks like
may be an issue with our RADIUS.
Update: Looks like some sort of issue started a little before 9pm and
has resulted in some lines going off line after 10pm.
Update: We can confirm the LNS did not crash. Lines are coming back
now, and we are trying to find the underlying cause of the issue.
Looking at graphs it seems most lines that have been affected were only
off for a couple of minutes.
Update: This looks more complex. We have seen lots of LCP timeouts as
well.
OK, the issue looks like one of RADIUS accounting stopping mostly
around 9pm and it is not clear why. We have restarted radius servers,
and it seems to be working at the moment. We are checking logs
carefully to find how this happened.
Update: We are clearly sessions stuck on 20CN RASs still, but looks
like we have most people back on line.
Update: It looks like 21CN lines had a brief outage but 20CN much
longer as we have had to go through clearing stuck sessions in BT.
Something that was meant to have been fixed some time ago. We'll chase
| this with BT.
>
> Update: Sunday. It has taken a bit of investigation... We think we have
> identified the cause. It was to do with work for the new test LNS on
> Saturday night and very minor changes but were not right and caused
> some authentication requests not to reply. On their own they would not
> have had this effect, but the LNS is a tad inefficient when it does not
> get a RADIUS response (an issue we have been working on anyway) and
> this combined with a problem in RADIUS accounting all meant that
> several hours later things went wrong. The action taken at the time
> problems started was to revert all changes in source control
> immediately as a precaution and this solved the problem. BT having
> stuck sessions caused further knock on effects as usual. Only now are
> the pieces fitting together to explain how such a minor change had a
> knock on effect. The accounting issue has been sorted. The
> authentication issue has been sorted. The LNS inefficiency is in
> development for next LNS release. So should all be fine now. We do
> apologise for the inconvenience, and yes, this time it was not entirely
> BT to blame![IMAGE]
URL: http://aaisp.blogspot.com/2009/10/open-major-issue.html
--
AAISP Status Blog
URL:http://aaisp.blogspot.com/
date: Sun, 01 Nov 2009 16:54:47 +0000
author: RevK
|
|
|