Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

ongoing
 IncidentWho AffectedDescription ApplicationStatus
120Loss of radio channelsEstates and IT

On Thursday 14 March 2019 at 0750 hours a radio repeater failed on Lamorbey campus which took down the ability for any college radios to communicate using channels 3 and 4. Alternative procedures are in-place using other channels whilst we await a replacement unit

UPDATE 15 March at 1530 hours : the repeater is back online and service to channels 3 and 4 is resumed

Radio communicationresolved
119VLE errorsUsers in USA

On Wed 13 Marc 2019 at 1630 hours our external monitoring started to report problems accessing the colleges VLE website, specifically from within the USA. This was tracked down to a problem with Facebook/Flickr at this time, as there is external content from these services imbedded in the colleges VLE

We confirmed that whilst some errors were being displayed it only related to this missing content and did not affect the VLE functionality

We will continue to monitor

UPDATE 13 March at 1930 hours : the issue has been resolved

VLEresolved
118SAN controller failureAll users

On Tues 22 Jan 2019 at 1130 hours one of the virtual server cluster hosts went off-line. This was caused by a controller card failure in the attached storage. All services remained online but there is a loss of all reslience so all service should be considered at risk

UPDATE 23 Jan at 0830 hours : the failed virtual server was bought back online and rejoined the cluster

virtual server hostresolved
117Water problem in server roomAll users

On Sat 19 Jan 2019 at 1030 hours during routine maintenance a water leak was discovered in server room L115. This was traced to being caused by a leaking AC unit

UPDATE 21 Jan at 0800 hours : Estates dried out floor and provided temporary containment

UPDATE 22 Jan at 1030 hours : engineer identified leak cause in AC drain pipe and implimented a permenant fix

server room L115resolved
116Wireless controller failureOncampus users of wi-fi

On Sat 12 Jan 2019 at 0830 hours during routine maintenance checks one of the resilient wireless controllers failed after a reboot. This caused a failure of the high availability nature of the service and so is to be considered at risk

UPDATE Mon 14 Jan at 0900 hours -efforts to recover the controller have failed so a support call has been logged to get it replaced

UPDATE 24 Jan at 0800 hours : FortiNet engineers are investigating the cause and collected logs for further anaylsis

UPDATE 28 Jan at 0800 hours : issued identified as bug 0532038 still awaiting fix

UPDATE 6 Feb at 0800 hours : patch applied a HA cluster rebuilt, tested and normal service resummed

wirelessresolved
115Email delivery problemUsers on campus

On Thurs 20 Dec 2018 at 0920 hours monitoring identified problems with users sending emails to external email addresses - they were being returned undeliverable

After investigation it was discoved that an on campus email server had corrupted the TLS certificates used to secure the email flow off campus to Microsoft

UPDATE Thurs 20 Dec 2018 at 1000 hours - the problimatic email server was removed from the service cluster and normal email flow resumed

outgoing emailresolved
114Google public DNSoff campus users

On Tues 18 Dec 2018 at 0710 hours our monitoring identified issues with DNS resolution for our domain bruford.ac.uk with Googles two public DNS servers

This was confirmed as only effecting Googles public DNS servers 8.8.8.8 and 8.8.4.4

We advise users to switch to another DNS server - like OpenDNS 208.67.222.222 whilst we resolve the issue with Google

UPDATE Tues 18 Dec 2018 at 0830 hours - mitigating problem with Google but expect further disruption throughout day if using Googles public DNS servers

UPDATE Tues 18 Dec 2018 at 1030 hours - issue resolved and no further disruption anticipated

public DNS for domain servicesresolved
113O365 SharePoint OnlineAll users

On Tuesday 6 November 2018 at 1100 hours our monitoring indicated problems with trying to access college resources in SharePoint sites both on and of campus

This was related to a Microsoft incident SP152986

UPDATE Tues 6 Nov 2018 at 1330 hours - mitigation being applied by MS

UPDATE Tues 6 Nov 2018 at 1430 hours ongoing mitigation being applied by MS

UPDATE Tues 6 Nov 2018 at 1620 hours - issue reported as remidiated by MS, we will continue to monitor


SharePoint Online sitesresolved
112O365 SharePoint and OneDriveAll users

On Monday 22 October 2018 at 2010 hours our monitoring indicated problems with slow response when trying to access college resources in SharePoint sites and OneDrive both on and of campus

This was related to a Microsoft incident SP151830

UPDATE Tues 23 Oct 2018 at 0915 hours - The problem continues and no includes timeouts when trying to open and save documents within these areas

UPDATE Tues 23 Oct 2018 at 1315 hours - Ticket raised with MS #11844977 requesting further support

UPDATE Tues 23 Oct 2018 at 1515 hours - performance issues accessing sites returning to normal, we will continue to monitor

UPDATE Tues 23 Oct 2018 at 1730 hours - service returned to normal

SharePoint and OneDrive sitesresolved
111Athens authenticationUsers using Athens gateway

On Mon 1 Oct 2018 at 1450 hours our monitoring indicated problems with the Athens authentication gateway that protects online resources, this was confirmed as a major outage

UPDATE Mon 1 Oct 2018 at 1715 hours - Eduserv have applied a fix and systems are recovering, we will continue to monitor

UPDATE Tues 2 Oct 2018 at 0715 hours - service now stable

Athens authenticationresolved
110Power failureUsers in Rose Theatre and Cafe 

On Fri 7 Sept 2018 at 0917 hours monitoring reported loss of conectivity to all voice and data devices located within the Rose Theatre building. Further investigation identified a power failure to the data switch connecting to other buildings on campus

UPDATE Fri 7 Sept 2018 at 0945 hours - the power was restored to the data cabinet but further problems when trying to power on the data switch - a replacement power supply was requested from support

UPDATE Fri 7 Sept 2018 at 1150 hours - the power supply was replaced and the switch powered on and repatched. All systems returned to normal operation

Voice, Wi-Fi, computers and CC terminalsresolved
109O365 shared areas and emailuser based in USA

On Tues 4 Sept 2018 at 1115 hours monitoring reported that access to O365 shared areas and email was failing in the US continent

UPDATE 4 Sept 2018 at 1400 hours : Microsoft advised known incident MO147606 is the cause

UPDATE 4 Sept 2018 at 1530 hours : Microsoft advised further details of symptoms

 

UPDATE 4 Sept 2018 at 1630 hours : Monitoring reporting services returning to normal

O365 sharepoint sites and emailresolved
108O365 shared areas all users of shared areas

On Thurs 30 Aug 2018 at 0810 hours users started to report issues accessing O365 sharepoint sites (shared areas), somtimes they got in but mostly access just hung after authenticating. Confirmed not an authentication issue as no other services effected

UPDATE 30 Aug 2018 at 1000 hours : Microsoft advised known issue SP147225 is the cause

UPDATE 30 Aug 2018 at 1300 hours : Escolated with Microsoft #1276633

UPDATE 30 Aug 2018 at 1345 hours : Intermittent access restored but currently running very slow 

UPDATE 30 Aug 2018 at 1435 hours : MS confirm our tenancy is effected by service incident SP147225 with no ETA to fix yet

UPDATE 30 Aug 2018 at 1730 hours : Mitigation by MS being applied and stability and response times have improved, still awaiting further confirmation of status

UPDATE 31 Aug 2018 at 0630 hours : Ongoing recovery of service by MS, still at risk but monitoring indicates stable access and responsive over the past 12 hours - more details Known issues

UPDATE 31 Aug 2018 at 1630 hours : MS confirm service restored and issue closed - more details Known issues

O365 sharepoint sitesresolved
107Server failure all users

On Sat 18 Aug 2018 at 1140 hours one of the hypervisor nodes failed due to corruption during a routine update window, this resulted in all hypervisor hosts being servered from the remaining hypervisor node

The impact is the loss of reslience and load blancing across multiple systems and services, which may restult in certain services being slower to respond and all systems are now considered 'at risk'

UPDATE Mon 20 Aug 2018 at 0630 hours: recovery of the failed hypervisor node OS was completed and fully tested so starting the rebuild of the cluster drives

UPDATE Mon 20 Aug 2018 1940 hours: rebuild of cluster drives complete and sync across nodes active. Normal HA cluster operations resummed


various sytemsresolved
106VLE downall users accessing VLE

On Tues 7 August 2018 at 1320 hours monitoring should loss of access from multplie locations. Cause currently being investigated

Update 7 Aug 2018 at 1340 hours VLE monitoring reporting back online

Update 8 Aug 2018 at 0700 no further issues logged 

VLEresolved
105Primary server room failureall users

On Sun 27 May 2018 at 1730 hours the two AC units in the primary server failed resulting in an uncontrolled increase in room temperature. This reached critical at 1750 hours when a number of systems and services located in the server room failed

Update 27 May at 1815 hours: all remaining services and systems were shutdown, the AC units were power cycled and server room vented

Update 27 May 1900 hours: a temporary AC unit was installed to allow some systems to be restarted

Update 27 May 1935 hours: identified failed hardware and started backup restore of systems

Update 27 May 2005 hours: key authentication, email and DNS systems back online

Update 27 May 2110 hours: restore of primary key systems complete and services back online but with no redundancy

Update 28 May 0830 hours: confirmed temporary AC in server room holding and no further failures

Update 28 May 1600 hours: confirmed temporary AC in server room holding and no further failures

Update 29 May 0830 hours: key secondary systems bought back online as temp AC still holding

Update 30 May 0700 hours: still awaiting fix/replacement of broken AC unit, as result all third level systems remain off-line which includes DA, wireless, SQL replication, DAG, hyper-v replication, GFI, WUS, WDS, RDP and all resilient systems. No current ETA to fix 

Update 31 May 0700 hours: awaiting installation of temporary hire AC units later today

Update 31 May 1130 hours: BYOD wireless service restored 

Update 31 May 1730 hours: hired AC unit installed, all third level systems/services now back online, risk level changed from critical/red to warning/yellow

Update 20 June 1130 hours: failed AC unit replaced and server room returned to normal operations

 all systemsresolved
104 Global transit links across JaNETall users

On Tues 8 May 2018 at 0818 hours the global transit providers out of the JaNET network went off line

This result in loss of internet access to certain parts of the world and also effecting external users trying to access services on campus 

UPDATE 8 May at 0935 hours; services returned to normal 

 internetresolved
103MyAthens login errorUsers trying to access MyAthens

On 20 March 2018 at 0345 hours our monitoring reported problems accessing the myAthens home page after logging in openAthens - returns a HTTP 500 server error

This error was logged with EduServ as it seems to only affect accessing the myAthens site not authentication or resources

UPDATE 20 March at 1100 hours - site back online

myAthens site resolved
102Problems accessing college resources off campusUsers based in USA and Australia

On 20 March 2018 at 0738 hours our external monitoring report problems with users accessing college web resources from locations in the USA and Australia:

This has been logged with JaNET 

UPDATE : 20 March at 0900 hours - JaNET confirm routing problems #TT180695

UPDATE : 20 March at 0930 hours - routing issue resolved and services are returning to normal 

off campus resources resolved
101JaNET link issues redundant systems

On 6 March 2018 at 0700 hours our external monitoring reported intermitant problems with our reslient JaNET link via PR

A ticket was raised with JaNET Operations #180634 

UPDATE : 6 March at 0900 hours - advised routing/resolver issue in core network which is spreading to other network services

UPDATE : 6 March at 0930 hours - effecting external peering on core network which is effecting on campus services

UPDATE : 6 March at 1100 hours - JaNET advised that the issue experienced this morning was due to a corrupted forwarding table in a router within Telehouse North.  All systems returned to normal

reslient link via PR and other servicesresolved
100 Power lossnone

On 3 March 2018 at 0054 hours the campus UPS devices switched to battery operation until 0104 hours

No disruption to live services during this period recorded and further information as to cause requested from Estates

power resolved
99VLE downall users of vle.bruford.ac.uk

On 1 March 2018 at 1117 hours our monitoring reported that vle.bruford.ac.uk was not responding and further investigation confirmed the outage from multiple locations.  Ticket raised with CoSector Digital (ULCC)

UPDATE : 1 March at 1150 hours - ULCC confirm power outage at data centre, no time to fix given

UPDATE : 1 March at 1210 hours - website now responsive to requests but returning an error

 

UPDATE : 1 March at 1230 hours - CoSector Digital (ULCC) confirmed power restored but now in recovery mode for the next few hours until normal services return 

UPDATE : 1 March at 1645 hours - confirmed vle.bruford.ac.uk back online 

VLE

resolved

98AC failure on campus services

On 22 Feb 2018 at 0040 hours the AC unit in server room two failed which has resulted in the shut down of systems running in this room - primary affected users have been notified and all campus services systems should be considered 'at-risk' until further notice

UPDATE : 22 Feb at 0700 hours - Estates dept. notified of failure

UPDATE : 27 Feb at 0700 hours - No change still awaiting fix for failures

UPDATE : 5 March at 0700 hours - No change, still awaiting fix for failure

UPDATE : 12 March 0800 hours - engineer onsite investigating the failure

UPDATE : 19 March at 0700 hours - No change, still awaiting fix for failure

UPDATE : 26 March at 0700 hours - No change, still awaiting fix for failure

UPDATE : 2 April at 0700 hours - No change, still awaiting fix for failure

UPDATE : 9 April at 0700 hours - No change, still awaiting fix for failure

UPDATE : 16 April at 0700 hours - No change, still awaiting fix for failure

UPDATE : 23 April at 0700 hours - No change, still awaiting fix for failure

UPDATE : 30 April at 0700 hours - No change, still awaiting fix for failure

UPDATE : 3 May at 1130 hours - AC unit fixed in C120, waiting for temperature to stabilise over the next 24 hours 

UPDATE : 8 May at 0700hours - AC unit fixed in C120, all services/systems back online

variousresolved
97Compromsied website - vle.bruford.ac.ukAll visitors of VLE

On 11 Jan 2018 at 1030 hours CSIRT notified us of a possible website compromise

[JANET_CSIRT #1624173] Possible webserver compromise

> Google detected 4 suspicious URLs (space inserted to prevent
> accidental clicking in case your email client auto-links URLs):
> http://vle.bruford .ac.uk/mod/url/view.php?REDACTED (128.86.140.93)
> http://vle.bruford .ac.uk/mod/url/view.php?REDACTED (128.86.140.93)
> https://vle.bruford .ac.uk/mod/url/view.php?REDACTED (128.86.140.93)
> https://vle.bruford .ac.uk/mod/url/view.php?REDACTED (128.86.140.93)

https://transparencyreport.google.com/safe-browsing/search?url=http:%2F%2Fvle.bruford.ac.uk%2Fmod%2Furl%2Fview.php

Update : 11 Jan at 1230 hours - confirmed scan status also at WebsecurityGuard

UPDATE : 11 Jan at 1530 hours - ULLC confirmed location of offending link (theatrefutures.org.uk) and removed it

Waiting for Google to rescan the site to confirm status cleared 

UPDATE : 12 Jan at 0800 hours - Google safe search still not cleared 

UPDATE : 13 Jan at 0900 hours - Google safe search flag reset to safe

VLEresolved

...