WSU HOME | SEARCH | INDEX | CONTACT US

Continuity of Services Plan

WSU Information Technology Division Continuity of Service Plan as of: Friday, November 15, 2002

INTRODUCTION

Over the past several years Weber State University (WSU) has set up a highly computerized environment. This includes the use of microcomputers in offices as well as minicomputer and mainframe servers that provide much of the operational support for  administrative and academic units. A campus-wide network ties these various systems together and provides communications to other computer networks and the computer diagnostic facilities of the various computer vendors involved. In addition, operation of the campus network is a vital support component of the university system, including the operation of local and long distance telephone services, Utah Educational Network (UEN) and cable TV.

The reliability of computers and computer-based systems has increased dramatically in the past few years, and those computer failures that do occur can normally be diagnosed automatically and repaired promptly using both local and remote diagnostic facilities. Many computer systems contain redundant parts, which improve their reliability and provide continual operation when some failures occur.

In the past, when most computer operations were predominantly batch, reciprocal agreements for computer batch running, usually at night and/or week-ends, were often made between users of similar systems. This has become less feasible with the very complicated on-line and diverse network systems most institutions now have installed. Although institutions may have similar equipment and operating systems, they generally do not have the capacity to add a large number of users from another on-line environment to their systems even if the technical problems could be solved.

A trend is evolving to provide alternate sites near the central systems where any additional equipment needed can be shipped in rapidly and critical on-line operations for the organization can be resumed in a reasonable time. Redundancy in the communications network and a tie into the alternate site or the ability to rapidly tie-in is an important part of the continuity of service plan. This type of site is called a cold backup site, as opposed to a hot backup site, which contains all equipment necessary to start immediate operations.

For the most part, the major problems that can cause a computing system to be inoperable for a length of time result from environmental problems related to the computing systems. The various situations or incidents that can disable, partially or completely, or impair support of WSU's computing facilities are identified. A working plan for how to deal with each situation is provided.

Almost any disaster will require special funding from the university in order to allow the affected systems to be repaired or replaced. This report assumes that these funds will be made available as needed. Proper approval will be obtained before any funds are committed for recovery.

OBJECTIVES/CONSTRAINTS

A major objective of this document is to define procedures for a contingency plan for recovery from disruption of computer and/or network services. This disruption may come from total destruction of the central site or from minor disruptive incidents. There is a great deal of similarity in the procedures to deal with the different types of incidents affecting different departments in Information Technology. However, special attention and emphasis is given to an orderly recovery and resumption of those operations that concern the critical business of running the university, including providing support to academic departments relying on computing. Consideration is given to recovery within a reasonable time and within cost constraints.

The objectives of this plan are limited to the computing support given to WSU clients from Information Technology, including academic and administrative systems under the stewardship of Information Technology. Some elements that concern microcomputers are addressed; however, client-related functions not directly tied to computer and telephone support by Information Technology are not addressed. Also, offices at WSU should develop their own plan to deal with manual operations within their office should computer and/or network services be disrupted. Due to cost factors and benefit considerations at this time, the alternatives of hot sites and contracts with disaster recovery companies is not considered feasible or necessary for WSU.

All major computing systems that are vital for the daily operation of the University and under the stewardship of Information Technology are maintained under service contracts with the equipment vendors. This ensures that routine maintenance problems will be addressed in a timely way with adequate resources. These contracts range from telephone support only to full hardware replacement.

ASSUMPTIONS

This section contains some general assumptions, but does not include all special situations that can occur. Senior technology staff members on site will make any special decisions for situations not covered in this plan that are needed at the time of an incident.

  1. This plan will be invoked upon the occurrence of an incident. The senior staff member on site at the time of the incident or the first on site following an incident will contact the CIO and/or the Managers, for a determination of the need to declare an incident.
  2. The senior technology staff member on site at the time of the incident will assume immediate responsibility. The first responsibility will be to see that people are evacuated as needed. If injuries have resulted or may occur as a result of the incident, immediate attention will be given to those persons injured. The WSU Security Police Department and Facilities Management will be notified if necessary. If the situation allows, attention will be focused on shutting down systems, turning off power, etc., but evacuation is the highest priority.
  3. Once an incident, which is covered by this plan, has been declared, the plan, duties, and responsibilities will remain in effect until the incident is resolved and proper university authorities are notified.
  4. Invoking this plan implies that a recovery operation has begun and will continue with top priority until workable computer and/or communications network support to the university has been re-established.

INCIDENTS REQUIRING ACTION

This disaster recovery plan will be invoked under one of the following circumstances:

  • An incident which has disabled or will disable, partially or completely the central computing facilities, and/or the communications network for a period of 24 hours.
  • An incident, which has impaired the use of computers and networks, managed by Information Technology due to circumstances which fall beyond the normal processing of day-to-day operations. This includes all academic and administrative systems, which Information Technology manage.

RECOVERY TEAMS

In case of a disaster, the emergency call list will need to be used. General duties of the disaster recovery coordinator are discussed. Recovery team leaders have been assigned in each major area and general duties given. Assignment of personnel in the major areas to specific tasks during the recovery stage will be made by the team leader over that area

ORGANIZATION OF THE DISASTER/RECOVERY TEAM
  • Disaster Recovery Coordinator - CIO
  • Manager, Systems/Network Management
  • Manager, Computing Support
  • Director, Administrative Computing
  • Director, Technology Services
Academic Systems Recovery Team

  • System Administrator, Stewart Library
  • Computing Support Staff

Communications Recovery Team

  • Manager, Systems/Network Management
  • Manager, Telecommunications
  • Network Administrator
  • Telecom Analysts
  • CATS Analysts
  • ES&R Technicians

Administrative Systems Recovery Team

  • Director, Administrative Services
  • System Administrator, Systems/Network Management
  • Programmer/Analysts
  • Manager, Systems/Operations

DISASTER/RECOVERY TEAM HEADQUARTERS

  1. If TE-210D is not usable, the recovery team will meet in LH-201.
  2. If LH-201 is hazardous or not usable, the team will meet in the Miller Administration Building Board Room.
  3. If the Administration Building is not usable, the Disaster Recovery Coordinator will be responsible for locating another meeting place on campus.
  4. If none of the campus facilities are usable, it is presumed that the disaster is of such proportions that recovery of computer support will take a lesser priority. The Disaster Recovery coordinator will make appropriate arrangements.

DISASTER RECOVERY COORDINATOR

The CIO will serve as Disaster Recovery Coordinator. The major responsibilities include:

  1. Determining the extent and seriousness of the disaster and notifying the President, immediately and keeping her informed of the activities and recovery progress. The CIO will also keep the other Vice Presidents informed.
  2. Invoking the Disaster Recovery Plan after approval of the President.
  3. Supervising the recovery activities.
  4. Coordinating with the President on priorities for clients while going from partial to full recovery.
  5. Naming replacements, when needed, to fill in for any disabled or absent disaster recovery members. Any members who are out of town and are needed will be notified to return.

ACADEMIC SYSTEMS RECOVERY TEAM LEADER RESPONSIBILITIES

The Manager of Computing Support will serve as Academic Systems Recovery Team Leader. The responsibilities in this area include recovery in case of complete or partial disruption of services from the central academic computers.  Responsibilities include:

  1. Coordinating hardware and software replacement with the academic hardware and software vendors.
  2. Coordinating the activities of moving backup media and materials from the off-site security files and using these for recovery when needed.
  3. Keeping the Provost informed of the extent of damage and recovery procedures being implemented.
  4. Coordinating recovery with client departments, those using the academic computers and/or those using labs.
  5. Coordinating appropriate computer and communications recovery with the Communications Recovery Team Leader.
  6. Keeping the Disaster Recovery Coordinator informed of the extent of damage and recovery procedures being implemented.

ADMINISTRATIVE SYSTEMS/OPERATIONS RECOVERY TEAM LEADER RESPONSIBILITIES

The Director, Administrative Computing will serve as Administrative Systems/Operations Recovery Team Leader. Responsibilities include:

  1. Coordinating hardware and software replacement with the administrative hardware and software vendors.
  2. Supervising moving backup media and materials from the off-site security files and using these for recovery when needed.
  3. Coordinating recovery with client departments.
  4. Coordinating appropriate computer and communications recovery with the Communications Recovery Team Leader.
  5. Coordinating recovery of administrative software with client departments.
  6. Coordinating scheduling for administrative programming, production services, and computer scheduling.
  7. Keeping the Disaster Recovery Coordinator informed of the extent of damage and recovery procedures being implemented.

COMMUNICATIONS RECOVERY TEAM LEADER RESPONSIBILITIES

The Manager, Systems/Network Management will serve as the Communications Recovery Leader. Responsibilities include:

  1. Coordinating hardware and software replacement with the communications hardware and software vendors.
  2. Supervising recovery of the computer communications, telephone system and/or cable TV.
  3. Assigning personnel duties from telecom analysts to project leaders of disaster recovery tasks as needed.
  4. Coordinating activities of communications recovery with the other Recovery Team Leaders.
  5. Keeping the Disaster Recovery Coordinator informed of the extent of damage and recovery procedures being implemented.

PREPARING FOR A DISASTER

This section contains the minimum steps necessary to prepare for a disaster and as preparation for implementing the recovery procedures. An important part of these procedures is ensuring that the off-site storage facility contains adequate and timely computer backup tapes and documentation for applications systems, operating systems, support packages, and operating procedures.

GENERAL PROCEDURES

Responsibilities have been given for ensuring each of following actions have been taken and that any updating needed is continued. 

  1. Maintaining and updating the disaster recovery plan.
  2. Ensuring that all Information Technology personnel are aware of their responsibilities in case of a disaster.
  3. Ensuring that periodic scheduled rotation of backup media is being followed for the off-site storage facilities.
  4. Maintaining and periodically updating disaster recovery materials, specifically documentation and systems information, stored in the off-site areas.
  5. Maintaining a current status of equipment in the main equipment rooms in the Technical Education Building.
  6. Informing all technology personnel of the appropriate emergency and evacuation procedures from TE.
  7. Ensuring that all security warning systems and emergency lighting systems are functioning properly and are being periodically checked by operations personnel.
  8. Ensuring that fire protection systems are functioning properly and that they are being checked periodically.
  9. Ensuring that UPS systems are functioning properly and that they are being checked periodically.
  10. Ensuring that the client community is aware of appropriate disaster recovery procedures and any potential problems and consequences that could affect their operations.
  11. Ensuring that the operations procedure manual is kept current.
  12. Ensuring that proper temperatures are maintained in equipment areas.
PHYSICAL SAFEGUARDS

Campus Service Building - Telecommunications Equipment Room

This area houses the telephone switch, data and video communications equipment and fiber optic cable hub.      Door Locks - Restricted key access

  • Fire Protection - There is an automatic detection system with an audio alarm and alarm panel in heating plant. A Halon fire suppression system is triggered by this alarm.    Water Protection - None
  • Power - The telephone equipment is connected to a 48V DC UPS system. This will maintain the telephone switch for 72 hours.
  • HVAC (heating, ventilation & air conditioning) - AC is provided by roof units.  Heating is provided by central boiler plant.

Telecommunications Contacts

  • Primary - Barbara LeDuc
    • home phone - 393-2547
    • cell number - 540-3799
    • business phone - 626-6808
     
  • Secondary - Kyle Stoddard
    • home phone - 985-2422
    • cell phone - 721-0448
    • business phone - 626-6024

Vendor Contacts

  • Lucent - Jody Arave, (801) 726-0204
  • AT&T - Bryan Arnett, (801) 568-3214

Others

  • Bruce Robb - WSU Electronic Services
    • business phone - 626-8068
    • Cell Phone - 860-7311

Stewart Library - Room 65 Data Communication, UEN and Cable TV. 

This room is the UEN hub, the primary exit point for the Internet and home to cable TV.

Door Locks - Doors use standard key locks.  Access to the area includes CATS, Facilities Management, Police and Systems/Network Management staff.

Fire Protection - There is an audio alarm with panel light in Physical Plant.  Automatic water pressurized extinguisher heads.

Water Protection - No detection

Power - UEN, data communication and cable TV equipment are connected to a 230 DC UPS system.  This will maintain power to the equipment for 24 hours.

HVAC (heating, ventilation & air conditioning) - AC is provided by central chiller system.  A secondary AC provides additional cooling to rooms 61, 62 and 65.  Heating is provided by central boiler plant.

CATS Contacts

  • Primary - Karen Stock
    • home 801 399-0050
    • business 801 626-6862
  • Secondary - Bob King
    • home 801 731-5231
    • cell 801 721-7077
    • business 801 626-6865
  • Alan Ferrin
    • home 801 776-2552
    • cell 801 497-8884
    • business 801 626-641

Vendor Contacts

  • UEN (Utah Educational Network) - 800 863-3496
  • USWest - 800 306-3496

Technical Education - TE 212/213

Houses the centralized equipment for support of academic computing, Administrative Computing, and most of the Systems & Network Management hardware.

Fire Protection - There is an alarm system to detect smoke and heat.  A pre-actionary system was recently installed to put out a fire in any part of the IT facilities in TE

Door Locks - Standard key and combination locks are used.  Combinations are changed periodically.

Water Protection - None   Power - The central computing facility in TE 208 and 209 benefit from a UPS.  The capacity to maintain power to the equipment is about 20 minutes.

HVAC - There are two air conditioning units in the main computer room. WSU's HVAC group from Facilities Management is responsible for service of these units. They are periodically checked and service for emergency problems is available nights and weekends. Response is usually within the hour reported. An updated list of service personnel to call is kept with key technology staff members and computer operations personnel.

Contacts

Computing Support Contacts

  • Primary – Gail Niklason
    • home phone 476-0083
    • Business phone 626-6753
  • Secondary - Robb Herrmann
    • home phone 392-7856
    • business phone 626-7050
    • cell number 549-8755

Systems & Network Management

  • Primary - Ted McGrath
    • home phone 745-2572
    • business phone 626-7196
  • Secondary - Bill Clark
    • business phone 626-7669
    • cell number 540-3925
  • Secondary - Dan Guarine
    • home phone 825-0824
    • business phone 626-6652

Vendors Contact

  • Perpetual Storage Inc. - Jim Nowa - 801 942-1952
  • Compaq Hardware/Software Services - 800-354-9000
  • Benchmark - 801 298-8200
  • MGE 1-800-438-7373
  • WSU Facilities Management Department - 626-6331

SOFTWARE SAFEGUARDS

Computing Support

Novell file servers, Windows file servers, and Administrative Computing software and data are secured by full backups each weekend and differential backups each weeknight.  Backup media for FRS, HRS and STAARS are DLT 20 Gb tapes.  Backup media for Novell and Windows file servers are 4mm DAT 4 Gb tapes.  Every Wednesday, Perpetual Storage Inc. pick up full backup tapes and return the previous week's tapes.

Vendor Contact:

  • Perpetual Storage Inc. - Jim Nowa - 801 942-1952

Telecommunications

At the close of business each night, all translation data stored in volatile memory is saved to the server's integrated Mass Storage System (MSS).  A backup of these translations and all switch software is made each night at 11:00pm to a High Density Tape Drive.   Each Friday, a duplicate of this tape is made and stored off premises for redundancy purposes.  In the case of an MSS failure or system upgrade/update, all data may be restored via the High Density Tape Drive. In the case of a system failure, the server can be re booted from the tape drive.

For Data Bases and Call Detail Recording Equipment: At the close of business on Friday, a backup is made of all databases and call records via the Connor Backup Basics software and a tape drive. Duplicates of each of these backups are made and stored off premises. Lost, damages, or old records may be restored using the Connor Backup Basics software package.

For the Office Server:

The office server is protected by a Redundant Array of Individual Drives (RAID) which duplicates all data to independent drives. All drives in the RAID are "hot swappable" and able to be replaced on the fly. This RAID is in turn backed up each Friday by a high capacity data cartridge. Two backups are made, one of which is stored off premises. All backed up data may be restored via either the RAID or data cartridge.

For desktops:

All pertinent data and files such as documents, spread sheets or system files are backed up on an individual basis, either by copying to an IOMEGA ZIP disk, network folders or a read/write CD.

RECOVERY PROCEDURES

Central Facilities Recovery Plan An incident at the central computing/networking facilities in TE may place this plan into action.

An incident may be of the magnitude that the facilities are not usable and alternate site plans are required. In this case, the alternate site portions of this plan must be implemented. It is obvious that all major support sections in Information Technology will need to function together in a disaster, although a specific plan of action is written for each section.

This central support is provided by Compaq VAX systems. The VAX systems are compatible down to the smallest MicroVAX. In a disaster situation, VAX systems can be rapidly shipped, even by airfreight, in a matter of hours. These systems can then be installed across campus.

Administrative Computing

This portion of the disaster/recovery plan will be set into motion for Administrative Computing when an incident has occurred that requires use of the alternate site, or the damage is such that operations can be restored, but only in a degraded mode at the central site in a reasonable time.

It is assumed a disaster has occurred and the administrative recovery plan is to be put in effect. The CIO up will make this decision on advice from the Assistant CIO.

In case of either a move to an alternate site, or a plan to continue operations at the main site, the following general steps must be taken:

  1. Determine the extent of the damage and if additional equipment and supplies are needed.
  2. Obtain approval for expenditure of funds to bring in any needed equipment and supplies.
  3. Notify local vendor marketing and/or service representatives if there is a need of immediate delivery of components to bring the computer systems to an operational level even in a degraded mode.
  4. If it is judged advisable, check with third-party vendors to see if a faster delivery schedule can be obtained.
  5. Notify vendor hardware support personnel that a priority should be placed on assistance to add and/or replace any additional components.
  6. Notify vendor systems support personnel that help is needed immediately to begin procedures to restore systems software at WSU.
  7. Order any additional electrical cables needed from suppliers.8.  Rush order any supplies, forms, or media that may be needed.

In addition to the general steps listed at the beginning of this section, the following additional major tasks must be followed in use of the alternate site:

  1. Notify officials that an alternate site will be needed for an alternate Administrative Computing facility.
  2. Coordinate moving of equipment and support personnel into the alternate site with appropriate personnel.
  3. Bring the Administrative Computing recovery materials from the off-site storage to the alternate site.
  4. As soon as the hardware is up to specifications to run the operating system, load software and run necessary tests.
  5. Determine the priorities of the client software that need to be available and load these packages in order. These priorities often are a factor of the time of the month and semester when the disaster occurs.
  6. Prepare backup materials and return these to the off-site storage area.
  7. Set up Administrative Computing operations in the alternate site.
  8. Coordinate client activities to ensure the most critical jobs are being supported as needed.
  9. As production begins, ensure that periodic backup procedures are being followed and materials are being placed in off-site storage periodically.
  10. Work out plans to ensure all critical Administrative Computing support will be phased in.
  11. Keep administration and clients informed of the status, progress, and problems.
  12. Coordinate the longer range plans with the administration, the alternate site officials, and Administrative Computing staff for time of continuing support and ultimately the restoring of the Administrative Computing section.

DEGRADED OPERATIONS AT CENTRAL SITE

In this event, it is assumed that an incident has occurred but that degraded operations can be set up at Stewart Library room 73. In addition to the general steps that are followed in either case, special steps need to be taken.

  1. Evaluate the extent of the damage, and if only degraded service can be obtained, determine how long it will be before full service can be restored.
  2. Replace hardware as needed to restore service to at least a degraded service.
  3. Perform system installation as needed to restore service. If backup files are needed and are not available from the on-site backup files, they will be transferred from the off-site storage.
  4. Work with the various vendors, as needed, to ensure support in restoring full service.
  5. Keep the administration and clients informed of the status, progress and problems.

Academic computing resources from the central site are provided for academic type services to the university. In addition to some batch support at the central site, the majority of this support is over communications lines directly to clients, departments, and various labs across campus. Some general steps that should be taken, in case of a disaster at the central site, are given.

  1. Determine the extent of the damage and whether additional components can be brought in for present computer systems or whether additional computers need to be brought in.
  2. Obtain approval for expenditures of funds to bring in added equipment as needed.
  3. Notify vendor marketing and/or service officials that additional equipment needs to be shipped, with the highest priority, to WSU.
  4. Notify vendor technical support personnel of the disaster and the need for their assistance.
  5. Determine if there is a need for any additional electrical cables and order these for immediate shipment from suppliers.

Use of Alternative Sites If the central site is destroyed, support of critical academic computing activities will be given from the alternate sites.

Additional computer systems will be brought in as needed. Some steps necessary in this process are listed.

  1. Determine the priorities of client needs and upgrade computers at the academic labs.
  2. Setup for operations support.
  3. Coordinate installing additional equipment and moving support personnel.
  4. When additional needed equipment is available, move backup materials from the off-site storage area.
  5. Coordinate restoring any communications with Technical Services.
  6. Coordinate client-computing support with clients.
  7. As production begins, ensure that backup procedures are followed and periodic backups are stored off site.
  8. Work with the Director of the Stewart Library, the Provost, and clients in coordinating long-range plans for restoring full support by the Academic Computing section.

Degraded Services from Central Site If the central academic computing support can be resumed in a reasonable time from the central site, steps will need to be taken immediately to restore these services.

  1. Determine the extent of the damage and set up procedures to bring in any needed added equipment.
  2. Determine priorities of client needs and prepare for running at a degraded level of service.
  3. After the hardware is functioning, perform system installation as needed. If backup files are destroyed at the central site, bring these from the off-site storage area.
  4. If off-site files are used, replace these at the off-site storage as soon as possible.
  5. Work with vendors as needed to ensure support is given to restore full service.
  6. Keep the administration and clients informed of the status, progress and problems.

Network Communications

Redundancy is being built into the computer communications systems. We do not have complete redundancy, but most systems have backup equipment and/or cards.

This plan does not, at this time, address the problem of a need for redundancy in the telephone switch system. Considerable funds will be needed for an alternate plan in this area in case of a major disaster in the university telephone switch.

Since most of the telephone and computer communications lines are in the underground tunnels and in conduits across campus, connecting lines to alternate sites and to critical areas cannot be done rapidly. For example, it is estimated that if WSU Information Technology had to move, it would take 72 hours to restore critical data and voice communications lines.

Some general steps that must be taken in case of a communications disaster at the central site and/or other parts of the communications network are given.

  1. Assessment of the damage and an evaluation of steps needed to restore services.
  2. Assignment of personnel to disaster crews and assignment of tasks. The priority of repairs will be made by the Disaster Coordinator after an evaluation of the critical needs of the University following the disaster.
  3. If present supplies and equipment on hand are not adequate to restore service as needed, obtain approval for funds needed and contact vendors for priority shipment.
  4. Coordinate repairs of data communications disasters affecting specific areas of technology support with the recovery team leader of that area.
  5. Keep the Disaster Recovery Coordinator and team leaders of support areas informed of the extent of the communications damage and recovery procedures being implemented.

A chart of the communications network at WSU is being developed. When it is completed, a copy of this chart will be placed in the off-site storage area and periodically updated.

MICROCOMPUTER RECOVERY PLAN

1.  Individual clients should plan backups as follows:

Daily - This procedure is used to backup all files created each day. This procedure copies all files to a floppy diskette or local tape for backup storage. It can be performed at the end of the day or when a client is through using the computer for the day. These backup diskettes or tapes need to be placed in a locked file cabinet.

Weekly - This procedure is used to backup all files. This procedure will also copy all files to a floppy diskette or local tape for backup storage. This procedure needs to be performed on any weekday, but should be done consistently once a week on the particular day chosen.

NOTE: It is recommended that each microcomputer workstation retain only one set of daily backups. It is also recommended that two sets of weekly backups be kept.

2.  Provide a protective environment for all disks.

Weekly backup disks should be placed in a protective area away from the office. This area needs to be fireproof.

EMERGENCY PROCEDURES

In case an incident has happened or is imminent that will drastically disrupt operations, the following steps should be taken to reduce the probability of personal injuries and/or limit the extent of the damage, if there is not a risk to employees. Similar steps should be followed, where appropriate, in incidents occurring in a satellite center.

  1. An announcement should be made to evacuate the building, if appropriate, or move to a safe location in the building. As a preparation for a potential disaster, all Information Technology personnel should be aware of the exits available.
  2. If there are injured personnel, ensure their evacuations and call emergency assistance as needed.
  3. If the computers and air conditioning have not automatically powered down, initiate procedures to orderly shut down systems when possible.
  4. When possible and if time is available, set up damage limiting measures.
  5. Designate available personnel to initiate lockup procedures normal to last shift procedures.

Weber State University
Ogden, Utah 84408