Custom Search

April 14, 2010

Troubleshooting hunged backups

Even though backup resources are available, backups are taking hours to go active and complete. However, if they are canceled and re-submitted, they take off and complete right away.

Details:
Overview:
Jobs remain queued while resources are available due to length of time it takes nbrb to complete an evaluation cycle.

Troubleshooting/Log Files:
Determine if there is an excessive amount of time taken between nbjm requesting a resource and nbrb granting the resource.

For example:
1. From the Activity Monitor within the NetBackup GUI, view the job details.  In this particular instance, it took 10 minutes to grant a resource:
8/21/2007 2:58:44 PM - requesting resource figrin.NBU_POLICY.MAXJOBS.sybase-arvel1-SDS_SY48-sybsecurity-DB
8/21/2007 3:08:31 PM - granted resource figrin.NBU_CLIENT.MAXJOBS.arvel1

2. In the nbrb log, look for the amount of time between "evaluation cycle is in progress" and "reserved resource" for a particular jobid :
08/21/07 20:58:13.512 [Debug] NB 51216 nbrb 118 PID:16608 TID:3 File ID:118 [jobid=9180586] 2 [ResBroker_i::evaluateNow] evaluation cycle is in progress
08/21/07 21:08:30.706 [Debug] NB 51216 nbrb 118 PID:16608 TID:9 File ID:118 [jobid=9180586] 3 [CountedProvider::allocate] reserved resource figrin.NBU_CLIENT.MAXJOBS.arvel1 (current_ref_count=2, max_ref_count=10)

To generate an nbrb log, use this command:
/usr/openv/netbackup/bin/vxlogview -p 51216 -o 118 -d all

3. Search the raw nbrb log (/usr/openv/logs/*118*) for the "msec" string:
For example:
51216-118-1067800690-070912-0000000011.log:0,51216,118,118,34289634,1189573603242,17599,9,0:,83:Resource evaluation completed. Evaluated 779 requests, evaluation time: 481417 msec,25:ResBroker_i::doEvaluation,2
51216-118-1067800690-070912-0000000024.log:0,51216,118,118,34308513,1189574183188,17599,10,0:,83:Resource evaluation completed. Evaluated 763 requests, evaluation time: 579944 msec,25:ResBroker_i::doEvaluation,2
51216-118-1067800690-070912-0000000040.log:0,51216,118,118,34327216,1189574666005,17599,9,0:,83:Resource evaluation completed. Evaluated 916 requests, evaluation time: 482815 msec,25:ResBroker_i::doEvaluation,2

Resolution:
There are a number of steps that were taken to increase the time taken to grant resources:

Note: All configuration files listed below can be found in the following subdirectory:
UNIX: /usr/openv/var/global
Windows: \Veritas\NetBackup\var\global\

1. If the master server has greater than ten CPUs, it is necessary to increase the number of CPUs available to Sybase.  See TechNote 285579, linked below.

2. Enable additional database connections via emm.conf changes.

For example, the following lines could be added or edited:
NUM_DB_CONNECTIONS=21
NUM_DB_BROWSE_CONNECTIONS=20
NUM_ORB_THREADS=35

See TechNote 285629, linked below.

Note: In NetBackup 6.0, the server.conf file must also be changed to allow greater than 10 connections to the ASA database.  This can be accomplished by editing the server.conf file and changing the value of the -gn parameter to 30.

In NetBackup 6.5, the -gn parameter does not exist.

3.  Starting in NetBackup 6.0 MP6 and 6.5.2, nbrb can be tuned via the nbrb.conf file.

This file offers the following parameters:
  • SECONDS_FOR_EVAL_LOOP_RELEASE
  • RESPECT_REQUEST_PRIORITY
  • DO_INTERMITTENT_UNLOADS
Note: By default all values are 0.

Example format for parameters in nbrb.conf:
SECONDS_FOR_EVAL_LOOP_RELEASE = 180
RESPECT_REQUEST_PRIORITY = 0
DO_INTERMITTENT_UNLOADS = 1

An explanation of nbrb.conf parameters can be found in TechNote 300442, linked below.

4. VERY IMPORTANT for nbrb caching: do not mix SSO and dedicated drives within a library.  Have all drives SSO or all drives dedicated.

5. Do not exceed the total number of drives within a storage unit (max concurrent drives).
For example: if the library has 8 drives, the largest value that total max concurrent drives can be is 8.

6. Ensure drives do not need cleaning. (CLEANING REQUIRED flag flipped). Clean drives. Run tpclean -M on all drives to clear flag if still set.

7. If using custom scripts or Aptare, ensure that scripts are not accessing media servers for drive/tape information. All information is now stored on the EMM server.

8. Change HPUX kernel by increasing dbc_max_pct to 50 and dbc_min_pct to 10  to improve file caching.

9. Move the database transaction log and database to different disks:
For example:
If 2 additional drives (/u01 and /u02) are available:
# mkdir /u01/nbu
# mkdir /u02/nbu
# /usr/openv/netbackup/bin/goodies/netbackup stop
# /usr/openv/db/bin/nbdbms_start_server
# /usr/openv/db/bin/nbdb_move -data /u01/nbu -index /u01/nbu -tlog /u02/nbu
# /usr/openv/netbackup/bin/goodies/netbackup start

No comments:

Post a Comment