Symantec NetBackup Blog: April 2010

April 14, 2010

Netapp NDMP restore issue

Canceled NDMP restores on a NetApp NAS device will zero out the files destined to be restored.

If an NDMP restore on a NetApp NAS device is initiated and then canceled, the files destined to be overwritten by the restore will be zeroed out. This behavior is normal and is not specific to NetBackup, rather it is a part of the functionality of the NetApp NAS device.

The following is an excerpt from the NetApp Data OnTap manual documenting the flow of a restore:
1 Data ONTAP catalogs the files that need to be extracted from the tape.
2 Data ONTAP creates directories and empty files.
3 Data ONTAP reads a file from tape, writes it to disk, and sets the permissions-including ACLs-on it.
4 Data ONTAP repeats steps 2 and 3 until all the specified files are copied from the tape.

There are no error messages in any log files, as all products are working as designed. Note than any canceled NDMP restores can have unexpected results, including data being overwritten with zeros.

If these actions are being performed to determine the tapes necessary for the restore, use the bpimagelist command with the -preview option to list the necessary media

The bpimagelist command with the -d, -e, and -U options allows for the specification of start and end dates (mm/dd/yy format) and having the output formatting in a user friendly mode:

C:\VERITAS\NetBackup\bin\admincmd>bpimagelist -d 09/20/05 -e 09/21/05 -U
Backed Up Expires Files KB C Sched Type Policy
---------------- ---------- -------- -------- - ------------ ------------
09/21/2005 09:29 10/05/2005 3 224 N User Backup server01_windows
09/20/2005 14:59 10/04/2005 3 224 N Full Backup server01_windows
09/09/2005 16:26 09/23/2005 5 2016 N User Backup server01_sql
09/09/2005 16:26 09/23/2005 5 3296 N User Backup server01_sql
09/09/2005 16:25 09/23/2005 5 12064 N User Backup server01_sql
09/09/2005 16:25 09/23/2005 5 1376 N User Backup server01_sql
09/09/2005 16:24 09/23/2005 5 16480 N User Backup server01_sql
09/09/2005 16:24 09/23/2005 0 0 N Full Backup server01_sql
09/09/2005 14:29 09/23/2005 4 226 N Full Backup server01_windows

Then, specifying the same command with the -preview option at the end indicates the media IDs used for the backup:

C:\VERITAS\NetBackup\bin\admincmd>bpimagelist -d 09/20/05 -e 09/21/05 -U -preview
CD3643
CD3643
c:\veritas\netbackup\disk_storage_01\server01_1126301216_C1_F1
c:\veritas\netbackup\disk_storage_01\server01_1126301182_C1_F1
c:\veritas\netbackup\disk_storage_01\server01_1126301148_C1_F1
c:\veritas\netbackup\disk_storage_01\server01_1126301114_C1_F1
c:\veritas\netbackup\disk_storage_01\server01_1126301080_C1_F1

Import Disk Images

How to import disk images into VERITAS NetBackup (tm) 6.0

Modification:
Starting with NetBackup 6.0, disk based images can be imported into NetBackup. The process for importing disk images is essentially the same as importing images from tape. The NetBackup Administration Console will have additional entries during the "phase one import" to allow selecting a disk path instead of a Media ID.

In order to import disk based images, the images that were written to a disk storage unit must be present on the server. Copy the disk images to a directory on the server, or restore from a backup that contains the images from a disk storage unit. Once the files reside in a directory on the server, the NetBackup Administration Console or command line can be used to import the disk images.

Importing NetBackup disk images using the NetBackup Administration Console
First, run the "phase 1 import".
1. Start the NetBackup Administration Console and go to the Catalog section.
2. From the menu select Actions > Initiate Import and a dialog box will appear.
3. Check the "The images to be imported are on disk" radio button and enter a disk path to import images from.
4. Check the Activity Monitor for progress of the phase one import.

Figure 1 shows the new Initialize Import screen.
Figure 1

Once the "phase 1 import" completes, run the "phase 2 import".
1. Start the NetBackup Administration Console and go to the Catalog section.
2. Under Pathname:, enter the disk path used for the phase 2 import.
3. Select a date range for the images to import and click the Search button.
4. Select Backup IDs to import from the search results.
5. From the menu select Actions > Import to begin the phase 2 import.
6. Check the Activity Monitor for progress of the phase 2 import.

Once the phase 2 import completes successfully, the disk images will be available for restores.

Importing NetBackup disk images using the command line

The command line utility bpimport can also be used to import images from a disk storage unit. This will use the "-id" option that was previously used to pass the media ID to use for the import. However, the "-id" option now supports passing either a disk path or a media ID.

To start a phase 1 import from the command line run the command:

# cd /usr/openv/netbackup/bin/admincmd

# ./bpimport -create_db_info -id -L /usr/tmp/phase1.log

Enter the disk path to use for the import. Then, monitor the /usr/tmp/phase1.log file to monitor the progress of the phase 1 import.

To start a phase 2 import from the command line run the command:

# cd /usr/openv/netbackup/bin/admincmd

# ./bpimport -id -s -e -L /usr/tmp/phase2.log

Enter the disk path to use for the import as well as the date range for the images to import. The format to use for the start and end dates is 'MM/DD/YYYY hh:mm:ss'. Then, monitor the /usr/tmp/phase2.log file to monitor the progress of the phase 2 import.

Once the phase 2 import completes successfully the disk images will be available for restores.

Notes

It will not be possible to import disk images from a NetBackup 5.x server into NetBackup 6.0. This is due to a difference in the way disk images are handled in NetBackup 6.0. A phase 1 import of a directory that contains NetBackup 5.x disk images will fail with the message "INF - No importable NetBackup disk images found in path ". The activity monitor will show a Status 0 for the phase 1 import, even though no disk images are processed.

Imports from a read only filesystem are not supported. Attempts to import from a read only filesystem will result in an "invalid disk header file" error.

How to correctly decommission a NetBackup Media Server and remove it from the NetBackup environment

Modification:
Before removing a media server from the NetBackup environment, several activities must be completed, or restores will require the import of tapes, which is a time consuming process.

These activities include:

Determining which media need to be related with a new media server
Updating the internal NetBackup database references to any tapes with active images that are moved to a new media server
Removing the media server from the NetBackup environment
Updating all references involving storage units, policies, and volume groups and pools
Changing the configuration so NetBackup does not continue to recognize the old system as a valid media server

Perform the following steps to decommission a media server:

1. On the media server being decommissioned, determine which tapes have images that are not expired. This is done by running the bpmedialist command with the following options (Note: Use -l to get one line of output per tape):

bpmedialist -mlist -l -h

2. Select another media server (or the master server, if it is also a media server) to inherit the tapes from the decommissioned media server. Then the NetBackup internal databases (the mediaDB and the images database) need to be updated to reflect the new media server assignment. Since the mediaDB databases are binary, they can only be updated by the use of the bpmedia command. To update these databases, execute the bpmedia command with the following options for each tape identified in step 1:

bpmedia -movedb -ev -oldserver -newserver

3. Move all tapes in all robots attached to the decommissioned media server to non-robotic status (standalone). This is done through the NetBackup Administrative Console. Under Media and Device management, select Media.

Next, select the media to be moved, right-click on it, and select Move.

In the resulting dialog box, specify Standalone.

Multiple media can be selected using the or keys.

4. After moving all tapes in all robots associated with the decommissioned media server to standalone, use the Administrative Console to delete the tape drives and then the robots from the decommissioned media server

5. Once the tape drives and robots are deleted, delete all storage units associated with all robots on decommissioned media server through the Administrative Console

6. If any robots from the decommissioned media server are being moved to other media servers, power down the affected servers and make any cabling changes required, and physically attach the robots to the new media servers. Once the robots are recognized by the operating systems on the new media servers, add the robot and tape drives to those media servers through the Administrative Console. Then, in the NetBackup Administrative Console, create the appropriate storage units. Next, physically load the tapes from the decommissioned media server into the correct robot. Finally, inventory all new robots, or robots with new tapes. This updates NetBackup with thew new location of all tapes in those robots.

7. Modify any policy that explicitly specifies any of the storage units on the decommissioned media server. These policies must point to any other defined storage unit in the NetBackup configuration or to Any Available, as appropriate.

8. Update the bp.conf and vm.conf files on the master server and all media servers (or the servers list on a Windows master or media server) in the NetBackup environment to remove any reference to the decommissioned media server. Also update the server list on all clients to no longer include a reference to the decommissioned media server. Cycle the NetBackup daemons/services on any system where these files are modified.

Creating catarc policy

How to create a Catalog Archiving (catarc) policy

Modification:

The following steps can be taken to create a catalog archiving policy. There are several features that are specific to a catalog archiving policy.

For example:

The policy must be named catarc.
The policy must also be inactive.
The catalog archiving commands must be done manually. The administrator must specify a list of files to archive using the bpcatlist command. As a result, there isn't a way to run scheduled catalog archiving backups.

To create the policy in the GUI, do the following:
1. Go to the Policies tab in the GUI and create a new policy
2. For the Policy name: enter catarc
3. For the Policy type: enter Standard
4. Select the storage unit and volume pool you want to use for catalog archiving
5. Clear the Active. Go into effect at: check box. The policy is not run by the scheduler. Rather the bpcatarc command activates this special policy to perform a catalog backup job, then deactivates the policy after the job is done.
6. Accept the defaults for all the other options in the Attributes tab of the policy

Next, create a schedule for the catarc policy.
7. Go to the Schedules tab of the policy
8. Create a new schedule. You can enter any name for the schedule. Only the policy name must be catarc. It is valid to name the schedule catarc as well.
9. Set the Type of backup: to be User Backup
10. Select a retention level for the catalog archive. This must be at least as long as the longest retention period being archived.
Note: Failure to set the retention level of the catalog archive for a time at least as long as the longest retention period of the backups being archived can result in the loss of catalog data.
11. Accept the defaults for all the other options in the Schedule tab of the policy
12. Go to the Start Window tab in the schedule. Enter a backup window for when the catalog archiving commands can be run by the administrator.

Finally, complete the Backup Selections and Clients tab entries.
13. Go to the Files/Backup Selections tab of the policy
14. Select New to add the following path to the backup: /usr/openv/netbackup/db/images
15. Go to the Clients tab of the policy
16. Add the master server to the client list and select the correct hardware/OS type

Save the new policy and exit the GUI. The catarc policy is now available for catalog archiving

Catalog Archiving process

Overview of Catalog Archiving process

Modification:

The following is an overview of the catalog archiving process. Catalog archiving is meant to be used to save disk space used by images with infinite or multi-year retention periods. It can be used for any image that will most likely not be restored, but must remain in the image catalog. It should not be used as a method to reclaim disk space when a catalog filesystem fills up. For those cases, investigate catalog compression or adding disk space and growing the filesystem.

Warning: Catalog archiving modifies existing catalog images. As a result, it should never be run when the catalog filesystem is 100% full. Entries are added to the header files in /usr/openv/netbackup/db/images. If the filesystem is at 100%, it is impossible to predict what would happen.

Note: There is no simple method to determine what tape the catalog has been archived to. The bpcatlist -offline command is the only administrative command to determine what images have been archived. This command does not list what tape was used for the archive. As a result, caution should be exercised to ensure the tapes used for catalog archiving are available for restoring the archived catalog images. Either create a seperate volume pool to use exclusively for catalog archives or find a method to label the tape as a catalog archive tape.

Step 1: Use bpcatlist to determine what image files will be archived.

Before attempting to run bpcatarc or bpcatrm use the bpcatlist command to display what images are available for archiving. Running bpcatlist alone will not modify any catalog images. Only when the bpcatlist output is piped to bpcatarc and bpcatrm will the images be modified and the image .f files removed.

To determine what images are available for catalog archiving, run the following command:
# /usr/openv/netbackup/bin/admincmd/bpcatlist -online

To determine what images have been previously archived run the following command:
# /usr/openv/netbackup/bin/admincmd/bpcatlist -offline

Note: This should return "no entity was found" if catalog archiving has not been previously run. For more information on what the fields in the bpcatlist output indicate, refer TechNote 273999.

To display all images for a specific client prior to January 1st, 2000 run the command:
# bpcatlist -client -before Jan 1 2000

To display the help for the bpcatlist command run the command:
# bpcatlist -help

Once the bpcatlist output correctly lists all the images that are to be archived, then the archive itself can be run.

Step 2: Running the catalog archive.

Before running the catalog archive, create a catarc schedule. This is required in order for the bpcatarc command to successfully process images. Refer to the 5.x System Administrator's Guide or TechNote 274028 for more details on creating a catarc policy. When initiated, the catalog archive will create a Job ID for each time the catalog archiving is run.

To run the catalog archive, run the bpcatlist command with the same options used in step 1 to display images. Then just pipe the output through bpcatarc and bpcatrm.

Eg. # bpcatlist -client all -before Jan 1 2000 | bpcatarc | bpcatrm

A Job ID will appear in the activity monitor. The command will wait until the backup completes before returning the prompt. It will report an error only if the catalog archive fails. Otherwise the commands will simply return to the prompt. The File List: section of the Job Details in the Activity Monitor will show a list of image files that have been processed. When the job completes with a status 0, the bpcatrm will remove the corresponding .f files. If the job fails, then no catalog .f files will be removed.

Step 3: Restoring the Catalog archive

To restore the catalog archive, you must first use the bpcatlist command to list the files that need to be restored. Once bpcatlist displays the proper files to restore, then the bpcatres can be run to restore the actual files.

To restore all the archived files from Step 2 above, run the command:
# bpcatlist -client all -before Jan 1 2000 | bpcatres

This will restore all the catalog archive files prior to Jan 1, 2000.

Some miscellaneous notes about catalog archiving:

In the /usr/openv/netbackup/db/images// directory, a header and files file will be created for the catalog archive job.
The header file will be named: catarc__UBAK
The files file will be named: catarc__UBAK.f

Do not attempt to archive the catarc__UBAK image files. These are not archived by default. Attempting to archive catalog archives will make it nearly impossible to determine what files are needed for restoring catalog entries. The catarc__UBAK image files contain a listing of what catalog images were archived and need to be present and intact on the master in order to do a catalog restore.

The catalog archive images will also appear in the Backup, Archive and Restore GUI. The catarc policy is of type Standard so it will display the catalog archive backups along with the regular filesystem backups. However, this is not the correct method to restore archived files. Catalog archive files should be restored using the bpcatlist | bpcatres commands.

Running bpcatlist | bpcatarc |bpcatrm as listed on page 229 of the NetBackup 5.0 System Administrator's Guide 1 will archive the entire NetBackup catalog. To recover from this, run bpcatlist |bpcatres to restore all archived images. Then work with the bpcatlist command to determine what options are needed to archive only the desired images.

Some recommendations for catalog archiving:
Perform catalog archiving operations when NetBackup is in a quiet state. It is unknown what would happen to an active backup if bpcatlist | bpcatarc | bpcatrm was running during an active backup.

To ensure that catalog backup images are not on the same tapes as user backups, create a separate media pool for catalog archives.

You may find it useful to set up, then designate, a special retention level for catalog archive images. To specify retention levels, go to Host Properties | Master Server | Retention Periods or see "Retention Periods Properties" on page 369 of the NetBackup 5.0 System Administrator's Guide 1.

Configure Hot Catalog Backups

How to configure Hot Catalog Backups with Veritas NetBackup (tm) 6.0.

Details:

Backing up the NetBackup catalogs is essential to proper maintenance of a NetBackup environment. To aid in this practice, NetBackup 6.0 introduces the concept of Hot Catalog Backups. Hot Catalog Backups do not have the limitation of needing to run when no other backup is running, so this greatly simplifies the scheduling of catalog backups.

Hot Catalog Backups are configured on the NetBackup Master Server. They will automatically backup all for all necessary servers in the NetBackup environment, as well as the Disaster Recovery Information file which is created. All of these components are necessary to completely recover NetBackup catalogs.

To configure Hot Catalog Backups:
1. From the NetBackup Administration Console screen, select the Configure the Catalog Backup.
2. In the first screen of the NetBackup Catalog Backup Wizard, click Next.
3. On the Catalog Backup Method screen, verify that Online, hot catalog backup is selected and click Next.
4. On the NetBackup Catalog Backup Policy screen, verify that Create a new catalog backup policy is selected, click Next.
5. On the Policy Name and Type screen, enter a valid Policy name and click Next.
6. On the Backup Type screen, select to do a Full Backup. If desired, select to perform an Incremental Backup, either Incremental or Differential. Click Next.
7. On the Rotation screen, set the desired rotation frequency and click Next.
8. On the Start Window screen, select the desired Start Window. Because these are Hot Catalog backups, the Start Window does not need to be outside of the Start Window for normal backups. Click Next.
9. On the Catalog Disaster Recovery Method screen, select whether to use Disk and Tape, Disk only, or Tape only. If using disk, provide an appropriate path. This path is usually a location other than where NetBackup is installed. Provide Logon and Password information if necessary. For tape, provide an appropriate Media ID and Media density. Click Next.
10. On the E-mail Catalog Disaster Recovery Information screen, choose either Yes or No to have the Disaster Recovery Information stored via email. Provide an appropriate E-mail address and click Next.
11. To save the Hot Catalog Backup policy, click Finish.
12. After NetBackup finishes adding the policy, verify it appears on the NetBackup Catalog Backup Policy screen. This time, verify that Create a new catalog backup policy is not selected and click Next.
13. You have now finished the Catalog backup configuration process. Click Finish to exit the wizard.

The policy will now appear in the Policies dialog of the NetBackup Administration Console. It can be modified similar to any other policy.

NBDN.log corruption

The NBDB database transaction log (NBDB.log) may become too large or corrupt for proper NetBackup operation.

Overview:

It is possible for the transaction log for the ASA database that runs under NetBackup to become corrupt or too large for proper operation. For example, in a file system that limits file size to 2Gb, the transaction log will be truncated and corrupt once it reaches the 2Gb limit. The transaction log is truncated during online catalog backup, but it might be possible that catalog backups did not take place to prevent the log from growing too large.

The ASA database failing due to transaction log problems can include these errors:

- Catalog backups and regular backups can hang.

- Operator is unable to cancel queued jobs.

- The EMM database does not come back up after restarting NetBackup.

Troubleshooting:
Running the command "/usr/openv/db/nbdb_ping" returns:
Database [NBDB] is not available.

Log Files:
There may be errors reported by the ASA database engine in the /usr/openv/db/log/server.log file as well. These errors would look similar to the following:
Error: Cannot open transaction log file -- Can't use log file "/usr/openv/db/data/NBDB.log" since it is shorter than expected

Resolution:
Rebuild the transaction log via the following steps:

1. Stop NetBackup if it is not already stopped:

/usr/openv/netbackup/bin/goodies/netbackup stop

2. Remove NBDB and BMRDB from auto_start option

/usr/openv/db/bin/nbdb_admin -auto_start NONE

3. Start the ASA database engine

/usr/openv/db/bin/nbdbms_start_server

4. To add the necessary environment variables to your shell, run the command:

. /usr/openv/db/vxdbms_env.sh

(please make note of the period at the beginning of the line).

5. Change directories to where the NBDB resides:

cd /usr/openv/db/data

6. Remove or rename the bad transaction log:

mv NBDB.log NBDB.log.bad

7. To force database recovery, while still in the /usr/openv/db/data directory, run the command (please note, the NBDB.log will not be created until after step 10):

/usr/openv/db/bin/dbeng9 -f NBDB

8. Stop the database by running the command:

/usr/openv/db/bin/nbdbms_start_server -stop

9. Add NBDB and BMRDB to auto_start option (Add BMRDB only if using BMR)

/usr/openv/db/bin/nbdb_admin -auto_start NBDB

/usr/openv/db/bin/nbdb_admin -auto_start BMRDB

10. To start the database, run the command:

/usr/openv/db/bin/nbdbms_start_server

11. To test if the database is operational, run the command:

/usr/openv/db/bin/nbdb_ping

which should return something similar to the following:

Database [NBDB] is alive and well on server [NB_master1].

Updating replaced tape drive in NBU

How to update Netbackup for a replaced tape drive without deleting and re-adding the drive

Details:

Manual: Veritas Netbackup (tmp) 6.0 Commands for UNIX, Pages: 420-423
Modification Type: Supplement

Modification:
When a tape drive is replaced, the Veritas Netbackup (tmp) configuration will need to be updated to reflect the changed drive.

This procedure is needs to be followed:

To swap a shared serialized drive or to update drive firmware on a shared drive
1 Down the drive. In the Device Monitor, select the drive to swap or update. From the Actions menu, select Down Drive.

2 Replace the drive or physically update the firmware for the drive. If you replace the drive, specify the same SCSI ID for the new drive as the old drive.

3 To produce a list of new and missing hardware, run tpautoconf -report_disc on one of the reconfigured servers. This command scans for new hardware and produce a report that shows the new and the replaced hardware.

4 Ensure that all servers that share the new hardware are up and that all Netbackup services are active.

5 Run tpautoconf with the -replace_drivedrive_name -path path_name options or -replace_robotrobot_number -pathrobot_path options. The tpautoconfcommand reads the serial number from the new hardware device and then updates the EMM database.

6 If the new device is an unserialized drive, run the device configuration wizard on all servers that share the drive. If the new device is a robot, run the device configuration wizard on the server that is the robot control host.

7 Up the drive. In the Device Monitor, select the new drive. From the Actions menu, select Up Drive.

Here is an example but make sure the drive is "down" prior to running the tpautoconf -replace_drive. If it is not the info could actually revert back to the old drive information:

Once the drive is replaced, run the following command to report the discrepancy:

/usr/openv/volmgr/bin/tpautoconf -report_disc

This produces information similar to the following:

======================= New Device (Tape) ============
Inquiry = "QUANTUM DLT7000 245F"
Serial Number = PXA51S3232
Drive Path = /dev/rmt/21cbn
Found as TLD(0), Drive = 1
===================== Missing Device (Drive) =========
Drive Name = QUANTUMDLT70001
Drive Path = /dev/rmt/11cbn
Inquiry = "QUANTUM DLT7000 245F"
Serial Number = PXA51S3587
TLD(0) definition, Drive = 1
Hosts configured for this device:
Host = HOSTA
Host = HOSTB

This reports the discrepancy between the device database and the new device found. Take note of the new Drive Path for the device as this will be needed for the tpautoconf command. To resolve this, run:

# cd /usr/openv/volmgr/bin
#./tpautoconf -replace_drive QUANTUMDLT70001 -path /dev/rmt/21cbn

Found a matching device in global DB, QUANTUMDLT70001 on host HOSTA
update of local DB on host HOSTA completed
globalDB update for host HOSTA completed

Found a matching device in global DB, QUANTUMDLT70001 on host HOSTB
update of local DB on host HOSTB completed
globalDB update for host HOSTB completed

This will update the global and local database to reflect the new device being replaced.

Up the drive at this point.

Drive Status

The following list describes the current drive status field descriptions for drives within a Veritas NetBackup (tm) configuration.

Details:

Manual:
Veritas NetBackup (tm) 6.5 Troubleshooting Guide for UNIX, Linux, and Windows
Veritas NetBackup (tm) 6.0 Media Manager System Administrator's Guide for UNIX, Page: 240-242
Veritas NetBackup (tm) 6.0 Media Manager System Administrator's Guide for Windows, Page: 223-225

Modification Type: Supplement

Modification:
The following NetBackup Drive statuses can appear in command line output or in the Device Monitor in the NetBackup Administration Console.

Column Description Note for the Administration Console:
Drive Name - Drive name assigned to the drive during configuration.
Control - Control mode for the drive can be any of the following: robot_designation. For example, TLD.
The robotic daemon managing the drive has connected to LTID (the device daemon and Device Manager service) and is running. The drive is in the usable state. AVR is assumed to be active for the drive, as all robotic drives must be in automatic volume recognition (AVR) mode (not OPR mode).
Applies only to robotic drives.

DOWN-

For example, DOWN-TLD.

The drive is in an usable state because it was downed by an operator or by NetBackup; or when the drive was configured, it was added as a down drive. Applies only to robotic drives.

DOWN

In this mode, the drive is not available to Media Manager.

Applies only to standalone drives.

A drive can be in a DOWN mode because of problems or because it was set to that mode using Actions | Down Drive.

PEND-

For example, PEND-TLD.

The drive is in a pending status. Applies only to robotic drives.

PEND

The drive is in a pending status. Applies only to standalone drives.

If the drive reports a SCSI RESERVATION CONFLICT status, this column will show PEND. This status means that the drive is reserved when it should not be reserved. Some server operating systems (Windows, Tru64, and HP-UX) may report PEND if the drive reports Busy when opened. You can use the AVRD_PEND_DELAY entry in the Media Manager configuration file to filter out these reports.

AVR

The drive is running with automatic volume recognition enabled.

The drive is in a usable state with automatic volume recognition enabled, but the robotic daemon managing the drive is not connected or is not working. Automated media mounts do not occur with a drive in this state (unless the media is in a drive on the system, or, this is a standalone tape drive), but the operator can physically mount a tape in the drive or use robtest to cause a tape mount as needed.

OPR

The drive is running in a secure mode where operators must manually assign mount requests to the drive. AVRD is not scanning

this drive when in this mode. This mode gives operators control over which mount requests can be satisfied.

Applies only to standalone drives.

NO-SCAN

A drive is configured for shared storage option (SSO), but has no available scan host (to be considered available, a host must register with a SSO_SCAN_ABILITY factor of non-zero and have the drive in the UP state). NO-SCAN may be caused if all available scan hosts have the drive in the DOWN state. Other hosts (that are not scan hosts) may want to use the drive, but they registered with a scan factor of zero. The drive is unusable by NetBackup until a scan host can be assigned.

Mixed

The control mode for a shared drive may not be the same on all hosts sharing the drive. For shared drives, each host can have a

different status for the drive. If the control modes are all the same, that mode is displayed.

RESTART

The control mode for a shared drive may not be the same on all hosts sharing the drive. This status indicates that ltid needs to be restarted. To determine what server need to be restarted, right-click the drive in the device monitor and select up. This will tell you what servers that ltid needs to be restarted.

Drives in AVR mode

Troubleshooting for when Robotic Drives are going into AVR mode, and backups are halting with a pending mount request.

Exact Error Message
TLD(0) unavailable: initialization failed: Control daemon connect or protocol error

Details:

The cause of this problem is most often a result of communication problems. There are two NetBackup daemons for robotic control: one runs on the machine with robotic control, the other runs on the machine that has drives in the robot. For example, if the robot is a TLD robot the two daemons are tldcd (runs on the server with robotic control) and tldd (runs on server with drives on the robot). In this commonly occurring problem, the drives will change from TLD control to AVR control. This is so the jobs will go into a pending mount state, rather than failing. That happens so that if network communications were to fail between two server for a short time, then there would be no need to fail the jobs and they could wait until the connection comes back up. However, at times this can be caused by more severe problems.

In the media server's system log would be an error such as this:
Dec 4 08:54:36 host01 tldd[260]: TLD(0) unavailable: initialization failed: Control daemon connect or protocol error
Dec 4 08:56:41 host01 tldd[858]: TLD(0) [858] unable to connect to tldcd on host02: Error 0 (0)

The above error is what will cause drives in a robot to go into an AVR control mode. This is because these two daemons are unable to communicate.

It is not possible to give a single cause or a single solution.

Some common causes are:

1. Network connectivity has just plain failed. In this case, the network must be restored.

2. There are multiple interfaces on one or both of the machines that cannot route to or resolve each other. In this case, either routing needs to be changed so that a request going will be able to reach its destination. Adding the proper host names to the /etc/hosts file has been shown to work in some situations.

3. The tldcd daemon has enters an uninterruptible state or is hung, thus making it unable for it to reply to tldd. In this case, shutdown the media management daemons by running /usr/openv/volmgr/bin/stopltid.

Next, run /usr/openv/volmgr/bin/vmps to get the pid (process ID) of the tldcd daemon and run a kill command on it. If that doesn't work, use kill -9. If this does not kill the process, the server will have to be rebooted. To restart the daemons, run /usr/openv/volmgr/bin/ltid.

Note: The daemon does not time out because it is hung on a system call. This is something out of an application's ability to control.

4. The /etc/services file is missing the correct entries on one or both the servers. Below are the entries that should be in /etc/services:
# Media Manager services #
vmd 13701/tcp vmd
acsd 13702/tcp acsd
tl8cd 13705/tcp tl8cd
odld 13706/tcp odld
tldcd 13711/tcp tldcd
tl4d 13713/tcp tl4d
tshd 13715/tcp tshd
tlmd 13716/tcp tlmd
tlhcd 13717/tcp tlhcd
rsmd 13719/tcp rsmd
# End Media Manager services #

Note: Not only can this happen between two different servers, it can also happen on the same server.

Vault eject error

Vault does not eject any tapes and the error "Rejected - No frag in robotGroup" is seen in the vault detail log for the session ID.

Exact Error Message
Rejected - No frag in robotGroup

Details:

Overview:

The error "Rejected - No frag in robotGroup" may be seen when Vault has been configured to eject "original" tapes, but no tapes were picked up for ejection. A possible reason for this is that the Robotic Volume Group is incorrect in the Vault configuration. Vault will only eject tapes that belong to the Robotic Volume Group specified at the "Vault Name" level of the configuration. Figure 1 shows where to find the Robotic Volume Group information.

Figure 1 - Robotic Volume Group in the Vault configuration GUI

Log Files:

To determine if the Robotic Volume Group is incorrect, the following information will need to be collected:

(i) The name of the vault and the profile that is not ejecting tapes.

(ii) The vault configuration file:

\netbackup\db\vault\vault.xml (Windows)

/usr/openv/netbackup/db/vault/vault.xml (UNIX)

(iii) The output of the command:

\Volmgr\bin\vmquery -pn (Windows)

/usr/openv/volmgr/bin/vmquery -pn (UNIX)

Note: Use the volume pool name(s) listed in the Eject tab of the profile (that is not ejecting tapes). There may be more than one volume pool listed. This command should be run for each volume pool.

(iv) The \netbackup\vault\sessions\\SIDXXX directory (Windows)

The /usr/openv/netbackup/vault/sessions//SIDXXX directory (UNIX)

where "XXX" matches the session ID from the job details in activity monitor. Figure 2 below shows the session ID.

Figure 2 - Session ID from the vault job details

Troubleshooting:

Once the above information has been collected, perform the following:

1. Inspect the RobotVolumeGroup entry specified at the Vault Name level under which the profile that is not ejecting tapes resides. It should be something similar to 00_000_TL8 (see Figure 3).

Figure 3 - Robotic Volume Group in the vault.xml file

2. Check where one of the images (that suffered the error) resides by performing the following:

(i) From the detail.log file within the sid directory, find an image that exhibits the error message:

04:37:19.841 [23584] ImgFilterVaulted(): Image: hostname_1135991212 Rejected - No frag in robotGroup

(ii) Then run the bpimagelist command to obtain details of this image:

# /usr/openv/netbackup/bin/admincmd/bpimagelist -backupid hostname_1135991212

IMAGE hostname 0 0 7 hostname_1135991212 policy 13 ...

HISTO -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

FRAG 1 1 14327919 0 2 15 1 C01243 ...

The FRAG line shows that the image is on tape C01243

(iii) Check the volume group where tape C01243 resides using the vmquery -m output.

# /usr/openv/volmgr/bin/vmquery -m C01243

==================================================

media ID: C01243

media type: 8MM cartridge tape (4)

barcode: ABCO1423

media description: NT|826|S981|20040421

volume pool: Duplicate (6)

robot type: TL8 - Tape Library 8MM (6)

robot number: 0

robot slot: 43

robot control host: spnbu02-nbu

volume group: 00_000_TL8

vault name: NT

vault sent date: Mon Oct 04 14:19:22 2004

vault return date: ---

vault slot: 834

vault session id: 57

vault container id: -

created: Wed Mar 17 14:02:45 2004

assigned: Sun Oct 03 14:37:13 2004

last mounted: Sun Oct 03 18:00:29 2004

first mount: Wed Mar 17 14:04:50 2004

expiration date: ---

number of mounts: 8

max mounts allowed: ---

status: 0x0

==============================================

Resolution:

If the vmquery output shows a volume group different to the one specified in the vault configuration file (vault.xml), then the tapes will never be ejected. There are two different ways to resolve this, as follows:

A. The vault configuration will need to be changed so that it specifies a new Robotic Volume Group.
To do this:

1. Open the NetBackup Administration GUI and expand Vault Management.

2. Right-click on the name of the vault that is incorrect, and click Change.

3. Change the Robotic Volume Group entry to match the Volume Group entry in the vmquery output and save.

4. Re-run the vault profile and the tapes should be ejected

B. Change the robotic volume group associated with the media to match what the vault is looking for.
To do this:

1. Open the NetBackup Administration GUI and expand Media

2. Right-click on the media you want to change.

3. Select "Change Volume Group" from the drop down menu.

4. Use the drop down menu to select the proper Volume Group entry in the vmquery output (or type it in) and click OK.

5. Re-run the vault profile and the tapes should be ejected.

CAUTION:

I. For Resolution A - If the Vault Offsite Volume Group for the Vault concerned already houses tapes of a certain media density, then ensure that the Robotic Volume Group being changed in the Vault configuration also houses tapes of the same media density. If the media density differs, then eject will fail at the point when Vault tries to move the tapes to the offsite volume group (most likely with error 101 media type and volume group mismatch) . This is because a volume group (offsite or robotic, other than "---") cannot house more than one media density.

Note: The media density is not an issue for Resolution A if the Vault Offsite Volume Group is not yet assigned to any particular media density at the time of the eject. However, as soon as it does, it is obliged to stick with that media density. It is only freed from this obligation if all the tapes in it are moved elsewhere (to another volume group).

II. For Resolution B - the move of medias / tapes to another volume group will only succeed if the volume group that the tapes are moving into either already houses the same density, or is not yet assigned to a particular media density. Moving medias of a certain media density to a volume group of another media density will result in error 101 (as above).

Incremental Catalog backup failing with status 41

Incremental Catalog backup fails with Status 41 (network connection timeout)

Exact Error Message
15:38:11.876 [18923] <16> bpbrm main: db_FLISTsend failed: network
connection timed out (41)

Details:

Overview:

An Incremental Catalog Backup fails with Status 41 (network connection timeout), whereas the Full Catalog Backup completed in approximately 20 minutes.

Troubleshooting:

The bpbrm log file on the media server ( /usr/openv/netbackup/logs/bpbrm/log. ) shows the error:

15:38:11.876 [18923] <16> bpbrm main: db_FLISTsend failed: network connection timed out (41)

Following a 10-minute delay, the bpbkar log file on the client ( /usr/openv/netbackup/logs/bpbkar/log. ) shows the error:

15:28:12.868 [18954] <2> bpbkar SelectFile: INF - Resolved_path = /usr/openv/netbackup/db/images/master1/1231000000/SYM_INC_1231711779_UBAK.f

15:38:59.484 [18954] <4> bpbkar PrintFile: /data/openv/netbackup/db/images/master1/1231000000/

15:38:59.484 [18954] <16> bpbkar sighandler: ERR - bpbkar killed by SIGPIPE

15:38:59.484 [18954] <2> bpbkar sighandler: INF - ignoring additional SIGPIPE signals

Creating or amending the file /usr/openv/netbackup/bin/DBMto to contain the value "30" increases the timeout (from the default of 10 minutes) allowing the incremental backups to complete. However, it was found that the incremental backup then took around 8 hours to complete.

The bpdbm log ( /usr/openv/netbackup/logs/bpdbm/log. ) shows:

10:31:34.768 [19001] <2> add_files: updating

/usr/openv/netbackup/db/images/master1/1243000000/tmp/NB-Catalog_1243231645_INCR.f

10:48:29.184 [19001] <2> add_files: 5000 files added to

/usr/openv/netbackup/db/images/master1/1243000000/tmp/NB-Catalog_1243231645_INCR.f

10:48:35.557 [22486] <2> add_files: updating

/usr/openv/netbackup/db/images/master1/1243000000/tmp/NB-Catalog_1243231645_INCR.f

11:05:50.079 [22486] <2> add_files: 5000 files added to

/usr/openv/netbackup/db/images/master1/1243000000/tmp/NB-Catalog_1243231645_INCR.f

The delays can be seen between the entries for "add_files: updating" and "add_files: 5000 files added", for a given process, which has cumulative effect on the time the Catalog backup takes.

Solution:

Create /usr/openv/netbackup/bin/DBMto with the entry "30" to correct the Status 41 Error, restart the NetBackup services.
Create the file /usr/openv/netbackup/MAX_FILES_PER_ADD, with the entry 500, restart the NetBackup services.

It was found that following these changes, the Incremental Catalog backup successfully completed in 15 minutes.

SQL script to delete tape drives with MISSING_DRIVE status

SQL script to delete tape drives with "MISSING_DRIVE" status from the EMM database. There can be situations where there are drives in the EMM database with "MISSING_DRIVE" status, and it is not possible to delete them via "tpconfig -delete -drive " command.

Details of the issue:
----------------------------
A. "tpconfig -emm_dev_list -noverbose" command executed from a media server having problem will show entries similar to the following:
TPC_DEV60 DRIVE T10K_2_1_1_1 16 1 2 -1 - 2 1 1 1 media_server MISSING_DRIVE:531001002370 531001002370 - 128 4 -1 -1 -1 -1 0 11889 0 0 - 0 - - STK~~~~~T10000A~~~~~~~~~1.38 - 16

B. "tpconfig -l" executed on the media server may show the entry similar to the following:
drive - 16 hcart3 - DISABL - T10K_2_1_1_1 MISSING_DRIVE:531001002370 ACS=2, LSM=1, PANEL=1, DRIVE=1

C. To verify what are the entries in the Enterprise Media Manager (EMM) database & tables, run the following command:

/usr/openv/db/bin/nbdb_unload -t EMM_MAIN.EMM_Device, EMM_MAIN.EMM_MachineDeviceConnection, EMM_MAIN.EMM_DriveIndex

Entries like the following will appear in the nnn.dat files output received from the above command:

436.dat:
'2000185',0xECD8529A2C3213F88038865237F37639,'2','16','0','128','1','NetBackup HCART3','NetBackup HCART','523118080','16176','6','0','T10K_2_1_1_1',','2000176','2','1',','-1','STK','T10000A','1.38',',',','531001002370',',','STK T10000A 1.38',','0',','0','0','1970-01-01 00:00:00.000000','1970-01-01 00:00:00.000000','2009-04-23 19:33:18.
000000','0','42802796','0','0',0x00000000,0x00000000000000000000000000000000,'-1','-1','1970-01-01 00:00:00.000000','0','0','2','1','1','1',',',','0','0','0','8388608','2006-08-1502:18:16.184874','2009-05-18 23:22:46.007585'

444.dat:
'2002855',0x6F2D947E305911DE8000893E7E4907EB,'2','16','0','32907','1','NetBackup HCART3','NetBackup HCART','523118080','16176','6','0','T10K_2_1_1_1',','2000176','2','1',','-1','STK','T10000A','1.38',',',','531004007993',',','STK T10000A 1.38',','0',','1000036','1000036','1970-01-01 00:00:00.000000','1970-01-01 00:00:00.000000','2009-0
5-20 03:00:53.000000','0','248714','0','0',0x00000000,0x00000000000000000000000000000000,'-1','-1','2009-05-20 03:01:41.000000','1000036','1','2','1','1','1',','T17480','T17480','82,'0','0','8388608','2009-04-23 02:52:34.317443','2009-05-20 03:03:49.093255'

D. If "tpconfig -delete -drive " does not work, or if it works but entries like the above still appear from any commands executed (listed above), please engage Symantec Technical Support (who may require to engage Symantec Development, as otherwise a SQL script not approved by Symantec Development may further damage the EMM database) to get proper SQL script to delete the drives with MISSING_DRIVE status.

E. The above condition may lead to backup/duplication failures, and also may lead to unexpected slower performance from resource allocation and selection processes such as nbrb and mds (emm).

F. After the completion of successful removal of the drives with MISSING_DRIVE status, execute the following steps on NetBackup servers to start with a clean state in the environment:

1. Select a window with minimum or no jobs on the master server, if there are, cancel them. Make sure all jobs are stopped. To make sure no further jobs will be queued by nbpem during this activity, "nbpemreq -suspend_scheduling" can be run, and no manual backups, user backups or restores should be initiated during this time. If "nbpemreq -suspend_scheduling" is used, "nbpemreq -resume_scheduling" will have to be executed to resume the job scheduling when all the steps are done.

2. Run "nbrbutil -releaseAllocHolds" on the master server.
3. Run "nbrbutil -resetAll" on the master server.
4. Run nbrbutil -dump and take a copy for SYMC to review.
5. Stop all master server's daemons.
6. Recycle daemons on listed media servers those had missing drives before the execution of the SQL script. The media servers also can be found by running the following SQL command:

SELECT MachineKey, FQMachineName, MachinePrimaryName FROM "EMM_MAIN"."EMM_Machine" WHERE MachineKey IN (SELECT DISTINCT PrimaryMachineKey FROM "EMM_MAIN"."EMM_MachineDeviceConnection" WHERE PrimaryPath LIKE 'MISSING%'))

7. Recycle daemons on all remaining media servers.
8. Start NBU daemons on the Master server.

Comments:
Although only Solaris is mentioned here as the Operating System, the above condition may exist in any NetBackup environment with any supported Operating Systems.

Rebuild SG File on Solaris

Drives are not being recognized by the OS - Steps to rebuild the /dev/sg/* and /dev/rmt/* devices on a Solaris Master or Media server

Details:

Here are the basic steps to rebuild the /dev/sg/* and /dev/rmt/* devices on a Solaris server without rebooting.

1. Create a backup copy of the current st.conf file:
# cp /kernel/drv/st.conf /kernel/drv/st.conf.`date +%m%d%y_%H%M%S`

2. Move the existing sg.conf to a backup (this must be a move, otherwise a later step will fail):
# mv /kernel/drv/sg.conf /kernel/drv/sg.conf.`date +%m%d%y_%H%M%S`

3. Create a backup copy of the current devlink.tab file:
# cp /etc/devlink.tab /etc/devlink.tab.`date +%m%d%y_%H%M%S`

4. Delete SCSI targets/LUNs from the /kernel/drv/st.conf file:
name="st" class="scsi"
target=0 lun=0;

All of these entries should be removed, otherwise duplicates will be added later.

5. Delete SCSI targets/LUNs from /etc/devlink.tab. This is typically the section near the end of the file and the entries are typically of the form:

# begin SCSA Generic devlinks file - creates nodes in /dev/sg
type=ddi_pseudo;name=sg;addr=0,0; sg/c\N0t0l0
type=ddi_pseudo;name=sg;addr=1,0; sg/c\N0t1l0
type=ddi_pseudo;name=sg;addr=2,0; sg/c\N0t2l0
type=ddi_pseudo;name=sg;addr=3,0; sg/c\N0t3l0
type=ddi_pseudo;name=sg;addr=4,0; sg/c\N0t4l0
type=ddi_pseudo;name=sg;addr=5,0; sg/c\N0t5l0
type=ddi_pseudo;name=sg;addr=6,0; sg/c\N0t6l0
type=ddi_pseudo;name=sg;addr=0,1; sg/c\N0t0l1
type=ddi_pseudo;name=sg;addr=1,1; sg/c\N0t1l1
type=ddi_pseudo;name=sg;addr=2,1; sg/c\N0t2l1
type=ddi_pseudo;name=sg;addr=3,1; sg/c\N0t3l1
type=ddi_pseudo;name=sg;addr=4,1; sg/c\N0t4l1
type=ddi_pseudo;name=sg;addr=5,1; sg/c\N0t5l1
type=ddi_pseudo;name=sg;addr=6,1; sg/c\N0t6l1
# end SCSA devlinks

Everything in this section should be removed, inclusive of the beginning and ending lines.

6. Change to the appropriate directory to run commands:
# cd /usr/openv/volmgr/bin/driver

7. Generate the configuration files (st.conf, sg.conf and sg.links):
../sg.build all -mt -ml

Note: You will need to know what the max_target and max_lun values will need to be (this is the maximum SCSI Target and LUN value).

8. Append the generated st.conf entries to the OS configuration file:
# cat st.conf >> /kernel/drv/st.conf

9. Unload the sg driver:
# rem_drv sg

10. Use the provided script to re-create the /kernel/drv/sg.conf file, append the SCSA entries to /etc/devlink.tab and reload the sg driver:
# ./sg.install

11. Now sgscan should see the appropriate devices:
# /usr/openv/volmgr/bin/sgscan all conf -v

Troubleshooting hunged backups

Even though backup resources are available, backups are taking hours to go active and complete. However, if they are canceled and re-submitted, they take off and complete right away.

Details:

Overview:
Jobs remain queued while resources are available due to length of time it takes nbrb to complete an evaluation cycle.

Troubleshooting/Log Files:
Determine if there is an excessive amount of time taken between nbjm requesting a resource and nbrb granting the resource.

For example:
1. From the Activity Monitor within the NetBackup GUI, view the job details. In this particular instance, it took 10 minutes to grant a resource:
8/21/2007 2:58:44 PM - requesting resource figrin.NBU_POLICY.MAXJOBS.sybase-arvel1-SDS_SY48-sybsecurity-DB
8/21/2007 3:08:31 PM - granted resource figrin.NBU_CLIENT.MAXJOBS.arvel1

2. In the nbrb log, look for the amount of time between "evaluation cycle is in progress" and "reserved resource" for a particular jobid :
08/21/07 20:58:13.512 [Debug] NB 51216 nbrb 118 PID:16608 TID:3 File ID:118 [jobid=9180586] 2 [ResBroker_i::evaluateNow] evaluation cycle is in progress
08/21/07 21:08:30.706 [Debug] NB 51216 nbrb 118 PID:16608 TID:9 File ID:118 [jobid=9180586] 3 [CountedProvider::allocate] reserved resource figrin.NBU_CLIENT.MAXJOBS.arvel1 (current_ref_count=2, max_ref_count=10)

To generate an nbrb log, use this command:
/usr/openv/netbackup/bin/vxlogview -p 51216 -o 118 -d all

3. Search the raw nbrb log (/usr/openv/logs/*118*) for the "msec" string:
For example:
51216-118-1067800690-070912-0000000011.log:0,51216,118,118,34289634,1189573603242,17599,9,0:,83:Resource evaluation completed. Evaluated 779 requests, evaluation time: 481417 msec,25:ResBroker_i::doEvaluation,2
51216-118-1067800690-070912-0000000024.log:0,51216,118,118,34308513,1189574183188,17599,10,0:,83:Resource evaluation completed. Evaluated 763 requests, evaluation time: 579944 msec,25:ResBroker_i::doEvaluation,2
51216-118-1067800690-070912-0000000040.log:0,51216,118,118,34327216,1189574666005,17599,9,0:,83:Resource evaluation completed. Evaluated 916 requests, evaluation time: 482815 msec,25:ResBroker_i::doEvaluation,2

Resolution:
There are a number of steps that were taken to increase the time taken to grant resources:

Note: All configuration files listed below can be found in the following subdirectory:
UNIX: /usr/openv/var/global
Windows: \Veritas\NetBackup\var\global\

1. If the master server has greater than ten CPUs, it is necessary to increase the number of CPUs available to Sybase. See TechNote 285579, linked below.

2. Enable additional database connections via emm.conf changes.

For example, the following lines could be added or edited:
NUM_DB_CONNECTIONS=21
NUM_DB_BROWSE_CONNECTIONS=20
NUM_ORB_THREADS=35

See TechNote 285629, linked below.

Note: In NetBackup 6.0, the server.conf file must also be changed to allow greater than 10 connections to the ASA database. This can be accomplished by editing the server.conf file and changing the value of the -gn parameter to 30.

In NetBackup 6.5, the -gn parameter does not exist.

3. Starting in NetBackup 6.0 MP6 and 6.5.2, nbrb can be tuned via the nbrb.conf file.

This file offers the following parameters:

SECONDS_FOR_EVAL_LOOP_RELEASE
RESPECT_REQUEST_PRIORITY
DO_INTERMITTENT_UNLOADS

Note: By default all values are 0.

Example format for parameters in nbrb.conf:
SECONDS_FOR_EVAL_LOOP_RELEASE = 180
RESPECT_REQUEST_PRIORITY = 0
DO_INTERMITTENT_UNLOADS = 1

An explanation of nbrb.conf parameters can be found in TechNote 300442, linked below.

4. VERY IMPORTANT for nbrb caching: do not mix SSO and dedicated drives within a library. Have all drives SSO or all drives dedicated.

5. Do not exceed the total number of drives within a storage unit (max concurrent drives).
For example: if the library has 8 drives, the largest value that total max concurrent drives can be is 8.

6. Ensure drives do not need cleaning. (CLEANING REQUIRED flag flipped). Clean drives. Run tpclean -M on all drives to clear flag if still set.

7. If using custom scripts or Aptare, ensure that scripts are not accessing media servers for drive/tape information. All information is now stored on the EMM server.

8. Change HPUX kernel by increasing dbc_max_pct to 50 and dbc_min_pct to 10 to improve file caching.

9. Move the database transaction log and database to different disks:
For example:
If 2 additional drives (/u01 and /u02) are available:
# mkdir /u01/nbu
# mkdir /u02/nbu
# /usr/openv/netbackup/bin/goodies/netbackup stop
# /usr/openv/db/bin/nbdbms_start_server
# /usr/openv/db/bin/nbdb_move -data /u01/nbu -index /u01/nbu -tlog /u02/nbu
# /usr/openv/netbackup/bin/goodies/netbackup start

bpexpdate -deassignempty fails

The "bpexpdate -deassignempty" command fails with "Could not update media list, file read failed" due to bad image headers in the NetBackup catalog.

Exact Error Message
<16> compare_list_to_fragments: unexpected return value from db_IMAGEreceive: file read failed 13
<16> check_for_empty_media: Could not update media list, file read failed
<16> bpexpdate: file read failed

Details:

Overview:

The bpexpdate command will check the NetBackup (tm) image database for images to expire. If a bad image header file exists, this can cause the bpdbm daemon to return a "status 13, file read failed" to the bpexpdate command. To resolve this error, the NetBackup images database needs to be checked for bad image headers.

Troubleshooting:

This Problems report may appear with messages like:
05/29/05 16:41:40 nbmaster1 - cleaning image DB
05/29/05 16:41:42 nbmaster1 - Bad image header: Unix_1114295338_INCR
05/29/05 16:41:44 nbmaster1 - Bad image header: Unix_1114295341_INCR
05/29/05 16:42:01 nbmaster1 - Bad image header: Unix_1114704580_INCR

Master Server Log Files:

The /usr/openv/netbackup/logs/admin/log. file will show "file read failed" errors when trying to run bpexpdate -deassignempty.
<2> logconnections: BPDBM CONNECT FROM x.x.x.x.34611 TO x.x.x.x.13721
<16> compare_list_to_fragments: unexpected return value from db_IMAGEreceive: file read failed 13
<16> check_for_empty_media: Could not update media list, file read failed
<16> bpexpdate: file read failed
<2> bpexpdate: EXIT status = 13

The /usr/openv/netbackup/logs/bpdbm/log. file will show the specific image header that is generating the "file read failed" errors.
<16> db_get_image_info: fopen(D:\VERITAS\NetBackup\db\images\client\1114000000\Unix_1114704580_FULL): Permission denied (13)
<16> list_client_images: cannot get image info...client Unix_1114704580_FULL
<4> bpdbm: request complete: exit status 13 file read failed

Media Server Log Files: n/a

Client Log Files: n/a

Resolution:

The bpexpdate command will check the NetBackup image database for images to expire. If a bad image header file exists, this can cause the bpdbm daemon to return a "status 13, file read failed" to the bpexpdate command. To resolve this error, the NetBackup images database needs to be checked for bad image headers.

1. Run a consistency check of the NetBackup database. This will help find bad image header and zero byte header files.

# /usr/openv/netbackup/bin/bpdbm -consistency > /tmp/filename.txt

Note: The consistency check should be run when there are no backups or restores running. False errors will appear for any current backup since the .f files file will not be completely written until the backup completes. In a 24X7 environment, this can be run while backups are running. Special care will need to be taken to check the timestamp of any image with errors, to ensure it was not an active backup.

2. Review the output file /tmp/filename.txt that was created for any images listed with a "Bad image header" error

3. Remove the image files that correspond to the "Bad image header" errors listed in the consistency output

4. Run the bpexpdate again

"Bad Image Header" in Image DB

All Log Entries or Problems Log reports show: "Bad image header," indicating a problem with the NetBackup images database.

Exact Error Message
Bad image header

Details:

The All Log Entries or Problems Log reports show entries such as the following:

06/09/98 16:41:40 sparky - cleaning image DB
06/09/98 16:41:42 sparky - Bad image header: Unix_0896295338_INCR
06/09/98 16:41:44 sparky - Bad image header: Unix_0896295341_INCR
06/09/98 16:42:01 sparky - Bad image header: Unix_0896704580_INCR

"Bad image header" messages result from backup image headers that are incomplete.

Background Information

When VERITAS NetBackup (tm) executes backups, it writes information to the Images database on the NetBackup master server. The Images database resides in /usr/openv/netbackup/db/images. For every backup, NetBackup writes two files to the Images database: the backup image header file and the "files file." For a valid NetBackup image, both of these files must exist, otherwise the backup image is incomplete and a restore is impossible.

If the filesystem that the Images database resides on reaches 100% utilization NetBackup can no longer record information about the backups it is attempting to process. The active backups can no longer write to the Images database and the backup image headers of the active backups become corrupt. Typically, the backup image headers are re-written with a size of zero bytes. It is the zero byte length image headers that cause the "Bad image header" messages in the reports

An unexpected shutdown (such as a power outage or system crash) or a disk device failure can also corrupt image header files. In these cases the image header file may be zero bytes in length, contain only spaces, or contain garbled data. This type of corruption can occur after the backup is complete - for example during duplication, copy expiration, or even while no NetBackup activity is occurring on the image (in the case of device failure).

Resolution:

There are several options available to resolve the problem of bad image header:

Option 1:
Under most circumstances, running the catalog consistency checking tool will remove the suspect bad image heads, and place them in the /usr/openv/netbackup/db.corrupt folder. The tool can be run by executing the following command and redirecting the output to a file

Unix: /usr/openv/netbackup/bin/bpdbm -consistency -move > /path_to_direct_output/consistency.out

Windows: \Veritas\netbackup\bin\bpdbm -consistency > \path_to_direct_output\consistency.out

Administrators should review the db.corrupt folder and consistency.out files and take corrective actions as needed to recreate the images database information such as importing media, modifying the files or restoring individual files.

Option 2:
Administrators may choose to manually remove the bad image headers to resolve the problem. This can be done by completing the following steps:

1. Shutdown the NetBackup master server

/usr/openv/netbackup/bin/goodies/bp.kill_all

2. Change directory to the affected host's subdirectory in the images database. Using the above example output from the reports, "sparky" is the affected host.

cd /usr/openv/netbackup/db/images/sparky

3. Change directory to the directory containing the affected backup image header. Given the file name of the backup image header (Unix_0896295338_INCR), you can determine which directory to change to.

cd 0896000000

4. Confirm that the backup image header is, in fact, zero bytes in size, contains all spaces or is otherwise obviously corrupted:

ls -l Unix_0896295338_INCR

5. Confirm if the backup image header's "files file" exists:

ls -l Unix_0896295338_INCR.f

6. Remove the backup image header file and it's corresponding "files file" if it exists:

rm Unix_0896295338_INCR

rm Unix_0896295338_INCR.f

Complete steps 1 - 6 for each reported "Bad image header" and then restart NetBackup on the master server.

Option 3:
If corruption of the image files occurs after completion of the backup, such as due to a hardware issue or unexpected power outage, the following steps may be used in order to recover the image files on 6.x systems. These directions must be followed exactly as failure to do so may cause other catalog corruption and should only be used after Options 1 and 2 are reviewed. Symantec NetBackup Support should be engaged if there are questions about this procedure.

Determine a time frame when the corruption occurred. If the problems database base contains enough history, NetBackup Administrators may be able to use the Problems report to find the first recorded occurrence of the "Bad Image Header" error.
Using the time of the first error reported, Administrators can use the Backup, Archive and Restore interface to query for catalog backups of the type "NBU-Catalog" to find the catalog backup that occurred before the first error. A restore of the suspect files to an alternate location can be done and the files manually examined for corruption. If the file in the most recent catalog backup is also corrupted, continue restoring from earlier catalog backups until a non-corrupted version of the image file is found.
****Note**** this is not a Hot Catalog recovery using the DR file - this is a standard restore to an alternate location.
Examine the contents of the restored image file for invalid media references (any copies expired after the catalog backup was done would still exist in the restored image file, etc.). Manually edit the image file to remove any invalid media references. Once the file has been edited, it can be moved into the appropriate catalog directory.
Check whether the image is due to expire soon by reviewing the EXPIRATION tag in the image header and run /usr/openv/netbackup/bin/bpdbm –ctime on Unix or \Veritas\netbackup\bin\bpdbm -ctime on Windows.
If the image is due to expire soon, create the touch file NOexpire in the following location:
Unix: /usr/openv/netbackup/bin Windows: \Veritas\netbackup\bin
Verify the image to ensure that it is valid.
Take any steps necessary to protect the backup image. If images were duplicated after the catalog backup used in the restore, it may be necessary to extend the expiration date using the bpexpdate command or manually duplicate the backup.
Remove the NOexpire file if it was created in step 4.

Prevention

Make sure that the NetBackup master server is protected from power outages via redundant power supply, such as a battery backup or other solution.
Keep the NetBackup master server current on critical OS and NetBackup patches.
Implement a maintenance routine for monitoring the status of disk devices used by the NetBackup master server.
Monitor disk usage growing filesystems or adding disk space by other methods when/if needed.

Symantec NetBackup Blog

Custom Search

April 14, 2010

Netapp NDMP restore issue

Import Disk Images

Creating catarc policy

Catalog Archiving process

Configure Hot Catalog Backups

NBDN.log corruption

Updating replaced tape drive in NBU

Drive Status

Drives in AVR mode

Vault eject error

Incremental Catalog backup failing with status 41

SQL script to delete tape drives with MISSING_DRIVE status

Rebuild SG File on Solaris

Troubleshooting hunged backups

bpexpdate -deassignempty fails

"Bad Image Header" in Image DB

Followers

About Me

Blog Archive