A Monday morning we found several messages like this from Bacula don t shoot the messenger : 09-Sep 18:27 Tape-A JobId 15196: Please mount Volume "A0000038" or label a new one for: Job: sites.2012-09-08_01.50.03_40 Storage: "LTO-4" (/dev/nst0) Pool: Default Media type: LTO-4
That was Bacula complaining about the wrong volume on the drive. However, after examining the status of storage the weird thing was that the tape with barcode “A0000038″ was actually mounted on the drive, it was purged and had 0 bytes used… WTF?
We started suspecting our tape rotation algorithm was not taking into account don t shoot the messenger that maybe purged tapes were not used if data retention hadn’t expired, but Automatic Volume Recycling says otherwise and some tests on other purged tapes with expired data retention failed too.
Several days of searching similar issues later, it seemed clear that there was something wrong only with some purged tapes from the media catalog. So i came up with a quick test for each one of our purged tapes in order to make sure that they were labeled correctly. First, i loaded “A0000038″ and read its label: # mtx -f /dev/sg4 load 22 Loading media from Storage Element 22 into drive 0...done # dd if=/dev/nst0 bs=65000 count=1 | od -a | head 0+0 records in 0+0 records out 0 bytes (0 B) copied, 3.81024 s, 0.0 kB/s 0000000
You can see that it failed to find any Bacula label, so next i tried the same test with another tape in full state that had been used the day before: # mtx -f /dev/sg4 load 16 Loading media from Storage Element 16 into drive 0...done # dd if=/dev/nst0 bs=65000 count=1 | od -a | head 0+1 records in 0+1 records out 64512 bytes (65 kB) copied, 3.73563 s, 17.3 kB/s 0000000 G ; = < nul nul nul D nul nul nul nul B B 0 2 0000020 nul nul stx gs L etx O sub del del del ~ nul nul nul nul 0000040 don t shoot the messenger nul nul nul 240 B a c u l a 1 . 0 i 0000060 m m o r t a l nl nul nul nul nul vt nul eot dc3 0000100 1 8 O si ff nul eot dc3 J + W stx p nul nul nul 0000120 nul nul nul nul nul nul nul nul nul nul nul nul nul A 0 0 0000140 0 0 0 0 6 nul nul D e f a u l t nul B 0000160 a c k u p nul L T O - 4 nul v m - h 0000200 o s t - 4 nul T a p e - A nul V e r 0000220 . 5 . 0 . 1 2 4 F e b r u
This time the Bacula label was found on the tape (the Bacula 1.0 immortal string is a clue), so we were now on the right track. Then i reviewed our wiki for the tape change procedure and there was an optional step in the tape purging section don t shoot the messenger explaining how to blank a tape from another Bacula director before labeling it for this one. Therefore, it seemed that someone had been blanking the tapes after purging them and Bacula didn’t know what to make with them even though they had the right barcodes.
The fix was quick: i tested every purged tape to make a list of the blanked ones, then deleted them from the catalog and labeled them back from the bconsole : * delete volume="A0000038" * label barcodes slots=22 pool=Default
Related This entry was posted in english , sysadmin and tagged bacula on 1 octubre 2012 by alexm . Navegació per l'entrada ← etckeeper en mode silenciós El futur del correu a la UPC →
alsa apport backscatter backup don t shoot the messenger backups bacula barcelona barcelonatech brusel les bzr catalan clamav cloud community cpl cron cyrus disseny ubuntaire dkms drag drumbeat ecryptfs educació etc etckeeper eucalyptus fai festival fluiddb don t shoot the messenger fosdem free software funcionariat gtd infiniband iwarp jack juju karaoke karmic kde kernel kvm locoteam maas mail mozilla nagios nul ong orchestra pair papercuts perl postfix postmaster pulseaudio puppet rabbitmq server shield sieve team thrust tic tm4sa ubuntu uds upc upstream vcs web zeitgeist Arxius juny 2014 maig 2014 juny 2013 octubre 2012 febrer 2012 gener 2012 octubre 2010 abril 2010 gener 2010 setembre 2009 juliol 2009 juny 2009 maig 2009
%d bloggers like this:
No comments:
Post a Comment