Recently I was taken aback to find that my Amanda backups didn’t contain files. Not even a single file! All that I had backed up were empty directories. Luckily it was part of a check and I didn’t lose anything. The incident, however, reminded me of the biggest problem with backups: you don’t want to realize that your backups are broken when you need to restore something. That motivated me to write a script that simulates what I will do when I need to get a backed up version of something. Writing the script taught me a few things about threading in Python, which was totally unexpected, but exciting.
amrecover is the utility that handles recovery of Amanda backups.
amrecover is geared toward interactive use, and doesn’t provide
an easy way to automate. Nevertheless, the following worked:
# List the files available for recovery in the "disk" /etc for localhost echo ' sethost localhost setdisk /etc ls ' | amrecover daily
But here’s the problem:
ls command does not traverse into
subdirectories, so the above won’t show
/etc/amanda/amanda.conf, for example.
Looking through man pages and exercising google-fu didn’t help. “Fine, I’ll
walk the subdirectories myself”, but to do that I need something to hold the
directories and files. And because Bash is not exactly good at data structures,
it was time to switch to Python.
Same In Python
I had used Python to execute external programs before. To do what I wanted to
do, I would tell my Python script to execute
amrecover, store the files in a
list, then run
amrecover again for each subdirectories. So, for a start:
# Same thing as the Bash script above p = Popen(['amrecover', 'daily'], stdin=PIPE, stdout=PIPE) p.stdin.write('sethost localhost\n') p.stdin.write('setdisk /etc\n') p.stdin.write('ls\n') output, error = p.communicate() print output print error
Getting To Subdirectories
It was fairly easy to parse the output to list subdirectories and files:
def get_file_list(dir_path): files =  p = Popen(['amrecover', 'daily'], stdin=PIPE, stdout=PIPE) p.stdin.write('sethost localhost\n') p.stdin.write('setdisk /etc\n') p.stdin.write('cd %s' % dir_path) p.stdin.write('ls\n') output, _ = p.communicate() for line in output: # Omitted: parse the output to get the_path if the_path.endswith(path.sep): # Doh, a recursive call files += get_file_list(the_path) else: # We found a file! files.append(the_path) return files
I tested and tested and tested. The code worked, and it worked consistently.
The only problem was that it seemed be slower than it should be. My guess was
that it called
amrecover multiple times, one for each subdirectory. That’s
not very efficient. If only I can read the output and add more input at will.
Popen.communicate won’t let me do that, because it waits until the
There are a few tricky things regarding reading from
stdout and writing to
stdin of a subprocess at runtime:
- There is no way to find out whether there is anything in
stdoutworth reading. You don’t want to wait forever to read nothing.
- You need the subprocess to die eventually, either by closing
stdinor by sending an
A timeout should be able to eliminate both problems, so I wanted to do this:
# Made-up code, won't work files =  p = Popen(['amrecover', 'daily'], stdin=PIPE, stdout=PIPE) try: while True: line = read_line_with_timeout(p.stdout, timeout=1) # Omitted: parse line to get the_path if the_path.endswith(path.sep): p.stdin.write('cd %s' % the_path) p.stdin.write('ls\n') else: # We found a file! files.append(the_path) except Timeout: pass
read_line_with_timeout ability is nowhere to be found. I also tried
fcntl but it didn’t help because I do want the reading code to block until it
has read everything there is to read.
In the end I gave up in finding a solution that works without threads, and
decided to try a multi-thread method that looked clean described in this
StackOverflow answer. The idea is that you will use a thread to read the
subprocess’s output and put it in a Queue. The main thread will then use
get method of the queue, which has a timeout.
def read_pipe_to_lines(stdout, lines): while True: # This is a blocking read, it waits until it receives something line = stdout.readline() lines.put(line) lines = Queue() read_thread = Thread(target=read_pipe_to_lines, args=(stdout, lines)) read_thread.daemon = True read_thread.start() try: while True: line = lines.get(True, 1) # Omitted: do things with line except Empty: print 'no more line to read'
Fake It To Make It
After finishing the lengthy multi-thread implementation, I looked back and
found that the code lacks a quality that I always look for: intuition. It
doesn’t work the way a human will work with the shell. A human will type a
command, wait for it to finish and decide what command to use next. The problem
is that it’s hard for my program to know when one command finishes. A human can
look at the prompt, the reappearance of the prompt means that the previous
command has finished. However, the prompt is neither in
stdout nor in
stderr. If that can be detected somehow, I can write a
function that doesn’t need a timeout and doesn’t use another thread.
But I can fake it, hah! By trying to execute a useless command and detecting
its output, I can use it as a delimiter. The only annoyance is that now the
read function needs access to both
def read_all(in_stream, out_stream): # Add some junk input to get a delimiter INVALID_DIR = 'intentionally-invalid' INVALID_RES = 'Invalid directory - %s' % INVALID_DIR enter_line(in_stream, 'cd %s' % INVALID_DIR) # Real reading here lines =  while True: line = str(out_stream.readline().strip()) if line == INVALID_RES: return lines lines.append(line)
And the rest is just straightforward Python. In the end all the “optimizations” did not solve the performance problem with very large volumes. I changed to random the subdirectory to traverse instead of traversing everything. I lost the number-of-files statistics, but the script is fast enough to be practical.
I can finally submit to Hacker News. Your comments can go there :)