Recently I was taken aback to find that my Amanda backups didn’t contain files. Not even a single file! All that I had backed up were empty directories. Luckily it was part of a check and I didn’t lose anything. The incident, however, reminded me of the biggest problem with backups: you don’t want to realize that your backups are broken when you need to restore something. That motivated me to write a script that simulates what I will do when I need to get a backed up version of something. Writing the script taught me a few things about threading in Python, which was totally unexpected, but exciting.
Naive Solution
amrecover
is the utility that handles recovery of Amanda backups.
Unfortunately amrecover
is geared toward interactive use, and doesn’t provide
an easy way to automate. Nevertheless, the following worked:
# List the files available for recovery in the "disk" /etc for localhost
echo '
sethost localhost
setdisk /etc
ls
' | amrecover daily
But here’s the problem: amrecover
’s ls
command does not traverse into
subdirectories, so the above won’t show /etc/amanda/amanda.conf
, for example.
Looking through man pages and exercising google-fu didn’t help. “Fine, I’ll
walk the subdirectories myself”, but to do that I need something to hold the
directories and files. And because Bash is not exactly good at data structures,
it was time to switch to Python.
Same In Python
I had used Python to execute external programs before. To do what I wanted to
do, I would tell my Python script to execute amrecover
, store the files in a
list, then run amrecover
again for each subdirectories. So, for a start:
# Same thing as the Bash script above
p = Popen(['amrecover', 'daily'], stdin=PIPE, stdout=PIPE)
p.stdin.write('sethost localhost\n')
p.stdin.write('setdisk /etc\n')
p.stdin.write('ls\n')
output, error = p.communicate()
print output
print error
Getting To Subdirectories
It was fairly easy to parse the output to list subdirectories and files:
def get_file_list(dir_path):
files = []
p = Popen(['amrecover', 'daily'], stdin=PIPE, stdout=PIPE)
p.stdin.write('sethost localhost\n')
p.stdin.write('setdisk /etc\n')
p.stdin.write('cd %s' % dir_path)
p.stdin.write('ls\n')
output, _ = p.communicate()
for line in output:
# Omitted: parse the output to get the_path
if the_path.endswith(path.sep):
# Doh, a recursive call
files += get_file_list(the_path)
else:
# We found a file!
files.append(the_path)
return files
I tested and tested and tested. The code worked, and it worked consistently.
The only problem was that it seemed be slower than it should be. My guess was
that it called amrecover
multiple times, one for each subdirectory. That’s
not very efficient. If only I can read the output and add more input at will.
Too bad Popen.communicate
won’t let me do that, because it waits until the
subprocess finishes.
Asynchronism
There are a few tricky things regarding reading from stdout
and writing to
stdin
of a subprocess at runtime:
- There is no way to find out whether there is anything in
stdout
worth reading. You don’t want to wait forever to read nothing. - You need the subprocess to die eventually, either by closing
stdin
or by sending anexit
command.
A timeout should be able to eliminate both problems, so I wanted to do this:
# Made-up code, won't work
files = []
p = Popen(['amrecover', 'daily'], stdin=PIPE, stdout=PIPE)
try:
while True:
line = read_line_with_timeout(p.stdout, timeout=1)
# Omitted: parse line to get the_path
if the_path.endswith(path.sep):
p.stdin.write('cd %s' % the_path)
p.stdin.write('ls\n')
else:
# We found a file!
files.append(the_path)
except Timeout:
pass
But that read_line_with_timeout
ability is nowhere to be found. I also tried
fcntl
but it didn’t help because I do want the reading code to block until it
has read everything there is to read.
In the end I gave up in finding a solution that works without threads, and
decided to try a multi-thread method that looked clean described in this
StackOverflow answer. The idea is that you will use a thread to read the
subprocess’s output and put it in a Queue. The main thread will then use
the get
method of the queue, which has a timeout.
def read_pipe_to_lines(stdout, lines):
while True:
# This is a blocking read, it waits until it receives something
line = stdout.readline()
lines.put(line)
lines = Queue()
read_thread = Thread(target=read_pipe_to_lines, args=(stdout, lines))
read_thread.daemon = True
read_thread.start()
try:
while True:
line = lines.get(True, 1)
# Omitted: do things with line
except Empty:
print 'no more line to read'
Fake It To Make It
After finishing the lengthy multi-thread implementation, I looked back and
found that the code lacks a quality that I always look for: intuition. It
doesn’t work the way a human will work with the shell. A human will type a
command, wait for it to finish and decide what command to use next. The problem
is that it’s hard for my program to know when one command finishes. A human can
look at the prompt, the reappearance of the prompt means that the previous
command has finished. However, the prompt is neither in stdout
nor in
stderr
. If that can be detected somehow, I can write a read_until_prompt
function that doesn’t need a timeout and doesn’t use another thread.
But I can fake it, hah! By trying to execute a useless command and detecting
its output, I can use it as a delimiter. The only annoyance is that now the
read function needs access to both stdin
and stdout
.
def read_all(in_stream, out_stream):
# Add some junk input to get a delimiter
INVALID_DIR = 'intentionally-invalid'
INVALID_RES = 'Invalid directory - %s' % INVALID_DIR
enter_line(in_stream, 'cd %s' % INVALID_DIR)
# Real reading here
lines = []
while True:
line = str(out_stream.readline().strip())
if line == INVALID_RES:
return lines
lines.append(line)
And the rest is just straightforward Python. In the end all the “optimizations” did not solve the performance problem with very large volumes. I changed to random the subdirectory to traverse instead of traversing everything. I lost the number-of-files statistics, but the script is fast enough to be practical.
I can finally submit to Hacker News. Your comments can go there :)