02 November 2013

Recently I was taken aback to find that my Amanda backups didn’t contain files. Not even a single file! All that I had backed up were empty directories. Luckily it was part of a check and I didn’t lose anything. The incident, however, reminded me of the biggest problem with backups: you don’t want to realize that your backups are broken when you need to restore something. That motivated me to write a script that simulates what I will do when I need to get a backed up version of something. Writing the script taught me a few things about threading in Python, which was totally unexpected, but exciting.

Naive Solution

amrecover is the utility that handles recovery of Amanda backups. Unfortunately amrecover is geared toward interactive use, and doesn’t provide an easy way to automate. Nevertheless, the following worked:

# List the files available for recovery in the "disk" /etc for localhost
echo '
    sethost localhost
    setdisk /etc
    ls
' | amrecover daily

But here’s the problem: amrecover’s ls command does not traverse into subdirectories, so the above won’t show /etc/amanda/amanda.conf, for example. Looking through man pages and exercising google-fu didn’t help. “Fine, I’ll walk the subdirectories myself”, but to do that I need something to hold the directories and files. And because Bash is not exactly good at data structures, it was time to switch to Python.

Same In Python

I had used Python to execute external programs before. To do what I wanted to do, I would tell my Python script to execute amrecover, store the files in a list, then run amrecover again for each subdirectories. So, for a start:

# Same thing as the Bash script above
p = Popen(['amrecover', 'daily'], stdin=PIPE, stdout=PIPE)
p.stdin.write('sethost localhost\n')
p.stdin.write('setdisk /etc\n')
p.stdin.write('ls\n')
output, error = p.communicate()
print output
print error

Getting To Subdirectories

It was fairly easy to parse the output to list subdirectories and files:

def get_file_list(dir_path):
    files = []
    p = Popen(['amrecover', 'daily'], stdin=PIPE, stdout=PIPE)
    p.stdin.write('sethost localhost\n')
    p.stdin.write('setdisk /etc\n')
    p.stdin.write('cd %s' % dir_path)
    p.stdin.write('ls\n')
    output, _ = p.communicate()
    for line in output:
        # Omitted: parse the output to get the_path
        if the_path.endswith(path.sep):
            # Doh, a recursive call
            files += get_file_list(the_path)
        else:
            # We found a file!
            files.append(the_path)
    return files

I tested and tested and tested. The code worked, and it worked consistently. The only problem was that it seemed be slower than it should be. My guess was that it called amrecover multiple times, one for each subdirectory. That’s not very efficient. If only I can read the output and add more input at will. Too bad Popen.communicate won’t let me do that, because it waits until the subprocess finishes.

Asynchronism

There are a few tricky things regarding reading from stdout and writing to stdin of a subprocess at runtime:

There is no way to find out whether there is anything in stdout worth reading. You don’t want to wait forever to read nothing.
You need the subprocess to die eventually, either by closing stdin or by sending an exit command.

A timeout should be able to eliminate both problems, so I wanted to do this:

# Made-up code, won't work
files = []
p = Popen(['amrecover', 'daily'], stdin=PIPE, stdout=PIPE)
try:
    while True:
        line = read_line_with_timeout(p.stdout, timeout=1)
        # Omitted: parse line to get the_path
        if the_path.endswith(path.sep):
            p.stdin.write('cd %s' % the_path)
            p.stdin.write('ls\n')
        else:
            # We found a file!
            files.append(the_path)
except Timeout:
    pass

But that read_line_with_timeout ability is nowhere to be found. I also tried fcntl but it didn’t help because I do want the reading code to block until it has read everything there is to read.

In the end I gave up in finding a solution that works without threads, and decided to try a multi-thread method that looked clean described in this StackOverflow answer. The idea is that you will use a thread to read the subprocess’s output and put it in a Queue. The main thread will then use the get method of the queue, which has a timeout.

def read_pipe_to_lines(stdout, lines):
    while True:
        # This is a blocking read, it waits until it receives something
        line = stdout.readline()
        lines.put(line)

lines = Queue()
read_thread = Thread(target=read_pipe_to_lines, args=(stdout, lines))
read_thread.daemon = True
read_thread.start()

try:
    while True:
        line = lines.get(True, 1)
        # Omitted: do things with line
except Empty:
    print 'no more line to read'

Fake It To Make It

After finishing the lengthy multi-thread implementation, I looked back and found that the code lacks a quality that I always look for: intuition. It doesn’t work the way a human will work with the shell. A human will type a command, wait for it to finish and decide what command to use next. The problem is that it’s hard for my program to know when one command finishes. A human can look at the prompt, the reappearance of the prompt means that the previous command has finished. However, the prompt is neither in stdout nor in stderr. If that can be detected somehow, I can write a read_until_prompt function that doesn’t need a timeout and doesn’t use another thread.

But I can fake it, hah! By trying to execute a useless command and detecting its output, I can use it as a delimiter. The only annoyance is that now the read function needs access to both stdin and stdout.

def read_all(in_stream, out_stream):
    # Add some junk input to get a delimiter
    INVALID_DIR = 'intentionally-invalid'
    INVALID_RES = 'Invalid directory - %s' % INVALID_DIR
    enter_line(in_stream, 'cd %s' % INVALID_DIR)
    # Real reading here
    lines = []
    while True:
        line = str(out_stream.readline().strip())
        if line == INVALID_RES:
            return lines
        lines.append(line)

And the rest is just straightforward Python. In the end all the “optimizations” did not solve the performance problem with very large volumes. I changed to random the subdirectory to traverse instead of traversing everything. I lost the number-of-files statistics, but the script is fast enough to be practical.

I can finally submit to Hacker News. Your comments can go there :)

Checking Amanda Backups

Naive Solution

Same In Python

Getting To Subdirectories

Asynchronism

Fake It To Make It