reducible complexity

Friday, January 17, 2014

New version of SeqTrace, 0.9.0, now available!

I am pleased to announce that a new version of SeqTrace, my free and open source software for viewing and processing Sanger DNA sequencing trace files, is now available. It had been nearly two years since the 0.8.1 release of SeqTrace, and over the Christmas break I decided it was finally time to get a new release finished up and out the door. So I want on a major coding binge and now, many hours of work later, SeqTrace 0.9.0 is done (actually, it was done a few days ago!).

SeqTrace 0.9.0 includes many new features and improvements in comparison to 0.8.1 as well as a few bug fixes. Some of the most important upgrades include full support for all IUPAC nucleotide codes, a new algorithm for computing consensus sequences from matched forward and reverse traces that is based on Bayesian statistical methods, the ability to search for PCR primers in the trace sequences and display the primers along with the sequence data, an algorithm for trimming primers (and trailing bases) from consensus sequences, and synchronized scrolling of matched forward and reverse traces. This latter feature is a big improvement when navigating paired forward and reverse sequencing traces, at least in my opinion. There are many other improvements, too, which you can read about in SeqTrace's release history.

I want to thank the users of SeqTrace who took the time to send me feedback and suggestions regarding the previous version of the software. Your comments were very helpful in deciding what to focus on for 0.9.0!

Saturday, October 19, 2013

Reverting to a previous revision with Subversion

One of the great advantages of using a version control system to manage source code is that you never lose the history of changes to your software project. You can always see what a file looked like at any point in the past. Sometimes, you need to take this a step further and actually revert a file (or files) back to an earlier version.

Taking a file back in time is only rarely necessary (in my experience, anyway – I suppose it depends on how a project is managed). As a result, it is one of those things that I always forget how to do by the next time I need to do it. So, this is a short post discussing how to achieve this with Subversion.

First, what about the Subversion command svn revert? To avoid any confusion: This command simply reverts any local changes (i.e., uncommitted changes) to a working copy of one or more files. Thus, it does not make any changes to the repository and therefore does not do what we need.

The key to reverting a file to a previous revision is the svn merge command. The general idea is to do what is known as a "reverse merge." In a nutshell, you tell svn merge to make your local working copy look like a previous revision, then use svn commit to send these changes to the repository.

Suppose that the latest revision of a project is 270, and we need to roll all of the files in some directory back to revision 240. Here is the merge command to do this.

svn merge -r 270:241 https://repository.location/path/to/directory

The key is the argument "-r 270:241", where we specify the two revisions to compare. Here, we're finding the differences between the latest revision, 270, (the "left side" of the comparison) and revision 241 (the "right side" of the comparison), and then applying the differences to the local working copy (assumed to be the local current directory in this form of the command).

If the merge is successful, then svn commit can be used to make the changes official, and you're done.

The additional argument "--dry-run" is also useful to know. By adding this to the end of the svn merge command, you can see a summary of what changes svn merge would make without actually altering any local files. Of course, if you do the merge and then decide you made a mistake, you can simply run svn revert to undo the changes to your working copy.

For more details, see the section "Advanced Merging" in the official SVN book, Version Control with Subversion.

Sunday, September 22, 2013

Getting the total shutter actuation count from a Canon DSLR

Given enough use, the shutter mechanism on digital SLR cameras (or any camera with a shutter) will eventually fail. How many pictures can you take before that happens? Browsing the data at the Camera Shutter Life Database suggests that the actual number of actuations prior to camera death is highly variable, but you should expect tens of thousands, upward to one hundred thousand or more photos out of most DSLRs.

As far as I know, all DSLR cameras keep an internal count of their total number of shutter actuations. This number can be of interest to the camera owner for a variety of reasons. Perhaps you're just curious how many photos you've taken since you acquired a camera. Or maybe you'd like to see if your camera is reaching its life expectancy and it might be time to look for a replacement. If you are shopping for a used camera body, the shutter actuation count is a bit like the mileage on a used car – it gives you a sense for how heavily the camera has been used.

Recently, I was curious about how to get this information out of a Canon DSLR body. I discovered that there is a lot of conflicting, and sometimes flat-out wrong, information about this floating around on the Web. So when I found a working solution, I thought it'd be worth sharing. Now, if you have a Nikon DSLR, getting the shutter actuation count is easy. Nikon cameras are nice enough to add this information to the EXIF data of every JPEG image they generate. All you have to do is inspect the EXIF data!

Despite numerous postings online claiming that Canon cameras also do this, they do not (at least, none that I know of). With some searching, you can find links to several freeware programs (mostly Windows-only) or websites that claim to be able to extract this information from Canon cameras. I haven't tried any of them, though, because once again, GNU/Linux and open-source software came to the rescue with an incredibly simple solution. I must acknowledge this forum thread because it pointed me to the solution I'll describe below.

First, you need to have the program gphoto2 installed. If you run Debian or a Debian-based Linux distribution, such as Ubuntu, installing gphoto2 is as simple as opening a terminal window and running this command.

sudo apt-get install gphoto2

Then, connect your camera to your computer's USB port and turn on the camera. Back at the terminal window, run this command.

gphoto2 --get-config /main/status/shuttercounter

If everything worked correctly, you should get output similar to the following.

Label: Shutter Counter                                
Type: TEXT
Current: 11892

The number at the end is the total number of shutter actuations. It couldn't be much easier than that!

Saturday, March 16, 2013

Using Java and the Jersey library to process multipart/form-data with "arrays" of form elements

Yes, that title is a mouthful, but I couldn't think of a more succinct way to describe the problem. Sometimes, it's very useful to create an HTML form that contains multiple input elements with the same name. Then, on the server, you would like to be able to treat the values of these elements as an array. This is generally fairly straightforward in most server-side programming languages, including Java with the Jersey library (a standard way to build RESTful Web services in Java). However, if your form data are encoded as "multipart/form-data," (the norm for file uploads) then it turns out that achieving the desired functionality using Java and Jersey is not as straightforward as you might think. Perhaps I didn't search in the right places, but I found relatively little helpful information on the Web (and some folks simply concluded it wasn't possible!). So, here is one working solution to the problem, and I hope this might save someone else the time that I wasted figuring this out.

To illustrate the problem, suppose you need to process form data that include a file upload along with some other information, perhaps a keyword designation for the file. Your HTML <form> element might look something like this.

<form id="upload" enctype="multipart/form-data" method="post" action="upload">
    Keyword: <input name="keyword" /><br />
    File:<br/>
    <input type="file" name="file" size="44"/><br/>
    <input type="submit" value="Upload file" /><br />
</form>

And the (simplified) Java code you use to handle the form data might look like this.

@POST
@Path("upload")
@Consumes(MediaType.MULTIPART_FORM_DATA)
public DataFile upload(
    @FormDataParam("keyword") String keyword,
    @FormDataParam("file") InputStream file_in,
    @FormDataParam("file") FormDataContentDisposition contentdisp)
    throws Exception {

    // Do something with the form data... 
}

That should all work fine, but what if you want to let users provide an arbitrary number of keywords, using a separate input box for each keyword? You could write some javascript to allow the user to click a "more keywords" button that adds more input boxes to the form. Then, your form would effectively look something like the following.

<form id="upload" enctype="multipart/form-data" method="post" action="upload">
    Keyword: <input name="keywords" /><br />
    Keyword: <input name="keywords" /><br />
    Keyword: <input name="keywords" /><br />
    <!-- Could be any number of keyword elements here... -->
    File:<br/>
    <input type="file" name="file" size="44"/><br/>
    <input type="submit" value="Upload file" /><br />
</form>

How should you handle this on the server side? The obvious solution is to specify a parameter that is a List, like this.

@POST
@Path("upload")
@Consumes(MediaType.MULTIPART_FORM_DATA)
public DataFile upload(
    @FormDataParam("keywords") List<String> keywords,
    @FormDataParam("file") InputStream file_in,
    @FormDataParam("file") FormDataContentDisposition contentdisp)
    throws Exception {

    // Do something with the form data... 
}

Unfortunately, this straightforward approach doesn't work for MediaType.MULTIPART_FORM_DATA, even though it works fine for other media types. The trick is to replace the "keywords" parameter with a List of FormDataBodyPart objects. Then, we can extract the keyword String from each FormDataBodyPart. In the following example, the extracted keyword strings are placed into another List.

@POST
@Path("upload")
@Consumes(MediaType.MULTIPART_FORM_DATA)
public DataFile upload(
    @FormDataParam("keywords") List<FormDataBodyPart> bparts,
    @FormDataParam("file") InputStream file_in,
    @FormDataParam("file") FormDataContentDisposition contentdisp)
    throws Exception {
    // Get the keyword strings.
    ArrayList<String> keywords = new ArrayList<String>(); 
    for (FormDataBodyPart bpart : keywords)
        keywords.add(bpart.getValueAs(String.class));

    // Do something with the form data... 
}

And that should work. It's a bit awkward, but still a fairly simple solution. Finally, I must acknowledge these two threads, which provided the clues I needed to figure this out.

Friday, December 21, 2012

Using Python array.arrays efficiently

In Python, the list is the workhorse "array-like" data structure, but the standard library does provide an alternative: the array, which is defined in a module of the same name. Most of the time, lists are the best (and certainly the most versatile) choice, but the array.array is useful any time you need a thin Python veneer on top of a C-style array. For instance, perhaps you need the memory efficiency of a fixed-type array structure, or you need to create or process simple, byte-packed, sequential data.

The array.array was perfect for a project I am working on, and I was curious about how to use it most efficiently. For the sorts of situations where array.array is useful, it is quite likely that you will know exactly how much storage space you need by the time you are ready to instantiate the array object. Consequently, I was surprised to see that the array constructor does not include a parameter for specifying an initial capacity. Since array.array guarantees that all elements are contiguous in memory, then increasing the array's size could require copying all previous array elements, which is relatively expensive. Specifying an initial size is an obvious way to avoid this inefficiency.

The constructor does have an optional argument, though, that lets you provide an "initializer" -- an object that provides initial values for the array. Does an initializer improve the array's performance?

To test this, I wrote some simple code that creates an array of unsigned chars with 10,000,000 elements and assigns a value to each element (20, in this case). The first way to do this is without the initializer.

arr = array.array('B')
for i in xrange(10000000):
    arr.append(20)

The second option is to initialize the array first, using a list. Note that if all we wanted to do was assign a single value to all array elements, the initializer would be all we needed! The point of this exercise, though, is to test the performance for situations where we must separately assign each array element its own value. So just imagine that rather than assigning 20 to each element, we are assigning something unpredictable and more meaningful.

arr = array.array('B', [0]*10000000)
for i in xrange(10000000):
    arr[i] = 20

I ran each of these code snippets 100 times (in Python 2.7), timed each run (using time.clock()), and calculated the mean run time. The first option, with no initializer, took an average of 1.73 seconds. The second option took 1.49 seconds, on average. Using the initializer is about 14% faster.

So using an initializer definitely makes the subsequent write operations faster. What about using a tuple, rather than a list, as the initializer? Tuples should have less overhead, since they are read-only. Using a tuple as the initializer requires only a small change to the code.

arr = array.array('B', (0,)*10000000)
for i in xrange(10000000):
    arr[i] = 20

This version took 1.46 seconds to run, on average. So using a tuple as an initializer might be slightly faster than a list, but the difference is negligible and perhaps not even statistically significant. The important lesson here is that if you need to assign a large number of values to an array.array, you should initialize it first.

Even better would be if the Python developers would add an option to the array constructor to specify the initial capacity. This would avoid any overhead incurred by copying the initializer values to the array. For now, though, your best option is to use an initializer.