Automating Remote Data Fetching With Terminal.app: A Practical Example

I always tell people that my favorite aspect of UNIX is the fact that you can quickly and easily string together a few command-line utilities and do things that would require days worth of custom programming to pull off on Windows. There are plenty of UNIX geeks using OS X these days, but I'd wager that the majority of scientific Mac users are still a bit shy when it comes to the terminal and are perhaps a bit glad that it is safely tucked away in the /Applications/Utilities folder. I thought I'd write up a practical example of how you can use the Terminal application and a few handy command-line utilities to make your life easier. In this example I'm going to illustrate how you can automate the retrieval of protein sequence data from NCBI's genbank database. Note that the principles illustrated in this simple example can easily be applied to numerous other situations.

If you'd like to try this example for yourself you'll need to launch the terminal application which is found in the /Applications/Utilities folder. In this example we begin with a list of accession numbers from a published study that are given to us in a text file. In NCBI speak, accession numbers are simply the unique identifier for a particular entry in the database. In this case the accession numbers are references to protein sequence data entries. We obviously can't do much science with a list of accession numbers so we'll have to retrieve the sequence data and place it into a properly formatted data file. If you lacked programming skills this normally means that you would use a web browser to manually search for each accession number, export the data to the desired format, and paste it into a data file. You can imagine how tedious this gets after the first few accession numbers. Mundane data collection tasks such as this are usually ripe for automation and in this case we can write a very simple command-line statement involving three simple utilities to automate the whole process.

Before we begin let's look at the contents of the text file containing the accession numbers:

17986270
4503385
118214
76673410
73951844
55666937
51458701
4507041
6753680
30794286
6978777
32483397
118206

as you can see it's a simple file containing one accession number per line. The next and most difficult part of this task is finding out how we can get data from the NCBI website without manually using a web browser. To do this I used a web browser to manually search the NCBI website using the first accession number and pull up the FASTA formatted sequence data for the accession number I entered. Then I took note of the URL and determined that the list_uids parameter expected an accession number, and all I have to do to request FASTA formatted sequence data for any accession number is to make a request with list_uids=my_accession_number. Through some further testing I determined that some of the other parameters in the URL were not required for a successful request so I ended up with the following:

http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&list_uids=my_accession_number&dopt=fasta&sendto=t

Now I have all the information I need to assemble the command-line and automate sequence retrieval. The web interaction is going to be performed by the curl utility which comes standard with OS X. One of the other standard utilities we are going to be using is xargs. The xargs utility is able to accept lines of text and use them as parameters to launch other utilities. It might sound simple, but it can be quite powerful if used wisely. Before we begin lets understand fully what we are trying to accomplish. For each accession number listed in the text file, we need to download fasta formatted sequence data and place it into a data file. We already know that we can use the URL above to request sequence data if we replace my_accession_number with a valid accession number. We know that curl is used to download data from the web, and that xargs can accept text lines and pass them on to other commands. I hope you are getting a feeling that our problem can easily be accomplished with curl and xargs. We haven't discussed how we connect these utilities and in order to do that we need to become familiar with the "|" or pipe as it is known. When the pipe character, |, is placed between two command line statements it tells the terminal environment, or "shell", to take the output from the first command and use it as the input for the second command. That explanation is a bit simplified but it is all we need to know for the purpose of this example. So the first thing we are going to do is use the cat utility which, when called with only a file name as an argument, will simply print out each line of the file one after the other. If you run the command yourself you should see the following:

PowerBook-G4-15:~/src/curl_fun$ cat accession_list.txt
17986270
4503385
118214
76673410
73951844
55666937
51458701
4507041
6753680
30794286
6978777
32483397
118206

Now that we can output each accession number in our file, we need to call curl to request the sequence data for that accession from NCBI. Before we do that, however, we need to make sure that the value for list_uids in the URL is set equal to each accession number as it is output. To do this we'll use the xargs utility to replace a variable character in the URL, in this case %, with the accession number produced as output from the cat utility. When we put it all together it looks like the following:

PowerBook-G4-15:~/src/curl_fun$ cat accession_list.txt | xargs -I % curl "http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&list_uids=%&dopt=fasta&sendto=t"

When this command-line is executed, the cat utility will send each accession number to the xargs command through the pipe mechanism. For each accession number that is received by xargs utility the -I option tells xargs to replace the % character in the URL with the accession number that it recieves from the via the pipe and then execute the curl command with the updated URL. Notice that the URL must be enclosed in double quotes otherwise the command will not execute properly. If you run the above command you'll notice that the sequence data is printed to the terminal. To send this data to the file we need to add one more simple element. The ">" mechanism tells the shell to take any output from the left side of the character and place it into a file with the file name given on the right side of the character. Let's call our new file data_file.fasta and the following is the updated command line:

PowerBook-G4-15:~/src/curl_fun joel$ cat accession_list.txt | xargs -I % curl "http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&list_uids=%&dopt=fasta&sendto=t" > data_file.fasta

When the above command is executed the sequence data will be written to a new file named data_file.fasta. if we wanted to append the sequence data to an already existing data_file.fasta we would need to use ">>" mechanism. Now we have a sequence data file that is ready to be used for analysis.

We've completed the automation example and as you can see there is not much to it at all. I hope this simple example gives you a taste of what's available to you through the terminal and how it might help you to automate or augment your research activities.

Comments

Applescript wrapper

One of the really nice things about MacOSX is that you can use applescript to enhance the UNIX tools. One think you may have to watch for is if you are using a proxy server to access the internet.

Copy this into applescript editor, I've tried to show a couple of different ways to interact.
You could of course also build this in Applescript Studio to build a more fully featured application.

on run
set TheFile to (choose file with prompt "Choose a file containing the accesion numbers")

GetProt(TheFile) of me

end run

on GetProt(TheFile)
set the_path to TheFile as string
--display dialog the_path this is a mac path need to convert to posix
set posix_path_assc to POSIX path of the_path

display dialog "Name for Results File" default answer "data_file.fasta" buttons {"OK"} default button 1
set the fasta_name to the text returned of the result as text
set this_file to (((path to desktop folder) as text) & fasta_name)
set posix_path_fasta to POSIX path of this_file

--cat accession_list.txt | xargs -I % curl "http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&list_uids=%&dopt=fasta&sendto=t" > data_file.fasta

--use this if you have a proxy server, change myproxy_name to your proxy server
--set the_script to "cat " & posix_path_assc & " | xargs -I % curl -x myproxy_name:proxyport \"http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&list_uids=%&dopt=fasta&sendto=t\" > " & posix_path_fasta

set the_script to "cat " & posix_path_assc & " | xargs -I % curl \"http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&list_uids=%&dopt=fasta&sendto=t\" > " & posix_path_fasta

do shell script the_script

end GetProt

pipelines/web services are the way forward

I was blown away the first time I used UNIX and realised the power of it for tasks like this. That's partly why I love having a Mac on my desk. ONly recently I have been similarly blown away by my exposure to pipelining tools such as KDE Inforsense and Pipeline Pilot (do a google). They take a bit to get your head round initially but once you see some examples you'll see how it's worth it. There's a good webex here showing how KDE was integrated with Spotfire - http://www.spotfire.com/events/webcasts/detail.cfm?id=6201. There's also the open source Taverna which is has a pretty good tutorial - http://taverna.sourceforge.net/ It's the future!!!