Getting used to Unix

This assignment introduces Unix tools useful for working with large data files commonly encountered in bioinformatics. Once you've completed this lab assignment, you'll be able to do the following:

Intro to Unix

The Unix operating system (OS) is popular in bioinformatics because of its powerful command-line tools that make scripting and performing automated analyses relatively easy.

One of the great benefits of Unix is that everything you learn today about how to interact with and use the system will probably still be useful in several years time. Unix has been around since the 1970s and has been steadily improving since then. Time you invest today in learning Unix tools (most of which are free) will likely pay off your entire career.

To interact with a Unix system, you invoke a program called a shell, an interpreter where you type in commands for the system to execute. The commands you can type execute in the shell are programs that have predictable behaviors and which you can chain together in simple programs of your own called shell scripts.

The computer lab is equipped with Apple Imac computers, which run a version of the Unix operating system based (originally) on FreeBSD. However, you can also do the following exercises using other types of computers, provided Unix tools are installed. Students with Windows computers should install the cygwin tools, which are Windows implementations of Unix utilities.

To do the exercises in this homework on the lab computers, open a terminal window, which is an interface to a single shell.

On Apple, to open a terminal window, double-click the "terminal" icon under Applications->Utilities. You can also launch a terminal window by clicking the terminal icon shown on the Dock, the strip of application icons along the bottom of the screen.

On computers running other versions of Unix, e.g., CentOs or RedHat, you can launch a terminal window by right-clicking on the desktop and choose "Open Terminal" or something similarly-worded.

Once you've opened terminal (the shell), you can type in commands at the prompt, typically a sequences of characters ending with a % or $ character, depending on how the system is set up.

To find out what a Unix command does, use the "man" program, which displays a text file (the "man" page) for a given command. For example, at the prompt ($), type

    $ man ls

Exercises

Part 1: Getting started with Unix commands

Open a terminal window.

To start, read Introduction to the UNIX Operation System and do Tutorials 1 through 6 and 8. On tutorial 8, create and edit .bash_profile file instead of .cshrc - the lab Macs are using BASH (the Gnu Bourne-Again Shell). See: http://www.ee.surrey.ac.uk/Teaching/Unix/.

Now, use an internet search tool to research how the following extremely useful commands operate. For each, write a short description of how they work. You will use these in the next part of the assignment:

  1. uniq




  2. cut




  3. sort




Unix saves typing: The up-arrow key tells the shell to retrieve the most recently-typed command. This saves a lot of typing when you need to repeat the same command many times. The tab key has a similarly useful function. If you can't remember the full name of a command, just the first letter or two, type those couple of letters and hit the tab key. The shell will either fill in the name of the command or indicate that it needs more information to determine the command you're trying to type. (Some systems will beep while others may flash the cursor - it depends on the variety of Unix you're working with.) In the latter case, if you type tab a second time, it will print a list of all the possible commands that begin with the letters you typed. You can then look at the various options, type the next few letters of the command you want, and type tab again. If you have typed enough letters to unambiguously distinguish a command from all the other possibilities, the shell will fill in the rest. The same thing works if you are trying to type the name of a file residing in the current working directory.


Part 2: Working with bioinformatics data files

Download this compressed file from the BioViz Quickload site, a Web site that Integrated Genome Browser uses as a back end data server: TAIR9_mRNA.bed.gz.

Create a directory in your home directory where you will store data files. Use the change directory command (cd) to change into your home directory.

    $ cd

QUESTION 1: When you execute the change directory command without specifying the name of a directory as an argument, the command assumes that you want to do what? Explain.





Create a new directory in your home directory for data files you download from the Internet:

    $ mkdir data

QUESTION 2: What happens if you try to create a new directory but a directory by that name already exists? Can you have more than one directory with the same name in different directories? Explain.





Move the downloaded file into your data directory. Examples showing how to do this inside the shell appear below. Note that the ~ ('tilde') symbol is a Unix shortcut that stands for your 'home' directory. Type something like:

    $ mv ~/downloads/TAIR9_mRNA.bed.gz data/.

where "downloads" is the directory where you saved the file. Note that different browsers will likely put downloaded files in different places on the file system. It is usually a good idea to configure your browser's preferences to put download files in the same location relative to your home directory on every machine you use, so that you will not spend too much time chasing down downloaded files.

QUESTION 3: What does the '.' symbol tell Unix to do in the above command? Hint: It helps save typing. How?





Move in to the data directory and use ls with the -l and -h options to get details about the file:

    $ cd data
    $ ls -lh 

Note In the above command, you could also have typed:

    $ ls -l -h

You can typically pile on several different options onto a Unix command, provided they don't contradict each other.

Now uncompress the file using gunzip, if it is not already uncompressed. (Some Mac systems try to help you out by uncompressing files you download via a Web browser.) Now, take a look at the top and bottom of the file, using the head and tail commands. Note: When you uncompress a file using gunzip, gunzip, by default, replaces the old '.gz' version of the file with a new uncompressed version that lacks the '.gz' extension. As you might expect with Unix, you can modify this default behavior using options.

    $ gunzip TAIR9_mRNA.bed.gz 
    $ ls -lh 
    $ head TAIR9_mRNA.bed
    $ tail TAIR9_mRNA.bed 

Now, compress it again:

    $ gzip TAIR9_mRNA.bed 
    $ ls -lh

Note the output from the ls command when you give it the -lh option. The -l (long) option tells ls to "list" all the files in the directory *and* their properties, such as how big they are, who owns them, etc. The 'h' option tells ls to make the output human-friendly and to report file sizes using familiar units of kilobytes instead of bytes.

    $ gunzip TAIR9_mRNA.bed.gz
QUESTION 4: How big was the compressed file? How big was the uncompressed file?



The file you've downloaded is in 'bed' format, which stands for "browser extensible format," a format invented by the University of California at Santa Cruz Genome Bionformatics group.

Each line is represents the structure of single gene model, which a hypothesis about the structure of a mature, spliced mRNA transcript arising from a gene. Note that a single gene can, in theory, produce multiple different mRNA transcripts due to alternative promoters, alternative splicing, and alternative polyadenylation, which are types of chemical modifications a newly-synthesized transcript can undergo as it progresses through the maturation process. This means that a single gene can be associated with multiple gene models, where each gene model is essentially just a theory about the structure of the genes transcript products. Also, gene models that overlap along the genome sequence axis represent hypothetical alternative mRNA transcripts arising from the same gene or locus. Note that the term "locus" is a synonym for "gene". We will use these terms interchangeably in this class.

Take note of the following fields in the 'bed' file you downloaded:

Column 1: The name of the annotated sequence where the gene model is located.
Column 2: The start position for the gene model (genomic sequence coordinates).
Column 3: The end position for the gene model (genomic sequence coordinates).
Column 4: The name of the gene model
Column 6: The strand (+ or -) it comes from Column 10: The number of exons the gene model contains.

For details, see http://genome.ucsc.edu/goldenPath/help/customTrack.html#BED.

Answer the following questions using Unix commands you learned in the tutorials and in the exercises above. What commands did you use? Hint: be sure you understand how to use sort, wc, uniq, and the | (pipe) operator. (On most keyboard, the pipe key is above the Enter or Return key.)

QUESTION 5: How many distinct gene models does the file contain?




QUESTION 6: Are any gene models represented multiple times in the file? If yes, what are they?





QUESTION 7: How many annotated sequences (chromosome) are represented in the file? What are their names?





QUESTION 8: Which annotated sequence (chromosome) has the smallest number of gene models? How many does it contain?



A single gene may be associated with multiple gene models. The TAIR9_mRNA.bed.gz file you downloaded from the class Web site contains many thousands of gene models, all of which have names like AT4G22890.5 or AT2G46660.1. These names are called AGI codes. (AGI stands for Arabidopsis Genome Inititative.) The first part of the gene model name is the locus identifier (e.g., AT4G22890) and the suffix ".[numeral]" indicates the gene model number. For example, the gene model name AT4G22890.5 means: gene model number 5 belonging to locus AT4G22890. Locus is a synonym for gene.

QUESTION 9: How many unique locus identifiers does the file contain?





QUESTION 10: What sequence of commands could you use to determine if any gene models that have identical genomic coordinates and different names? Be ready to explain your reasoning in class.




You're done!