Bioinformatics Tutorial

Finding Sequences


In this section, you will learn how to obtain nucleic acid or protein sequence information, in a format called FASTA, that is easy to use as input into bioinformatics tools.

What is the nucleotide sequence of this gene?

Remember that you are looking at information about the gene for the red-sensitive opsin in human vision, and it is located near the bottom tip of the X chromosome. On the Entrez Gene page for OPN1LW opsin 1 scroll farther down (way down!) to NCBI Reference Sequences (RefSeq). In the first subsection, mRNA and Protein(s), all of the following are available:

Note that the two links to mRNA sequence and protein sequence are given as NM_020061.3→NP_064445.1, the arrow implying that the sequence of the NM entry is translated (by protein synthesis) to give the sequence of the NP entry.

Click the entry number for the mRNA sequence: NM_020061.3

This is a typical GenBank nucleotide file, and a lot of it is hard to read, but a few things are clear. First note, under references, citations to the publication of this sequence in the scientific literature. To see an abstract of the article in which this gene was described, click the PubMed link (a number) below the first reference and read it.

Scroll to the bottom of this long page. The last thing, labeled ORIGIN, is the sequence of this messenger RNA. You are seeing the actual list of As, Ts, Gs, and Cs that make up the message for synthesis of this opsin. But wait! You know that RNA contains no T. In most nucleotide databases, U from RNA is represented as T, to make for easy comparison of DNA and RNA sequences. This sequence information is not in the form that is most useful for searching in databases, say, searching for related genes. Let's display this entry in a form more useful for searching.

At the top of the page, beside the Display button, pull down the menu that says GenBank (the default display format for each entry), and select FASTA (note that several other display options are available). Now you see one descriptive or "comment" line that begins with ">", followed by the nucleotide sequence. This little bit of text is just what you need to search nucleotide databases for similar sequences.

Keep it for future use, as follows. Click and drag on the web page to select everything from the ">" through the last nucleotides (CCAA). Be careful not to select anything else. From your browser's Edit menu, select Copy to make a copy of this information on your clipboard, for pasting elsewhere. Now start a simple word processor (use TextEdit on Mac, Notepad on Windows—to avoid inadvertent changes in crucial formatting of sequence files), make a new document, and paste. The FASTA comment and sequence should appear. If necessary, select all of the text and change the font to Courier or Monaco -- these "typewriter" fonts make it easy to align letters into columns, because all letter are the same width. Save this file, choosing text or plain text as the file type. Call it mrnared.txt (for mRNA sequence of red opsin). Save it to a convenient location for this and other files you'll be making for later seaches.

Click your browser's Back button until you return to the Entrez Gene page for this gene.

What is the amino-acid sequence of this gene?

Under NCBI Reference Sequences (RefSeq), click the entry number NP_064445.1 for the protein sequence.

Things look a lot like before, but this is a protein entry (the classical view is that gene products are proteins, but many are not), containing the amino-acid sequence in one-letter abbreviations. Just as with the mRNA entry, turn this into a FASTA display, and copy it into a new word-processor document. Save it in text format as protred.txt (for protein sequence of red opsin). Return to Entrez Gene.

What does the neighborhood of this gene look like?

(Get ready for a surprise. Hint: OPN1LW is a human gene, and humans are eucaryotes. When people began to sequence eucaryotic genes, what big surprise was in store for them?)

Now take a look at the chromosome region that contains the red opsin gene. Scroll back to near the top of the Entrez Gene page for OPN1LW, to the section called Genomic context. The diagram shows you that the red opsin gene lies on the X chromosome, within a segment of base pairs (bp) stretching from position 152,929,151 to position 153,114,725 (a distance of 185,574 bp). [Don't worry if these numbers are not exactly the ones you see; these resources are constantly being updated.] The location of OPN1LW, shown as a red arrow, is about 3/4 of the way down this segment.

Now look at the diagram in the preceding section, Genomic regions, transcripts, and products. This diagram gives a closer look at the OPN1LW segment, representing only positions 153,062,939 to 153,077,701 (14,762 bp). The lower line shows coding regions as red blocks, noncoding regions as red lines. Here is the surprise: You knew, but you might have forgotten, that eucaryotic genes are often interrupted by non-coding regions called intervening sequences or introns. The coding regions are called exons. From this diagram, you can see that the OPN1LW gene consists of 6 exons and 5 introns, and that the introns are far larger than the exons. Of the 14,762 bp in the "gene", only 1095 bp code for protein, which means that less than 8% of the base pairs contain the code. When this gene is expressed in cells in the human retina, an RNA copy of the entire gene is synthesized. Then the intron regions are cut out, and the exon regions joined together to produce the mature mRNA (a process called splicing). which will be translated by ribosomes as they make the red opsin protein. In this case, 92% of the initial RNA transcript is tossed out, leaving the pure protein code. Seems wasteful, but our understanding of how all this works, while impressive, is still pretty fragmentary.

At the ends of the lower line in the diagram, there are links to NM_020061.3 and NP_064445.1, the entries for the mRNA and protein sequences for this gene. You visited these pages in the two sections above. Click CCDS 14742.1 at the far right of the diagram to go to the Consensus Coding Sequence page for this gene. It shows nicely how the OPN1LW gene transcript is divided into exons. Under Chromosomal Locations for CCDS 14742.1 is a table listing start and end base-pair positions for each exon. Below that is the full nucleotide sequence of the mature mRNA, with alternating blue and black sections indicating exon boundaries. Farther below is the amino-acid sequence, again divided into exons by alternating blue and black, with red indicating amino-acid residues whose codons are partly in one exon and partly in the following exon. This makes it dramatically clear how the mRNA is pieced together from the exons.

You still have not seen any of the actual sequences of the introns. Return to the Entrez Gene page for OPN1LW. Under Genomic regions, transcripts, and products, click Go to reference sequence details. This takes you down the page to NCBI Reference Sequences. You were here before, to retrieve the mRNA and protein sequences. This time, click the sequence of four entry numbers (all one link) beside Source Sequence(s). This takes you to the Entrez Nucleotide page that contains information about all four of the genome fragments from the Human Genome Project that contain all of part of the red opsin gene, along with information about how each clone was produced. This entry thus shows the gene in the larger context of the cloned fragments in which the gene was found. These sequences allow you to explore flanking regions around the gene, which might be useful in designing PCR primers for making useful quantities of this region. From this page, you could also find neighboring sequences if you wanted to look farther afield. As before, you can display this entry in FASTA format. You will get a series of entries, each a different clone that was used to construct this region of the genome.