GenBank, RefSeq, and Entrez

These problems come from NCBI Field Guide. Due Date: Feb 3rd, 2004
  1. UseEntrez nucleotides to retrieve the finished record AC009453 from the human genome project. How many times has it been updated since it first appeared. Trace the history all the way back to the first version. Based on the update date when did this record first appear How many unordered pieces were there then? Now use electronic PCR (linked as a "hotspot" on the NCBI homepage to identify STS markers present in this record. How many are there? These include radiation hybrid and genetic markers. Notice that one of these markers is also a repeat polymorphism that is mapped on two human genetic maps (Marshfield and Genethon). Follow the links from the ePCR results to see which marker it is.
  2. Retrieve the SWISS-PROT record for the human CFTR (cystic fibrosis) protein by searching with CFTR_HUMAN in proteins on the search box on the NCBI home page. View the record and look at the extensive annotations. How many primary database records are linked to this record? How many literature citations are linked? What is the prevalence of cystic fibrosis in the caucasian population? Use the FEATURES table to find the nature and location of the most common mutation in this gene in cystic fibrosis. Now compare these annotations to those on the RefSeq record NP_000483 and the corresponding primary sequence database record M28668.

    Go back to the original SWISS-PROT entry at NCBI. Now use the BLink link to retrieve related proteins, Click the Best Hits button and find the related protein from the fish Fundulus heteroclitus. Follow the PubMed link from this record to read about the biology of this protein. What is the physiological role of this CFTR homologue in this animal?

    CFTR contains conserved domains that are homologous to bacterial transporters. These bacterial homologues do not appear in the BLink output because only the top 200 proteins are shown. You can use the "Related sequences" link on the CFTR_HUMAN record to find these. Go back to the CFTR_HUMAN record and follow the "Related sequences" link. How many related proteins are there? To identify the ones from bacteria click on the History tab. Follow the instructions on that page for constructing an query combining the protein neighbors with an organism field search bacteria. Your query will be something similar to the following

    #13 AND bacteria[Organism]
    How many of the related sequences are bacterial proteins?
  3. Find the genomic scaffold AE003584 from Drosophila melanogaster using Entrez Nucleotide. Display protein links to see the predicted proteins for this scaffold. (You will need to increase the number of records displayed to see all of the proteins on one page. Then use the browser's "Find in page" function to find the protein that you want.) Identify conserved domains present in predicted protein CG10879 (AAF51293) by clicking on the BLink link and then clicking the CDD buttton. These conserved domains suggest a potential function for this hypothetical protein. Now peform a search against the Prosite patterns using the ScanProsite tool at ExPASy. Did you find the same protein family signature? To verify the Pfam results, try the search against the ProSite profiles. Do your results agree now? This points out the problems with representing a profile as a pattern.


  4. The Entrez nucleotides [Properties] field stores information about the kind of sequence and its source. You can use the the index feature on the Preview/Index tab to display the terms that are indexed for this field. The Properties field terms are somewhat cryptic, but they are very useful for searching. Three useful types are the biomol, gbdiv and srcdb sets. The biomol terms classify records based on the the type and origin of the molecule, for example biomol mrna or biomol genomic. The gbdiv sets of terms index records by the GenBank division code, gbdiv est, gbdiv pri, gbdiv htg and so on. The srcdb terms classify records based upon their database origin. For nucleotide records these could be GenBank, EMBL, DDBJ, RefSeq or PDB (gbdiv genbank, gbdiv embl, gbdiv ddbj, gbdiv refseq). Perform an organism search for mouse, then use Preview/Index tab and the Properties field terms to count the number of mouse genomic records. How many of these are draft sequences (gbdiv htg)? How many are finished records (gbdiv rod)? How many are genome survey sequences? How many of these genomic records are RefSeqs? What kind of RefSeqs are they? Now retrieve all mouse mRNA records. How many of these are in the rodent division? How many are in the EST division? Using these properties field terms, design a query and retrieve all the mouse known mRNA RefSeqs (NM_).


  5. Use Entrez Nucleotide to find the full-length cDNA (mRNA) sequence for Plasmodium falciparum glyceraldehyde 3-phosphate dehydrogenase (GAPD). This time start by typing Plasmodium in the search box without limiting to any field. How many records do you retrieve? Browse through your results to find some records that are not from Plasmodium. Display a few of these to see why you retrieved them; you should find "Plasmodium" somewhere on the record. Now use the Limits tab to restrict to Plasmodium in the Organism field [Organism]. How many nucleotide records in Entrez are from Plasmodium? Now find GAPD records by using the Preview/Index tab to add glyceraldehyde 3-phosphate dehydrogenase as a [Title] Word. How many records did you retrieve?


  6. Search for population and phylogenetic studies on bears in Entrez PopSet. Find the study on brown bears and polar bears and display the alignment. What gene or molecular regions were used in this study? Use the tool bar link to display variations in the alignment. Are there fixed differences in the sequences from the brown bear, Ursus arctos, and the polar bear sequences in the alignment? What if the Ursus arctos sequence from the "ABC" islands (Sequence 7) is removed? Link to the article to read more about these remarkable results.


  7. Substantial data are available for two species of filarial nematodes that are human parasites. Use the Taxonomy Browser to examine the number of nucleotide sequences for the superfamily Filaroidea and determine which two species these are. How many nucleotide and protein sequences are there for each of these two species? Display nucleotide records for each of these. What kinds of sequences are most of these?




  8. The last known Tasmanian tiger died in the Hobart Zoo in 1936. DNA sequences have been obtained from museum specimens. (In fact, there is an effort to clone this animal using museum material.) You can retrieve tasmanian tiger sequences using the Taxonomy Browser. Search the taxonomy database for Tasmanian Tiger. How many DNA and protein sequences are there? What genes were cloned? You can build a phylogenetic dataset that could be used to analyze the taxonomic position of the Tasmanian Tiger with the Taxonomy Browser. Click on the Metatheria (Marsupial) link in the lineage of the tiger. How many nucleotide sequences are there for Metatheria? Retrieve the entry for Metatheria and get the nucleotide sequences. In Entrez you can refine the query to include only cytochrome b sequences through the Preview/Index tab. How many marsupial cytochrome b sequences are there? You could save these in FASTA format for use in phylogenetic analysis if you wanted. You could browse up the lineage further to get an outgroup sequence.

    There are a number of sequences for extinct organisms in the NCBI databases. Visit the list of extinct taxa in the Taxonomy pages.