UseEntrez nucleotides to retrieve the finished record
AC009453 from the human genome project. How many times has it been
updated since it first appeared. Trace the history all the way back to
the first version. Based on the update date when did this record first
appear How many unordered pieces were there then? Now use electronic PCR
(linked as a "hotspot" on the NCBI homepage to identify STS markers
present in this record. How many are there? These include radiation
hybrid and genetic markers. Notice that one of these markers is also a
repeat polymorphism that is mapped on two human genetic maps (Marshfield
and Genethon). Follow the links from the ePCR results to see which
marker it is.
Retrieve the SWISS-PROT record for the human CFTR (cystic fibrosis)
protein by searching with CFTR_HUMAN in proteins on the search box on
the NCBI home
page. View the record and look at the extensive annotations. How
many primary database records are linked to this record? How many
literature citations are linked? What is the prevalence of cystic
fibrosis in the caucasian population? Use the FEATURES table to find the
nature and location of the most common mutation in this gene in cystic
fibrosis. Now compare these annotations to those on the RefSeq record
NP_000483 and the corresponding primary sequence database record M28668.
Go back to the original SWISS-PROT entry at NCBI. Now use the BLink
link to retrieve related proteins, Click the Best Hits button and find
the related protein from the fish Fundulus heteroclitus. Follow
the PubMed link from this record to read about the biology of this
protein. What is the physiological role of this CFTR homologue in this
animal?
CFTR contains conserved domains that are homologous to bacterial
transporters. These bacterial homologues do not appear in the BLink
output because only the top 200 proteins are shown. You can use the
"Related sequences" link on the CFTR_HUMAN record to find these. Go back
to the CFTR_HUMAN record and follow the "Related sequences" link. How
many related proteins are there? To identify the ones from bacteria
click on the History tab. Follow the instructions on that page for
constructing an query combining the protein neighbors with an organism
field search bacteria. Your query will be something similar to the
following
#13 AND bacteria[Organism]
How many of the related sequences
are bacterial proteins?
Find the genomic scaffold AE003584 from Drosophila
melanogaster using Entrez Nucleotide. Display protein links to see the
predicted proteins for this scaffold. (You will need to increase the
number of records displayed to see all of the proteins on one page. Then
use the browser's "Find in page" function to find the protein that you
want.) Identify conserved domains present in predicted protein CG10879
(AAF51293) by clicking on the BLink link and then clicking the CDD
buttton. These conserved domains suggest a potential function for this
hypothetical protein. Now peform a search against the Prosite patterns
using the ScanProsite
tool at ExPASy. Did you find the
same protein family signature? To verify the Pfam results, try the
search against the ProSite profiles. Do
your results agree now? This points out the problems with representing a
profile as a pattern.
The Entrez nucleotides [Properties] field stores
information about the kind of sequence and its source. You can use the
the index feature on the Preview/Index tab to display the terms that are
indexed for this field. The Properties field terms are somewhat cryptic,
but they are very useful for searching. Three useful types are the
biomol, gbdiv and srcdb sets. The biomol terms classify records based on
the the type and origin of the molecule, for example biomol mrna or
biomol genomic. The gbdiv sets of terms index records by the GenBank
division code, gbdiv est, gbdiv pri, gbdiv htg and so on. The srcdb
terms classify records based upon their database origin. For nucleotide
records these could be GenBank, EMBL, DDBJ, RefSeq or PDB (gbdiv
genbank, gbdiv embl, gbdiv ddbj, gbdiv refseq). Perform an organism
search for mouse, then use Preview/Index tab and the Properties field
terms to count the number of mouse genomic records. How many of these
are draft sequences (gbdiv htg)? How many are finished records (gbdiv
rod)? How many are genome survey sequences? How many of these genomic
records are RefSeqs? What kind of RefSeqs are they? Now retrieve all
mouse mRNA records. How many of these are in the rodent division? How
many are in the EST division? Using these properties field terms, design
a query and retrieve all the mouse known mRNA RefSeqs (NM_).
Use Entrez Nucleotide to find the full-length cDNA (mRNA)
sequence for Plasmodium falciparum glyceraldehyde 3-phosphate
dehydrogenase (GAPD). This time start by typing Plasmodium in the search
box without limiting to any field. How many records do you retrieve?
Browse through your results to find some records that are not from
Plasmodium. Display a few of these to see why you retrieved them;
you should find "Plasmodium" somewhere on the record. Now use the Limits
tab to restrict to Plasmodium in the Organism field [Organism]. How many
nucleotide records in Entrez are from Plasmodium? Now find GAPD
records by using the Preview/Index tab to add glyceraldehyde 3-phosphate
dehydrogenase as a [Title] Word. How many records did you retrieve?
Search for population and phylogenetic studies on
bears in Entrez PopSet. Find the study on brown bears and polar
bears and display the alignment. What gene or molecular regions were
used in this study? Use the tool bar link to display variations in the
alignment. Are there fixed differences in the sequences from the brown
bear, Ursus arctos, and the polar bear sequences in the
alignment? What if the Ursus arctos sequence from the "ABC"
islands (Sequence 7) is removed? Link to the article to read more about
these remarkable results.
Substantial data are available for two species of
filarial nematodes that are human parasites. Use the Taxonomy Browser to examine the number of nucleotide
sequences for the superfamily Filaroidea and determine which two species
these are. How many nucleotide and protein sequences are there for each
of these two species? Display nucleotide records for each of these. What
kinds of sequences are most of these?
 |
The last known Tasmanian tiger died in the Hobart Zoo in 1936.
DNA sequences have been obtained from museum specimens. (In fact,
there is an effort to clone this animal using museum material.)
You can retrieve tasmanian tiger sequences using the Taxonomy Browser. Search the taxonomy database
for Tasmanian Tiger. How many DNA and protein sequences are there?
What genes were cloned? You can build a phylogenetic dataset that
could be used to analyze the taxonomic position of the Tasmanian
Tiger with the Taxonomy Browser. Click on the Metatheria
(Marsupial) link in the lineage of the tiger. How many nucleotide
sequences are there for Metatheria? Retrieve the entry for
Metatheria and get the nucleotide sequences. In Entrez you can
refine the query to include only cytochrome b sequences
through the Preview/Index tab. How many marsupial cytochrome
b sequences are there? You could save these in FASTA format
for use in phylogenetic analysis if you wanted. You could browse
up the lineage further to get an outgroup sequence.
There are a number of sequences for extinct organisms in the
NCBI databases. Visit the list of extinct taxa in the Taxonomy pages.
|