Entrez LAB MODULE

Entrez (1) is a search and retrieval tool developed by NCBI that is capable of searching multiple NCBI databases with just one query. Entrez returns search results that can include a combination of many types of data on the query, such as nucleotide sequences, protein sequences, macromolecular structures, and related articles in the literature. Prior to the creation of Entrez, an individual might have to place one query to a nucleotide database to find a nucleotide sequence, submit another query to a structural database to find the published structure of the gene product, and submit a final query to a literature database to find citations for journal articles on the query topic. NCBI recognized the time and effort that could be saved by a tool that could cross-link these databases and integrate all information related to a given query subject into one report. View the Entrez Database page. This module contains a few questions, designated Q#, for use in a computer lab setting. The lab instructor may require that you supply answers to these questions as an indication that you have completed the module.

The Entrez Nucleotides database includes sequences from GenBank, RefSeq, and PDB. GenBank is the National Institutes of Health (NIH) genetic sequence database. GenBank, the DNA DataBank of Japan (DDBJ) and the European Molecular Biology Laboratory (EMBL) comprise the International Nucleotide Sequence Database Collaboration. These three organizations exchange data on a daily basis. The number of bases in the Entrez Nucleotides database currently grows at an exponential rate. Click on the Nucleotides link located in the blue border section on the left side of the web page. (Q1) What is the total number of bases stored in the Entrez nucleotide database today?

MMDB (Molecular Modeling DataBase) is NCBI's structure database, and it is a subset of three-dimensional structures obtained from the Protein Data Bank (PDB), excluding theoretical models. The literature database is accessed through PubMed, which encompasses the National Library of Medicine's journals database, MEDLINE, as well as providing some additional online services. MEDLINE is a collection of medical and life science journal citations that includes articles dating back to the mid-1960's. Entrez allows access to information such as nucleotide and protein sequences organized by species in the NCBI taxonomy database. These are the most commonly queried databases, but there are many more databases that are accessed by Entrez, as you can see by returning to the Entrez Database webpage and viewing the list in the blue border section on the left.

From the the Entrez Database webpage, click on the blue "Try a Tutorial" link at the top. Follow the tutorial by reading and completing any instructions listed in the blue boxes. If you have performed all the instructions in a given blue box, and nothing happens, click the "Go" button next to the search box at the top of the page to continue to the next page of the tutorial. When you are ready to perform a search using the query Mycobacterium tuberculosis, the tutorial will request that you limit your search by changing "all fields" to "organism". (Q2) Why does the tutorial request this? (Q3) What would be the result if you left this option set on "all fields", instead of changing it to "organism"? (Q4) How many items are returned as matches for the query when you limit the search to "organism"? Keep in mind, this is a tutorial, so trying to access any results beyond the first page will cause an error message. Also, some of the links that are returned are outdated, presumably because the tutorial is not updated as frequently as the database is updated. (Q5) Still, looking at the first page of results, when the search query Mycobacterium tuberculosis is limited to the field of "organism", what type of items are returned as matches to this query? (This question can be answered with a general answer, by looking at the list of results, but feel free to click on one of the first two links if you would like a more comprehensive look at the information returned as a match.) The tutorial will next ask you to perform a combined search. Holding your mouse arrow over the accession number of the first returned item will yield a tutorial box containing a brief explanation of the item. Likewise, the mouse can reveal a tutorial box containing definitions for each of the blue links to the right of the accession number. Click on the accession number and read the information in the blue boxes to continue the tutorial. The gene that encodes a putative penicillin-binding protein has been identified by the tutorial. However, you would make this identification by scrolling through the results, looking at the CDS listings. The CDS tag identifies coding DNA sequences, meaning these sequences have been determined (most often by bioinformatics and not experimental methods) to encode proteins, and are thus distinguished from the noncoding regions that make up a substantial amount of the DNA in the human genome, for example. A good primer on the basic characteristics of DNA, including the differences between coding versus noncoding sequences, can be found on the Dolan DNA Learning Center web page (2).

Use your browser to go back to the list of 14 records that were returned in response to your combined query. Now click on the subsequent accession number in the list, next to the number 2. Scroll through the results, looking at the information marked by CDS tags, and find the gene for the predicted penicillin-binding protein that caused this record to be returned as a match. (Q6) What is Rv number assigned to this gene? (M. tuberculosis genes from the sequenced genome of strain H37Rv are all assigned names beginning with Rv plus a numerical code.) Read the information marked by this CDS tag. There is a sequence of capital letters at the bottom of the section. (Q7) What does this sequence represent? Click on the blue CDS tag to the left of the Rv number. (Q8) What additional information is provided after the sequence of capital letters that you have already observed? If these questions regarding sequences have been difficult to answer, please review the genetic code, as this is prerequisite information for this course module.

Try your own search. Scroll back to the top of the web page and this time next to the Search command, choose PubMed from the menu. Pick any life sciences topic that interests you for your query. Attempt a first query with a general topic, such as protein kinase or tuberculosis. (Q9) What kind of results does PubMed return from a query? Note how many items in total (not just on the first page) were returned. Make your query topic related to your original choice, but more specific. For example, change 'protein kinase' to 'protein kinase C'. (Q10) How much did this reduce the number of items returned? This module is intended as an introduction to performing searches of the NCBI databases using Entrez. If you are unfamiliar with Entrez, please feel free to return to this module as a resource for getting started on NCBI searches.

References:

  1. Benson D.A., Boguski M.S., Lipman D.J., Ostell J.: "GenBank."; Nucleic Acids Res. 22:3441-3444(1994).
  2. http://www.bioservers.org/, Dolan DNA Learning Center, Cold Spring Harbor Laboratory: Noncommercial, educational use only.