Help Page: get the most out of using RNAdb
Index
1. Description of RNAdb Datasets
2. Browsing RNAdb
3. Searching RNAdb - Keyword Search
4. Searching RNAdb - Using Flters
5. BLAST Search
6. Downloading datasets from RNAdb
7. Creating web links to individual RNAdb entries
8. References
9. Appendix A
1. Description of RNAdb Datasets
Noncoding RNAs (ncRNAs) in RNAdb are provided in a number of separate datasets. This decision was made not only to reflect the different sources from which ncRNAs were identified but also in acknowledgement that users may want to separately query and download each set. For release 2.0, the datasets are: i) miRNAs; ii) snoRNAs and scaRNAs; iii) piRNAs; iv) other ncRNAs from the literature; v) FANTOM3 ncRNAs; vi) H-Invitational ncRNAs; vii) ncRNAs predicted from structural alignments; and viii) predicted antisense ncRNAs.
A brief summary of each dataset is provided below:
i) miRNAs
Over 1800 mammalian miRNAs are found within RNAdb. These sequences were obtained from the latest release of miRBase (release 8.2, July 2006) (Griffiths-Jones et al., 2006). miRBase is the central repository for miRNA data on the web and is regularly maintained. We have elected to directly link the RNAdb miRNA entries to miRBase, so as to keep abreast with the most recent annotations and updates.
ii) snoRNAs and scaRNAs
RNAdb contains more than 500 mammalian snoRNAs and scaRNAs. The snoRNAs fall into two general classes, C/D box and H/ACA snoRNAs, which classically guide ribose methylation and pseudo-uridylation of rRNAs respectively. Interestingly, some snoRNAs appear to regulate other RNAs. Human snoRNAs and scaRNAs in RNAdb were derived from snoRNA-LBME-db (release 3, August 2006) (Lestrade and Weber, 2006), and annotations for these sequences are maintained by linking out to this informative and specialized resource.
iii) piRNAs
The PIWI family of proteins is known to be important for germ cell development. PIWI proteins were recently discovered to bind thousands of small RNAs, termed piRNAs (Aravin et al., 2006; Girard et al., 2006; Lau et al., 2006). piRNAs have been identified in testis, are 26-31 nucleotide in length, and are distinct from miRNAs. Over 176,000 piRNA candidates have been cloned and sequenced from mouse, human and rat, and are included for the first time in the current release of RNAdb.
iv) other ncRNAs from the literature
This dataset contains more than 900 unique ncRNA sequences which have been identified and manually curated based upon extensive literature review. The majority of ncRNAs listed here are much longer than those in the previous three datasets. Altogether, 36 mammalian organisms are represented but most ncRNAs are either murine or human. Although some of these transcripts have documented biological roles, most are transcripts of unknown function. As well as sequence data, additional information - including Genbank accessions, references, chromosomal location, transcript length, splicing status, conservation notes, function, disease associations, antisense relationships, imprinting status, and tissue expression patterns - is provided wherever possible in separate searchable fields.
v) FANTOM3 ncRNAs
Using full-length cDNA cloning and sequencing strategies, the Functional Annotation of Mouse (FANTOM) project has identified thousands of novel transcripts from the mouse genome (Carninci et al., 2005). In the most recent round of annotation, 34,030 cDNAs were manually annotated as putative ncRNAs (Maeda et al., 2006). Since both cloning and manual human annotation is subject to variation and error, the true number of ncRNAs remains unclear. To this end, we provide the results of various computational prediction strategies for use as additional filters in identifying ncRNAs (see Appendix A for more details). In addition to sequence data, details such as the Riken clone identifier, Genbank accession, genomic location, transcript length, likely imprinting status, and library of origin are provided. RNAdb also incorporates expression information from publicly available microarray datasets such as GNF Symatlas (Su et al., 2004). Although limited to only a small proportion of FANTOM3 ncRNAs, this information allows the identification of transcripts that are dynamically expressed across various tissues and cell types, and is expected to provide a useful starting point for their further characterization.
vi) H-Invitational ncRNAs
This dataset contains more than 1700 putative ncRNAs from the latest round of the Human Full-length cDNA Annotation Invitational (H-Invitational) project (release 3.4, August 2006) (Imanishi et al., 2004). Non-protein-coding transcripts are defined in this dataset by the absence of any open reading frame and not belonging to the pseudogene classification. In addition to the sequence data, details such as the Genbank accession, genomic location, transcript length, library of origin, and expression data (based upon publicly available microarray data where present) are also listed.
vii) ncRNAs predicted from structural alignments
Recently, a number of studies have identified thousands of putative ncRNAs based upon predicted structural features and alignments using novel comparative genomics tools. The datasets resulting from three independent approaches, RNAz (Washietl et al., 2005), ncRNAscan (Torarinsson et al., 2006) and EvoFold (Pedersen et al., 2006), are included here. RNAz combines a comparative approach (scoring conservation of secondary structure) with the observation that ncRNAs are thermodynamically more stable than expected by chance. Using sequences conserved in at least human, mouse, rat and dog, over 35,000 structured elements were identified in the human genome. Non-coding RNA Search uses syntenic regions between human and mouse that are unalignable and then utilizes the FOLDALIGN algorithm to identify regions with conserved secondary structure. Finally, EvoFold utilizes a comparative genomics method based on phylogenetic stochastic context-free grammars to identify functional RNAs. Using an eight-way genome wide alignment of human, chimpanzee, mouse, rat, dog, chicken, zebra-fish and puffer-fish, over 48,000 candidate RNA structures were identified in the human genome.
viii) predicted antisense ncRNAs
Natural antisense transcription is now recognized as being a common occurrence in the mammalian transcriptome, and a means by which gene expression can be regulated. In release 1.0, RNAdb contained a dataset of putative antisense ncRNAs identified from cDNA and EST databases for human and mouse using a computational pipeline. Coinciding with release 2.0, we have recently re-developed this pipeline, and experimentally validated a subset of its predictions (Engstrom et al., 2006). We will continue to use the improved pipeline in regular updates of the antisense ncRNA dataset.
2. Browsing RNAdb
Users can browse the entire collection or search for specific ncRNAs of interest (see below). When browsing, first select the dataset of interest. What will then appear is an Outline table that lists the first page of ncRNAs within the selected dataset. Users can then choose other pages of the dataset to display.
The Outline table displays only limited information about each ncRNA including RNAdb sequence identifier, description, Genbank accession, species and the source from which the ncRNA was identified. To view the detailed annotation for a particular ncRNA , click 'Select' and the sequence and relevant annotations are displayed in a new Detailed View table.
3. Searching RNAdb - Keyword Search
Users can perform keyword searches to find particular ncRNAs of interest. To do so, first select the dataset of interest then type your search term in the query box. To search for a phrase, the phrase must be in double quotes (for instance "miRNA stem-loop"). The search interface accepts Boolean operators (AND, OR, NOT, NEAR) to allow more advanced queries to be run. You can search for "Xist AND Bos taurus" as an example. Boolean operators can be entered in either upper or lower case, and are processed from left to right. Once you have entered one or more search terms, click 'Search' to run the query against the selected dataset. The hits will be displayed in the Results table below the search interface.
Examples of common searches
A. Query line: "expressed pseudogene" - result: displays all ncRNAs annotated as expressed pseudogenes for the selected dataset
B. Query line: "imprinted" - result: displays all ncRNAs annotated as being imprinted for the selected dataset
C. Query line: "host gene" - result: displays all ncRNAs annotated as being miRNA or snoRNA host genes for the selected dataset
D. Query line: "tissue-specific" - result: displays all ncRNAs annotated as being tissue-specific for the selected dataset
4. Searching RNAdb - Using Filters
Filters - either singly or in combination - can also be applied to search for ncRNAs of interest. The number and type of filters vary between datasets, a reflection of the different informational content within each set. As with keyword searches, you must first select the dataset of interest. (Note: Filters and keyword searches can be combined together.)
5. BLAST Search
BLAST (Basic Local Alignment Search Tool) is a heuristic search algorithm tailored for finding sequence similarities (Karlin and Altschul, 1990). On the BLAST Search page, users can use a program called blastn to compare a nucleotide sequence of interest (up to 8000 nt in length) against the ncRNA sequences listed in RNAdb. To do so, paste your sequence into the large query text box, select the dataset of interest, then click 'BLAST sequence'. Your request will be queued for processing, and a BLAST job ID number will appear. Usually, your request will take only a few seconds but it may take longer depending upon the query. To view the results, click 'Get Results'. The search results are displayed in a tabular format that lists all the ncRNAs with similarity to your sequence of interest, along with standard BLAST statistics including % identity, length, mismatches, gaps, E-value and bit score. For more information on BLAST, go to http://www.ncbi.nlm.nih.gov/BLAST/
6. Downloading datasets from RNAdb
All the datasets contained within RNAdb are available for download as individual datasets from either the 'Fasta Downloads' or 'XML Downloads' pages as zipped files.
7. Creating web links to individual RNAdb entries
To facilitate links with other on-line resources, users can now directly go to a detailed view of an entry by using the following URL and substituting the RNAdb unique identifier of interest for <RNAdbID>:
http://jsm-research.imb.uq.edu.au/rnadb/default.aspx?ncrna=<RNAdbID>
As an example, if a user wishes to look at the detailed view for MIR1004, one would use:
http://jsm-research.imb.uq.edu.au/rnadb/default.aspx?ncrna=MIR1004
8. References
Aravin, A., Gaidatzis, D., Pfeffer, S., Lagos-Quintana, M., Landgraf, P., Iovino, N., Morris, P., Brownstein, M. J., Kuramochi-Miyagawa, S., Nakano, T., et al. (2006). A novel class of small RNAs bind to MILI protein in mouse testes. Nature.
Badger, J. H., and Olsen, G. J. (1999). CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol 16, 512-524.
Carninci, P., Kasukawa, T., Katayama, S., Gough, J., Frith, M. C., Maeda, N., Oyama, R., Ravasi, T., Lenhard, B., Wells, C., et al. (2005). The transcriptional landscape of the mammalian genome. Science 309, 1559-1563.
Engstrom, P. G., Suzuki, H., Ninomiya, N., Akalin, A., Sessa, L., Lavorgna, G., Brozzi, A., Luzi, L., Tan, S. L., Yang, L., et al. (2006). Complex Loci in Human and Mouse Genomes. PLoS Genetics 2, e47.
Furuno, M., Kasukawa, T., Saito, R., Adachi, J., Suzuki, H., Baldarelli, R., Hayashizaki, Y., and Okazaki, Y. (2003). CDS annotation in full-length cDNA sequence. Genome Res 13, 1478-1487.
Girard, A., Sachidanandam, R., Hannon, G. J., and Carmell, M. A. (2006). A germline-specific class of small RNAs binds mammalian Piwi proteins. Nature.
Griffiths-Jones, S., Grocock, R. J., van Dongen, S., Bateman, A., and Enright, A. J. (2006). miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res 34, D140-144.
Imanishi, T., Itoh, T., Suzuki, Y., O'Donovan, C., Fukuchi, S., Koyanagi, K. O., Barrero, R. A., Tamura, T., Yamaguchi-Kabata, Y., Tanino, M., et al. (2004). Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones. PLoS Biol 2, E162.
Karlin, S., and Altschul, S. F. (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A 87, 2264-2268.
Lau N. C., Seto A. G.,Kim J., Kuramochi-Miyagawa S., Nakano T.,Bartel D. P., Kingston R.E. (2006). Characterization of the piRNA complex from rat testes. Science 313, 363-367.
Lestrade, L., and Weber, M. J. (2006). snoRNA-LBME-db, a comprehensive database of human H/ACA and C/D box snoRNAs. Nucleic Acids Res 34, D158-162.
Liu, J., Gough, J., and Rost, B. (2006). Distinguishing Protein-Coding from Non-Coding RNAs through Support Vector Machines. PLoS Genetics 2.
Maeda, N., Kasukawa, T., Oyama, R., Gough, J., Frith, M., Engstrom, P. G., Lenhard, B., Aturaliya, R. N., Batalov, S., Beisel, K. W., et al. (2006). Transcript annotation in FANTOM3: mouse gene catalog based on physical cDNAs. PLoS Genet 2, e62.
Pedersen, J. S., Bejerano, G., Siepel, A., Rosenbloom, K., Lindblad-Toh, K., Lander, E. S., Kent, J., Miller, W., and Haussler, D. (2006). Identification and Classification of Conserved RNA Secondary Structures in the Human Genome. PLoS Comput Biol 2, e33.
Su, A. I., Wiltshire, T., Batalov, S., Lapp, H., Ching, K. A., Block, D., Zhang, J., Soden, R., Hayakawa, M., Kreiman, G., et al. (2004). A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci 101, 6062-6067.
Torarinsson, E., Sawera, M., Havgaard, J. H., Fredholm, M., and Gorodkin, J. (2006). Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure. Genome Res.
Washietl, S., Hofacker, I. L., Lukasser, M., Huttenhofer, A., and Stadler, P. F. (2005). Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nat Biotechnol 23, 1383-1390.
9. Appendix A
Computational prediction of FANTOM3 ncRNAs
A number of computational strategies were used to help identify ncRNAs within the FANTOM3 cDNA clone set. These are described below.
A consensus method based upon the predictions of three independent computer programs -mTRANS, rsCDS and CRITICA - was used to identify 38,129 putative ncRNAs. cDNAs were annotated as ncRNAs by this method if at least two of the three programs classified a particular transcript as noncoding. mTRANS (M. Furuno, unpublished data) makes predictions based upon assessment of likely open reading frames (ORFs), while taking into account possible experimental errors in the cDNA sequences (for instance, due to sequencing errors, intron retention, and 3' end truncation). rsCDS uses sequence similarity to known proteins to identify likely coding transcripts (Furuno et al., 2003). Critica makes predictions by assessing synonymous vs nonsynonymous nucleotide substitution rates (Badger and Olsen, 1999).
A support vector machine-based algorithm has also recently been reported to distinguish protein-coding from noncoding RNAs with high sensitivity and specificity (Liu et al., 2006). This method confidently predicted 13,873 ncRNAs within the FANTOM3 clone set.
Many FANTOM3 cDNAs appear to be full-length, but some are not. Since the accurate prediction of ncRNAs relies upon having the entire cDNA sequence, some cDNAs predicted or annotated as ncRNAs are incorrectly classified. To address this issue, we have provided an additional filter to exclude likely truncated cDNAs, based upon the presence or absence of strong experimental support for completeness of their 5' and 3' ends (Carninci et al., 2005). Since ncRNAs are frequently quite long (>10Kb) and unable to be cloned in their entirety given limitations in current cDNA cloning technologies, it is important to note that this filter will also exclude some genuine ncRNAs as well, including cDNAs corresponding to well-known ncRNAs such Xist and Air. The following criteria (each of which has less than 1 in 1000 chance of occurring if the sites are randomly scattered across the genome) were used: 5' ends: 2 CAGE tag starts within ±15 nt, 3 CAGE tag starts within ±60 nt, 4 CAGE tag starts within ±100 nt, 1 GSC ditag start within ±0 nt, 2 GSC ditag starts within ±50 nt, 1 GIS ditag start within ±15 nt, 1 RIKEN 5'EST start within ±3 nt, 2 RIKEN 5'EST starts within ± 100 nt, 1 non-RIKEN 5'EST start within ±2 nt, 2 non-RIKEN 5'EST starts within ±100 nt, 1 other Fantom clone start within ± 25 nt, or 1 non-RIKEN RNA start within ±50 nt; 3' ends: 1 GSC ditag end within ±0 nt, 2 GSC ditag ends within ±50 nt, 1 GIS ditag end within ±15 nt, 1 RIKEN 3'EST end within ±2 nt, 2 RIKEN 3'EST ends within ±100 nt, 1 non-RIKEN 3'EST end within ±7 nt, 2 non-RIKEN 3'ESTs ends within ±100 nt, 1 other Fantom clone end within ± 25 nt, or 1 non-RIKEN RNA end within ±50 nt. Additionally, cDNAs were excluded as full-length if immediately upstream of A-rich sequences in the genome (> 10 As within 20 nt) to minimize the likelihood of 3' end truncation via internal oligo(dT)-based priming.