Support Links: Statistics | Help | FAQ

GeneSigDB Release Notes

Archives:
Release 1, December 11th, 2009
Release 2, March 5th, 2010
Release 3, December 17th, 2010

Release 4, September 15th, 2011.

GeneSigDB (http://compbio.dfci.harvard.edu/genesigdb/) is a database of gene signatures that have been extracted and manually curated from the published literature. It provides a standardized resource of published prognostic, diagnostic, and other gene signatures of cancer and related disease to the community so they can compare the predictive power of gene signatures or use these in gene set enrichment analysis. GeneSigDB is built by performing a series of focussed PubMed Searches, retrieving pdf's of these articles, carefully reading these and manually extracting the gene signatures. We then map the gene identifiers in every signature to the genome (human, mouse or rat). Please see our documentation for more information about our process.

This release (version 4) is a significant data release for GeneSigDB.

In release 4, we have read and processed over 3,500 publications to extract 3,515 genes signatures from 1604 publications, continung to almost double the content of GeneSigDB at each release.

Gene signatures, which were collected and transcribed from published articles largely focused on gene expression in cancer, stem cells, immune cells, development, and lung disease.

Growth of GeneSigDB

We have made substantial upgrades to the GeneSigDB website to improve accessibility and usability.

As always, while we attempt to describe signatures as best as possible, we know many are not described as well as we'd like, so we welcome your contributions and suggestions for improvement. Please contact us at genesigdb@jimmy.harvard.edu if you have signatures that you would like to contribute, or if you would like to recommend edits to a GeneSigDB signature.

Comparison to other gene signature databases

As we get larger, the chance of overlaps between GeneSigDB and other gene signature databases increases. Therefore we compared the content of GeneSigDB Release 4.0 (3,515 signatures from 1,604 publications) with that from the other gene signature databases; MSigDB at the Broad Institute and CCancer. MSigDB's C2 section contains 2,392 signatures from 786 publications.

We found miminal overlap (198 publications) between GeneSigDB and MSigDB. These 198 publications account for 12.3% of GeneSigDB. Therefore we are delighted to say that GeneSigDB is providing content not available in other public resources.

CCancer contains 3.500 gene signatures curated from articles published in approximately 100 journals. The data in CCancer is not available as a bulk download but we able to identify the PMIDs of articles associated with 2002 gene signatures in CCancer. Therefore our comparison represent a subset of their data. When we compared the overlap in articles curated by CCancer (n=1,820) and GeneSigDB (n=1,604), the overlap was 527 which represents 31% of articles curated by GeneSigDB. Note there are high depcrency in the number of tables curated from publications in GeneSigDB and Ccancer.

Overlap in GeneSigDB, MSigDB and CCancer

We provide GeneSigDB for download in the Broad .gmt file format, so our content can be used in GSEA and other analyses.

Number of Gene Signatures Curated by Species

Number of articles processed and gene signatures extracted by species
HumanMouseRatTotal
Publications 1368 208 39 1604
Gene Signatures 2951 493 71 3515
Genes 20523 16258 5110 -
No of Platforms 47 20 11 78
Average Genes/Signature 113 174 107 -



Number of Gene Signatures Curated by PubMed Search Terms

As our coverage of published gene signatures in GeneSigDB have improved, we updated the PubMed searches to do much more general searches. We now include signatures from head and neck cancers,sarcomas, leukemia and lymphona, sarcoma and many other cancer types. Searches used by GeneSigDB to retrieve articles include:

Search Name

Search

Cancer

(“tumor” OR “cancer” OR “tumour” OR "carcinoma” OR "neoplasm" or "sarcoma")

Development

Morphogenesis OR "Embryonic and Fetal Development" OR "Embryonic Development" OR"Fetal Development" OR "Nervous System Malformations" OR "Sexual Development" OR"Muscle Development"

Number of articles processed and gene signatures extracted (by Tissue Type or Search Term)

Note these terms are subjective and are generated by our curators. We have also analysed publications indexed in GeneSigDB by their MeSH terms.
Number of articles processed and gene signatures extracted.
All Human Mouse Rat
Search Term/Tissue Publications Gene Signatures Genes Average Genes/Signature Genes Average Genes/Signature Genes Average Genes/Signature
Bladder 23 46 2948 92 221 5 0 0
Bone 12 38 1663 57 952 33 73 2
Bone Marrow 15 36 2972 100 225 7 0 0
Brain 55 108 4780 72 610 6 0 0
Brain not cancer 1 2 0 0 0 0 425 254
Breast 281 580 14897 139 3801 10 597 1
Cardiac Cells 1 2 93 48 0 0 0 0
CellDifferentiationMarkers 1 1 345 345 0 0 0 0
Cervical 4 5 749 160 0 0 0 0
Chemoprevention 1 2 0 0 54 27 0 0
Chorionic Villi Samples 1 1 10 10 0 0 0 0
Colon 60 142 6940 87 287 2 38 0
ConsensusCancerGenes 1 1 365 365 0 0 0 0
Dendritic Cells 2 7 35 5 85 12 0 0
Development 1 1 9 9 0 0 0 0
Embryo 3 11 2199 418 63 6 0 0
Embryonic Stem Cell 7 30 4243 201 2224 80 0 0
Endometrial 9 16 545 36 0 0 0 0
Esophagus 12 21 971 54 0 0 83 4
Eye 1 1 47 47 0 0 0 0
Glioblastoma 1 2 36 18 0 0 0 0
Head and Neck Cancer 23 54 3901 97 0 0 0 0
Heart 2 8 0 0 0 0 1194 162
Hyperglycemia 1 1 23 23 0 0 0 0
Hypothalamic 1 2 0 0 0 0 2536 1268
Immune 14 47 7765 266 190 4 650 14
Inner Ear 1 4 0 0 5932 2443 0 0
Intestine 7 17 2565 199 97 6 0 0
In Vitro 6 14 1082 91 38 3 0 0
Kidney 23 46 2563 103 1145 37 0 0
Leukemia 193 434 13005 94 1600 5 125 0
Liver 58 95 5197 73 1842 25 270 3
Lung 111 237 6943 54 2979 16 48 0
Lung not cancer 43 135 1526 21 5008 60 252 2
Lung Not Caner 1 5 74 18 0 0 0 0
Lymphatic Endothelial Cells 1 4 194 76 0 0 0 0
Lymphoma 93 210 8418 85 3196 22 0 0
Mesenchymal Stem Cells 5 10 1306 143 0 0 0 0
Mesothelioma 1 1 23 23 0 0 0 0
Mouth 5 7 131 19 0 0 0 0
Multiple Myeloma 2 9 288 39 0 0 0 0
Muscle 1 2 252 128 0 0 0 0
Myopathy 1 2 78 44 0 0 0 0
Neuroblastoma 1 1 14 14 0 0 0 0
Neuronal Cells 1 2 0 0 115 68 0 0
Ovarian 66 118 8641 145 221 2 0 0
Pancreas 30 58 3718 83 242 4 0 0
Placenta 2 3 53 18 0 0 0 0
Prostate 68 128 5390 70 179 2 295 3
ProteinKinases 2 3 805 294 0 0 0 0
Rectal 1 1 30 30 0 0 0 0
sarcoma 1 1 62 62 0 0 0 0
Sarcoma 16 27 1939 85 75 3 0 0
Skeletal Myoblasts 1 1 0 0 22 22 0 0
Skin 20 39 913 32 2601 74 33 1
Spleen 1 1 0 0 16 16 0 0
Stem Cell 183 406 13272 107 11395 71 276 1
Stomach 32 73 2048 37 23 0 0 0
Testicular 8 34 619 22 818 25 128 6
Thoracic Aorta 1 2 0 0 0 0 65 36
Thymus 1 4 0 0 26 6 0 0
Thyroid 15 27 964 43 0 0 0 0
Transcription factors 1 1 1764 1764 0 0 0 0
Umbilical Cord Blood Cells 1 2 198 99 0 0 0 0
Uterine 15 37 1347 47 726 29 0 0
Viral 64 149 5590 62 6084 53 34 0
Total 1604 3515 20523 104 16258 48 5110 27

Number of articles processed (by MeSH Term)


Number of Gene Signatures Curated from Different Platforms

The format of genelists varied considerably. The most common array platforms were Affymetrix platforms, particularly u133a and u133plus2 arrays. There were also many signatures generated on spotted cDNA microarrays. However in many cases authors do not provide probe identifiers for the gene expression probesets (see Table below) and thus one is reliant on a static, user provided, probeset annotation.

Gene Expression Platforms in GeneSigDB
Platform All Signature, n=3,515 Human Signature, n=2,951
affy_HC_G110 7 7
affy_HG-Focus 18 18
affy_HG-U95A 45 45
affy_HG-U95Av2 178 178
affy_Hu6800 14 14
Affy HuGene 1_0 st v1 11 11
affy_HuGeneFL 92 92
Affymetrix array (custom) 25 22
Affymetrix human genome U133 (HG-U133A and HG-133B 2 2
Affymetrix human genome U133 (HG-U133A and HG-133B) 70 70
Affymetrix Microarray moex 1 0 st v1 2 NA
Affymetrix SNP chips 5 5
Affymetrix Tiling array 2 2
Affy mg u74a 13 NA
Affy mg u74av2 120 NA
affy_MOE430A 38 NA
affy_Mouse430_2 139 5
Affy rae230a 15 NA
Affy RaGene 1_0 st v1 2 NA
Affy rat230 2 6 NA
Affy rg u34a 8 NA
Affy rt u34 2 NA
affy_U133A 589 582
affy_U133A_2 39 39
affy_U133Plus2 513 512
affy_U133_X3P 6 6
Agilent Array (custom) 18 18
Agilent Arrays (more than 1) 6 6
Agilent_Human1A 35 35
Agilent_Human1Av2 13 13
Agilent_Human1_cDNA 14 14
Agilent_HumanGenome 76 76
Agilent Mouse oligonucleotide microarrays 37 1
Agilent rat oligonucleotide microarrays 11 NA
APPLERA 5 5
Clontech_Atlas_13K 54 54
Codelink_Human_Whole_Genome 4 4
CodeLink_UniSet_Human_I_Bioarray 7 7
Custom cDNA Array 858 762
Custom Oligo Array 52 44
Illumina HumanWG 6 V1 22 22
Illumina HumanWG 6 V2 19 16
Illumina HumanWG 6 V3 15 15
Illumina MouseWG 6 v1 1 NA
Illumina MouseWG 6 v2 1 NA
Illumina RatRef-12 4 NA
Incyte Genomics 22 18
IntelliGene Human Cancer CHIP 6 6
miR_Base 1 1
miRNA Array 48 47
multiple platforms 82 49
N/A 28 28
Operon/CapitalBio Corporation 21329 9 9
OPERON_HUMANv2 16 16
OPERON_HUMANv3 4 4
Research_Genetics 6 4
RT-PCR 54 43
SAGE 17 17
tissue microarray 1 1
No Platform Specified 8 NA

*Studies with no platform are gene lists from ImmPort or the Sanger Consensus Cancer Gene list, etc


Gene Identifiers in Gene Signatures Curated by GeneSigDB

We observed that the success of mapping is greatly affected by the identifier provided. One lesson that clearly emerges from this analysis is that those identifiers closest to the primary data, such as probe identifiers, have the highest rate of mapping to the EnsEMBL geneIDs that are our standard identifiers.

Success of matching different gene signature identifiers to an EnsEMBL gene
ID type Species Success Failures %
(unique IDs) (unique IDs) Success
1 affy_hc_g110 28 0 100%
2 affy_hg_focus 119 17 87%
3 affy_hg_u133_plus_2 24674 2973 89%
4 affy_hg_u133a 15636 2479 86%
5 affy_hg_u133a_2 4085 70 98%
6 affy_hg_u95a 593 19 96%
7 affy_hg_u95av2 4314 363 92%
8 affy_hugenefl 647 50 92%
9 affy_mg_u74a 119 1 99%
10 affy_mg_u74av2 6716 784 89%
11 affy_moe430a 5354 248 95%
12 affy_mouse430_2 10573 1106 90%
13 affy_rae230a 537 72 88%
14 affy_rat230_2 1438 610 70%
15 affy_rg_u34a 5 0 100%
16 affy_u133_x3p 186 10 94%
17 agilent_wholegenome 8519 1337 86%
18 embl 1917 8552 18%
19 ensembl_gene_id 6133 589 91%
20 entrezgene 27007 3790 87%
21 hgnc_symbol 13828 13266 51%
22 mgi_symbol 6078 3798 61%
23 mirbase_id 436 148 74%
24 refseq_dna 18908 2407 88%
25 rgd_symbol 841 803 51%
26 unigene 5869 5233 52%