clinvar数据库重新解读

在很久以前的我直播基因组活动,我提到过这个数据库: 【直播】我的基因组67:clinvar数据库

不过,那个时候遗传背景知识不够,其实并没有很好的理解它,现在有机会重新学习一下,可以使用以下代码下载并且注释到clinvar数据库

wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20180429.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20180429.vcf.gz.tbi
java -jar  ~/biosoft/SnpEff/snpEff/SnpSift.jar  annotate clinvar_20180429.vcf.gz merge_snpeff.vcf >merge_clinvar.vcf

得到的注释信息的描述是:

##SnpSiftCmd="SnpSift annotate clinvar_20180429.vcf.gz merge_snpeff.vcf"
##INFO=<ID=CLNDISDBINCL,Number=.,Type=String,Description="For included Variant: Tag-value pairs of disease database name and identifier, e.g. OMIM:NNNNNN">
##INFO=<ID=DBVARID,Number=.,Type=String,Description="nsv accessions from dbVar for the variant">
##INFO=<ID=CLNSIGCONF,Number=.,Type=String,Description="Conflicting clinical significance for this single variant">
##INFO=<ID=AF_TGP,Number=1,Type=Float,Description="allele frequencies from TGP">
##INFO=<ID=MC,Number=.,Type=String,Description="comma separated list of molecular consequence in the form of Sequence Ontology ID|molecular_consequence">
##INFO=<ID=AF_EXAC,Number=1,Type=Float,Description="allele frequencies from ExAC">
##INFO=<ID=CLNDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">
##INFO=<ID=ORIGIN,Number=.,Type=String,Description="Allele origin. One or more of the following values may be added: 0 - unknown; 1 - germline; 2 - somatic; 4
 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive; 1073741824 - other">
##INFO=<ID=CLNVI,Number=.,Type=String,Description="the variant's clinical sources reported as tag-value pairs of database and variant identifier">
##INFO=<ID=CLNVC,Number=1,Type=String,Description="Variant type">
##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP">
##INFO=<ID=CLNHGVS,Number=.,Type=String,Description="Top-level (primary assembly, alt, or patch) HGVS expression.">
##INFO=<ID=CLNDNINCL,Number=.,Type=String,Description="For included Variant : ClinVar's preferred disease name for the concept specified by disease identifier
s in CLNDISDB">
##INFO=<ID=CLNVCSO,Number=1,Type=String,Description="Sequence Ontology id for variant type">
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Clinical significance for this single variant">
##INFO=<ID=CLNDISDB,Number=.,Type=String,Description="Tag-value pairs of disease database name and identifier, e.g. OMIM:NNNNNN">
##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description="ClinVar review status for the Variation ID">
##INFO=<ID=CLNSIGINCL,Number=.,Type=String,Description="Clinical significance for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:clinical significance.">
##INFO=<ID=ALLELEID,Number=1,Type=Integer,Description="the ClinVar Allele ID">

记录最多的基因是

zcat clinvar_20180429.vcf.gz|perl -alne '{/GENEINFO=(.*?):/;print $1 if $1}'|sort |uniq -c |sort -k 1,1nr >gene.clinvar.freq
head gene.clinvar.freq
   9800 TTN
   9038 BRCA2
   6311 BRCA1
   4987 ATM
   4498 APC
   3312 TSC2
   3114 MSH6
   2888 NF1
   2653 MSH2
   2223 LDLR

记录的致病情况最多的基因是

zcat clinvar_20180429.vcf.gz|perl -alne '{/GENEINFO=(.*?):/;$g=$1;/CLNSIG=(.*?);/;print "$g\t$1" if $1}'| sort |uniq -c |sort -k 1,1nr >gene_sign.clinvar.freq
head gene_sign.clinvar.freq
   5218 TTN Uncertain_significance
   3293 BRCA2   Uncertain_significance
   2665 BRCA2   Pathogenic
   2523 ATM Uncertain_significance
   2162 BRCA1   Pathogenic
   1897 APC Uncertain_significance
   1887 TTN Likely_benign
   1817 BRCA1   Uncertain_significance
   1651 MSH6    Uncertain_significance
   1465 BRCA2   Likely_benign
grep Pathogenic gene_sign.clinvar.freq|head
   2665 BRCA2   Pathogenic
   2162 BRCA1   Pathogenic
    691 LDLR    Pathogenic
    594 MSH2    Pathogenic
    571 MLH1    Pathogenic
    544 COL4A5  Pathogenic
    536 NF1 Pathogenic
    469 DMD Pathogenic
    467 MSH6    Pathogenic
    466 APC Pathogenic

看看是否有被专家审核

zcat clinvar_20180429.vcf.gz|perl -alne '{/GENEINFO=(.*?):/;$g=$1;/CLNSIG=(.*?);/;$t=$1;/CLNREVSTAT=(.*?);/;print "$g\t$t\t$1" if $1}'| sort |uniq -c |sort -k 1,1nr >gene_sign_review.clinvar.freq
[jianmingzeng@jade anno]$ head gene_sign_review.clinvar.freq
   4200 TTN Uncertain_significance  criteria_provided,_single_submitter
   2065 BRCA2   Pathogenic  reviewed_by_expert_panel
   1915 BRCA2   Uncertain_significance  criteria_provided,_single_submitter
   1666 BRCA1   Pathogenic  reviewed_by_expert_panel
   1639 TTN Likely_benign   criteria_provided,_single_submitter
   1597 ATM Uncertain_significance  criteria_provided,_single_submitter
   1251 APC Uncertain_significance  criteria_provided,_single_submitter
   1200 TTN Conflicting_interpretations_of_pathogenicity    criteria_provided,_conflicting_interpretations
   1174 TSC2    not_provided    no_assertion_provided

可以看到之前的Pathogenic的BRCA2突变记录是2665个,但是现在增加了 reviewed_by_expert_panel条件后下降到了 2065个,还有600个是因为:

 grep BRCA2 gene_sign_review.clinvar.freq|grep Pathogenic
   2065 BRCA2   Pathogenic  reviewed_by_expert_panel
    436 BRCA2   Pathogenic  criteria_provided,_single_submitter
     94 BRCA2   Pathogenic  criteria_provided,_multiple_submitters,_no_conflicts
     70 BRCA2   Pathogenic  no_assertion_criteria_provided

Comments are closed.