<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>生信菜鸟团 &#187; 假基因</title>
	<atom:link href="http://www.bio-info-trainee.com/tag/%e5%81%87%e5%9f%ba%e5%9b%a0/feed" rel="self" type="application/rss+xml" />
	<link>http://www.bio-info-trainee.com</link>
	<description>欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee</description>
	<lastBuildDate>Sat, 28 Jun 2025 14:30:13 +0000</lastBuildDate>
	<language>zh-CN</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.1.33</generator>
	<item>
		<title>基因组标准注释文件-Gencode数据库</title>
		<link>http://www.bio-info-trainee.com/1781.html</link>
		<comments>http://www.bio-info-trainee.com/1781.html#comments</comments>
		<pubDate>Fri, 08 Jul 2016 12:28:49 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[基础数据库]]></category>
		<category><![CDATA[gencode]]></category>
		<category><![CDATA[lncRNA]]></category>
		<category><![CDATA[假基因]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1781</guid>
		<description><![CDATA[Gencode数据库是ENCODE计划的衍生品，也是由大名鼎鼎的sanger研究 &#8230; <a href="http://www.bio-info-trainee.com/1781.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div>Gencode数据库是ENCODE计划的衍生品，也是由大名鼎鼎的sanger研究所负责整理和维护，主要记录了基因组的功能注释，比如基因组每条染色体上面有哪些编码蛋白的基因，哪些假基因，哪些lncRNA的基因，它们坐标是什么，基因上面的外显子内含子坐标是什么，UTR区域坐标是什么？我以前通常是在EBI的ENSEMBL的FTP服务器下载，后来才发现了这个Gencode数据库，现在以这个为金标准啦！</div>
<p><span id="more-1781"></span></p>
<div></div>
<div>数据库文章:The GENCODE v7 catalog of human long noncoding RNAs, 链接是 <a href="http://genome.cshlp.org/content/22/9/1775.full">http://genome.cshlp.org/content/22/9/1775.full</a></div>
<div>FTP地址：<a href="ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/">ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/</a>  可以下载该数据库的所有资料，而且整理的非常好，自己写脚本很容易处理得到自己想要的信息。</div>
<div></div>
<div>GENCODE最新版是v24，在linux系统里面用 wget -c -r -np -k -L -A "*metadata*" <a href="ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/">ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/</a>  可以把所有metadata数据下载</div>
<div>检查里面的记录数： ls *gz |while read id;do (echo -n $id;echo -n "    " ;zcat $id |wc -l ) ;done</div>
<div>可以与官网的统计信息相对应： <a href="http://www.gencodegenes.org/stats.html">http://www.gencodegenes.org/stats.html</a></div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/07/gencode_statistics.png"><img class="alignnone size-full wp-image-1782" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/07/gencode_statistics.png" alt="gencode_statistics" width="618" height="360" /></a></div>
<div>可以看到编码蛋白的基因的个数并不比lncRNA的基因多，甚至跟假基因也差不太多</div>
<div><img src="file:///C:/Users/Jimmy/AppData/Local/YNote/data/jmzeng1314@163.com/44937a9364c24fedb03d8b47028250c0/clipboard.png" alt="" data-media-type="image" data-inited="true" /></div>
<div>首先，我们看看meta data信息，主要是该数据库与其它主流数据库的关系</div>
<div>gencode.v24.metadata.Annotation_remark.gz    40879</div>
<div>gencode.v24.metadata.EntrezGene.gz    170466</div>
<div>gencode.v24.metadata.Exon_supporting_feature.gz    19193542</div>
<div>gencode.v24.metadata.Gene_source.gz    66206</div>
<div>gencode.v24.metadata.HGNC.gz    182831</div>
<div>gencode.v24.metadata.PDB.gz    94547</div>
<div>gencode.v24.metadata.PolyA_feature.gz    84652</div>
<div>gencode.v24.metadata.Pubmed_id.gz    209094</div>
<div>gencode.v24.metadata.RefSeq.gz    75365</div>
<div>gencode.v24.metadata.Selenocysteine.gz    119</div>
<div>gencode.v24.metadata.SwissProt.gz    45067</div>
<div>gencode.v24.metadata.Transcript_source.gz    217202</div>
<div>gencode.v24.metadata.Transcript_supporting_feature.gz    87375</div>
<div>gencode.v24.metadata.TrEMBL.gz    61924</div>
<div></div>
<div>还可以下载所有的gtf文件：</div>
<div>wget -c -r -np -nd -k -L -A "*gtf.gz" <a href="ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/">ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/</a></div>
<div>gtf文件特别重要，大家一定要抽两个小时时间好好理解，写一写脚本好好玩一玩这个文件，彻底吃透它！！！</div>
<div></div>
<div>还可以下载参考转录组及参考蛋白组，我这里还是拿hg19举例：</div>
<div>## <a href="ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/GRCh37_mapping/gencode.v24lift37.transcripts.fa.gz">ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/GRCh37_mapping/gencode.v24lift37.transcripts.fa.gz</a></div>
<div>## <a href="ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/GRCh37_mapping/gencode.v24lift37.lncRNA_transcripts.fa.gz">ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/GRCh37_mapping/gencode.v24lift37.lncRNA_transcripts.fa.gz</a></div>
<div>## <a href="ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/GRCh37_mapping/gencode.v24lift37.pc_transcripts.fa.gz">ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/GRCh37_mapping/gencode.v24lift37.pc_transcripts.fa.gz</a></div>
<div></div>
<div>其实你有gtf文件，也可以直接从参考基因组序列里面提取这个参考转录组及参考蛋白组，就是通常是gtf2fasta，随便搜索一下，一大堆方法。</div>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1781.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>假基因资源中心</title>
		<link>http://www.bio-info-trainee.com/1636.html</link>
		<comments>http://www.bio-info-trainee.com/1636.html#comments</comments>
		<pubDate>Mon, 16 May 2016 11:37:06 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[基础数据库]]></category>
		<category><![CDATA[生信基础]]></category>
		<category><![CDATA[pseudogene]]></category>
		<category><![CDATA[假基因]]></category>
		<category><![CDATA[数据库]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1636</guid>
		<description><![CDATA[假基因是原来的能翻译成蛋白的基因经过各种突变导致丧失功能的基因。 比如 PTEN &#8230; <a href="http://www.bio-info-trainee.com/1636.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div>假基因是原来的能翻译成蛋白的基因经过各种突变导致丧失功能的基因。</div>
<div>比如</div>
<div>PTEN--&gt;PTENP1</div>
<div>KRAS--&gt;KRASP1</div>
<div>NANOG--&gt;NANOGP1</div>
<div>很好理解，一般来说看到结尾是P1,等字眼的都是假基因，现在共有一万多假基因，我一般以<a href="http://www.genenames.org/cgi-bin/statistics">http://www.genenames.org/cgi-bin/statistics</a> （人类基因命名委员会）为标准参考。</div>
<div></div>
<div>研究的时候可能需要更全面一点，所以我又谷歌了一下，发现了一个还算比较全面的收集。</div>
<div></div>
<div>就是 <a href="http://pseudogene.org/Human/">http://pseudogene.org/Human/</a>  （中心网站）</div>
<div>现在主要是 ENCODE计划的GENCODE 21. 和 耶鲁大学的Ensembl genome release 79.</div>
<div>
<table border="1" width="100%" cellspacing="0" cellpadding="2">
<tbody>
<tr>
<td>Human Pseudogene Annotation</td>
</tr>
<tr>
<td>
<h1>GENCODE Annotation</h1>
<p><b>- Data:</b> The current human pseudogene annotation is in GENCODE 21. .</p>
<p><b>- Description:</b> The GENCODE annotation of pseudogenes contains models that have been created by the Human and Vertebrate Analysis and Annotation (HAVANA) team, an expert manual annotation team at the Wellcome Trust Sanger Institute. This is informed by, and checked against, computational pseudogene predictions by the<a href="http://bioinformatics.oxfordjournals.org/content/22/12/1437.long" target="_blank">PseudoPipe</a> and <a href="http://www.biomedcentral.com/1471-2164/9/466" target="_blank">RetroFinder</a> pipelines.</p>
<h1>PseudoPipe Output</h1>
<p><b>- Data:</b> The current PseudoPipe results are on Ensembl genome release 79. .</p>
<p><b>- Description:</b> Genome-wide human pseudogene annotation predicted by PseudoPipe. PseudoPipe is a homology-based computational pipeline that searches a mammalian genome and identifies pseudogene sequences.</p>
<p><b>- Reference:</b></p>
<h1>Other Human Pseudogene Sets</h1>
<p><b>- Data:</b> .</p>
<p><b>- Description:</b> Archived pseudogene annotation on previous human genome releases from PseudoPipe. Genome-wide annotation or specific subset.</td>
</tr>
</tbody>
</table>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1636.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>TCGA数据挖掘系列文章之-pseudogene假基因探究</title>
		<link>http://www.bio-info-trainee.com/1630.html</link>
		<comments>http://www.bio-info-trainee.com/1630.html#comments</comments>
		<pubDate>Mon, 16 May 2016 11:31:04 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[cancer]]></category>
		<category><![CDATA[pseudogene]]></category>
		<category><![CDATA[TCGA]]></category>
		<category><![CDATA[假基因]]></category>
		<category><![CDATA[数据挖掘]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1630</guid>
		<description><![CDATA[这是TCGA数据挖掘系列文章之一，是安德森癌症研究中心的Han Liang主导的 &#8230; <a href="http://www.bio-info-trainee.com/1630.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div>这是TCGA数据挖掘系列文章之一，是安德森癌症研究中心的Han Liang主导的，纯粹的生物信息学数据分析文章。</div>
<div>文章见：<a href="http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html">http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html</a></div>
<div>TCGA数据库的数据量现在已经非常可观了，一万多的肿瘤样本数据，关于假基因的这篇文章是2014年发的，所以他们只研究了2,808个样本数据，也只涉及到7个癌症种类。</div>
<p><span id="more-1630"></span></p>
<div>假基因是原来的能翻译成蛋白的基因经过各种突变导致丧失功能的基因。</div>
<div>比如</div>
<div>PTEN--&gt;PTENP1</div>
<div>KRAS--&gt;KRASP1</div>
<div>NANOG--&gt;NANOGP1</div>
<div>很好理解，一般来说看到结尾是P1,等字眼的都是假基因，现在共有一万多假基因，我一般以<a href="http://www.genenames.org/cgi-bin/statistics">http://www.genenames.org/cgi-bin/statistics</a> （人类基因命名委员会）为标准参考。</div>
<div></div>
<div>文章主要做了6件事情。</div>
<div><span style="color: #ff0000;"><b>一是重新定义及规范了假基因该研究什么</b>，</span>就是把Yale Pseudogene database的假基因资源和GENCODE Pseudogene Resource的假基因资源结合起来，然后定义了一些过滤手段，具体流程如下。</div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/05/16.png"><img class="alignnone size-full wp-image-1631" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/05/16.png" alt="1" width="600" height="477" /></a></div>
<div><span style="color: #ff0000;">二是下载了TCGA的那2,808个样本的RNA-seq的level2数据，也就是bam文件，<b>重新提取关于假基因的表达数据。</b></span>如果只是自己下载表达数据的话，关于假基因的定量并不准确，而且只有五百多个假基因。</div>
<div>当然，一般人没有条件下载RNA-seq的level2数据，所以想学习这个流程的话，直接下载表达矩阵吧。</div>
<div>
<table border="1" cellspacing="0" cellpadding="2">
<thead>
<tr>
<th><b>Cancer type</b></th>
<th><b>Number of nontumour samples</b></th>
<th><b>Number of tumour samples</b></th>
<th><b>Sequencing strategy</b></th>
<th><b>Number of mappable reads</b></th>
<th><b>Number of detectable pseudogenes</b></th>
</tr>
</thead>
<tbody>
<tr>
<td>Breast invasive carcinoma</td>
<td>105</td>
<td>837</td>
<td>Paired-end</td>
<td>161 M</td>
<td>747</td>
</tr>
<tr>
<td>Kidney renal clear cell carcinoma</td>
<td>67</td>
<td>448</td>
<td>Paired-end</td>
<td>166 M</td>
<td>712</td>
</tr>
<tr>
<td>Lung squamous cell carcinoma</td>
<td>17</td>
<td>220</td>
<td>Paired-end</td>
<td>171 M</td>
<td>813</td>
</tr>
<tr>
<td>Ovarian serous cystadenocarcinoma</td>
<td>0</td>
<td>412</td>
<td>Paired-end</td>
<td>170 M</td>
<td>670</td>
</tr>
<tr>
<td>Glioblastoma multiforme</td>
<td>0</td>
<td>154</td>
<td>Paired-end</td>
<td>106 M</td>
<td>875</td>
</tr>
<tr>
<td>Colorectal carcinoma</td>
<td>0</td>
<td>228</td>
<td>Single-end</td>
<td>22 M</td>
<td>168</td>
</tr>
<tr>
<td>Uterine corpus endometrioid carcinoma</td>
<td>4</td>
<td>316</td>
<td>Single-end</td>
<td>26 M</td>
<td>181</td>
</tr>
</tbody>
</table>
</div>
<div></div>
<div><span style="color: #ff0000;">第三件事是把假基因与其配对的野生型基因的表达数据做了相关性分析，一般来说，它们的相关性由下面三个原因决定。</span></div>
<div>(i) the sequence similarity between the pseudogene/gene pair;</div>
<div>(ii) the molecular mechanisms through which the pseudogene functions;</div>
<div>(iii) the detection sensitivity given the setting of RNA-seq experiments.</div>
<div>结论是不怎么相关，暗示着假基因虽然不编码蛋白产物，但仍然行使着某种功能。</div>
<div></div>
<div><span style="color: #ff0000;">第四件事是如果RNA-seq有正常对照的， 就做一样normal和tumor的差异分析，</span>当然现在已经是都有了，在GSE62944可以下载所有的表达数据，专门提取假基因的表达数据做差异分析就好了。</div>
<div>但是差异分析的结果是， 没有什么现实意义。所以作者认为normal和tumor这样比较是不科学的，因为tumor本来就不应该按照组织来分类，而是应该按照TCGA的6种数据来分类()</div>
<div>In recent years, various ‘omic’ data, such as mRNA expression, microRNA expression, DNA methylation, somatic copy number alteration and protein expression, have been widely used to classify tumour samples into different molecular subtypes<sup><a title="The Cancer Genome Atlas Research Network. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012)." href="http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html#ref13">13</a>, <a title="The Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061–1068 (2008)." href="http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html#ref14">14</a>, <a title="The Cancer Genome Atlas Research Network. Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 499, 43–49 (2013)." href="http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html#ref15">15</a>, <a title="The Cancer Genome Atlas Research Network. Comprehensive genomic characterization of squamous cell lung cancers. Nature 489, 519–525 (2012)." href="http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html#ref16">16</a>, <a title="The Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature 474, 609–615 (2011)." href="http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html#ref17">17</a>, <a title="The Cancer Genome Atlas Research Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012)." href="http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html#ref18">18</a>, <a title="The Cancer Genome Atlas Research Network. Integrated genomic characterization of endometrial carcinoma. Nature 497, 67–73 (2013)." href="http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html#ref19">19</a></sup>.</div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/05/21.png"><img class="alignnone size-full wp-image-1632" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/05/21.png" alt="2" width="600" height="714" /></a></div>
<div></div>
<div><span style="color: #ff0000;">第五件事就是把假基因表达数据的分类来跟其它几种分类形式作比较。</span></div>
<div>那些分类来源于以前的TCGA大文章：</div>
<div>48 in UCEC (endometrioid vs serous)<sup><a title="Lax, S. F. &amp; Kurman, R. J. A dualistic model for endometrial carcinogenesis based on immunohistochemical and molecular genetic analyses. Verh. Dtsch. Ges. Pathol. 81, 228–232 (1997)." href="http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html#ref23">23</a></sup>,</div>
<div>138 in LUSC (basal, classical, primitive and secretory)<sup><a title="The Cancer Genome Atlas Research Network. Comprehensive genomic characterization of squamous cell lung cancers. Nature 489, 519–525 (2012)." href="http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html#ref16">16</a></sup>,</div>
<div>71 in GBM (classical, mesenchymal, neural and proneural)<sup><a title="Verhaak, R. G. et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 17, 98–110 (2010)." href="http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html#ref24">24</a></sup> and</div>
<div>547 in BRCA (PAM50 subtypes: luminal A, luminal B, basal-like, Her2-enriched and normal-like)<sup><a title="Sorlie, T. et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc. Natl Acad. Sci. USA 98, 10869–10874 (2001)." href="http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html#ref25">2</a></sup></div>
<div>文章就是：<sup><a title="The Cancer Genome Atlas Research Network. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012)." href="http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html#ref13">13</a>, <a title="The Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061–1068 (2008)." href="http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html#ref14">14</a>, <a title="The Cancer Genome Atlas Research Network. Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 499, 43–49 (2013)." href="http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html#ref15">15</a>, <a title="The Cancer Genome Atlas Research Network. Comprehensive genomic characterization of squamous cell lung cancers. Nature 489, 519–525 (2012)." href="http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html#ref16">16</a>, <a title="The Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature 474, 609–615 (2011)." href="http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html#ref17">17</a>, <a title="The Cancer Genome Atlas Research Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012)." href="http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html#ref18">18</a>, <a title="The Cancer Genome Atlas Research Network. Integrated genomic characterization of endometrial carcinoma. Nature 497, 67–73 (2013)." href="http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html#ref19">19</a></sup>.</div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/05/32.png"><img class="alignnone size-full wp-image-1634" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/05/32.png" alt="3" width="600" height="644" /></a></div>
<div></div>
<div></div>
<div>最后就是做一些生存分析，讲一些好听的故事，比如说这样分类有利于精准医疗。看起来还不错，值得大家学习一下，数据也都可以下载， 文章中提供了syn编号。</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1630.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
