<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>生信菜鸟团 &#187; 基因坐标</title>
	<atom:link href="http://www.bio-info-trainee.com/tag/%e5%9f%ba%e5%9b%a0%e5%9d%90%e6%a0%87/feed" rel="self" type="application/rss+xml" />
	<link>http://www.bio-info-trainee.com</link>
	<description>欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee</description>
	<lastBuildDate>Sat, 28 Jun 2025 14:30:13 +0000</lastBuildDate>
	<language>zh-CN</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.1.33</generator>
	<item>
		<title>一个基因坐标定位到具体基因的程序的改进</title>
		<link>http://www.bio-info-trainee.com/1001.html</link>
		<comments>http://www.bio-info-trainee.com/1001.html#comments</comments>
		<pubDate>Fri, 18 Sep 2015 11:37:02 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[perl]]></category>
		<category><![CDATA[基因坐标]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1001</guid>
		<description><![CDATA[这是为了回答以前的一个疑问：任意给定基因组的 chr:pos, 判断它在哪个基因 &#8230; <a href="http://www.bio-info-trainee.com/1001.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<h1><span lang="ZH-CN">这是为了回答以前的一个疑问：任意给定基因组的 chr:pos, 判断它在哪个基因上面？这个程序难吗？ </span></h1>
<p><span lang="ZH-CN">基因的chr,start,end都是已知的 </span></p>
<p><span lang="ZH-CN">学术一点讲述这个问题：已知CNV数据在染色体上的position如chr1:2075000-2930999，怎样批量获取其对应的Gene Symbol呢（批量）</span></p>
<p><span lang="ZH-CN">数据如下：</span></p>
<p><b><span lang="EN-US">head gene_position.hg19 //</span></b><b><span lang="ZH-CN">共</span></b><span lang="EN-US">21629</span><span lang="ZH-CN">行</span></p>
<p><span lang="EN-US">1 chr19 58858171 58874214 A1BG ENSG00000121410</span></p>
<p><span lang="EN-US">2 chr12 9220303 9268558 A2M ENSG00000175899</span></p>
<p><span lang="EN-US">3 chr12 9381128 9386803 A2MP1 ENSG00000256069</span></p>
<p><span lang="EN-US">9 chr8 18027970 18081198 NAT1 ENSG00000171428</span></p>
<p><span lang="EN-US">10 chr8 18248754 18258723 NAT2 ENSG00000156006</span></p>
<p><span lang="EN-US">12 chr14 95058394 95090390 <span class="Apple-converted-space"> </span>ENSG00000273259</span></p>
<p><span lang="EN-US">13 chr3 151531860 151546276 AADAC ENSG00000114771</span></p>
<p><span lang="EN-US">14 chr2 219128851 219134893 AAMP ENSG00000127837</span></p>
<p><span lang="EN-US">15 chr17 74449432 74466199 AANAT ENSG00000129673</span></p>
<p><span lang="EN-US">16 chr16 70286296 70323412 AARS ENSG00000090861</span></p>
<p><b><span lang="EN-US">head pfam.df.hg19.bed  <span class="Apple-converted-space"> </span>//</span></b><b><span lang="ZH-CN">共</span></b><span lang="EN-US">340960</span><span lang="ZH-CN">行</span></p>
<p><span lang="EN-US">chr1 <span class="Apple-converted-space"> </span>12190         <span class="Apple-converted-space"> </span>12689         <span class="Apple-converted-space"> </span>Helicase_C_2    <span class="Apple-converted-space"> </span>0       <span class="Apple-converted-space"> </span>+       <span class="Apple-converted-space"> </span>12190         <span class="Apple-converted-space"> </span>12689         <span class="Apple-converted-space"> </span>255,255,0</span></p>
<p><span lang="EN-US">chr1 <span class="Apple-converted-space"> </span>69157         <span class="Apple-converted-space"> </span>69220         <span class="Apple-converted-space"> </span>7tm_4       <span class="Apple-converted-space"> </span>0       <span class="Apple-converted-space"> </span>+       <span class="Apple-converted-space"> </span>69157         <span class="Apple-converted-space"> </span>69220         <span class="Apple-converted-space"> </span>255,255,0</span></p>
<p><span lang="EN-US">chr1 <span class="Apple-converted-space"> </span>69184         <span class="Apple-converted-space"> </span>69817         <span class="Apple-converted-space"> </span>7TM_GPCR_Srsx        <span class="Apple-converted-space"> </span>0       <span class="Apple-converted-space"> </span>+       <span class="Apple-converted-space"> </span>69184         <span class="Apple-converted-space"> </span>69817         <span class="Apple-converted-space"> </span>255,255,0</span></p>
<p><span lang="EN-US">chr1 <span class="Apple-converted-space"> </span>69190         <span class="Apple-converted-space"> </span>69931         <span class="Apple-converted-space"> </span>7tm_1       <span class="Apple-converted-space"> </span>0       <span class="Apple-converted-space"> </span>+       <span class="Apple-converted-space"> </span>69190         <span class="Apple-converted-space"> </span>69931         <span class="Apple-converted-space"> </span>255,255,0</span></p>
<p><span lang="EN-US">chr1 <span class="Apple-converted-space"> </span>69490         <span class="Apple-converted-space"> </span>69910         <span class="Apple-converted-space"> </span>7tm_4       <span class="Apple-converted-space"> </span>0       <span class="Apple-converted-space"> </span>+       <span class="Apple-converted-space"> </span>69490         <span class="Apple-converted-space"> </span>69910         <span class="Apple-converted-space"> </span>255,255,0</span></p>
<p><span lang="ZH-CN">现在需要对我们的</span><span lang="EN-US">pfam</span><span lang="ZH-CN">数据进行注释，根据每一行的</span><span lang="EN-US">chr</span><span lang="ZH-CN">和</span><span lang="EN-US">pos</span><span lang="ZH-CN">来看看是属于哪一个基因</span></p>
<p><span lang="ZH-CN">总共会有</span><span lang="EN-US">338879<span class="Apple-converted-space"> </span></span><span lang="ZH-CN">条</span><span lang="EN-US">pfam</span><span lang="ZH-CN">记录可以注释上基因。</span></p>
<p><span lang="ZH-CN">注释之后应该是</span><span lang="ZH-CN"><span class="Apple-converted-space"> </span></span><b><span lang="EN-US">head pfam.gene.df.hg19</span></b><span lang="EN-US"><span class="Apple-converted-space"> </span> </span><span lang="ZH-CN">这个样子</span></p>
<p><span lang="EN-US">CDK11B      <span class="Apple-converted-space"> </span>chr1 <span class="Apple-converted-space"> </span>1571423    <span class="Apple-converted-space"> </span>1573930    <span class="Apple-converted-space"> </span>Pkinase       <span class="Apple-converted-space"> </span>0       <span class="Apple-converted-space"> </span>-        <span class="Apple-converted-space"> </span>1571423    <span class="Apple-converted-space"> </span>1573930    <span class="Apple-converted-space"> </span>255,255,0</span></p>
<p><span lang="EN-US">CDK11B      <span class="Apple-converted-space"> </span>chr1 <span class="Apple-converted-space"> </span>1572048    <span class="Apple-converted-space"> </span>1573921    <span class="Apple-converted-space"> </span>Pkinase_Tyr         <span class="Apple-converted-space"> </span>0       <span class="Apple-converted-space"> </span>-        <span class="Apple-converted-space"> </span>1572048    <span class="Apple-converted-space"> </span>1573921    <span class="Apple-converted-space"> </span>255,255,0</span></p>
<p><span lang="EN-US">CDK11B      <span class="Apple-converted-space"> </span>chr1 <span class="Apple-converted-space"> </span>1572120    <span class="Apple-converted-space"> </span>1572823    <span class="Apple-converted-space"> </span>Kinase-like <span class="Apple-converted-space"> </span>0       <span class="Apple-converted-space"> </span>-        <span class="Apple-converted-space"> </span>1572120    <span class="Apple-converted-space"> </span>1572823    <span class="Apple-converted-space"> </span>255,255,0</span></p>
<p><span lang="EN-US">CDK11B      <span class="Apple-converted-space"> </span>chr1 <span class="Apple-converted-space"> </span>1572120    <span class="Apple-converted-space"> </span>1572820    <span class="Apple-converted-space"> </span>Kinase-like <span class="Apple-converted-space"> </span>0       <span class="Apple-converted-space"> </span>-        <span class="Apple-converted-space"> </span>1572120    <span class="Apple-converted-space"> </span>1572820    <span class="Apple-converted-space"> </span>255,255,0</span></p>
<p><span lang="EN-US">CDK11B      <span class="Apple-converted-space"> </span>chr1 <span class="Apple-converted-space"> </span>1572120    <span class="Apple-converted-space"> </span>1572817    <span class="Apple-converted-space"> </span>Kinase-like <span class="Apple-converted-space"> </span>0       <span class="Apple-converted-space"> </span>-        <span class="Apple-converted-space"> </span>1572120    <span class="Apple-converted-space"> </span>1572817    <span class="Apple-converted-space"> </span>255,255,0</span></p>
<p><span lang="EN-US">CDK11B      <span class="Apple-converted-space"> </span>chr1 <span class="Apple-converted-space"> </span>1573173    <span class="Apple-converted-space"> </span>1573918    <span class="Apple-converted-space"> </span>Kinase-like <span class="Apple-converted-space"> </span>0       <span class="Apple-converted-space"> </span>-        <span class="Apple-converted-space"> </span>1573173    <span class="Apple-converted-space"> </span>1573918    <span class="Apple-converted-space"> </span>255,255,0</span></p>
<p><span lang="EN-US">CDK11B      <span class="Apple-converted-space"> </span>chr1 <span class="Apple-converted-space"> </span>1575747    <span class="Apple-converted-space"> </span>1577317    <span class="Apple-converted-space"> </span>Daxx <span class="Apple-converted-space"> </span>0       <span class="Apple-converted-space"> </span>-        <span class="Apple-converted-space"> </span>1575747    <span class="Apple-converted-space"> </span>1577317    <span class="Apple-converted-space"> </span>255,255,0</span></p>
<p><span lang="EN-US">CDK11B      <span class="Apple-converted-space"> </span>chr1 <span class="Apple-converted-space"> </span>1576417    <span class="Apple-converted-space"> </span>1577347    <span class="Apple-converted-space"> </span>Nop14        <span class="Apple-converted-space"> </span>0       <span class="Apple-converted-space"> </span>-        <span class="Apple-converted-space"> </span>1576417    <span class="Apple-converted-space"> </span>1577347    <span class="Apple-converted-space"> </span>255,255,0</span></p>
<p><span lang="EN-US">CDK11B      <span class="Apple-converted-space"> </span>chr1 <span class="Apple-converted-space"> </span>1576423    <span class="Apple-converted-space"> </span>1577332    <span class="Apple-converted-space"> </span>Mitofilin     <span class="Apple-converted-space"> </span>0       <span class="Apple-converted-space"> </span>-        <span class="Apple-converted-space"> </span>1576423    <span class="Apple-converted-space"> </span>1577332    <span class="Apple-converted-space"> </span>255,255,0</span></p>
<p><span lang="EN-US">CDK11B      <span class="Apple-converted-space"> </span>chr1 <span class="Apple-converted-space"> </span>1576432    <span class="Apple-converted-space"> </span>1577317    <span class="Apple-converted-space"> </span>SAPS <span class="Apple-converted-space"> </span>0       <span class="Apple-converted-space"> </span>-        <span class="Apple-converted-space"> </span>1576432    <span class="Apple-converted-space"> </span>1577317    <span class="Apple-converted-space"> </span>255,255,0</span></p>
<p><span lang="ZH-CN">我的第一个程序用的是全基因位点扫描到</span><span lang="EN-US">hash</span><span lang="ZH-CN">的方法。这样需要扫描</span><span lang="EN-US">13,1390,4974</span><span lang="ZH-CN">个位点</span><span lang="EN-US">,</span><span lang="ZH-CN">多于三分之一的基因组，这样是非常浪费内存的，尤其是</span><span lang="EN-US">keys</span><span lang="ZH-CN">需要多个字节。</span></p>
<p><span lang="ZH-CN">我用了</span><span lang="EN-US">256G</span><span lang="ZH-CN">的服务器都没有运行完。</span></p>
<p><span lang="ZH-CN">后来我取巧了把我的</span><b><span lang="EN-US">gene_position.hg19</span></b><b><span lang="ZH-CN">文件用</span></b><b><span lang="EN-US">split</span></b><b><span lang="ZH-CN">命令分成了</span></b><b><span lang="EN-US">25</span></b><b><span lang="ZH-CN">个，然后循环</span></b><b><span lang="EN-US">25</span></b><b><span lang="ZH-CN">次对</span></b><b><span lang="EN-US">pfam.df.hg19.bed <span class="Apple-converted-space"> </span></span></b><b><span lang="ZH-CN">文件进行注释。</span></b></p>
<p>&nbsp;</p>
<p><span lang="ZH-CN">这样的确可以解决问了，而且只需要</span><span lang="EN-US">32G</span><span lang="ZH-CN">的内存的服务器即可，时间也很快，就十多分钟吧。</span></p>
<p><span lang="ZH-CN">但这只是取巧的方法，应该要从算法上面优化，首先我仅仅做一个改动，就是不再扫描全基因的位点，对每个基因，我以</span><span lang="EN-US">1K</span><span lang="ZH-CN">的窗口来取位点进行扫描。这样我判断</span><span lang="EN-US">pfam</span><span lang="ZH-CN">的坐标时候，也以</span><span lang="EN-US">1K</span><span lang="ZH-CN">为最小单位进行判断。</span></p>
<p><span lang="ZH-CN">这样只需要不到</span><span lang="EN-US">30s</span><span lang="ZH-CN">就可以出结果，总共注释了</span><span lang="EN-US">303474</span><span lang="ZH-CN">条</span><span lang="EN-US">pfam</span><span lang="ZH-CN">记录，还不是最终的</span><span lang="EN-US">338879</span><span lang="ZH-CN">，因为我这次只注释了基因的</span><span lang="EN-US">1000</span><span lang="ZH-CN">整数倍基因区间，这样如果</span><span lang="EN-US">pfam</span><span lang="ZH-CN">记录落在一个基因的起始终止点不到</span><span lang="EN-US">1K</span><span lang="ZH-CN">位置时就不会被注释。这时候需要对代码进行继续优化。</span></p>
<p><span lang="EN-US"> 脚步懒得上传了，在我的<a href="http://note.youdao.com/share/?id=58e66d138e9434284ffa61c53b65abdc&amp;type=note">有道云笔记里面。</a><br />
</span></p>
<p>http://note.youdao.com/share/?id=58e66d138e9434284ffa61c53b65abdc&#038;type=note</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1001.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
