<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>生信菜鸟团 &#187; RCurl</title>
	<atom:link href="http://www.bio-info-trainee.com/tag/rcurl/feed" rel="self" type="application/rss+xml" />
	<link>http://www.bio-info-trainee.com</link>
	<description>欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee</description>
	<lastBuildDate>Sat, 28 Jun 2025 14:30:13 +0000</lastBuildDate>
	<language>zh-CN</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.1.33</generator>
	<item>
		<title>用R语言的RCurl包结合XML包批量下载生信课件</title>
		<link>http://www.bio-info-trainee.com/799.html</link>
		<comments>http://www.bio-info-trainee.com/799.html#comments</comments>
		<pubDate>Fri, 29 May 2015 23:29:07 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[RCurl]]></category>
		<category><![CDATA[xml]]></category>
		<category><![CDATA[爬虫]]></category>
		<category><![CDATA[生信课件]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=799</guid>
		<description><![CDATA[首先是宾夕法尼亚州立大学（The Pennsylvania State Univ &#8230; <a href="http://www.bio-info-trainee.com/799.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>首先是宾夕法尼亚州立大学（The Pennsylvania State University缩写<em>PSU</em>)的生信课件下载，这个生信不仅有课件，而且在中国的优酷视频网站里面还有全套授课视频，非常棒！</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/05/image0011.png"><img class="alignnone  wp-image-800" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/05/image0011.png" alt="image001" width="610" height="395" /></a></p>
<p>课程主页是<a href="http://www.personal.psu.edu/iua1/courses/2013-BMMB-597D.html">http://www.personal.psu.edu/iua1/courses/2013-BMMB-597D.html</a></p>
<p>可以看出所有的课件pdf链接都在这一个页面，所以是非常简单的代码！</p>
<p>下面是R代码：</p>
<p>library(XML)</p>
<p>library(RCurl)</p>
<p>library(dplyr)</p>
<p>psu_edu_url='http://www.personal.psu.edu/iua1/courses/2013-BMMB-597D.html';</p>
<p>wp=getURL(psu_edu_url)</p>
<p>base='http://www.personal.psu.edu/iua1/courses/file';</p>
<p>#pse_edu_links=getHTMLLinks(psu_edu_url)</p>
<p>psu_edu_links=getHTMLLinks(wp)</p>
<p>psu_edu_pdf=psu_edu_links[grepl(".pdf$",psu_edu_links,perl=T)]</p>
<p>for (pdf in psu_edu_pdf){</p>
<p>down_url=getRelativeURL(pdf,base)</p>
<p>filename=last(strsplit(pdf,"/")[[1]])</p>
<p>cat("Now we down the ",filename,"\n")</p>
<p>#pdf_file=getBinaryURL(down_url)</p>
<p>#FH=file(filename,"wb")</p>
<p>#writeBin(pdf_file,FH)</p>
<p>#close(FH)</p>
<p>download.file(down_url,filename)</p>
<p>}</p>
<p>因为这三十个课件都是接近于10M，所以下载还是蛮耗时间的</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/05/image0031.png"><img class="alignnone size-full wp-image-801" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/05/image0031.png" alt="image003" width="338" height="245" /></a></p>
<p>其实R语言里面有这个down_url函数，可以直接下载download.file(down_url,filename)</p>
<p>然后我开始下载德国自由大学的生信课件，这次不同于宾夕法尼亚州立大学的区别是，课程主页里面是各个课题的链接，而pdf讲义在各个课题里面，所以我把pdf下载写成了一个函数对我们的课题进行批量处理</p>
<p>library(XML)</p>
<p>library(RCurl)</p>
<p>library(dplyr)</p>
<p>base="http://www.mi.fu-berlin.de/w/ABI/Genomics12";</p>
<p>down_pdf=function(url){</p>
<p>links=getHTMLLinks(url)</p>
<p>pdf_links=links[grepl(".pdf$",links,perl=T)]</p>
<p>for (pdf in pdf_links){</p>
<p>down_url=getRelativeURL(pdf,base)</p>
<p>filename=last(strsplit(pdf,"/")[[1]])</p>
<p>cat("Now we down the ",filename,"\n")</p>
<p>#pdf_file=getBinaryURL(down_url)</p>
<p>#FH=file(filename,"wb")</p>
<p>#writeBin(pdf_file,FH)</p>
<p>#close(FH)</p>
<p>download.file(down_url,filename)</p>
<p>}</p>
<p>}</p>
<p>down_pdf(base)</p>
<p>list_lecture= paste("http://www.mi.fu-berlin.de/w/ABI/GenomicsLecture",1:15,"Materials",sep="")</p>
<p>for ( url in list_lecture ){</p>
<p>cat("Now we process the ",url ,"\n")</p>
<p>try(down_pdf(url))</p>
<p>}</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/05/image0051.png"><img class="alignnone size-full wp-image-802" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/05/image0051.png" alt="image005" width="540" height="277" /></a></p>
<p>同样也是很多pdf需要下载</p>
<p>接下来下载Minnesota大学的关于生物信息的教程的ppt合集</p>
<p>主页是： <a href="https://www.msi.umn.edu/tutorial-materials">https://www.msi.umn.edu/tutorial-materials</a></p>
<p>&nbsp;</p>
<p>这个网页里面有64篇pdf格式的ppt，还有几个压缩包，本来是准备写爬虫来爬去的，但是后来想了想有点麻烦，而且还不一定会看，反正也是玩玩</p>
<p>就用linux的命令行简单实现了这个爬虫功能。</p>
<p>curl https://www.msi.umn.edu/tutorial-materials &gt;tmp.txt</p>
<p>perl -alne '{/(https.*?pdf)/;print $1 if $1}' tmp.txt &gt;pdf.address</p>
<p>perl -alne '{/(https.*?txt)/;print $1 if $1}' tmp.txt</p>
<p>perl -alne '{/(https.*?zip)/;print $1 if $1}' tmp.txt &gt;zip.address</p>
<p>wget -i pdf.address</p>
<p>wget -i pdf.zip</p>
<p>这样就可以啦！</p>
<p>&nbsp;</p>
<p>用爬虫也就是几句话的事情，因为我已经写好啦下载函数，只需要换一个主页即可下载页面所有的pdf文件啦！</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/799.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
