<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>生信菜鸟团 &#187; 爬虫</title>
	<atom:link href="http://www.bio-info-trainee.com/tag/%e7%88%ac%e8%99%ab/feed" rel="self" type="application/rss+xml" />
	<link>http://www.bio-info-trainee.com</link>
	<description>欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee</description>
	<lastBuildDate>Sat, 28 Jun 2025 14:30:13 +0000</lastBuildDate>
	<language>zh-CN</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.1.33</generator>
	<item>
		<title>用R语言的RCurl包结合XML包批量下载生信课件</title>
		<link>http://www.bio-info-trainee.com/799.html</link>
		<comments>http://www.bio-info-trainee.com/799.html#comments</comments>
		<pubDate>Fri, 29 May 2015 23:29:07 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[RCurl]]></category>
		<category><![CDATA[xml]]></category>
		<category><![CDATA[爬虫]]></category>
		<category><![CDATA[生信课件]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=799</guid>
		<description><![CDATA[首先是宾夕法尼亚州立大学（The Pennsylvania State Univ &#8230; <a href="http://www.bio-info-trainee.com/799.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>首先是宾夕法尼亚州立大学（The Pennsylvania State University缩写<em>PSU</em>)的生信课件下载，这个生信不仅有课件，而且在中国的优酷视频网站里面还有全套授课视频，非常棒！</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/05/image0011.png"><img class="alignnone  wp-image-800" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/05/image0011.png" alt="image001" width="610" height="395" /></a></p>
<p>课程主页是<a href="http://www.personal.psu.edu/iua1/courses/2013-BMMB-597D.html">http://www.personal.psu.edu/iua1/courses/2013-BMMB-597D.html</a></p>
<p>可以看出所有的课件pdf链接都在这一个页面，所以是非常简单的代码！</p>
<p>下面是R代码：</p>
<p>library(XML)</p>
<p>library(RCurl)</p>
<p>library(dplyr)</p>
<p>psu_edu_url='http://www.personal.psu.edu/iua1/courses/2013-BMMB-597D.html';</p>
<p>wp=getURL(psu_edu_url)</p>
<p>base='http://www.personal.psu.edu/iua1/courses/file';</p>
<p>#pse_edu_links=getHTMLLinks(psu_edu_url)</p>
<p>psu_edu_links=getHTMLLinks(wp)</p>
<p>psu_edu_pdf=psu_edu_links[grepl(".pdf$",psu_edu_links,perl=T)]</p>
<p>for (pdf in psu_edu_pdf){</p>
<p>down_url=getRelativeURL(pdf,base)</p>
<p>filename=last(strsplit(pdf,"/")[[1]])</p>
<p>cat("Now we down the ",filename,"\n")</p>
<p>#pdf_file=getBinaryURL(down_url)</p>
<p>#FH=file(filename,"wb")</p>
<p>#writeBin(pdf_file,FH)</p>
<p>#close(FH)</p>
<p>download.file(down_url,filename)</p>
<p>}</p>
<p>因为这三十个课件都是接近于10M，所以下载还是蛮耗时间的</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/05/image0031.png"><img class="alignnone size-full wp-image-801" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/05/image0031.png" alt="image003" width="338" height="245" /></a></p>
<p>其实R语言里面有这个down_url函数，可以直接下载download.file(down_url,filename)</p>
<p>然后我开始下载德国自由大学的生信课件，这次不同于宾夕法尼亚州立大学的区别是，课程主页里面是各个课题的链接，而pdf讲义在各个课题里面，所以我把pdf下载写成了一个函数对我们的课题进行批量处理</p>
<p>library(XML)</p>
<p>library(RCurl)</p>
<p>library(dplyr)</p>
<p>base="http://www.mi.fu-berlin.de/w/ABI/Genomics12";</p>
<p>down_pdf=function(url){</p>
<p>links=getHTMLLinks(url)</p>
<p>pdf_links=links[grepl(".pdf$",links,perl=T)]</p>
<p>for (pdf in pdf_links){</p>
<p>down_url=getRelativeURL(pdf,base)</p>
<p>filename=last(strsplit(pdf,"/")[[1]])</p>
<p>cat("Now we down the ",filename,"\n")</p>
<p>#pdf_file=getBinaryURL(down_url)</p>
<p>#FH=file(filename,"wb")</p>
<p>#writeBin(pdf_file,FH)</p>
<p>#close(FH)</p>
<p>download.file(down_url,filename)</p>
<p>}</p>
<p>}</p>
<p>down_pdf(base)</p>
<p>list_lecture= paste("http://www.mi.fu-berlin.de/w/ABI/GenomicsLecture",1:15,"Materials",sep="")</p>
<p>for ( url in list_lecture ){</p>
<p>cat("Now we process the ",url ,"\n")</p>
<p>try(down_pdf(url))</p>
<p>}</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/05/image0051.png"><img class="alignnone size-full wp-image-802" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/05/image0051.png" alt="image005" width="540" height="277" /></a></p>
<p>同样也是很多pdf需要下载</p>
<p>接下来下载Minnesota大学的关于生物信息的教程的ppt合集</p>
<p>主页是： <a href="https://www.msi.umn.edu/tutorial-materials">https://www.msi.umn.edu/tutorial-materials</a></p>
<p>&nbsp;</p>
<p>这个网页里面有64篇pdf格式的ppt，还有几个压缩包，本来是准备写爬虫来爬去的，但是后来想了想有点麻烦，而且还不一定会看，反正也是玩玩</p>
<p>就用linux的命令行简单实现了这个爬虫功能。</p>
<p>curl https://www.msi.umn.edu/tutorial-materials &gt;tmp.txt</p>
<p>perl -alne '{/(https.*?pdf)/;print $1 if $1}' tmp.txt &gt;pdf.address</p>
<p>perl -alne '{/(https.*?txt)/;print $1 if $1}' tmp.txt</p>
<p>perl -alne '{/(https.*?zip)/;print $1 if $1}' tmp.txt &gt;zip.address</p>
<p>wget -i pdf.address</p>
<p>wget -i pdf.zip</p>
<p>这样就可以啦！</p>
<p>&nbsp;</p>
<p>用爬虫也就是几句话的事情，因为我已经写好啦下载函数，只需要换一个主页即可下载页面所有的pdf文件啦！</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/799.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>美国Minnesota大学的生信全套课件分享</title>
		<link>http://www.bio-info-trainee.com/655.html</link>
		<comments>http://www.bio-info-trainee.com/655.html#comments</comments>
		<pubDate>Tue, 21 Apr 2015 13:06:12 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[爬虫]]></category>
		<category><![CDATA[生信]]></category>
		<category><![CDATA[课件]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=655</guid>
		<description><![CDATA[刚才在知乎什么看到了一篇分享pacbio的数据特征，顺便看到了Minnesota &#8230; <a href="http://www.bio-info-trainee.com/655.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>刚才在知乎什么看到了一篇分享pacbio的数据特征，顺便看到了Minnesota大学的关于生物信息的教程的ppt合集，所以就想打包下载。</p>
<p>https://www.msi.umn.edu/tutorial-materials</p>
<p>这个网页里面有64篇pdf格式的ppt，还有几个压缩包，本来是准备写爬虫来爬去的，但是后来想了想有点麻烦，而且还不一定会看，反正也是玩玩<br />
就用linux的命令行简单实现了这个爬虫功能。<br />
curl https://www.msi.umn.edu/tutorial-materials &gt;tmp.txt<br />
perl -alne '{/(https.*?pdf)/;print $1 if $1}' tmp.txt &gt;pdf.address<br />
perl -alne '{/(https.*?txt)/;print $1 if $1}' tmp.txt<br />
perl -alne '{/(https.*?zip)/;print $1 if $1}' tmp.txt &gt;zip.address<br />
wget -i pdf.address<br />
wget -i pdf.zip<br />
这样就可以啦！<br />
教程ppt列表如下，大家有兴趣的可以自行下载浏览。</p>
<p>2009-04-22-mrm-presentation_0.pdf               Matlab_viz_image_UMR.pdf<br />
Analyzing ChIP at the command line.pdf          MaxQuant_Introduction_112409.pdf<br />
Analyzing ChIP using Galaxy.pdf                 Maxquant-step-by-step_rs091124.pdf<br />
Badalamenti_PacBio_tutorial_12-10-2014.pdf      MSI Applications Catalog Oct 21 MB slides.pdf<br />
basics_chip_seq.pdf                             MSIIntro2013Jun18.pdf<br />
Best_Practices_GATK_Variant_Detection_v1_0.pdf  MSIIntroBMEN5311.pdf<br />
blast2go.pdf                                    MSI_Workshop_for_Introduction_to_Structure_based_Drug_Design.pdf<br />
ClinProTools_0.pdf                              MTLB_GPUs.pdf<br />
CUDA_Programming.pdf                            OpenMP.tutorial_1.pdf<br />
cuda_tutorial_performance.pdf                   Open_Source_Proteomics_1.pdf<br />
FLUENT_2009April21_final.pdf                    OptimizingWithGA.pdf<br />
FLUENT_tutorial_2008aug14fin.pdf                Orbi_Data_Analysis_092811.pdf<br />
galaxy_101_V4_ljm_0.pdf                         Partek Training Handout_miRNA and mRNA Data Analysis.pdf<br />
GPU_tools.pdf                                   PerformanceTuning_itasca_11_27_12_0.pdf<br />
gpututorial-msi.pdf                             PETSc_Tutorial.pdf<br />
Hands_On_Tutorial_Using_ProTIP.pdf              Phi_Intro.pdf<br />
Introduction to MSI Systems.pdf                 Protein_Grouping_FDR_Analysis_and_Database_Pratik_March2012_Draft.pdf<br />
Introduction_to_PEAKS_0.pdf                     Proteomics_MSI_072309_Print.pdf<br />
Introduction_to_SBDD.pdf                        pymol_v5.pdf<br />
IntroMPI2011july19c.pdf                         QC_illumina_galaxy_V1_ljm.pdf<br />
IntroMPI2012_July25-part1.pdf                   Quality Control of Illumina Data at the Command Line.pdf<br />
IntroMSI2014.pdf                                remotevisualization.pdf<br />
IntroNWChem.pdf                                 RISS_Hsapiens_variant_Detection_v3.0-small.pdf<br />
IntroOpenMP_2011jun28b.pdf                      RNA_seq_Lecture2_2014_v2.pdf<br />
Intro_to_GAMESS.pdf                             RNA-Seq mod1v6.pdf<br />
IntroToGaussian09.pdf                           R_Spring2012_ver2.pdf<br />
introtomolpro.pdf                               SchrodingerTutorial2011.pdf<br />
Intro_to_MSI_Physicists.pdf                     Sybyl.pdf<br />
intro-to-perl.pdf                               Tutorial-Hsap-v15.pdf<br />
Matlab_11_29_UMR.pdf                            Tutorial-Stuber-v12-1.pdf<br />
Matlab_PCT.pdf                                  unix2013.6.18.pdf<br />
MATLAB_Tuning.pdf                               WRKSP_2_19.pdf</p>
<p>Total wall clock time: 40m 22s<br />
Downloaded: 64 files, 249M in 40m 2s (106 KB/s)</p>
<p>我都已经下载好了，打包压缩到群里面啦！</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/655.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>生信常用论坛seq-answer里面所有帖子爬取</title>
		<link>http://www.bio-info-trainee.com/328.html</link>
		<comments>http://www.bio-info-trainee.com/328.html#comments</comments>
		<pubDate>Wed, 18 Mar 2015 13:34:24 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[计算机基础]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[seq-answer]]></category>
		<category><![CDATA[爬虫]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=328</guid>
		<description><![CDATA[生信常用论坛seq-answer里面所有帖子爬取 这个是爬虫专题第二集，主要讲如 &#8230; <a href="http://www.bio-info-trainee.com/328.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p style="text-align: center;"><b>生信常用论坛seq-answer里面所有帖子爬取</b></p>
<p>这个是爬虫专题第二集，主要讲如何分析seq-answer这个网站并爬去所有的帖子列表，及标签列表等等，前提是读者必须掌握perl，然后学习perl的LWP模块，可以考虑打印那本书读读，挺有用的！</p>
<p>其实爬虫是个人兴趣啦，跟这个网站没多少关系，本来一个个下载，傻瓜式的重复也能达到目的。我只是觉得这样很有技术范，哈哈，如何大家不想做傻瓜式的操作可以自己学习学习，如果不懂也可以问问我！</p>
<p>http://seqanswers.com/这个是主页</p>
<p>http://seqanswers.com/forums/forumdisplay.php?f=18 这个共570个页面需要爬取</p>
<p>其中f=18 代表我们要爬去的bioinformatics板块里面的内容</p>
<p>http://seqanswers.com/forums/forumdisplay.php?f=18&#038;order=desc&#038;page=1</p>
<p>http://seqanswers.com/forums/forumdisplay.php?f=18&#038;order=desc&#038;page=570</p>
<p>&lt;tbody id="threadbits_forum_18"&gt;这个里面包围这很多&lt;tr&gt;对，</p>
<p>前五个&lt;tr&gt;对可以跳过，里面的内容不需要</p>
<p><span id="more-328"></span></p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/生信常用论坛seq556.png"><img class="alignnone size-full wp-image-329" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/生信常用论坛seq556.png" alt="生信常用论坛seq556" width="554" height="200" /></a></p>
<p>这样就可以捕获到所有的目录啦！</p>
<p>我这个直接把所有代码贴出了啦</p>
<p>[perl]</p>
<p>use LWP::Simple;</p>
<p>use HTML::TreeBuilder;</p>
<p>use Encode;</p>
<p>use LWP::UserAgent;</p>
<p>use HTTP::Cookies;</p>
<p>my $tmp_ua = LWP::UserAgent-&gt;new;    #UserAgent用来发送网页访问请求</p>
<p>$tmp_ua-&gt;timeout(15);                ##连接超时时间设为15秒</p>
<p>$tmp_ua-&gt;protocols_allowed( [ 'http', 'https' ] ); ##只允许http和https协议</p>
<p>$tmp_ua-&gt;agent(</p>
<p>&quot;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727;.NET CLR 3.0.04506.30; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)&quot;</p>
<p>  ) ;</p>
<p>  open  FH_OUT ,&quot;&gt;bioinformatics.csv&quot;;</p>
<p> $total_pages=571;</p>
<p>  foreach (1..$total_pages){</p>
<p>         my $url = URI-&gt;new(&quot;http://seqanswers.com/forums/forumdisplay.php?&quot;);</p>
<p>         my($f,$page) = (18,$_);#</p>
<p>         $url-&gt;query_form(</p>
<p> 'f' =&gt; $f,</p>
<p> 'order'=&gt; 'desc',</p>
<p>         'page' =&gt; $page,</p>
<p>         );</p>
<p>&amp;get_each_index($url,'FH_OUT');</p>
<p>print $url.&quot;\n&quot;;</p>
<p>  }</p>
<p>sub get_each_index{</p>
<p>my ($url,$handle)=@_;</p>
<p>$response = $tmp_ua-&gt;get($url);  </p>
<p>$html=$response-&gt;content;</p>
<p>my $tree = HTML::TreeBuilder-&gt;new; # empty tree</p>
<p>        $tree-&gt;parse($html) or print &quot;error : parse html &quot;;</p>
<p>        $tmp=$tree-&gt;find_by_attribute(&quot;id&quot;,&quot;threadbits_forum_18&quot;);</p>
<p>        next unless $tmp;</p>
<p>        my @list_tr=$tmp-&gt;find_by_tag_name('tr');</p>
<p>        shift @list_tr;shift @list_tr;shift @list_tr;shift @list_tr;shift @list_tr;</p>
<p>        foreach  (@list_tr) {</p>
<p>                my @list_td=$_-&gt;find_by_tag_name('td');</p>
<p>                #print $_-&gt;as_text;</p>
<p>                next unless @list_td&gt;4;</p>
<p>                my $brief=$list_td[2]-&gt;attr('title');</p>
<p>                my $title=$list_td[2]-&gt;find_by_tag_name('a')-&gt;as_text();</p>
<p>                my $href=$list_td[2]-&gt;find_by_tag_name('a')-&gt;attr('href');</p>
<p>                my $author=$list_td[3]-&gt;as_text();</p>
<p>                #print $handle  &quot;$base$href\t$title\t$author\t$brief\n&quot;;</p>
<p>print $handle  &quot;$base$href\t$title\t$author\n&quot;;</p>
<p>        }</p>
<p>}</p>
<p>[/perl]</p>
<p>帖子列表如下：</p>
<p>共17109个帖子。</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/生信常用论坛seq2414.png"><img class="alignnone size-full wp-image-330" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/生信常用论坛seq2414.png" alt="生信常用论坛seq2414" width="554" height="424" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/328.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>生信常用论坛bio-star里面所有帖子爬取</title>
		<link>http://www.bio-info-trainee.com/323.html</link>
		<comments>http://www.bio-info-trainee.com/323.html#comments</comments>
		<pubDate>Wed, 18 Mar 2015 13:11:54 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[perl]]></category>
		<category><![CDATA[bio-star]]></category>
		<category><![CDATA[爬虫]]></category>
		<category><![CDATA[论坛]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=323</guid>
		<description><![CDATA[生信常用论坛bio-star里面所有帖子爬取 这个是爬虫专题第一集，主要讲如何分 &#8230; <a href="http://www.bio-info-trainee.com/323.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p style="text-align: center;"><b>生信常用论坛bio-star里面所有帖子爬取</b></p>
<p>这个是爬虫专题第一集，主要讲如何分析bio-star这个网站并爬去所有的帖子列表，及标签列表等等，前提是读者必须掌握perl，然后学习perl的LWP模块，可以考虑打印那本书读读，挺有用的！</p>
<p><a href="http://seqanswers.com/">http://seqanswers.com/</a> 这个是首页</p>
<p>http://seqanswers.com/forums/forumdisplay.php?f=18 这个共570个页面需要爬取</p>
<p>http://seqanswers.com/forums/forumdisplay.php?f=18&#038;order=desc&#038;page=1</p>
<p>http://seqanswers.com/forums/forumdisplay.php?f=18&#038;order=desc&#038;page=570</p>
<p>&lt;tbody id="threadbits_forum_18"&gt;这个里面包围这很多&lt;tr&gt;对，</p>
<p>前五个&lt;tr&gt;对可以跳过，里面的内容不需要</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/生信常用论坛bio_star462.png"><img class="alignnone size-full wp-image-324" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/生信常用论坛bio_star462.png" alt="生信常用论坛bio_star462" width="554" height="154" /></a></p>
<p><span id="more-323"></span></p>
<p>这样就可以捕获到所有的目录啦！</p>
<p>首先我们看看如何爬去该论坛主页的板块构成，然后才进去各个板块里面继续爬去帖子。</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/生信常用论坛bio_star520.png"><img class="alignnone size-full wp-image-325" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/生信常用论坛bio_star520.png" alt="生信常用论坛bio_star520" width="495" height="233" /></a></p>
<p>接下来看进入各个板块里面爬帖子的代码，可以直接复制张贴使用的！</p>
<p>[perl]</p>
<p>use LWP::Simple;</p>
<p>use HTML::TreeBuilder;</p>
<p>use Encode;</p>
<p>use LWP::UserAgent;</p>
<p>use HTTP::Cookies;</p>
<p>my $tmp_ua = LWP::UserAgent-&gt;new;    #UserAgent用来发送网页访问请求</p>
<p>$tmp_ua-&gt;timeout(15);                ##连接超时时间设为15秒</p>
<p>$tmp_ua-&gt;protocols_allowed( [ 'http', 'https' ] ); ##只允许http和https协议</p>
<p>$tmp_ua-&gt;agent(</p>
<p>&quot;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727;.NET CLR 3.0.04506.30; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)&quot;</p>
<p>  ) ;</p>
<p>  $base='https://www.biostars.org';</p>
<p>  open FH_IN,&quot;index.txt&quot;;</p>
<p>  while (&lt;FH_IN&gt;) {</p>
<p>          chomp;</p>
<p>          @F=split;</p>
<p>          open FH_OUT,&quot;&gt;index-$F[1].txt&quot;;</p>
<p>         $total_pages=int($F[2]/40)+1;</p>
<p>          foreach (1..$total_pages){</p>
<p>         my $url = URI-&gt;new(&quot;$F[0]/?&quot;);</p>
<p>         my($sort,$page) = (&quot;update&quot;,$_);#</p>
<p>         $url-&gt;query_form(</p>
<p>           'page' =&gt; $page,</p>
<p>           'sort'  =&gt; $sort,</p>
<p>         );</p>
<p>                &amp;get_each_index($url,'FH_OUT');</p>
<p>                print $url.&quot;\n&quot;;</p>
<p>          }</p>
<p>  }</p>
<p>sub get_each_index{</p>
<p>        my ($url,$handle)=@_;</p>
<p>        $response = $tmp_ua-&gt;get($url);</p>
<p>        $html=$response-&gt;content;</p>
<p>        my $tree = HTML::TreeBuilder-&gt;new; # empty tree</p>
<p>        $tree-&gt;parse($html) or print &quot;error : parse html &quot;;</p>
<p>        my @list_title=$tree-&gt;find_by_attribute('class',&quot;post-title&quot;);</p>
<p>        foreach  (@list_title) {</p>
<p>                my $title =  $_-&gt;as_text();</p>
<p>                my $ref = $_-&gt;find_by_tag_name('a')-&gt;attr('href');</p>
<p>                print  $handle &quot;$base$href,$title\n&quot;;</p>
<p>        }</p>
<p> }</p>
<p>[/perl]</p>
<p>这样就可以爬去帖子列表了</p>
<p>https://www.biostars.org/t/rna-seq rna 1573</p>
<p>https://www.biostars.org/t/R R 1309</p>
<p>https://www.biostars.org/t/snp snp 1268</p>
<p>等等```````````````````````````````````````````````````````````</p>
<p>帖子文件如下，在我的群里面共享了所有的代码及帖子内容，欢迎加群201161227，生信菜鸟团！</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/生信常用论坛bio_star2283.png"><img class="alignnone size-full wp-image-326" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/生信常用论坛bio_star2283.png" alt="生信常用论坛bio_star2283" width="218" height="482" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/323.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
