Indexer: finished at 2013-xx-xx xx:xx:xx, elapsed: 00:00:01
SolrDeleteDuplicates: starting at 2013-xx-xx xx:xx:xx
SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:160)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
但降為將solr.4.4 即不會有類似的exception
安装必要:
JAVA
# sudo aptitude install java-1.7.0-openjdk-devel.x86_64
Nutch
# wget http://archive.apache.org/dist/nutch/1.7/apache-nutch-1.7-src.tar.gz
Solr
# wget http://archive.apache.org/dist/lucene/solr/4.4.0/solr-4.4.0.tgz
1.首先將下載回來的Nutch.1.7 及 solr 1.7 解壓縮後放置你自己想要放置的路徑。
# tar -zxf apache-nutch-1.7-src.tar.gz
# tar -zxf solr-4.4.0.tgz
# mv apache-nutch-1.7 solr-4.4.0 U_PATH ### U_PATH 請自行設決定路徑
2. 進入Nutch 的資料夾並進行compile
# cd U_PATH/apache-nutch-1.7
# ant
3. ant compile (大約需4~5分)結束後會多一個runtime的資料夾,將來所有執行及設定檔都會在runtime的資料夾內
4. 進入runtime file並進行crawl data基本設訂
# cd runtime/local
# chmod +x bin/nutch
# export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
"""每台電腦的JAVA_HOME不一定相同,所以要了解一下java install path
也可以直接將export 直接寫入 ~/.profile 中
修改 nutch-site.xml檔
# vim conf/nutch-site.xml
<configuration>
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
</configuration>
設定要爬的網站
# mkdir urls
# cd urls
# echo http://nutch.apache.org/ > seed.txt """http://nutch.apache.org/ 可自行設訂網址
# vim conf/regex-urlfilter.txt
將最後一行
+.
改成以下
+^http://([a-z0-9]*\.)*nutch.apache.org/
以上基本設定就完成了,就可以開始爬網站了。
但是在此,為了要在爬網站時順便做Index,所以接下來講解solr的設定
5. 先啟動solr
# cd U_PATH/solr-4.4.0/example
# java -jar start.jar
在browser 上 打網址: http://localhost:8983/solr/
若有看到solr的網頁代表solr已啟動。
6. 更改solr設定檔
# mkdir U_PATH/solr-4.4.0/example/solr/conf
# cp U_PATH/apache-nutch-1.7/runtime/local/conf U_PATH/solr-4.4.0/example/solr/conf
在U_PATH/solr-4.4.0/example/solr/collection1/conf/schema.xml 的<fields> <\fields> 中 增加以下內容
# vim U_PATH/solr-4.4.0/example/solr/collection1/conf/schema.xml
<fields>
<field name="host" type="string" stored="false" indexed="true"/>
<field name="digest" type="string" stored="true" indexed="false"/>
<field name="segment" type="string" stored="true" indexed="false"/>
<field name="boost" type="float" stored="true" indexed="false"/>
<field name="tstamp" type="date" stored="true" indexed="false"/>
<field name="anchor" type="string" stored="true" indexed="true" multiValued="true"/>
<field name="cache" type="string" stored="true" indexed="false"/>
</fields>
以上就設定好solr了
必需重新啟動solr
# cd U_PATH/solr-4.4.0/example
# java -jar start.jar
7. 進行爬網且建立Index
# cd U_PATH/apache-nutch-1.7/runtime/local
# bin/nutch crawl urls -dir crawl -depth 2 -topN 5 -solr http://localhost:8983/solr/
如果顯示以下訊息即代表crawl
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
Indexer: finished at 2013-10-21 11:54:42, elapsed: 00:00:01
SolrDeleteDuplicates: starting at 2013-10-21 11:54:42
SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
SolrDeleteDuplicates: finished at 2013-10-21 11:54:43, elapsed: 00:00:01
crawl finished: crawl
當前目錄會多一個crawl的資料夾,即為爬下來的結果跟Index。
Reference : http://blog.csdn.net/panjunbiao/article/details/12171147