英文原文出处:
DissectingTheNutchCrawler 转载本文请注明出处:http://blog.csdn.net/pwlazy
Command "fetch": net.nutch.fetcher.Fetcher
> "fetch: fetch a segment's pages"
> Usage: Fetcher[-logLevellevel][-showThreadID][-threadsn]dir
So far we've created a webdb, primed it withURLs, and created a segment that a Fetcher can write to. Now let's look at the Fetcher itself, and try running it to see what comes out.
net.nutch.fetcher.Fetcher relies on several other classes:
Fetcher.main() reads arguments, instantiates a new Fetcher object, sets options, then calls run(). The Fetcher constructor is similarly simple; it just instantiates all of the input/output streams:
Fetcher.run() instantiates 1..threadCount FetcherThread objects, calls thread.start() on each, sleeps until all threads are gone or a fatal error is logged, then calls close() on the i/o streams.
FetcherThread is an inner class of net.nutch.fetcher.Fetcher that extends java.lang.Thread. It has one instance method, run(), and three static methods: handleFetch(), handleNoFetch(), and logError().
FetcherThread.run() instantiates a new FetchListEntry called "fle", then runs the following in an infinite loop:
-
If a fatal error was logged, break
-
Get the next entry in the FetchList, break if none remain
-
Extract url from FetchListEntry
-
If the FetchListEntry is not tagged "fetch", call this.handleNoFetch() with status=1. This in turn does:
-
If is tagged "fetch", call ProtocolFactory and get Protocol and Content objects for this url
-
Call this.handleFetch(url, fle, content). This in turn does:
-
Call ParserFactory.getParser() for this content type
-
Call parser.getParse(content)
-
Call Fetcher.outputPage() with a new FetcherOutput, including url MD5, the populated Content object, and a new ParseText
-
On every 100th pass through loop, write a status message to the log
-
Catch any exceptions and log as necessary
As we can see here, the fetcher relies on Factory classes to choose the code it uses for different content types: ProtocolFactory() finds a Protocol instance for a given url, and ParserFactory finds a Parser for a given contentType.
It should now be apparent that implementing a custom crawler with Nutch will revolve around creating new Protocol/Parser classes, and updating ProtocolFactory/ParserFactory to load them as needed. Let's examine these classes now.
命令fetch对应net.nutch.fetcher.Fetcher类
该命令用于抓取一个segment的所有网页
该类的调用方式如下:
Fetcher[-logLevellevel][-showThreadID][-threadsn]dir
到目前为止,我们产生了一个新的webdb,并注入url,也产生了一个segment待fetcher写入,现在我们看看fetcher本身,我们可以运行它看看到底发生了什么
net.nutch.fetcher.Fetcher类依赖一下个类
Fetcher的main方法读取输入参数,接着实例化一个新的Fetcher对象,接着设置选项,然后调用run方法。Fetcher构造函数相当简单,仅仅是实例化以下几个输入输出流
Fetch类的run方法实例化1到threadCount(译注:在调用CrawlTool的main是传入,默认为10)个
FetcherThread 对象,然后调用每个对象的start方法,接着主线程休眠直到所有子线程运行完毕或者发生了严重错误,最后调用个输入输出流的close方法。
FetcherThread 是一个net.nutch.fetcher.Fetcher的内部类,该类继承了java.lang.Thread,它有一个实例方法run()和3个静态方法:handleFetch(), handleNoFetch(), and logError().
FetcherThread 的run()方法首先实例化一个新的
FetchListEntry 对象叫"fle",接着以无限循环的方式运行一下步骤:
-
如果有严重错误发生,跳出循环
-
获取FetchList 的下一个条目,如果没有跳出循环
-
从fle中获取url
-
如果该fle并未标记"fetch", 那么调用 this.handleNoFetch() ,调用时传入值为1的status参数. 接着会发生如下步骤:
-
如果该fle标记"fetch", 调用 ProtocolFactory ,并从该url获取Protocol and Content 对象
-
调用 this.handleFetch(url, fle, content). 以下各步骤会发生
-
调用ParserFactory.getParser(contentType, url) (译注:contentType=content.getContentType();)
-
调用parser.getParse(content)(译注:parser=ParserFactory.getParser(contentType, url) )
- 接着调用outputPage(new FetcherOutput(fle, hash, protocolStatus),
content, new ParseText(parse.getText()), parse.getData());参数中包含一个新的 FetcherOutput 对象, including url MD5(译注:即参数中的hash),也包含一个被植入的Content对象和一个新的ParseText
-
每循环100次, 将状态信息写入日志
-
如果有必要捕捉任何意外并作记录
正如我们所见,
fetcher依靠各类工厂根据不同内容类型选择代码,例如 ProtocolFactory() 根据给定的url返回相应的 Protocol实例,
ParserFactory 根据给定的内容类型返回相应的Parser实例
很明显,扩展nutch crawler可以通过产生新的Protocol/Parser 或者通过更新
ProtocolFactory/ParserFactory 按需加载他们来实现. 我们现在可以好好看看这些类。
分享到:
相关推荐
Real World Java EE Night Hacks--Dissecting the Business Tier.jpg(电子书的封面图片)
技术文档分享。
信息安全_数据安全_us-18-Goland-Dissecting-Non-Mali 安全管理 信息安全研究 信息安全 安全防护 区块链
FOLLOW THE MONEY:DISSECTING THE OPERATIONS OF THE CYBER CRIME GROUP FIN6,火眼2018年发布的FIN6网络组织的相关活动报告
C++ For Artists: The Art, Philosophy, and Science of Object-Oriented Programming by Rick Miller ISBN:1932504028 Biblio Distribution ? 2003 (590 pages) Intended as both a classroom and reference ...
藏经阁-Offensive-Malware-Analysis-Dissecting-OSXFruitFly-Via-A-Cust
信息安全_数据安全_D2T1 - Dissecting a Cloud-Connec 数据分析 情报处理 业务安全 数据恢复 安全架构
解剖图像作物这是B. Van Hoorick和C. Vondrick的正式资料库,“解剖图像作物”, arXiv预印本arXiv:2011.11831,2020 。简而言之,我们研究了视觉裁剪留下的痕迹。基本用法说明步骤1:使用高分辨率图像文件填充data...
Completely updated and featuring 12 new chapters, Gray Hat Hacking: The Ethical Hacker's Handbook, Fourth Edition explains the enemy’s current weapons, skills, and tactics and offers field-tested ...
GTC 2018Dissecting the Volta GPU Architecture throughMicrobenchmarkingZhe Jia, Marco Maggioni, Benjamin Staiger, Daniele P. ScarpazzaHigh-Performance Computing Group• Micro-architectural details ...
reported in an IEEE conference paper entitled Dissecting Android Malware: Characterization and Evolution, which was presented at the IEEE Symposium on Security and Privacy (often mentioned as Oakland ...
探索oracle redolog内部结构
Tricks of the Windows video Game Programming <br>PART I Windows Programming Foundations 7 1 Journey into the Abyss 9 A Little History.............................................................
In 2019, the rapid rate at which GPU manufacturers refresh their designs, coupled with their reluctance to disclose microarchitectural details, is still a hurdle for those software designers who want ...
Dissecting the Hack - the F0rb1dd3n Network (revised) - J. Street (Syngress, 2010) BBS(英文版)
dissecting C programs into assembly language code. The chapters in the first section are as follows: Chapter 1, “What Is Assembly Language?” starts the section off by ensuring that you understand...
2018CVPR_Dissecting Person Re-identification from the Viewpoint of Viewpoint
H0w t0 R34d Dissecting the Hack: The F0rb1dd3n Network xvii About the Authors xix PART 1 F0RB1DD3N PR010gu3 3 A New Assignment 3 ChAPTeR 0N3 15 Problem Solved 15 Getting Started 21 The Acquisition 22 ...
dissecting MFC dissecting MFC