Dissecting The Nutch Crawler -Factory classes: '''URLFilterFactory'''

xitong

浏览: 6199129 次

最近访客更多访客>>

gegewuqin9

summer_1988

u012363178

devcang

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (7329)

社区版块

存档分类

2013-05 ( 45)
2013-04 ( 98)
2013-03 ( 45)
更多存档...

英文原文出处：DissectingTheNutchCrawler
转载本文请注明出处：http://blog.csdn.net/pwlazy

Factory classes: '''URLFilterFactory'''

> Class net.nutch.net.URLFilterFactory
> used by:
> - net.nutch.db.WebDBInjector
> - net.nutch.tools.UpdateDatabaseTool

URLFilterFactory is not strictly part of the crawler, but it is a good extension point within Nutch. Here's how it works:

When the class is loaded, URLFILTER_CLASS is set to the value returned by NutchConf for the key "urlfilter.class"
When getFilter() is called, it checks to see if the filter class has already been loaded. If not, we load it using Class.forName(URLFILTER_CLASS), and the class is returned.

It loads one class, which is configurable via "urlfilter.class". By default, nutch-default.xml specifies this as follows:

<!-- urlfilter properties -->

<property>
  <name>urlfilter.class</name>
  <value>net.nutch.net.RegexURLFilter</value>
  <description>Name of the class used to filterURLs.</description>
</property>

<property>
  <name>urlfilter.regex.file</name>
  <value>regex-urlfilter.txt</value>
  <description>Name of file onCLASSPATH containing default regular
  expressions used byRegexURLFilter.</description>
</property>

Now let's look at the crawler factories, which are a bit more complex.

工厂类：''URLFilterFactory'''

类 net.nutch.net.URLFilterFactory 被net.nutch.db.WebDBInjector 和net.nutch.tools.UpdateDatabaseTool 使用

URLFilterFactory is not strictly part of the crawler, but it is a good extension point within Nutch. Here's how it works:

URLFilterFactory 严格意义上并不属于crawler，但它是一个好的扩展点。让我们看看它的工作机制：

当该类被加载时，属性URLFILTER_CLASS被赋值为NutchConf.get().get("urlfilter.class")
当getFilter()方法被调用，它检查是否该类被加载，如果没有，通过Class.forName(URLFILTER_CLASS)来加载，否则直接返回该类

它通过可配置的urlfilter.class特性加载该类。默认情况下，nutch-default.xml定义如下


<!--urlfilterproperties-->

<property>
<name>urlfilter.class</name>
<value>net.nutch.net.RegexURLFilter</value>
<description>NameoftheclassusedtofilterURLs.</description>
</property>

<property>
<name>urlfilter.regex.file</name>
<value>regex-urlfilter.txt</value>
<description>NameoffileonCLASSPATHcontainingdefaultregular
expressionsusedbyRegexURLFilter.</description>
</property>

让我们再看看与crawler相关的工厂，那可是有点复杂。

分享到：

Dissecting The Nutch Crawler -Factory cl ... | Dissecting The Nutch Crawler -Aside: net ...

2006-08-08 20:49
浏览 577
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

相关推荐

Dissecting the Hotspot JVM-Martin Toshev.pdf: 技术文档分享。

藏经阁-Offensive-Malware-Analysis-Dissecting-OSXFruitFly-Via-A-Cust: 藏经阁-Offensive-Malware-Analysis-Dissecting-OSXFruitFly-Via-A-Cust

dissecting-image-crops:您什么时候可以判断图像是否已裁剪？: 解剖图像作物这是B. Van Hoorick和C. Vondrick的正式资料库，“解剖图像作物”， arXiv预印本arXiv：2011.11831，2020 。简而言之，我们研究了视觉裁剪留下的痕迹。基本用法说明步骤1：使用高分辨率图像文件填充data...

Dissecting the Redo Logs: 探索oracle redolog内部结构

信息安全_数据安全_us-18-Goland-Dissecting-Non-Mali.pdf: 信息安全_数据安全_us-18-Goland-Dissecting-Non-Mali 安全管理信息安全研究信息安全安全防护区块链

信息安全_数据安全_D2T1 - Dissecting a Cloud-Connec.pdf: 信息安全_数据安全_D2T1 - Dissecting a Cloud-Connec 数据分析情报处理业务安全数据恢复安全架构

Biblio.Distribution.C++.For.Artists.The.Art.Philosophy.and.Science.of.Object-Oriented.Programming.2003.chm: Chapter 11 - Dissecting Classes Chapter 12 - Compositional Design Chapter 13 - Extending Class Functionality Through Inheritance Part III - Implementing Polymorphic Behavior Chapter 14 - Ad ...

Dissecting the Hack: Dissecting the Hack - the F0rb1dd3n Network (revised) - J. Street (Syngress, 2010) BBS（英文版）

Real World Java EE Night Hacks--Dissecting the Business Tier.jpg: Real World Java EE Night Hacks--Dissecting the Business Tier.jpg(电子书的封面图片)

Dissecting the NVidia Turing T4 GPU via Microbenchmark.pdf: In 2019, the rapid rate at which GPU manufacturers refresh their designs, coupled with their reluctance to disclose microarchitectural details, is still a hurdle for those software designers who want ...

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking - 2018 - Slides (1804.06826)-计算机科学: GTC 2018Dissecting the Volta GPU Architecture throughMicrobenchmarkingZhe Jia, Marco Maggioni, Benjamin Staiger, Daniele P. ScarpazzaHigh-Performance Computing Group• Micro-architectural details ...

cvpr论文_2018CVPR: 2018CVPR_Dissecting Person Re-identification from the Viewpoint of Viewpoint

Dissecting the Hack_ The F0rb1dd3n Network, Revised Edition.pdf: H0w t0 R34d Dissecting the Hack: The F0rb1dd3n Network xvii About the Authors xix PART 1 F0RB1DD3N PR010gu3 3 A New Assignment 3 ChAPTeR 0N3 15 Problem Solved 15 Getting Started 21 The Acquisition 22 ...

Dissecting Android Malware: Characterization and Evolution: Dissecting Android Malware: Characterization and Evolution

Predicting Malicious Behavior: Written by an expert with intelligence officer experience who invented the technology, it explores the keys to understanding the dark side of human nature, various types of security threats (current ...

一本android的好书beginning android 2 和源码: Dissecting the Activity Building and Running the Activity ■Chapter 4: Using XML-Based Layouts What Is an XML-Based Layout? Why Use XML-Based Layouts? OK, So What Does It Look Like? What’s with ...

Gray.Hat.Hacking.The.Ethical.Hackers.Handbook.4th.Edition: Completely updated and featuring 12 new chapters, Gray Hat Hacking: The Ethical Hacker's Handbook, Fourth Edition explains the enemy’s current weapons, skills, and tactics and offers field-tested ...

Dissecting a C Sharp Application: The developers who created SharpDevelop give you an inside track on application development with a guided tour of the source code for SharpDevelop. They will show you the most important code ...

Dissecting_A_CSharp_Application: SharpDevelop Dissecting_A_CSharp_Application

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论