hidden web(暗网)

原文链接:
http://www.cnblogs.com/istep/archive/2008/12/18/1357589.html

百度今日还宣布了“阿拉丁平台”计划,该计划为解决现有搜索无法抓取和检索互联网上存在的大量hidden web(暗网)问题。据悉,百度已经投入超过1千人研发该平台。

 

于是在网上搜了下有关“暗网”的资料。发现了一些英文资料。

Current-day crawlers retrieve content only from
the publicly indexable Web, i.e., the set of Web
pages reachable purely by following hypertext
links, ignoring search forms and pages that require
authorization or prior registration. In particular,
they ignore the tremendous amount of high quality
content “hidden” behind search forms, in large
searchable electronic databases. In this paper, we
address the problem of designing a crawler capable
of extracting content from this hidden Web.
We introduce a generic operational model of a
hidden Web crawler and describe how this model
is realized in HiWE (Hidden Web Exposer), a
prototype crawler built at Stanford. We introduce
a new Layout-based Information Extraction
Technique (LITE) and demonstrate its use in automatically
extracting semantic information from
search forms and response pages. We also present
results from experiments conducted to test and
validate our techniques.

转载于:https://www.cnblogs.com/istep/archive/2008/12/18/1357589.html