计算机专业时文选读:Deep Web

Deep Web

Most writers these days do a significant part of their research using the World Wide Web, with the help of powerful search engines such as Google and Yahoo. There is so much information available that one could be forgiven for thinking that “everything”is accessible this way, but nothing could ber further from the truth. For example, as of August 2005, Google claimed to have indexed 8.2 billion Web pages and 2.1 billion images. That sounds impressive, but it’s just the tip of the iceberg. Behold the deep Web.

According to Mike Bergman, chief technology officer at BrightPlanet Corp., more than 500 times as much information as traditional search engines “know about” is available in the deep Web. This massive store of information is locked up inside databases from which Web pages are generated in response to specific queries. Although these dynamic pages have a unique URL address with which they can be retrieved again, they are not persistent or stored as static pages, nor are there links to them from other pages.

The deep Web also includes sites that require registration or otherwise restrict access to their pages, prohibiting search engines from browsing them and creating cached copies.

Let’s recap how conventional search engines create their databases. Programs called spiders or Web crawlers start by reading pages from a starting list of Web sites. These spiders first read each page on a site, index all their content and add the words they find to the search engine’s growing database. When a spider finds a hyperlink to another page, it adds that new link to the list of pages to be indexed. In time, the program reaches all linked pages, presuming that the search engine doesn’t run out of time or storage space. These linked pages constitute what most of us use and refer to as the Internet or the Web. In fact, we have only scratched the surface, which is why this realm of information is often called the surface Web.

Why don’t our search engines find the deeper information? For starters, let’s consider a typical data store that an individual or enterprise has collected, containing books, texts, articles, images, laboratory results and various other kinds of data in diverse formats. Typically we access such database information by means of a query or search 。

we type in the subject or keyword we’re looking for, the database retrieves the appropriate content, and we are shown a page of results to our query.

If we can do this easily, why can’t a search engine? We assume that the search engine can reach the query input (or search) page, and it will capture the text on that page and in any pages that may have static hyperlinks to it. But unlike the typical human user, the spider can’t know what words it should type into the query field. Clearly, it can’t type in every word it knows about, and it doesn’t know what’s relevant to that particular site or database. If there’s no easy way to query, the underlying data remains invisible to the search engine. Indeed, any pages that are not eventually connected by links from pages in a spider’s initial list will be invisible and thus are not part of the surface Web as that spider defines it.

How Deep? How Big?

According to a 2001 BrightPlanet study, the deep Web is very big indeed: The company found that the 60 largest deep Web sources contained 84 billion pages of content with about 750TB of information. These 60 sources constituted a resource 40 times larger than the surface Web. Today, BrightPlanet reckons the deep Web totals 7500TB, with more than 250,000 sites and 500 billion individual documents. And that’s just for Web sites in English or European character sets. (For comparison, remember that Google, the largest crawler-based search engine, now indexes some 8 billion pages.)

The deep Web is getting deeper and bigger all the time. Two factors seem to account for this. First, newer data sources (especially those not in English) tend to be of the dynamic-query/searchable type, which are generally more useful than static pages. Second, governments at all levels around the world have made commitments to making their official documents and records available on the Web.

Interestingly, deep Web sites appear to receive 50% more monthly traffic than surface sites do, and they have more sites linked to them, even though they are not really known to the public. They are typically narrower in scope but likely to have deeper, more detailed content.

深度Web

如今,大多数的作者在他们的研究中利用万维网,并借助Google、Yahoo等公司强大的搜索引擎做大量的工作。(在网上)有如此多的信息可资利用,以至于那种认为“所有信息”可以用此方法获得的想法是有情可愿的。然而,真实情况是难以掩盖的。例如,在2005年8月,Google公司称,它拥有82亿页和21亿张编了索引的网页和图形。这听起来非常惊人,但这还只是冰上一角。看看深度Web(就更惊人)。

据BrightPlanet公司的首席技术官Mike Bergman称,在深度Web上,信息量是传统搜索引擎“知道的”信息量的500倍。这些大量的信息锁定在数据库内,从中可以响应具体的查询而生成网页。虽然这些动态的网页拥有唯一的URL地址,用此地址可以再次检索到,但它们不是稳定不变的或者不是按静态页面存储的,互相之间也没有链接。

深度Web还包括需要注册的或者对页面限制访问、禁止搜索引擎浏览和生成缓冲拷贝的网站。

让我们重新看看常规搜索引擎是如何生成数据库的。那些叫做蜘蛛程序或爬行程序的程序开始从网站的起始表读取网页。首先阅读网站上的每一页,对所有内容编索引,再把它们发现的字加入到不断壮大的搜索引擎数据库中。当蜘蛛程序发现与另一页面的超链接时,它就把新的链接加入要编索引的页面表中。只要搜索引擎没有用光时间或存储空间该程序就能马上到达所有链接的页面,这些链接的页面就构成了我们中的大多数人使用和参照的因特网或万维网。事实上,我们只是察看了表面,所以这个信息领域常常叫做表层Web。

为什么我们的搜索引擎不能发现更深一些的信息呢?对于一名初始者来说,让我们来考虑一下个人或企业收集的典型数据集,它包括按不同格式保存的图书、文章、图像、实验结果以及其他各种各样的数据。一般而言,我们是通过查询或搜索的方式访问这些数据库信息。

通常我们从键盘上输入寻找的题目或关键字,数据库会检索相应的内容,并给我们显示查询结果的页面。

如果我们能很容易做到这件事,那么为什么搜索引擎就不能做到呢?假定搜索引擎能到达查询输入(或搜索)的页面,它将捕捉到该页面以及任何与其有静态链接的页面中的文本。但是蜘蛛程序与人类用户不同,它不会知道应该把哪个字键入查询字段。很明显,它不会键入它知道的每个字,它也不知道与某个特定网站或数据库存在什么样的关系。如果没有简易的查询方法,底层的数据就看不见,因此不会成为表层Web的一部分,蜘蛛程序也不能定义它。的确,任何网页如果与蜘蛛程序最初列表中的网页没有链接关系的话,我们是看不见的,它也无法成为蜘蛛程序所定义的表面Web中的一部分。

多深?多大?

据BrightPlanet公司2001年的研究,深度Web实际上非常涤:该公司发现60个最大的深度Web源包含了840亿页的内容,约750TB的信息量。这60个源构成的资源比表层Web大了40倍。如今,BrightPlanet公司估计深度Web总量达到7500TB,有25万以上的网站和5000亿个个人文档。这仅是对英文或欧洲字符集的网站。(作为比较,请记住Google这个最大的基于爬行程序的搜索引擎也只对80亿页面编有索引。)

随时间的推移,深度Web变得越来越深、越来越大。造成此种情况有两个因素:第一,较新的数据源(特别是那些非英语的数据源)更愿意选择动态查询/可搜索类型,通常比静态页面更有用。第二,全世界的各级政府都已承诺官方文档和记录要上网。

有意思的是,似乎深度Web每月获得的流量比表层Web多50%,它们拥有更多的网站与之链接,即使它们不为公众所知。通常,它们的范围比较窄,但可能拥有更深、更详细的内容。