It then sends its relative path names to the web downloader, which is responsible for downloading the content of the page. For downloads to local web pages, they are stored in the library, pending subsequent processing such as indexing; on the other hand, the url of the downloaded page is placed in the captured url queue, which records the url of the downloaded reptile system to avoid double-checking. For the newly downloaded web page, all the links are extracted and checked in the captured url queue. If it is found that the link has not been captured, the url is placed at the end of the grab queue and the corresponding url page is downloaded in the subsequent grab schedule. In this way, a cycle is formed until the u. R. L. Team is captured for trial, which means that the reptile system has exhausted the number of pages that can be captured and completes the full capture process。
In the case of reptiles, there is often a need for web pages to deal with fraud。
The above-mentioned is a general reptile process, and the relationship between reptiles in the dynamic capture process and all internet pages can be divided into five parts, as in figure 2-2:
Downloaded web pages: reptiles have been downloaded from the internet to local indexed web pages。
2. Expired web page collections: due to the largest number of web pages, a full reptile capture round takes a long time, and many of the pages that have been downloaded may expire during the capture process. This is due to the fact that internet pages are constantly evolving, which makes it easier to reconcile local content with real internet pages。
3. Collection of web pages to be downloaded: pages in the top chart to be captured in the url queue, which are being downloaded by reptiles。
4. Available web collections: these pages have not yet been downloaded by reptiles or appear in the url queue to be retrieved, although they can be found through links and later captured by reptiles and indexed through already captured pages or pages in url queues to be retrieved。
5. Unknown web collections: some pages are inaccessible to reptiles and form an unknown web collection. In fact, the proportion of these pages is high。
According to different applications, reptile systems vary in many ways and can be broadly divided into three types:
Batch crawler: batch crawler: batch crawler has a clearer range and target of capture, and stops the capture process when the reptile has reached this target. As regards targets, it may be possible to set access to a certain number of pages, perhaps to set the time spent, etc。
2. Incremental crawler: incremental reptiles, unlike bulk reptiles, maintain continuous capture, and regularly update the pages captured, as the internet web pages are constantly changing, new pages, pages deleted or changes in content are common, and incremental reptiles need to reflect such changes in a timely manner, so they are in the process of continuous capture, either capturing new pages or updating existing ones. Common commercial search engine reptiles generally fall into this category。
3. Vertical reptiles (focused crawter): vertical reptiles focus on specific subject matter or industry-specific web pages, such as, for health sites, simply find health-related pages from the internet pages and not from other industries. One of the greatest features and difficulties of vertical reptiles is how to identify whether web content belongs to a designated industry or subject. From the point of view of system savings, it is unlikely that all internet pages will be downloaded before they are filtered. It would be too wasteful to waste resources, often requiring reptiles to be able to dynamically identify whether a site is subject-related or not at the capture stage, and to try not to scratch unrelated pages for resource-saving purposes. This type of reptile is often required by vertical searches of websites or vertical industry sites。




