In the age of the internet, access to information is becoming easier, but sometimes the information we need may be scattered across multiple websites. At this point, it would no doubt be very time-consuming and labour-intensive to search and sort it out manually. So is there any way to automate the content of other websites? The answer is yes. This paper will describe how php can be used to collect content from other websites。
Access to target pages for HTML content
First, we need to access the HTML content of the target page. Php provides many functions to access remote content, the most common of which is file get coNtents() function. This function reads a file corresponding to 2e9b5865537db472679919e97f0ae9 into a string:
9328636b3add707425d0a89494f5url='; $HTML=file get coNtents ($url);
If you need to send a post request to the target site, you can use the curl library:
9328966b3add7425d0a89494f5$url ='; $data =array ('c2add694bf94dc77b376592d'=>'value1','key2'=>`value2'); $[opt returtransfer =>true, curlopt post =>true, curlopt pott postfieds =>http build query ($data)); $ch =curl in ($url); curl surt array ($uch, dollars);
Analysis of HTML content
We need to extract the information we need from the HTML content of the target page. It's time to use the HTML solver. There are two commonly used HTML solvers in php: dom and simplexml。
The dom resolver can interpret HTML documents into a tree structure, and we can access the information we need through this tree:
932830636 bb3add707425d0a89494f5$dom=new domdDocument(); $dom->loadHTML ($HTML); $lInks = $dom->getelementsbytagname('a'); foreach ($l)Inks as $lInk){\chffffff}{\ch00ffff}
The simplexml parser interprets HTML documents as an object and we can use object properties or methods to obtain the information we need:
9328636 bb3add7425d0a89494f5 $xml = simplexml load string ($HTML); foreach ($xml->xpath('/a') as $lInk){echo (string) $I don't know
Iii. Processing the results
Once we have the information we need, we need to deal with it. For example, if we are to store this information in a database, then it needs to be formatted and filtered。
For formatting, php offers many functions to help us achieve it. For example, if you want to convert a string to a date format, you can use the sttotime() function:
9328636bb3add7425d0a89494f5$date str=`2023-04-28'; $date=date(`y-m-d',strtotime($date str));
For filters, php also provides many functions to help us achieve them. For example, if you want to remove all HTML tags from a string, you can use the stip tags() function:
932896336b3add707425d0a89494f5HTML ='
Hello, world!
';$text = strip tags ($HTML);
Use of third-party banks to accelerate development
If we need to collect sites that are more complex, manual resolution of HTML can be very difficult. At this point, we can use some third-party banks to accelerate development。
Phpquery, for example, is a php library based on jquery that allows us to analyze HTML like jquery:
Pp require oNo, no, no, noDocumentHTML ($HTML); foreach (pq('a') as $lInk){echo pq($l){\chffffff}attr
Goutte, for example, is a php library based on symfony2, which allows us to collect web pages like jquery and css selectors:

Pp require o= $crawler = $crawler->request ('get'); foreach ($crawler->filter('a') as $l)Ink){\chffffff}{\ch00ffff}
V. The anti-pastoral mechanism
Some websites use anti-pastoral mechanisms to prevent reptile capture of data. This is when we need to take a number of measures to respond。
The most common anti-pastoral mechanism is the authentication code. If we encounter the authentication code in the data collection process, third-party banks can be used for automatic identification. Death bycaptcha, for example, is a very common authentication code recognition service:
Pp require oNce 'deathbycaptcha. Php'; $client = new deathbycaptcha clittcha ('username','password'); $captcha file ='/path/to/captcha. Png'; $captcha id =$client->decode ($captcha file);
For example, some websites will check our user-agent, and if we find out that we use reptiles, we will be banned from accessing them. At this point, we can add some random information to the head of the request to simulate the browser. For example:
9328966b3add7425d0a89494f5$url ='; $options = array ( 'method'=>get', 'header'=>user-agent: mozilla/5. 0 (windows nt 10. 0; win64; x64) applewebkit/537. 36 (kHTML, like gecko) chrome/58. 0.'rand(1000,9999).'. 0 safari/537. 36')Ntext =stream content creates; $HTML =file get co(i) notes ($url, false, $context);
Processing multi-linear collection
If we have a lot of sites to collect, one-way collection can be very slow. At this point, we can use multiple threads to increase efficiency。
In php, there are two commonly used methods to achieve multiple threads: pcntl and curl。
Pcntl is an php extension that allows us to create subprocesses to perform parallel tasks:
9328636b3add7425d0a89494f5$urls = array(',',','); foreach ($urls as $url){$pid = pcntl fork(); if ($pid = 1){die("fork covered\n");} otherwise($pid = 0){$HTML = file get co)Ntents ($url); / do something with $HTML exit (0;}
The curl multi-line is used to perform parallel tasks using the multi () function of the curl:
9328636b3add707425d0a89494f5$urls = us$ (`,',''); $mh = curl muldi init(); foreach ($urls as us$url){$ch =curl init ($url); {curl setopt array ($ch, curl mult xi exec);}wire ($runing >0); (dollars >$url) {dollar cult >0; (dollars >$url) {mm col col mhNent ($mh); / /do something with $HTML);
Vii. Processing the results
Finally, once the collection has been completed, we need to process the results. For example, if the collected data are to be stored in a database, they need to be reordered and sequenced。
Php provides many functions to help us achieve this. For example, if you want to remove all duplicate elements from an array, you can use the aray unique() function:
9328636 bb3add7425d 0a89494f5 = aray ('a','b','c','a','d','b'); $data = aray unique ($data);
For sorting, php also provides many functions to help us achieve it. For example, if you want to sort an array alphabetically, you can use the sort() function:
9328636b3add7425d0a89494f5$data = allay ('c','b','a','d'); sort ($data);
Summary
Php collection of other website content is a very useful skill. Through this presentation, it is believed that you have learned how to use php to collect content from other sites and to deal with anti-pastoral mechanisms, multi-line collection and collection. It should be noted, however, that the collection of web pages must be carried out in accordance with the laws and regulations and ethical norms that do not infringe on the rights and interests of others。




