As a script language widely used in web development, php's strong data processing capacity and simple and learning characteristics have been favoured by many developers. In practical development, we usually need to extract some information from web pages, such as page titles, urls, etc. This paper will describe how to use php to extract page titles and automate access to web information。
Access to web content
To access web content, we need to use the curl function library in php. Curl is a very powerful tool that supports protocols and authentication methods and can simulate browser behaviour and access complete web content。
Here is a simple curl example code:
93286366 bb3add707425d0a89494f5curl=”; $ch =curl init(); curl setopt($ch, curlopt url, $url); curl setpt ($ch, curlopt returntransfer, 1)Ntent = curl exec ($ch); curl close ($ch);
The above code uses the curl function library to send a request to the specified url and saves the return result to the $[content] variable. In practical development, we can achieve bulk capture by cycling through multiple urls。
Analysis of HTML
Once the content of the web page is available, we need to interpret it to extract the required information. In php, there are various HTML solvers available, such as domdI'm not sure if i'm going to do this. This document will be used as domdFor example, php is used to interpret HTML。
Here's a simple domdExample code:
932830636b3add707425d0a89494f5doc=new domd@$doc->loadHTML ($content); $title =$doc->getelementsbytagname ('title') ->item(0)->nodeValue;
The above code uses domdCould not close temporary folder: %s the elements of the given label name can be obtained through the getelementsbytagname function, then the elements of the given location can be obtained through the setem function, and finally through nodeThe value function takes the value of the element。
Iii. Addressing coding issues
In practical development, we often encounter coding problems. If the coding of the web page differs from the coding we use, it can lead to such problems as fragmentation. To address this problem, we need a coding conversion of the web page。
The following is a simple code code for code conversion:
932896363b3add707425d0a89494f5$charset = mb decect encoding ($content); $co= iconv ($charset,'utf-8//ignore', $content);
The above code uses the mb dect encoding function to detect web-page coding and the iconv function to convert the coding to utf-8。
Iv. Processing urls
When accessing a web page, we usually need to get a url. If there is a relative path in the page, it needs to be converted to an absolute path。
The following is a simple url processing example code:
932863636 bb3add707425d0a89494f5curl=;$bAs url =parse url ($url); $b= $bI don't know what you're talking about{$href =$b{$href = $b{$href = st url.$href;} elseif = $href = $b} } } } } } { { { $bAs url =dirname ($b)As url); $href = substr ($href, 3);} $href = $b{\chffffff}{\ch00ffff}
The above code converts the relative path to an absolute path and addresses a variety of situations, including those starting with "/" , "/" , "/" and "/"。
Addressing reorientation

Some pages are re-directed when accessing the page. Redirection needs to be addressed if we are to get information on the final page。
The following is a simple re-directional treatment example code:
93283063 6bb3add7425d0a89494f5nch =curl init(); curl setopt ($ch, curlopt url, $url); curl setopt ($ch, curlopt returntransfer, 1); curl sett ($ch, curlopt fololocation, true)Ntent = curl exec ($ch); $url = curl getinfo ($ch); curl close ($ch);
The above code uses the curlopt followocation option in the curl function library to follow the re-direction automatically, and uses the curl getinfo function to get the final url address。
Vi. Addressing anomalies
In practical development, we must take into account anomalies such as network connections over time, the absence of web pages, etc. To ensure the correctness and stability of the process, we need to address these anomalies。
The following is a simple example of an anomaly handling code:
93286366b3add7425d0a89494f5ch = curl init(); curl setopt($ch, curl url, $url); curl setopt ($ch, curlot returntransfer, 1); curl setopt ($ch, curlot timeut, 10){curl erno($ch)}error:’. Curl error ($ch);}$http code =curl getinfo ($ch, curlinfo http code)}($http code!=200){error: http status code is '.$http code;}curl close ($ch);
The above code uses the curlopt timeout option in the curl function library to set the timeout and the curl errno and curl getinfo functions to obtain error information and http status code。
Vii. Bulk access to web pages
In actual development, we usually need to capture multiple pages in bulk and save the results in a document or database. In order to achieve bulk capture, we can use technologies such as multi-wire or walk request。
Here is a simple multi-wire capture example code:
9328636b3add7425d0a89494f5$urls = array ( " , " , " , " , /...); $mh =curl multi init();foreach ($urls as $i=$url init ($url); curl sett ($conn, turlot returtransfer, 1); curl setopt ($conn, curl curtit(curlt) connettmeout, 10); curl mul ad handle ($mh, $conn); {curl mul exec($mh, $factive);}=curl multi getcoNtent ($conn); ///process web content}foreach ($urls as $i=>$url){curl multi remove handle ($mh, $conn); curl close ($conn);}curl multi close ($mh);
The above code uses the curl multi init function and the curl multi exec function in the curl function library to achieve multiple thread grabs。
Application scene
Php ripping page titles can be applied to various scenarios, such as:
1. Web-based automation tests: automatic access to web-page titles to determine whether the results of the tests are correct
2. Web-based monitoring: regular page headers are taken to check whether the site is functioning
Data acquisition: extracting information from multiple websites and conducting data analysis and processing
4. Seo optimization: obtaining competition's web titles and keywords and optimizing its own seo strategy。
Ix. Summary
This paper describes how to use php to extract page titles and automate access to web information. These include access to web content, the analysis of HTML, the handling of coding issues, the processing of urls, the handling of re-direction, the handling of anomalies, bulk capture of web pages and application scenarios. Through this study, readers are confident that they have the basic skills to extract the page title from php and that it can be applied flexibly in practical development。




