<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Get links with XPath</title>
	<atom:link href="http://blog.agilephp.com/2008/10/06/get-links-with-xpath/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.agilephp.com/2008/10/06/get-links-with-xpath/</link>
	<description>Dagfinn Reiersøl on PHP, agile development, Ruby and other addictive substances</description>
	<lastBuildDate>Fri, 23 Jul 2010 20:17:39 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Anup</title>
		<link>http://blog.agilephp.com/2008/10/06/get-links-with-xpath/comment-page-1/#comment-99</link>
		<dc:creator>Anup</dc:creator>
		<pubDate>Mon, 20 Oct 2008 17:41:58 +0000</pubDate>
		<guid isPermaLink="false">http://localhost/wordpress/?p=1414#comment-99</guid>
		<description>&lt;p&gt;Using // is a real performance killer as it causes node traversal of every single element in the document.&lt;/p&gt;
&lt;p&gt;Admittedly finding links throughout a document means you need to use some kind of traversal through lots of unknown elements.&lt;/p&gt;
&lt;p&gt;To address this to some extent it is good to be as specific as you can.&lt;/p&gt;
&lt;p&gt;In (X)HTML documents you could start off by trying an xpath such as this:&lt;/p&gt;
&lt;p&gt;/html/body//a&lt;/p&gt;
&lt;p&gt;This saves traversal of all head elements.&lt;/p&gt;
&lt;p&gt;If you want all anchors inside a div with id content that is immediately inside body you could use something like this:&lt;/p&gt;
&lt;p&gt;/html/body/div[@id=&#039;content&#039;]//a&lt;/p&gt;
&lt;p&gt;if your div could appear anywhere that you cannot easily predict or control, then something like this is okay:&lt;/p&gt;
&lt;p&gt;/html/body//div[@id=&#039;content&#039;]/a&lt;/p&gt;
&lt;p&gt;XPath is quite flexible so you can do a lot more if you know something about the document you are traversing.&lt;/p&gt;
&lt;p&gt;Basically the more precise you can be, the less wasteful traversal you&#039;ll need. That will really help with performance. Especially for large documents.&lt;/p&gt;
&lt;p&gt;(Of course, the more precise your XPath the more likely it will fail when the source HTML changes, so this needs to be considered carefully.)&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>Using // is a real performance killer as it causes node traversal of every single element in the document.</p>
<p>Admittedly finding links throughout a document means you need to use some kind of traversal through lots of unknown elements.</p>
<p>To address this to some extent it is good to be as specific as you can.</p>
<p>In (X)HTML documents you could start off by trying an xpath such as this:</p>
<p>/html/body//a</p>
<p>This saves traversal of all head elements.</p>
<p>If you want all anchors inside a div with id content that is immediately inside body you could use something like this:</p>
<p>/html/body/div[@id='content']//a</p>
<p>if your div could appear anywhere that you cannot easily predict or control, then something like this is okay:</p>
<p>/html/body//div[@id='content']/a</p>
<p>XPath is quite flexible so you can do a lot more if you know something about the document you are traversing.</p>
<p>Basically the more precise you can be, the less wasteful traversal you&#8217;ll need. That will really help with performance. Especially for large documents.</p>
<p>(Of course, the more precise your XPath the more likely it will fail when the source HTML changes, so this needs to be considered carefully.)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: dagfinn</title>
		<link>http://blog.agilephp.com/2008/10/06/get-links-with-xpath/comment-page-1/#comment-97</link>
		<dc:creator>dagfinn</dc:creator>
		<pubDate>Tue, 07 Oct 2008 13:28:18 +0000</pubDate>
		<guid isPermaLink="false">http://localhost/wordpress/?p=1414#comment-97</guid>
		<description>&lt;p&gt;@Zilvinas: That&#039;s correct, I haven&#039;t benchmarked it. I wasn&#039;t the one who claimed the DOM is faster.&lt;/p&gt;
&lt;p&gt;And I&#039;m not saying it&#039;s a sin to used regular expressions. But DOM and XPath tend to be more precise since they respect the structure of the HTML document. For instance, you can search for the presence of an attribute without worrying about the possibility that another attribute might be inserted between the tag and the attribute you&#039;re searching for.&lt;/p&gt;
&lt;p&gt;And thanks for the other comment, which I&#039;ve deleted for reasons I&#039;m sure you understand.&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>@Zilvinas: That&#8217;s correct, I haven&#8217;t benchmarked it. I wasn&#8217;t the one who claimed the DOM is faster.</p>
<p>And I&#8217;m not saying it&#8217;s a sin to used regular expressions. But DOM and XPath tend to be more precise since they respect the structure of the HTML document. For instance, you can search for the presence of an attribute without worrying about the possibility that another attribute might be inserted between the tag and the attribute you&#8217;re searching for.</p>
<p>And thanks for the other comment, which I&#8217;ve deleted for reasons I&#8217;m sure you understand.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: William Candillon</title>
		<link>http://blog.agilephp.com/2008/10/06/get-links-with-xpath/comment-page-1/#comment-98</link>
		<dc:creator>William Candillon</dc:creator>
		<pubDate>Tue, 07 Oct 2008 12:36:28 +0000</pubDate>
		<guid isPermaLink="false">http://localhost/wordpress/?p=1414#comment-98</guid>
		<description>&lt;p&gt;Using XQuery can also help: http://www.zorba-xquery.com/index.php/24/&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>Using XQuery can also help: <a href="http://www.zorba-xquery.com/index.php/24/" rel="nofollow">http://www.zorba-xquery.com/index.php/24/</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Zilvinas</title>
		<link>http://blog.agilephp.com/2008/10/06/get-links-with-xpath/comment-page-1/#comment-96</link>
		<dc:creator>Zilvinas</dc:creator>
		<pubDate>Tue, 07 Oct 2008 08:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://localhost/wordpress/?p=1414#comment-96</guid>
		<description>&lt;p&gt;Did you try to benchmark this? I think not. Here is the code for xpath approach to get all imdb links:&lt;/p&gt;
&lt;p&gt;$dom = new domDocument;&lt;br /&gt;
@$dom-&gt;loadHTML(file_get_contents(&#039;http://www.imdb.com/&#039;));&lt;br /&gt;
$dom-&gt;preserveWhiteSpace = false;&lt;br /&gt;
$xpath = new DOMXpath($dom);&lt;br /&gt;
$links = $xpath-&gt;query(&#039;//a&#039;);&lt;br /&gt;
$ret = array();&lt;br /&gt;
foreach ($links as $tag) {&lt;br /&gt;
	$ret[$tag-&gt;getAttribute(&#039;href&#039;)] = $tag-&gt;childNodes-&gt;item(0)-&gt;nodeValue;&lt;br /&gt;
}&lt;br /&gt;
print_r($ret);&lt;/p&gt;
&lt;p&gt;And here&#039;s the code to do it with regular expressions:&lt;/p&gt;
&lt;p&gt;preg_match_all(&quot;/&lt;a&gt;/i&quot;, file_get_contents(&#039;http://www.imdb.com/&#039;), $matches);&lt;br /&gt;
print_r($matches);&lt;/p&gt;
&lt;p&gt;Sadly DOM and xpath does not make a big difference here. Regular expressions are even slightly faster on my pc. DOM and Xpath leave a bigger memory footprint than pregs on this particular case. And dom aproach is hell a lot more ugly than preg match :]&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>Did you try to benchmark this? I think not. Here is the code for xpath approach to get all imdb links:</p>
<p>$dom = new domDocument;<br />
@$dom-&gt;loadHTML(file_get_contents(&#8216;http://www.imdb.com/&#8217;));<br />
$dom-&gt;preserveWhiteSpace = false;<br />
$xpath = new DOMXpath($dom);<br />
$links = $xpath-&gt;query(&#8216;//a&#8217;);<br />
$ret = array();<br />
foreach ($links as $tag) {<br />
	$ret[$tag-&gt;getAttribute('href')] = $tag-&gt;childNodes-&gt;item(0)-&gt;nodeValue;<br />
}<br />
print_r($ret);</p>
<p>And here&#8217;s the code to do it with regular expressions:</p>
<p>preg_match_all(&#8220;/&lt;a&gt;/i&#8221;, file_get_contents(&#8216;http://www.imdb.com/&#8217;), $matches);<br />
print_r($matches);</p>
<p>Sadly DOM and xpath does not make a big difference here. Regular expressions are even slightly faster on my pc. DOM and Xpath leave a bigger memory footprint than pregs on this particular case. And dom aproach is hell a lot more ugly than preg match :]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: daliada</title>
		<link>http://blog.agilephp.com/2008/10/06/get-links-with-xpath/comment-page-1/#comment-95</link>
		<dc:creator>daliada</dc:creator>
		<pubDate>Tue, 07 Oct 2008 07:48:06 +0000</pubDate>
		<guid isPermaLink="false">http://localhost/wordpress/?p=1414#comment-95</guid>
		<description>&lt;p&gt;Thank you for promoting the right tools for php&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>Thank you for promoting the right tools for php</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: dagfinn</title>
		<link>http://blog.agilephp.com/2008/10/06/get-links-with-xpath/comment-page-1/#comment-94</link>
		<dc:creator>dagfinn</dc:creator>
		<pubDate>Tue, 07 Oct 2008 06:48:25 +0000</pubDate>
		<guid isPermaLink="false">http://localhost/wordpress/?p=1414#comment-94</guid>
		<description>&lt;p&gt;Yes, but for slightly different purposes. ;-)&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>Yes, but for slightly different purposes. <img src='http://blog.agilephp.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David M</title>
		<link>http://blog.agilephp.com/2008/10/06/get-links-with-xpath/comment-page-1/#comment-93</link>
		<dc:creator>David M</dc:creator>
		<pubDate>Tue, 07 Oct 2008 03:42:53 +0000</pubDate>
		<guid isPermaLink="false">http://localhost/wordpress/?p=1414#comment-93</guid>
		<description>&lt;p&gt;What&#039;s even cooler: jQuery :-)&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>What&#8217;s even cooler: jQuery <img src='http://blog.agilephp.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
</channel>
</rss>
