In a previous article I mentioned that I had made some effort to adapt the GetSimple CMS to use relative URIs by default. This represents a complete reversal of GetSimple’s normal behaviour which prefers fully qualified URLs.
I also noted that a consequence of using relative URIs was that it broke the links and images in my RSS feed items. My relative URIs became relative to their host domain —
The obvious solution was to parse the RSS XML file and prepend the Perpetual βeta’s base URL to all relative image src
and link href
attributes.
A naive implementation of this would probably involve a RegEx which, with my Perl background, I should have no problem composing. But RegEx’s and HTML are not a good fit. Thus whenever I consider using a RegEx for such a task, I remember this now infamous thread on Stack Overflow and I quickly turn to a proper XML parser.
I really can’t stress this enough: RegEx is not an appropriate tool for parsing an SGML-based language like HTML.
Parsing the DOM
PHP has a native XML parser, DOMDocument
. It is a library of tools for creating, reading from and writing to a DOM. Internally, it represents a DOM as a tree of nodes. Each of those nodes is accessible through DOMDocument as are each node’s attributes.
An XML file looks something like this:
<books>
<book>
<author>Jack Herrington</author>
<title>PHP Hacks</title>
<publisher>O'Reilly</publisher>
</book>
<book>
<author>Jack Herrington</author>
<title>Podcasting Hacks</title>
<publisher>O'Reilly</publisher>
</book>
</books>
If we were to load that file into DOMDocument we’d end up with a model of that data that looks something like this:
With that model, I can make queries of the individual nodes including their attributes and data. So I can ask for (pseudo language): $dom->books->book[0]->title
and the parser will return the string “PHP Hacks”. Or I could request the values of the src
attributes of all the img
tags and of the href
attributes of the a
tags. Sweet!
The Code
So, without further ado, here is the (self-documenting) code I have used to convert my relative URIs to absolute URLs.
<?php
public function syndicationURLs($content = NULL) {
if ( is_null($content) ) {
return FALSE;
} else {
// Pro Tip #1: Use the SmartDOMDocument extension
// http://bit.ly/1d7duDg
if ( class_exists('SmartDOMDocument') ) {
global $SITEURL; // Set to https://www.perpetual-beta.org/ in my case
$base_url = rtrim($SITEURL, '/'); // Remove trailing slash
// Instantiate the object
$doc = new SmartDOMDocument();
// Build the DOM from the input (X)HTML snippet
$doc->loadHTML($content);
// Method #1: getElementsByTagName
// Find the link tags
$links = $doc->getElementsByTagName('a');
foreach($links as $link){
// Get the value of the href attribute
$href = $link->getAttribute('href');
// De-construct the UR(L|I)
$url_parts = parse_url($href);
// Is it a relative link (URI)?
if ( !isset($url_parts['host']) || ($url_parts['host'] == '') ) {
// It is, so prepend our base URL
$link->setAttribute('href', $base_url . $href);
}
}
// Method #2: XPath
// Pro Tip #2: Use the SimpleXML extension
$xml = simplexml_import_dom($doc);
// Find the image tags
$images = $xml->xpath('//img');
foreach ($images as $img) {
// De-construct the UR(L|I)
$url_parts = parse_url($img['src']);
// Is it a relative link (URI)?
if ( !isset($url_parts['host']) || ($url_parts['host'] == '') ) {
// It is, so prepend our base URL
$img['src'] = $base_url . $img['src'];
}
}
// Return the processed (X)HTML
return $doc->saveHTMLExact();
} else {
return FALSE;
}
}
}
The code demonstrates two different methods of accessing tags and attributes. The first, using getElementsByTagName
, getAttribute
and setAttribute
is the one you’ll see referenced most often in DOMDocument tutorials. The second uses XPath to locate the tags and the SimpleXML extension allows us to get and set our attributes with PHP’s familiar array syntax.
The final element (no pun intended) of our equation is PHP’s parse_url
function. We use this to see if we can extract a host from the attribute value. If we can, then it’s a URL and we need take no further action. Otherwise, it’s a URI, in which case we need to prepend our host (and protocol, port, etc., as required).
We return the whole HTML snippet with SmartDOMDocument’s saveHTMLExact
method and that’s all there is to it. Way, way better than trying to do the job with a RegEx.
NOTE: The XML file sample and DOM illustration above are courtesy of IBM.