QR Code contains TinyURL of this article.Using PHP DOMDocument to Convert Relative URIs to Absolute URLs

In a previous article I mentioned that I had made some effort to adapt the GetSimple CMS to use relative URI‍s by default. This represents a complete reversal of GetSimple’s normal behaviour which prefers fully qualified URL‍s.

I also noted that a consequence of using relative URI‍s was that it broke the links and images in my RSS feed items. My relative URI‍s became relative to their host domain — either the syndicating site or “localhost” when used within a standalone RSS client — and thus they referenced resources that were unlikely to exist.

The obvious solution was to parse the RSS XML file and prepend the Perpetual βeta’s base URL to all relative image src and link href attributes.

A naive implementation of this would probably involve a RegEx which, with my Perl background, I should have no problem composing. But RegEx’s and HTML are not a good fit. Thus whenever I consider using a RegEx for such a task, I remember this now infamous thread on Stack Overflow and I quickly turn to a proper XML parser.

I really can’t stress this enough: RegEx is not an appropriate tool for parsing an SGML-based language like HTML.

Parsing the DOM

PHP has a native XML parser, DOMDocument. It is a library of tools for creating, reading from and writing to a DOM. Internally, it represents a DOM as a tree of nodes. Each of those nodes is accessible through DOMDocument as are each node’s attributes.

An XML file looks something like this:

<books>
  <book>
    <author>Jack Herrington</author>
    <title>PHP Hacks</title>
    <publisher>O'Reilly</publisher>
  </book>
  <book>
    <author>Jack Herrington</author>
    <title>Podcasting Hacks</title>
    <publisher>O'Reilly</publisher>
  </book>
</books>

If we were to load that file into DOMDocument we’d end up with a model of that data that looks something like this:

Illustration of a DOM.

With that model, I can make queries of the individual nodes including their attributes and data. So I can ask for (pseudo language): $dom->books->book[0]->title and the parser will return the string “PHP Hacks”. Or I could request the values of the src attributes of all the img tags and of the href attributes of the a tags. Sweet!

The Code

So, without further ado, here is the (self-documenting) code I have used to convert my relative URI‍s to absolute URL‍s.

<?php
public function syndicationURLs($content = NULL) {
  if ( is_null($content) ) {
    return FALSE;
  } else {
    // Pro Tip #1: Use the SmartDOMDocument extension
    // http://bit.ly/1d7duDg
    if ( class_exists('SmartDOMDocument') ) {
      global $SITEURL; // Set to https://www.perpetual-beta.org/ in my case
      $base_url = rtrim($SITEURL, '/'); // Remove trailing slash
      // Instantiate the object
      $doc = new SmartDOMDocument();
      // Build the DOM from the input (X)HTML snippet
      $doc->loadHTML($content);
      // Method #1: getElementsByTagName
      // Find the link tags
      $links = $doc->getElementsByTagName('a');
      foreach($links as $link){
        // Get the value of the href attribute
        $href = $link->getAttribute('href');
        // De-construct the UR(L|I)
        $url_parts = parse_url($href);
        // Is it a relative link (URI)?
        if ( !isset($url_parts['host']) || ($url_parts['host'] == '') ) {
          // It is, so prepend our base URL
          $link->setAttribute('href', $base_url . $href);
        }
      }
      // Method #2: XPath
      // Pro Tip #2: Use the SimpleXML extension
      $xml = simplexml_import_dom($doc);
      // Find the image tags
      $images = $xml->xpath('//img');
      foreach ($images as $img) {
        // De-construct the UR(L|I)
        $url_parts = parse_url($img['src']);
        // Is it a relative link (URI)?
        if ( !isset($url_parts['host']) || ($url_parts['host'] == '') ) {
          // It is, so prepend our base URL
          $img['src'] = $base_url . $img['src'];
        }
      }
      // Return the processed (X)HTML
      return $doc->saveHTMLExact();
    } else {
      return FALSE;
    }
  }
}

The code demonstrates two different methods of accessing tags and attributes. The first, using getElementsByTagName, getAttribute and setAttribute is the one you’ll see referenced most often in DOMDocument tutorials. The second uses XPath to locate the tags and the SimpleXML extension allows us to get and set our attributes with PHP’s familiar array syntax.

The final element (no pun intended) of our equation is PHP’s parse_url function. We use this to see if we can extract a host from the attribute value. If we can, then it’s a URL and we need take no further action. Otherwise, it’s a URI, in which case we need to prepend our host (and protocol, port, etc., as required).

We return the whole HTML snippet with SmartDOMDocument’s saveHTMLExact method and that’s all there is to it. Way, way better than trying to do the job with a RegEx.

NOTE: The XML file sample and DOM illustration above are courtesy of IBM.