QR Code contains TinyURL of this article.Managing the Field Notes (Part III)

In the previous articles in this series, we looked at how I publish Field Notes and how I prevent duplicates. In this final instalment, we’re going to look at how the system checks the integrity of the links so that we can avoid the dreaded link rot.

The methodology is simple: the software creates a list of all the URL‍s in the Field Notes then tries to get the headers of the resources those links reference. If the header return code is anything other than 200 then the link is probably broken and the system should advise accordingly. Additionally, the testing should be able to negotiate redirects and it should also be able to negotiate SSL links.

I should also point out that I handle the testing from my local copy of the website and the test suite is only available from the local instance.

We create an array of the URL‍s in a given Field Note with the following code:

<?php
if ($_SERVER['SERVER_ADDR'] == '127.0.0.1') {
  // Get URLs in Field Note, if any
  $doc = new SmartDOMDocument();
  $doc->loadHTML($field_note['text']);
  $links = $doc->getElementsByTagName('a');
  $found_urls = array();
  foreach($links as $link){
    $href = $link->getAttribute('href');
    $found_urls[] = $href;
  }
  unset($doc);
}
?>

The code is self-explanatory. It returns the URL‍s it isolates in the $found_urls array.

We then add the list to the Field Note’s container element as follows:

<?php
echo '<div class="field_note"'. ( (isset($found_urls) && count($found_urls)) ? ' data-urls="' . implode('|', $found_urls) . '"' : '') . '>' . $field_note['text'] . '</div>' . PHP_EOL;
?>

Which results in a HTML snippet like this:

<div class="field_note" data-urls="http://bit.ly/15nKLqT|http://bit.ly/15nKSm3|http://www.bytecellar.com/?p=1434">

We can see that the Field Note’s container element now has a data-urls attribute that contains a list of URL‍s separated by a pipe character |.

Control then passes to the following jQuery snippet:

if ( $('.asides_container').length && $('.field_note').length && $('#testURLs').length ) {
  $('#testURLs').click(function() {
    $('.field_note').each(function() {
      var subject = $(this);
      if (subject.data('urls') !== undefined) {
        var urls = $(this).data('urls').split('|');
        $.each(urls, function(index, value) {
          subject.prepend('<i class="icon-refresh icon-spin"></i>');
          $.get('/check_url.php', { url: value }, function(data) {
            subject.find('i.icon-refresh').remove();
            if (data != '200') subject.prepend('<i class="icon-warning-sign" style="color: #990000; font-size: 1.2em;"></i>');
          });
        });
      }
    });
  });
}

The jQuery, when launched, calls the PHP function that actually performs the check, via Ajax. It also displays busy icons for those links it is currently checking and warning icons, if any, for those that have failed the test. I’m using the fabulous Font Awesome, for the icons. The cool thing about using Ajax to initiate the tests, is that it will synchronously spawn multiple, concurrent requests. That is, it will test a batch of URL‍s simultaneously. This is real time-saver when there are lots of links to test, as there are with the Field Notes.

Now we come to the code that actually tests the URL‍s (i.e.: the file check_url.php):

<?php
define('TIMEOUT', 10);
if ( isset($_GET['url']) && filter_var($_GET['url'], FILTER_VALIDATE_URL) ) {
  $url =@ $_GET['url'];
  $ch = curl_init();
  // set cURL options
  $opts = array(
    // do not output to browser
    CURLOPT_RETURNTRANSFER => TRUE,
    // set URL
    CURLOPT_URL => $url,
    // do a HEAD request only
    CURLOPT_NOBODY => TRUE,
    // follow location headers
    CURLOPT_FOLLOWLOCATION => TRUE,
    // automatically set referrer
    CURLOPT_AUTOREFERER => TRUE,
    // set timeout
    CURLOPT_TIMEOUT => TIMEOUT,
    // Handle SSL URLs too!
    CURLOPT_SSL_VERIFYPEER => FALSE,
    CURLOPT_SSL_VERIFYHOST => FALSE,
  );
  curl_setopt_array($ch, $opts);
  // do it!
  curl_exec($ch);
  // find HTTP status
  $status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
  // close handle
  curl_close($ch);
  echo $status;
} else {
  echo FALSE;
}
?>

The code above takes a URL and uses cURL to test it. The cURL request will follow all redirects to get to the final resource and can also negotiate SSL links. The script simply returns the status code of the resource it is testing (remember, it’s a status of 200 that we’re testing for, anything else is a fail.)

That’s all there is to it. This simple handler tests 100+ URL‍s every 30 seconds. It’s perfect for helping me to maintain the integrity of the Field Notes. I hope you find it useful too.

Updated: 17th February, 2014.

I have since removed the Field Notes section from the website. However, this series of articles is still valid in that they describe a process that might be useful to another party.