In the previous articles in this series, we looked at how I publish Field Notes and how I prevent duplicates. In this final instalment, we’re going to look at how the system checks the integrity of the links so that we can avoid the dreaded link rot.
The methodology is simple: the software creates a list of all the URLs in the Field Notes then tries to get the headers of the resources those links reference. If the header return code is anything other than 200
then the link is probably broken and the system should advise accordingly. Additionally, the testing should be able to negotiate redirects and it should also be able to negotiate SSL links.
I should also point out that I handle the testing from my local copy of the website and the test suite is only available from the local instance.
We create an array of the URLs in a given Field Note with the following code:
<?php
if ($_SERVER['SERVER_ADDR'] == '127.0.0.1') {
// Get URLs in Field Note, if any
$doc = new SmartDOMDocument();
$doc->loadHTML($field_note['text']);
$links = $doc->getElementsByTagName('a');
$found_urls = array();
foreach($links as $link){
$href = $link->getAttribute('href');
$found_urls[] = $href;
}
unset($doc);
}
?>
The code is self-explanatory. It returns the URLs it isolates in the $found_urls
array.
We then add the list to the Field Note’s container element as follows:
<?php
echo '<div class="field_note"'. ( (isset($found_urls) && count($found_urls)) ? ' data-urls="' . implode('|', $found_urls) . '"' : '') . '>' . $field_note['text'] . '</div>' . PHP_EOL;
?>
Which results in a HTML snippet like this:
<div class="field_note" data-urls="http://bit.ly/15nKLqT|http://bit.ly/15nKSm3|http://www.bytecellar.com/?p=1434">
We can see that the Field Note’s container element now has a data-urls
attribute that contains a list of URLs separated by a pipe character |
.
Control then passes to the following jQuery snippet:
if ( $('.asides_container').length && $('.field_note').length && $('#testURLs').length ) {
$('#testURLs').click(function() {
$('.field_note').each(function() {
var subject = $(this);
if (subject.data('urls') !== undefined) {
var urls = $(this).data('urls').split('|');
$.each(urls, function(index, value) {
subject.prepend('<i class="icon-refresh icon-spin"></i>');
$.get('/check_url.php', { url: value }, function(data) {
subject.find('i.icon-refresh').remove();
if (data != '200') subject.prepend('<i class="icon-warning-sign" style="color: #990000; font-size: 1.2em;"></i>');
});
});
}
});
});
}
The jQuery, when launched, calls the PHP function that actually performs the check, via Ajax. It also displays busy icons for those links it is currently checking and warning icons, if any, for those that have failed the test. I’m using the fabulous Font Awesome, for the icons. The cool thing about using Ajax to initiate the tests, is that it will synchronously spawn multiple, concurrent requests. That is, it will test a batch of URLs simultaneously. This is real time-saver when there are lots of links to test, as there are with the Field Notes.
Now we come to the code that actually tests the URLs (i.e.: the file check_url.php
):
<?php
define('TIMEOUT', 10);
if ( isset($_GET['url']) && filter_var($_GET['url'], FILTER_VALIDATE_URL) ) {
$url =@ $_GET['url'];
$ch = curl_init();
// set cURL options
$opts = array(
// do not output to browser
CURLOPT_RETURNTRANSFER => TRUE,
// set URL
CURLOPT_URL => $url,
// do a HEAD request only
CURLOPT_NOBODY => TRUE,
// follow location headers
CURLOPT_FOLLOWLOCATION => TRUE,
// automatically set referrer
CURLOPT_AUTOREFERER => TRUE,
// set timeout
CURLOPT_TIMEOUT => TIMEOUT,
// Handle SSL URLs too!
CURLOPT_SSL_VERIFYPEER => FALSE,
CURLOPT_SSL_VERIFYHOST => FALSE,
);
curl_setopt_array($ch, $opts);
// do it!
curl_exec($ch);
// find HTTP status
$status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
// close handle
curl_close($ch);
echo $status;
} else {
echo FALSE;
}
?>
The code above takes a URL and uses cURL to test it. The cURL request will follow all redirects to get to the final resource and can also negotiate SSL links. The script simply returns the status code of the resource it is testing (remember, it’s a status of 200
that we’re testing for, anything else is a fail.)
That’s all there is to it. This simple handler tests 100+ URLs every 30 seconds. It’s perfect for helping me to maintain the integrity of the Field Notes. I hope you find it useful too.
Updated: 17th February, 2014.
I have since removed the Field Notes section from the website. However, this series of articles is still valid in that they describe a process that might be useful to another party.