In the previous article in this series I described how I publish Field Notes on this website using basic email. Now I’ll describe the process by which I ensure, as far as possible, that I don’t publish a link that I have already posted.
The workflow looks like this:
- Extract all links from the incoming Field Note;
- Check each new link against those already published;
- If the link is in the list then inform the operator (me!);
- If the link isn’t in the list, then add it to the list and proceed to the next link;
- If all new links are unique, publish the Field Note.
The inbound Field Note is plain text (it starts life as an email remember). So the first thing we do is convert all links in that text to HTML nodes. I use Søren Løvborg’s UrlLinker to perform the conversion.
We then take the resulting HTML snippet and parse it with PHP’s DOMDocument to extract the links and store them in an array. We can then test these against the existing URL corpus which is a serialised array:
<?php
// Link Extraction and Testing
// Load the URL corpus
$urls_seen = unserialize(file_get_contents($url_corpus));
$new_urls = array();
$doc = new SmartDOMDocument();
// Load the body of the Field Note email into DOMDocument
$doc->loadHTML($field_note['text']);
// Find the links
$links = $doc->getElementsByTagName('a');
$new_urls = array();
foreach($links as $link){
// Get the URL of the link
$href = $link->getAttribute('href');
// Test it
if ( array_key_exists($href, $urls_seen) ) {
// Posted this link already, tell me the URL of the Field Note it's on
echo 'URL: ' . $href . ' appears in file: ' . $urls_seen[$href] . '. Skipping...' . PHP_EOL;
// and test the next URL
continue;
} else {
// URL is unique. Add it to the corpus.
$new_urls[$href] = $field_note['filename'];
}
}
// Release some memory
unset($doc);
// Merge the "seen" and "new" URL arrays
$urls_seen = array_merge($urls_seen, $new_urls);
// Update the disk file
file_put_contents($url_corpus, serialise($urls_seen));
?>
Easy isn’t it?
In the next (and final) article in this series, we’ll look at how I’ve used PHP and jQuery to create a fast, multi-threaded link checker that quickly scans all my Field Notes and highlights any that have broken links. Stay tuned!
Updated: 17th February, 2014.
I have since removed the Field Notes section from the website. However, this series of articles is still valid in that they describe a process that might be useful to another party.