Finding mismatched URLs in the migration to Squarespace

I recently migrated from Wordpress to Squarespace, and in a summary post yesterday about the experience, I wrote:

Most of the Wordpress URLs are the same in Squarespace, but I found one that was not, which turned out to be because of a timezone issue. The old article was posted at 1:30am one day, and Squarespace imported it as being posted at 6:30pm the day before, which changed the post's URL. This obviously breaks both incoming links and Disqus comments. To solve this problem, I need to figure out which URLs on the old site no longer match URLs on the new site.

I wrote a Perl program today to crawl all of the journal entries URLs in Squarespace and see if they exist on my old server. If they don't exist on the old server, I will need to change the Squarespace post URLs to match.

First, I created an archive page on Squarespace that lists all of the posts in a particular blog, which generated a gigantic page with over 2000 post titles and URLs. I grabbed the source code, dumped it into a file, and used sed to extract all of the URLs that matched my journal, writing that into a text file.

sed -ne 's/.*\(http[^"]*\).*/\1/p' < source.html | grep "journal/" > ssurls.txt

Then, I wrote a Perl program to convert the URLs to their equivalents on the old server, try to fetch that content, and output successful and failed URLs to two different files. I'm no Perl expert, and I wrote this by looking literally everything up on the web, but it seems to work.

#!/usr/local/bin/perl

use strict;
use warnings;
use LWP::Simple;
use 5.010;
 
my $sourcefilename = 'ssurls.txt'; # a textfile with a bunch of URLs in it
my $matchedoutputfilename = 'out_matched.txt'; # output matched URLs here
my $nomatchoutputfilename = 'out_nomatch.txt'; # output no match URLs here

open(my $fh, '<:encoding(UTF-8)', $sourcefilename)
  or die "Could not open file '$sourcefilename' $!";

open(my $fhMatch, '>', $matchedoutputfilename)
  or die "Could not open file '$matchedoutputfilename' $!";

open(my $fhNoMatch, '>', $nomatchoutputfilename)
  or die "Could not open file '$nomatchoutputfilename' $!";
 
while (my $url = <$fh>) {
    chomp $url;
    my $oldurl = $url;
    
    # get the URL on the old server—change this if you're using this code
    $oldurl =~ s/www.echeng/old.echeng/ig; 
    
    # Get the old web content
    my $content = get($oldurl);
        
    if (defined $content) {
        print "OK: $oldurl\n";
        say $fhMatch $oldurl;
    } else {
        print "Can't be found: $oldurl\n";
        say $fhNoMatch $oldurl;
    }
}
Program output. It's working!

Program output. It's working!

I hate brute-force approaches to migrating content, but there are tradeoffs when moving to less-open platforms (and I'm OK with that).