PapaScott I like big blogs and I cannot lie! 🐘

Creating an MT Import File from HTML

Nico recently trashed his Movable Type database. He had a backup, but it was old (from October), so he was missing about 200 entries. Knowing that I had trashed my database once or twice, he asked me for advice.

200 entries is a bit too much to copy and paste, so I came up with a little perl script to create a MT import file from the HTML files of his individual entries. It's customized for Nico's template, but it might be a useful starting point for others stuck in the same situation.

Nico's template was something like this:
<div><h3>Title</h3>
Post
<a name="more">
More post
</div>

So I was able to look for the title between the h3 tags, the body of the post between the title and the 'more' anchor, the extended entry between the more anchor and the closing div tag, and so on. The category wasn't included, so I had to skip it. You'll need to adjust this logic to fit your template.

The perl is pretty basic. I'm using 'slurp mode' to read in the entire HTML file into a variable, instead of reading the file line by line. I'm also in the habit of using | to delimit my regex, so they look like m|...| instead of the typical /.../. And when you return a regex in a list context, you get a list of the matches $1, $2, etc. So instead of '$content =~ m|...|;$author=$1;', I can put this one a single line as '($author) =($content =~ m|...|);'.

#!/usr/bin/perl
# parse.pl - parse HTML files to import into Movable Type
# usage: parse.pl *html > output.txt
# Note: you _will_ need to adjust the regex and date conversion
while (<>) { # for each file on the command line
# read in entire file to $content, line feeds and all
# using slurp mode
{ local $/; $content = <>;}
# locate the fields we need using regex
# some matches may include newlines
($author) =($content =~ m|<div class="posted">s+(.+?)s+/|s);
($title) = ($content =~ m|<h3 class="title">(.+)</h3>|);
($text) = ($content =~ m|</h3>s+(.+)s+<a name="more">|s);
($more) = ($content =~ m|<a name="more">s+(.+)s+</div>|s);
($date) = ($content =~ m|title="updated: (.+)" name="updated"|);
# convert the date to MM/DD/YYYY hh:mm:ss
$date =~ s|([-d])02|${1}2002|;
$date =~ tr|-,|/|d;
# printout the fields in the proper format
print "AUTHOR: $authorn";
print "TITLE: $titlen";
print "DATE: $date:00n";
print "-----n";
print "BODY:n$textn";
print "-----n";
print "EXTENDED BODY:n$moren";
print "

comments powered by Disqus