E2 node backup

created by sleeping wolf
(thing) by sleeping wolf (2.5 wk) (print)   (I like it!) 3 C!s Mon Feb 19 2001 at 22:27:41

10/17/2003 Removed a semicolon that had crept into this copy, fixed the useragent string, and added the version to the useragent string. 12/3/2002 Revised to use displaytype=xmltrue, and now allows login again.
7/23/2001 Ok, I fixed the login bit. As long as you DON'T use the login feature, backup will work fine. This also allows you to backup another user's writeups. Furthermore, I've added a delay (I'd love to use LWP::RobotUA, but given that all robots are disallowed, it's suicide.)
7/17/2001 Apparently, the login is broken again, but so is XML display for ones' own writeups. I'll try to figure something out.
3/16/2001 Updated to work with the new login.

Ever want to have the text of your writeups handy? Want to revise a bunch of similar nodes? Worried since you're not big enough to get into Node Heaven yet? Worry no longer! This Perl program will back up your writeups to a autonodeable single file containing all writeups, or indivdual files, each of which contain exactly what you entered. It can also dump in XML format.

Like Cow of Doom's E2 node tracker (which, coincidentally, it was obviously based upon -- gotta love the GPL), it requires the LWP and HTTP::Cookies modules in libwww-perl, which you can get very easily from CPAN, but it also requires GetOpt::Long (which should be part of the standard distribution) and the CGI module (for one function to fix the re-translation to entities in the XML displaytype).

Basic usage is simple -- run it followed by your username and password (quoted if you have spaces in them), and it will save all your writeups to individually-named files. But, there are a few options to modify its behavior:
usage: $0 [--spaced-names | --nospaced-names ] [--dump-to=dirname] [--xml] [--regexp=expr] [--single=name] [--login=username,pass] <username>
&lt;username> should be quoted if they have spaces in them (single for UNIX, double for DOS/Windows).
--spaced-names and -nospaced-names control whether or not spaces will appear in the filenames. The default is --no-spaces.
--dump-to will place the files in dirname.
--regexp will only dump nodes fitting the regular expression expr. For those not familiar with regular expressions, one way to use it is to wrap a case-sensitive substring for titles you want in quotation marks.
--single will output everything to the named filename, rather than separate filenames. The format is compatible with kaatanut's autonoder.
--login will log you in (the username and password might need quoting) This currently breaks things when backing up your own nodes due to a bug in E2's XML display.
--xml will output in raw XML format, based on E2's exact XML output, which escapes characters.

I suggest you cut and paste the following to a file, presumably e2back.pl. My preferred method is to view source and run the code through the E2 Source code formatter in deformat mode.

#!/usr/bin/perl -w
# e2back.pl - gathers user nodes from everything2.com.
# Portions Copyright (C) 2001,2002,2003 Arthur Shipkowski aka "sleeping wolf" 
#                             <Art_Kowolf at yahoo.com>
# Portions Copyright (C) 2000,2001 Will Woods <wwoods@cowofdoom.com>
# Distributed under the terms of the GNU General Public License,
# included here by reference.
#
# send comments, questions, and stories to the top address above, or just 
# /msg me.
#
# To-Do:
#  * Perhaps a switch to just route it all to stdout would be nice, rather than having
#    to figure out the appropriate filename.
#  * Figure out if my XML is Everything Core compatible.
#
# history: 
# v1.0.0: (initial release)
# v1.0.1: Added single non-XML and XML file output.
#         Made the handling of command-line directory setup less braindead.
#         Replaced the 'delete chars from filename' approach with the 
#             'substitute chars into filename' approach.
# v1.0.2: Updated the login routine from Cow of Doom's latest since it stopped 
#             working.
# v1.0.3: Added usersearch option to save someone else's WUs instead of your 
#             own.
# v1.0.4: Fixed login (again) and added sleeps, since the robots.txt or whatnot
#             prohibits me from using the robot agent.
# v1.1.0: Changeover to use displaytype=xmltrue.
# v1.1.1: Bugfix to getXMLwu and the LWP->UA instantiation
# v1.1.2: Formatted to fit in 80 columns per request.

$0="$0"; # Perl magic to clean the commandline from the process list
my $version = "1.1.2";

use LWP::UserAgent; # these are both part of libwww-perl, available
use HTTP::Cookies;  # at your friendly local CPAN mirror
# The CGI package should be available at your friendly local CPAN mirror, too.
use CGI qw/unescapeHTML/; 
use Getopt::Long;   # This should be part of the standard distribution.
use File::Spec::Functions; # Should also be part of the standard distribution

my $spaces_ok = 0, $XML_mode = 0, $dump_to_dir = curdir(), $regexp = "", 
   $singlefileMode = "", $loginpass = "";

GetOptions('spaced-names!' => \$spaces_ok, 'dump-to=s' => \$dump_to_dir,
'regexp=s' => \$regexp, 'single=s' => \$singlefileMode, 'xml' => \$XML_mode, 
'login=s' => \$loginpass);

$baseurl="http://www.everything2.com/index.pl";

$|=1;
my $ua = LWP::UserAgent->new(agent => "e2backup/$version");
$ua->env_proxy();
$cookies = HTTP::Cookies->new();
$ua->cookie_jar($cookies);

if ($#ARGV < 0) {
print "
usage: $0 [--spaced-names | --nospaced-names ] [--dump-to=dirname] 
[--xml] [--regexp=expr] [--single=name] [--login=username,pass] <username> 
<username> should be quoted if they have spaces in them 
(single for UNIX, double for DOS/Windows).
--spaced-names and -nospaced-names control whether or not spaces will appear 
in the filenames. The default is --no-spaces.
--dump-to will place the files in dirname.
--regexp will only dump nodes fitting the regular expression expr.  For those 
not familiar with regular expressions, one way to use it is to wrap a 
case-sensitive substring for titles you want in quotation marks.
--single will output everything to the named filename, rather than separate
filenames.  The format is compatible with kaatanut's autonoder.
--login will log you in (the username and password might need quoting)
--xml will output in raw XML format, based on E2's exact XML output, which
escapes characters.\n";
exit(1);}

#
# Overwrite the old file if in single-file-mode
#
if ($singlefileMode) {
    $fullnodeoutputfilename = catfile($dump_to_dir, $singlefileMode);
    open(NODEFILE, ">$fullnodeoutputfilename");
    close(NODEFILE);
}

$usersearch = $ARGV[0];

if ($loginpass) 
{
    ($login, $pass) = split(/,/, $loginpass);

    $username = $ARGV[0];
    $usersearch = $username unless $usersearch;
    print "Logging in...";
    login($login, $pass) or die "failed";
    print "ok.\n";
    sleep(10);
}

# get the User Search XML page, and array-ify it
print "Doing user search...";
@data = split(/\n/,&getUserSearchXMLTicker) or die "failed";
print "ok.\n";
sleep(10);

# Read the info out of the User Search page.
foreach (@data) { # loop over each line in the page
    if (/^<writeup/g) { # if this line is about a writeup..

        while (/ (\w+)=\"(.*?)\"/gc) { $n{$1}=$2; } # get node info

        ($name, $type) = />(.*) \(([a-z]+)\)<\/writeup>/gc;

        next unless (($regexp eq "") or ($name =~ m{$regexp}));

        print "\rCurrent node title: ", substr($name.' 'x59,0,59);

        $nodecontent = &getXMLwu($n{node_id});
        sleep(10);

        unless ($XML_mode) {
            if ( $nodecontent =~ m{<doctext>(\C*)</doctext>}is ) {
                if ($singlefileMode) 
                    { singlefileDump($dump_to_dir, $singlefileMode, 
                        "$name\n$type\n" . &unescapeHTML($1) . "\n----\n"); }
                else
                    { multifileDump($dump_to_dir, $name, &unescapeHTML($1)); }
            }
            else
                { print "Warning! Unable to get content for $name!\n"; }
        }
        else
        {
            if ($singlefileMode) 
                { 
                  singlefileDump($dump_to_dir, $singlefileMode, $nodecontent ); 
                }
            else
                { multifileDump($dump_to_dir, $name, $nodecontent ); }

        }


    }
}

#----- end of main program ------------------------------

#----- subroutines --------------------------------------

sub getnode {
# takes one argument: $node_id
# assumes that $ua is a valid HTTP::UserAgent object
# returns the contents of the page in a scalar variable
# example: $page = getnode($node_id);
my $req = HTTP::Request->new('GET', "$baseurl?node_id=$_[0]");
return($ua->request($req)->content());
}

sub getUserSearchXMLTicker {
# 762826 = User Search XML Ticker
my $req = HTTP::Request->new('GET', "$baseurl?node_id=762826&usersearch=$usersearch");
return ($ua->request($req)->content());
}


sub getXMLwu {
# takes one argument: $node_id
# assumes that $ua is a valid HTTP::UserAgent object
# returns the contents of the XML writeup page in a scalar variable
# example: $page = getnode($node_id);
    my $req = 
        HTTP::Request->new('GET', 
                           "$baseurl?node_id=$_[0]&displaytype=xmltrue");
    return($ua->request($req)->content());
}


sub login {
# takes two arguments: $username, $password
# assumes that $ua is a valid HTTP::UserAgent object
# returns true on success, false on failure
# example: login($username, $password) or die "failed";
  my $req = HTTP::Request->new('POST', "$baseurl");
  $req->content_type('application/x-www-form-urlencoded');
  $req->content("op=login&user=$_[0]&passwd=$_[1]&displaytype=null");
  my $response = $ua->request($req);
  return($cookies->as_string() ne "");
}

sub multifileDump {
# Takes three arguments, $dump_to_dir, $nodetitle, and $nodecontent
# Dumps $nodecontent to a file named based on $nodetitle.
# Example: Example: multifileDump("/home/sleepingwolf", "Fred the Node", 
#                                 $nodecontent);
    my $dump_to_dir = $_[0];
    my $nodetitle   = $_[1];
    my $nodecontent = $_[2];

    $nodefilename = $nodetitle;

# Need to escape /:\?'*"<>;&! and \0
    $nodefilename =~ s,\/,(slash),g;
    $nodefilename =~ s,\:,(colon),g;
    $nodefilename =~ s,\\,(backslash),g;
    $nodefilename =~ s,\?,(questionmark),g;
    $nodefilename =~ s,\',(singlequot),g;
    $nodefilename =~ s,\*,(asterix),g;
    $nodefilename =~ s,\",(doublequot),g;
    $nodefilename =~ s,\<,(lessthan),g;
    $nodefilename =~ s,\>,(greaterthan),g;
    $nodefilename =~ s,\;,(semicolon),g;
    $nodefilename =~ s,\&,(ampersand),g;
    $nodefilename =~ s,\!,(bang),g;
    $nodefilename =~ s,\0,(null),g;

    if ($nodefilename eq ".")
        { $nodefilename = "(dot)" }
    elsif ($nodefilename eq "..")
        { $nodefilename = "(dot)(dot)" }

    $nodefilename =~ s/ /(space)/g unless ($spaces_ok);

    $fullnodefilename = catfile($dump_to_dir, $nodefilename);

    open(NODEFILE,">$fullnodefilename") or die 
      "Couldn't open $fullnodefilename: $!";

    print NODEFILE $nodecontent;

    close(NODEFILE);
}

sub singlefileDump {
# Takes three arguments: $dump_to_dir, $backupfilename, and $nodecontent
# Dumps it all to one file, using append mode.
# Example: singlefileDump("/home/sleepingwolf", "nodebackups", "Fred the Node" 
#                         . "\n" . $nodecontent);
    my $dump_to_dir = $_[0];
    my $backupfilename = $_[1];
    my $nodecontent = $_[2];

    $fullnodeoutputfilename = catfile($dump_to_dir, $backupfilename);    

    open(NODEFILE, ">>$fullnodeoutputfilename") or 
        die "Couldn't open $fullnodeoutputfilename: $!";  

    print NODEFILE $nodecontent;

    close(NODEFILE);

}
(thing) by winged (3.2 y) (print)   (I like it!) Tue Jun 12 2001 at 10:15:31

I found it helpful to put a sleep(1); just before 'unless ($XML_mode) {'; this helps alleviate a bit of the lag on the database, and is very nice to do in any case.

Also, I modified it a bit for my Windows 2000 machine -- I've not yet figured out the way to autodetect when running on Windows, so I'm stuck for it in any case. But, for the 'open(NODEFILE,...' statement, I added the .txt extension.

For content issues, I added a comment inside the file itself as to the original node title that it came from -- very handy if the file gets renamed. (And I also took out the '(space)' substitution, replacing it with '_', since it's MUCH easier for me to read.

Good job. :)

(idea) by enth (5.2 hr) (print)   (I like it!) 4 C!s Fri Dec 19 2003 at 16:39:15

I don't know about the rest of you, but when the site went down for the move (parts of November and December 2003) it really made me wish I had captured a node backup a long time ago. Unfortunately, due to various problems like a) not having a working perl devlopment environment and b) my awe-inspiring laziness, I hadn't. So, with my free time (ha!) over the past month or so, I re-wrote the script to provide a CGI interface, so it could be run from a central server, so that non-coders could use it with a minimum of effort.

As of earlier today, it is finished and working perfectly, so I am providing a server running it at http://www.postreal.org/nodebackup . Check it out.

If you want to mirror the script on your own server, message me once your copy is running, and I will post a link to it in this writeup. The interface and requirements of the script should be fairly obvious from its source.


#!/usr/bin/perl -w
# e2backup_cgi.pl - gathers user nodes from everything2.com and lets 
# users download it from a server.
# Portions Copyright (C) 2000,2001 Will Woods <wwoods AT cowofdoom.com>
# Portions Copyright (C) 2001,2002,2003 Arthur Shipkowski aka "sleeping wolf" 
#                             <Art_Kowolf AT yahoo.com>
# Portions Copyright (C) 2003 J. Chatterton <cee aitch ay tee tee jay AT gmail>
# Distributed under the terms of the GNU General Public License,
# included here by reference.
#
# This program is being maintained by J. Chatterton, please email him
# at the address above with any questions, patches, etcetera.

use Archive::Zip; # This will itself require Compress::Zlib.
use LWP::UserAgent; # these are both part of libwww-perl, available
use CGI qw/unescapeHTML/; # The CGI package should be available at your friendly local CPAN mirror, too.
use HTTP::Request;

my $query = new CGI;
my $username = $query->param('username');
$username = lc($username);
my $singleFileMode = $query->param('singleFileMode');
my $sysdate = localtime;
my $baseurl = "http://www.everything2.com/index.pl";
my $ua = LWP::UserAgent->new(agent => "e2backup_cgi");
$ua->env_proxy();
# get the User Search XML page, and array-ify it
my @data = split(/\n/,&getusernameXMLTicker) or die "failed ($!)";
my $outputfilename = "../e2generated/".$username."_index.html";
my $zipfilename = "../e2generated/".$username."_index.zip";
if ($singleFileMode) {
    $outputfilename = "../e2generated/".$username.".html";
    $zipfilename = "../e2generated/".$username.".zip";
} 
sleep(3);
## Begin CGI output,
print "Content-type: text/html\n\n";
if (-e $zipfilename) { 
    ## put out a link to the already-generated content.
    if ($singleFileMode) {
        print "Content has already been generated in the past 24 hours. Right click and 
save it <a href=\"../e2generated/$username.zip\">here</a>.\n";
    } else {
        print "Content has already been generated in the past 24 hours. Right click and 
save it <a href=\"../e2generated/$username"."_index.zip\">here</a>.\n";
    }
    print "</body></html>\n";
    exit 1;
} else {
    ## New search. Create main file.
    open(NODEFILE, ">$outputfilename");
    print NODEFILE htmlheader();
    print NODEFILE "<center><big>Writeups by $username</big><br>Snapshot taken: $sysdate</center><br><br>\n";
    close(NODEFILE);
}

my $writeupcount = scalar(@data);
if ($writeupcount <= 1) {
    print "<p><b>E2 server error</b>, unable to get content for $username!</p>\n";
    print "<p>Check that the username is correct and try again in ten minutes.</p>\n";
    print "</body></html>\n";
    exit;
}
print "<p>Checking $writeupcount lines, please stand by until complete:</p>\n<p>\n";
$writeupcount = 0;
my %nodelist;
# Read the info out of the User Search page.
foreach (@data) { # loop over each line in the page
    $writeupcount++;
    print " $writeupcount "; 
    $| = 1; # Flush output to browser.
    if (/^<writeup/g) { # if this line is about a writeup..
        ## Put line's info into hash.
        while (/ (\w+)=\"(.*?)\"/gc) {
            $n{$1}=$2; 
        } 
        # get node info

        ($name, $type) = />(.*) \(([a-z]+)\)<\/writeup>/gc;
        $type =~ s///; ## I am tired of looking at the warning.
        $title = substr($name.' 'x59,0,59);
        $title =~ s/(.*)\w*?/$1/;
        my $createtime = $n{createtime};
        my $nodeid = $n{node_id};
        my $nodecontent = &getXMLwu($nodeid);
        if (!($nodecontent =~ m{<doctext>(\C*)</doctext>}is )) {
            print "<b>E2 server error</b>, unable to get content for $name";
        }
        $nodecontent = &unescapeHTML($1);
        ## Create a friendly html-ish formatted writeup.
        my $htmlformat = "";
        if (!($singleFileMode)) {
            $htmlformat .= htmlheader();
        }
        $htmlformat .= "<!-- Below is e2 node #$nodeid -->\n";
        $htmlformat .= '<table border="3" bordercolor="000000"><tr><td>'."\n";
        $htmlformat .= "<b>Node title:</b> <a href=\"http://www.everything2.com/index.pl?node_id=$nodeid\">$name</a> <br> \n";
        $htmlformat .= "<b>Submit date:</b> $createtime\n";
        $htmlformat .= "</td></tr></table>\n";
        $htmlformat .= $nodecontent . "\n";
        $htmlformat .= "<br><br><br>\n";
        ## Open whatever should be open. 
        if($singleFileMode) {
            open(NODEFILE, ">>$outputfilename");
        } else {
            ## Relies on nodeid being unique.
            open(IDXFILE, ">>$outputfilename");
            print IDXFILE "<a href=\"$nodeid.html\">$name</a><br>\n";
            $nodelist{$nodeid} = 1;
            close(IDXFILE);
            my $singlefilename = "../e2generated/".$nodeid.".html";
            open(NODEFILE, ">$singlefilename");
        }
        ## Add the content.
        print NODEFILE $htmlformat;
        close(NODEFILE);
    }
    sleep(3);
}
print "<b>Done!</b></p>\n";
## Archive and delete downloaded data.
my $zip = Archive::Zip->new();
my $zname = "$username.zip";
if ($singleFileMode) {
    $zip->addFile($outputfilename, "$username.html");
    $zip->writeToFileNamed("../e2generated/$username.zip");
    
} else {
    my @keys = keys %nodelist;
    $zip->addFile($outputfilename, "$username\_index.html");
    foreach $nodeid (@keys) {
        $zip->addFile("../e2generated/$nodeid.html", "$nodeid.html");
    }
    $zip->writeToFileNamed("../e2generated/$username\_index.zip");
    foreach $nodeid (@keys) {
        unlink("../e2generated/$nodeid.html");
    }
    $zname = "$username\_index.zip";
}
unlink($outputfilename);
print "<p>Right click and save this zip file: <a href=\"$zipfilename\">$zname</a></p>\n";
print "</body></html>\n";

########## Subs are delicious. ##########

## Gets the node list for a given username. Seems to be
## case-insensitive, at the discretion of everything2.
sub getusernameXMLTicker {
    # 762826 = User Search XML Ticker
    my $req = HTTP::Request->new('GET', "$baseurl?node_id=762826&usersearch=$username");
    return ($ua->request($req)->content());
}
# takes one argument: $node_id
# assumes that $ua is a valid HTTP::UserAgent object
# returns the contents of the XML writeup page in a scalar variable
sub getXMLwu {
    my $req = HTTP::Request->new('GET', "$baseurl?node_id=$_[0]&displaytype=xmltrue");
    return($ua->request($req)->content());
}
## Returns a valid html header. Prettier code than having it above.
sub htmlheader {
    my $header = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">'."\n";
    $header .= "<html><head></head><body>\n";
    return $header;
}

Y'know, if you log in, you can write something here, or contact authors directly on the site. Create a New User if you don't already have an account.