In this tutorial I'll tell how to download HTML-page, including all images, and send it by e-mail using Perl modules LWP::UserAgent and MIME::Lite. Personally I receive fresh thumbnails from www.deviantart.com daily using this method, because don't want to manually download and save these pages. :-) We will need the next modules for our script (you can download them at http://www.cpan.org)
LWP::UserAgent - WWW user agent class
MIME::Lite - lite MIME encoder/decoder
URI::URL - to work with URL
HTML::LinkExtor - to receive a list of all URL's in a document
Time::Local - to transform time to seconds
As an example we'll see, how to download all thumbnails for one day in the section of sci-fi wallpapers from http://www.deviantart.com.
The site is constantly refreshing, so we better download only yesterday's thumbnails, 'cause they are not changing anymore. An URL of a page we want is composed of:
http://browse.deviantart.com/wallpaper/scifi/?startts=<start_time_stamp>&endts=<end_time_stamp>
For example, the page http://browse.deviantart.com/wallpaper/scifi/?startts=1071648000&endts=1071734400 will include all the sci-fi wallpapers thumbnails for December 17, 2003.
Now, a few words for those, who has just begun to learn Perl.
How do I download a web-page?
|
require LWP::UserAgent; $ua = LWP::UserAgent->new;
$ua->proxy(['http', 'ftp'], 'proxy-server address'); $req = new HTTP::Request('GET' => 'page to be downloaded');
if ($res->is_success) { $page = $res->content; } How do I send a e-mail with an attachment? require MIME::Lite; $msg = MIME::Lite->new( From =>'your@address.com', To =>'recipient@address.com', Subject =>'Subject', Type => 'multipart/related');
$msg->attach( Type =>'text/plain; charset=windows-1251', Data => message text);
$msg->attach( Type => 'image/gif', Path => path to the file, Filename =>'img.gif'); $msg->send(); |
Let's see how the script works now.
Determine URL of the document
Download web-page content
Seek for all images on the page and download them
Change links in the documents to absolute values
Attach external files CSS, JavaScript
Encode all images and assemble MIME-object
Send the message by e-mail
I'll describe the technical realization of the script schematically, but if anything is unclear - see the script itself.
Let's determine a time stamp for yesterday's date and a day before yesterday also.
$yesterday = time() - 86400; $before_yesterday = time() - 86400; |
Determine URL of the page to be downloaded according to this template.
$url_page="http://www.deviantart.com/wallpaper/scifi/?startts=".$yesterday."&endts=".$before_yesterday;
Actually download the page contents using LWP module:
if ($url_page && $url_page=~/^(https?|ftp|file|nntp):\/\//) { my $req = new HTTP::Request('GET' => $url_page); my $res = $ua->request($req); $gabarit = $res->content; } |
Include external CSS and JavaScript. I'll show it in a very simplified way, but you'll be able to understand if you wish - download the file with the scripts and include it into neccessary location in the HTML-file.
CSS-file = '<style type="text/css">'."\n".'<!--'."\n". file with CSS ."\n-->\n</style>\n"; HTML-file =~s/<link([^<>]*?)href="?([^\" ]*)"?([^>]*)>/ CSS-file /iegmx; JavaScript-file = '<script><!--'."\n". file with JavaScript ."\n-->\n</script>\n"; HTML-file =~s/<script([^>]*)src="?([^\" ]*js)"?([^>]*)>/ JavaScript-file /iegmx; |
Now walking over all of the links and changing relative path with absolute. This is necessary to make sure that you be able to jump exactly to the location that link was pointing on the original web-page.
my $analyseur = HTML::LinkExtor->new; $analyseur->parse($gabarit); my @l = $analyseur->links; foreach my $url (@l) { my $urlAbs = URI::WithBase->new($$url[2],$racinePage)->abs; chomp $urlAbs; if ( ($$url[0] eq 'a') && ($$url[1] eq 'href') && ($$url[2]) && (($$url[2]!~m!^http://!) && ($$url[2]!~m!^mailto:!)) ) { $gabarit=~s/\s href= [\"']? $$url[2] [\"']?/ href="$urlAbs"/gimx; } } |
Now we should locate all the images in the document, download them, determine their types and return them, encoded with MIME.
if ( ((lc($$url[0]) eq 'img') || (lc($$url[0]) eq 'src')) ) { push(@mail, create_image_part($urlAbs)); } if (lc($ur)=~/gif$/) {$type="image/gif";} elsif (lc($ur)=~/jpg$/) {$type = "image/jpg";} else { $type = "application/x-shockwave-flash"; } my $res2 = $ua->request(new HTTP::Request('GET' => $ur)); $buff1=$res2->content; $file_name = substr($ur,rindex($ur,"/")+1,length($ur)); # encode next image my $mail = new MIME::Lite( Data => $buff1, Encoding =>'base64', 'Filename'=>$file_name); $mail->attr('Content-type'=>$type); $mail->attr('Content-Location'=>$ur); |
Create MIME-object, fill "From", "To" and "Subject" fields. If there was no images on the page - then the message will have the type "text/html", otherwise - "multipart/related".
$mail = new MIME::Lite 'From' => 'somebody@somewhere.com', 'To' => $to_email, 'Subject' => $url_page, 'Data' => $html; $mail->attr("Content-type" => $content_type); if (@mail) { $mail->replace("Type" => "multipart/related"); # attach every image foreach (@mail) {$mail->attach($_);} }
Now send page by e-mail. MIME::Lite->send('smtp', "SMTP-server address", Timeout=>60); $mail->send(); |
Script execution
Place our script in a folder, where execution of CGI scripts is enabled and make the file executable
chmod 750 /usr/local/www/cgi-bin/html_on_email3.pl
To automate the process entirely, we can run our script by CRON. For that matter we'll add one string to file /etc/crontab
0 9 * * * root /usr/local/www/cgi-bin/html_on_email3.pl
and every morning we will have a fresh set of thumbnails of the sci-fi wallpapers in mailbox, with the real links to the actual images as well.