well i have a nice script that works as a image-scraper: for the first trials and tests all goes well.
here a list or urls that i use in urls.txt - that i am running against with the script. Note this is only a short list. i need to run against 2500 Urls - so it would be great if the sript is a bit more robust and would continue to run - if some urls are not available or take too much time to get. i thint that the script is running into some problems if some Urls are not available or take too much time or do block mozrepl and www:Mechanize::FireFox too much time.
Well - do you think that my ideas and suggestions are probably the cause of the issue or not. If so - how can we improve the script and make it stronger and more powerful - and robust so that it does not stop tooo soon.
love to hear from you
greetiings
- Code: Select all
http://www.bez-zofingen.ch
http://www.schulesins.ch
http://www.schulen-turgi.ch/pages/bezirksschule/startseite.php
http://www.schinznach-dorf.ch
http://www.schule-seengen.ch
http://www.gilgenberg.ch/schule/bez/2005-06/
http://www.rheinfelden-schulen.ch/bezirksschule/
http://www.bezmuri.ch
http://www.moehlin.ch/schulen/
http://www.schule-mewo.ch
http://www.bez-frick.ch
http://www.bezendingen.ch
http://www.bezbrugg.ch
http://www.schule-bremgarten.ch/content/view/20/37/
http://www.bez-balsthal.ch
http://www.schule-baden.ch
http://bezaarau.educanet2.ch/info/.ws_gen/index.htm
http://www.benedict-basel.ch
http://www.institut-beatenberg.ch/
http://www.schulewilchingen.ch
http://www.ksuo.ch
http://www.international-school.ch
http://www.vsgtaegerwilen.ch/
http://www.vgk.ch/
http://www.vstb.ch
well but i guess that i would be very happy if it is more robust than now
well sure thing it is driving a real browser as with WWW::Mechanize::Firefox
so somewhere it might be somewhat instable - perhaps some bit more than any other screen-scraping solution. I am getting sometimes some errors like the following... (see below) note i also had a closer look at the debugging pages WWW::Mechanize::Firefox::Troubleshooting - search.cpan.org with its hints and tricks and workarounds regarding various bugs, troubles and things like that.
- Code: Select all
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $mech = new WWW::Mechanize::Firefox();
open my $urls, '<', 'urls.txt' or die $!;
while (<$urls>) {
chomp;
next unless /^http/i;
print "$_\n";
$mech->get($_);
my $png = $mech->content_as_png;
my $name = $_;
$name =~ s#^http://##i;
$name =~ s#/##g;
$name =~ s/\s+\z//;
$name =~ s/\A\s+//;
$name =~ s/^www\.//;
$name .= ".png";
open(my $out, '>', "/home/martin/images/$name") or die $!;
binmode $out;
print $out $png;
close $out;
sleep 5;
}
see the results and yes, also the errors where it stops.
- Code: Select all
martin@linux-wyee:~/perl> perl test_10.pl
http://www.bez-zofingen.ch
Datei oder Verzeichnis nicht gefunden at test_10.pl line 24, <$urls> line 3.
martin@linux-wyee:~/perl> perl test_10.pl
http://www.bez-zofingen.ch
http://www.schulesins.ch
http://www.schulen-turgi.ch/pages/bezirksschule/startseite.php
http://www.schinznach-dorf.ch
http://www.schule-seengen.ch
http://www.gilgenberg.ch/schule/bez/2005-06/
http://www.rheinfelden-schulen.ch/bezirksschule/
Not Found at test_10.pl line 15
martin@linux-wyee:~/perl>
what do you suggest - how can we make the script a bit more robust - how to get it so that it does not stop so early!?