Apache Friends Support Forum

by **unleash** » 01. April 2012 14:33

hello dear perl-friends,

well i have a nice script that works as a image-scraper: for the first trials and tests all goes well.

here a list or urls that i use in urls.txt - that i am running against with the script. Note this is only a short list. i need to run against 2500 Urls - so it would be great if the sript is a bit more robust and would continue to run - if some urls are not available or take too much time to get. i thint that the script is running into some problems if some Urls are not available or take too much time or do block mozrepl and www:Mechanize::FireFox too much time.

Well - do you think that my ideas and suggestions are probably the cause of the issue or not. If so - how can we improve the script and make it stronger and more powerful - and robust so that it does not stop tooo soon.

love to hear from you

greetiings

Code: Select all: http://www.bez-zofingen.ch http://www.schulesins.ch http://www.schulen-turgi.ch/pages/bezirksschule/startseite.php http://www.schinznach-dorf.ch http://www.schule-seengen.ch http://www.gilgenberg.ch/schule/bez/2005-06/ http://www.rheinfelden-schulen.ch/bezirksschule/ http://www.bezmuri.ch http://www.moehlin.ch/schulen/ http://www.schule-mewo.ch http://www.bez-frick.ch http://www.bezendingen.ch http://www.bezbrugg.ch http://www.schule-bremgarten.ch/content/view/20/37/ http://www.bez-balsthal.ch http://www.schule-baden.ch http://bezaarau.educanet2.ch/info/.ws_gen/index.htm http://www.benedict-basel.ch http://www.institut-beatenberg.ch/ http://www.schulewilchingen.ch http://www.ksuo.ch http://www.international-school.ch http://www.vsgtaegerwilen.ch/ http://www.vgk.ch/ http://www.vstb.ch

well but i guess that i would be very happy if it is more robust than now

well sure thing it is driving a real browser as with WWW::Mechanize::Firefox

so somewhere it might be somewhat instable - perhaps some bit more than any other screen-scraping solution. I am getting sometimes some errors like the following... (see below) note i also had a closer look at the debugging pages WWW::Mechanize::Firefox::Troubleshooting - search.cpan.org with its hints and tricks and workarounds regarding various bugs, troubles and things like that.

Code: Select all: #!/usr/bin/perl use strict; use warnings; use WWW::Mechanize::Firefox; my $mech = new WWW::Mechanize::Firefox(); open my $urls, '<', 'urls.txt' or die $!; while (<$urls>) { chomp; next unless /^http/i; print "$_\n"; $mech->get($_); my $png = $mech->content_as_png; my $name = $_; $name =~ s#^http://##i; $name =~ s#/##g; $name =~ s/\s+\z//; $name =~ s/\A\s+//; $name =~ s/^www\.//; $name .= ".png"; open(my $out, '>', "/home/martin/images/$name") or die $!; binmode $out; print $out $png; close $out; sleep 5; }

see the results and yes, also the errors where it stops.

Code: Select all: martin@linux-wyee:~/perl> perl test_10.pl http://www.bez-zofingen.ch Datei oder Verzeichnis nicht gefunden at test_10.pl line 24, <$urls> line 3. martin@linux-wyee:~/perl> perl test_10.pl http://www.bez-zofingen.ch http://www.schulesins.ch http://www.schulen-turgi.ch/pages/bezirksschule/startseite.php http://www.schinznach-dorf.ch http://www.schule-seengen.ch http://www.gilgenberg.ch/schule/bez/2005-06/ http://www.rheinfelden-schulen.ch/bezirksschule/ Not Found at test_10.pl line 15 martin@linux-wyee:~/perl>

what do you suggest - how can we make the script a bit more robust - how to get it so that it does not stop so early!?

Apache Friends Support Forum

WWW::Mechanize::Firefox runs well: some attempts to make the

WWW::Mechanize::Firefox runs well: some attempts to make the

Who is online