issues with a parser: www mechanize and mozrepl

Einfach Dinge, die nichts mit XAMPP, Apache Friends, Apache, MySQL, PHP und alle dem zu tun haben. Allerlei halt. ;)

issues with a parser: www mechanize and mozrepl

Postby unleash » 25. October 2012 17:01

good evening dear friends here at devshed

well - I have some troubles with a perl-script that turns out to be not 100% optimal. now i am tryin to find a better solution either in perl or ruby - but if you have ideas to re-work the perl-script. i would be glad too.

The question: Is there a way to specify Net::Telnet timeout with WWW::Mechanize::Firefox?
At the moment my internet connection [a quite fast dsl one] is very slow and sometimes I get error

[PHP]
with $mech->get():
command timed-out at /usr/local/share/perl/5.12.3/MozRepl/Client.pm line 186[/PHP]

[PHP]SEE THIS ONE: $mech->repl->repl->timeout(100000);
[/PHP]


Unfortunatly it does not work: Can't locate object method "timeout" via package "MozRepl"

Documentation says this should:

[PHP]$mech->repl->repl->setup_client( { extra_client_args => { timeout => 1 +80 } } );
[/PHP]
problem: I have a list of 2500 websites and need to grab a thumbnail screenshot (!) of them. How do I do that?
I could try to parse the sites either with Perl.- Mechanize would be a good thing.
Note: i only need the results as a thumbnails that are a maximum 240 pixels in the long dimension.
At the moment i have a solution which is slow and does not give back thumbnails:
How to make the script running faster with less overhead - spiting out the thumbnails

[PHP]
My prerequisites: addon/mozrepl/
the module WWW::Mechanize::Firefox;
the module imager
[/PHP]

This is my source ... see a snippet [example]of the sites i have in the url-list.

urls.txt [the list of sources in a file]

www.google.com
www.cnn.com
www.msnbc.com
news.bbc.co.uk
www.bing.com
www.yahoo.com - and so on and so forth...:


What i have tried allready; here it is:


[PHP]
#!/usr/bin/perl

use strict;
use warnings;
use WWW::Mechanize::Firefox;

my $mech = new WWW::Mechanize::Firefox();

open(INPUT, "<urls.txt") or die $!;

while (<INPUT>) {
chomp;
print "$_\n";
$mech->get($_);
my $png = $mech->content_as_png();
my $name = "$_";
$name =~s/^www\.//;
$name .= ".png";
open(OUTPUT, ">$name");
print OUTPUT $png;
sleep (5);
}[/PHP]

Well this does not care about the size:

See the output commandline:

[PHP]linux-vi17:/home/martin/perl # perl mecha_test_1.pl
www.google.com
www.cnn.com
www.msnbc.com
command timed-out at /usr/lib/perl5/site_perl/5.12.3/MozRepl/Client.pm line 186
linux-vi17:/home/martin/perl #
[/PHP]

Question: how to extend the solution either to make sure that it does not stop in a time out. Note again: i only need the results as a thumbnails that are a maximum 240 pixels in the long dimension.
As a prerequisites, i allready have installed the module imager.
How to make the script running faster with less overhead - spiting out the thumbnails


i also tried out this one here:

[PHP]$mech->repl->repl->setup_client( { extra_client_args => { timeout => 5*60 } } );
[/PHP]

putting links to @list and use eval
[PHP]
while (scalar(@list)) {
my $link = pop(@list);
print "trying $link\n";
eval{
$mech->get($link);
sleep (5);
my $png = $mech->content_as_png();
my $name = "$_";
$name =~s/^www\.//;
$name .= ".png";
open(OUTPUT, ">$name");
print OUTPUT $png;
close(OUTPUT);
}
if ($@){
print "link: $link failed\n";
push(@list,$link);#put the end of the list
next;
}
print "$link is done!\n";

}

[/PHP]


Question: is there a Ruby / Python /PHP-Solution that runs more efficient - or can you suggest a Perl-solution that is more stable..


Look forward to hear from you

Thx for any and all help in advance

have a great day

greetings
your unleash
unleash
 
Posts: 147
Joined: 03. December 2011 10:16
Operating System: OpenSuse Linux 12.1

Return to Allerlei

Who is online

Users browsing this forum: No registered users and 10 guests