LWP HTML ::TokeParser - ein Forum wie dieses rekursiv parsen

Alles, was Perl betrifft, kann hier besprochen werden.

LWP HTML ::TokeParser - ein Forum wie dieses rekursiv parsen

Postby salsa_experience » 25. August 2006 10:55

moinsen,

bin ein perlNewbie; ich muss fuer eine abschlussarbeit an der Uni noch etwas perl lernen. Ziel: soll ein Forum auswerten - linguistisch - mit der
Fragestellung: wie wir gesprochen, wie funktioniert der Austausch;

- ein PHPBB ist mein Ziel, das ist mein Untersuchungsobjekt: - nun will ich die Daten - möglichst vollständig auslesen, also

userserdaten
forum-categories,
forum-threads
forum-postings

Das Teilziel: ich brauche eben eine Ueberblick , der mit validen Daten ausgestattet ist. Die solide Datenlage entscheidet hier über den Erfolg der Abschlussarbeit.

Wenn ich diesen Code laufen lasse um ein Forum zu "monitoren" dann kriege ich einen Auszug - siehe unten:

Code: Select all
#!/usr/bin/perl
use strict;
use warnings;

use LWP::RobotUA;
use HTML::LinkExtor;
use HTML::TokeParser;
use URI::URL;

use Data::Dumper; # for show and troubleshooting

my $url = "http://www.nukeforums.com/forums/viewforum.php?f=17";
my $ua = LWP::RobotUA->new;
my $lp = HTML::LinkExtor->new(\&wanted_links);

my @links;
get_threads($url);

foreach my $page (@links) { # this loops over each link collected from the index
   my $r = $ua->get($page);
   if ($r->is_success) {
      my $stream = HTML::TokeParser->new(\$r->content) or die "Parse error in $page: $!";
      # just printing what was collected
      print Dumper get_thread($stream);
      # would instead have database insert statement at this point
    } else {
      warn $r->status_line;
    }
}

sub get_thread {
   my $p = shift;
   my ($title, $name, @thread);
   while (my $tag = $p->get_tag('a','span')) {
      if (exists $tag->[1]{'class'}) {
         if ($tag->[0] eq 'span') {
            if ($tag->[1]{'class'} eq 'name') {
               $name = $p->get_trimmed_text('/span');
            } elsif ($tag->[1]{'class'} eq 'postbody') {
               my $post = $p->get_trimmed_text('/span');
               push @thread, {'name'=>$name, 'post'=>$post};
            }
         } else {
            if ($tag->[1]{'class'} eq 'maintitle') {
               $title = $p->get_trimmed_text('/a');
            }
         }
      }
   }
   return {'title'=>$title, 'thread'=>\@thread};
}

sub get_threads {
   my $page = shift;
   my $r = $ua->request(HTTP::Request->new(GET => $url), sub {$lp->parse($_[0])});
   # Expand URLs to absolute ones
   my $base = $r->base;
   return [map { $_ = url($_, $base)->abs; } @links];
}

sub wanted_links {
   my($tag, %attr) = @_;
   return unless exists $attr{'href'};
   return if $attr{'href'} !~ /^viewtopic\.php\?t=/;
   push @links, values %attr;
}


Code: Select all

$VAR1 = {
          'thread' => [
                        {
                          'post' => 'Hello, I\'m pretty new to PHPNuke. I\'ve got my site up and running great! I\'m now starting to make modifications, add modules etc. I\'m using the most recent RavenPHP76. I want to display the 5 most recent forum posts at the top of the forum page. I\'m not sure if this functionality is built in, if so, how to activate. Or if there is a module or block made to do this. I looked at Raven\'s Collapsing Forum block but wasn\'t crazy about the format, and I don\'t want it to be collapsable. Thanks! mopho',
                          'name' => 'mopho'
                        },
                        {
                          'post' => 'hi there',
                          'name' => 'sail'
                        },
                        {
                          'post' => 'thanks for asking this; :not very sure if i got you right; Do you want to have a feed of the last forumthreads? guess the easiest way is to go to raven and ask how he did it. hth sail.',
                          'name' => 'sail'
                        },
                        {
                          'post' => 'Thanks. i found what I was looking for. It wasn\'t so easy to find! It\'s called glance_mod. mopho',
                          'name' => 'mopho'
                        },
                        {
                          'post' => 'hi there thx',
                          'name' => 'sail'
                        },
                        {
                          'post' => 'it sound interesting - i will have also a look i google after it - and try to find out more regards sailor',
                          'name' => 'sail'
                        }
                      ],
          'title' => 'Recent Forum Posts Module'
        };


meine Frage ist die;


Kann ich den oben stehenden Code anwenden und dann die beiden Kategorien auslesen


http://www.nukeforums.com/forums/viewforum.php?f=3
http://www.nukeforums.com/forums/viewforum.php?f=17


ist das mit dem obenstehenden Code moeglich.

vielen Dank fuer eine Antwort. PS - dann muesste ich das ganze noch ein eine MYSQL Datenbank einlesen. Das ist dann die weiter Aufgabe

viele dank fuer Hilfe

greetz
sals
salsa_experience
 
Posts: 104
Joined: 25. August 2006 10:46

Return to Perl

Who is online

Users browsing this forum: No registered users and 17 guests