Extracting metatags for URLs with Apache Nutch

Einfach Dinge, die nichts mit XAMPP, Apache Friends, Apache, MySQL, PHP und alle dem zu tun haben. Allerlei halt. ;)

Extracting metatags for URLs with Apache Nutch

Postby gideoncaller » 14. February 2016 11:06

Hi everyone,

I'm trying to crawl several websites using Apache Nutch and extract only their title, keyword and description (and nothing else)
I saw several examples on how to do that.
However they all propose complicated (at least to a Nutch newbie) plugins configuration and settings Since my use case sounds like a very common one I was wondering if there is any simpler solution?
If there is no easier solution, can anyone at least explain what are the steps required for me to extract just these specific tags?

Thanks in advance
gideoncaller
 
Posts: 2
Joined: 14. February 2016 10:55
Operating System: Linux

Re: Extracting metatags for URLs with Apache Nutch

Postby Kobold90 » 22. February 2016 06:45

You can see this information. https://clgiles.ist.psu.edu/IST441/materials/crawling/nutch-crawling-and-searching.pdf
Maybe it can helpful.
Diese Hülle für das Smartphone sieht gut aus und passt genau. Ecken und Kanten sind gut geschützt und alle notwendigen Anschlüsse zugänglich. Das Material fühlt sich gut an und die Hülle macht einen wertigen Eindruck.
Kobold90
 
Posts: 5
Joined: 13. January 2016 09:45
Operating System: win10

Re: Extracting metatags for URLs with Apache Nutch

Postby gideoncaller » 23. February 2016 09:38

Thanks for the reply!
I've looked at it but it mostly deals with crawling and indexing, I'm not interested in the indexing part at all, I'm trying to fetch just some parts of the HTML content, is there any other guide?
gideoncaller
 
Posts: 2
Joined: 14. February 2016 10:55
Operating System: Linux


Return to Allerlei

Who is online

Users browsing this forum: No registered users and 2 guests