Extract Data From HTML Table With Mechanize
Solution 1:
More succint version relying more on the black magic of XPath :)
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.alpineascents.com/8000m-peaks.asp'))
last_td = doc./("//tr[td[strong[text()='#{ARGV[0]}']]]/td[5]")
puts last_td.text.gsub(/.*?;/, '').strip
Solution 2:
I believe this is what you want (you will need to gem install nokogiri)
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.alpineascents.com/8000m-peaks.asp'))
rows = doc.search('//table')[6]./('tr')
rows.shift
rows.shift
rows.each do |row|
if row.text.include? ARGV[0]
puts row./('td')[4].text.gsub(/.*?;/, '').strip
end
end
Solution 3:
The first mistake that I see is that you are calling the following:
p=Mechanize.new.get('www.alpineascents.com/8000m-peaks.asp').body
Unfortunately grabbing the body from the mechanize object will just return all the body text as you would find in the DOCTYPE body block.
This information is quite annoying to parse through so I would recommend doing the following.
p=Mechanize.new.get('http://www.alpineascents.com/8000m-peaks.asp')
This will return a Mechanize#Page object which you an play with(http://mechanize.rubyforge.org/Mechanize/Page.html)
With that object we can simply perform a search which is nokogiris search by doing the following;
elems = p.search('tr')
this will return all the tr elements as a Nokogiri::XML::Element which we can use pretty cleanly to get the information that we want. Note that you may want to play around with all the stuff in IRB to figure out exactly what you need but the idea is should be clear from the following:
elems.first.search('td').last.text
which will return the final td elements text from the first tr element we searched for before.
If you have any questions / want me to clarify feel free to ask away.
I have been hacking on things with mechanize for a long while now.
EDIT:
If you want to be able to look up the values this using some argument this is how I imagined you would solve the problem
values = {}
elems.each do |e|
td = e.search('td')
values[td.first.text] = td.last.text
end
When you have the values hash filled you can do the following:
if ARG[0] = "Everest"
then
> values["Everest"]
=> "1953; Sir E. Hillary, T. Norgay"
Post a Comment for "Extract Data From HTML Table With Mechanize"