Wednesday, February 20, 2008

Ruby: How to Parse an HTML page and retrieve embedded table information

I couldn’t find readily available an HTML::TABLE parser for ruby like I had for Perl. So I thought I’d just try and make basic one for a page that I like to grab information from. I really couldn’t find any good examples posted. So here is a sample method for extracting information from a table on an HTML page.

First I usually start with a very simple get the data and print the data out. The main purpose is to verify that you actually retrieving the data.


require 'net/http'

url = 'http://your.url.here/

url_data = Net::HTTP.get_response(URI.parse(url)).body

p url_data

Second you can look at the output data to see how the table labels are made.

Next I add the next piece to split the sheet data along the different embedded tables. This will become a loop process to access the different tables. Right now I just want to print out all of the tables.

data = url_data.split(/

data.each do |x|

p x

end

The first table split lines is 24 tables down. In this case I’m finding an embedded table within the original table or Table 24 for this page and it has a different tag format a "forward slash"table. The page I’m looking at is very table extensive in its layout. I reset a counter before going into the do each on the split of the table lines. Once I hit the table that I want to drill down further into I start another loop. At this stage I’m printing out the entire row of the table that is being displayed.

y=0

data.each do |x|

y = y+1

if y == 24

data_table = x.split(/<\/table>/)

z = 0

data_table.each do |t2|

if z == 0

data_row = t2.split(/tr/)

data_row.each do |row|

p row

end

end

end

end

end

The next step would be to isolate the individual elements in the row that you want to extract. In my case my elements are laid out in a single column. Column 1 is padded space, column 2 are labels and column 3 are the data values. So to display only the column 3 values I would loop through like this.

data.each do |x|

y = y+1

if y == 24

data_table = x.split(/<\/table>/)

z = 0

data_table.each do |t2|

if z == 0

data_row = t2.split(/tr/)

data_row.each do |row|

data_column = row.split(/

c = 0

data_column.each do |column|

c= c +1

if c == 3

p column

end

end

end

end

end

end

end

After this point it’s clean up of the extra HTML formatting tags. Save the values to a local variable and work with the data.

Labels:

0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home