I'm fairly new to ruby and experimenting on how to scrape a website for data. The below code is what I've put together after a few days of research, however, the output from Nokogiri is not as "clean" as I would expect. When I print my array, I get a lot of line break "/n" in the output. I'm hoping someone can provide some guidance on how to clean this up so I can assign labels to each piece of data for reference.
require 'httparty'
require 'nokogiri'
require 'open-uri'
require 'pry'
require 'csv'
# Assigning the page to scrape
page = HTTParty.get('http://ift.tt/2fZ2qfE')
# Transform the http response into a Nokogiri in order to parse it
parse_page = Nokogiri::HTML(page)
# Create a blank array of each property address and add it to array
address_array = []
parse_page.css('li.srp-item-address.ellipsis').map do |a|
addresses = a.text
address_array.push(addresses)
end
# Since can't get both items in single pull, create second array with property details
details_array = []
parse_page.css('div.srp-item-body').map do |d|
property_details = d.text
details_array.push(property_details)
end
Pry.start(binding)
while in Pry, if I display "details_array" or "address_array", output looks like:
[2] pry(main)> details_array => ["\n \n \n \n 2265 Tanglewood Cir NE,\n Atlanta,\n GA\n 30345\n
\n \n\n \n Dresden East\n \n \n
\n $289,900\n \n \n \n
3 bd\n 2 ba\n 1,566 sq ft\n
0.3 acres lot\n \n \n \n \n Single Family Home\n \n \n \n \n
Brokered by Re/Max Town And Country\n \n \n
\n \n \n Brokered by \n Re/Max Town And Country\n \n \n \n ", "\n \n
\n \n 2141 Dunwoody Gln,\n
Atlanta,\n GA\n 30338\n \n \n\n \n \n $469,900\n \n \n
\n 4 bd\n 3 ba\n 2,850 sq ft\n 0.3 acres lot\n 2 car\n
\n \n \n \n Single Family Home\n
\n \n \n \n Brokered by Buckhead Home Realty Llc\n \n \n \n
\n \n Brokered by \n Buckhead Home Realty Llc\n \n \n \n ", "\n \n
\n \n 1048 Martin St SE,\n
Atlanta,\n GA\n 30315\n \n \n\n \n Intown South\n Peoplestown\n \n \n \n $164,900\n \n \n \n
5 bd\n 3 ba\n 2,376 sq ft\n
7,405 sq ft lot\n \n \n \n \n
Single Family Home\n \n \n \n \n
Brokered by Greenlet Llc\n \n \n \n
\n \n Brokered by \n Greenlet Llc\n
\n \n \n ", "\n \n \n \n
1048 Martin St SE,\n Atlanta,\n GA\n
30315\n \n \n\n \n Intown South\n
Peoplestown\n \n \n \n $164,900\n
\n \n \n 5 bd\n 3 ba\n 2,055 sq ft\n 7,584 sq ft lot\n
\n \n \n \n Single Family Home\n
\n \n \n \n Brokered by Greenlet, Llc\n \n \n \n \n
\n Brokered by \n Greenlet, Llc\n \n
\n \n ", "\n \n \n \n
1991 Woodbine Ter NE,\n Atlanta,\n GA\n
30329\n \n \n\n \n Sagamore Hills\n
\n \n \n $299,900\n \n \n \n 3 bd\n 1+ ba\n 1,449 sq ft\n 0.8 acres lot\n \n \n
\n \n Single Family Home\n \n \n
\n :
Aucun commentaire:
Enregistrer un commentaire