web: Python - How to search for a zip file that resides in an iframe on https

lundi 3 avril 2017

Python - How to search for a zip file that resides in an iframe on https

Python - 2.7.5
Google Chrome

First off I am self taught coder and will accept any critique and/or suggestions to any of my posted codes below. This issue has been a joy to work through because I love challenging myself but I am afraid I have hit a brick wall and need some guidance. I will be as detailed as possible below to fully explain the overall picture of my script and then show where I am at with the actual issue that is explained in the title.

I am putting together a script that will go out and download data automatically, upzip, and export to a GDB. We serve a wide region of users and have a very large enterprise SDE setup containing large amount of public data that we have to go out and search and update for our end users. Most of our data is updated monthly by local government entities and we have to go out and search for the data manually, download, unzip, QAQC, etc. I am wanting to put a script a together that will automate the first part of this process by going out and downloading all my data for me and exporting to a local GDB, from there I can QAQC everything and upload to our SDE for our users to access.

The process has been pretty straight forward so far until I got to this issue I have before me. My script will search a webpage for specific keywords and find the relevant link and begin the download. For this post I will use two examples, one that works and one that is currently giving me issues. What works is my function for searching and downloading the Metro GIS dataset and below shows my current process for finding this. So far all http websites I have included will use the posted function below. Like Metro is being shown I plan on having a defined function for each group of data.

import requests, zipfile, StringIO, time, arcpy, urllib2, urlparse
from BeautifulSoup import BeautifulSoup

arcpy.env.overwriteOutput = True

workPath = -- #The output GDB
timestr = time.strftime("%Y%m%d")
gdbName = "GlobalSDEUpdate_" + timestr
gdbPath = workPath + "\\" + gdbName + ".gdb"

class global_DataFinder(object):
    def __init__(self):
        object.__init__(self)
        self.gdbSetup()
        self.metro()

    def gdbSetup(self):       
        arcpy.CreateFileGDB_management(workPath, gdbName)

    def fileDownload(self, key, url, dlPath, dsName):
        page = urllib2.urlopen(url).read()
        urlList = []

        soup = BeautifulSoup(page)
        soup.prettify()

        for link in soup.findAll('a', href = True):
            if not 'http://' in link['href']:
                if urlparse.urljoin(url, link['href']) not in urlList:
                    zipDL = urlparse.urljoin(url, link['href'])
                    if zipDL.endswith(".zip"):
                        if key in zipDL:
                            urlList.append(zipDL)        

        for x in urlList:
            print x
            r = requests.get(x, stream=True)
            z = zipfile.ZipFile(StringIO.StringIO(r.content))        
            z.extractall(dlPath)        

        arcpy.CreateFeatureDataset_management(gdbPath, dsName)
        arcpy.env.workspace = dlPath
        shpList = []

        for shp in arcpy.ListFeatureClasses():
            shpList.append(shp)

        arcpy.FeatureClassToGeodatabase_conversion(shpList, (gdbPath + "\\" + dsName))

        del shpList[:]

    def metro(self):
        key = "METRO_GIS_Data_Layers"
        url = "http://ift.tt/2oQO6JM"
        dlPath = -- *#Where my zipfiles output to*  
        dsName = "Metro"

        self.fileDownload(key, url, dlPath, dsName)

global_DataFinder()

As you can see above this is the method I started with using Metro as my first testing point and this is currently working great. I was hoping all my sites going forward would like this but when I got to FEMA I ran into an issue.

The website National Flood Hazard Layer (NFHL) Status hosts floodplain data for many counties across the country is available for free to any who wish to use it. When arriving at the website you will see that you can search for the county you want, then the table queries out the search, then you can simply click and download the county you desire. When checking the source this is what I came across and noticed its in an iframe.

When accessing the iframe source link through Chrome and checking the png source url this is what you get - http://ift.tt/2oR5YnS

Now here is where my problem lies, unlike http sites I have quickly learned that accessing a secured https site and scraping the page is different especially when its using javascript to show the table. I have spent hours searching through forums and tried different python packages like selenium, mechanize, requests, urllib, urllib2, and I seem to always hit a dead-end before I can securely establish a connection and parse the webpage and search for my counties zipfile. The code below shows the closest I have gotten and shows the error code I am getting.

(I always test in a separate script and then when it works I bring it over to my main script, so thats why this code snippet below is separated from my original)

import urllib2, httplib, socket, ssl
from BeautifulSoup import BeautifulSoup

url = "http://ift.tt/2otNehp"

def test():  
    page = urllib2.urlopen(url).read()
    urlList = []

    soup = BeautifulSoup(page)
    soup.prettify()

    for link in soup.findAll("iframe", src=True):
        r = urllib2.urlopen(link['src'])
        iFrame = link['src']
        print iFrame

def connect_patched(self):
    "Connect to a host on a given (SSL) port."

    sock = socket.create_connection((self.host, self.port),
                                    self.timeout, self.source_address)
    if self._tunnel_host:
        self.sock = sock
        self._tunnel()
    self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file,
                                ssl_version=ssl.PROTOCOL_SSLv2)

httplib.HTTPSConnection.connect = connect_patched

test()

Error I get when running this test

urllib2.URLError: urlopen error [Errno 6] _ssl.c:504: TLS/SSL connection has been closed

I am hoping a more experienced coder can see what I have done and tell me if my current methods are the way to go and if so how to get past this final error and parse the datatable properly.

web

lundi 3 avril 2017

Python - How to search for a zip file that resides in an iframe on https

Aucun commentaire:

Enregistrer un commentaire