samedi 31 août 2019

Webscraping multiple tags from in a single site from multiple websites

I am attempting to scrape data from multiple webpages that have multiple iterations of various tags. See the example below.

I've been able to pull the data from the webpage and store the first iteration of the table tags however I'm not able to pull the rest of the tags. I'm sure I need another for loop somewhere but I'm not sure where to add it if thats what is required.

df_list = []
folder = 'camupdate'
for cam_data in os.listdir(folder):
    with open(os.path.join(folder, cam_data),encoding="utf8") as file:
        soup = BeautifulSoup(file)
        title = soup.find('title').contents[0]
        main_title = soup.find('div', id_='login_inputs')
        tag = soup.find('div', class_='tag_row')
        tag_title = tag.find('span').contents
        viewers = tag.find_all('span')[1].contents
        rooms = tag.find_all('span')[2].contents
        df_list.append({'title': title,
                        'main_title': main_title,
                        'tag_title': tag_title,
                        'viewers': viewers,
                        'rooms': rooms})

df = pd.DataFrame(df_list, columns = ['title','main_title', 'tag_title', 'viewers', 'rooms'])


Example of code:

# I am trying to pull from this tag, the page=1
<div id="login_inputs">
<form method='post' action='/auth/login/?next=/tags/?page=1' target="_top">

<a href="/tag/x1/" title="x1">x1</a>
</span>
<span class="viewers">48804</span>
<span class="rooms">550</span>

<div class="tag_row">
<span class="tag">
<a href="/tag/x2/" title="x2">x2</a>
</span>
<span class="viewers">22067</span>
<span class="rooms">400</span>

<a href="/tag/x3/" title="x3">x3</a>
</span>
<span class="viewers">12857</span>
<span class="rooms">253</span>

# I am trying to pull from this tag, the page=2
<div id="login_inputs">
<form method='post' action='/auth/login/?next=/tags/?page=2' target="_top">



<a href="/tag/y1/" title="y1">y1</a>
</span>
<span class="viewers">1425</span>
<span class="rooms">16</span>


<div class="tag_row">
<span class="tag">
<a href="/tag/y2/" title="y2">y2</a>
</span>
<span class="viewers">785</span>
<span class="rooms">32</span>

<a href="/tag/y3/" title="y3">y3</a>
</span>
<span class="viewers">492</span>
<span class="rooms">12</span>


Actual

    title   main_title  tag_title   viewers    rooms
0   Tags    None        x1          [48804]    [388]
1   Tags    None        y1          [1425]     [16]


Expected

    title   main_title  tag_title   viewers    rooms
0   Tags    Page 1      x1          [48804]    [550]
1   Tags    Page 1      x2          [22067]    [400]
3   Tags    Page 1      x3          [12857]    [253]
4   Tags    Page 2      y1          [1425]     [16]
5   Tags    Page 2      y2          [785]      [32]
6   Tags    Page 2      y3          [492]      [12]




Aucun commentaire:

Enregistrer un commentaire