I am attempting to scrape data from multiple webpages that have multiple iterations of various tags. See the example below.
I've been able to pull the data from the webpage and store the first iteration of the table tags however I'm not able to pull the rest of the tags. I'm sure I need another for loop somewhere but I'm not sure where to add it if thats what is required.
df_list = []
folder = 'camupdate'
for cam_data in os.listdir(folder):
with open(os.path.join(folder, cam_data),encoding="utf8") as file:
soup = BeautifulSoup(file)
title = soup.find('title').contents[0]
main_title = soup.find('div', id_='login_inputs')
tag = soup.find('div', class_='tag_row')
tag_title = tag.find('span').contents
viewers = tag.find_all('span')[1].contents
rooms = tag.find_all('span')[2].contents
df_list.append({'title': title,
'main_title': main_title,
'tag_title': tag_title,
'viewers': viewers,
'rooms': rooms})
df = pd.DataFrame(df_list, columns = ['title','main_title', 'tag_title', 'viewers', 'rooms'])
Example of code:
# I am trying to pull from this tag, the page=1
<div id="login_inputs">
<form method='post' action='/auth/login/?next=/tags/?page=1' target="_top">
<a href="/tag/x1/" title="x1">x1</a>
</span>
<span class="viewers">48804</span>
<span class="rooms">550</span>
<div class="tag_row">
<span class="tag">
<a href="/tag/x2/" title="x2">x2</a>
</span>
<span class="viewers">22067</span>
<span class="rooms">400</span>
<a href="/tag/x3/" title="x3">x3</a>
</span>
<span class="viewers">12857</span>
<span class="rooms">253</span>
# I am trying to pull from this tag, the page=2
<div id="login_inputs">
<form method='post' action='/auth/login/?next=/tags/?page=2' target="_top">
<a href="/tag/y1/" title="y1">y1</a>
</span>
<span class="viewers">1425</span>
<span class="rooms">16</span>
<div class="tag_row">
<span class="tag">
<a href="/tag/y2/" title="y2">y2</a>
</span>
<span class="viewers">785</span>
<span class="rooms">32</span>
<a href="/tag/y3/" title="y3">y3</a>
</span>
<span class="viewers">492</span>
<span class="rooms">12</span>
Actual
title main_title tag_title viewers rooms
0 Tags None x1 [48804] [388]
1 Tags None y1 [1425] [16]
Expected
title main_title tag_title viewers rooms
0 Tags Page 1 x1 [48804] [550]
1 Tags Page 1 x2 [22067] [400]
3 Tags Page 1 x3 [12857] [253]
4 Tags Page 2 y1 [1425] [16]
5 Tags Page 2 y2 [785] [32]
6 Tags Page 2 y3 [492] [12]
Aucun commentaire:
Enregistrer un commentaire