目标:以“克拉玛依大火”为关键词,在百度贴吧使用“全吧搜索”,然后选择“只看主题帖”。
现在需要把每一个主题帖里面的“用户id”和他们对应的“评论内容”爬出来。
代码如下:
from bs4 import BeautifulSoup
import requests
import time
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'
}
def get_links(url):
wb_data=requests.get(url,headers=headers)
soup=BeautifulSoup(wb_data.text,'lxml')
links=soup.select('div.s_post_list > div:nth-of-type(1) > span > a')
for link in links:
href='http://tieba.baidu.com'+link.get('href')
get_info(href)
def get_info(url):
wb_data = requests.get(url, headers=headers)
soup = BeautifulSoup(wb_data.text, 'lxml')
names=soup.select('#j_p_postlist > div:nth-of-type(1) > div.d_author > ul > li.d_name > a')
contents=soup.select('#post_content_111337133239')
for name,content in zip(names,contents):
data={
'name': name.get_text(),
'content': content.get_text()
}
print (data)
if __name__=='__main__':
urls=['http://tieba.baidu.com/f/search/res?isnew=1&kw=&qw=%BF%CB%C0%AD%C2%EA%D2%C0%B4%F3%BB%F0&rn=10&un=&only_thread=1&sm=1&sd=&ed=&pn={}'.format(number)for number in range(1,2)]
for single_url in urls:
get_links(single_url)
time.sleep(1)
但是,现在运行代码,只能输出一个结果,我想请教一下大大们,是不是这一步“links=soup.select('div.s_post_list > div:nth-of-type(1) > span > a')”中的“ div:nth-of-type(1)”导致的?怎么改“ div:nth-of-type(1)”才能让它顺利的按照循环输出啊~谢谢啦!
现在需要把每一个主题帖里面的“用户id”和他们对应的“评论内容”爬出来。
代码如下:
from bs4 import BeautifulSoup
import requests
import time
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'
}
def get_links(url):
wb_data=requests.get(url,headers=headers)
soup=BeautifulSoup(wb_data.text,'lxml')
links=soup.select('div.s_post_list > div:nth-of-type(1) > span > a')
for link in links:
href='http://tieba.baidu.com'+link.get('href')
get_info(href)
def get_info(url):
wb_data = requests.get(url, headers=headers)
soup = BeautifulSoup(wb_data.text, 'lxml')
names=soup.select('#j_p_postlist > div:nth-of-type(1) > div.d_author > ul > li.d_name > a')
contents=soup.select('#post_content_111337133239')
for name,content in zip(names,contents):
data={
'name': name.get_text(),
'content': content.get_text()
}
print (data)
if __name__=='__main__':
urls=['http://tieba.baidu.com/f/search/res?isnew=1&kw=&qw=%BF%CB%C0%AD%C2%EA%D2%C0%B4%F3%BB%F0&rn=10&un=&only_thread=1&sm=1&sd=&ed=&pn={}'.format(number)for number in range(1,2)]
for single_url in urls:
get_links(single_url)
time.sleep(1)
但是,现在运行代码,只能输出一个结果,我想请教一下大大们,是不是这一步“links=soup.select('div.s_post_list > div:nth-of-type(1) > span > a')”中的“ div:nth-of-type(1)”导致的?怎么改“ div:nth-of-type(1)”才能让它顺利的按照循环输出啊~谢谢啦!