前言
好久没有使用selenium了,在爬虫群无意间看到一个需求,大致问题就是爬取的网站有速率访问限制,我说那你不如就直接模拟人上无脑selenium算了,就花了5分钟写了一下,顺便复习一下新版本的selenium
import json
import time
import pandas as pd
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
url = "http://search.ccgp.gov.cn/"
options = Options()
options.add_argument("--incognito") # 配置隐私模式
options.add_experimental_option("excludeSwitches", ["enable-automation"])
driver = webdriver.Chrome(
executable_path=r"D:\chrome driver\chromedriver", options=options
)
driver.execute_cdp_cmd(
"Page.addScriptToEvaluateOnNewDocument",
{
"source": """
Object.defineProperty(navigator,'webdriver',{
get: () => undefined
})
"""
},
)
driver.get(url)
time.sleep(1)
driver.find_element(By.XPATH, '//*[@id="kw"]').click()
time.sleep(1)
driver.find_element(By.XPATH, '//*[@id="kw"]').send_keys("供应商")
time.sleep(1)
driver.find_element(By.XPATH, '//*[@id="doSearch1"]').click()
time.sleep(1)
soup = BeautifulSoup(driver.page_source)
li = soup.find("ul", "vT-srch-result-list-bid").findAll("li")
pd.DataFrame(
[
(i.find("a").text.replace(" ", "A").replace("\n", ""), i.find("span").text.replace(' ','').replace('\n',''))
for i in li
]
)
最后主要就差一个selenium翻页继续访问列表添加的问题了,记得time.sleep都加上,反正随便写的,肯定会有其他问题,后面如果他还有需求再说吧。