Python selenium打開(kāi)瀏覽器指定端口實(shí)現(xiàn)接續(xù)操作
一般使用 selenium 進(jìn)行數(shù)據(jù)爬取時(shí),常用處理流程是讓 selenium 從打開(kāi)瀏覽器開(kāi)始,完成全流程的所有操作。但是有時(shí)候,我們希望用戶先自己打開(kāi)瀏覽器進(jìn)入指定網(wǎng)頁(yè),完成登錄認(rèn)證等一系列操作之后(比如用戶、密碼、短信驗(yàn)證碼及各種難處理的圖形驗(yàn)證碼之類),再讓 selenium 從登錄后的頁(yè)面進(jìn)行接續(xù)操作爬取數(shù)據(jù)。那么如何才能將前后操作接續(xù)起來(lái)呢?
常規(guī)操作
常規(guī)操作一般使用下面的這種方式,設(shè)置初始參數(shù)后直接使用 get 方法去打開(kāi)網(wǎng)頁(yè)。
from selenium import webdriver
class DriverClass:
def __init__(self):
self.driver = self._init_driver()
def _init_driver(self):
try:
option = webdriver.ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_experimental_option('useAutomationExtension', False)
prefs = dict()
prefs['credentials_enable_service'] = False
prefs['profile.password_manager_enable'] = False
prefs['profile.name'] = "Person 1"
option.add_experimental_option('prefs', prefs)
option.add_argument('--disable-gpu')
option.add_argument("--disable-blink-features=AutomationControlled")
option.add_argument('--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"')
option.add_argument('--no-sandbox')
option.add_argument('ignore-certificate-errors')
driver = webdriver.Chrome(r"./driver/chromedriver.exe", options=option)
driver.implicitly_wait(2)
driver.maximize_window()
return driver
except Exception as e:
raise e
def get_driver(self) -> webdriver.Chrome:
if isinstance(self.driver, webdriver.Chrome):
return self.driver
raise Exception('初始化瀏覽器失敗')
if __name__ == '__main__':
dc = DriverClass()
driver = dc.get_driver()
print(driver)
driver.get("https://www.baidu.com")接續(xù)操作
接續(xù)操作主要通過(guò)在打開(kāi)瀏覽器時(shí),都設(shè)置相同的接口來(lái)完成前后的銜接(不然 selenium 不知道要從哪個(gè)瀏覽器頁(yè)面進(jìn)行接續(xù))。
用戶打開(kāi)瀏覽器
用戶手動(dòng)打開(kāi)瀏覽器時(shí),指定對(duì)應(yīng)的端口(這里設(shè)置的是 9527)及數(shù)據(jù)目錄(自己自定義自定一個(gè))。
C:\Program Files\Google\Chrome\Application>chrome.exe --remote-debugging-port=9527 --user-data-dir="E:\lky_project\tmp_project\handle_qcc_data\\chrome_user_data"
執(zhí)行完上面的命令以后,會(huì)打開(kāi)一個(gè)新的瀏覽器頁(yè)面。
打開(kāi)瀏覽器后,用戶可以手動(dòng)輸入相應(yīng)頁(yè)面,完成相應(yīng)的用戶登錄認(rèn)證等操作。
程序接續(xù)瀏覽器
selenium 通過(guò)增加下面的配置參數(shù)
option.add_experimental_option("debuggerAddress", "127.0.0.1:9527")
來(lái)打開(kāi)并接續(xù)處理用戶已經(jīng)打開(kāi)的指定端口的瀏覽器。之后,程序就可以通過(guò)瀏覽器句柄去接續(xù)處理后續(xù)的任務(wù)了。
driver_class.py
from selenium import webdriver
class DriverClass:
def __init__(self):
self.driver = self._init_driver()
def _init_driver(self):
try:
option = webdriver.ChromeOptions()
# option.add_experimental_option('excludeSwitches', ['enable-automation'])
# option.add_experimental_option('useAutomationExtension', False)
# prefs = dict()
# prefs['credentials_enable_service'] = False
# prefs['profile.password_manager_enable'] = False
# prefs['profile.name'] = "Person 1"
# option.add_experimental_option('prefs', prefs)
option.add_argument('--disable-gpu')
option.add_argument("--disable-blink-features=AutomationControlled")
option.add_argument('--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"')
option.add_argument('--no-sandbox')
option.add_argument('ignore-certificate-errors')
option.add_experimental_option("debuggerAddress", "127.0.0.1:9527")
driver = webdriver.Chrome(r"./driver/chromedriver.exe", options=option)
driver.implicitly_wait(2)
# driver.maximize_window()
return driver
except Exception as e:
raise e
def get_driver(self) -> webdriver.Chrome:
if isinstance(self.driver, webdriver.Chrome):
return self.driver
raise Exception('初始化瀏覽器失敗')
if __name__ == '__main__':
dc = DriverClass()
driver = dc.get_driver()
print(driver)
# 程序使用接續(xù)后的瀏覽器句柄 driver 完成后續(xù)操作注意事項(xiàng)
注意看,我上面的接續(xù)操作函數(shù),有一部分的參數(shù)設(shè)置是注釋掉的。這是因?yàn)榻永m(xù)是從已經(jīng)打開(kāi)的瀏覽器接收繼續(xù)進(jìn)行操作,有部分的參數(shù)在用戶打開(kāi)瀏覽器的時(shí)候就已經(jīng)設(shè)定了,所以不再支持通過(guò)接續(xù)的方式繼續(xù)重復(fù)設(shè)置。
實(shí)戰(zhàn)示例
比如在手動(dòng)打開(kāi)指定 9527 端口的瀏覽器后,登錄企查查進(jìn)入高級(jí)搜索,然后使用程序獲取具有相應(yīng)資質(zhì)的企業(yè)數(shù)目(操作太頻繁可能觸發(fā)校驗(yàn)或封號(hào),請(qǐng)謹(jǐn)慎操作?。?,最后生成結(jié)果文件 data.json(中途可能會(huì)異常中斷,可以做成下面這種利用 data.json 實(shí)現(xiàn)的斷點(diǎn)續(xù)查的方式,這樣,后續(xù)再次運(yùn)行也只會(huì)查詢未查詢過(guò)的資質(zhì)數(shù)據(jù))。
driver_class.py 用上面的就可以。
main.py
import json
import re
import time
from selenium.webdriver.common.by import By
from driver_class import DriverClass
dc = DriverClass()
driver = dc.get_driver()
xpath_prefix = '//div/div/div/div/span[text()="資質(zhì)證書(shū)"]/following-sibling::div'
def checkbox_select(element_checkbox):
"""復(fù)選框選中"""
class_attribute = element_checkbox.get_attribute("class")
if "checked" not in class_attribute:
element_checkbox.find_element(By.XPATH, './span[@class="qccd-tree-checkbox-inner"]').click()
def checkbox_unselect(element_checkbox):
"""復(fù)選框取消選中"""
class_attribute = element_checkbox.get_attribute("class")
if "checked" in class_attribute:
element_checkbox.find_element(By.XPATH, './span[@class="qccd-tree-checkbox-inner"]').click()
def get_amount(element_checkbox):
"""獲取對(duì)應(yīng)復(fù)選框?qū)?yīng)的企業(yè)數(shù)目"""
checkbox_select(element_checkbox)
xpath_confirm = xpath_prefix + '/div/div/div/div/div[text()="確定"]'
driver.find_element(By.XPATH, xpath_confirm).click()
time.sleep(0.5)
try:
xpath_result = '//div/div/div[@class="search-btn limit-svip"]'
result = str(driver.find_element(By.XPATH, xpath_result).text)
except Exception as e:
print(f"異常: {str(e)}")
result = "0"
result = result.replace(",", "")
match_object = re.search("(\d+)", result)
amount = match_object.group(1)
print(f"數(shù)目:{amount}")
# 清除結(jié)果,避免點(diǎn)擊選擇項(xiàng)時(shí)誤點(diǎn)擊關(guān)閉
xpath_clear = '//div/div/a[contains(text(), "清除")]'
try:
driver.find_element(By.XPATH, xpath_clear).click()
except:
pass
xpath_select = xpath_prefix + '[@class="trigger-container"]'
driver.find_element(By.XPATH, xpath_select).click()
time.sleep(0.2)
checkbox_unselect(element_checkbox)
return amount
def extend_options():
"""展開(kāi)折疊項(xiàng)并獲取數(shù)據(jù),只展開(kāi)三層"""
# json.dump(data, open("data.json", 'w', encoding="utf-8"), indent=2, ensure_ascii=False)
try:
data = json.load(open("data.json", encoding="utf-8"))
except:
data = {}
try:
xpath_first_class = xpath_prefix + '//div/ul/li[@role="treeitem"]'
# xpath_first_class = xpath_prefix + '//div/ul/li/span[contains(@class, "qccd-tree-switcher")]'
first_item_list = driver.find_elements(By.XPATH, xpath_first_class)
for item_li in first_item_list:
text_dk1 = item_li.find_element(By.XPATH, './span/span/div/span[@class="text-dk"]').text
data[text_dk1] = data.get(text_dk1, {})
print(f"{text_dk1}")
switcher = item_li.find_element(By.XPATH, './span[contains(@class, "qccd-tree-switcher")]')
class_attribute = switcher.get_attribute("class")
element_checkbox = item_li.find_element(By.XPATH, './span[contains(@class, "checkbox")]')
if "close" in class_attribute:
switcher.click()
time.sleep(0.1)
elif "noop" in class_attribute:
# 當(dāng)前節(jié)點(diǎn)沒(méi)有子節(jié)點(diǎn)
if not data[text_dk1]:
amount = get_amount(element_checkbox)
data[text_dk1] = amount
continue
# 點(diǎn)開(kāi)以后,下層級(jí)的 ul/li 會(huì)展示出來(lái)
second_item_list = item_li.find_elements(By.XPATH, "./ul/li")
for second_item_li in second_item_list:
text_dk2 = second_item_li.find_element(By.XPATH, './span/span/div/span[@class="text-dk"]').text
data[text_dk1][text_dk2] = data[text_dk1].get(text_dk2, {})
print(f"--{text_dk2}")
switcher = second_item_li.find_element(By.XPATH, './span[contains(@class, "qccd-tree-switcher")]')
class_attribute = switcher.get_attribute("class")
element_checkbox = second_item_li.find_element(By.XPATH, './span[contains(@class, "checkbox")]')
if "close" in class_attribute:
switcher.click()
time.sleep(0.1)
elif "noop" in class_attribute:
# 當(dāng)前節(jié)點(diǎn)沒(méi)有子節(jié)點(diǎn)
if not data[text_dk1][text_dk2]:
amount = get_amount(element_checkbox)
data[text_dk1][text_dk2] = amount
continue
# 點(diǎn)開(kāi)以后,下層級(jí)的 ul/li 會(huì)展示出來(lái)
third_item_list = second_item_li.find_elements(By.XPATH, "./ul/li")
for third_item_li in third_item_list:
text_dk3 = third_item_li.find_element(By.XPATH, './span/span/div/span[@class="text-dk"]').text
data[text_dk1][text_dk2][text_dk3] = data[text_dk1][text_dk2].get(text_dk3, {})
print(f"----{text_dk3}")
switcher = third_item_li.find_element(By.XPATH, './span[contains(@class, "qccd-tree-switcher")]')
class_attribute = switcher.get_attribute("class")
# 到第三層時(shí),不再展開(kāi),直接選擇復(fù)選框
element_checkbox = third_item_li.find_element(By.XPATH, './span[contains(@class, "checkbox")]')
if not data[text_dk1][text_dk2][text_dk3]:
amount = get_amount(element_checkbox)
data[text_dk1][text_dk2][text_dk3] = amount
except Exception as e:
raise e
finally:
json.dump(data, open("data.json", 'w', encoding="utf-8"), indent=2, ensure_ascii=False)
def spider_data():
# 嘗試關(guān)閉資質(zhì)證書(shū)選擇框、清除所選項(xiàng)
xpath_close = xpath_prefix + '/div/div/div/a[@class="nclose"]'
xpath_clear = '//div/div/a[contains(text(), "清除")]'
try:
driver.find_element(By.XPATH, xpath_close).click()
except:
pass
try:
driver.find_element(By.XPATH, xpath_clear).click()
except:
pass
# 點(diǎn)擊資質(zhì)證書(shū)選擇框
xpath_select = xpath_prefix + '[@class="trigger-container"]'
driver.find_element(By.XPATH, xpath_select).click()
time.sleep(2)
extend_options()
# 取消按鈕
xpath_cancel = xpath_prefix + '/div/div/div/div/div[text()="取消"]'
# 確定按鈕
xpath_confirm = xpath_prefix + '/div/div/div/div/div[text()="確定"]'
driver.find_element(By.XPATH, xpath_confirm).click()
if __name__ == '__main__':
spider_data()最后可以得到生成的 data.json 文件如下:
{
"建筑業(yè)資質(zhì)": {
"工程設(shè)計(jì)資質(zhì)證書(shū)": {
"工程設(shè)計(jì)專項(xiàng)資質(zhì)": "26329",
"建筑工程設(shè)計(jì)事務(wù)所": "356",
"工程設(shè)計(jì)行業(yè)資質(zhì)": "4487",
"工程設(shè)計(jì)專業(yè)資質(zhì)": "19902",
"工程設(shè)計(jì)綜合資質(zhì)": "98"
},
"工程勘察資質(zhì)證書(shū)": {
"工程勘察綜合資質(zhì)": "377",
"工程勘察專業(yè)資質(zhì)": "7464",
"工程勘察勞務(wù)資質(zhì)": "3019"
},
...
},
"食品農(nóng)產(chǎn)品認(rèn)證": {
"有機(jī)產(chǎn)品(OGA)": "49868",
"良好農(nóng)業(yè)規(guī)范(GAP)": "6449",
"食品質(zhì)量認(rèn)證(酒類)": "151",
"綠色食品認(rèn)證": "34723",
"綠色市場(chǎng)認(rèn)證": "318",
"無(wú)公害農(nóng)產(chǎn)品": "31067",
"食品安全管理體系認(rèn)證": "72075",
"危害分析與關(guān)鍵控制點(diǎn)認(rèn)證": "51844",
"乳制品生產(chǎn)企業(yè)良好生產(chǎn)規(guī)范認(rèn)證": "445",
"乳制品生產(chǎn)企業(yè)危害分析與關(guān)鍵控制點(diǎn)(HACCP)體系認(rèn)證": "570",
"飼料產(chǎn)品": "85"
},
"其他資質(zhì)": {
"辦學(xué)許可證": "192010",
"代理記賬許可證書(shū)": "34588",
"會(huì)計(jì)師事務(wù)所執(zhí)業(yè)證書(shū)": "12252",
"DOC證書(shū)": "982",
"SMC證書(shū)": "1886",
"名特優(yōu)新農(nóng)產(chǎn)品證書(shū)": "1818",
"招投標(biāo)類綜合資質(zhì)": "36317",
"區(qū)塊鏈信息服務(wù)備案": "2765",
"醫(yī)療機(jī)構(gòu)執(zhí)業(yè)許可證": "570877",
"CCC工廠認(rèn)證": "16154",
"衛(wèi)生許可證": "3244"
}
}以上就是Python selenium打開(kāi)瀏覽器指定端口實(shí)現(xiàn)接續(xù)操作的詳細(xì)內(nèi)容,更多關(guān)于Python selenium瀏覽器的資料請(qǐng)關(guān)注腳本之家其它相關(guān)文章!
相關(guān)文章
numpy數(shù)組坐標(biāo)軸問(wèn)題解決
本文主要介紹了numpy數(shù)組坐標(biāo)軸問(wèn)題解決,文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧2023-03-03
使用python的pandas庫(kù)讀取csv文件保存至mysql數(shù)據(jù)庫(kù)
這篇文章主要介紹了利用python的pandas庫(kù)讀取csv文件保存至mysql數(shù)據(jù)庫(kù)的方法,非常不錯(cuò),具有一定的參考借鑒價(jià)值,需要的朋友可以參考下2018-08-08
python 實(shí)現(xiàn)二維字典的鍵值合并等函數(shù)
今天小編就為大家分享一篇python 實(shí)現(xiàn)二維字典的鍵值合并等函數(shù),具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧2019-12-12
Python使用pytorch動(dòng)手實(shí)現(xiàn)LSTM模塊
這篇文章主要介紹了Python使用pytorch動(dòng)手實(shí)現(xiàn)LSTM模塊,LSTM是RNN中一個(gè)較為流行的網(wǎng)絡(luò)模塊。主要包括輸入,輸入門,輸出門,遺忘門,激活函數(shù),全連接層(Cell)和輸出2022-07-07
Python實(shí)現(xiàn)多元線性回歸的梯度下降法
梯度下降法的機(jī)器學(xué)習(xí)的重要思想之一,梯度下降法的目標(biāo),是使得代價(jià)函數(shù)最小。本文主要和大家分享的是python實(shí)現(xiàn)多元線性回歸的梯度下降法,感興趣的可以了解一下2022-08-08
python爬蟲(chóng)獲取京東手機(jī)圖片的圖文教程
下面小編就為大家分享一篇python爬蟲(chóng)獲取京東手機(jī)圖片的圖文教程,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧2017-12-12

