所以我有一個代碼,它從 14 頁(到目前為止)中刪除礦物的名稱 價格並將其保存到 .txt 檔案中。我首先嘗試僅使用 Page1,然後我想添加更多頁面以獲取更多資料。但隨後程式碼抓取了一些它不應該抓取的東西——隨機名稱/字串。我沒想到它會搶到那個,但它確實搶到了,並且給了這個錯誤的價格!它發生在具有這種“意外名稱”的礦物之後,然後列表中的整個其餘部分都有錯誤的價格。見下圖:
因此,由於該字串與其他字串不同,因此進一步的程式碼無法拆分它並給出錯誤:
cutted2 = split2.pop(1) ^^^^^^^^^^^^^ IndexError: pop index out of range
我試著忽略這些錯誤並使用不同 Stackoverflow 頁面中使用的方法之一:
try: cutted2 = split2.pop(1) except IndexError: continue
它確實有效,沒有出現錯誤......但隨後它為錯誤的礦物分配了錯誤的價格(正如我注意到的)!如何更改程式碼以忽略這些“奇怪”的名稱並繼續列表?下面是完整的程式碼,我記得它停在 URL5 上,並給出了這個彈出索引錯誤:
import requests from bs4 import BeautifulSoup import re def collecter(URL): headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"} soup = BeautifulSoup(requests.get(URL, headers=headers).text, "lxml") names = [n.getText(strip=True) for n in soup.select("table tr td font a")] prices = [ p.getText(strip=True).split("Price:")[-1] for p in soup.select("table tr td font font") ] names[:] = [" ".join(n.split()) for n in names if not n.startswith("[")] prices[:] = [p for p in prices if p] with open("Minerals.txt", "a+", encoding='utf-8') as file: for name, price in zip(names, prices): # print(f"{name}\n{price}") # print("-" * 50) filename = str(name)+" "+str(price)+"\n" split1 = filename.split(' / ') cutted1 = split1.pop(0) split2 = cutted1.split(": ") try: cutted2 = split2.pop(1) except IndexError: continue two_prices = cutted2+" "+split1.pop(0)+"\n" file.write(two_prices) URL1 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=0" URL2 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=25" URL3 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=50" URL4 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=75" URL5 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=100" URL6 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=125" URL7 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=150" URL8 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=175" URL9 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=200" URL10 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=225" URL11 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=250" URL12 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=275" URL13 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=300" URL14 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=325" collecter(URL1) collecter(URL2) collecter(URL3) collecter(URL4) collecter(URL5) collecter(URL6) collecter(URL7) collecter(URL8) collecter(URL9) collecter(URL10) collecter(URL11) collecter(URL12) collecter(URL13) collecter(URL14)
編輯:這是下面完全有效的程式碼,感謝幫助人員!
import requests from bs4 import BeautifulSoup import re for URL in range(0,2569,25): headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"} soup = BeautifulSoup(requests.get(f'https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First={URL}', headers=headers).text, "lxml") names = [n.getText(strip=True) for n in soup.select("table tr td font>a")] prices = [p.getText(strip=True).split("Price:")[-1] for p in soup.select("table tr td font>font")] names[:] = [" ".join(n.split()) for n in names if not n.startswith("[") ] prices[:] = [p for p in prices if p] with open("MineralsList.txt", "a+", encoding='utf-8') as file: for name, price in zip(names, prices): # print(f"{name}\n{price}") # print("-" * 50) filename = str(name)+" "+str(price)+"\n" split1 = filename.split(' / ') cutted1 = split1.pop(0) split2 = cutted1.split(": ") cutted2 = split2.pop(1) try: two_prices = cutted2+" "+split1.pop(0)+"\n" except IndexError: two_prices = cutted2+"\n" file.write(two_prices)
但是經過一些更改後,它會因新錯誤而停止- 它無法通過給定屬性找到字符串,因此出現錯誤“IndexError:從空列表中彈出”...甚至soup.select( "table tr td font>font" )
提供了幫助,就像它在「名稱」中所做的那樣
您可以嘗試下一個範例以及分頁
輸出:
您只需使 CSS 選擇器更具體,以便僅識別直接位於字體元素內部(而不是向下幾層)的連結:
新增進一步的條件,即連結指向單一項目而不是頁面底部的下一頁/上一頁連結也將有所幫助: