Is it possible for the code to skip an iteration in web scraping? IndexError: popup index out of range
P粉846294303
P粉846294303 2024-02-21 19:40:15
0
2
412

So I have a code that removes the name and price of a mineral from 14 pages (so far) and saves it to a .txt file. I first tried using only Page1, then I wanted to add more pages to get more data. But then the code grabs something it shouldn't - random names/strings. I didn't expect it to grab that one, but it did and assigned the wrong price to this one! It happens after a mineral has this "unexpected name" and then the entire rest of the list has the wrong price. See below:

So, since the string is different from other strings, further code cannot split it and gives error:

cutted2 = split2.pop(1)
              ^^^^^^^^^^^^^
IndexError: pop index out of range

I tried to ignore these errors and use one of the methods used in different Stackoverflow pages:

try:
   cutted2 = split2.pop(1)
except IndexError:
   continue

It does work, no errors occur...but then it assigns the wrong price to the wrong mineral (as I noticed)! How can I change the code to ignore these "weird" names and continue with the list? Below is the full code, I remember it stopped at URL5 and gave this popup index error:

import requests
from bs4 import BeautifulSoup
import re
def collecter(URL):
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}

    soup = BeautifulSoup(requests.get(URL, headers=headers).text, "lxml")

    names = [n.getText(strip=True) for n in soup.select("table tr td font a")]
    prices = [
        p.getText(strip=True).split("Price:")[-1] for p
        in soup.select("table tr td font font")
    ]
    
    names[:] = [" ".join(n.split()) for n in names if not n.startswith("[")]
    prices[:] = [p for p in prices if p]

    with open("Minerals.txt", "a+", encoding='utf-8') as file:
        for name, price in zip(names, prices):
                # print(f"{name}\n{price}")
                # print("-" * 50)
                filename = str(name)+" "+str(price)+"\n"
                split1 = filename.split(' / ')          
                cutted1 = split1.pop(0)
                split2 = cutted1.split(": ")
                try:
                    cutted2 = split2.pop(1)
                except IndexError:
                    continue
                two_prices = cutted2+" "+split1.pop(0)+"\n"
                file.write(two_prices)
                      
URL1 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=0"
URL2 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=25"
URL3 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=50"
URL4 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=75"
URL5 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=100"
URL6 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=125"
URL7 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=150"
URL8 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=175"
URL9 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=200"
URL10 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=225"
URL11 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=250"
URL12 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=275"
URL13 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=300"
URL14 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=325"

collecter(URL1)
collecter(URL2)
collecter(URL3)
collecter(URL4)
collecter(URL5)
collecter(URL6)
collecter(URL7)
collecter(URL8)
collecter(URL9)
collecter(URL10)
collecter(URL11)
collecter(URL12)
collecter(URL13)
collecter(URL14)

EDIT: This is the fully valid code below, thanks to the helper!

import requests
from bs4 import BeautifulSoup
import re
for URL in range(0,2569,25):
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}

    soup = BeautifulSoup(requests.get(f'https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First={URL}', headers=headers).text, "lxml")

    names = [n.getText(strip=True) for n in soup.select("table tr td font>a")]

    prices = [p.getText(strip=True).split("Price:")[-1] for p in soup.select("table tr td font>font")]   
       
    names[:] = [" ".join(n.split()) for n in names if not n.startswith("[") ]
    prices[:] = [p for p in prices if p]

    with open("MineralsList.txt", "a+", encoding='utf-8') as file:
        for name, price in zip(names, prices):
                # print(f"{name}\n{price}")
                # print("-" * 50)
                filename = str(name)+"  "+str(price)+"\n"
                split1 = filename.split(' / ')          
                cutted1 = split1.pop(0)
                split2 = cutted1.split(": ")
                cutted2 = split2.pop(1)
                try:
                    two_prices = cutted2+"  "+split1.pop(0)+"\n"
                except IndexError:
                    two_prices = cutted2+"\n"
                file.write(two_prices)

But after some changes it stops with a new error - it cannot find the string by the given property, hence the error "IndexError: popping from empty list"...even soup.select( "table tr td font>font" ) provides help, as it does in "name"

P粉846294303
P粉846294303

reply all(2)
P粉391955763

You can try the next example along with pagination

import requests
from bs4 import BeautifulSoup

for URL in range(0,100,25):
    headers = {"User-Agent": "Mozilla/5.0"}

    soup = BeautifulSoup(requests.get(f'https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First={URL}', headers=headers).text, "lxml")

    names = [ x.get_text(strip=True) for x in soup.select('table tr td font a')][:25]
    print(names)
    prices = [ x.get_text(strip=True) for x in soup.select('table tr td font:nth-child(3)')][:25]
    print(prices)

    # with open("Minerals.txt", "a+", encoding='utf-8') as file:
    #     for name, price in zip(names, prices):
    #             # print(f"{name}\n{price}")
    #             # print("-" * 50)
    #             filename = str(name)+" "+str(price)+"\n"
    #             split1 = filename.split(' / ')          
    #             cutted1 = split1.pop(0)
    #             split2 = cutted1.split(": ")
    #             try:
    #                 cutted2 = split2.pop(1)
    #             except IndexError:
    #                 continue
    #             two_prices = cutted2+" "+split1.pop(0)+"\n"
    #             file.write(two_prices)

Output:

["NX51AH2:\n'lepidolite' after Elbaite with Elbaite", "TH27AL9:\n'Pearceite' with Calcite", "TFM69AN5:\n'Stilbite'", 'SM90CEX:\nAcanthite', 'TMA97AN5:\nAcanthite', 'TB90AE8:\n Acanthite', 'TZ71AK9:\nAcanthite', 'EC63G1:\nAcanthite', 'MN56K9:\nAcanthite', 'TF89AL3:\nAcanthite (Se-bearing) with Polybasite (Se-bearing) and Calcite', 'TP66AJ8:\nAcanthite (Se-bearing) with Pyrite', 'TY86AN2:\nAcanthite after Polybasite', 'TA66AF6:\nAcanthite with Calcite', 'JFD104AO2:\nAcanthite with Calcite', 'TX36AL6:\nAcanthite with Calcite', 'TA48AH1:\nAcanthite with Chalcopyrite', 'EF89L9:\nAcanthite with Pyrite and Calcite', 'TX89AN0:\nAcanthite with Siderite and Proustite', 'EA56K0:\nAcanthite with Silver', 'EC48K0:\nAcanthite with Silver', '11AT12:\nAcanthite, Calcite', '9EF89L9:\nAcanthite, Pyrite, Calcite', 'SM75TDA:\nAdamite', '2M14:\nAdamite', '20MJX66:\nAdamite']
['Price:€580 / US8 / ¥84010 / AUD0', 'Price:€220 / US7 / ¥31860 / AUD0', 'Price:€450 / US4 / ¥65180 / AUD0', 'Price:€90 / US / ¥13030 / AUD0', 'Price:€240 / US7 / ¥34760 / AUD0', 'Price:€540 / US7 / 
¥78220 / AUD0', 'Price:€580 / US8 / ¥84010 / AUD0', 'Price:€85 / US / ¥12310 / AUD0', 'Price:€155 / US9 / ¥22450 / AUD0', 'Price:€460 / US4 / ¥66630 / AUD0', 'Price:€1500 / US47 / ¥217290 / AUD10', 'Price:€1600 / US51 / ¥231770 / AUD60', 'Price:€160 / US5 / ¥23170 / AUD0', 'Price:€240 / US7 / ¥34760 / AUD0', 'Price:€1200 / US38 / ¥173830 / AUD50', 'Price:€290 / US9 / ¥42000 / AUD0', 'Price:€480 / US5 / ¥69530 / AUD0', 'Price:€4800 / US53 / ¥695320 / AUD00', 'Price:€150 / US4 / ¥21720 / AUD0', 'Price:€290 / US9 / ¥42000 / AUD0', 'Price:€70 / US / ¥10140 / AUD0', 'Price:€320 / US0 / ¥46350 / AUD0', 'Price:€75 / US / ¥10860 / AUD0', 'Price:€90 / US / ¥13030 / AUD0', 'Price:€140 / US4 / ¥20280 / AUD5']
['5TD76M9:\nAdamite', 'MA54AE9:\nAdamite (variety Cu-bearing adamite) with Calcite', 'EA11Y6:\nAdamite (variety cuprian)', 'EB14Y6:\nAdamite (variety cuprian)', 'MC11X8:\nAdamite (variety cuprian) with Smithsonite', 'JRM10AN8:\nAegirine', 'MFA46AP3:\nAegirine with Zircon, Orthoclase and Quartz (variety smoky)', 'EM48AF8:\nAlabandite with Calcite', 'MC92T6:\nAlabandite with Calcite and Rhodochrosite', 'TF16AN1:\nAlabandite with Rhodochrosite', 'TX17S1:\nAlabandite with Rhodochrosite', 'TD34S1:\nAlabandite with Rhodochrosite', '10TR46:\nAlmandine', 'HM90EJ:\nAnalcime', 'EFH36AP3:\nAnalcime with Natrolite, Rhodochrosite and Serandite', 'ELR67AP1:\nAnalcime with Quartz', 'EML88AP1:\nAnalcime with Quartz', 'TF87AF4:\nAndorite', 'TR88AJ3:\nAndorite', 'ND56AN0:\nAndorite with Zinkenite', 'SM180NH:\nAndradite (variety demantoid)', 'MT86AL3:\nAndradite (variety demantoid) with Calcite', 'MA27AL7:\nAndradite (variety demantoid) with Calcite', 'TC80TL:\nAndradite (variety topazolite) with Clinochlore', 'TC85TE:\nAndradite (variety topazolite) with Clinochlore']
['Price:€180 / US5 / ¥26070 / AUD0', 'Price:€840 / US6 / ¥121680 / AUD90', 'Price:€60 / US / ¥8690 / 
AUD', 'Price:€90 / US / ¥13030 / AUD0', 'Price:€70 / US / ¥10140 / AUD0', 'Price:€580 / US8 / ¥84010 / AUD0', 'Price:€1600 / US51 / ¥231770 / AUD68', 'Price:€2700 / US86 / ¥391120 / AUD60', 'Price:€740 / US3 / ¥107190 / AUD40', 'Price:€110 / US3 / ¥15930 / AUD0', 'Price:€220 / US7 / ¥31860 / AUD0', 'Price:€920 / US9 / ¥133270 / AUD10', 'Price:€140 / US4 / ¥20280 / AUD0', 'Price:€90 / US / ¥13030 / AUD0', 'Price:€130 / US4 / ¥18830 / AUD0', 'Price:€260 / US8 / ¥37660 / AUD0', 'Price:€380 / US2 / ¥55040 / AUD0', 'Price:€240 / US7 / ¥34760 / AUD0', 'Price:€390 / US2 / ¥56490 / AUD0', 'Price:€150 / US4 / ¥21720 / AUD0', 'Price:€180 / US5 / ¥26070 / AUD0', 'Price:€1600 / US51 / ¥231770 / AUD60', 'Price:€2200 / US70 / ¥318690 / AUD90', 'Price:€80 / US / ¥11580 / AUD0', 'Price:€85 / US / ¥12310 / AUD0']
['T29NAK3:\nAndradite (variety topazolite) with Clinochlore', 'TC85TV:\nAndradite (variety topazolite) with Clinochlore', 'T89GH5:\nAndradite (variety topazolite) with Clinochlore', 'TQ94Q0:\nAndradite (variety topazolite) with Stilbite', 'SM140TFV:\nAndradite on Microcline', 'HM140NG:\nAndradite with Calcite', 'GM66R9:\nAndradite with Clinochlore', 'SM70TYW:\nAndradite with Epidote', 'TC290TVH:\nAndradite with Epidote and Microcline', 'TKX11AO7:\nAndradite with Microcline', 'TC2100TEJ:\nAndradite with Microcline', 'TH16AN2:\nAndradite with Microcline', 'TTX66AO7:\nAndradite with Microcline', 'TC2150TJL:\nAndradite with Microcline', 'TQ96AN2:\nAndradite with Microcline', 'TF48AF2:\nAnglesite', 'MA47AL4:\nAnglesite with Galena', 'LQ88AE6:\nAnglesite with Galena', 'ER90AL8:\nAnglesite with Galena', 'TP70AE1:\nAnglesite with Galena', 'N54NAL5:\nAnglesite with Galena', 'GV96R9:\nAnhydrite with Calcite and Pyrite', '11TV99:\nAnhydrite, Calcite', 'MG26AL4:\nAnorthoroselite with Calcite', 'XM260NFF:\nAragonite']
['Price:€240 / US7 / ¥34760 / AUD0', 'Price:€85 / US / ¥12310 / AUD0', 'Price:€220 / US7 / ¥31860 / AUD0', 'Price:€980 / US11 / ¥141960 / AUD10', 'Price:€140 / US4 / ¥20280 / AUD0', 'Price:€140 / US4 / ¥20280 / AUD0', 'Price:€160 / US5 / ¥23170 / AUD0', 'Price:€70 / US / ¥10140 / AUD0', 'Price:€90 / US / ¥13030 / AUD0', 'Price:€70 / US / ¥10140 / AUD0', 'Price:€100 / US3 / ¥14480 / AUD0', 'Price:€110 / US3 / ¥15930 / AUD0', 'Price:€140 / US4 / ¥20280 / AUD0', 'Price:€150 / US4 / ¥21720 / AUD0', 'Price:€220 / US7 / ¥31860 / AUD0', 'Price:€380 / US2 / ¥55040 / AUD0', 'Price:€220 / US7 / ¥31860 / AUD0', 'Price:€360 / US1 / ¥52140 / AUD0', 'Price:€540 / US7 / ¥78220 / AUD0', 'Price:€540 / US7 / ¥78220 / AUD0', 'Price:€940 / US9 / ¥136160 / AUD50', 'Price:€220 / US7 / ¥31860 / AUD0', 'Price:€460 / US4 / ¥66630 / AUD0', 'Price:€140 / US4 / ¥20280 / AUD0', 'Price:€60 / US / ¥8690 / AUD'] 
['XM295EAR:\nAragonite', 'ETE46AP2:\nAragonite', 'EXM26AP0:\nAragonite', 'EYB26AP0:\nAragonite', 'EXE56AP2:\nAragonite', 'ETF46AP0:\nAragonite', 'XM2160ERF:\nAragonite', 'EXM46AP0:\nAragonite', 'XM2190MEX:\nAragonite', 'XM2780EFT:\nAragonite', 'EHM93AO9:\nAragonite', 'TYB37AO8:\nAragonite (variety Cu-bearing aragonite)', 'SM99AM3:\nAragonite (variety cuprian)', '1M06:\nAragonite (variety flos ferri)', 'TG69AL3:\nAragonite (variety tarnowitzite)', 'MLC96AO2:\nAragonite on Calcite', 'MLE68AO2:\nAragonite on Calcite', 'MTB66AP3:\nAragonite with Quartz (variety hematoide)', 'MXF96AP3:\nAragonite with Quartz (variety hematoide)', 'MRR47AP3:\nAragonite with Quartz (variety hematoide)', 'MTR37AP3:\nAragonite with Quartz (variety hematoide)', 'JFD193AP3:\nArfvedsonite with Microcline', 'TFX76AO7:\nArsenopyrite with Calcite, Pyrite, Sphalerite and Rhodochrosite', 'NB37AL3:\nArsenopyrite with Muscovite', 'HM220NX:\nArsenopyrite with Muscovite']
['Price:€95 / US / ¥13760 / AUD6', 'Price:€140 / US4 / ¥20280 / AUD0', 'Price:€140 / US4 / ¥20280 / AUD0', 'Price:€140 / US4 / ¥20280 / AUD0', 'Price:€150 / US4 / ¥21720 / AUD0', 'Price:€150 / US4 / 
¥21720 / AUD0', 'Price:€160 / US5 / ¥23170 / AUD6', 'Price:€160 / US5 / ¥23170 / AUD0', 'Price:€190 / US6 / ¥27520 / AUD3', 'Price:€780 / US4 / ¥112990 / AUD03', 'Price:€880 / US8 / ¥127470 / AUD50', 'Price:€240 / US7 / ¥34760 / AUD0', 'Price:€480 / US5 / ¥69530 / AUD0', 'Price:€100 / US3 / ¥14480 / AUD0', 'Price:€460 / US4 / ¥66630 / AUD0', 'Price:€190 / US6 / ¥27520 / AUD0', 'Price:€360 / US1 
/ ¥52140 / AUD0', 'Price:€160 / US5 / ¥23170 / AUD6', 'Price:€190 / US6 / ¥27520 / AUD3', 'Price:€230 / US7 / ¥33310 / AUD4', 'Price:€230 / US7 / ¥33310 / AUD4', 'Price:€240 / US7 / ¥34760 / AUD0', 'Price:€170 / US5 / ¥24620 / AUD0', 'Price:€220 / US7 / ¥31860 / AUD0', 'Price:€220 / US7 / ¥31860 / AUD0']
P粉677684876

You just need to make the CSS selector more specific so that only links that are directly inside the font element (rather than a few levels down) are recognized:

soup.select("table tr td font>a")

Adding a further condition that the link points to a single item rather than the next/previous page links at the bottom of the page would also help:

soup.select("table tr td font>a[href*='CODE']")
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template