Téléchargez un fichier CSV contenant les URL de la page HTML et utilisez Flask pour lire les URL que vous souhaitez explorer

Question

J'ai actuellement besoin de créer un système Web capable de télécharger un fichier CSV contenant une liste d'URL. Après le téléchargement, le système lira l'URL ligne par ligne et sera utilisée pour la prochaine étape d'exploration. Ici, l'exploration nécessite de se connecter au site Web avant l'exploration. J'ai déjà le code source du site de connexion. Cependant, le problème est que je souhaite connecter une page HTML nommée "upload_page.html" avec un fichier flask nommé "upload_csv.py". Où le code source pour la connexion et le scraping doit-il être placé dans le fichier flask ? upload_page.html<d

P粉207969787 · Answer

csv_file = request.files['file']
# Load the CSV data into a DataFrame
df = pd.read_csv(csv_file)
final_data = []
# Initialize the web driver
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
driver = webdriver.Chrome(options=chrome_options)
# Loop over the rows in the DataFrame and scrape each link
for index, row in df.iterrows():
    link = row['Link']
    # Login to the website
    # Replace this with your own login code
    driver.get("https://example.com/login")
    username_field = driver.find_element_by_name("username")
    password_field = driver.find_element_by_name("password")
    username_field.send_keys("myusername")
    password_field.send_keys("mypassword")
    password_field.send_keys(Keys.RETURN)
    # Wait for the login to complete
    WebDriverWait(driver, 10).until(EC.url_changes("https://example.com/login"))
    # Scrape the website
    driver.get(link)
    start = time.time()
    # will be used in the while loop
    initialScroll = 0
    finalScroll = 1000

    while True:
        driver.execute_script(f"window.scrollTo({initialScroll},{finalScroll})")
        # this command scrolls the window starting from the pixel value stored in the initialScroll
        # variable to the pixel value stored at the finalScroll variable
        initialScroll = finalScroll
        finalScroll += 1000

        # we will stop the script for 3 seconds so that the data can load
        time.sleep(2)
        end = time.time()
        # We will scroll for 20 seconds.
        if round(end - start) > 20:
            break