Upload a CSV file containing the URLs from the HTML page and use Flask to read the URLs you want to crawl

Question

I currently need to make a web-based system that can upload a CSV file containing a list of URLs. After uploading, the system will read the URL line by line and will be used for the next step of crawling. Here, crawling requires logging into the website before crawling. I already have the source code for the login website. However, the problem is that I want to connect an html page named "upload_page.html" with a flask file named "upload_csv.py". Where should the source code for login and scraping be placed in the flask file? upload_page.html<d

P粉207969787 · Answer

csv_file = request.files['file']
# Load the CSV data into a DataFrame
df = pd.read_csv(csv_file)
final_data = []
# Initialize the web driver
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
driver = webdriver.Chrome(options=chrome_options)
# Loop over the rows in the DataFrame and scrape each link
for index, row in df.iterrows():
    link = row['Link']
    # Login to the website
    # Replace this with your own login code
    driver.get("https://example.com/login")
    username_field = driver.find_element_by_name("username")
    password_field = driver.find_element_by_name("password")
    username_field.send_keys("myusername")
    password_field.send_keys("mypassword")
    password_field.send_keys(Keys.RETURN)
    # Wait for the login to complete
    WebDriverWait(driver, 10).until(EC.url_changes("https://example.com/login"))
    # Scrape the website
    driver.get(link)
    start = time.time()
    # will be used in the while loop
    initialScroll = 0
    finalScroll = 1000

    while True:
        driver.execute_script(f"window.scrollTo({initialScroll},{finalScroll})")
        # this command scrolls the window starting from the pixel value stored in the initialScroll
        # variable to the pixel value stored at the finalScroll variable
        initialScroll = finalScroll
        finalScroll += 1000

        # we will stop the script for 3 seconds so that the data can load
        time.sleep(2)
        end = time.time()
        # We will scroll for 20 seconds.
        if round(end - start) > 20:
            break