This tutorial demonstrates web scraping using JavaScript's Cheerio library to extract Academy Award-winning films from Wikipedia and save them to a CSV file.
First, install the required packages:
<code class="language-bash">npm install cheerio axios</code>
The Wikipedia page URL is:
<code class="language-javascript">const url = 'https://en.wikipedia.org/wiki/List_of_Academy_Award%E2%80%93winning_films';</code>
The code fetches the page's HTML using axios
, then uses Cheerio to parse it:
<code class="language-javascript">const { data: html } = await axios.get(url); const $ = cheerio.load(html); const theadData = []; const tableData = [];</code>
The script navigates the DOM, extracting data from table cells:
<code class="language-javascript">$('tbody').each((i, column) => { const columnData = []; $(column).find('th').each((j, cell) => { columnData.push($(cell).text().replace('\n', '')); }); theadData.push(columnData); }); tableData.push(theadData[0]); $('table tr').each((i, row) => { const rowData = []; $(row).find('td').each((j, cell) => { rowData.push($(cell).text().trim()); }); if (rowData.length) tableData.push(rowData); });</code>
Finally, the extracted data is formatted and saved to a CSV file using fs.writeFileSync
, with semicolons as delimiters:
<code class="language-javascript">const csvContent = tableData.map((row) => row.join(';')).join('\n'); fs.writeFileSync('academy_awards.csv', csvContent, 'utf-8');</code>
Run the script using:
<code class="language-bash">node scraper.js</code>
The resulting academy_awards.csv
file contains the scraped data.
This tutorial builds upon previous scraping tutorials using Go and Python. Consider supporting the author if this was helpful:
The above is the detailed content of A JavaScript scraper for the Wikipedia Academy Award List.. For more information, please follow other related articles on the PHP Chinese website!