Java的爬蟲應用教程,實戰資料抓取與分析
隨著網路時代的到來,資料成為了企業和個人獲取成功的一條必經之路,所以資料的重要性也越來越高。而爬蟲技術作為數據獲取的利器,在各行各業都得到了廣泛的應用。本文將介紹如何使用Java語言寫爬蟲,實現資料的抓取與分析。
一、前知識
在學習Java爬蟲之前,需要掌握以下幾個基礎知識:
二、Java爬蟲基礎
爬蟲(web crawler)是一種自動化程序,可以模擬人的行為訪問互聯網,從網頁中提取資訊並進行處理。 Java語言具有良好的網頁程式設計能力和強大的物件導向特性,因此很適合編寫爬蟲程式。
Java爬蟲一般分為三個部分:URL管理員、網頁下載器、網頁解析器。
URL管理員管理爬蟲需要爬取的URL位址,並記錄哪些URL已經爬取過了,哪些URL還需要被爬取。 URL管理器一般有兩種實作方式:
(1)記憶體式URL管理器:使用一個Set或Queue來記錄已經爬取的URL和待爬取的URL。
(2)資料庫式URL管理器:將已經爬取和待爬取的URL儲存在資料庫中。
網頁下載器是爬蟲的核心部分,負責從網路下載網頁。 Java爬蟲一般有兩種實作方式:
(1)URLConnection:使用URLConnection類別實現,使用起來比較簡單,核心程式碼如下:
URL url = new URL("http://www.example.com"); HttpURLConnection conn = (HttpURLConnection) url.openConnection(); InputStream in = conn.getInputStream(); BufferedReader reader = new BufferedReader(new InputStreamReader(in)); String line = reader.readLine(); while (line != null) { System.out.println(line); line = reader.readLine(); }
(2)HttpClient:使用HttpClient框架實現,相對於URLConnection更加強大,可以處理Cookie、自訂User-Agent等HTTP頭部訊息,核心程式碼如下:
HttpClient httpClient = new HttpClient(); GetMethod getMethod = new GetMethod("http://www.example.com"); int status = httpClient.executeMethod(getMethod); if (status == HttpStatus.SC_OK) { InputStream in = getMethod.getResponseBodyAsStream(); BufferedReader reader = new BufferedReader(new InputStreamReader(in)); String line = reader.readLine(); while (line != null) { System.out.println(line); line = reader.readLine(); } }
String pattern = "<title>(.*?)</title>"; Pattern r = Pattern.compile(pattern); Matcher m = r.matcher(html); if (m.find()) { System.out.println(m.group(1)); }
Document doc = Jsoup.connect("http://www.example.com").get(); Elements links = doc.select("a[href]"); for (Element link : links) { String text = link.text(); String href = link.attr("href"); System.out.println(text + " " + href); }
https://movie.douban.com/chart
Document doc = Jsoup.connect("https://movie.douban.com/chart").get(); Elements items = doc.select("div.item"); List<Movie> movieList = new ArrayList<>(); for (Element item : items) { Elements title = item.select("div.info div.hd a"); Elements rating = item.select("div.info div.bd div.star span.rating_num"); Elements director = item.select("div.info div.bd p").eq(0); Elements actor = item.select("div.info div.bd p").eq(1); Movie movie = new Movie(); movie.setTitle(title.text()); movie.setRating(Double.valueOf(rating.text())); movie.setDirector(director.text().replace("导演: ", "")); movie.setActor(actor.text().replace("主演: ", "")); movieList.add(movie); }
public class DBHelper { private static final String JDBC_DRIVER = "com.mysql.jdbc.Driver"; private static final String DB_URL = "jdbc:mysql://localhost:3306/db"; private static final String USER = "root"; private static final String PASS = "password"; public static Connection getConnection() { Connection conn = null; try { Class.forName(JDBC_DRIVER); conn = DriverManager.getConnection(DB_URL, USER, PASS); } catch (Exception e) { e.printStackTrace(); } return conn; } public static void saveMovies(List<Movie> movieList) { try (Connection conn = getConnection(); PreparedStatement stmt = conn.prepareStatement( "INSERT INTO movie(title,rating,director,actor) VALUES (?,?,?,?)" )) { for (Movie movie : movieList) { stmt.setString(1, movie.getTitle()); stmt.setDouble(2, movie.getRating()); stmt.setString(3, movie.getDirector()); stmt.setString(4, movie.getActor()); stmt.addBatch(); } stmt.executeBatch(); } catch (Exception e) { e.printStackTrace(); } } }
public class MovieAnalyzer { public static void analyzeMovies() { try (Connection conn = DBHelper.getConnection(); Statement stmt = conn.createStatement()) { String sql = "SELECT director, COUNT(*) AS cnt, AVG(rating) AS avg_rating " + "FROM movie " + "GROUP BY director " + "HAVING cnt > 1 " + "ORDER BY avg_rating DESC"; ResultSet rs = stmt.executeQuery(sql); while (rs.next()) { String director = rs.getString("director"); int cnt = rs.getInt("cnt"); double avgRating = rs.getDouble("avg_rating"); System.out.printf("%-20s %5d %7.2f%n", director, cnt, avgRating); } } catch (Exception e) { e.printStackTrace(); } } }
以上是Java的 爬蟲應用教學課程,實戰資料抓取與分析的詳細內容。更多資訊請關注PHP中文網其他相關文章!