Ubuntu12.04+Nutch2.2.1+MySQL 配置笔记
日期:2013/10/13 系统 :Ubuntu12.04LTS JDK :1.7.0_21 Nutch :2.2.1 MySQL :5.5.32 ------------------------------------------------------------------------------------------------------------------------------------------------------------
日期:2013/10/13
系统:Ubuntu12.04LTS
JDK:1.7.0_21
Nutch:2.2.1
MySQL:5.5.32
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Pre1:安装配置OracleJDK
Pre2:安装配置MySQL sudo apt-get install mysql-server,mysql-client
Pre3:安装配置Apache Ant sudo apt-get install ant
Start:Ubuntu下搭建Nutch2.2.1,并以MySQL作为数据库,UTF-8为默认编码综合配置
Step1:MySQL配置
首先编辑 /etc/mysql/my.cnf 文件在[mysqld]下面添加以下内容:
innodb_file_format=barracuda innodb_file_per_table=true innodb_large_prefix=true character-set-server=utf8 collation-server=utf8mb4_unicode_ci max_allowed_packet=500M
然后创建数据库与数据表:
CREATE DATABASE nutch DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8;
CREATE TABLE `webpage` ( `id` varchar(767) NOT NULL, `headers` blob, `text` mediumtext DEFAULT NULL, `status` int(11) DEFAULT NULL, `markers` blob, `parseStatus` blob, `modifiedTime` bigint(20) DEFAULT NULL, `score` float DEFAULT NULL, `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL, `batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL, `baseUrl` varchar(767) DEFAULT NULL, `content` longblob, `title` varchar(2048) DEFAULT NULL, `reprUrl` varchar(767) DEFAULT NULL, `fetchInterval` int(11) DEFAULT NULL, `prevFetchTime` bigint(20) DEFAULT NULL, `inlinks` mediumblob, `prevSignature` blob, `outlinks` mediumblob, `fetchTime` bigint(20) DEFAULT NULL, `retriesSinceFetch` int(11) DEFAULT NULL, `protocolStatus` blob, `signature` blob, `metadata` blob, PRIMARY KEY (`id`) ) ENGINE=InnoDB ROW_FORMAT=COMPRESSED DEFAULT CHARSET=utf8;
Step2:Nutch配置
获取Nutch2.2.1,从官网http://www.apache.org/dyn/closer.cgi/nutch/下载,然后解压至本地安装目录,如本地根目录为${APACHE_NUTCH_HOME}
将以下行的注释取消:
<dependency org="”mysql”" name="”mysql-connector-java”" rev="”5.1.18″" conf="”*-">default”/></dependency>
<span><dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default"></dependency></span>
修改以下行:
<pre class="brush:php;toolbar:false"><span><dependency org="org.apache.gora" name="gora-core" rev="0.3" conf="*->default"></dependency></span>
<span><dependencyorg name="gora-core"><span>rev="0.2.1"</span>conf="*->default"/></dependencyorg></span>
Step3:数据库连接配置
编辑${APACHE_NUTCH_HOME}/conf/gora.properties文件,注释掉默认的数据库连接配置,同时添加以下配置内容:
############################### # MySQL configure # ############################### gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true gora.sqlstore.jdbc.user=xxxx(MySQL用户名) gora.sqlstore.jdbc.password=xxxx(MySQL密码)
Step4:数据表映射配置
修改${APACHE_NUTCH_HOME}/conf/gora.properties文件,这里的修改建议按照前面介绍的自动生成数据表的方法进行修改,网上说的要将primarykey的长度从512修改成767,即:
改:
Step5:nutch-site.xml配置
添加以下配置:
<property> <name>http.agent.name</name> <value>Your Nutch Spider</value> </property> <property> <name>http.accept.language</name> <value>zh-cn, en-us,en-gb,en;q=0.7,*;q=0.3</value> <description>*</description> </property> <property> <name>parser.character.encoding.default</name> <value>utf-8</value> <description>*</description> </property> <property> <name>storage.data.store.class</name> <value>org.apache.gora.sql.store.SqlStore</value> <description>*</description> </property>
java.lang.NullPointerException at org.apache.avro.util.Utf8.<init>(Utf8.java:37) at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398) </init>
<property> <name>generate.batch.id</name> <value>*</value> </property>
(关于ant的命令,这里就不说明了),只需要切换到${APACHE_NUTCH_HOME}下执行ant clean 然后ant 即可。构建完毕后会在${APACHE_NUTCH_HOME}目录下生成runtime 文件夹。
Step:7 网页抓取,种子配置
创建种子文件
cd${APACHE_NUTCH_HOME}/runtime/local mkdir -p urls echo 'http://www.sina.com.cn' > urls/seed.txt echo 'http://www.ifeng.com' > urls/seed.txt
bin/nutchcrawl urls -depth 5 -topN 10
至此,已经完成了基本的配置。

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

What graphics card is good for Core i73770? RTX3070 is a very powerful graphics card with excellent performance and advanced technology. Whether you're playing games, rendering graphics, or performing machine learning, the RTX3070 can handle it with ease. It uses NVIDIA's Ampere architecture, has 5888 CUDA cores and 8GB of GDDR6 memory, which can provide a smooth gaming experience and high-quality graphics effects. RTX3070 also supports ray tracing technology, which can present realistic light and shadow effects. All in all, the RTX3070 is a powerful and advanced graphics card suitable for those who pursue high performance and high quality. RTX3070 is an NVIDIA series graphics card. Powered by 2nd generation NVID

i73770 with rx5600xt Because the RX5600XT graphics card is matched with the R53600CPU, we chose the i7-3770. The evaluation results of the RX5600XT graphics card are as follows: The RX5600XT graphics card is an excellent graphics card and performed very well after testing. It adopts AMD's RDNA architecture, has 6GBGDDR6 video memory and 192-bit memory interface, supports PCIe4.0 bus, and has excellent gaming performance. In all tests, the RX5600XT graphics card performed well. At high resolutions, it delivers a smooth gaming experience and maintains frame rates above 60 FPS in most games. In the latest games, it can also provide good

Is it reasonable to use 13600kf with 3070? "Words Play with Flowers" is a popular text puzzle game with new levels updated every day. Among them, Nostalgia Cleaning is one of the levels, which requires players to find 12 places in the picture that do not match the era. Today, I will share with you the strategy for clearing the nostalgic cleaning level of "Word Play Flowers", so that players who have not yet cleared the level will know the specific operation methods. If it is used to play games, then there is no difference between I513600KF and I713700KF in terms of gaming experience. In this case, just choose I513600KF for the CPU. For the graphics card, you can choose RTX3070. It should be noted that different games have different hardware requirements. If the little cutie just plays DOTALO

1. MX330 chassis overview MX330 is a mid-tower chassis produced by Shenzhen Hangjia Technology Co., Ltd. It has a simple and elegant appearance and is made of high-quality steel plates. It has excellent cooling performance and scalability, and is very suitable for use with high-performance motherboards and processors 2. Introduction to the Core 10th Generation i5 processor The Core 10th Generation i5 processor is a high-performance processor launched by Intel. Using a 10nm process, it has higher frequency and lower power consumption. It has powerful multi-core processing capabilities and intelligent acceleration technology, which can meet the daily use needs and light gaming needs of most users. 3. The heat dissipation performance of the MX330 chassis. The design of the MX330 chassis adopts an all-black grid style. The front and top are equipped with large-area meshes, which can

When choosing the lowest-end central processing unit (CPU), it's important to balance affordability with basic needs. For users using 1150 motherboards, choosing a suitable minimum configuration CPU can meet the needs of daily office work, web browsing and light entertainment. This article will recommend several suitable CPUs for you in terms of performance, price, power consumption, and scalability. Among the selections of the lowest-performance CPU, performance is an important consideration. For general office and light entertainment users, a quad-core processor is enough to meet the needs. Intel's i3 series and PentiumG series are both good choices. The i3 series has higher performance and larger cache, suitable for handling multi-tasking and multi-threaded applications. The PentiumG series is

Which motherboard is better for Xiaomi computers? In today's era of rapid development of information technology, computers have become one of the indispensable tools in people's lives. When choosing a computer, the motherboard is one of the most important components. As a well-known technology company, Xiaomi has also launched a series of high-performance computer products. So, what kind of motherboard should Xiaomi choose? This article will elaborate on performance from multiple aspects such as performance, stability, scalability and brand reputation. Performance is one of the most important considerations when choosing a motherboard. Xiaomi's high-end computers have the highest configurations, so you need to choose a motherboard with powerful performance to match. We can consider choosing a motherboard that supports the latest generation of processors, such as Intel's 10th generation Core processors. Motherboard memory and storage expansion

Is it suitable to pair i56500 with GTX1070? If you want to be more serious, the answer is no. But based on the actual situation, there is no problem with this combination. Why say no? Because for those large-scale stand-alone games that require more configuration now, if you use i76700k or i56500 with gtx1070, and the other configurations and settings are the same, the frame rate will usually be different, and the frame rate of 6700k will tend to be higher. In fact, this is more like the barrel principle. It is not a question of whether you can afford it or not. The key lies in how high the demand for the CPU is in the game. But considering the compatibility of i56500 and gtx1070 in most games, the frame rate has already Pretty impressive and capable of providing a great gaming experience, so something like this

What kind of mouse should I use with my laptop? It is best to use a wireless mouse. 1. The wireless mouse does not have the problem of wires getting tangled together, making the operation more convenient. 2. Equipped with a wireless mouse, you can avoid cluttered cables and provide more freedom when moving. 3. There is no need to use a cable to connect the wireless mouse to the notebook, and the cable will not be easily pulled out, making the use experience better. 4. In situations such as business trips, wireless mice are more convenient to carry. When using a mouse with a laptop, you should choose a wireless mouse. Because a wireless mouse does not require a cable, it is more convenient to use and can avoid tangles in the cable. At the same time, the sensitivity and response speed of a wireless mouse are better than that of a wired mouse, which can improve work efficiency. If you need to use it for a long time, it is recommended to choose a charging
