爬虫爬下来的数据(100G级别，2000W以上数据量)用mysql还是mongodb存储好？-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

爬虫爬下来的数据(100G级别，2000W以上数据量)用mysql还是mongodb存储好？

PHPz

Jun 06, 2016 pm 04:22 PM

mongodb mysql

MongoDB作为非关系型数据库，其主要的优势在于schema-less。由于爬虫数据一般来说比较“脏”，不会包含爬取数据的所有field，这对于不需要严格定义schema的MongoDB再合适不过。而MongoDB内置的sharding分布式系统也保证了它的可扩展性。MongoDB的aggregation framework除了join以外可以完全替代SQL语句，做到非常快速的统计分析。

而题主的100GB、20m数据量(5k per record)，据我的经验，这对于MongoDB来说不是太大问题，需要全局统计的话就做sharding＋自带的Map Reduce进行优化，需要filter的话就做索引（前人也提到MongoDB的查询速度是MySQL不能比的），而且需要join的概率也不大（不需要normalize）。（推荐相关mysql视频教程：mysql教程）

总而言之，主要看你用来做什么，如果是简单的raw data储存直接存诚txt文件，后续加载到HDFS都可以。如果是数据仓库设计的话，MySQL可以作为一个轻量级的aggregate table载体，作为OLAP的后端数据源。（推荐相关MongoDB视频教程：MongoDB教程）

反正，在这种情况下，我是看不到MySQL单纯用做储存的必要。 well，
这个量级，且用处来看，mysql or mongo 都无所谓，区别不大。
不过你既然爬虫的数据，就会要跟着源数据结构变动而变动，mongo的模式就会更方便适合些

Mongo快的原因主要有以下几点：
写：
-3.0以前是mmap ，内存映射模式，写入数据时候在内存里完成后就是可以返回的，这样并发会高，当然也有各种级别的安全写级别，应对不同的安全需求。
-3.0之后，WT引擎其MVCC机制更是大幅度提高了写效率，多版本控制机制提高了并发，降低了锁的粒度。

读：
MongoDB 与mysql不同，文档型的结构设计使同一条document中的内容在连续的位置内（内存啊，硬盘啊）。而关系型数据库需要把数据从各个地方找过来，join啊之类的，减少了随机io。

Mongo的设计模式也会让我们尽可能的把working set能在ram中装下。

3.0以后WT的MVCC也大幅度提高了效率。

然后，sharding的存在，让我们对于带有shard key的读与写都有了横向水平扩展的能力，也提高了效率。这个问题是钓鱼吧...
存100G，2KW条数据随便一个数据库都妥妥的，真正用来做选择的是要怎么用... 那个熟悉用那个。我觉得要看你的爬虫是用什么语言吧,php的就mysql比较好点，nodejs就mongodb呗～还有就是数据结构要考虑考虑，还有读写要求你也没给出这些很难推荐，不过我推崇芒果，因为跟node无缝对接，不好就是比较新，坑也多～这样以后数据有了直接来MEAN框架就可以弄爬虫应用了O(∩_∩)O

首先，数据大了，存储绝对不是一件容易的事，要考虑很多因素。

爬虫爬下来的大量数据，存在关系型数据库里往往不是很恰当的，因为当数据量和并发很大时，关系型数据库的容量与读写能力会是瓶颈，另一方面，爬虫保存的页面信息之间一般也不需要建立关系。

比较好的做法应该是存在列族数据库类型的Nosql里。Google的BigTable论文里就提出了使用BigTable存储网页信息，开源的列族数据库，像HBase、Cassandra也都很适合存储这类信息。每爬一个网页，构造一个Key（比如是倒排域名的url，或者是散列的key）和一系列Column（网页内容等），插到HBase的里，作为一行。

有一套较通用的大规模分布式爬虫方案是Nutch + Gora + HBase + Solr／Elasticsearch，爬虫爬的数据通过Gora作为数据抽象层存在HBase里，然后导入Solr或者Elasticsearch里建立索引。也可以通过Gora执行MapReduce或者导入Spark进行计算。

但是上述方案其实并不适合普通的开发者，因为搭建和维护HBase是很繁琐的，引入很多学习成本，遇到问题还要排查。重要的是这跟爬虫毫无关系啊，完全是存储问题。

所以我最终推荐的是云的方案，阿里云的OTS是一个类似HBase的Nosql数据库，成本低、读写性能好，非常适合爬虫这个场景：

不需要自己搭建与运维，开通实例即可使用，完全不用担心规模问题。
按照读写预留能力收费，爬虫爬的时候读写预留能力调上去，爬完了读写预留能力调下来。
存储成本非常低。
数据存在OTS上，计算资源就可以弹性的扩容或者缩减。举个例子，假如爬虫爬的时候要使用很多云服务器，等爬虫爬完了，这些服务器就可以及时释放；另一方面，如果要对爬下来的数据做分析计算，也只需要在计算的时候购买云服务器，从OTS中把数据导下来，计算完成服务器即可释放。

开放结构化数据服务OTS_海量数据存储

利益相关：OTS开发

都是些什么数据？拿到后做什么处理？query pattern是啥？
一般join, filter多的用mysql.整块数据读的用mongo.
要是简单的话plain text JSON也不错。应该是MongoDB好一些，其设计之初就是为了应对大数据存储的。 mongo

没有schema的严格定义，json存取
爬虫的字段会经常变化，字段定义可能会变更，mongo就对这方面很宽松
mongo是文档型的，天生为海量数据存储准备
可以很轻松的横向扩展，分片，复制集群分分钟

使用mongo也有坑，3.2之后就换了新的WiredTiger引擎，占的内存略坑，对于没有太多query的存的数据库来看，内存还是会偶尔断片，没关系，在上面套一个docker ，还是一样很方便。

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

1 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Where to find the Crane Control Keycard in Atomfall

1 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7416

CakePHP Tutorial

1359

What is the format of the account name of steam

win11 activation key permanent

Related knowledge

How to use MySQL backup and restore in PHP? Jun 03, 2024 pm 12:19 PM

Backing up and restoring a MySQL database in PHP can be achieved by following these steps: Back up the database: Use the mysqldump command to dump the database into a SQL file. Restore database: Use the mysql command to restore the database from SQL files.

How to optimize MySQL query performance in PHP? Jun 03, 2024 pm 08:11 PM

MySQL query performance can be optimized by building indexes that reduce lookup time from linear complexity to logarithmic complexity. Use PreparedStatements to prevent SQL injection and improve query performance. Limit query results and reduce the amount of data processed by the server. Optimize join queries, including using appropriate join types, creating indexes, and considering using subqueries. Analyze queries to identify bottlenecks; use caching to reduce database load; optimize PHP code to minimize overhead.

How to insert data into a MySQL table using PHP? Jun 02, 2024 pm 02:26 PM

How to insert data into MySQL table? Connect to the database: Use mysqli to establish a connection to the database. Prepare the SQL query: Write an INSERT statement to specify the columns and values to be inserted. Execute query: Use the query() method to execute the insertion query. If successful, a confirmation message will be output.

How to use MySQL stored procedures in PHP? Jun 02, 2024 pm 02:13 PM

To use MySQL stored procedures in PHP: Use PDO or the MySQLi extension to connect to a MySQL database. Prepare the statement to call the stored procedure. Execute the stored procedure. Process the result set (if the stored procedure returns results). Close the database connection.

How to create a MySQL table using PHP? Jun 04, 2024 pm 01:57 PM

Creating a MySQL table using PHP requires the following steps: Connect to the database. Create the database if it does not exist. Select a database. Create table. Execute the query. Close the connection.

How to fix mysql_native_password not loaded errors on MySQL 8.4 Dec 09, 2024 am 11:42 AM

One of the major changes introduced in MySQL 8.4 (the latest LTS release as of 2024) is that the "MySQL Native Password" plugin is no longer enabled by default. Further, MySQL 9.0 removes this plugin completely. This change affects PHP and other app

The difference between oracle database and mysql May 10, 2024 am 01:54 AM

Oracle database and MySQL are both databases based on the relational model, but Oracle is superior in terms of compatibility, scalability, data types and security; while MySQL focuses on speed and flexibility and is more suitable for small to medium-sized data sets. . ① Oracle provides a wide range of data types, ② provides advanced security features, ③ is suitable for enterprise-level applications; ① MySQL supports NoSQL data types, ② has fewer security measures, and ③ is suitable for small to medium-sized applications.

How to set up MySQL connection pool using PHP? Jun 04, 2024 pm 03:28 PM

Setting up a MySQL connection pool using PHP can improve performance and scalability. The steps include: 1. Install the MySQLi extension; 2. Create a connection pool class; 3. Set the connection pool configuration; 4. Create a connection pool instance; 5. Obtain and release connections. With connection pooling, applications can avoid creating a new database connection for each request, thereby improving performance.

See all articles