MongoDB and Hadoop: A Step-by Step Tutorial Using
The following is a guest post from Jeremy Karn. This article is excerpted from MongoDB + Hadoop: A Step-by-Step Tutorial. Jeremy is a cofounder at Mortar Data, a Hadoop-as-a-service provider, and creator of mortar, an open source framework
The following is a guest post from Jeremy Karn. This article is excerpted from ‘MongoDB + Hadoop: A Step-by-Step Tutorial’. Jeremy is a cofounder at Mortar Data, a Hadoop-as-a-service provider, and creator of mortar, an open source framework for data processing.
People who are worried about scalability often find themselves looking at two tools: MongoDB for storing large amounts of data easily and Hadoop for processing that data. But a common question is: “How do I combine these two to really get the most out of my data?”
Here’s a step-by-step tutorial that will get you up and running with MongoDB and Hadoop in a matter of minutes. And the best part about this tutorial is that at the end you’ll be ready to jump right into using your own MongoDB data with Hadoop.
For this tutorial you’ll be using Apache Pig, a high-level data flow language that compiles down into Hadoop MapReduce jobs. It was designed to be easy to learn and simple to write. If you’ve written SQL, Pig will feel familiar, it is like procedural SQL.
To run your Hadoop jobs, you’re going to use a free Mortar account. Mortar provides Hadoop as a service, which means you can run your jobs without worrying about how to set up and manage a multi-node Hadoop cluster.
To get started, we’ve already set up a small MongoDB instance on MongoLab, populated it with a random sampling of Twitter data from a single day (around 120,000 tweets), and created a read-only user for you.
We’ve also set up a public Github repo with a Mortar project that has three Pig scripts ready to run. Here’s what you need to do:
If you don’t already have a free Github account - create one.? You’ll need a github username in step 4.
- Sign into (or create) your free Mortar account.
- After you receive the confirmation email, log into Mortar at https://app.mortardata.com.
- Install?the Mortar Development Framework:?
gem install mortar
登录后复制 -
Clone the example git project and register it as a mortar project:?
git clone git@github.com:mortardata/mongo-pig-examples.git
登录后复制cd mongo-pig-examples
登录后复制mortar register mongo-pig-examples
登录后复制
Script 1 - Characterize Collection
If you’re like most MongoDB users, you may not have a great sense of the different fields, data types, or values in your collection. We built characterize_collection.pig to deeply inspect your collection to extract that information.
From the base directory of the mongo-pig-examples project you just cloned take a look at pigscripts/characterize_collection.pig. It loads all the data in the collection as a map, sends the map to Python (udfs/python/mongo_util.py) to gather a bunch of metadata, calculates some basic information about the collection, and then it writes the results out to an S3 bucket.
To see this script in action let’s run it on a 4 node Hadoop cluster. In your terminal (from the base directory of your mongo-pig-examples project) run:
mortar run characterize_collection --clustersize 4
This job will take about 10 minutes to finish. You can monitor the job’s status on the command line or by going to https://app.mortardata.com/jobs?
Once the job has finished, you’ll receive an email with a link to your job results. Clicking on this link will bring you into the Mortar web app, where you can download the results from s3. The output is described at the top of the characterize_collection script but as an example you can scroll down the output and find:
… user.is_translator 2 false unicode 118806 user.is_translator 2 true unicode 31 user.lang 26 en unicode 114108 user.lang 26 es unicode 3462 user.lang 26 fr unicode 532 user.lang 26 pt unicode 281 user.lang 26 ja unicode 79 user.listed_count 398 0 int 73757 user.listed_count 398 1 int 18518
Looking at the values for user.lang - we see that there are 26 unique values for the field in our dataset. The most common was “en” with 114108 occurrences, the next most common was “es” with 3462 occurrences, and so on. To see the full results without running the job you can view the output file here.
Script 2 - MongoDB Schema Generator
It can be tricky to properly declare MongoDB’s highly nested schemas in Pig. Now, Pig is graceful—it can roll without a schema, or with inconsistent, or incorrect schemas. But it’s easier to read and write your Pig code if you have a schema because it allows you (and the Pig optimizer) to focus on just the relevant data.
So this next script automatically generates a Pig schema by examining your MongoDB collection. If you don’t need the whole schema, you can easily edit it to keep just the fields you want.
Running this script is similar to running the previous one. If you ran the Characterize Collection script in the past hour, the same cluster you used for that job should still be running. In that case, you can just run:
mortar run mongo_schema_generator
If you don’t have a cluster that’s still running, just run the job on a new 4 node cluster like this:
mortar run mongo_schema_generator --clustersize 4
Script 3 – Twitter Hourly Coffee Tweets
Using a Twitter coffee tweets script (pigscripts/hourly_coffee_tweets.pig), we’re going to demonstrate how we can use a small subset of the fields in our MongoDB collection. For our example, we’ll look at how often the word “coffee” is tweeted throughout the day. As with the Mongo Schema Generator script, you can run this job on an existing cluster or start up a new one.
Next Steps
If you already have a mongo instance/cluster based in US-East EC2, the first two example scripts should run on one of your collections with only minor modifications. You’ll just need to:
- Update the MongoLoader connection strings in the pig scripts to connect to your MongoDB collections with one of your own users. If your mongo instance is on a non-standard port (any port other than 27017), just email us at support@mortardata.com to allow your Mortar account to access that port.
- If you’d like your jobs to write to one of your own S3 buckets, you can update the AWS keys associated with your Mortar account by following these instructions to enable s3 access.
- If you run out of free cluster hours with Mortar, you can upgrade your account to get additional free hours each month.
- You can find more resources for learning Pig here
- If you have any questions or feedback, please contact us at support@mortardata.com or ping us on in-app chat at app.mortardata.com
原文地址:MongoDB and Hadoop: A Step-by Step Tutorial Using , 感谢原作者分享。

热AI工具

Undresser.AI Undress
人工智能驱动的应用程序,用于创建逼真的裸体照片

AI Clothes Remover
用于从照片中去除衣服的在线人工智能工具。

Undress AI Tool
免费脱衣服图片

Clothoff.io
AI脱衣机

Video Face Swap
使用我们完全免费的人工智能换脸工具轻松在任何视频中换脸!

热门文章

热工具

记事本++7.3.1
好用且免费的代码编辑器

SublimeText3汉化版
中文版,非常好用

禅工作室 13.0.1
功能强大的PHP集成开发环境

Dreamweaver CS6
视觉化网页开发工具

SublimeText3 Mac版
神级代码编辑软件(SublimeText3)

在开发一个电商网站时,我遇到了一个棘手的问题:如何为用户提供个性化的商品推荐。最初,我尝试了一些简单的推荐算法,但效果并不理想,用户的满意度也因此受到影响。为了提升推荐系统的精度和效率,我决定采用更专业的解决方案。最终,我通过Composer安装了andres-montanez/recommendations-bundle,这不仅解决了我的问题,还大大提升了推荐系统的性能。可以通过一下地址学习composer:学习地址

本文介绍如何在Debian系统上配置MongoDB实现自动扩容,主要步骤包括MongoDB副本集的设置和磁盘空间监控。一、MongoDB安装首先,确保已在Debian系统上安装MongoDB。使用以下命令安装:sudoaptupdatesudoaptinstall-ymongodb-org二、配置MongoDB副本集MongoDB副本集确保高可用性和数据冗余,是实现自动扩容的基础。启动MongoDB服务:sudosystemctlstartmongodsudosys

本文介绍如何在Debian系统上构建高可用性的MongoDB数据库。我们将探讨多种方法,确保数据安全和服务持续运行。关键策略:副本集(ReplicaSet):利用副本集实现数据冗余和自动故障转移。当主节点出现故障时,副本集会自动选举新的主节点,保证服务的持续可用性。数据备份与恢复:定期使用mongodump命令进行数据库备份,并制定有效的恢复策略,以应对数据丢失风险。监控与报警:部署监控工具(如Prometheus、Grafana)实时监控MongoDB的运行状态,并

直接通过 Navicat 查看 MongoDB 密码是不可能的,因为它以哈希值形式存储。取回丢失密码的方法:1. 重置密码;2. 检查配置文件(可能包含哈希值);3. 检查代码(可能硬编码密码)。

CentOS系统下MongoDB高效备份策略详解本文将详细介绍在CentOS系统上实施MongoDB备份的多种策略,以确保数据安全和业务连续性。我们将涵盖手动备份、定时备份、自动化脚本备份以及Docker容器环境下的备份方法,并提供备份文件管理的最佳实践。手动备份:利用mongodump命令进行手动全量备份,例如:mongodump-hlocalhost:27017-u用户名-p密码-d数据库名称-o/备份目录此命令会将指定数据库的数据及元数据导出到指定的备份目录。

在Debian系统上为MongoDB数据库加密,需要遵循以下步骤:第一步:安装MongoDB首先,确保您的Debian系统已安装MongoDB。如果没有,请参考MongoDB官方文档进行安装:https://docs.mongodb.com/manual/tutorial/install-mongodb-on-debian/第二步:生成加密密钥文件创建一个包含加密密钥的文件,并设置正确的权限:ddif=/dev/urandomof=/etc/mongodb-keyfilebs=512

CentOS系统上GitLab数据库部署指南选择合适的数据库是成功部署GitLab的关键步骤。GitLab兼容多种数据库,包括MySQL、PostgreSQL和MongoDB。本文将详细介绍如何选择并配置这些数据库。数据库选择建议MySQL:一款广泛应用的关系型数据库管理系统(RDBMS),性能稳定,适用于大多数GitLab部署场景。PostgreSQL:功能强大的开源RDBMS,支持复杂查询和高级特性,适合处理大型数据集。MongoDB:流行的NoSQL数据库,擅长处理海

要设置 MongoDB 用户,请按照以下步骤操作:1. 连接到服务器并创建管理员用户。2. 创建要授予用户访问权限的数据库。3. 使用 createUser 命令创建用户并指定其角色和数据库访问权限。4. 使用 getUsers 命令检查创建的用户。5. 可选地设置其他权限或授予用户对特定集合的权限。
