MongoDB and Hadoop: A Step-by Step Tutorial Using
The following is a guest post from Jeremy Karn. This article is excerpted from MongoDB + Hadoop: A Step-by-Step Tutorial. Jeremy is a cofounder at Mortar Data, a Hadoop-as-a-service provider, and creator of mortar, an open source framework
The following is a guest post from Jeremy Karn. This article is excerpted from ‘MongoDB + Hadoop: A Step-by-Step Tutorial’. Jeremy is a cofounder at Mortar Data, a Hadoop-as-a-service provider, and creator of mortar, an open source framework for data processing.
People who are worried about scalability often find themselves looking at two tools: MongoDB for storing large amounts of data easily and Hadoop for processing that data. But a common question is: “How do I combine these two to really get the most out of my data?”
Here’s a step-by-step tutorial that will get you up and running with MongoDB and Hadoop in a matter of minutes. And the best part about this tutorial is that at the end you’ll be ready to jump right into using your own MongoDB data with Hadoop.
For this tutorial you’ll be using Apache Pig, a high-level data flow language that compiles down into Hadoop MapReduce jobs. It was designed to be easy to learn and simple to write. If you’ve written SQL, Pig will feel familiar, it is like procedural SQL.
To run your Hadoop jobs, you’re going to use a free Mortar account. Mortar provides Hadoop as a service, which means you can run your jobs without worrying about how to set up and manage a multi-node Hadoop cluster.
To get started, we’ve already set up a small MongoDB instance on MongoLab, populated it with a random sampling of Twitter data from a single day (around 120,000 tweets), and created a read-only user for you.
We’ve also set up a public Github repo with a Mortar project that has three Pig scripts ready to run. Here’s what you need to do:
If you don’t already have a free Github account - create one.? You’ll need a github username in step 4.
- Sign into (or create) your free Mortar account.
- After you receive the confirmation email, log into Mortar at https://app.mortardata.com.
- Install?the Mortar Development Framework:?
gem install mortar
Copy after login -
Clone the example git project and register it as a mortar project:?
git clone git@github.com:mortardata/mongo-pig-examples.git
Copy after logincd mongo-pig-examples
Copy after loginmortar register mongo-pig-examples
Copy after login
Script 1 - Characterize Collection
If you’re like most MongoDB users, you may not have a great sense of the different fields, data types, or values in your collection. We built characterize_collection.pig to deeply inspect your collection to extract that information.
From the base directory of the mongo-pig-examples project you just cloned take a look at pigscripts/characterize_collection.pig. It loads all the data in the collection as a map, sends the map to Python (udfs/python/mongo_util.py) to gather a bunch of metadata, calculates some basic information about the collection, and then it writes the results out to an S3 bucket.
To see this script in action let’s run it on a 4 node Hadoop cluster. In your terminal (from the base directory of your mongo-pig-examples project) run:
mortar run characterize_collection --clustersize 4
This job will take about 10 minutes to finish. You can monitor the job’s status on the command line or by going to https://app.mortardata.com/jobs?
Once the job has finished, you’ll receive an email with a link to your job results. Clicking on this link will bring you into the Mortar web app, where you can download the results from s3. The output is described at the top of the characterize_collection script but as an example you can scroll down the output and find:
… user.is_translator 2 false unicode 118806 user.is_translator 2 true unicode 31 user.lang 26 en unicode 114108 user.lang 26 es unicode 3462 user.lang 26 fr unicode 532 user.lang 26 pt unicode 281 user.lang 26 ja unicode 79 user.listed_count 398 0 int 73757 user.listed_count 398 1 int 18518
Looking at the values for user.lang - we see that there are 26 unique values for the field in our dataset. The most common was “en” with 114108 occurrences, the next most common was “es” with 3462 occurrences, and so on. To see the full results without running the job you can view the output file here.
Script 2 - MongoDB Schema Generator
It can be tricky to properly declare MongoDB’s highly nested schemas in Pig. Now, Pig is graceful—it can roll without a schema, or with inconsistent, or incorrect schemas. But it’s easier to read and write your Pig code if you have a schema because it allows you (and the Pig optimizer) to focus on just the relevant data.
So this next script automatically generates a Pig schema by examining your MongoDB collection. If you don’t need the whole schema, you can easily edit it to keep just the fields you want.
Running this script is similar to running the previous one. If you ran the Characterize Collection script in the past hour, the same cluster you used for that job should still be running. In that case, you can just run:
mortar run mongo_schema_generator
If you don’t have a cluster that’s still running, just run the job on a new 4 node cluster like this:
mortar run mongo_schema_generator --clustersize 4
Script 3 – Twitter Hourly Coffee Tweets
Using a Twitter coffee tweets script (pigscripts/hourly_coffee_tweets.pig), we’re going to demonstrate how we can use a small subset of the fields in our MongoDB collection. For our example, we’ll look at how often the word “coffee” is tweeted throughout the day. As with the Mongo Schema Generator script, you can run this job on an existing cluster or start up a new one.
Next Steps
If you already have a mongo instance/cluster based in US-East EC2, the first two example scripts should run on one of your collections with only minor modifications. You’ll just need to:
- Update the MongoLoader connection strings in the pig scripts to connect to your MongoDB collections with one of your own users. If your mongo instance is on a non-standard port (any port other than 27017), just email us at support@mortardata.com to allow your Mortar account to access that port.
- If you’d like your jobs to write to one of your own S3 buckets, you can update the AWS keys associated with your Mortar account by following these instructions to enable s3 access.
- If you run out of free cluster hours with Mortar, you can upgrade your account to get additional free hours each month.
- You can find more resources for learning Pig here
- If you have any questions or feedback, please contact us at support@mortardata.com or ping us on in-app chat at app.mortardata.com
原文地址:MongoDB and Hadoop: A Step-by Step Tutorial Using , 感谢原作者分享。

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

It is recommended to use the latest version of MongoDB (currently 5.0) as it provides the latest features and improvements. When selecting a version, you need to consider functional requirements, compatibility, stability, and community support. For example, the latest version has features such as transactions and aggregation pipeline optimization. Make sure the version is compatible with the application. For production environments, choose the long-term support version. The latest version has more active community support.

Node.js is a server-side JavaScript runtime, while Vue.js is a client-side JavaScript framework for creating interactive user interfaces. Node.js is used for server-side development, such as back-end service API development and data processing, while Vue.js is used for client-side development, such as single-page applications and responsive user interfaces.

The data of the MongoDB database is stored in the specified data directory, which can be located in the local file system, network file system or cloud storage. The specific location is as follows: Local file system: The default path is Linux/macOS:/data/db, Windows: C:\data\db. Network file system: The path depends on the file system. Cloud Storage: The path is determined by the cloud storage provider.

The MongoDB database is known for its flexibility, scalability, and high performance. Its advantages include: a document data model that allows data to be stored in a flexible and unstructured way. Horizontal scalability to multiple servers via sharding. Query flexibility, supporting complex queries and aggregation operations. Data replication and fault tolerance ensure data redundancy and high availability. JSON support for easy integration with front-end applications. High performance for fast response even when processing large amounts of data. Open source, customizable and free to use.

MongoDB is a document-oriented, distributed database system used to store and manage large amounts of structured and unstructured data. Its core concepts include document storage and distribution, and its main features include dynamic schema, indexing, aggregation, map-reduce and replication. It is widely used in content management systems, e-commerce platforms, social media websites, IoT applications, and mobile application development.

On Linux/macOS: Create the data directory and start the "mongod" service. On Windows: Create the data directory and start the MongoDB service from Service Manager. In Docker: Run the "docker run" command. On other platforms: Please consult the MongoDB documentation. Verification method: Run the "mongo" command to connect and view the server version.

The MongoDB database file is located in the MongoDB data directory, which is /data/db by default, which contains .bson (document data), ns (collection information), journal (write operation records), wiredTiger (data when using the WiredTiger storage engine ) and config (database configuration information) and other files.

Solutions to resolve Navicat expiration issues include: renew the license; uninstall and reinstall; disable automatic updates; use Navicat Premium Essentials free version; contact Navicat customer support.
