Home Database Mysql Tutorial hadoop实例---多表关联

hadoop实例---多表关联

Jun 07, 2016 pm 04:31 PM
hadoop association Example kind

多表关联和单表关联类似,它也是通过对原始数据进行一定的处理,从其中挖掘出关心的信息。如下 输入的是两个文件,一个代表工厂表,包含工厂名列和地址编号列;另一个代表地址表,包含地址名列和地址编号列。要求从输入数据中找出工厂名和地址名的对应关系,

多表关联和单表关联类似,它也是通过对原始数据进行一定的处理,从其中挖掘出关心的信息。如下

输入的是两个文件,一个代表工厂表,包含工厂名列和地址编号列;另一个代表地址表,包含地址名列和地址编号列。要求从输入数据中找出工厂名和地址名的对应关系,输出工厂名-地址名表

样本如下:

factory:

1

2

3

4

5

6

7

8

factoryname addressed

Beijing Red Star 1

Shenzhen Thunder 3

Guangzhou Honda 2

Beijing Rising 1

Guangzhou Development Bank 2

Tencent 3

Back of Beijing 1

Copy after login

address:

1

2

3

4

5

addressID addressname

1 Beijing

2 Guangzhou

3 Shenzhen

4 Xian

Copy after login


结果:

1

2

3

4

5

6

7

8

factoryname     addressname

Beijing Red Star        Beijing

Beijing Rising  Beijing

Bank of Beijing         Beijing

Guangzhou Honda         Guangzhou

Guangzhou Development Bank      Guangzhou

Shenzhen Thunder        Shenzhen

Tencent         Shenzhen

Copy after login


代码如下:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

import java.io.IOException;

import java.util.*;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class MTjoin {

    public static int time = 0;

    /*

     * 在map中先区分输入行属于左表还是右表,然后对两列值进行分割,

     * 保存连接列在key值,剩余列和左右表标志在value中,最后输出

     */

    public static class Map extends Mapper {

        // 实现map函数

        public void map(Object key, Text value, Context context)

                throws IOException, InterruptedException {

            String line = value.toString();// 每行文件

            String relationtype = new String();// 左右表标识

            // 输入文件首行,不处理

            if (line.contains("factoryname") == true

                    || line.contains("addressed") == true) {

                return;

            }

            // 输入的一行预处理文本

            StringTokenizer itr = new StringTokenizer(line);

            String mapkey = new String();

            String mapvalue = new String();

            int i = 0;

            while (itr.hasMoreTokens()) {

                // 先读取一个单词

                String token = itr.nextToken();

                // 判断该地址ID就把存到"values[0]"

                if (token.charAt(0) >= '0' && token.charAt(0)  0) {

                        relationtype = "1";

                    } else {

                        relationtype = "2";

                    }

                    continue;

                }

                // 存工厂名

                mapvalue += token + " ";

                i++;

            }

            // 输出左右表

            context.write(new Text(mapkey), new Text(relationtype + "+"+ mapvalue));

        }

    }

    /*

     * reduce解析map输出,将value中数据按照左右表分别保存,

  * 然后求出笛卡尔积,并输出。

     */

    public static class Reduce extends Reducer {

        // 实现reduce函数

        public void reduce(Text key, Iterable values, Context context)

                throws IOException, InterruptedException {

            // 输出表头

            if (0 == time) {

                context.write(new Text("factoryname"), new Text("addressname"));

                time++;

            }

            int factorynum = 0;

            String[] factory = new String[10];

            int addressnum = 0;

            String[] address = new String[10];

            Iterator ite = values.iterator();

            while (ite.hasNext()) {

                String record = ite.next().toString();

                int len = record.length();

                int i = 2;

                if (0 == len) {

                    continue;

                }

                // 取得左右表标识

                char relationtype = record.charAt(0);

                // 左表

                if ('1' == relationtype) {

                    factory[factorynum] = record.substring(i);

                    factorynum++;

                }

                // 右表

                if ('2' == relationtype) {

                    address[addressnum] = record.substring(i);

                    addressnum++;

                }

            }

            // 求笛卡尔积

            if (0 != factorynum && 0 != addressnum) {

                for (int m = 0; m  <pre class="brush:php;toolbar:false"> javac -classpath hadoop-core-1.1.2.jar:/opt/hadoop-1.1.2/lib/commons-cli-1.2.jar -d firstProject firstProject/MTJoin.java

Copy after login

1

jar -cvf MTJoin.jar -C firstProject/ .

Copy after login

删除已经存在的output

1

hadoop fs -rmr output

Copy after login

1

hadoop fs -mkdir input

Copy after login

1

hadoop fs -put factory input

Copy after login

1

hadoop fs -put address input

Copy after login

运行

1

hadoop jar  MTJoin.jar MTJoin input output

Copy after login


查看结果

1

hadoop fs -cat output/part-r-00000

Copy after login










?

作者:a331251021 发表于2013-8-4 16:20:52 原文链接

阅读:72 评论:0 查看评论

hadoop实例---多表关联

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Java Errors: Hadoop Errors, How to Handle and Avoid Java Errors: Hadoop Errors, How to Handle and Avoid Jun 24, 2023 pm 01:06 PM

Java Errors: Hadoop Errors, How to Handle and Avoid When using Hadoop to process big data, you often encounter some Java exception errors, which may affect the execution of tasks and cause data processing to fail. This article will introduce some common Hadoop errors and provide ways to deal with and avoid them. Java.lang.OutOfMemoryErrorOutOfMemoryError is an error caused by insufficient memory of the Java virtual machine. When Hadoop is

Naming conventions in PHP: How to use camel case naming for classes, methods and variables Naming conventions in PHP: How to use camel case naming for classes, methods and variables Jul 30, 2023 pm 02:43 PM

Naming conventions in PHP: How to use camelCase notation to name classes, methods, and variables In PHP programming, good naming conventions are an important coding practice. It improves code readability and maintainability, and makes teamwork smoother. In this article, we will explore a common naming convention: camelCase and provide some examples of how to use it in PHP to name classes, methods, and variables. 1. What is camel case nomenclature? CamelCase is a common naming convention in which the first letter of each word is capitalized,

PHP error: Unable to declare class repeatedly, solution! PHP error: Unable to declare class repeatedly, solution! Aug 25, 2023 pm 04:13 PM

PHP error: Unable to declare class repeatedly, solution! It is common for developers to encounter problems. In PHP development, we often encounter a common error: the class cannot be declared repeatedly. This problem seems simple, but if not solved in time, the code will not execute correctly. This article will introduce the cause of this problem and provide a solution for your reference. When we define a class in PHP code, if the same class is defined multiple times in the same file or multiple files, an error that the class cannot be declared repeatedly will occur. This is

Packaging technology and application in PHP Packaging technology and application in PHP Oct 12, 2023 pm 01:43 PM

Encapsulation technology and application encapsulation in PHP is an important concept in object-oriented programming. It refers to encapsulating data and operations on data together in order to provide a unified access interface to external programs. In PHP, encapsulation can be achieved through access control modifiers and class definitions. This article will introduce encapsulation technology in PHP and its application scenarios, and provide some specific code examples. 1. Encapsulated access control modifiers In PHP, encapsulation is mainly achieved through access control modifiers. PHP provides three access control modifiers,

Using Hadoop and HBase in Beego for big data storage and querying Using Hadoop and HBase in Beego for big data storage and querying Jun 22, 2023 am 10:21 AM

With the advent of the big data era, data processing and storage have become more and more important, and how to efficiently manage and analyze large amounts of data has become a challenge for enterprises. Hadoop and HBase, two projects of the Apache Foundation, provide a solution for big data storage and analysis. This article will introduce how to use Hadoop and HBase in Beego for big data storage and query. 1. Introduction to Hadoop and HBase Hadoop is an open source distributed storage and computing system that can

Learn best practice examples of pointer conversion in Golang Learn best practice examples of pointer conversion in Golang Feb 24, 2024 pm 03:51 PM

Golang is a powerful and efficient programming language that can be used to develop various applications and services. In Golang, pointers are a very important concept, which can help us operate data more flexibly and efficiently. Pointer conversion refers to the process of pointer operations between different types. This article will use specific examples to learn the best practices of pointer conversion in Golang. 1. Basic concepts In Golang, each variable has an address, and the address is the location of the variable in memory.

How to automatically associate MySQL foreign keys and primary keys? How to automatically associate MySQL foreign keys and primary keys? Mar 15, 2024 pm 12:54 PM

How to automatically associate MySQL foreign keys and primary keys? In the MySQL database, foreign keys and primary keys are very important concepts. They can help us establish relationships between different tables and ensure the integrity and consistency of the data. In actual application processes, it is often necessary to automatically associate foreign keys to the corresponding primary keys to avoid data inconsistencies. The following will introduce how to implement this function through specific code examples. First, we need to create two tables, one as the master table and the other as the slave table. Create in main table

The relationship between the number of Oracle instances and database performance The relationship between the number of Oracle instances and database performance Mar 08, 2024 am 09:27 AM

The relationship between the number of Oracle instances and database performance Oracle database is one of the well-known relational database management systems in the industry and is widely used in enterprise-level data storage and management. In Oracle database, instance is a very important concept. Instance refers to the running environment of Oracle database in memory. Each instance has an independent memory structure and background process, which is used to process user requests and manage database operations. The number of instances has an important impact on the performance and stability of Oracle database.

See all articles