3. OfflineData analysisProcess introduction
Note: This link mainly experiences the macro concept and processing process of the data analysis system, and initially understands the application links of hadoop and other frameworks. Don’t pay too much attention. Code details
A widely used data analysis system: "webLog data mining"
3.1 Requirements Analysis
3.1.1 Case Name
"Website or APPClickstream Log Data Mining System".
3.1.2 Case requirement description
“Web "Clickstream log" contains very important information for website operation. Through log analysis, we can know the number of visits to the website, which webpage has the most visitors, which webpage is the most valuable, advertising conversion rate, visitor source information, and visitor terminal information. wait.
3.1.3 Data source
The data of this case is mainly composed of User’s click behavior record
How to obtain: Pre-embed a js program on the page for the page you want to monitor Label binding event, as long as the user clicks or moves to the label, it can trigger the ajax request to the backgroundservlet program, use log4jRecord event information to the web server (nginx, tomcat, etc.), a growing log file is formed.
Form:
##58.215.204.118 - - [18/Sep/2013:06: 51:35 +0000] "GET /wp-includes/js/jquery/jquery.js?ver=1.10.2 HTTP/1.1" 304 0 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
|
3.2 Data processing process
##3.2.1 Flow chart analysis
This case is very similar to the typical BI system, the overall process As follows:
##However, since the premise of this caseis handles massive amounts of data. Therefore, the technologies used in each link of the process are completely different from traditional BI. Subsequent courses will explain them one by one: 1) Data collection: Customized development of the collection program, or using the open source frameworkFLUME
2) Data preprocessing: Customized developmentmapreduce
The program runs on hadoopCluster3) Data warehouse technology:
Hive# based on hadoop ##4) Data export: sqoop
data import and export tool based on hadoop5) Data visualization: Customized development of web programs or the use of
kettle and other products6) The entire process Process scheduling: hadoop#oozie
tools in the hadoop ecosystem or other similar open source products 3.2.2
Project technical architecture diagram
3.2.3
Project related screenshots (Perceptual understanding, just appreciation)
a) MapreudceProgram running
b) Query data in
Hive
##c)
Import statistical results into mysql
##./sqoop export --connect jdbc:mysql://localhost:3306/weblogdb --username root --password root --table t_display_xx --export-dir /user/hive/warehouse/uv/dt=2014-08-03
3.3
Final effect of the project
After the complete data processing process, reports of various statistical indicators will be periodically output. In production practice, these reports will eventually need to be The data is displayed in the form of visualization. This case uses the web program to realize data visualization.
The effect is as follows:
The above is the detailed content of Introduction to offline data analysis process. For more information, please follow other related articles on the PHP Chinese website!