Detailed explanation of apache druid in one article-Apache-php.cn

Home

Operation and Maintenance

Apache

Detailed explanation of apache druid in one article

王林

Feb 18, 2021 am 10:20 AM

apache druid

Detailed explanation of apache druid in one article

Foreword:

What is apache druid?

It is an analytical data platform that integrates the characteristics of time series database, data warehouse and full-text retrieval system.

This article will give you a brief understanding of druid's characteristics, usage scenarios, technical features and architecture, etc. This will help us choose a data storage solution and gain an in-depth understanding of druid storage and time series storage.

Overview

A modern cloud-native, stream-native, analytical database

Druid is designed for fast queries and fast data ingestion workflows. The strength of Druid lies in its powerful UI, operable queries at runtime, and high-performance concurrent processing. Druid can be regarded as an open source alternative for data warehouses that meets diverse user scenarios.

Easy integration with existing data pipelines

Druid can stream data from a message bus (such as Kafka, Amazon Kinesis), or batch load files from a data lake (such as HDFS, Amazon S3 and other similar data sources).

100x faster performance than traditional solutions

Druid’s benchmark performance tests for data ingestion and data querying significantly exceed traditional solutions.

Druid's architecture combines the best features of data warehouses, time series databases and retrieval systems.

Unlock new workflows

Druid unlocks scenarios for clickstream, APM (application performance management system), supply chain (supply chain), network telemetry, digital marketing and other event-driven forms of scenarios New query methods and workflows. Druid is built for fast ad-hoc querying of real-time and historical data.

Deployed on AWS/GCP/Azure, hybrid cloud, k8s and rented servers

Druid can be deployed in any *NIX environment. Whether it's an on-premises environment or a cloud environment. Deploying Druid is very easy: scale up and down by adding or removing services.

Usage Scenarios

Apache Druid is suitable for scenarios with high requirements for real-time data extraction, high-performance query and high availability. Therefore, Druid is often used as an analysis system with a rich GUI, or as a backend for a high-concurrency API that requires fast aggregation. Druid is more suitable for event-oriented data.

Common usage scenarios:

Click stream analysis (web and mobile analysis)

Risk control analysis

Network telemetry analysis (network performance monitoring )

Server indicator storage

Supply chain analysis (manufacturing indicators)

Application performance indicators

Business intelligence/real-time online analysis system OLAP

These usage scenarios will be analyzed in detail below:

User activities and behaviors

Druid is often used in click stream, access stream, and activity stream data. Specific scenarios include: measuring user engagement, tracking A/B testing data for product launches, and understanding user usage patterns. Druid can accurately and approximately calculate user indicators, such as unique counting indicators. This means that metrics such as daily active users can be calculated to an approximate value (with an average accuracy of 98%) in a second to see overall trends, or to be calculated precisely to present to stakeholders. Druid can be used to do "funnel analysis" to measure how many users took a certain action and did not take another action. This is useful for products tracking user registrations.

Network flow

Druid is often used to collect and analyze network flow data. Druid is used to manage streaming data segmented and combined with arbitrary attributes. Druid is able to extract large amounts of network flow records and can quickly combine and sort dozens of attributes at query time, which facilitates network flow analysis. These attributes include core attributes such as IP and port numbers, as well as additional enhanced attributes such as location, service, application, device and ASN. Druid is able to handle non-fixed schemas, which means you can add any attributes you want.

digital marketing

Druid is often used to store and query online advertising data. This data usually comes from advertising service providers, and it is crucial to measure and understand advertising campaign performance, click penetration rate, conversion rate (consumption rate) and other indicators.

Druid was originally designed as a powerful user-oriented analytical application for advertising data. In terms of storing advertising data, Druid has already had a lot of production practice, and a large number of users around the world have stored PB-level data on thousands of servers.

Application Performance Management

Druid is often used to track operational data generated by applications. Similar to user activity usage scenarios, this data can be about how users interact with the application, and it can be indicator data reported by the application itself. Druid can be used to drill down to discover how different components of an application are performing, locate bottlenecks, and identify problems.

Unlike many traditional solutions, Druid has the characteristics of smaller storage capacity, smaller complexity, and greater data throughput. It can quickly analyze application events on thousands of properties and calculate complex loading, performance, and utilization metrics. For example, API endpoint based on 95% query latency. We can organize and segment data by any temporary attributes, such as segmenting data by day, such as statistics by user portraits, such as statistics by data center location.

IoT and Device Metrics

Driud can be used as a time series database solution to store indicator data of processing servers and devices. Collect real-time data generated by machines and perform quick ad hoc analysis to measure performance, optimize hardware resources, and locate problems.

Unlike many traditional time series databases, Druid is essentially an analysis engine. Druid combines the concepts of time series databases, columnar analysis databases, and retrieval systems. It supports time-based partitioning, column storage, and search indexing in a single system. This means that time-based queries, numeric aggregations, and retrieval filter queries will be extremely fast.

You can include millions of unique dimension values in your metrics, and freely combine groups and filters by any dimension (dimensions in Druid are similar to tags in time series databases). You can calculate a large number of complex metrics based on tag groups and ranks. And your search and filtering on tags will be faster than traditional time series databases.

OLAP and Business Intelligence

Druid is often used in business intelligence scenarios. The company deploys Druid to speed up queries and enhance applications. Unlike Hadoop-based SQL engines (such as Presto or Hive), Druid is designed for high concurrency and sub-second queries, and enhances interactive data queries through the UI. This makes Druid more suitable for real visual interaction analysis.

Technology

Apache Druid is an open source distributed data storage engine. Druid's core design incorporates concepts from OLAP/analytic databases, timeseries databases, and search systems to create a unified system suitable for a wide range of use cases. Druid integrates the main features of these three systems into Druid's ingestion layer (data ingestion layer), storage format (storage formatting layer), querying layer (querying layer), and core architecture (core architecture).

Detailed explanation of apache druid in one article

Druid’s main features include:

Column storage

Druid stores and compresses each column of data separately. And when querying, only the specific data that needs to be queried is queried, and fast scanning, ranking and groupBy are supported.

Native search index

Druid creates an inverted index for string values to achieve fast search and filtering of data.

Streaming and batch data ingestion

Out-of-the-box Apache kafka, HDFS, AWS S3 connectors, streaming processors.

Flexible data schemas

Druid elegantly adapts to changing data schemas and nested data types.

Time-based optimized partitioning

Druid intelligently partitions data based on time. Therefore, Druid time-based queries will be significantly faster than traditional databases.

Support SQL statements

In addition to native JSON-based queries, Druid also supports SQL based on HTTP and JDBC.

Horizontal scalability

Data ingestion rate of millions/second, massive data storage, and sub-second query.

Easy to operate and maintain

The capacity can be expanded and reduced by adding or removing servers. Druid supports automatic rebalancing and failover.

Data Intake

Druid supports both streaming and batch data ingestion. Druid typically connects to raw data sources through a message bus like Kafka (loading streaming data) or through a distributed file system like HDFS (loading batch data).

Druid stores original data in data nodes in the form of segments through Indexing processing. Segments are a query-optimized data structure.

Detailed explanation of apache druid in one article

Data Storage

Like most analytical databases, Druid uses columnar storage. Depending on the data type of different columns (string, number, etc.), Druid uses different compression and encoding methods. Druid also builds different types of indexes for different column types.

Similar to the retrieval system, Druid creates an inverted index for string columns to achieve faster search and filtering. Similar to a time series database, Druid intelligently partitions data based on time to achieve faster time-based queries.

Unlike most traditional systems, Druid can pre-aggregate data before ingesting it. This pre-aggregation operation is called rollup, which can significantly save storage costs.

Detailed explanation of apache druid in one article

Query

Druid supports JSON-over-HTTP and SQL query methods. In addition to standard SQL operations, Druid also supports a large number of unique operations. The algorithm suite provided by Druid can be used to quickly perform counting, ranking and quantile calculations.

Detailed explanation of apache druid in one article

Architecture

Druid is a microservice architecture, which can be understood as a database disassembled into multiple services. Each of Druid's core services (ingestion, querying, and coordination) can be deployed individually or jointly on commodity hardware.

Druid clearly names each service to ensure that operation and maintenance personnel can adjust the parameters of the corresponding service according to usage and load conditions. For example, when the load demands it, operators can give more resources to the data ingestion service and reduce resources to the data query service.

Druid can fail independently without affecting the operation of other services.

Detailed explanation of apache druid in one article

Operation and Maintenance

Drui is designed to be a robust system that needs to run 24/7. Druid has the following features to ensure long-term operation and ensure no data loss.

Data copies

Druid creates multiple data copies based on the configured number of copies, so a single machine failure will not affect Druid queries.

Independent services

Druid clearly names each main service, and each service can be adjusted accordingly according to usage. Services can fail independently without affecting the normal operation of other services. For example, if the data ingestion service fails, no new data will be loaded into the system, but existing data can still be queried.

Automatic data backup

Druid automatically backs up all indexed data to a file system, which can be a distributed file system, such as HDFS. You can lose all Druid cluster data and quickly reload from backup data.

Rolling update

Through rolling update, you can update the Druid cluster without downtime, so that it is invisible to users. All Druid versions are backwards compatible.

If you want to learn about time series databases and comparisons, you can move to another article:

First introduction and selection of time series database (TSDB)

Related recommendations: apache server

The above is the detailed content of Detailed explanation of apache druid in one article. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7577

CakePHP Tutorial

1386

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

111

Related knowledge

How to set the cgi directory in apache Apr 13, 2025 pm 01:18 PM

To set up a CGI directory in Apache, you need to perform the following steps: Create a CGI directory such as "cgi-bin", and grant Apache write permissions. Add the "ScriptAlias" directive block in the Apache configuration file to map the CGI directory to the "/cgi-bin" URL. Restart Apache.

How to start apache Apr 13, 2025 pm 01:06 PM

The steps to start Apache are as follows: Install Apache (command: sudo apt-get install apache2 or download it from the official website) Start Apache (Linux: sudo systemctl start apache2; Windows: Right-click the "Apache2.4" service and select "Start") Check whether it has been started (Linux: sudo systemctl status apache2; Windows: Check the status of the "Apache2.4" service in the service manager) Enable boot automatically (optional, Linux: sudo systemctl

How to delete more than server names of apache Apr 13, 2025 pm 01:09 PM

To delete an extra ServerName directive from Apache, you can take the following steps: Identify and delete the extra ServerName directive. Restart Apache to make the changes take effect. Check the configuration file to verify changes. Test the server to make sure the problem is resolved.

How to use Debian Apache logs to improve website performance Apr 12, 2025 pm 11:36 PM

This article will explain how to improve website performance by analyzing Apache logs under the Debian system. 1. Log Analysis Basics Apache log records the detailed information of all HTTP requests, including IP address, timestamp, request URL, HTTP method and response code. In Debian systems, these logs are usually located in the /var/log/apache2/access.log and /var/log/apache2/error.log directories. Understanding the log structure is the first step in effective analysis. 2. Log analysis tool You can use a variety of tools to analyze Apache logs: Command line tools: grep, awk, sed and other command line tools.

How to check Debian OpenSSL configuration Apr 12, 2025 pm 11:57 PM

This article introduces several methods to check the OpenSSL configuration of the Debian system to help you quickly grasp the security status of the system. 1. Confirm the OpenSSL version First, verify whether OpenSSL has been installed and version information. Enter the following command in the terminal: If opensslversion is not installed, the system will prompt an error. 2. View the configuration file. The main configuration file of OpenSSL is usually located in /etc/ssl/openssl.cnf. You can use a text editor (such as nano) to view: sudonano/etc/ssl/openssl.cnf This file contains important configuration information such as key, certificate path, and encryption algorithm. 3. Utilize OPE

How to connect to the database of apache Apr 13, 2025 pm 01:03 PM

Apache connects to a database requires the following steps: Install the database driver. Configure the web.xml file to create a connection pool. Create a JDBC data source and specify the connection settings. Use the JDBC API to access the database from Java code, including getting connections, creating statements, binding parameters, executing queries or updates, and processing results.

How to view your apache version Apr 13, 2025 pm 01:15 PM

There are 3 ways to view the version on the Apache server: via the command line (apachectl -v or apache2ctl -v), check the server status page (http://<server IP or domain name>/server-status), or view the Apache configuration file (ServerVersion: Apache/<version number>).

What to do if the apache80 port is occupied Apr 13, 2025 pm 01:24 PM

When the Apache 80 port is occupied, the solution is as follows: find out the process that occupies the port and close it. Check the firewall settings to make sure Apache is not blocked. If the above method does not work, please reconfigure Apache to use a different port. Restart the Apache service.

See all articles