Home Database Mysql Tutorial 【数据挖掘导论】数据类型

【数据挖掘导论】数据类型

Jun 07, 2016 pm 03:59 PM
different data data mining Now type

数据类型 数据集的不同表现在很多方面。例如:描述数据对象的属性可有具有不同的类型——定量的或者定性的。并且数据集可能还具有特定的性质,如包含时间序列或者彼此相关联。这因为如此,数据的类型决定我们应使用何种工具和技术来分析数据。此外,数据挖掘

数据类型
数据集的不同表现在很多方面。例如:描述数据对象的属性可有具有不同的类型——定量的或者定性的。并且数据集可能还具有特定的性质,如包含时间序列或者彼此相关联。这因为如此,数据的类型决定我们应使用何种工具和技术来分析数据。此外,数据挖掘的研究也是为了适应新的应用领域和新的数据类型。
数据的质量 数据通常远非完美,尽管大部分的数据挖掘技术都容忍不完美的数据,但注重理解和提高数据质量将是改进精确分析结果的重要途径之一。
使数据适合挖掘的预处理步骤 通常,原始数据必须经过加工才能适合分析。而加工处理一方面是提高数据的质量,另一方面让数据更好的适应特定的数据挖掘技术或者工具。
根据数据联系分析数据 数据分析的一种方法是找出数据对象之间的联系,之后使用这些联系而不是数据对象本身进行其余的分析。
通常,数据集可以看作数据对象的集合。数据对象可以是:记录,点,向量,模式等。数据对象用一组刻画对象基本特性的属性描述,如:变量,字段,特征或者维。 \
\ \ 属性与度量喎?http://www.2cto.com/kf/ware/vc/" target="_blank" class="keylink">vc3Ryb25nPgo8c3Ryb25nPsqyw7TKx8r00NSjujwvc3Ryb25nPgo8c3Ryb25nPsr00NSjqGF0dHJpYnV0ZaOpPC9zdHJvbmc+yse21M/ztcTQ1NbKu/LV38zY0NSjrMv80vK21M/ztvjS7Lvyy+bXxcqxvOSx5Luvtvix5LuvoaPXt7j5y93UtKOsyvTQ1LKit8fK/dfWu/K3+7rFoaPIu7b4zqrBy8zWwtu6zbfWzva21M/ztcTM2NDUo6zO0sPHuLPT6MHLy/zDx8r919a6zbf7usWho86qwcvTw9K71tbD98i3tqjS5bXEt73Kvdf2tb3V4rXjo6zO0sPH0OjSqrLiwb+x6rbIoaMKPGJyPgoKPHN0cm9uZz6y4sG/seq2yKOobXJlYXN1cmVtZW50IHNjYWxlo6k8L3N0cm9uZz7Kx72ryv0mIzIwNTQwO7vyt/u6xSYjMjA1NDA70+u21M/ztcTK9NDUz+C52MGqtcS55tTyo6i6r8r9o6mho9DOyr3Jz6OssuLBv7n9s8zKx8q508Oy4sG/seq2yL2r0ru49iYjMjA1NDA70+vSu7j2zNi2qLbUz/O1xMzYtqjK9NDUz+C52MGqoaPL5Mi7y7W1xNPQ0Kmz6c/zoaO1q9Tayfq77tbQo6zO0sPHzt7Ksc7ev8y1xL340NCy4sG/uf2zzKOsyOejusnPuau9u7O1o6y74b+009DDu9PQyqPT4LXE1/nOu8Tc1/i1yKGj1eLQqcfpv/bPwqOstrzKx7bUz/PK9NDUtcTO78DtJiMyMDU0MDuxu9OzyeS1vcr9JiMyMDU0MDu78rf7usUmIzIwNTQwO6GjCjxicj4KCjxzdHJvbmc+yvTQ1LXEwODQzTwvc3Ryb25nPgq008eww+a1w9aqo6zK9NDUtcTQ1NbKsrux2NPr08PAtLbIwb/L/LXEJiMyMDU0MDu1xNDU1srP4M2soaO8tKOs08PAtLT6se3K9NDUtcQmIzIwNTQwO7/JxNy+39PQsrvNrNPryvTQ1LG+ye21xNDU1sqjrLe01q7S4Mi7oaMKPGltZyBzcmM9"http://www.2cto.com/uploadfile/2014/0724/20140724013644436.png" alt="\"> \

属性的类型告诉我们,属性的那些性质反映在用于测量它的值中。知道属性的类型的重要性,因为它告诉我们测量值的那些性质与属性的基本性质一致,从而使我,恶魔得以避免计算雇员的平均ID这也愚蠢的行为,需要注意的是,通常将属性的类型称作测量标度的类型。

属性的不同类型 一种指定属性类型的有用方法是:确定对应属性基本性质的数值的性质。如:长度的属性可以有数值的许多性质,按长度比较对象,确定对象的排序,以及长度的差与比例都是有意义的。数值如下的操作通常用来描述属性: \
\ 给定这些性质,我们可以定义出四种属性类型:标称(nominal),序数(ordinal),区间(interval),比率(ratio)。 \ \
属性的类型也可以用不改变属性意义的变换来描述,如:长度可用米或者英尺来度量。下表给出上表的四种属性类型的允许变换: \ \
用值的个数描述属性 区分属性的一种独立的方法就是根据属性可能取值的个数来判断 离散的(discrete)离散属性具有有限个或无限个可数个值。通常离散属性应整数变量表示。二元属性(binary attribute)是离散属性的一种特殊情况,只接受两个值:真假,是否,01等。二元属性用布尔变量表示。
连续的(continuous)连续属性是取实数值的属性。如温度,高度等。通常,连续属性用浮点变量表示。
从理论上讲,任何测量标度类型(标称的,序数的,区间的,比率的)都可以与基于属性值个数的任意类型(二元的,离散的,连续的)组合。有些组合并不常出现,或者没有什么意义。
非对称属性 对于非对称属性(asymmetric attribute),出现非零属性值才是重要的。如:对于一个,每个对象都是学生的数据集。每个属性记录学生是否选修大学的某个课程。对于某个学生,选修某个属性的课程,值为1,否则为0。由于学生只能选所有可选的课程的一部分,因此这种数据集的大部分值为0,因此关注非零值将更有意义。只有非零值才重要的二元属性是非对称的二元属性。

数据集的类型 数据集的类型有很多,一般我们将数据集分为三组:记录数据,基于图形的数据和有序数据。
数据集的一般特性 维度(dimensionality)数据集的维度是数据集中的对象具有的属性数目,分为底,中,高维度。在分析数据的时候,最好将数据的维度降低。因为在分析高维度数据的时候,会陷入所谓的维灾难(curse of dimensionality)。因此,数据预处理的一个重要的动机就是减少维度,称为维归约(dimensionality reduction)
稀疏性(sparsity)有些数据集,如具有非对称特征的数据集,一个对象的大部分属性上的值都是0,在许多情况下,非零项还不到1%。事实上,稀疏性是一个优点,因为只有非零值才需要存储和处理。这将大大节省计算时间和存储空间。
分辨率(resolution)常常可以在不同的分辨率下得到数据,且在不同的分辨率下数据的性质也不同。如:在几米的分辨率下,地表看起来很不平坦,但在数十公里的分辨率下却相对平坦。

记录数据 许多数据挖掘任务都是假定数据集是记录(数据对象)的汇集,每个记录包含固定的数据字段(属性)集。下面介绍不同类型的记录数据: \ \
事务数据或购物篮数据 事务数据(transaction data)是一种特殊类型的记录数据,其中每个记录(数据)涉及一系列的项。考虑顾客一次购物所买的商品集合构成一个事务,而所有购买的商品作为项。这种类型的数据称作购物篮数据(market basket data)。
数据矩阵 如果一个数据集族中所有数据对象都具有相同的数值属性集,则数据对象可以看作多维空间的点(向量),其中每个维代表对象的一个不同属性。这样的数据对象集可以用一个m*n的矩阵表示,其中m行,一个对象一行;n列,一个属性一列。这种矩阵称作数据矩阵(data matrix)模式矩阵(pattern matrix)。
稀疏数据矩阵 稀疏数据矩阵是数据矩阵的一种特殊的情况,其中属性的类型相同并且是非对称的,即只有非零值才是重要的。事务数据是仅含0-1元素的稀疏数据矩阵的例子。另一个常见的便是文档数据。文档集合的表示通常称作文档-词矩阵(document-term matrix),如图2-2d,文档是该矩阵的行,词是该矩阵的列。

基于图形的数据 有时图形可以有效的表示数据,但有两种特殊的情况:图形捕获数据对象之间的联系;数据对象本身用图形表示。
担忧对象之间联系的数据 对象之间的联系常常携带重要的信息。这种情况下,数据常常用图形表示。一般把数据对象映射到图的结点,而对象之间的联系用对象之间的链或方向,权值等表示。如相互链接的网页。
具有图形对象的数据 如果对象具有结构,即对象包含具有联系的子对象,则这样的对象常常用图形表示。如化学物的结构用图形表示。

有序数据 对于某些数据类型,属性涉及到时间或空间序的联系。如下: \
时序数据 时序数据(sequential data)也称时间数据(temporal data),可以看作记录数据的扩充,其中每一个记录包含一个与之相关联的时间。时间也可以与每个属性相关,如:每个记录可以是一位顾客的购物历史,包含不同时间购买的商品列表。使用这些信息,我们也许可能发现:买了iPhone的人是不会在关注那些低端的android机的。
序列数据 序列数据(sequence data)是一个数据集合,它是各个实体的序列,如:词或字母的序列,基因组序列等
时间序列数据 时间序列数据(time series data)是一种特殊的时序数据,其中每个记录都是一个时间序列(time series),即一段时间以来的测量序列。如图2-4c,记录的是一个地方1982年到1994年月平均的时间序列。需要注意的是:在分析时间数据时,需要考虑时间自相关(temporal autocorrelation),即如果两个测量的时间很近,则这些测量的值通常非常的相似。
空间数据 某些数据也许还会拥有空间属性,如位置或区域。空间数据的例子有很多,比如:从不同地方收集气象数据。空间数据的一个重要的特点就是空间自相关性(spatial autocorrelation),即物理上靠近的对象趋向于其他方面也相似。

处理非记录数据 大部分数据挖掘算法都是为记录数据或其变体(事务数据,数据矩阵)设计的。通过对象中提取特征,并使用这些特征创建对应与每个对象的记录,针对记录数据的技术也可以用与非记录数据。如化学结构的数据,给定一个常见的子结构集合,每个化合物都可以用一个具有二元属性的记录表示,这些二元属性指出化合物是否包含特定的子结构,这也的表示实际上是事务数据集,其中事务是化合物,而项是子结构。
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
Two Point Museum: All Exhibits And Where To Find Them
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Use ddrescue to recover data on Linux Use ddrescue to recover data on Linux Mar 20, 2024 pm 01:37 PM

DDREASE is a tool for recovering data from file or block devices such as hard drives, SSDs, RAM disks, CDs, DVDs and USB storage devices. It copies data from one block device to another, leaving corrupted data blocks behind and moving only good data blocks. ddreasue is a powerful recovery tool that is fully automated as it does not require any interference during recovery operations. Additionally, thanks to the ddasue map file, it can be stopped and resumed at any time. Other key features of DDREASE are as follows: It does not overwrite recovered data but fills the gaps in case of iterative recovery. However, it can be truncated if the tool is instructed to do so explicitly. Recover data from multiple files or blocks to a single

Open source! Beyond ZoeDepth! DepthFM: Fast and accurate monocular depth estimation! Open source! Beyond ZoeDepth! DepthFM: Fast and accurate monocular depth estimation! Apr 03, 2024 pm 12:04 PM

0.What does this article do? We propose DepthFM: a versatile and fast state-of-the-art generative monocular depth estimation model. In addition to traditional depth estimation tasks, DepthFM also demonstrates state-of-the-art capabilities in downstream tasks such as depth inpainting. DepthFM is efficient and can synthesize depth maps within a few inference steps. Let’s read about this work together ~ 1. Paper information title: DepthFM: FastMonocularDepthEstimationwithFlowMatching Author: MingGui, JohannesS.Fischer, UlrichPrestel, PingchuanMa, Dmytr

Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Apr 01, 2024 pm 07:46 PM

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

The vitality of super intelligence awakens! But with the arrival of self-updating AI, mothers no longer have to worry about data bottlenecks The vitality of super intelligence awakens! But with the arrival of self-updating AI, mothers no longer have to worry about data bottlenecks Apr 29, 2024 pm 06:55 PM

I cry to death. The world is madly building big models. The data on the Internet is not enough. It is not enough at all. The training model looks like "The Hunger Games", and AI researchers around the world are worrying about how to feed these data voracious eaters. This problem is particularly prominent in multi-modal tasks. At a time when nothing could be done, a start-up team from the Department of Renmin University of China used its own new model to become the first in China to make "model-generated data feed itself" a reality. Moreover, it is a two-pronged approach on the understanding side and the generation side. Both sides can generate high-quality, multi-modal new data and provide data feedback to the model itself. What is a model? Awaker 1.0, a large multi-modal model that just appeared on the Zhongguancun Forum. Who is the team? Sophon engine. Founded by Gao Yizhao, a doctoral student at Renmin University’s Hillhouse School of Artificial Intelligence.

Slow Cellular Data Internet Speeds on iPhone: Fixes Slow Cellular Data Internet Speeds on iPhone: Fixes May 03, 2024 pm 09:01 PM

Facing lag, slow mobile data connection on iPhone? Typically, the strength of cellular internet on your phone depends on several factors such as region, cellular network type, roaming type, etc. There are some things you can do to get a faster, more reliable cellular Internet connection. Fix 1 – Force Restart iPhone Sometimes, force restarting your device just resets a lot of things, including the cellular connection. Step 1 – Just press the volume up key once and release. Next, press the Volume Down key and release it again. Step 2 – The next part of the process is to hold the button on the right side. Let the iPhone finish restarting. Enable cellular data and check network speed. Check again Fix 2 – Change data mode While 5G offers better network speeds, it works better when the signal is weaker

The U.S. Air Force showcases its first AI fighter jet with high profile! The minister personally conducted the test drive without interfering during the whole process, and 100,000 lines of code were tested for 21 times. The U.S. Air Force showcases its first AI fighter jet with high profile! The minister personally conducted the test drive without interfering during the whole process, and 100,000 lines of code were tested for 21 times. May 07, 2024 pm 05:00 PM

Recently, the military circle has been overwhelmed by the news: US military fighter jets can now complete fully automatic air combat using AI. Yes, just recently, the US military’s AI fighter jet was made public for the first time and the mystery was unveiled. The full name of this fighter is the Variable Stability Simulator Test Aircraft (VISTA). It was personally flown by the Secretary of the US Air Force to simulate a one-on-one air battle. On May 2, U.S. Air Force Secretary Frank Kendall took off in an X-62AVISTA at Edwards Air Force Base. Note that during the one-hour flight, all flight actions were completed autonomously by AI! Kendall said - "For the past few decades, we have been thinking about the unlimited potential of autonomous air-to-air combat, but it has always seemed out of reach." However now,

The first robot to autonomously complete human tasks appears, with five fingers that are flexible and fast, and large models support virtual space training The first robot to autonomously complete human tasks appears, with five fingers that are flexible and fast, and large models support virtual space training Mar 11, 2024 pm 12:10 PM

This week, FigureAI, a robotics company invested by OpenAI, Microsoft, Bezos, and Nvidia, announced that it has received nearly $700 million in financing and plans to develop a humanoid robot that can walk independently within the next year. And Tesla’s Optimus Prime has repeatedly received good news. No one doubts that this year will be the year when humanoid robots explode. SanctuaryAI, a Canadian-based robotics company, recently released a new humanoid robot, Phoenix. Officials claim that it can complete many tasks autonomously at the same speed as humans. Pheonix, the world's first robot that can autonomously complete tasks at human speeds, can gently grab, move and elegantly place each object to its left and right sides. It can autonomously identify objects

Alibaba 7B multi-modal document understanding large model wins new SOTA Alibaba 7B multi-modal document understanding large model wins new SOTA Apr 02, 2024 am 11:31 AM

New SOTA for multimodal document understanding capabilities! Alibaba's mPLUG team released the latest open source work mPLUG-DocOwl1.5, which proposed a series of solutions to address the four major challenges of high-resolution image text recognition, general document structure understanding, instruction following, and introduction of external knowledge. Without further ado, let’s look at the effects first. One-click recognition and conversion of charts with complex structures into Markdown format: Charts of different styles are available: More detailed text recognition and positioning can also be easily handled: Detailed explanations of document understanding can also be given: You know, "Document Understanding" is currently An important scenario for the implementation of large language models. There are many products on the market to assist document reading. Some of them mainly use OCR systems for text recognition and cooperate with LLM for text processing.

See all articles