数据挖掘之Apriori算法详解和Python实现代码分享
关联规则挖掘(Association rule mining)是数据挖掘中最活跃的研究方法之一,可以用来发现事情之间的联系,最早是为了发现超市交易数据库中不同的商品之间的关系。(啤酒与尿布)
基本概念
1、支持度的定义:support(X-->Y) = |X交Y|/N=集合X与集合Y中的项在一条记录中同时出现的次数/数据记录的个数。例如:support({啤酒}-->{尿布}) = 啤酒和尿布同时出现的次数/数据记录数 = 3/5=60%。
2、自信度的定义:confidence(X-->Y) = |X交Y|/|X| = 集合X与集合Y中的项在一条记录中同时出现的次数/集合X出现的个数 。例如:confidence({啤酒}-->{尿布}) = 啤酒和尿布同时出现的次数/啤酒出现的次数=3/3=100%;confidence({尿布}-->{啤酒}) = 啤酒和尿布同时出现的次数/尿布出现的次数 = 3/4 = 75%
同时满足最小支持度阈值(min_sup)和最小置信度阈值(min_conf)的规则称作强规则 ,如果项集满足最小支持度,则称它为频繁项集
“如何由大型数据库挖掘关联规则?”关联规则的挖掘是一个两步的过程:
1、找出所有频繁项集:根据定义,这些项集出现的频繁性至少和预定义的最小支持计数一样。
2、由频繁项集产生强关联规则:根据定义,这些规则必须满足最小支持度和最小置信度。
Apriori定律
为了减少频繁项集的生成时间,我们应该尽早的消除一些完全不可能是频繁项集的集合,Apriori的两条定律就是干这事的。
Apriori定律1:如果一个集合是频繁项集,则它的所有子集都是频繁项集。举例:假设一个集合{A,B}是频繁项集,即A、B同时出现在一条记录的次数大于等于最小支持度min_support,则它的子集{A},{B}出现次数必定大于等于min_support,即它的子集都是频繁项集。
Apriori定律2:如果一个集合不是频繁项集,则它的所有超集都不是频繁项集。举例:假设集合{A}不是频繁项集,即A出现的次数小于min_support,则它的任何超集如{A,B}出现的次数必定小于min_support,因此其超集必定也不是频繁项集。
上面的图演示了Apriori算法的过程,注意看由二级频繁项集生成三级候选项集时,没有{牛奶,面包,啤酒},那是因为{面包,啤酒}不是二级频繁项集,这里利用了Apriori定理。最后生成三级频繁项集后,没有更高一级的候选项集,因此整个算法结束,{牛奶,面包,尿布}是最大频繁子集。
Python实现代码:
Skip to content
Sign up Sign in This repository
Explore
Features
Enterprise
Blog
Star 0 Fork 0 taizilongxu/datamining
branch: master datamining / apriori / apriori.py
hackerxutaizilongxu 20 days ago backup
1 contributor
156 lines (140 sloc) 6.302 kb RawBlameHistory
#-*- encoding: UTF-8 -*-
#---------------------------------import------------------------------------
#---------------------------------------------------------------------------
class Apriori(object):
def __init__(self, filename, min_support, item_start, item_end):
self.filename = filename
self.min_support = min_support # 最小支持度
self.min_confidence = 50
self.line_num = 0 # item的行数
self.item_start = item_start # 取哪行的item
self.item_end = item_end
self.location = [[i] for i in range(self.item_end - self.item_start + 1)]
self.support = self.sut(self.location)
self.num = list(sorted(set([j for i in self.location for j in i])))# 记录item
self.pre_support = [] # 保存前一个support,location,num
self.pre_location = []
self.pre_num = []
self.item_name = [] # 项目名
self.find_item_name()
self.loop()
self.confidence_sup()
def deal_line(self, line):
"提取出需要的项"
return [i.strip() for i in line.split(' ') if i][self.item_start - 1:self.item_end]
def find_item_name(self):
"根据第一行抽取item_name"
with open(self.filename, 'r') as F:
for index,line in enumerate(F.readlines()):
if index == 0:
self.item_name = self.deal_line(line)
break
def sut(self, location):
"""
输入[[1,2,3],[2,3,4],[1,3,5]...]
输出每个位置集的support [123,435,234...]
"""
with open(self.filename, 'r') as F:
support = [0] * len(location)
for index,line in enumerate(F.readlines()):
if index == 0: continue
# 提取每信息
item_line = self.deal_line(line)
for index_num,i in enumerate(location):
flag = 0
for j in i:
if item_line[j] != 'T':
flag = 1
break
if not flag:
support[index_num] += 1
self.line_num = index # 一共多少行,出去第一行的item_name
return support
def select(self, c):
"返回位置"
stack = []
for i in self.location:
for j in self.num:
if j in i:
if len(i) == c:
stack.append(i)
else:
stack.append([j] + i)
# 多重列表去重
import itertools
s = sorted([sorted(i) for i in stack])
location = list(s for s,_ in itertools.groupby(s))
return location
def del_location(self, support, location):
"清除不满足条件的候选集"
# 小于最小支持度的剔除
for index,i in enumerate(support):
if i
support[index] = 0
# apriori第二条规则,剔除
for index,j in enumerate(location):
sub_location = [j[:index_loc] + j[index_loc+1:]for index_loc in range(len(j))]
flag = 0
for k in sub_location:
if k not in self.location:
flag = 1
break
if flag:
support[index] = 0
# 删除没用的位置
location = [i for i,j in zip(location,support) if j != 0]
support = [i for i in support if i != 0]
return support, location
def loop(self):
"s级频繁项级的迭代"
s = 2
while True:
print '-'*80
print 'The' ,s - 1,'loop'
print 'location' , self.location
print 'support' , self.support
print 'num' , self.num
print '-'*80
# 生成下一级候选集
location = self.select(s)
support = self.sut(location)
support, location = self.del_location(support, location)
num = list(sorted(set([j for i in location for j in i])))
s += 1
if location and support and num:
self.pre_num = self.num
self.pre_location = self.location
self.pre_support = self.support
self.num = num
self.location = location
self.support = support
else:
break
def confidence_sup(self):
"计算confidence"
if sum(self.pre_support) == 0:
print 'min_support error' # 第一次迭代即失败
else:
for index_location,each_location in enumerate(self.location):
del_num = [each_location[:index] + each_location[index+1:] for index in range(len(each_location))] # 生成上一级频繁项级
del_num = [i for i in del_num if i in self.pre_location] # 删除不存在上一级频繁项级子集
del_support = [self.pre_support[self.pre_location.index(i)] for i in del_num if i in self.pre_location] # 从上一级支持度查找
# print del_num
# print self.support[index_location]
# print del_support
for index,i in enumerate(del_num): # 计算每个关联规则支持度和自信度
index_support = 0
if len(self.support) != 1:
index_support = index
support = float(self.support[index_location])/self.line_num * 100 # 支持度
s = [j for index_item,j in enumerate(self.item_name) if index_item in i]
if del_support[index]:
confidence = float(self.support[index_location])/del_support[index] * 100
if confidence > self.min_confidence:
print ','.join(s) , '->>' , self.item_name[each_location[index]] , ' min_support: ' , str(support) + '%' , ' min_confidence:' , str(confidence) + '%'
def main():
c = Apriori('basket.txt', 14, 3, 13)
d = Apriori('simple.txt', 50, 2, 6)
if __name__ == '__main__':
main()
############################################################################
Status API Training Shop Blog About
© 2014 GitHub, Inc. Terms Privacy Security Contact
Apriori算法
Apriori(filename, min_support, item_start, item_end)
参数说明
filename:(路径)文件名
min_support:最小支持度
item_start:item起始位置
item_end:item结束位置
使用例子:
import apriori
c = apriori.Apriori('basket.txt', 11, 3, 13)
输出:
--------------------------------------------------------------------------------
The 1 loop
location [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
support [299, 183, 177, 303, 204, 302, 293, 287, 184, 292, 276]
num [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
The 2 loop
location [[0, 9], [3, 5], [3, 6], [5, 6], [7, 10]]
support [145, 173, 167, 170, 144]
num [0, 3, 5, 6, 7, 9, 10]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
The 3 loop
location [[3, 5, 6]]
support [146]
num [3, 5, 6]
--------------------------------------------------------------------------------
frozenmeal,beer ->> cannedveg min_support: 14.6% min_confidence: 0.858823529412
cannedveg,beer ->> frozenmeal min_support: 14.6% min_confidence: 0.874251497006
cannedveg,frozenmeal ->> beer min_support: 14.6% min_confidence: 0.843930635838
--------------------------------------------------------------------------------

Alat AI Hot

Undresser.AI Undress
Apl berkuasa AI untuk mencipta foto bogel yang realistik

AI Clothes Remover
Alat AI dalam talian untuk mengeluarkan pakaian daripada foto.

Undress AI Tool
Gambar buka pakaian secara percuma

Clothoff.io
Penyingkiran pakaian AI

AI Hentai Generator
Menjana ai hentai secara percuma.

Artikel Panas

Alat panas

Notepad++7.3.1
Editor kod yang mudah digunakan dan percuma

SublimeText3 versi Cina
Versi Cina, sangat mudah digunakan

Hantar Studio 13.0.1
Persekitaran pembangunan bersepadu PHP yang berkuasa

Dreamweaver CS6
Alat pembangunan web visual

SublimeText3 versi Mac
Perisian penyuntingan kod peringkat Tuhan (SublimeText3)

Topik panas

Mengenai masalah menghapuskan penterjemah python yang dilengkapi dengan sistem Linux, banyak pengagihan Linux akan memasang semula penterjemah python apabila dipasang, dan ia tidak menggunakan pengurus pakej ...

Penyelesaian Masalah Pengesanan Jenis Pylance Apabila menggunakan penghias tersuai dalam pengaturcaraan python, penghias adalah alat yang berkuasa yang boleh digunakan untuk menambah baris ...

Mengenai Pythonasyncio ...

Elakkan masalah bar kemajuan pendua yang disebabkan oleh cetakan dalam TQDM di Python semasa menggunakan perpustakaan TQDM Python, jika anda menggunakan cetakan di dalam gelung ...

Menggunakan Python di Terminal Linux ...

Memuatkan Fail Pickle di Python 3.6 Kesalahan Alam Sekitar: ModulenotFoundError: Nomodulenamed ...

Hantar http/2 menggunakan perpustakaan httpx python ...

Isu keserasian antara perpustakaan asynchronous Python di Python, pengaturcaraan tak segerak telah menjadi proses kesesuaian tinggi dan I/O ...
