ML algorithm and practice(Basic)

machine_learning

1. 矢量编程基础

矩阵

三个方向的用途:
解线性方程:通过计算距离
方程降次:二次型 升维
变换:维度约简

矢量化编程

基于矩阵的基本运算
MATLAB 和python的矩阵运算调用C函数完成 数值运算 并行运算
图形处理器GPU (graphic computing unit)

numpy 矩阵运算

初始化

  1. # initialize a 3*4 matrix
  2. import numpy as np
  3. allZero = np.zeros([3,4]) # 全零矩阵
  4. allOne = np.ones([3,4]) # 全一矩阵
  5. myrandom = np.random.rand(3,4) # 随机矩阵
  6. myeye = np.eye(3) # 3*3 单位矩阵
  7. # 用二维向量初始化矩阵
  8. myMatrix = mat([[1,2],[2,4],[4,8]])

元素运算

  1. print matrixA + matrixB
  2. print matrixA - matrixB
  3. print c * matrixA
  4. # the sum of all elements
  5. sum(myMatrix)
  6. # multiply elements at the same positions
  7. # a certain broadcast rule will be applied to expand the matrix if dimension is not match.
  8. multiply(matrixA,matrixB)
  9. # the power of each element
  10. power(matrix,2)

矩阵相乘

  1. matrixa * matrixB
  2. # multiply of two matrixs use operator *, of matrix elements use multiply()

矩阵转置

  1. matrix.T
  2. matrix.transpose()

其他操作:行列切片/拷贝/比较

  1. # show the line and column number
  2. [m,n] = shape(matrix)
  3. # slice by line
  4. line = matrix[0]
  5. # slice by column
  6. col = matrix.T[0]
  7. # copy method
  8. copied = matrix.copy()
  9. # compare each pair of elements at the same position
  10. # return a bool matrix
  11. matrixA > matrixB

numpy的Linalg库提供线性代数运算

1 行列式
linalg.det(matrix)
2 逆矩阵
linalg.inv(matrix)
3 秩
linalg.matrix_rank(matrixA)
4 解线性方程组

  1. # solve A*s = B
  2. s = linalg.solve(A,B)

2. 数学基础

现代数学三大根基:概率论(事物可能会在怎样)、数值分析(怎样变化)、线性代数(不同观察维度)

2.1 相似性的度量 similarity measurement(向量的距离)

when p = 2, d is Euclidean Distance
sqrt( (vectorA-vectorB)*((vectorA-vectorB).T) )

when p -> infinite, d is Chebyshev Distance
the same as


abs(vectorA-vectorB).max()

  1. tmp = nonzero(vectorA-vecotrB)
  2. tmp2 = shape(tmp) # return [line,column]
  3. tmp2[1]
  1. from numpy import *
  2. import scipy.spacial.distance as dist
  3. matV = mat(vectorA,vectorB)
  4. dist.pdist(matV,'jaccard')

2.2概率论

贝叶斯公式

covariance matrix 协方差矩阵

correlation coefficient 相关系数


取值范围[-1,1] -1线性负相关 1线性正相关

corelation distance 相关距离

  1. # 均值
  2. m1 = mean(vectorA)
  3. # 标准差 = 方差的平方根
  4. dv1 = std(vectorA)
  5. # 相关系数
  6. corref = mean(multiply(vectorA-m1,vectorB-m2))/(dv1*dv2)
  7. # 相关系数矩阵
  8. print corrcoef(mat(vectorA,vectorB))

Mahalanobis Distance马氏距离
排除量纲对相关性的干扰



  1. matrix = mat(vectorA,vectorB)
  2. covinv = linalg.inv(cov(matrix))
  3. tp = vectorA-vectorB
  4. mditance = sqrt(dot(dot(tp,covinv),tp.T))

2.3 线性空间变换

向量乘矩阵:向量从一个线性空间变换到另一个线性空间的过程
矩阵乘矩阵:向量组的空间变换,维度对齐
一组特征向量:变换过程只发生伸缩、不发生旋转
特征值:伸缩的比例

  1. # 特征值 特征向量
  2. eval,evec = linalg.eig(matrix)
  3. # 还原
  4. matrix = evec*(eval*eye(m))*linalg.inv(evec)

2.4 数据归一化

变成(0,1)之间的小数,或者有量纲变成无量纲

2.5数据处理与可视化

  1. def fileToMatrix(path,delimiter): # 'path' is the path of a file
  2. list = []
  3. fp = open(path,"rb")
  4. content = fp.read()
  5. fp.close()
  6. rowlist = content.splitlines()
  7. recordlist = [row.split(decimiter) for r in rowlist if r.strip()]
  8. return mat(recordlist)
  1. import matplotlib as plt
  2. fig = plt.figure()
  3. ax = fig.add_subplot(111)
  4. ax.scatter(X,Y,c='red',marker='o') # draw dots
  5. ax.plot(X,y,'r') # draw curve
  6. plt.show()

http://matplotlib.org/examples/index.html

3. 中文文本分类

总体步骤:
预处理--中文分词--构建词向量空间--权重策略--使用分类器--评价结果

3.1 预处理

3.2 中文分词 Chinese Word Segmentation

词没有形式上的分界符
NLP的核心问题?

中文分词算法: 基于概率图模型的条件随机场(CRF)——Lafferty

文本结构化模型: 词向量空间模型、主题模型、依存句法的树表示、RDF的图表示

jieba分词系统 使用CRF算法和python

  1. # 返回可迭代的generator
  2. # 默认切分
  3. seg_list = jieba.cut("中文文本串",cut_all=False)
  4. # 全切分
  5. seg_list = jieba.cut("中文文本串",cut_all=False)
  6. # 搜索引擎粒度切分
  7. seg_list = jieba.cut_for_search("中文文本串")
  8. print "/".join(seg_list)
  9. list(seg_list)

more use: https://www.oschina.net/p/jieba

  1. import sys
  2. import os
  3. import jieba
  4. import codecs
  5. #reload(sys) 被淘汰的用法
  6. #sys.setdefaultencoding('unicode')
  7. def savefile(path,content):
  8. fp = open(path,"w")
  9. fp.write(content)
  10. fp.close()
  11. def readfile(path):
  12. fp = open(path,"r")
  13. try:
  14. content = fp.read()
  15. except:
  16. content = ""
  17. fp.close()
  18. return content
  19. corpus_path = "D:/Data Science Experiment/Chinese Text Clustering/train_corpus/"
  20. seg_path = "D:/Data Science Experiment/Chinese Text Clustering/corpus_segments/"
  21. # get all the files under this path
  22. catelist = os.listdir(corpus_path)
  23. print(catelist)
  24. for mydir in catelist:
  25. class_path = corpus_path + mydir + "/"
  26. seg_dir = seg_path + mydir + "/"
  27. if not os.path.exists(seg_dir):
  28. os.makedirs(seg_dir)
  29. file_list = os.listdir(class_path)
  30. print(file_list)
  31. for file_path in file_list:
  32. print(file_path)
  33. # make up the path of target txt
  34. fullname = class_path+file_path
  35. content = readfile(fullname).strip()
  36. content = content.replace("\r\n","").strip()
  37. content_seg = jieba.cut(content)
  38. savefile(seg_dir + file_path, "/".join(content_seg))
  1. import sys
  2. import os
  3. import pickle
  4. from sklearn.datasets.base import Bunch
  5. bunch = Bunch(target_name = [],label=[],filename=[],contents=[])
  6. seg_path = "D:/Data Science Experiment/Chinese Text Clustering/corpus_segments/"
  7. bunch_path = "D:/Data Science Experiment/Chinese Text Clustering/words_bag.dat"
  8. catelist = os.listdir(seg_path)
  9. print(catelist)
  10. for mydir in catelist:
  11. class_path = seg_path + mydir + "/"
  12. file_list = os.listdir(class_path)
  13. for file_path in file_list:
  14. fullname = class_path + file_path
  15. bunch.label.append(mydir)
  16. bunch.filename.append(fullname)
  17. fileObj = open(fullname)
  18. bunch.contents.append(fileObj.read().strip())
  19. fileObj.close()
  20. file_obj = open(bunch_path,"wb")
  21. pickle.dump(bunch,file_obj)
  22. file_obj.close()
  23. print("finished!")

3.3 权重策略: TF-IDF方法

词频Term Frequency
是指某个给定词语在某文件中的出现频率。(考虑重复计数)


逆向文件频率IDF
代表一个词语的普遍重要程度

TF_IDF


可以衡量某个词语w的重要性
某一“重要”的词语在某一文件内高频、在全部文件中低频,则可以用来分类,TF-TDF权重就高。

  1. from sklearn.datasets.base import Bunch
  2. import pickle
  3. from sklearn import feature_extraction
  4. from sklearn.feature_extraction.text import TfidfTransformer
  5. from sklearn.feature_extraction.text import TfidfVectorizer
  6. def readBunchObj(path):
  7. fileObj = open(path,"rb")
  8. bunch = pickle.load(fileObj)
  9. fileObj.close()
  10. return bunch
  11. def writeBunchObj(path,bunchObj):
  12. fileObj = open(path,"wb")
  13. pickle.dump(bunchObj,fileObj)
  14. fileObj.close()
  15. path = "D:/Data Science Experiment/Chinese Text Clustering/words_bag.dat"
  16. bunch = readBunchObj(path)
  17. tfispace = Bunch(target_name = bunch.target_name,label=bunch.label,filename=bunch.filename,tdm=[],vocabulary=[])
  18. vectorizer=TfidfVectorizer(max_df=0.5)
  19. tansformer = TfidfTransformer()
  20. tfispace.tdm=vectorizer.fit_transform(bunch.contents)
  21. tfispace.vocabulary = vectorizer.vocabulary_
  22. space_path = "D:/Data Science Experiment/Chinese Text Clustering/tfidf_space.dat"
  23. writeBunchObj(space_path,tfispace)
  24. print("finished!")

3.4 文本分类方法

  1. kNN最邻近方法:简单、精度一般、速度慢
  2. 朴素叶贝斯算法:对短文本分类效果好、精度高
  3. 支持向量机算法:支持线性不可分的情况?

朴素叶贝斯算法实现

  1. class bayes(obj):
  2. def _init_(self):
  3. self.vocab = []
  4. self.idf = 0 # matrix
  5. self.tf = 0 # matrix
  6. self.tdm = 0 # P(x|y)
  7. self.Pcates = {}
  8. self.labels = []
  9. self.doclenth = 0
  10. self.vocablen = 0
  11. self.testset = 0
  12. def train_set(self,trainset,classvec):
  13. self.cate_pro(classvec)
  14. self.doclenth = len(trainset)
  15. tempset = set()
  16. [tempset.add(word) for doc in trainset for word in doc]
  17. self.vocab = list(tempset)
  18. self.vocablen = len(self.vocab)
  19. self.cul_word_freq(trainset)
  20. self.build_tdm()
  21. def cate_pro(self,classvec):
  22. self.labels = classvec
  23. labeltmp = set(self.labels)
  24. for label in labeltmp:
  25. self.Pcates[label] = float(self.labels.count(label)/float(len(self.labels)))
  26. def cul_word_freq(self,trainset):
  27. self.idf = np.zeros([1,self.vocab]) # build two matrics?
  28. self.tf = np.zeros([self.doclenth,self.vocablen])
  29. for f in range(self.doclenth):
  30. for word in trainset[f]:
  31. self.tf[f,self.vocab.index(word)]+=1
  32. for signalword in set(trainset[f]):
  33. self.idf[0,self.vocab.index(signalword)]+=1
  34. def build_tdm(self):
  35. self.tdm = np.zeros([len(self.Pcates),self.vocablen])
  36. sumlist = np.zeros(len(self.Pcates),1)
  37. for index in range(self.doclenth):
  38. self.tdm[self.labels[index]] = np.sum(self.tdm[self.labels[index])
  39. self.tdm = self.tdm/sumlist
  40. def mapTolist(self,testdata):
  41. self.testset = np.zeros([1,self.vocablen])
  42. for w in testdata:
  43. self.testset[0,self.vocab.index(w)]+=1
  44. def predict(self,testset):
  45. if np.shape(testset)[1] != self.vocablen:
  46. print("dimension error")
  47. exit(0)
  48. else:
  49. prevalue = 0
  50. preclass ""
  51. for tdm_vec, keyclass in zip(self.tdm,self.Pcates):
  52. tmp = np.sum(testset * tdm_vec * self.Pcates[keyclass])
  53. if tmp > prevalue:
  54. prevalue = tmp
  55. preclass = keyclass
  56. return preclass

3.5 分类结果评估

机器学习算法评估三指标:
召回率Recall Rate
相关文件里被检索到的比例


准确率(精度)Precision
检索到的文件里相关文件的比例

F-Score

t是参数,P是Precision,R是Recall
当t=1时是最常见的F1-Measure
P和R的调和平均数

3.6 朴素贝叶斯分类算法

公式推导略

3.7 KNN分类算法

k-nearest neighbor
A simple ML algorithm that do classification by measuring distances of different features
If most of the K-nearest neighbors of a sample in the feature space belong to a certain class, the sample belongs to it.
Steps:
1. determine the value of K.(often an odd number)
2. determine the formula to calculate distance.(for text classification consine is often used) select the nearest K samples.
3. calculate the number of different classes in the K samples. The max is what we look for.

  1. def cosine(vec1,vec2):
  2. v1 = np.array(vec1)
  3. v2 = np.array(vec2)
  4. return np.dot(v1,v2)/(np.linalg.norm(v1)*np.linalg.norm(v2))
  5. # return the predited label of test_vec
  6. def classify(test_vec, train_set, list_class, k):
  7. data_set_size = len(train_set)
  8. distance = np.array(np.zeros(data_set_size)) # 1-D
  9. for index in range(data_set_size):
  10. distance[index] = cosine(test_vec,train_set[index])
  11. sorted_dist = np.argsort(-distance)
  12. print(sorted_dist)
  13. class_cnt = {}
  14. for i in range(k):
  15. label = list_class[sorted_dist[i]]
  16. class_cnt[label] = class_cnt.get(label,0) + 1
  17. sorted_class_cnt = sorted(class_cnt.items(), key=lambda d:d[1], reverse = True)
  18. print(sorted_class_cnt)
  19. return sorted_class_cnt[0][0]

4 决策树

(CLS学习系统--ID3算法--ID4算法--C4.5算法&CART算法)

4.1 决策树的基本思想

CLS(Concept Learning System): the fundation of decision trees
there are three kinds of tree nodes: root, leaf and inner node.(根,叶子,内点)
Steps:
1. Start from an empty tree. randomly select the first feature as the root.
2. Do classification according to a certain condition. If the sub-set classified is empty or labeled the same, this sub-set is leaf. Otherwise, it is an inner node.
3. If it is an inner node, select a new label/feature to classify until all sub-sets are leaves.

根节点和内点是特征,叶子是结果判断
树枝是该特征取某个值的概率

4.2 决策树的算法框架

1. 主函数

递归函数,负责节点生长和结束算法

2. 计算最优特征子函数

不同决策树的根本差异
遍历当前数据集,评估每个特征,返回最优者

3. 划分函数

分割数据集,删除特征

4. 分类器

分类或预测

4.3 信息熵测度

4.3 ID3 Decision Tree

ID3 algorithm

code

  1. class ID3_DTree(object):
  2. def _init_(self):
  3. self.tree = {}
  4. self.dataset = []
  5. self.labels = []
  6. def train(self):
  7. labels = copy.deepcopy(self.labels)
  8. self.tree = self.buildTree(self.dataset,labels)
  9. """
  10. dataSet input instruction:
  11. row for object, column for feature
  12. the last column is class
  13. """
  14. def buildTree(self,dataset,labels):
  15. cateList = [data[-1] for data in dataset] # pick up all the classes
  16. if cateList.count(cateList[0]) == len(cateList): # only one feature
  17. return cateList[0]
  18. if len(dataset[0]) == 1: # no feature!
  19. return self.maxCate(cateList)
  20. bestFeat = self.getBestFeat(dataset)
  21. bestFeatLabel = labels[bestFeat]
  22. tree = {bestFeatLabel:{}}
  23. del(labels[bestFeat])
  24. # take this column
  25. uniqueVals = set([data[bestFeat] for data in dataset])
  26. for value in uniqueVals:
  27. subLabels = labels[:] # deep copy
  28. splitDataSet = self.splitDataSet(dataset,bestFeat,value)
  29. subTree = self.builTree(splitDataSet,subLabels)
  30. tree[bestFeatLabel][value] = subTree
  31. return tree
  32. """
  33. culculate the label that appears mostly
  34. """
  35. def maxCate(self,cateList):
  36. items = dict([(cateList.count(i),i) for i in cateList])
  37. return items[max(items.keys())]
  38. def getBestFeat(self,dataset):
  39. numFeat = len(dataset[0])-1
  40. baseEntropy = self.computeEntropy(dataset)
  41. bestInfoGain = 0.0
  42. bestFeat = -1
  43. for i in range(numFeat):
  44. uniqueVals = set([data[i] for data in dataset])
  45. newEntropy = 0.0
  46. for value in uniqueVals:
  47. subDataSet = self.splitDataSet(dataset,i,value)
  48. prob = len(subDataSet)/float(len(dataset))
  49. newEntropy += prob * self.computeEntropy(subDataSet)
  50. infoGain = baseEntropy - newEntropy
  51. if infoGain > bestInfoGain:
  52. # painter's method
  53. bestInfoGain = infoGain
  54. bestFeat = i
  55. return bestFeat
  56. def computeEntropy(self,dataset):
  57. datalen = float(len(dataset))
  58. cateList = [data[-1] for data in dataset]
  59. items = dict([(i,cateList.count(i)) for i in cateList])
  60. infoEntropy = 0.0
  61. for key in items:
  62. prob = float(items[key])/datalen
  63. infoEntropy -= prob * math.log(prob,2)
  64. return infoEntropy
  65. # abundon all the unit in column['axis'] with 'value'
  66. # so that te dataset could be compressed.
  67. def splitDataSet(self,dataset,axis,value):
  68. rList = []
  69. for featVec in dataset:
  70. if featVec[axis] == value:
  71. rFeatVec = featVec[:axis]
  72. rFeatVec.extend(featVec[axis+1:])
  73. rList.append(rFeatVec)
  74. return rList
  75. def predict(self,testVec):
  76. root = self.tree.key()[0]
  77. secondDict = self.tree[root]
  78. featIndex = self.labels.index(root)
  79. key = testVec[featIndex]
  80. valueOfFeat = secondDict[key]
  81. if isinstance(valueOfFeat,dict): # judge whether it is a dictionary
  82. classLabel = self.predict(valueOfFeat,self.labels,testVec)
  83. else:
  84. classLabel = valueOfFeat
  85. return classLabel

4.4 C4.5算法

用 信息增益率 代替信息增益


4.5 回归树

回归算法原理

Classification and regression tree(CART)成熟广泛应用
通过决策树实现回归
回归模型中,样本取值分为观察值和输出值,都是连续的。
CART使用最小剩余方差(Squared Residuals Minimization)判定回归树的最优划分.使用线性回归模型进行建模。如果难以拟合,继续划分子树。直到所有叶子节点都是线性回归模型。

最小剩余方差(Squared Residuals Minimization)

二重循环遍历每个特征列的所有样本点,在每个样本点上二分数据集,计算出最小的总方差(划分后的两个子集总方差之和)
总方差是每个数据与均值的方差的和。

模型树

叶子是一系列分段线性函数,是对原数据曲线得到模拟。
模型树还有很多性质

剪枝

连续数据会生出大量分支,需要对预测树剪枝
先剪枝:预定义划分阀值,低于阀值划分停止
后剪枝:计算内点的误判率,当子树的误判个数减去标准差后大于对应叶子节点的误判个数,就决定剪枝。

Scikit-Learn 对回归树的实现

5 推荐系统原理

5.1

经常一起购买的商品(打包销售)、购买此商品的客户也同时购买(协同过滤)、看过此商品后的客户买的其他商品、商品评分列表、商品评论列表

推荐系统的架构 (图)

推荐算法
- 基于人口统计学的推荐机制
- 基于内容的推荐
- 基于协同过滤的推荐:基于用户、基于项目
- 基于隐语义/模型的推荐:SVD隐语义模型

5.2 协同过滤

CF Collaborative Filtering
分为基于用户和基于项目:找到具有类似品味的人喜欢的物品、找到与喜欢的物品类似的物品

1. 数据预处理

对用户行为分组、加权
减噪(利用减噪算法)、归一化(统一量纲,除以最大值)
进行聚类,降低计算量

2. 使用scikit-Learn的KMeans聚类

给定划分的个数k。
创建初始划分,随机选择k个对象,各自代表聚类中心。其他对象属于离它最近的聚类。
迭代。当有新的对象加入或者离开某聚类时,重新计算聚类中心,然后重新分配。不断重复,直到没有聚类中的对象变化。

  1. from sklearn.cluster import KMeans
  2. kmeans = KMeans(init='k-mean++',n_cluster=4)
  3. kmeans.fit(dataMatrix)
  4. kmeans.cluster_centers_

3. User CF 原理

User Item矩阵:行是用户列表,列是物品列表,矩阵值是用户偏好数值。

用户相似度矩阵:按行归一化,一行总和为1

相似度的评判使用距离函数:
欧氏距离、相关系数、Jaccard距离、余弦距离

4. Item CF 原理

应用普遍广泛
物品相似度矩阵:按列归一化,一列的和为1

问题:人为分类对推荐算法造成影响;物品相似度受个人消费影响,用户难以加权
需要一种算法针对每类用户的不同消费行为计算不同物品的相似度

5. SVD 原理

隐语义模型(奇异值分解,SVD)通过隐含特征计算相似性
Singular Value Decomposition
The singular value docomposition of a m*n real or complex matrix is the factorization of the form , where is a m*m orthonormal matrix, is a rectangle diagonal matrix with no negative value in diagonal, and is a n*n orthonormal matrix.
The diagnoal entries of is the singular values of .
is called the left-singular vector. is called the right-singular vector.
is a set of orthonormal eigenvectors of .
is a set of orthonormal eigenvectors of . So it can be got by solving .

  1. import numpy as np
  2. U,s,V = np.linalg.svd(matrix)
  3. # s is in decending sort, so you cannot find out the unsorted s

but why it works?
奇异值在中按降序排列,衰减特别快,前10%甚至1%的奇异值占了奇异值总和的99%,所以可以用前几个奇异值来近似描述矩阵。奇异值由矩阵本身唯一决定.

Partial SVD(部分奇异值分解):

where r<< m and n
储存UsV可以节省空间

DIY SVD

  1. import numpy as np
  2. def SVD(M):
  3. lam,hU = np.linalg.eig(Matrix*Matrix * M.T)
  4. eV, hVT = np.linalg.eig(Matrix.T*Matrix.T * M)
  5. hV = hVT.T
  6. sigma = np.sqrt(lam)
  7. return hU, sigma, hV

分解之后,U,s,V都取前r个值,计算出待测算的向量。

  1. U,s,V = np.linalg.svd(dataset.T)
  2. V = V.T
  3. Ur = U[:,:r]
  4. Sr = np.diag(s)[:r,:r]
  5. Vr = V[:,:r]
  6. testResult = testVec * Ur * np.linalg.inv(Sr)
  7. result = array([dist(testResult,vi) for vi in Vr])
  8. descIndex = argsort(-result)[:r]

5.3 KMeans评估

簇:有距离相近的对象组成

KMeans不总是能找到正确的聚类。
KMeans擅长处理球状分布的数据,当类与类之间的区别比较明显时,效果较好。
复杂度为O(nkt),n是对象个数,k是簇的个数,t是迭代次数。

问题:
初始点选择影响迭代次数或者限于某个局部最优状态(?)
K要事先给出,不同数据集之间没有可借鉴性
对噪声和孤立点敏感

5.4 二分KMeans算法

Bitseting KMeans

原理:聚类的误差平方和越小,数据点越接近质心,越密集;误差平方和越大,有可能多个簇被划分为一个。

代码略