Pandas & Numpy

machine_learning

Pandas: basic application of Series & DataFrame

1. preparation

  1. import pandas as pd
  2. from pandas import Series,DataFrame

2. Series

  1. ser = Series([1,2,3]) # with array
  2. ser = Series([1,2,3,4],index = ['a','b','c','d'])
  3. ser = Series({'p':1,'a':2,'n':3}) # with dictionary
  4. ser.values
  5. ser.index
  6. ser[2] # call by default index
  7. ser['b'] # call by my index(s)
  8. ser[['a','c']]
  9. pd.isnull(ser) # check not-a-number
  10. pd.notnull(ser)
  11. ser.isnull()
  12. ser.notnull()
  13. ser1 + ser2 # add value with the same index
  14. ser.name = 'python' # set the name of the series
  15. ser.index.name = 'index name' # set the name of the index columne
  16. ser.index = ['a','b','v'] # change indexs

3. DataFrame

  1. dic = {'cla':[1,2,3,4],'colb':[2,4,6,8],'col3':[3,5,7,9]}
  2. f = DataFrame(dic)
  3. f = (dic,columns = ['id','num','rank'],index = ['a','b','c','d'])
  4. f = DataFrame({'dic1':{1:1,2:4},'dic2':{1:3,2:6}})
  5. f['id']
  6. f.id # without [ ]
  7. f.ix['a'] # row index with ix
  8. f.ix[1]
  9. f['rank'] = numpy.arange(5)
  10. f['rank'] = Series([1,2,3,4],index = ['?','/','!'])
  11. del f['rank']
  12. f.columns.name = 'name'
  13. f.columns.tolist() returns the list of column names.
  14. f.index.name = 'info'
  15. f.values
  16. df[val]
  17. df.ix[val]
  18. df.ix[:,val]
  19. df.ix[val1,val2]

icol、irow 方法
get_value、set_value 方法

4.Methods for Pandas

describe 针对 Series 或各 DataFrame 列计算汇总统计
mean 值的平均数
count 非 NA 值的数量
min、max 计算最小值和最大值
cumsum 样本值的累计和
argmin、argmax 计算能够获取到最小值和最大值的索引位置(整数)
median 值的算术中位数(50%分位数)
idxmin、idxmax 计算能够获取到最小值和最大值的索引值
sum 值的总和
var 样本值的方差
std 样本值的标准差
diff 计算一阶差分(对时间序列很有用)

quantile 计算样本的分位数(0到1)
mad 根据平均值计算平均绝对离差
skew 样本值的偏度(三阶矩)
kurt 样本值的峰度(四阶矩)
cummin、cummax 样本值的累计最大值和累计最小值
cumprod 样本值的累计积
pct_change 计算百分数变化
axis 简约的轴。DataFrame 的行用0,列用1
skipna 排除缺失值,默认值为 True
level 如果轴是层次化索引的(即 MultiIndex),则根据 level 分组约简

Scikit-Learn Machine Learning

1. concept:

  1. ML considers how to predict some characteristic of the unknown data by a set of data sample.
  2. classification:
    • supervised learning:
      the sample data carrys the part we want to predict. it can be devided into:
      Classification: the sample belongs to two or more classes. we predict the type of unknown data by learning sample of known classes.
      Regression: hope to output continue variables
    • unsupervised learning:
      sample data contains no objective value. it aims to mine similar parts from the data to form different groups - cluster(聚类). Or find the density distribution of input space
      of data, which means density assumption. Or reduce dimension for data visualization.
  3. use what we learn from train set to predict test set

2. basic operations:

  1. from sklearn import datasets
  2. iris = datasets.load_iris()
  3. iris.data.shape # see how data stored
  4. iris.target.shape # see how types stored
  5. import numpy as np
  6. np.unique(iris.target) # show all types
  1. digit = dataset.load_digits()
  2. print(digit.data)
  3. digit.target

3.learn and predict

in scikit-learn, estimator is a python object which implements fix(x,y) and predict(T) method.
class sklearn.svm.SVC is an estimator supporting vector classification. we treat it as a black box regardless of the detailed algorithm and chosen of paramaters.

  1. from sklearn import svm
  2. clf = svm.SVC(gamma=0.001,C=100.)
  3. # set gammar manually
  4. # we can find better parameters with node serach(格点搜索) and cross verification(交叉验证)

pass the train set to fit method for data fitting. here we take the last one as prediction and the remain as train data

  1. clf.fit(digit.data[:-1],digit.target[:-1])

the output is

SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,gamma=0.001, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

again

  1. clf.predict(digit.data[-1])

output:
array([8])

4.Regression

in regression model, objective value is the linear combination of input value.
y(w,x) = w0+w1x1+w2x2+...+wpxp
w = (w_1,w_2,...,w_p) is coef_(因数)
w_0 is intercept(截距)
least square method(最小二乘法) minimize residual(残差)

  1. from sklearn import linear_model
  2. clf = linear_model.LinearRegression()
  3. clf.fit([[0,0],[1,1,])
  4. clf.coef_

output:
array([0.5,0.5])

5. Classification

the simplest algorithm is nearest-neighbor: given a new data, take the tag of the nearest sample in N-dimension space as its tag.

  1. from sklearn import neighbors
  2. knn = neighbors.KNeighborsClassifier()
  3. knn.fit(iris.data,iris.target)
  4. knn.predict([[]0.1,0.2,0.3,0.4])

output:
array([0])

6. Cluster

The simpest cluster algorithm is k-means: classify data into k types, allocate a sample into a type according to the distance between the sample and average value of all the current samples of the type.
With each sample adding in, average value keeps renewing until the type reaches convergence after several turns(determined by max_iter).

Numpy basic operations

import numpy as np

1. numeric types

2. Numpy array object

  1. a = np.arange(10) # to generate ordered numbers
  2. a = np.array([np.arange(3),np.arange(3)])
  3. a.dtype # check the data type
  4. a.dtype.itemsize
  5. a.shape # check dimension with tuple
  6. a[2,1] # visit element
  1. a[3:7] # slice
  2. a[:7:2] # with increment
  3. a[::-1] # reverse
  1. ma = ar.reshape(2,3,4)
  2. ar.shape = (2,3,4)
  1. ar.resize(3,81)
  2. # same as reshape() but change the array itself

flatten & ravel

  1. ma.flatten()
  2. ma.ravel()
  3. # push multidimension into one

transpose
ma.transpose()

  1. a = np.array([0,1,2],[3,4,5],[6,7,8])
  2. np.hsplit(a,3)
  1. np.vsplit(a,3)
  1. c = np.arange(27).reshape(3,3,3)
  2. np.dsplit(c,3)

array attributes:

# b is an numpy array
b.ndim check the number of dimensions
b.size holds the count of elements
b.itemsizereturns the count of bytes for each element
b.nbytes returns the total bytes (=size*itemsize)
b.T the same as b.transpose()
complex number can be held like:
c = array([1.+2.j,3.+4.j])
c.real and c.imag show the real and imaginary part respectively
make it flat:
f = b.flat gives back a numpy.flatiter object which can be used to iterate each element

  1. for item in f:
  2. print item

obtain elements with flatiter object:
b.flat[2] b.flat[[1,3]]

  1. import scipy.misc
  2. import matplotlib.pyplot as plt
  3. lena = scipy.misc.lena() # a picture
  4. lena.copy()
  5. lena.view()
  6. plt.subplot(221)
  7. plt.imshow(lena)
  8. plt.show()

fancy indexing

  1. import scipy.misc
  2. import matplotlib.pyplot as plt
  3. lena = scipy.misc.lena() # a picture
  4. xmax = lena.shape[0]
  5. ymax = lena.shape[1]
  6. lena[range(xmax),range(ymax)] = 0 # ??
  7. lena[range(xmax-1,-1,-1),range(ymax)] = 0 # ??
  8. plt.imshow(lena)
  9. plt.show()
  1. Numpy I/O
    read from csv file
    np.genfromcsv("file.csv", delimiter=',')
    returns a two-dimension array.