∎ k-Nearest Neighbors (kNN) (k 最鄰近):依k個最近鄰伴的屬性(attributes) 來做測試點屬性的判斷, k是一個系統參數,一般取奇數。做決定時,若是分類則取多數決,若是回歸則取平均數。kNN有廣泛的應用,可處理二元類別(binary), 多元類別(multi-class) 和多準位 (multi-level) 等分類 (classification) 和各種回歸問題。
Classification
- Binary classification: 由兩個label(標籤)中預測出一個。若從很多類別中分出是多元分類。
- Lazy learner (instance-based learner) 僅儲存dataset,幾乎沒有處理模型,kNN屬之。
- Eager learner 是認真估計模型參數,並據以分類。
身高(cm) | 體重(kg) | 性別 |
158 | 64 | male |
170 | 66 | male |
183 | 84 | male |
191 | 89 | male |
155 | 49 | female |
153 | 59 | female |
180 | 67 | female |
158 | 54 | female |
178 | 77 | female |
假設我們要以此表為訓練集,再給定一筆如(155,70)的資料,並估計是男是女?
- 設k=3
- 求出歐氏距離最近的是[0 5 8], 屬性是 [男 女 女]
- 輸出結果是女性
import numpy as np import matplotlib.pyplot as plt X_train = np.array([ [158, 64], [170, 86], [183, 84], [191, 80], [155, 49], [163, 59], [180, 67], [158, 54], [170, 67] ]) y_train = ['male', 'male', 'male', 'male', 'female', 'female', 'female', 'female', 'female'] plt.figure() plt.title('Human Heights and Weights by Sex') plt.xlabel('Height in cm') plt.ylabel('Weight in kg') for i, x in enumerate(X_train): plt.scatter(x[0], x[1], c='k', marker='x' if y_train[i] == 'male' else 'D') plt.grid(True) plt.show()執行後得下圖,
![]() |
數據 |
預測一個身高150體重70的人的性別
x = np.array([[155, 70]]) distances = np.sqrt(np.sum((X_train - x)**2, axis=1)) distances得 array([ 6.70820393, 21.9317122 , 31.30495168, 37.36308338, 21. ,
13.60147051, 25.17935662, 16.2788206 , 15.29705854])
程式中axis=1 是指x對每個實例的元素對元素pair-wise 計算,若axis=0 則是計算後所有 實例的每個元素再相加,若是沒指定則是所有元素再相加。
nearest_neighbor_indices = distances.argsort()[:3] nearest_neighbor_genders = np.take(y_train, nearest_neighbor_indices) print(nearest_neighbor_indices) nearest_neighbor_genders得輸出 [0 5 8]
array(['male', 'female', 'female'], dtype='<U6')
其中 '<U6' data type 是指6 bytes 的unicode 編碼,低位址端存放低位字元,程式中argsort()[:3]是指取出最小的三個索引值。最近(歐氏距離)三個有兩個女性一個男性。
plt.figure() plt.title('Human Heights and Weights by Sex') plt.xlabel('Height in cm') plt.ylabel('Weight in kg') for i, x in enumerate(X_train): plt.scatter(x[0], x[1], c='k', marker='x' if y_train[i] == 'male' else 'D') plt.scatter(158, 64, s=200, c='k', marker='x') plt.scatter(163, 59, s=200, c='k', marker='D') plt.scatter(158, 54, s=200, c='k', marker='D') plt.scatter(155, 70, s=200, c='k', marker='o') plt.grid(True) plt.show()執行上列程式得
![]() |
三個最近的鄰伴 |
from collections import Counter b = Counter(np.take(y_train, distances.argsort()[:3])) b.most_common(1)[0][0]得 'female' 輸出結果是女性,其中most_common(1)[0][0] 中之(1)指找出一個最都相同的元赤,並傳回最多相同的元素及幾個相同,[0][0]指取出陣列中的(0,0)index的內容,此例中是指'female'。
∎ 二元分類例子(使用sklearn)
sklearn有個 LabelBinarizer 可將二元的標籤文字化成二元值,運算完成後再用inverse_transform()可轉回原來的標籤文字。也有個KNeighborsClassifier()可用於kNN分類。如下列程式
from sklearn.preprocessing import LabelBinarizer from sklearn.neighbors import KNeighborsClassifier lb = LabelBinarizer() y_train_binarized = lb.fit_transform(y_train) y_train_binarized顯示
array([[1],
[1],
[1],
[1],
[0],
[0],
[0],
[0],
[0]])
K = 3 clf = KNeighborsClassifier(n_neighbors=K) clf.fit(X_train, y_train_binarized.reshape(-1)) prediction_binarized = clf.predict(np.array([155, 70]).reshape(1, -1))[0] predicted_label = lb.inverse_transform(prediction_binarized) predicted_label得array(['female'], dtype='<U6')
假設我們有組測試資料,輸入進行測試,
X_test = np.array([ [168, 65], [180, 96], [160, 52], [169, 67] ]) y_test = ['male', 'male', 'female', 'female'] y_test_binarized = lb.transform(y_test) print('Binarized labels: %s' % y_test_binarized.T[0]) predictions_binarized = clf.predict(X_test) print('Binarized predictions: %s' % predictions_binarized) print('Predicted labels: %s' % lb.inverse_transform(predictions_binarized))得
Binarized labels: [1 1 0 0]
Binarized predictions: [0 1 0 0]
Predicted labels: ['female' 'male' 'female' 'female']
∎ 二元分類驗證準確性
二元分類設有1 (正) 及 0 (負),1被偵測成0的次數記成$𝑛_{10}$,類似地有$𝑛_{11}$, $𝑛_{00}$, $𝑛_{01}$. 定義$𝑛_{1*}=𝑛_{10}+𝑛_{11}$, $𝑛_{0*}=𝑛_{00}+𝑛_{01}$, $𝑛_{*1}=𝑛_{01}+𝑛_{11}$, $𝑛_{*0}=𝑛_{00}+𝑛_{10}$, 且總數為$𝑛$, 定義
- Accuracy (正確率):$ACC =\frac{n_{11}+n_{00}}{n}$
- Precision (精準率):$ACC =\frac{n_{11}}{n_{11}+n_{10}}$
- Recall (召回率):$R =\frac{n_{11}}{n_{*1}}$
- F1-Score (F1):F1 =$2\frac{ACC\cdot R}{ACC+R}$
- Matthews correlation coefficient ($\phi $ coefficient): $\phi =\frac{n_{11}n_{00}-n_{10}n_{01}}{\sqrt{n_{1*}n_{0*}n_{*1}n_{*0}}}$
from sklearn.metrics import accuracy_score print('Accuracy: %s' % accuracy_score(y_test_binarized, predictions_binarized))Accuracy: 0.75
from sklearn.metrics import precision_score print('Precision: %s' % precision_score(y_test_binarized, predictions_binarized))Precision: 1.0
from sklearn.metrics import recall_score print('Recall: %s' % recall_score(y_test_binarized, predictions_binarized))Recall: 0.5
from sklearn.metrics import f1_score print('f1_score: %s' % f1_score(y_test_binarized, predictions_binarized))f1_score: 0.6666666666666666
from sklearn.metrics import matthews_corrcoef print('Matthews correlation coefficient: %s' % matthews_corrcoef(y_test_binarized, predictions_binarized))Matthews correlation coefficient: 0.5773502691896258
from sklearn.metrics import classification_report print(classification_report(y_test_binarized, predictions_binarized, target_names=['male'], labels=[1]))precision recall f1-score support
male 1.00 0.50 0.67 2
avg / total 1.00 0.50 0.67 2
kNN回歸 with sklearn
∎ 應用:根據給定的特徵預測一個實數值,例如給定高度和性別用以預測體重,預測誤差常用的有;- Mean Absolute Error (MAE)平均絕對誤差:$MAE=\frac{1}{n}\sum_{i=0}^{n-1}\left | y_i-\hat{y}_i \right |$
- Mean Squared Error (MSE)平均平方誤差:$\frac{1}{n}\sum_{i=0}^{n-1}(y_i-\hat{y}_i)^2$
- R-square (R2) score:$R^2=1-\frac{\sum_{i=0}^{n-1}(y_i-\hat{y})^2}{\sum_{i=0}^{n-1}(y_i-\bar{y})^2}$,模型愈好,$R^2$愈接近1。
- 例中以身高、性別為特徵,身高的範圍可能是155-175,但性別是0或1,造成身高支配著整個特徵。
- StandardScaler:使訓練集裡的特徵間平均值為1,變異數0,稱為standardized(標準化)。
參考文獻:
Gavin Hackeling, Mastering Machine Learning with scikit-learn, 2nd, Packt Publishing, 2017
本單元檔案下載
沒有留言:
張貼留言