邏輯回歸雖名稱中有回歸,但是用於分類。將響應變數模型化成所有特徵的線性組合,再代入logistic function求值,此函數對任意的輸入值其輸出介於0~1,所以可依不同的輸出值做不同類別的判斷。
logistic function
logistic function 又稱 sigmoid function (S型函式),定義成$f(x)=\frac{1}{1+e^{-x}}$
此外另有一個相關的函數叫logit function, 定義成
$g(p)=ln \frac{p}{1-p}$
因 logistic 和 logit 字很像,若稱 sigmoid function 較易區辨。sigmoid 的定義域是實數($-\infty\sim \infty $),而logit是機率值($0\sim 1$)。且
$g(f(x))=ln \frac{\frac{1}{1+e^{-x}}}{1-\frac{1}{1+e^{-x}}}=x$,
所以$f(x)$是$g(p)$ 互為反函數。s-function 在機器學習及深度學習裡演很重要角色。畫出s-function介於$-6\sim 6$的值如下
import matplotlib.pyplot as plt import numpy as np x = np.arange(-6,6,0.01) y =1/(1+ np.exp(-x)) plt.grid(True) plt.plot(x,y) plt.show()
![]() |
sigmoid function |
Spam filtering (垃圾信過濾)應用
使用 SMS Spam Collection Data Set,資料集由下列網址下載http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection下載後是個smsspamcollection.zip 檔,解壓後是有個SMSSpamCollection檔,將此檔放於notebook的工作目錄裡。此資料集每一列是一則SMS,有兩欄位,第一欄是label欄,指出是sapm (垃圾) 或 ham (非垃圾),第二欄是訊息內容。
import pandas as pd df = pd.read_csv('SMSSpamCollection', delimiter='\t', header=None) print(df.head())0 1
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
列出前五則SMS,程式中使用panda是python的資料庫管理外掛套件,delimiter='\t'是用來分隔每一列。
print('Number of spam messages: %s' % df[df[0] == 'spam'][0].count()) print('Number of ham messages: %s' % df[df[0] == 'ham'][0].count())Number of spam messages: 747
Number of ham messages: 4825
可見有747則是垃圾而 4825則非垃圾。接著是要叫用 TfidfVectorizer() 將訊息轉成向量,再以
LogisticRegression() 的fit 及predict method 來訓練及預測。
import numpy as np import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model.logistic import LogisticRegression from sklearn.model_selection import train_test_split, cross_val_score X = df[1].values y = df[0].values X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y) vectorizer = TfidfVectorizer() X_train = vectorizer.fit_transform(X_train_raw) X_test = vectorizer.transform(X_test_raw) classifier = LogisticRegression() classifier.fit(X_train, y_train) predictions = classifier.predict(X_test) for i, prediction in enumerate(predictions[:5]): print('Predicted: %s, message: %s' % (prediction, X_test_raw[i]))Predicted: ham, message: How do you guys go to see movies on your side.
Predicted: ham, message: Huh? 6 also cannot? Then only how many mistakes?
Predicted: ham, message: Just got part Nottingham - 3 hrs 63miles. Good thing i love my man so much, but only doing 40mph. Hey ho
Predicted: ham, message: Yar he quite clever but aft many guesses lor. He got ask me 2 bring but i thk darren not so willing 2 go. Aiya they thk leona still not attach wat.
Predicted: spam, message: RT-KIng Pro Video Club>> Need help? info@ringtoneking.co.uk or call 08701237397 You must be 16+ Club credits redeemable at www.ringtoneking.co.uk! Enjoy!
上列程式第8行調用train_test_split(),此函數預設會隨機分配75%的資料樣本給訓練集,剩餘的25%給測試集,訓練完成後可以測試集進行評估系統效能。上列顯示 的5則中有一是spam其他的是ham,但要如何評估分類器的效能呢?
二元預測性能度量 (performance metrics)
∎ accuracy, precision, recall, F1-score我們已在04 kNN 分類和回歸的單元討論過,現再以另一方式來探討這幾個量度。一般二元分類是將關注的類別指定為1 (case),而非關注的類別指定為0 (non-case),例如在spam filtering中,注的當然是spam,所以就指定為1(正),而ham就是0(或負)。如圖所示
![]() |
spam filtering |
◆ accuracy 指的是正確性,也就是判斷正確的比例,也就是$(x_1+x_4)/n$,這個度量當資料數量極度不平衡就顯得較沒有用,因分類器可以總是判斷為資料數目大的類別而得到好的分數。
import numpy as np import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model.logistic import LogisticRegression from sklearn.model_selection import train_test_split, cross_val_score from sklearn.metrics import roc_curve, auc import matplotlib.pyplot as plt df = pd.read_csv('sms.csv') X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'], random_state=11) vectorizer = TfidfVectorizer() X_train = vectorizer.fit_transform(X_train_raw) X_test = vectorizer.transform(X_test_raw) classifier = LogisticRegression() classifier.fit(X_train, y_train) scores = cross_val_score(classifier, X_train, y_train, cv=5) print('Accuracies: %s' % scores) print('Mean accuracy: %s' % np.mean(scores))Accuracies: [0.95221027 0.95454545 0.96172249 0.96052632 0.95209581]
Mean accuracy: 0.9562200683094717
程式第16行叫用cross_val_score(),這函數就是計算accuracy,參數cv是指cross-validation generator,亦即相互驗證多少次,預設值是3。
◆ precision 精準性指的是關注樣本是否能被正測判斷歸類,和非關注的樣本判斷無關,以$x_4/n_s$ 衡量,也就是正→正的比率,這在偵測理論裡稱 Detection probability (偵測率)。
另有負→正,稱 false alarm rate (虛警率)亦即$x_3/n_s$,稍後則稱false positive rate 。精準性愈高是愈好,且虛警愈少愈好,降低用以判斷的門檻(threshold) 可提高精準性可,但如此則造成虛警增加,所以兩者是衝突目標。
precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision') print('Precision: %s' % np.mean(precisions)) recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall') print('Recall: %s' % np.mean(recalls))Precision: 0.992542742398164
Recall: 0.6836050302748021
◆ F1-score,定義成
$\mbox {F1}=2\frac{\mbox {accuracy}\cdot \mbox {recall}}{\mbox {accuracy}+\mbox {recall}}$,
同時兼顧了分類模型的準確率和召回率。F1分數可以看作是模型準確率和召回率的一種加權平均,它的最大值是1,最小值是0。
f1 = cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1') print('F1 score: %s' % np.mean(f1))F1 score: 0.8090678466269784
∎ ROC AUC score
Receiver Operating Characteristic (ROC)是指接收器的操作特性,是以曲線表示 ,而在此曲線下端的面積稱Area Under the Curve (AUC)。如下圖
![]() |
ROC AUC score |
∎ Confusion matrix (混同矩陣)
以矩陣的形式來表示,列是true type ,行是預測type,有數字輸出,也就是直接將上述$x_1, x_2, x_3, x_4$直接以矩 形式顯示,也有圖形顯示,更適合於多種類的分類。文字顯示的矩陣是
$\begin{bmatrix}
x_4 & x_3\\
x_2& x_1
\end{bmatrix}$,如
from sklearn.metrics import confusion_matrix import matplotlib.pyplot as plt y_test = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1] y_pred = [0, 1, 0, 0, 0, 0, 0, 1, 1, 1] confusion_matrix = confusion_matrix(y_test, y_pred) print(confusion_matrix) plt.matshow(confusion_matrix) plt.title('Confusion matrix') plt.colorbar() plt.ylabel('True label') plt.xlabel('Predicted label') plt.show()[[4 1]
[2 3]]
![]() |
Confusion matrix |
Multi-class classification (多元類別分類)
Rotten Tomatoes dataset :每個片語可分類成 negative (0), somewhat negative (1), neutral (3), somewhat positive (4), or positive (5).資料集可由下列網址下載:http://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data
資料集含有156060 instances
import pandas as pd df = pd.read_csv('train.tsv', header=0, delimiter='\t') print(df.count())PhraseId 156060
SentenceId 156060
Phrase 156060
Sentiment 156060
dtype: int64
共有四個欄位,header=0表示title在第0列。
print(df.head())
PhraseId SentenceId Phrase \ 0 1 1 A series of escapades demonstrating the adage ... 1 2 1 A series of escapades demonstrating the adage ... 2 3 1 A series 3 4 1 A 4 5 1 series Sentiment 0 1 1 2 2 2 3 2 4 2df.head()是資料庫預覽。
print(df['Phrase'].head(5))0 A series of escapades demonstrating the adage ...
1 A series of escapades demonstrating the adage ...
2 A series
3 A
4 series
Name: Phrase, dtype: object
列出phrase的前五筆
print(df['Sentiment'].describe())count 156060.000000
mean 2.063578
std 0.893832
min 0.000000
25% 2.000000
50% 2.000000
75% 3.000000
max 4.000000
Name: Sentiment, dtype: float64
列出各種類別的統。
print(df['Sentiment'].value_counts())2 79582
3 32927
1 27273
4 9206
0 7072
Name: Sentiment, dtype: int64
列出各類別的總數
print(df['Sentiment'].value_counts()/df['Sentiment'].count())2 0.509945
3 0.210989
1 0.174760
4 0.058990
0 0.045316
Name: Sentiment, dtype: float64
列出各類別的比例。下列程式進行預測,並列印結果
import numpy as np import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model.logistic import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, accuracy_score, confusion_matrix df = pd.read_csv('train.tsv', header=0, delimiter='\t') X, y = df['Phrase'], df['Sentiment'].values X_train, X_test, y_train, y_test = train_test_split(X, y) X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y) vectorizer = TfidfVectorizer() X_train = vectorizer.fit_transform(X_train_raw) X_test = vectorizer.transform(X_test_raw) classifier = LogisticRegression() classifier.fit(X_train, y_train) predictions = classifier.predict(X_test) print('Accuracy: %s' % accuracy_score(y_test, predictions)) print('Confusion Matrix:') print(confusion_matrix(y_test, predictions)) print('Classification Report:') print(classification_report(y_test, predictions))
Accuracy: 0.6265538895296681 Confusion Matrix: [[ 271 901 557 52 2] [ 156 2231 4100 289 10] [ 35 956 17917 1048 29] [ 3 210 4148 3518 234] [ 0 36 485 1319 508]] Classification Report: precision recall f1-score support 0 0.58 0.15 0.24 1783 1 0.51 0.33 0.40 6786 2 0.66 0.90 0.76 19985 3 0.57 0.43 0.49 8113 4 0.65 0.22 0.32 2348 avg / total 0.61 0.63 0.59 39015
參考文獻
Gavin Hackeling, Mastering Machine Learning with scikit-learn, 2nd, Packt Publishing, 2017
沒有留言:
張貼留言