2018/12/1

Logistic regression (邏輯回歸)

07 Logistic regression (邏輯回歸)

邏輯回歸雖名稱中有回歸,但是用於分類。將響應變數模型化成所有特徵的線性組合,再代入logistic function求值,此函數對任意的輸入值其輸出介於0~1,所以可依不同的輸出值做不同類別的判斷。

logistic function

logistic function 又稱 sigmoid function (S型函式),定義成
$f(x)=\frac{1}{1+e^{-x}}$
此外另有一個相關的函數叫logit function, 定義成
$g(p)=ln \frac{p}{1-p}$
因 logistic 和 logit 字很像,若稱 sigmoid function 較易區辨。sigmoid 的定義域是實數($-\infty\sim \infty $),而logit是機率值($0\sim 1$)。且
$g(f(x))=ln \frac{\frac{1}{1+e^{-x}}}{1-\frac{1}{1+e^{-x}}}=x$,
所以$f(x)$是$g(p)$ 互為反函數。s-function 在機器學習及深度學習裡演很重要角色。畫出s-function介於$-6\sim 6$的值如下
import matplotlib.pyplot as plt
import numpy as np
x = np.arange(-6,6,0.01)
y =1/(1+ np.exp(-x))
plt.grid(True)
plt.plot(x,y)
plt.show()
sigmoid function


Spam filtering (垃圾信過濾)應用

使用 SMS Spam Collection Data Set,資料集由下列網址下載http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
下載後是個smsspamcollection.zip 檔,解壓後是有個SMSSpamCollection檔,將此檔放於notebook的工作目錄裡。此資料集每一列是一則SMS,有兩欄位,第一欄是label欄,指出是sapm (垃圾) 或 ham (非垃圾),第二欄是訊息內容。
import pandas as pd
df = pd.read_csv('SMSSpamCollection', delimiter='\t', header=None)
print(df.head())
     0                                                  1
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
列出前五則SMS,程式中使用panda是python的資料庫管理外掛套件,delimiter='\t'是用來分隔每一列。
print('Number of spam messages: %s' % df[df[0] == 'spam'][0].count())
print('Number of ham messages: %s' % df[df[0] == 'ham'][0].count())
Number of spam messages: 747
Number of ham messages: 4825
可見有747則是垃圾而 4825則非垃圾。接著是要叫用 TfidfVectorizer() 將訊息轉成向量,再以
LogisticRegression() 的fit 及predict method 來訓練及預測。
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
X = df[1].values
y = df[0].values
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y)
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
for i, prediction in enumerate(predictions[:5]):
    print('Predicted: %s, message: %s' % (prediction, X_test_raw[i]))
Predicted: ham, message: How do you guys go to see movies on your side.
Predicted: ham, message: Huh? 6 also cannot? Then only how many mistakes?
Predicted: ham, message: Just got part Nottingham - 3 hrs 63miles. Good thing i love my man so much, but only doing 40mph. Hey ho
Predicted: ham, message: Yar he quite clever but aft many guesses lor. He got ask me 2 bring but i thk darren not so willing 2 go. Aiya they thk leona still not attach wat.
Predicted: spam, message: RT-KIng Pro Video Club>> Need help? info@ringtoneking.co.uk or call 08701237397 You must be 16+ Club credits redeemable at www.ringtoneking.co.uk! Enjoy!

上列程式第8行調用train_test_split(),此函數預設會隨機分配75%的資料樣本給訓練集,剩餘的25%給測試集,訓練完成後可以測試集進行評估系統效能。上列顯示 的5則中有一是spam其他的是ham,但要如何評估分類器的效能呢?

二元預測性能度量 (performance metrics)

∎ accuracy, precision, recall, F1-score
我們已在04 kNN 分類和回歸的單元討論過,現再以另一方式來探討這幾個量度。一般二元分類是將關注的類別指定為1 (case),而非關注的類別指定為0 (non-case),例如在spam filtering中,注的當然是spam,所以就指定為1(正),而ham就是0(或負)。如圖所示

spam filtering
其中Spam的樣本數是$n_s=x_1+x_2$, 而Ham的樣本數是$n_h=x_3+x_4$,總樣本就是$n=n_s+n_h$。
  ◆ accuracy 指的是正確性,也就是判斷正確的比例,也就是$(x_1+x_4)/n$,這個度量當資料數量極度不平衡就顯得較沒有用,因分類器可以總是判斷為資料數目大的類別而得到好的分數。
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
df = pd.read_csv('sms.csv')
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'],
    df['label'], random_state=11)
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
scores = cross_val_score(classifier, X_train, y_train, cv=5)
print('Accuracies: %s' % scores)
print('Mean accuracy: %s' % np.mean(scores))
Accuracies: [0.95221027 0.95454545 0.96172249 0.96052632 0.95209581]
Mean accuracy: 0.9562200683094717

程式第16行叫用cross_val_score(),這函數就是計算accuracy,參數cv是指cross-validation generator,亦即相互驗證多少次,預設值是3。

◆ precision 精準性指的是關注樣本是否能被正測判斷歸類,和非關注的樣本判斷無關,以$x_4/n_s$ 衡量,也就是正→正的比率,這在偵測理論裡稱 Detection probability (偵測率)。
另有負→正,稱 false alarm rate (虛警率)亦即$x_3/n_s$,稍後則稱false positive rate 。精準性愈高是愈好,且虛警愈少愈好,降低用以判斷的門檻(threshold) 可提高精準性可,但如此則造成虛警增加,所以兩者是衝突目標。
◆ recall 指的是召回率定義成$x_1/(x_1+x_3)$, 在疾病檢查中,診斷出有病的有兩種情況,即有病→有病和沒病→有病,但其中只有有病→有病的情況真正需召回治療,這度量就是算此比例。
precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
print('Precision: %s' % np.mean(precisions))
recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
print('Recall: %s' % np.mean(recalls))
Precision: 0.992542742398164
Recall: 0.6836050302748021

 ◆ F1-score,定義成
$\mbox {F1}=2\frac{\mbox {accuracy}\cdot \mbox {recall}}{\mbox {accuracy}+\mbox {recall}}$,
同時兼顧了分類模型的準確率和召回率。F1分數可以看作是模型準確率和召回率的一種加權平均,它的最大值是1,最小值是0。
f1 = cross_val_score(classifier, X_train, y_train, cv=5,
scoring='f1')
print('F1 score: %s' % np.mean(f1))
F1 score: 0.8090678466269784

∎ ROC AUC score
Receiver Operating Characteristic (ROC)是指接收器的操作特性,是以曲線表示 ,而在此曲線下端的面積稱Area Under the Curve (AUC)。如下圖
ROC AUC score
此圖是以false positive rate 為x 軸,precision 為 y軸。圖中紅色是指隨機猜,此情況兩機率相等,藍色是指變動門檻,當允許的false positive rate 較小時precision 也較小,當false positive rate 允許加 大時,precision可增大。圖中為例紅色隨機猜的紅線的AUC是0.5,而藍線是0.98,所以藍線遠比紅線好。

∎ Confusion matrix (混同矩陣)
以矩陣的形式來表示,列是true type ,行是預測type,有數字輸出,也就是直接將上述$x_1, x_2, x_3, x_4$直接以矩 形式顯示,也有圖形顯示,更適合於多種類的分類。文字顯示的矩陣是
$\begin{bmatrix}
x_4 & x_3\\
 x_2& x_1
\end{bmatrix}$,如
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
y_test = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred = [0, 1, 0, 0, 0, 0, 0, 1, 1, 1]
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
plt.matshow(confusion_matrix)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
[[4 1]
 [2 3]]
Confusion matrix

Multi-class classification (多元類別分類)

Rotten Tomatoes dataset :每個片語可分類成 negative (0), somewhat negative (1), neutral (3), somewhat positive (4), or positive (5).資料集可由下列網址下載:
http://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data
資料集含有156060 instances
import pandas as pd
df = pd.read_csv('train.tsv', header=0, delimiter='\t')
print(df.count())
PhraseId      156060
SentenceId    156060
Phrase        156060
Sentiment     156060
dtype: int64
共有四個欄位,header=0表示title在第0列。
print(df.head())
PhraseId  SentenceId                                             Phrase  \
0         1           1  A series of escapades demonstrating the adage ...  
1         2           1  A series of escapades demonstrating the adage ...  
2         3           1                                           A series  
3         4           1                                                  A  
4         5           1                                             series  

   Sentiment
0          1
1          2
2          2
3          2
4          2
df.head()是資料庫預覽。
print(df['Phrase'].head(5))
0    A series of escapades demonstrating the adage ...
1    A series of escapades demonstrating the adage ...
2                                             A series
3                                                    A
4                                               series
Name: Phrase, dtype: object
列出phrase的前五筆
print(df['Sentiment'].describe())
count    156060.000000
mean          2.063578
std           0.893832
min           0.000000
25%           2.000000
50%           2.000000
75%           3.000000
max           4.000000
Name: Sentiment, dtype: float64
列出各種類別的統。
print(df['Sentiment'].value_counts())
2    79582
3    32927
1    27273
4     9206
0     7072
Name: Sentiment, dtype: int64
列出各類別的總數
print(df['Sentiment'].value_counts()/df['Sentiment'].count())
2    0.509945
3    0.210989
1    0.174760
4    0.058990
0    0.045316
Name: Sentiment, dtype: float64
列出各類別的比例。下列程式進行預測,並列印結果
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
df = pd.read_csv('train.tsv', header=0, delimiter='\t')
X, y = df['Phrase'], df['Sentiment'].values
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y)
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
print('Accuracy: %s' % accuracy_score(y_test, predictions))
print('Confusion Matrix:')
print(confusion_matrix(y_test, predictions))
print('Classification Report:')
print(classification_report(y_test, predictions))
Accuracy: 0.6265538895296681
Confusion Matrix:
[[  271   901   557    52     2]
 [  156  2231  4100   289    10]
 [   35   956 17917  1048    29]
 [    3   210  4148  3518   234]
 [    0    36   485  1319   508]]
Classification Report:
             precision    recall  f1-score   support

          0       0.58      0.15      0.24      1783
          1       0.51      0.33      0.40      6786
          2       0.66      0.90      0.76     19985
          3       0.57      0.43      0.49      8113
          4       0.65      0.22      0.32      2348

avg / total       0.61      0.63      0.59     39015

參考文獻
Gavin Hackeling, Mastering Machine Learning with scikit-learn, 2nd, Packt Publishing, 2017

沒有留言:

張貼留言