Function

From scikit-learn: This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of the true labels given a probabilistic classifier’s predictions. The log loss is only defined for two or more labels. For a single sample with true label yt in {0,1} and estimated probability yp that yt = 1, the log loss is $$-log P(yt|yp) = -(yt\ log(yp) + (1 - yt) log(1 - yp))$$more

However, the actual scroe of ‘is_duplicate’ equals 1 or 0, we can derive from this function into $$-log P(yt|yp) = -(log(yp) + log(1 - yp))$$

Source data

Because of the copy right issue, I give you part of the data. Originally, the dataset contains 6 columns: id, qid1, qid2, question1, question2, is_duplicate, which is used to train a model to distinguish whether two questions are duplicate.

idqid1qid2question1question2is_duplicate
$0$12What is the step by step…What is the step by step guide to invest in sh…0
$1$34What is the story of…What would happen if the Indian government sto…0
$2$56How can I increase the speed…How can Internet speed be increased by hacking…0

However, we don’t need actul text in questions in order to learn LogLoss function. Aa a result, I delete ‘question1’, ‘question2’, and release this edition: {% asset_link train.csv trainingData %}

idqid1qid2is_duplicate
0120
1340
2560

Code

  1. Self-Defined LogLoss
import pandas as pd
import numpy as np
path = 'path/to/train.csv'
df_train = pd.read_csv(path)
df_train.head()

def logloss(actual, predict):
    predict = min(max(predict, 1e-15), (1 - (1e-15)))  
    ##if the number smaller than 10^-15, then define it to be 10^-15
    ##if the number bigger than 1-(10^-15), then define it to be 1-(10^-15)
    
    if actual == 1:    ##actual==1, else(actual==0)
        return np.log(predict)    ##numpy default log == ln (with e as base)
    else:
        return np.log(1-predict)
    
predict = np.average(df_train['is_duplicate'])    
##Assume our model predict every query pair has a score of average score of all pairs
print(predict)
LoglossList = [-logloss(actual, predict) for actual in df_train['is_duplicate']]
LoglossScore = np.average(LoglossList)
print(LoglossScore)
  1. Use sklearn
from sklearn.metrics import log_loss

p = np.average(df_train['is_duplicate'])    ##directly use average score
print(p)
LoglossScore = log_loss(df_train['is_duplicate'], np.zeros_like(df_train['is_duplicate']) + p)
print(LoglossScore)