O OP acredita erroneamente que a relação entre essas duas funções se deve ao número de amostras (ou seja, uma única vs todas). No entanto, a diferença real é simplesmente como selecionamos nossos rótulos de treinamento.
In the case of binary classification we may assign the labels y=±1 or y=0,1.
As it has already been stated, the logistic function σ(z) is a good choice since it has the form of a probability, i.e. σ(−z)=1−σ(z) and σ(z)∈(0,1) as z→±∞. If we pick the labels y=0,1 we may assign
P(y=1|z)P(y=0|z)=σ(z)=11+e−z=1−σ(z)=11+ez
which can be written more compactly as P(y|z)=σ(z)y(1−σ(z))1−y.
It is easier to maximize the log-likelihood. Maximizing the log-likelihood is the same as minimizing the negative log-likelihood. For m samples {xi,yi}, after taking the natural logarithm and some simplification, we will find out:
l(z)=−log(∏imP(yi|zi))=−∑imlog(P(yi|zi))=∑im−yizi+log(1+ezi)
Full derivation and additional information can be found on this jupyter notebook. On the other hand, we may have instead used the labels y=±1. It is pretty obvious then that we can assign
P(y|z)=σ(yz).
It is also obvious that P(y=0|z)=P(y=−1|z)=σ(−z). Following the same steps as before we minimize in this case the loss function
L(z)=−log(∏jmP(yj|zj))=−∑jmlog(P(yj|zj))=∑jmlog(1+e−yzj)
Where the last step follows after we take the reciprocal which is induced by the negative sign. While we should not equate these two forms, given that in each form y takes different values, nevertheless these two are equivalent:
−yizi+log(1+ezi)≡log(1+e−yzj)
The case yi=1 is trivial to show. If yi≠1, then yi=0 on the left hand side and yi=−1 on the right hand side.
While there may be fundamental reasons as to why we have two different forms (see Why there are two different logistic loss formulation / notations?), one reason to choose the former is for practical considerations. In the former we can use the property ∂σ(z)/∂z=σ(z)(1−σ(z)) to trivially calculate ∇l(z) and ∇2l(z), both of which are needed for convergence analysis (i.e. to determine the convexity of the loss function by calculating the Hessian).