Logistic regression is the standard first model for binary classification: take a linear score, push it through a sigmoid, and interpret the result as a probability. The motivation is simple: we still want a linear function of the inputs, but we also want the output to stay in $[0,1]$ so it can represent the posterior probability of class 1.

Forward pass: Start with the linear score $$ z = Xw + b $$ then map it to class-1 probabilities with the sigmoid $$ p = \sigma(z) = \frac{1}{1+e^{-z}}. $$ We predict class 1 when $p \ge 0.5$, which is equivalent to $z \ge 0$, so the decision boundary is still linear.

Loss: The training objective is still binary cross-entropy, written in closed form as $$ \mathcal L(w,b)= -\frac1N\sum_{i=1}^N \left[y_i\log p_i+(1-y_i)\log(1-p_i)\right]. $$ If we add $L_2$ regularization, we usually apply it to the weights but not the bias: $$ \mathcal L_{\text{reg}}(w,b)=\mathcal L(w,b)+\frac{\lambda}{2N}\lVert w\rVert^2. $$ Note: since $\lVert w\rVert^2 = w^\top w$, its gradient is $2w$, so differentiating the penalty contributes the extra term $\frac{\lambda}{N}w$.

Backward pass: For one sample, $$ \ell = -\left[y\log p+(1-y)\log(1-p)\right]. $$ Differentiate with respect to $p$: $$ \begin{aligned} \frac{\partial \ell}{\partial p} &= \frac{p-y}{p(1-p)}. \end{aligned} $$ Since $$ \frac{\partial p}{\partial z}=p(1-p), $$ the chain rule gives $$ \begin{aligned} \frac{\partial \ell}{\partial z} &= \frac{\partial \ell}{\partial p}\frac{\partial p}{\partial z} \ &= p-y. \end{aligned} $$ That is the key simplification: cross-entropy and sigmoid combine so that the extra sigmoid factor cancels out.

Because $z=x^\top w+b$, we get $$ \begin{aligned} \frac{\partial \ell}{\partial w} &= (p-y)x, \ \frac{\partial \ell}{\partial b} &= p-y. \end{aligned} $$ Stacking all samples: $$ \begin{aligned} \nabla_w \mathcal L &= \frac{1}{N}X^\top(p-y)+\frac{\lambda}{N}w, \ \frac{\partial \mathcal L}{\partial b} &= \frac{1}{N}\sum_{i=1}^N (p_i-y_i). \end{aligned} $$ This is exactly why a NumPy implementation looks so direct: compute logits, apply sigmoid, form $p-y$, and update the parameters.

Why this matters: If you used mean squared error with a sigmoid output, the gradient would keep an extra $p(1-p)$ term. With binary cross-entropy, that term cancels, which makes optimization cleaner and is one reason logistic regression is such a useful first classification model.

One-line takeaway: Logistic regression = linear score, sigmoid forward pass, binary cross-entropy objective, and a backward pass that collapses to the simple residual term $p-y$.

Citation

If you found this post useful for your work, please consider citing it as:

Razvan Florian Vasile. (March 2026). "Logistic Regression with NumPy". Atom Blog. Retrieved from https://atomwalk12.github.io/posts/ml_basics/logistic_regression/.

@misc{vasile2026logistic-regression-numpy,
    title = "Logistic Regression with NumPy",
    author = "Razvan Florian Vasile",
    note = "Personal blog",
    year = "2026",
    month = "March",
    url = "https://atomwalk12.github.io/posts/ml_basics/logistic_regression/"
}