Update (2026-03-09): See GitHub for a NumPy/Torch implementation of the linear regressor described below.

Linear regression is one of the simplest and most important models in supervised learning. In The Elements of Statistical Learning (ESL) (Hastie et al., Hastie T., et al. (n.d.). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. ), it appears early because it captures the basic prediction setup: approximate the regression function with something linear, then fit the coefficients by minimizing squared error.

Key insight: Linear regression assumes that the conditional mean of the output can be modeled, or at least approximated, by a linear function of the inputs: $$ f(X)=\beta_0+\sum_{j=1}^p X_j\beta_j $$ The model is linear in the parameters, even if the inputs include transformations, polynomial terms, dummy variables, or interactions.

How it works: Given training data, we choose the coefficients that minimize the residual sum of squares $$ RSS(\beta)=\sum_{i=1}^N (y_i-f(x_i))^2 $$ Equivalently, in matrix form, if $\hat y = Xw$, we can write the loss as $$ \begin{aligned} L(w) &= \frac{1}{N}\sum_{i=1}^N (y_i-\hat y_i)^2 \ &= \frac{1}{N}\lVert y-Xw \rVert^2 \end{aligned} $$ This is the ordinary least squares objective and can be viewed as estimating the conditional expectation $E(Y\mid X=x)$.

If we optimize this objective with gradient descent, we differentiate the loss with respect to the weights using the chain rule. For one prediction, $$ L_i = (y_i-\hat y_i)^2 $$ so $$ \frac{\partial L_i}{\partial \hat y_i} = 2(\hat y_i-y_i) $$ Since the model is linear, $$ \hat y_i = x_i^\top w \quad\Rightarrow\quad \frac{\partial \hat y_i}{\partial w} = x_i $$ Applying the chain rule, $$ \begin{aligned} \frac{\partial L_i}{\partial w} &= \frac{\partial L_i}{\partial \hat y_i} \frac{\partial \hat y_i}{\partial w} \ &= 2(\hat y_i-y_i)x_i \end{aligned} $$ Averaging over all samples gives $$ \begin{aligned} \nabla_w L(w) &= \frac{2}{N}\sum_{i=1}^N (\hat y_i-y_i)x_i \ &= \frac{2}{N}X^\top(\hat y-y) \end{aligned} $$ or, equivalently, $$ \begin{aligned} \nabla_w L(w) &= -\frac{2}{N}X^\top(y-\hat y) \end{aligned} $$ This is exactly the gradient used in a from-scratch gradient descent implementation.

Note: For a full-fledged generic backpropagation implementation that handles multilayer neural networks, see Deep Learning (Goodfellow et al., Goodfellow Ian, et al. (n.d.). Deep Learning. MIT Press. ), pp. 212-213.

Why it matters: Linear regression is a low-variance but potentially high-bias model: stable when the linear approximation is reasonable, limited when the true relationship is strongly nonlinear. It also serves as the starting point for a broader family of methods such as ridge regression and lasso.

One-line takeaway: Linear regression = model the output as an approximately linear function of the inputs, fit it by minimizing squared error, and compute gradients by chaining $\partial L / \partial \hat y$ with $\partial \hat y / \partial w$.

Citation

If you found this post useful for your work, please consider citing it as:

Razvan Florian Vasile. (March 2026). "Linear Regression with NumPy". Atom Blog. Retrieved from https://atomwalk12.github.io/posts/ml_basics/regression/.

@misc{vasile2026linear-regression-numpy,
    title = "Linear Regression with NumPy",
    author = "Razvan Florian Vasile",
    note = "Personal blog",
    year = "2026",
    month = "March",
    url = "https://atomwalk12.github.io/posts/ml_basics/regression/"
}

References

[1] Goodfellow Ian, et al. (n.d.). Deep Learning. MIT Press.
[2] Hastie T., et al. (n.d.). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.