Update (2026-03-09): See GitHub for a NumPy/Torch implementation of the linear regressor described below.

Linear regression is one of the simplest and most important models in supervised learning. In The Elements of Statistical Learning (ESL) ( et al., et al. (n.d.). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. ), it appears early because it captures the basic prediction setup: approximate the regression function with something linear, then fit the coefficients by minimizing squared error.

Key insight: Linear regression assumes that the conditional mean of the output can be modeled, or at least approximated, by a linear function of the inputs: $$ f(X)=\beta_0+\sum_{j=1}^p X_j\beta_j $$ The model is linear in the parameters, even if the inputs include transformations, polynomial terms, dummy variables, or interactions.

How it works: Given training data, we choose the coefficients that minimize the residual sum of squares $$ RSS(\beta)=\sum_{i=1}^N (y_i-f(x_i))^2 $$ Equivalently, in matrix form, if $\hat y = Xw$, we can write the loss as $$ \begin{aligned} L(w) &= \frac{1}{N}\sum_{i=1}^N (y_i-\hat y_i)^2 \ &= \frac{1}{N}\lVert y-Xw \rVert^2 \end{aligned} $$ This is the ordinary least squares objective and can be viewed as estimating the conditional expectation $E(Y\mid X=x)$.

If we optimize this objective with gradient descent, we differentiate the loss with respect to the weights using the chain rule. For one prediction, $$ L_i = (y_i-\hat y_i)^2 $$ so $$ \frac{\partial L_i}{\partial \hat y_i} = 2(\hat y_i-y_i) $$ Since the model is linear, $$ \hat y_i = x_i^\top w \quad\Rightarrow\quad \frac{\partial \hat y_i}{\partial w} = x_i $$ Applying the chain rule, $$ \begin{aligned} \frac{\partial L_i}{\partial w} &= \frac{\partial L_i}{\partial \hat y_i} \frac{\partial \hat y_i}{\partial w} \ &= 2(\hat y_i-y_i)x_i \end{aligned} $$ Averaging over all samples gives $$ \begin{aligned} \nabla_w L(w) &= \frac{2}{N}\sum_{i=1}^N (\hat y_i-y_i)x_i \ &= \frac{2}{N}X^\top(\hat y-y) \end{aligned} $$ or, equivalently, $$ \begin{aligned} \nabla_w L(w) &= -\frac{2}{N}X^\top(y-\hat y) \end{aligned} $$ This is exactly the gradient used in a from-scratch gradient descent implementation.

Note: For a full-fledged generic backpropagation implementation that handles multilayer neural networks, see Deep Learning ( et al., et al. (n.d.). Deep Learning. MIT Press. ), pp. 212-213.

Why it matters: Linear regression is a low-variance but potentially high-bias model: stable when the linear approximation is reasonable, limited when the true relationship is strongly nonlinear. It also serves as the starting point for a broader family of methods such as ridge regression and lasso.

One-line takeaway: Linear regression = model the output as an approximately linear function of the inputs, fit it by minimizing squared error, and compute gradients by chaining $\partial L / \partial \hat y$ with $\partial \hat y / \partial w$.

Citation

If you found this post useful for your work, please consider citing it as:

Razvan Florian Vasile. (March 2026). "Linear Regression with NumPy". Atom Blog. Retrieved from https://atomwalk12.github.io/posts/ml_basics/regression/.

or

@misc{vasile2026linear-regression-numpy,
    title = "Linear Regression with NumPy",
    author = "Razvan Florian Vasile",
    note = "Personal blog",
    year = "2026",
    month = "March",
    url = "https://atomwalk12.github.io/posts/ml_basics/regression/"
}

References

  1. [1] et al. (n.d.). Deep Learning. MIT Press.
  2. [2] et al. (n.d.). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.