Simple Neural Network from scratch

Idea

The purpose of this code is to implement a NN classifier from scratch for the MNIST handwritten digits dataset.

The goal is to understand the basic flow:

data -> forward pass -> loss -> backpropagation -> parameter update -> evaluate

Will be avoided:

deep learning framework
autograd
ready-made neural network layers

Will be used:

numpy
pandas
sklearn, only for fetching MNIST and splitting the data
manual forward pass
manual backpropagation
minibatch gradient descent

Setup

Install the needed packages:

pip install numpy pandas scikit-learn

Then run the model file:

python src/model.py

The first run can take a bit because fetch_openml("mnist_784") downloads the MNIST dataset.

Data preprocessing

MNIST images have this shape:

28 x 28 = 784 pixels

The image is flattened into one row with 784 values.

Pixels are normalized from:

0-255

to:

0-1

Labels are one-hot encoded.

Example:

3 -> [0,0,0,1,0,0,0,0,0,0]

The current split gives:

X_train: (56000, 784)
X_test:  (14000, 784)
y_train: (56000, 10)
y_test:  (14000, 10)

Used architecture

layer_sizes = [784, 128, 64, 10]

where:

784 - input pixels
128 - first hidden layer
64 - second hidden layer
10 - output classes

The network flow is:

Input -> Linear -> ReLU -> Linear -> ReLU -> Linear -> Softmax

Main functions

`normalize_input(X)`

Converts the raw pixel values to smaller float values.

X / 255.0

Before normalization, pixel values are between 0 and 255.

After normalization, they are between 0 and 1.

This helps the model train more smoothly because the input values are not too large.

`one_hot(y, num_classes=10)`

Converts normal labels into one-hot encoded labels.

Example:

label: 3
one-hot: [0,0,0,1,0,0,0,0,0,0]

This is useful because the output layer gives 10 probabilities, one for each digit.

`sigmoid(Z)` and `sigmoid_derivative(Z)`

Sigmoid maps numbers into the range 0 to 1.

It is implemented for learning and experimenting, but the main model does not use it in the hidden layers.

`tanh(Z)` and `tanh_derivative(Z)`

Tanh maps numbers into the range -1 to 1.

It is also included as another activation option, but the current network uses ReLU instead.

`relu(Z)` and `relu_derivative(Z)`

ReLU is used in the hidden layers.

ReLU(Z) = max(0, Z)

So:

negative values -> 0
positive values -> stay the same

The derivative is:

1 if Z > 0
0 otherwise

This derivative is needed during backpropagation.

`leaky_relu(Z)` and `leaky_relu_derivative(Z)`

Leaky ReLU is included as an experiment.

Instead of turning negative values fully into 0, it keeps a very small part of them.

Note: in the current code this function is written for scalar values, so it is not used in the main vectorized network.

`softmax(logits)`

Softmax converts the final layer scores into probabilities.

Example output:

[0.01, 0.02, 0.80, 0.04, ...]

The values in each row add up to 1.

The model prediction is the index with the highest probability:

np.argmax(P, axis=1)

The implementation subtracts the row maximum before using exp. This makes it more numerically stable.

`cross_entropy(Y, P)`

Calculates how wrong the predicted probabilities are.

Y is the true one-hot label.

P is the predicted probability from softmax.

The simplified idea is:

high probability on correct class -> low loss
low probability on correct class  -> high loss

The probabilities are clipped with a small eps value so log(0) does not happen.

`accuracy(Y, P)`

Checks how many predictions are correct.

It compares:

np.argmax(Y, axis=1)

with:

np.argmax(P, axis=1)

Then it returns the mean of correct predictions.

`init_parameters(layer_sizes, seed=42)`

Creates the weights and biases for every layer.

For:

layer_sizes = [784, 128, 64, 10]

it creates:

W1: (784, 128)
b1: (1, 128)

W2: (128, 64)
b2: (1, 64)

W3: (64, 10)
b3: (1, 10)

Weights use He initialization:

sqrt(2 / fan_in)

Biases start at zero.

`forward(X, params)`

Runs the model from input to prediction.

The flow is:

Z1 = X  @ W1 + b1
A1 = ReLU(Z1)

Z2 = A1 @ W2 + b2
A2 = ReLU(Z2)

Z3 = A2 @ W3 + b3
P  = Softmax(Z3)

It returns:

P - final probabilities
cache - saved intermediate values for backpropagation

The cache is important because the backward pass needs values like X, A1, A2, Z1, Z2, and P.

`backward(Y, params, cache)`

Computes the gradients manually.

This is the backpropagation part.

Because softmax and cross-entropy are used together, the first gradient is:

dZ3 = (P - Y) / batch_size

Then gradients are calculated backwards:

dW3, db3
dW2, db2
dW1, db1

Each gradient has the same shape as the parameter it updates.

For example:

dW1 has same shape as W1
db1 has same shape as b1

`update_parameters(params, grads, lr)`

Updates every weight and bias using gradient descent.

The update rule is:

parameter = parameter - learning_rate * gradient

So if the learning rate is too high, training can become unstable.

If it is too low, training becomes slow.

`iterate_minibatches(X, Y, batch_size, rng)`

Shuffles the dataset and returns small batches.

Instead of training on all 56000 examples at once, the model trains on smaller parts like:

batch_size = 32

This makes training faster and gives more frequent parameter updates.

`train(...)`

Runs the full training loop.

For every epoch:

create minibatches
run forward
calculate cross_entropy
run backward
call update_parameters
evaluate validation loss and accuracy

The current settings are:

epochs = 100
batch_size = 32
lr = 0.003

The function returns:

params - trained weights and biases
history - train loss, validation loss, validation accuracy

`confusion_matrix(y_true, y_pred, num_classes=10)`

Counts which labels were predicted correctly or incorrectly.

Rows are true labels.

Columns are predicted labels.

Example:

cm=confusion_matrix(y_true, y_pred, num_classes=10)
cm[3,8]

means the real digit was 3, but the model predicted 8.

Training output

The script prints training progress like:

epoch=001
train_loss=...
val_loss=...
val_acc=...

After training it prints the confusion matrix.

Short summary

This is a small MNIST neural network built with NumPy.

The main goal is to make the basic math less magical:

forward pass -> loss -> backward pass -> parameter update

Simple Neural Network from scratch

Idea

Setup

Data preprocessing

Used architecture

Main functions

normalize_input(X)

one_hot(y, num_classes=10)

sigmoid(Z) and sigmoid_derivative(Z)

tanh(Z) and tanh_derivative(Z)

relu(Z) and relu_derivative(Z)

leaky_relu(Z) and leaky_relu_derivative(Z)

softmax(logits)

cross_entropy(Y, P)

accuracy(Y, P)

init_parameters(layer_sizes, seed=42)

forward(X, params)

backward(Y, params, cache)

update_parameters(params, grads, lr)

iterate_minibatches(X, Y, batch_size, rng)

train(...)

confusion_matrix(y_true, y_pred, num_classes=10)

Training output

Short summary

Още програмни записи

Личен Hugo дневник

Anki Flashcard Generator

`normalize_input(X)`

`one_hot(y, num_classes=10)`

`sigmoid(Z)` and `sigmoid_derivative(Z)`

`tanh(Z)` and `tanh_derivative(Z)`

`relu(Z)` and `relu_derivative(Z)`

`leaky_relu(Z)` and `leaky_relu_derivative(Z)`

`softmax(logits)`

`cross_entropy(Y, P)`

`accuracy(Y, P)`

`init_parameters(layer_sizes, seed=42)`

`forward(X, params)`

`backward(Y, params, cache)`

`update_parameters(params, grads, lr)`

`iterate_minibatches(X, Y, batch_size, rng)`

`train(...)`

`confusion_matrix(y_true, y_pred, num_classes=10)`