програмиране
Simple NN from Scratch
A small nn built from scratch to classify MNIST digits, understand forward passes, backpropagation, gradients, and training.
Simple Neural Network from scratch
Idea
The purpose of this code is to implement a NN classifier from scratch for the MNIST handwritten digits dataset.
The goal is to understand the basic flow:
data -> forward pass -> loss -> backpropagation -> parameter update -> evaluate
Will be avoided:
- deep learning framework
- autograd
- ready-made neural network layers
Will be used:
- numpy
- pandas
- sklearn, only for fetching MNIST and splitting the data
- manual forward pass
- manual backpropagation
- minibatch gradient descent
Setup
Install the needed packages:
pip install numpy pandas scikit-learn
Then run the model file:
python src/model.py
The first run can take a bit because fetch_openml("mnist_784") downloads the
MNIST dataset.
Data preprocessing
MNIST images have this shape:
28 x 28 = 784 pixels
The image is flattened into one row with 784 values.
Pixels are normalized from:
0-255
to:
0-1
Labels are one-hot encoded.
Example:
3 -> [0,0,0,1,0,0,0,0,0,0]
The current split gives:
X_train: (56000, 784)
X_test: (14000, 784)
y_train: (56000, 10)
y_test: (14000, 10)
Used architecture
layer_sizes = [784, 128, 64, 10]
where:
- 784 - input pixels
- 128 - first hidden layer
- 64 - second hidden layer
- 10 - output classes
The network flow is:
Input -> Linear -> ReLU -> Linear -> ReLU -> Linear -> Softmax
Main functions
normalize_input(X)
Converts the raw pixel values to smaller float values.
X / 255.0
Before normalization, pixel values are between 0 and 255.
After normalization, they are between 0 and 1.
This helps the model train more smoothly because the input values are not too large.
one_hot(y, num_classes=10)
Converts normal labels into one-hot encoded labels.
Example:
label: 3
one-hot: [0,0,0,1,0,0,0,0,0,0]
This is useful because the output layer gives 10 probabilities, one for each digit.
sigmoid(Z) and sigmoid_derivative(Z)
Sigmoid maps numbers into the range 0 to 1.
It is implemented for learning and experimenting, but the main model does not use it in the hidden layers.
tanh(Z) and tanh_derivative(Z)
Tanh maps numbers into the range -1 to 1.
It is also included as another activation option, but the current network uses ReLU instead.
relu(Z) and relu_derivative(Z)
ReLU is used in the hidden layers.
ReLU(Z) = max(0, Z)
So:
negative values -> 0
positive values -> stay the same
The derivative is:
1 if Z > 0
0 otherwise
This derivative is needed during backpropagation.
leaky_relu(Z) and leaky_relu_derivative(Z)
Leaky ReLU is included as an experiment.
Instead of turning negative values fully into 0, it keeps a very small part of
them.
Note: in the current code this function is written for scalar values, so it is not used in the main vectorized network.
softmax(logits)
Softmax converts the final layer scores into probabilities.
Example output:
[0.01, 0.02, 0.80, 0.04, ...]
The values in each row add up to 1.
The model prediction is the index with the highest probability:
np.argmax(P, axis=1)
The implementation subtracts the row maximum before using exp. This makes it
more numerically stable.
cross_entropy(Y, P)
Calculates how wrong the predicted probabilities are.
Y is the true one-hot label.
P is the predicted probability from softmax.
The simplified idea is:
high probability on correct class -> low loss
low probability on correct class -> high loss
The probabilities are clipped with a small eps value so log(0) does not
happen.
accuracy(Y, P)
Checks how many predictions are correct.
It compares:
np.argmax(Y, axis=1)
with:
np.argmax(P, axis=1)
Then it returns the mean of correct predictions.
init_parameters(layer_sizes, seed=42)
Creates the weights and biases for every layer.
For:
layer_sizes = [784, 128, 64, 10]
it creates:
W1: (784, 128)
b1: (1, 128)
W2: (128, 64)
b2: (1, 64)
W3: (64, 10)
b3: (1, 10)
Weights use He initialization:
sqrt(2 / fan_in)
Biases start at zero.
forward(X, params)
Runs the model from input to prediction.
The flow is:
Z1 = X @ W1 + b1
A1 = ReLU(Z1)
Z2 = A1 @ W2 + b2
A2 = ReLU(Z2)
Z3 = A2 @ W3 + b3
P = Softmax(Z3)
It returns:
P- final probabilitiescache- saved intermediate values for backpropagation
The cache is important because the backward pass needs values like X, A1,
A2, Z1, Z2, and P.
backward(Y, params, cache)
Computes the gradients manually.
This is the backpropagation part.
Because softmax and cross-entropy are used together, the first gradient is:
dZ3 = (P - Y) / batch_size
Then gradients are calculated backwards:
dW3, db3
dW2, db2
dW1, db1
Each gradient has the same shape as the parameter it updates.
For example:
dW1 has same shape as W1
db1 has same shape as b1
update_parameters(params, grads, lr)
Updates every weight and bias using gradient descent.
The update rule is:
parameter = parameter - learning_rate * gradient
So if the learning rate is too high, training can become unstable.
If it is too low, training becomes slow.
iterate_minibatches(X, Y, batch_size, rng)
Shuffles the dataset and returns small batches.
Instead of training on all 56000 examples at once, the model trains on smaller parts like:
batch_size = 32
This makes training faster and gives more frequent parameter updates.
train(...)
Runs the full training loop.
For every epoch:
- create minibatches
- run
forward - calculate
cross_entropy - run
backward - call
update_parameters - evaluate validation loss and accuracy
The current settings are:
epochs = 100
batch_size = 32
lr = 0.003
The function returns:
params- trained weights and biaseshistory- train loss, validation loss, validation accuracy
confusion_matrix(y_true, y_pred, num_classes=10)
Counts which labels were predicted correctly or incorrectly.
Rows are true labels.
Columns are predicted labels.
Example:
cm=confusion_matrix(y_true, y_pred, num_classes=10)
cm[3,8]
means the real digit was 3, but the model predicted 8.
Training output
The script prints training progress like:
epoch=001
train_loss=...
val_loss=...
val_acc=...
After training it prints the confusion matrix.
Short summary
This is a small MNIST neural network built with NumPy.
The main goal is to make the basic math less magical:
forward pass -> loss -> backward pass -> parameter update

