Neural Networks Basics
InteractiveLearn how neural networks learn patterns through layers, weights, and backpropagation.
Try the interactive tools (2)What Is a Neural Network?
Think of a neural network as a pattern-recognition machine. It takes some input (text, an image, numbers), passes it through a series of processing steps, and produces an output (a prediction, a classification, a generated token).
The name comes from a loose analogy to the brain. A neural network is made up of neurons organized into layers. Each neuron receives input values, multiplies them by weights (numbers that represent how important each input is), adds them up, and passes the result through an activation function that decides whether and how strongly the neuron “fires.”
A single neuron isn’t very powerful. But stack thousands of neurons in multiple layers, and the network can learn incredibly complex patterns — from recognizing faces in photos to generating human-like text.
How It Learns
The learning process has three key steps that repeat over and over:
1. Forward pass — Data flows through the network from input to output. Each layer transforms the data using its current weights. The final layer produces a prediction.
2. Loss calculation — The prediction is compared to the correct answer using a loss function. The loss is a number that measures how wrong the prediction is. Lower is better.
3. Backpropagation — This is the clever part. The network works backward through the layers, calculating how much each weight contributed to the error. Then it nudges each weight slightly in the direction that would reduce the error. This nudging is controlled by a value called the learning rate — too large and the model overshoots, too small and learning is painfully slow.
This cycle — predict, measure error, adjust weights — repeats millions of times across the training data. Gradually, the weights converge to values that make good predictions.
Key Building Blocks
Layers come in different types. The most basic is a dense (fully connected) layer where every neuron connects to every neuron in the next layer. Modern networks use specialized layers — convolutional layers for images, attention layers for sequences (the foundation of Transformers).
Activation functions add non-linearity. Without them, stacking layers would be pointless — multiple linear operations collapse into a single linear operation. Common activations include ReLU (passes positive values through, blocks negatives) and softmax (converts raw scores into probabilities that sum to 1, used in the output layer of classifiers and language models).
Bias is an extra number added to each neuron’s calculation. It lets the neuron adjust its threshold — how much input it needs before it activates. Without bias, every neuron would be forced to pass through zero.
From Shallow to Deep
A network with just one hidden layer (the layers between input and output) can theoretically learn any function, but it might need an impossibly large number of neurons. Deep networks — networks with many layers — can learn the same patterns with far fewer total neurons by building up features hierarchically.
In an image network, early layers might learn to detect edges. Middle layers combine edges into shapes. Deep layers recognize objects. In a language model, early layers might capture word similarities, middle layers handle grammar and syntax, and deep layers grasp meaning and context.
This hierarchical feature learning is why depth matters, and why we call it deep learning.
Key Terminology
- Epoch — One complete pass through the entire training dataset. Training typically runs for many epochs.
- Batch — A subset of training data processed together. Instead of updating weights after every single example, the network processes a batch and averages the updates. This is faster and produces smoother learning.
- Gradient — The direction and magnitude of the slope of the loss function. Backpropagation computes gradients for each weight, telling the optimizer which direction to adjust.
- Overfitting — When a model memorizes the training data instead of learning generalizable patterns. It performs well on training data but poorly on new data.
- Underfitting — When a model is too simple to capture the patterns in the data. It performs poorly on both training and new data.
Why Does It Matter?
Every modern AI system — from image recognition to language models to game-playing agents — is built on neural networks. Understanding the basics of how they learn (forward pass → loss → backpropagation) gives you the mental model needed to understand more advanced concepts like Transformers, attention mechanisms, and fine-tuning.
You don’t need to implement neural networks from scratch to use AI effectively, but knowing what’s happening under the hood helps you make better decisions: why some models need more data, why training is expensive, why fine-tuning works, and why models sometimes fail in predictable ways.
Common Misconceptions
“Neural networks work like the brain.” The analogy is very loose. Biological neurons are far more complex than artificial ones. Neural networks are inspired by the brain’s structure but work very differently in practice.
“More layers always helps.” Deeper isn’t always better. Very deep networks can suffer from vanishing gradients (signals shrink to near-zero as they flow backward) and are harder to train. Techniques like residual connections (skip connections) were invented to address this.
“Neural networks understand what they’re doing.” They optimize a mathematical objective (minimize loss). They don’t have goals, understanding, or awareness. A network that classifies cat photos doesn’t know what a cat is — it has learned pixel patterns that correlate with the label “cat.”
Further Reading
- 3Blue1Brown’s “Neural Networks” YouTube series — brilliant visual explanations
- Michael Nielsen’s “Neural Networks and Deep Learning” — free online book
- The Transformer Architecture concept in this hub for how modern LLMs build on these foundations
Read the article first
These tools reinforce the concepts above — you'll get more out of them after reading through the article.
Interactive: Neuron Playground
Manipulate inputs, weights, bias, and activation functions to see how a neuron fires and why hidden layers only become more expressive with nonlinearity.
Playground mode
A single neuron shows the core weighted-sum -> activation pattern. Move the inputs and weights to see how one score turns into one decision.
Activation function
SigmoidSquashes outputs into a 0..1 range, which is useful when you want a probability-like score.
Inputs
Single neuron parameters
Weighted sum
0.69
Activated output
0.67
Confidence
67%
Classification
Positive
What changed after activation
Linear score 0.69 becomes 0.67 after Sigmoid.
Without nonlinearity, extra layers collapse into one bigger linear transform.
Forward pass
(0.90 x 1.20) + (0.30 x -0.80) -0.15 = 0.69
This is the entire neuron story: multiply, add, then pass the result through an activation function.
Interactive: Training Loop Visualizer
Scrub through a deterministic training run and inspect the forward pass, loss, backpropagation, and weight updates alongside the evolving decision map.
Training setup
A healthy step size for this toy problem. Fast enough to learn without obvious overshoot.
Epoch scrubber
Epoch 0Loss
0.69
Accuracy
50%
Fit state
UnderfittingOne training step
Forward pass
The run starts in the same uncertain state, but it moves away from it faster.
Loss
The prediction error at this checkpoint is 0.69 with 50% accuracy on the toy dataset.
Backprop
Balanced learning rate gives clearer progress without exploding the weights.
Weight update
The boundary rotates faster than in the gentle run, but it is still only one line.
Loss trajectory
Shallow + BalancedLower loss means the network's predictions are matching the training labels more closely.
Decision map snapshot
Teal cells mean "predict positive" and warm cells mean "predict negative". Deeper runs can bend this map in ways a single straight boundary cannot.
Highlight sample
Point (-1, 1) should be positive.
44%
Watch this confidence rise in healthy runs and wobble in unstable ones.
Dataset checkpoint view
nw-positive
label 1 at (-1, 1)
44%
wrong
se-positive
label 1 at (1, -1)
56%
correct
ne-negative
label 0 at (1, 1)
50%
wrong
sw-negative
label 0 at (-1, -1)
50%
wrong
nw-soft
label 1 at (-0.75, 0.75)
46%
wrong
se-soft
label 1 at (0.7, -0.7)
54%
correct
ne-soft
label 0 at (0.72, 0.72)
50%
wrong
sw-soft
label 0 at (-0.72, -0.72)
50%
wrong
Continue learning
Continue directly from here instead of returning to the top navigation.