FeedForward and the Transformer Block: The Other Half

This is the fourth post (of four) in my MiniMind learning series. It takes a deep dive into the FeedForward network and shows how the four core components — RMSNorm, RoPE, Attention, and FeedForward — assemble into a complete Transformer Block. By the end, you’ll have a thorough grasp of the full Transformer architecture.

About this series#

MiniMind ↗ is a concise yet complete project for training a large language model, covering the full pipeline from data processing and model training to inference and deployment. As I worked through it, I distilled the core technical points into the minimind-notes ↗ repository and produced this four-part blog series, walking through the core components of the Transformer in a systematic way.

This series covers:

Normalization — why we need RMSNorm
RoPE positional encoding — how to make the model understand word order
The Attention mechanism — the core engine of the Transformer
FeedForward and the full architecture (this post) — how the components work together

1. Introduction#

1.1 The neglected other half#

When people think of the Transformer, the usual associations are:

✅ The Attention mechanism (the star component)
✅ Positional encoding (RoPE)
❓ FeedForward? What’s that?

The facts:

FeedForward accounts for 40% of the code in a Transformer Block
FeedForward makes up two-thirds of the total parameters!
You can’t train a good model with Attention alone, without FeedForward

1.2 Questions this post answers#

What does FeedForward actually do?
Why “expand then compress” (768 → 2048 → 768)?
What makes SwiGLU better than a plain FFN?
How do Attention and FeedForward divide the labor?
How do the four components assemble into a complete Transformer Block?
What does the residual connection do?

1.3 Who this is for#

You’ve studied Attention but FFN is still fuzzy
You want a complete understanding of the Transformer architecture
You’re getting ready to implement a Transformer from scratch
You want to know why the Transformer “works”

2. What is FeedForward?#

2.1 The core idea#

“Apply a complex nonlinear transformation to each token’s vector.”

Key characteristics:

✅ Each token is processed independently (no token-to-token interaction)
✅ Input dimension = output dimension (768)
✅ But the content changes completely (it goes through a nonlinear transformation)
✅ It boosts expressive power by routing through a high-dimensional space

2.2 A typical structure#

input:   [batch, seq_len, 768]
  ↓
expand:  Linear(768 → 2048)
  ↓
activate: a nonlinear function (ReLU/GELU/SiLU)
  ↓
compress: Linear(2048 → 768)
  ↓
output:  [batch, seq_len, 768]

python

2.3 A simple implementation#

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleFeedForward(nn.Module):
    def __init__(self, hidden_size=768, intermediate_size=2048):
        super().__init__()
        self.w1 = nn.Linear(hidden_size, intermediate_size, bias=False)
        self.w2 = nn.Linear(intermediate_size, hidden_size, bias=False)

    def forward(self, x):
        # x: [batch, seq_len, 768]
        h = self.w1(x)       # expand: [batch, seq_len, 2048]
        h = F.relu(h)        # activation
        output = self.w2(h)  # compress: [batch, seq_len, 768]
        return output

python

2.4 Comparison with Attention#

# Attention
sentence: "I love coding"
# "love" can see "I" and "coding"
# → tokens interact, fusing context

# FeedForward
sentence: "I love coding"
# "love" only looks at itself, transformed independently
# → each token is independent, processed in depth

python

An analogy:

Attention = a meeting where everyone exchanges information
FeedForward = everyone thinking on their own, digesting that information independently

3. Why “expand then compress”?#

3.1 A common question#

“Why not just go 768 → 768 directly? Taking a big detour out to 2048 and back — isn’t that wasted compute?”

That’s a great question!

3.2 The intuition: expressive power#

Option 1: a direct transformation

# 768 → 768
output = W @ x  # W is a [768, 768] matrix

# This is just a linear transformation!
# Expressive power = a single matrix multiply

python

Option 2: expand then compress

# 768 → 2048 → 768
h = W1 @ x           # [2048, 768] @ [768] = [2048]
h = activation(h)    # nonlinear!
output = W2 @ h      # [768, 2048] @ [2048] = [768]

# Expressive power = two matrix multiplies + a nonlinear activation
# Can fit far more complex functions!

python

3.3 The mathematical essence: the magic of high-dimensional space#

Key insight: in a high-dimensional space, vectors have more “degrees of freedom.”

A simplified view:

Direct transformation (768 → 768): can only form linear combinations, constrained by the original dimensionality
Expanding to high dimensions (768 → 2048): more degrees of freedom in the high-dimensional space
Nonlinear activation: introduces nonlinear transformation capacity
Compressing back to the original dimension (2048 → 768): retains the complex patterns learned in the high-dimensional space

Experimental verification: when fitting a complex function (such as y = sin(x) + cos(2x)):

Direct transformation Linear(1, 1): can only fit a linear function, large error ❌
Expand-compress Linear(1, 64) → ReLU → Linear(64, 1): can fit nonlinear functions, small error ✅
Conclusion: expand-compress = expressive power++

3.4 Analogies#

Analogy 1: cooking

Direct transformation (768 → 768):
  raw ingredients → plated
  no processing, one-note flavor ❌

Expand-compress (768 → 2048 → 768):
  ingredients → chopped, seasoned, cooked (2048-dim high-dimensional space) → plated
  "processed" in a high-dimensional space, rich flavor ✅

plaintext

Analogy 2: photo editing

Direct transformation:
  original → a simple filter → output
  limited effect ❌

Expand-compress:
  original → extract features (more dimensions) → complex transformation → compress back to original size
  enables complex operations like denoising and super-resolution ✅

plaintext

3.5 Why 2048 dimensions (and not 1024 or 4096)?#

Rule of thumb: intermediate_size ≈ hidden_size × 2.67 to 4

Model	hidden_size	intermediate_size	Ratio
BERT-Base	768	3072	4.0
GPT-2	768	3072	4.0
Llama-7B	4096	11008	2.69
MiniMind	768	2048	2.67

Why not larger?

Larger = more parameters = slower and more memory-hungry
2.67–4× is the empirically validated sweet spot

4. SwiGLU: the modern Transformer’s choice#

4.1 The evolution of FeedForward#

Generation 1 (GPT-2 / BERT, 2018–2019):

h = ReLU(W1 @ x)
output = W2 @ h

python

Generation 2 (GPT-3, 2020):

h = GELU(W1 @ x)  # a smoother activation function
output = W2 @ h

python

Generation 3 (Llama / MiniMind, 2023):

gate = SiLU(W_gate @ x)
up = W_up @ x
h = gate * up  # a gating mechanism! ⭐
output = W_down @ h

python

4.2 SwiGLU in detail#

Full name: Swish-Gated Linear Unit

Core idea: use two branches, one controlling the other.

import torch
import torch.nn as nn
import torch.nn.functional as F

class SwiGLU(nn.Module):
    def __init__(self, dim=768, hidden_dim=2048):
        super().__init__()
        # three linear layers
        self.gate_proj = nn.Linear(dim, hidden_dim, bias=False)  # gating branch
        self.up_proj = nn.Linear(dim, hidden_dim, bias=False)    # up-projection branch
        self.down_proj = nn.Linear(hidden_dim, dim, bias=False)  # down-projection

    def forward(self, x):
        # two branches
        gate = self.gate_proj(x)  # [batch, seq, 2048]
        up = self.up_proj(x)      # [batch, seq, 2048]

        # SiLU activation + gating (element-wise product)
        hidden = F.silu(gate) * up

        # compress back to the original dimension
        output = self.down_proj(hidden)
        return output

python

Dimension changes:

input:  [batch, seq, 768]
  ↓
gate:  [batch, seq, 2048]  ← W_gate @ x
up:    [batch, seq, 2048]  ← W_up @ x
  ↓
hidden: [batch, seq, 2048]  ← SiLU(gate) * up
  ↓
output:  [batch, seq, 768]   ← W_down @ hidden

plaintext

4.3 The SiLU activation function#

# SiLU(x) = x * sigmoid(x)
def silu(x):
    return x * torch.sigmoid(x)

# also known as Swish

python

Compared with common activation functions:

Activation	Formula	Characteristics
ReLU	`max(0, x)`	simple, but the gradient can be 0
GELU	`x * Φ(x)`	smooth, but slightly slower to compute
SiLU/Swish	`x * σ(x)`	smooth and fast to compute ✅

4.4 The power of the gating mechanism#

Intuition: the gate branch controls which information from the up branch gets through.

# example
gate = torch.tensor([0.1, 0.9, 0.5, 0.2])  # gating values
up = torch.tensor([5.0, 3.0, 2.0, 8.0])    # up-projection values

# element-wise product
hidden = gate * up
# = [0.5, 2.7, 1.0, 1.6]

# observe:
# where gate=0.9: most of the information passes (2.7 ≈ 3.0)
# where gate=0.1: only a little passes (0.5 << 5.0)

python

An analogy:

Plain FFN:
  all information passes through the same "gate" (a single activation function)

SwiGLU:
  the gate branch acts like a "security guard," deciding which information from the up branch can enter
  a dynamic, more flexible choice!

plaintext

4.5 Plain FFN vs SwiGLU#

Feature	Plain FFN	SwiGLU
Number of branches	1	2 (gate + up)
Activation	ReLU/GELU	SiLU
Gating	none	yes (gate × up)
Parameters	2 × 768 × 2048	3 × 768 × 2048 (50% more)
Compute	less	slightly more
Performance	good	better (empirically proven)
Models used in	GPT-2, BERT	Llama, MiniMind, PaLM

4.6 Why the extra parameters are worth it#

# parameter comparison (using MiniMind as an example)
plain FFN parameters:
  W1: 768 × 2048 = 1,572,864
  W2: 2048 × 768 = 1,572,864
  total: 3,145,728

SwiGLU parameters:
  gate_proj: 768 × 2048 = 1,572,864
  up_proj: 768 × 2048 = 1,572,864
  down_proj: 2048 × 768 = 1,572,864
  total: 4,718,592 (50% more)

# but!
# Attention layer parameters: 768 × 768 × 4 ≈ 2.4M
# SwiGLU parameters: 4.7M

# FeedForward share of total parameters: 4.7M / (2.4M + 4.7M) ≈ 66%
# improving this part has a big effect on the model as a whole!

python

Experimental results (the Llama paper):

At equal parameter counts, SwiGLU outperforms GELU by 5–10%
For equal performance, SwiGLU trains faster (gradients are more stable)

5. The division of labor: Attention vs FeedForward#

5.1 The full comparison#

Feature	Attention	FeedForward
How it processes	tokens interact	each token independently
Role	information exchange (a meeting)	deep thinking (digesting independently)
Input	[seq, 768]	[seq, 768]
Intermediate dimension	[seq, seq] (score matrix)	[seq, 2048] (expansion)
Output	[seq, 768]	[seq, 768]
Positional encoding	required (RoPE)	not needed
Parameters	about 33%	about 67%
Compute bottleneck	seq² (quadratic in sequence length)	batch×seq (linear)
Analogy	looking up a dictionary, a meeting	solving a math problem, thinking

5.2 Why you can’t do without either one#

Attention only:

✅ Tokens can interact
❌ Lacks “deep understanding”
It’s like having meetings to discuss but never thinking on your own
The model can’t learn complex patterns

FeedForward only:

✅ Each token can be transformed in complex ways
❌ Has no sense of context
It’s like working in isolation, never listening to others’ input
The model has no idea how tokens relate to one another

The two combined:

Step 1: Attention
  → lets the model know "which tokens are relevant"
  → fuses contextual information

Step 2: FeedForward
  → lets the model know "how to process that information"
  → a deep nonlinear transformation

Step 3: repeat N times
  → refine the understanding layer by layer

plaintext

5.3 A full walkthrough#

sentence: "I love coding"

# ========== Attention stage ==========
# "love" gathers information from "I" and "coding"
"love" ← "I" (29%) + "love" (36%) + "coding" (25%)
# "love" now knows: it connects "I" and "coding"

# ========== FeedForward stage ==========
# "love" thinks deeply based on the information it gathered
representation of "love" [768-dim]
  ↓ expand into a high-dimensional space
[2048-dim]
  ↓ gating mechanism + nonlinear transformation
[2048-dim]  # complex reasoning in the high-dimensional space
  ↓ compress back to the original dimension
[768-dim]  # the refined understanding

# in the end
# "love" has both fused context (Attention)
# and completed a deep understanding (FeedForward)

python

6. Assembling the Transformer Block#

6.1 The four core components#

A recap of the four components we’ve studied:

RMSNorm: stabilizes the numbers (normalization)
Attention: tokens interact (information exchange)
FeedForward: independent deepening (deep thinking)
Residual Connection: a safety net (the residual connection)

6.2 The complete Transformer Block structure#

import torch
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        # RMSNorm #1: before Attention
        self.input_layernorm = RMSNorm(config.hidden_size)

        # Multi-Head Attention
        self.self_attn = Attention(config)

        # RMSNorm #2: before FeedForward
        self.post_attention_layernorm = RMSNorm(config.hidden_size)

        # FeedForward (SwiGLU)
        self.mlp = FeedForward(config)

    def forward(self, x, position_embeddings):
        # ========== Part 1: Attention ==========
        # 1. save the input (for the residual connection)
        residual = x

        # 2. RMSNorm #1 (normalization)
        x = self.input_layernorm(x)

        # 3. Multi-Head Attention
        x = self.self_attn(x, position_embeddings)

        # 4. residual connection
        x = residual + x

        # ========== Part 2: FeedForward ==========
        # 5. save the current state (for the residual connection)
        residual = x

        # 6. RMSNorm #2 (normalization)
        x = self.post_attention_layernorm(x)

        # 7. FeedForward (SwiGLU)
        x = self.mlp(x)

        # 8. residual connection
        x = residual + x

        return x

python

6.3 The data flow diagram#

input x: [batch, seq_len, 768]
  ↓
  ├──────┐ (save residual)
  ↓      │
RMSNorm #1  ← normalization
  ↓
Multi-Head Attention (+ RoPE)  ← tokens interact
  ↓
  └──────┘ (add residual) ← residual connection
  ↓
  ├──────┐ (save residual)
  ↓      │
RMSNorm #2  ← normalization
  ↓
FeedForward (SwiGLU)  ← independent deepening
  ↓
  └──────┘ (add residual) ← residual connection
  ↓
output x: [batch, seq_len, 768]

plaintext

6.4 The residual connection#

The formula:

y = x + F(x)

# rather than
y = F(x)  # no residual

python

Three big benefits:

Benefit 1: a safety net#

# worst case: F learns nothing
y = x + 0 = x  # at least you still have the input!

# without the residual
y = F(x) = noise  # completely broken ❌

python

Benefit 2: incremental learning#

# with a residual: only need to learn the "adjustment"
y = x + Δx  # Δx is a small adjustment

# without a residual: need to learn the "full output"
y = F(x)  # F has to learn to build y from scratch

python

An analogy:

Without a residual: each edit fully overwrites the original
  original → filter → new image (the original is lost)

With a residual: original + each adjustment
  original → original + adjustment 1 → original + adjustment 1 + adjustment 2 ...
  all the information is preserved!

plaintext

Benefit 3: a gradient highway#

# backpropagation
dy/dx = 1 + dF/dx

# even if F's gradient vanishes (dF/dx → 0)
dy/dx = 1  # the gradient can still flow back! ✅

# without a residual
dy/dx = dF/dx  # once it vanishes, the path is cut entirely ❌

python

6.5 Pre-Norm vs Post-Norm#

Post-Norm (the original Transformer, 2017):

# normalization comes after the sublayer
x = x + Attention(x)
x = Norm(x)
x = x + FeedForward(x)
x = Norm(x)

python

Pre-Norm (modern Transformers, Llama / MiniMind):

# normalization comes before the sublayer
x = x + Attention(Norm(x))
x = x + FeedForward(Norm(x))

python

Pre-Norm’s advantages:

Feature	Post-Norm	Pre-Norm
Training stability	difficult for deep networks	more stable ✅
Gradient flow	can be interrupted by Norm	a cleaner residual path ✅
Learning rate	needs warmup	can use a larger learning rate ✅

Every modern LLM uses Pre-Norm (GPT-3, Llama, MiniMind, Mistral, …).

7. The complete MiniMind architecture#

7.1 The overall structure#

MiniMindForCausalLM
├─ lm_head: output layer (hidden_size → vocab_size)
│   maps a 768-dim vector to a probability distribution over 6400 tokens
│
└─ MiniMindModel
    ├─ embed_tokens: token embedding layer (vocab_size → hidden_size)
    │   converts a token ID into a 768-dim vector
    │
    ├─ layers: N TransformerBlocks (8 by default)
    │   └─ TransformerBlock × 8
    │       ├─ input_layernorm: RMSNorm
    │       ├─ self_attn: Multi-Head Attention
    │       ├─ post_attention_layernorm: RMSNorm
    │       └─ mlp: FeedForward (SwiGLU)
    │
    └─ norm: the final RMSNorm
        normalizes the last layer's output once more

plaintext

7.2 The forward pass#

# input
token_ids = [34, 128, 556, 89, ...]  # the IDs for "I love coding"

# ========== token embedding ==========
x = embed_tokens(token_ids)
# [batch, seq_len] → [batch, seq_len, 768]

# ========== precompute RoPE ==========
cos, sin = precompute_freqs_cis(...)  # positional encoding

# ========== TransformerBlock #1 ==========
x = block_1(x, position_embeddings=(cos, sin))
# [batch, seq_len, 768] → [batch, seq_len, 768]

# ========== TransformerBlock #2 ==========
x = block_2(x, position_embeddings=(cos, sin))

# ...

# ========== TransformerBlock #8 ==========
x = block_8(x, position_embeddings=(cos, sin))

# ========== final normalization ==========
x = self.norm(x)
# [batch, seq_len, 768]

# ========== output layer ==========
logits = lm_head(x)
# [batch, seq_len, 768] → [batch, seq_len, 6400]
# each position predicts the probability distribution of the next token

# ========== generation ==========
next_token_id = torch.argmax(logits[:, -1, :], dim=-1)
# pick the highest-probability token

python

7.3 Parameter breakdown#

# MiniMind2 (104M parameters)

# token embedding
embed_tokens: 6400 × 768 = 4,915,200

# 8 TransformerBlocks
each Block:
  Attention:
    q_proj: 768 × 768 = 589,824
    k_proj: 768 × 768 = 589,824
    v_proj: 768 × 768 = 589,824
    o_proj: 768 × 768 = 589,824
    subtotal: 2,359,296

  FeedForward (SwiGLU):
    gate_proj: 768 × 2048 = 1,572,864
    up_proj: 768 × 2048 = 1,572,864
    down_proj: 2048 × 768 = 1,572,864
    subtotal: 4,718,592

  per-Block total: 7,077,888

8 Blocks: 8 × 7,077,888 = 56,623,104

# output layer
lm_head: 768 × 6400 = 4,915,200

# grand total
4.9M + 56.6M + 4.9M ≈ 104M ✅

python

Observations:

FeedForward parameters (4.7M) are 2× the Attention parameters (2.4M)!
FeedForward accounts for 67% of each Block’s parameters

7.4 Configuration parameters#

# MiniMind config (model/model_minimind.py)
class MiniMindConfig:
    hidden_size = 768           # hidden dimension
    num_hidden_layers = 8       # number of Transformer Block layers
    num_attention_heads = 8     # number of attention heads
    num_key_value_heads = 2     # GQA: number of KV heads
    intermediate_size = 2048    # FFN intermediate dimension
    vocab_size = 6400           # vocabulary size
    max_position_embeddings = 32768  # maximum sequence length
    rope_theta = 1000000.0      # RoPE base frequency
    rms_norm_eps = 1e-5         # RMSNorm epsilon
    use_moe = False             # whether to use MoE

python

8. Hands-on experiments#

To really understand FeedForward, try the following experiments:

Compare a plain FFN with SwiGLU: implement both architectures and compare parameter counts (SwiGLU has 50% more) and training performance.
Verify the necessity of expand-compress: try a direct 768→768 transformation vs 768→2048→768, and observe the difference in fitting ability.
Test the residual connection: compare the training stability and convergence speed of networks with and without a residual connection.

See the MiniMind project’s learning_materials/feedforward_explained.py for the complete experiment code.

9. Summary#

9.1 Key takeaways#

✅ What FeedForward does: independent deepening — a complex nonlinear transformation applied to each token
✅ Why expand-compress is necessary: a high-dimensional space has greater expressive power and can fit complex functions
✅ SwiGLU’s advantage: a gating mechanism with two branches, 5–10% better than a plain FFN
✅ Attention vs FFN: interaction vs independence, a meeting vs thinking — you can’t do without either
✅ The Transformer Block: four components combined perfectly (Norm + Attn + Norm + FFN + Residual)
✅ The residual connection: a safety net + incremental learning + a gradient highway
✅ Pre-Norm: the standard choice for modern deep Transformers

9.2 The Transformer Block mantra#

Norm → Attention → residual
Norm → FeedForward → residual
repeat N times → a complete model!

plaintext

9.3 A look back at the series#

Congratulations on completing the study of MiniMind’s core architecture!

A recap of the four posts:

✅ RMSNorm: the principles of normalization, and why it’s 7.7× faster than LayerNorm
✅ RoPE: positional encoding, with an in-depth analysis of the multi-frequency mechanism
✅ Attention: Q, K, V, and a complete understanding of Multi-Head
✅ FeedForward + architecture: expand-compress, and the full assembly

What you now have:

✅ All the core components of the Transformer
✅ The mathematical principles and code implementation of each component
✅ How the components work together
✅ Why the Transformer “works”
✅ The ability to implement a small Transformer from scratch!

9.4 Suggested next steps#

1. Get hands-on#

# clone MiniMind
git clone https://github.com/jingyaogong/minimind
cd minimind

# train a small model
cd trainer
python train_pretrain.py

bash

2. Go deeper#

GQA (Grouped Query Attention): saves memory
Flash Attention: optimizes compute efficiency
MoE (Mixture of Experts): increases capacity
KV Cache: speeds up inference

3. The implementation challenge#

# challenge: implement MiniMind from scratch
class MyMiniMind(nn.Module):
    def __init__(self):
        # over to you!
        pass

python

10. References#

Papers#

Attention Is All You Need ↗ - the original Transformer paper
GLU Variants Improve Transformer ↗ - the SwiGLU paper
Deep Residual Learning for Image Recognition ↗ - ResNet / residual connections

Code#

MiniMind source: github.com/jingyaogong/minimind ↗
This post’s learning materials: learning_materials/feedforward_explained.py
Transformer Block: model/model_minimind.py:359-380

Author: joye Published: 2025-12-30 Last updated: 2025-12-30 Series: MiniMind learning notes (4/4)

If you found this helpful, feel free to:

⭐ Star the original project, MiniMind ↗
⭐ Star my study notes, minimind-notes ↗
💬 Leave a comment with your own learning takeaways
🔗 Share it with other friends learning about LLMs

About this series#

1. Introduction#

1.1 The neglected other half#

1.2 Questions this post answers#

1.3 Who this is for#

2. What is FeedForward?#

2.1 The core idea#

2.2 A typical structure#

2.3 A simple implementation#

2.4 Comparison with Attention#

3. Why “expand then compress”?#

3.1 A common question#

3.2 The intuition: expressive power#

3.3 The mathematical essence: the magic of high-dimensional space#

3.4 Analogies#

3.5 Why 2048 dimensions (and not 1024 or 4096)?#

4. SwiGLU: the modern Transformer’s choice#

4.1 The evolution of FeedForward#

4.2 SwiGLU in detail#

4.3 The SiLU activation function#

4.4 The power of the gating mechanism#

4.5 Plain FFN vs SwiGLU#

4.6 Why the extra parameters are worth it#

5. The division of labor: Attention vs FeedForward#

5.1 The full comparison#

5.2 Why you can’t do without either one#

5.3 A full walkthrough#

6. Assembling the Transformer Block#

6.1 The four core components#

6.2 The complete Transformer Block structure#

6.3 The data flow diagram#

6.4 The residual connection#

Benefit 1: a safety net#

Benefit 2: incremental learning#

Benefit 3: a gradient highway#

6.5 Pre-Norm vs Post-Norm#

7. The complete MiniMind architecture#

7.1 The overall structure#

7.2 The forward pass#

7.3 Parameter breakdown#

7.4 Configuration parameters#

8. Hands-on experiments#

9. Summary#

9.1 Key takeaways#

9.2 The Transformer Block mantra#

9.3 A look back at the series#

9.4 Suggested next steps#

1. Get hands-on#

2. Go deeper#

3. The implementation challenge#

10. References#

Papers#

Code#

Related reading#