For TensorFlow (using Keras) implementations, we recommend
A recurrent neural network (RNN), as opposed to a regular fully connected neural network (FCNN) or just neural network (NN), has layers that are connected to themselves.
In an FCNN there are no connections between nodes in a single layer. For instance, \( (h_1^1 \) is not connected to \( (h_2^1 \). In addition, the input and output are always of a fixed length.
In an RNN, however, this is no longer the case. Nodes in the hidden layers are connected to themselves.
Recurrent neural networks work very well when working with sequential data, that is data where the order matters. In a regular fully connected network, the order of input doesn't really matter.
Another property of RNNs is that they can handle variable input and output. Consider again the simplified breast cancer dataset. If you have trained a regular FCNN on the dataset with the two features, it makes no sense to suddenly add a third feature. The network would not know what to do with it, and would reject such inputs with three features (or any other number of features that isn't two, for that matter).







See notebook at https://github.com/CompPhysics/AdvancedMachineLearning/blob/main/doc/pub/week7/ipynb/rnnmath.ipynb
Besides a simple recurrent neural network layer, as discussed above, there are two other commonly used types of recurrent neural network layers: Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU). For a short introduction to these layers see https://medium.com/mindboard/lstm-vs-gru-experimental-comparison-955820c21e8b and https://medium.com/mindboard/lstm-vs-gru-experimental-comparison-955820c21e8b.
LSTM uses a memory cell for modeling long-range dependencies and avoid vanishing gradient problems. Capable of modeling longer term dependencies by having memory cells and gates that controls the information flow along with the memory cells.
The LSTM were first introduced to overcome the vanishing gradient problem.
To preserve information for a long time in the activities of an RNN, we use a circuit that implements an analog memory cell.
The LSTM is a unit cell that is made of three gates:
It also introduces a cell state \( c \), which can be thought of as the long-term memory, and a hidden state \( h \) which can be thought of as the short-term memory.

The first stage is called the forget gate, where we combine the input at (say, time \( t \)), and the hidden cell state input at \( t-1 \), passing it through the Sigmoid activation function and then performing an element-wise multiplication, denoted by \( \odot \).
Mathematically we have (see also figure below)
$$
\mathbf{f}^{(t)} = \sigma(W_{fx}\mathbf{x}^{(t)} + W_{fh}\mathbf{h}^{(t-1)} + \mathbf{b}_f)
$$
where the $W$s are the weights to be trained.





The naming forget gate stems from the fact that the Sigmoid activation function's outputs are very close to \( 0 \) if the argument for the function is very negative, and \( 1 \) if the argument is very positive. Hence we can control the amount of information we want to take from the long-term memory.
$$
\mathbf{f}^{(t)} = \sigma(W_{fx}\mathbf{x}^{(t)} + W_{fh}\mathbf{h}^{(t-1)} + \mathbf{b}_f)
$$
where the $W$s are the weights to be trained.

The next stage is the input gate, which consists of both a Sigmoid function (\( \sigma_i \)), which decide what percentage of the input will be stored in the long-term memory, and the \( \tanh_i \) function, which decide what is the full memory that can be stored in the long term memory. When these results are calculated and multiplied together, it is added to the cell state or stored in the long-term memory, denoted as \( \oplus \).
We have
$$
\mathbf{i}^{(t)} = \sigma_g(W_{ix}\mathbf{x}^{(t)} + W_{ih}\mathbf{h}^{(t-1)} + \mathbf{b}_i),
$$
and
$$
\mathbf{g}^{(t)} = \tanh(W_{gx}\mathbf{x}^{(t)} + W_{gh}\mathbf{h}^{(t-1)} + \mathbf{b}_g),
$$
again the $W$s are the weights to train.

The forget gate and the input gate together also update the cell state with the following equation,
$$
\mathbf{c}^{(t)} = \mathbf{f}^{(t)} \otimes \mathbf{c}^{(t-1)} + \mathbf{i}^{(t)} \otimes \mathbf{g}^{(t)},
$$
where \( f^{(t)} \) and \( i^{(t)} \) are the outputs of the forget gate and the input gate, respectively.

The final stage of the LSTM is the output gate, and its purpose is to update the short-term memory. To achieve this, we take the newly generated long-term memory and process it through a hyperbolic tangent (\( \tanh \)) function creating a potential new short-term memory. We then multiply this potential memory by the output of the Sigmoid function (\( \sigma_o \)). This multiplication generates the final output as well as the input for the next hidden cell (\( h^{\langle t \rangle} \)) within the LSTM cell.
We have
$$
\begin{aligned}
\mathbf{o}^{(t)} &= \sigma_g(W_o\mathbf{x}^{(t)} + U_o\mathbf{h}^{(t-1)} + \mathbf{b}_o), \\
\mathbf{h}^{(t)} &= \mathbf{o}^{(t)} \otimes \sigma_h(\mathbf{c}^{(t)}). \\
\end{aligned}
$$
where \( \mathbf{W_o,U_o} \) are the weights of the output gate and \( \mathbf{b_o} \) is the bias of the output gate.
LSTMs provide a basic approach for modeling long-range dependencies in sequences. If you wish to read more, see An Empirical Exploration of Recurrent Network Architectures, authored by Rafal Jozefowicz et al., Proceedings of ICML, 2342-2350, 2015).
An important recent development are the so-called gated recurrent units (GRU), see for example the article by Junyoung Chung et al.,, at URL:"https://arxiv.org/abs/1412.3555. This article is an excellent read if you are interested in learning more about these modern RNN architectures
The GRUs have a simpler architecture than LSTMs. This leads to computationally more efficient methods, while their performance in some tasks, such as polyphonic music modeling, is comparable to LSTMs.
"""
Key points:
1. The input images (28x28 pixels) are treated as sequences of 28 timesteps with 28 features each
2. The LSTM layer processes this sequential data
3. A final dense layer with softmax activation handles the classification
4. Typical accuracy ranges between 95-98% (lower than CNNs but reasonable for demonstration)
Note: LSTMs are not typically used for image classification (CNNs are more efficient), but this demonstrates how to adapt them for such tasks. Training might take longer compared to CNN architectures.
To improve performance, you could:
1. Add more LSTM layers
2. Use Bidirectional LSTMs
3. Increase the number of units
4. Add dropout for regularization
5. Use learning rate scheduling
"""
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.utils import to_categorical
# Load and preprocess data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
# Normalize pixel values to [0, 1]
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
# Reshape data for LSTM (samples, timesteps, features)
# MNIST images are 28x28, so we treat each image as 28 timesteps of 28 features
x_train = x_train.reshape((-1, 28, 28))
x_test = x_test.reshape((-1, 28, 28))
# Convert labels to one-hot encoding
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)
# Build LSTM model
model = Sequential()
model.add(LSTM(128, input_shape=(28, 28))) # 128 LSTM units
model.add(Dense(10, activation='softmax'))
# Compile the model
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
# Display model summary
model.summary()
# Train the model
history = model.fit(x_train, y_train,
batch_size=64,
epochs=10,
validation_split=0.2)
# Evaluate on test data
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print(f'\nTest accuracy: {test_acc:.4f}')
"""
Key components:
1. **Data Handling**: Uses PyTorch DataLoader with MNIST dataset
2. **LSTM Architecture**:
- Input sequence of 28 timesteps (image rows)
- 128 hidden units in LSTM layer
- Fully connected layer for classification
3. **Training**:
- Cross-entropy loss
- Adam optimizer
- Automatic GPU utilization if available
This implementation typically achieves **97-98% accuracy** after 10 epochs. The main differences from the TensorFlow/Keras version:
- Explicit device management (CPU/GPU)
- Manual training loop
- Different data loading pipeline
- More explicit tensor reshaping
To improve performance, you could:
1. Add dropout regularization
2. Use bidirectional LSTM
3. Implement learning rate scheduling
4. Add batch normalization
5. Increase model capacity (more layers/units)
"""
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
# Hyperparameters
input_size = 28 # Number of features (pixels per row)
hidden_size = 128 # LSTM hidden state size
num_classes = 10 # Digits 0-9
num_epochs = 10 # Training iterations
batch_size = 64 # Batch size
learning_rate = 0.001
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# MNIST dataset
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,)) # MNIST mean and std
])
train_dataset = datasets.MNIST(root='./data',
train=True,
transform=transform,
download=True)
test_dataset = datasets.MNIST(root='./data',
train=False,
transform=transform)
train_loader = DataLoader(dataset=train_dataset,
batch_size=batch_size,
shuffle=True)
test_loader = DataLoader(dataset=test_dataset,
batch_size=batch_size,
shuffle=False)
# LSTM model
class LSTMModel(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(LSTMModel, self).__init__()
self.hidden_size = hidden_size
self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, num_classes)
def forward(self, x):
# Reshape input to (batch_size, sequence_length, input_size)
x = x.reshape(-1, 28, 28)
# Forward propagate LSTM
out, _ = self.lstm(x) # out: (batch_size, seq_length, hidden_size)
# Decode the hidden state of the last time step
out = out[:, -1, :]
out = self.fc(out)
return out
# Initialize model
model = LSTMModel(input_size, hidden_size, num_classes).to(device)
# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# Training loop
total_step = len(train_loader)
for epoch in range(num_epochs):
model.train()
for i, (images, labels) in enumerate(train_loader):
images = images.to(device)
labels = labels.to(device)
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (i+1) % 100 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{total_step}], Loss: {loss.item():.4f}')
# Test the model
model.eval()
with torch.no_grad():
correct = 0
total = 0
for images, labels in test_loader:
images = images.to(device)
labels = labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print(f'Test Accuracy: {100 * correct / total:.2f}%')
print('Training finished.')
Let us illustrate how we could train an RNN using data from the solution of a well-known differential equation, namely Newton's equation for oscillatory motion for an object being forced into harmonic oscillations by an applied external force.
We will start with the basic algorithm for solving this type of equations using the Runge-Kutta-4 approach. The first code example is a standalone differential equation solver. It yields positions and velocities as function of time, starting with an initial time \( t_0 \) and ending with a final time.
The data the program produces will in turn be used to train an RNN for a selected number of training data. With a trained RNN, we will then use the network to make predictions for data not included in the training. That is, we will train a model which should be able to reproduce velocities and positions not included in training data.
import numpy as np
import pandas as pd
from math import *
import matplotlib.pyplot as plt
import os
# Where to save the figures and data files
PROJECT_ROOT_DIR = "Results"
FIGURE_ID = "Results/FigureFiles"
DATA_ID = "DataFiles/"
if not os.path.exists(PROJECT_ROOT_DIR):
os.mkdir(PROJECT_ROOT_DIR)
if not os.path.exists(FIGURE_ID):
os.makedirs(FIGURE_ID)
if not os.path.exists(DATA_ID):
os.makedirs(DATA_ID)
def image_path(fig_id):
return os.path.join(FIGURE_ID, fig_id)
def data_path(dat_id):
return os.path.join(DATA_ID, dat_id)
def save_fig(fig_id):
plt.savefig(image_path(fig_id) + ".png", format='png')
def SpringForce(v,x,t):
# note here that we have divided by mass and we return the acceleration
return -2*gamma*v-x+Ftilde*cos(t*Omegatilde)
def RK4(v,x,t,n,Force):
for i in range(n-1):
# Setting up k1
k1x = DeltaT*v[i]
k1v = DeltaT*Force(v[i],x[i],t[i])
# Setting up k2
vv = v[i]+k1v*0.5
xx = x[i]+k1x*0.5
k2x = DeltaT*vv
k2v = DeltaT*Force(vv,xx,t[i]+DeltaT*0.5)
# Setting up k3
vv = v[i]+k2v*0.5
xx = x[i]+k2x*0.5
k3x = DeltaT*vv
k3v = DeltaT*Force(vv,xx,t[i]+DeltaT*0.5)
# Setting up k4
vv = v[i]+k3v
xx = x[i]+k3x
k4x = DeltaT*vv
k4v = DeltaT*Force(vv,xx,t[i]+DeltaT)
# Final result
x[i+1] = x[i]+(k1x+2*k2x+2*k3x+k4x)/6.
v[i+1] = v[i]+(k1v+2*k2v+2*k3v+k4v)/6.
t[i+1] = t[i] + DeltaT
# Main part begins here
DeltaT = 0.001
#set up arrays
tfinal = 20 # in dimensionless time
n = ceil(tfinal/DeltaT)
# set up arrays for t, v, and x
t = np.zeros(n)
v = np.zeros(n)
x = np.zeros(n)
# Initial conditions (can change to more than one dim)
x0 = 1.0
v0 = 0.0
x[0] = x0
v[0] = v0
gamma = 0.2
Omegatilde = 0.5
Ftilde = 1.0
# Start integrating using Euler's method
# Note that we define the force function as a SpringForce
RK4(v,x,t,n,SpringForce)
# Plot position as function of time
fig, ax = plt.subplots()
ax.set_ylabel('x[m]')
ax.set_xlabel('t[s]')
ax.plot(t, x)
fig.tight_layout()
save_fig("ForcedBlockRK4")
plt.show()
In the code here we have reworked the previous example in order to generate data that can be handled by recurrent neural networks in order to train our model.
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
# Newton's equation for harmonic oscillations with external force
# Global parameters
gamma = 0.2 # damping
Omegatilde = 0.5 # driving frequency
Ftilde = 1.0 # driving amplitude
def spring_force(v, x, t):
"""
SpringForce:
note: divided by mass => returns acceleration
a = -2*gamma*v - x + Ftilde*cos(Omegatilde * t)
"""
return -2.0 * gamma * v - x + Ftilde * np.cos(Omegatilde * t)
def rk4_trajectory(DeltaT=0.001, tfinal=20.0, x0=1.0, v0=0.0):
"""
Returns t, x, v arrays.
"""
n = int(np.ceil(tfinal / DeltaT))
t = np.zeros(n, dtype=np.float32)
x = np.zeros(n, dtype=np.float32)
v = np.zeros(n, dtype=np.float32)
x[0] = x0
v[0] = v0
for i in range(n - 1):
# k1
k1x = DeltaT * v[i]
k1v = DeltaT * spring_force(v[i], x[i], t[i])
# k2
vv = v[i] + 0.5 * k1v
xx = x[i] + 0.5 * k1x
k2x = DeltaT * vv
k2v = DeltaT * spring_force(vv, xx, t[i] + 0.5 * DeltaT)
# k3
vv = v[i] + 0.5 * k2v
xx = x[i] + 0.5 * k2x
k3x = DeltaT * vv
k3v = DeltaT * spring_force(vv, xx, t[i] + 0.5 * DeltaT)
# k4
vv = v[i] + k3v
xx = x[i] + k3x
k4x = DeltaT * vv
k4v = DeltaT * spring_force(vv, xx, t[i] + DeltaT)
# Update
x[i + 1] = x[i] + (k1x + 2.0 * k2x + 2.0 * k3x + k4x) / 6.0
v[i + 1] = v[i] + (k1v + 2.0 * k2v + 2.0 * k3v + k4v) / 6.0
t[i + 1] = t[i] + DeltaT
return t, x, v
# Sequence generation for RNN training
def create_sequences(x, seq_len):
"""
Given a 1D array x (e.g., position as a function of time),
create input/target sequences for next-step prediction.
Inputs: [x_i, x_{i+1}, ..., x_{i+seq_len-1}]
Targets: [x_{i+1}, ..., x_{i+seq_len}]
"""
xs = []
ys = []
for i in range(len(x) - seq_len):
seq_x = x[i : i + seq_len]
seq_y = x[i + 1 : i + seq_len + 1] # shifted by one step
xs.append(seq_x)
ys.append(seq_y)
xs = np.array(xs, dtype=np.float32) # shape: (num_samples, seq_len)
ys = np.array(ys, dtype=np.float32) # shape: (num_samples, seq_len)
# Add feature dimension (1 feature: the position x)
xs = np.expand_dims(xs, axis=-1) # (num_samples, seq_len, 1)
ys = np.expand_dims(ys, axis=-1) # (num_samples, seq_len, 1)
return xs, ys
class OscillatorDataset(Dataset):
def __init__(self, seq_len=50, DeltaT=0.001, tfinal=20.0, x0=1.0, v0=0.0):
t, x, v = rk4_trajectory(DeltaT=DeltaT, tfinal=tfinal, x0=x0, v0=v0)
self.t = t
self.x = x
self.v = v
xs, ys = create_sequences(x, seq_len=seq_len)
self.inputs = torch.from_numpy(xs) # (N, seq_len, 1)
self.targets = torch.from_numpy(ys) # (N, seq_len, 1)
def __len__(self):
return self.inputs.shape[0]
def __getitem__(self, idx):
return self.inputs[idx], self.targets[idx]
# RNN model (LSTM-based in this example)
class RNNPredictor(nn.Module):
def __init__(self, input_size=1, hidden_size=32, num_layers=1, output_size=1):
super().__init__()
self.lstm = nn.LSTM(input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
# x: (batch, seq_len, input_size)
out, _ = self.lstm(x) # out: (batch, seq_len, hidden_size)
out = self.fc(out) # (batch, seq_len, output_size)
return out
# Training part
def train_model(
seq_len=50,
DeltaT=0.001,
tfinal=20.0,
batch_size=64,
num_epochs=10,
hidden_size=64,
lr=1e-3,
device=None,
):
if device is None:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# Dataset & DataLoader
dataset = OscillatorDataset(seq_len=seq_len, DeltaT=DeltaT, tfinal=tfinal)
train_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
# Model, loss, optimizer
model = RNNPredictor(input_size=1, hidden_size=hidden_size, output_size=1)
model.to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
# Training loop
model.train()
for epoch in range(num_epochs):
epoch_loss = 0.0
for batch_x, batch_y in train_loader:
batch_x = batch_x.to(device)
batch_y = batch_y.to(device)
optimizer.zero_grad()
outputs = model(batch_x)
loss = criterion(outputs, batch_y)
loss.backward()
optimizer.step()
epoch_loss += loss.item() * batch_x.size(0)
epoch_loss /= len(train_loader.dataset)
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss:.6f}")
return model, dataset
# Evaluation / visualization
def evaluate_and_plot(model, dataset, seq_len=50, device=None):
if device is None:
device = next(model.parameters()).device
model.eval()
with torch.no_grad():
# Take a single sequence from the dataset
x_seq, y_seq = dataset[0] # shapes: (seq_len, 1), (seq_len, 1)
x_input = x_seq.unsqueeze(0).to(device) # (1, seq_len, 1)
# Model prediction (next-step for whole sequence)
y_pred = model(x_input).cpu().numpy().squeeze(-1).squeeze(0) # (seq_len,)
# True target
y_true = y_seq.numpy().squeeze(-1) # (seq_len,)
# Plot comparison
plt.figure()
plt.plot(y_true, label="True x(t+Δt)", linestyle="-")
plt.plot(y_pred, label="Predicted x(t+Δt)", linestyle="--")
plt.xlabel("Time step in sequence")
plt.ylabel("Position")
plt.legend()
plt.title("RNN next-step prediction on oscillator trajectory")
plt.tight_layout()
plt.show()
# This is the main part of the code where we define the network
if __name__ == "__main__":
# Hyperparameters can be tweaked as you like
seq_len = 50
DeltaT = 0.001
tfinal = 20.0
num_epochs = 10
batch_size = 64
hidden_size = 64
lr = 1e-3
model, dataset = train_model(
seq_len=seq_len,
DeltaT=DeltaT,
tfinal=tfinal,
batch_size=batch_size,
num_epochs=num_epochs,
hidden_size=hidden_size,
lr=lr,
)
evaluate_and_plot(model, dataset, seq_len=seq_len)