Building and Optimizing LSTM Networks for Sentiment Analysis

Introduction

Recently, I’ve been tinkering around with training different AI models. Specifically, different types of networks. Long Short-Term Memory (LSTM) are good at natural language processing tasks, particularly sentiment analysis. In this demonstration, I’m going to build an LSTM-based sentiment classifier using TensorFlow and the IMDB movie reviews dataset. Then, I will compare different optimizers and learning rates to understand their impact on model performance.

The Problem: Movie Review Sentiment Analysis

The IMDB dataset contains 50,000 movie reviews labeled as positive or negative, making it perfect for binary sentiment classification. Our goal is to build an LSTM network that can accurately predict whether a review expresses positive or negative sentiment.

Architecture Overview

Our LSTM network follows a carefully designed architecture:

Text Vectorization Layer: Converts raw text into numerical tokens
Embedding Layer: Maps tokens to dense vector representations
Dual LSTM Layers: Captures sequential patterns in the text
Dense Layers: Provides final classification capability

model = tf.keras.Sequential([
    encoder,  # TextVectorization
    tf.keras.layers.Embedding(input_dim=len(encoder.get_vocabulary()),
                              output_dim=EMBEDDING_DIM, mask_zero=True),
    tf.keras.layers.LSTM(64, return_sequences=True),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

Design

Hyperparameters

Vocabulary Size: 1,000 most frequent words
Embedding Dimension: 64-dimensional word vectors
Batch Size: 64 samples per batch
Training Epochs: 10 iterations through the dataset

Architecture Choices

Dual LSTM Structure: The first LSTM (with return_sequences=True) passes full sequences to the second LSTM, allowing the model to capture both local and global patterns
Masking: The embedding layer uses mask_zero=True to handle variable-length sequences efficiently
Activation Functions: ReLU for the dense layer and sigmoid for binary classification

Systematic Optimizer Comparison

We will be systematically comparing optimizers and learning rates. We test:

Optimizers:

Adam: Adaptive learning rate with momentum
RMSprop: Adaptive learning rate optimizer

Learning Rates:

0.001 (standard)
0.0001 (conservative)
0.00001 (very conservative)

This creates a 2×3 grid of experiments, totaling 6 different model configurations.

Implementation Strategy

The code implements a clean, reusable training function:

def create_and_train_model(optimizer, learning_rate):
    # Model creation and compilation
    # Training with validation
    # Performance evaluation
    return results_dictionary

This approach ensures:

Consistency: Each model trains under identical conditions
Reproducibility: Results can be easily verified
Scalability: Easy to add more optimizers or learning rates

Results Analysis

The code automatically generates a comprehensive results table showing training and validation accuracy for each configuration. It also identifies the best-performing model and generates learning curves for detailed analysis.

Key metrics tracked:

Training Accuracy: Performance on the training set
Validation Accuracy: Performance on unseen test data
Learning Curves: Training dynamics over epochs

Visualization and Insights

The implementation includes automatic generation of learning curves for the best-performing model, showing both accuracy and loss progression. These visualizations help identify:

Overfitting: Gap between training and validation performance
Convergence: Whether the model has stabilized
Optimization Efficiency: How quickly the model learns

Expected Outcomes and Interpretations

Based on the experimental design, we can expect:

Learning Rate Impact: Lower learning rates typically provide more stable training but may converge slower
Optimizer Differences: Adam often performs well on NLP tasks due to its adaptive learning rate
Overfitting Patterns: The dual LSTM architecture may overfit with higher learning rates

Results

Best Model:
Optimizer: rmsprop, Learning Rate: 0.001
Training accuracy: 0.8926
Validation accuracy: 0.8662

Results Table (Training Accuracy / Validation Accuracy):
Optimizer | LR=0.001 | LR=0.0001 | LR=0.00001
--------------------------------------------------
adam      | 0.8274/0.8116 | 0.8830/0.8646 | 0.8464/0.8380 |
rmsprop   | 0.8926/0.8662 | 0.8616/0.8485 | 0.6422/0.6387 |

Training Results Graph

Discussion

The winning configuration of RMSprop with learning rate 0.001 achieved an impressive 86.62% validation accuracy on the IMDB sentiment analysis task, outperforming Adam optimizer. This doesn’t mean that RMSprop is always better than Adam as an optimizer. In fact, Adam should just be better than RMSprop, as Adam is RMSprop + momentum. The results were quite close anyway, and Adam outperformed RMSprop for most learning rates.

The complete code and results from this experiment are available on my GitHub repository in a Python notebook. Of course, you are free to do whatever you want with the code.

Introduction#

The Problem: Movie Review Sentiment Analysis#

Architecture Overview#

Design#

Hyperparameters#

Architecture Choices#

Systematic Optimizer Comparison#

Implementation Strategy#

Results Analysis#

Visualization and Insights#

Expected Outcomes and Interpretations#

Results#

Discussion#