Introduction
Recently, I’ve been tinkering around with training different AI models. Specifically, different types of networks. Long Short-Term Memory (LSTM) are good at natural language processing tasks, particularly sentiment analysis. In this demonstration, I’m going to build an LSTM-based sentiment classifier using TensorFlow and the IMDB movie reviews dataset. Then, I will compare different optimizers and learning rates to understand their impact on model performance.
The Problem: Movie Review Sentiment Analysis
The IMDB dataset contains 50,000 movie reviews labeled as positive or negative, making it perfect for binary sentiment classification. Our goal is to build an LSTM network that can accurately predict whether a review expresses positive or negative sentiment.
Architecture Overview
Our LSTM network follows a carefully designed architecture:
- Text Vectorization Layer: Converts raw text into numerical tokens
- Embedding Layer: Maps tokens to dense vector representations
- Dual LSTM Layers: Captures sequential patterns in the text
- Dense Layers: Provides final classification capability
model = tf.keras.Sequential([
encoder, # TextVectorization
tf.keras.layers.Embedding(input_dim=len(encoder.get_vocabulary()),
output_dim=EMBEDDING_DIM, mask_zero=True),
tf.keras.layers.LSTM(64, return_sequences=True),
tf.keras.layers.LSTM(64),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
Design
Hyperparameters
- Vocabulary Size: 1,000 most frequent words
- Embedding Dimension: 64-dimensional word vectors
- Batch Size: 64 samples per batch
- Training Epochs: 10 iterations through the dataset
Architecture Choices
- Dual LSTM Structure: The first LSTM (with
return_sequences=True) passes full sequences to the second LSTM, allowing the model to capture both local and global patterns - Masking: The embedding layer uses
mask_zero=Trueto handle variable-length sequences efficiently - Activation Functions: ReLU for the dense layer and sigmoid for binary classification
Systematic Optimizer Comparison
We will be systematically comparing optimizers and learning rates. We test:
Optimizers:
- Adam: Adaptive learning rate with momentum
- RMSprop: Adaptive learning rate optimizer
Learning Rates:
- 0.001 (standard)
- 0.0001 (conservative)
- 0.00001 (very conservative)
This creates a 2×3 grid of experiments, totaling 6 different model configurations.
Implementation Strategy
The code implements a clean, reusable training function:
def create_and_train_model(optimizer, learning_rate):
# Model creation and compilation
# Training with validation
# Performance evaluation
return results_dictionary
This approach ensures:
- Consistency: Each model trains under identical conditions
- Reproducibility: Results can be easily verified
- Scalability: Easy to add more optimizers or learning rates
Results Analysis
The code automatically generates a comprehensive results table showing training and validation accuracy for each configuration. It also identifies the best-performing model and generates learning curves for detailed analysis.
Key metrics tracked:
- Training Accuracy: Performance on the training set
- Validation Accuracy: Performance on unseen test data
- Learning Curves: Training dynamics over epochs
Visualization and Insights
The implementation includes automatic generation of learning curves for the best-performing model, showing both accuracy and loss progression. These visualizations help identify:
- Overfitting: Gap between training and validation performance
- Convergence: Whether the model has stabilized
- Optimization Efficiency: How quickly the model learns
Expected Outcomes and Interpretations
Based on the experimental design, we can expect:
- Learning Rate Impact: Lower learning rates typically provide more stable training but may converge slower
- Optimizer Differences: Adam often performs well on NLP tasks due to its adaptive learning rate
- Overfitting Patterns: The dual LSTM architecture may overfit with higher learning rates
Results
Best Model:
Optimizer: rmsprop, Learning Rate: 0.001
Training accuracy: 0.8926
Validation accuracy: 0.8662
Results Table (Training Accuracy / Validation Accuracy):
Optimizer | LR=0.001 | LR=0.0001 | LR=0.00001
--------------------------------------------------
adam | 0.8274/0.8116 | 0.8830/0.8646 | 0.8464/0.8380 |
rmsprop | 0.8926/0.8662 | 0.8616/0.8485 | 0.6422/0.6387 |

Discussion
The winning configuration of RMSprop with learning rate 0.001 achieved an impressive 86.62% validation accuracy on the IMDB sentiment analysis task, outperforming Adam optimizer. This doesn’t mean that RMSprop is always better than Adam as an optimizer. In fact, Adam should just be better than RMSprop, as Adam is RMSprop + momentum. The results were quite close anyway, and Adam outperformed RMSprop for most learning rates.
The complete code and results from this experiment are available on my GitHub repository in a Python notebook. Of course, you are free to do whatever you want with the code.