Motion-to-Text | Léo Delecroix

Motion-2-Text model overview

Introduction

Understanding human motion and converting it into natural language is a complex challenge at the intersection of computer vision, deep learning, and natural language processing. This project introduces an autoencoder-based model that translates human motion sequences into textual descriptions using a Transformer-based Motion Encoder and a GPT-2-based Text Decoder.

Model Architecture

The model consists of three main components:

1. Motion Encoder Transformer

The Motion Encoder processes a sequence of human poses (joint positions over time) and encodes them into a latent representation:

1D Convolutions to capture local motion patterns
Transformer Encoder to model long-range temporal dependencies
Projection Layer to map motion representations to a format compatible with GPT-2
L2 Normalization and Noise Injection to stabilize embeddings and improve generalization

Output: A 768-dimensional latent vector representing the motion.

2. Latent Space Representation

The latent space acts as an intermediate representation between motion and text:

Dimensionality Reduction to condense motion data
Noise Addition to improve robustness
Transformation to GPT-2 Embeddings for smooth integration with the text decoder

This intermediate space is crucial for effective motion-to-text transformation.

3. Text Decoder with GPT-2

The Text Decoder generates coherent sentences from the latent motion representation:

Latent-to-GPT-2 Projection to align motion embeddings with GPT-2’s format
Cross-Attention Layer to refine context understanding
Autoregressive Text Generation for natural sentence construction
Controlled Generation using temperature, top-p sampling, and repetition penalties

Example Output:
“A person walks forward and picks something up with their right hand.”

Training Strategy

The training process is divided into two phases:

1. Pretraining Motion Encoder

GPT-2 is frozen for the first four epochs to stabilize motion embeddings
Layer Normalization & Weight Decay to avoid gradient explosion
ReduceLROnPlateau Scheduler to adjust learning rate dynamically

2. Fine-Tuning GPT-2

Gradual Unfreezing of GPT-2 for better adaptation
Scheduled Sampling replaces 10% of ground-truth tokens with model predictions for robustness
Gradient Clipping to prevent unstable updates

This approach steadily improves the BLEU score and text coherence.

Results & Key Takeaways

Improved Text Quality: The model generates concise and natural descriptions
Higher BLEU Score: Optimized training strategies enhance fluency
Better Generalization: Noise injection prevents overfitting to specific motion patterns
More Robust Predictions: Cross-attention improves motion understanding

Next Steps

Extending the model to 3D motion datasets
Improving diversity in sentence generation
Exploring real-time motion captioning for applications in AR/VR

This project represents a step forward in bridging human motion and language, with potential applications in action recognition, robotics, and accessibility.