Motion-to-Text

Transforming Human Movements into Natural Language

Motion-2-Text model overview

Introduction

Understanding human motion and converting it into natural language is a complex challenge at the intersection of computer vision, deep learning, and natural language processing. This project introduces an autoencoder-based model that translates human motion sequences into textual descriptions using a Transformer-based Motion Encoder and a GPT-2-based Text Decoder.


Model Architecture

The model consists of three main components:

1. Motion Encoder Transformer

The Motion Encoder processes a sequence of human poses (joint positions over time) and encodes them into a latent representation:

  • 1D Convolutions to capture local motion patterns
  • Transformer Encoder to model long-range temporal dependencies
  • Projection Layer to map motion representations to a format compatible with GPT-2
  • L2 Normalization and Noise Injection to stabilize embeddings and improve generalization

Output: A 768-dimensional latent vector representing the motion.


2. Latent Space Representation

The latent space acts as an intermediate representation between motion and text:

  • Dimensionality Reduction to condense motion data
  • Noise Addition to improve robustness
  • Transformation to GPT-2 Embeddings for smooth integration with the text decoder

This intermediate space is crucial for effective motion-to-text transformation.


3. Text Decoder with GPT-2

The Text Decoder generates coherent sentences from the latent motion representation:

  • Latent-to-GPT-2 Projection to align motion embeddings with GPT-2’s format
  • Cross-Attention Layer to refine context understanding
  • Autoregressive Text Generation for natural sentence construction
  • Controlled Generation using temperature, top-p sampling, and repetition penalties

Example Output:
“A person walks forward and picks something up with their right hand.”


Training Strategy

The training process is divided into two phases:

1. Pretraining Motion Encoder

  • GPT-2 is frozen for the first four epochs to stabilize motion embeddings
  • Layer Normalization & Weight Decay to avoid gradient explosion
  • ReduceLROnPlateau Scheduler to adjust learning rate dynamically

2. Fine-Tuning GPT-2

  • Gradual Unfreezing of GPT-2 for better adaptation
  • Scheduled Sampling replaces 10% of ground-truth tokens with model predictions for robustness
  • Gradient Clipping to prevent unstable updates

This approach steadily improves the BLEU score and text coherence.


Results & Key Takeaways

  • Improved Text Quality: The model generates concise and natural descriptions
  • Higher BLEU Score: Optimized training strategies enhance fluency
  • Better Generalization: Noise injection prevents overfitting to specific motion patterns
  • More Robust Predictions: Cross-attention improves motion understanding

Next Steps

  • Extending the model to 3D motion datasets
  • Improving diversity in sentence generation
  • Exploring real-time motion captioning for applications in AR/VR

This project represents a step forward in bridging human motion and language, with potential applications in action recognition, robotics, and accessibility.