Word Movers' Distance (WMD) in NLP

Word Movers' Distance (WMD) in NLP

Word Movers' Distance (WMD) is a powerful metric in Natural Language Processing (NLP) that quantifies the semantic similarity between two pieces of text. It leverages word embeddings, such as Word2Vec or GloVe, to measure the minimum amount of distance that words from one text need to “travel” in semantic space to reach the words in another text.


WMD, introduced by Kusner et al. in 2015, is based on the Earth Mover’s Distance (EMD) - a mathematical measure used in computer vision. WMD considers the semantic meanings of words, making it more effective than traditional methods like Bag of Words (BoW) or TF-IDF that rely on word frequency counts and often fail to capture the semantic relationships between words.

How WMD Works

WMD calculates the dissimilarity between two text documents as the minimum amount of cumulative distance that words from one document need to “move” in the word embedding space to reach the words of the other document. This “movement” is calculated using the word embeddings of the words, which capture the semantic meanings of the words.

The distance between two words is computed as the Euclidean distance between their word embeddings. The optimal transport problem, which seeks to minimize the total cost of moving the words, is then solved using efficient algorithms like the Hungarian method or linear programming.

Applications of WMD

WMD has been successfully applied in various NLP tasks, including:

  • Document Similarity and Clustering: WMD can be used to measure the semantic similarity between documents, which can be useful in document clustering, information retrieval, and recommendation systems.

  • Text Classification: WMD can be used as a feature in machine learning models for text classification tasks, such as sentiment analysis or topic classification.

  • Question Answering: In question answering systems, WMD can be used to find the most semantically similar questions to a given query, improving the system’s ability to provide relevant answers.

Advantages and Limitations


  • WMD captures the semantic meanings of words, making it more effective than methods that rely on word frequency counts.
  • WMD can handle synonyms and words with similar meanings, which can be a challenge for other methods.


  • WMD requires pre-trained word embeddings, which can be computationally expensive to compute for large vocabularies.
  • WMD is sensitive to the quality of the word embeddings. Poorly trained embeddings can lead to inaccurate distance measurements.


  1. Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015). From Word Embeddings To Document Distances. In International Conference on Machine Learning.

  2. Rubner, Y., Tomasi, C., & Guibas, L. J. (2000). The Earth Mover’s Distance as a Metric for Image Retrieval. International Journal of Computer Vision, 40(2), 99-121.

  3. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems.

  4. Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP).