This post follows on from the previous “Get Busy with Word Embeddings” post, and provides code samples and methods for you to use and create Word Embeddings / Word Vectors with your systems in Python.

To use word embeddings, you have two primary options:

  • Use pre-trained models that you can download online (easiest)
  • Train custom models using your own data and the Word2Vec (or another) algorithm (harder, but maybe better!).

Two Python natural language processing (NLP) libraries are mentioned here:

  1. Spacy is a natural language processing (NLP) library for Python designed to have fast performance, and with word embedding models built in, it’s perfect for a quick and easy start.
  2. Gensim is a topic modelling library for Python that provides access to Word2Vec and other word embedding algorithms for training, and it also allows pre-trained word embeddings that you can download from the internet to be loaded.

In this post, we examine how to load pre-trained models first, and then provide a tutorial for creating your own word embeddings using Gensim and the 20_newsgroups dataset.

Pre-trained Word Embeddings

Pre-trained models are the simplest way to start working with word embeddings. A pre-trained model is a set of word embeddings that have been created elsewhere that you simply load onto your computer and into memory.

The advantage of these models is that they can leverage massive datasets that you may not have access to, built using billions of different words, with a vast corpus of language that captures word meanings in a statistically robust manner. Example training data sets include the entire corpus of wikipedia text, the common crawl dataset, or the Google News Dataset. Using a pre-trained model removes the need for you to spend time obtaining, cleaning, and processing (intensively) such large datasets.

Pre-trained models are also available in languages other than English, opening up multi-lingual opportunities for your applications.

The disadvantage of pre-trained word embeddings is that the words contained within may not capture the peculiarities of language in your specific application domain. For example, Wikipedia may not have great word exposure to particular aspects of legal doctrine or religious text, so if your application is specific to a domain like this, your results may not be optimal due to the generality of the downloaded model’s word embeddings.

Pre-trained models in Spacy

Using pre-trained models in Spacy is incredible convenient, given that they come built in. Simply download the core English model using:

Spacy has a number of different models of different sizes available for use, with models in 7 different languages (include English, Polish, German, Spanish, Portuguese, French, Italian, and Dutch), and of different sizes to suit your requirements. The code snippet above installs the larger-than-standard  en_core_web_md library, which includes 20k unique vectors with 300 dimensions.

Spacy makes it easy to use word embeddings

Spacy parses entire blocks of text and seamlessly assigns word vectors from the loaded models.

Use the vectors in Spacy by first loading the model, and then processing text (see below):

The vectors can be accessed directly using the .vector attribute of each processed token (word). The mean vector for the entire sentence is also calculated simply using .vector, providing a very convenient input for machine learning models based on sentences.

Once assigned, word embeddings in Spacy are accessed for words and sentences using the .vector attribute.

Pre-trained models in Gensim

Gensim doesn’t come with the same in built models as Spacy, so to load a pre-trained model into Gensim, you first need to find and download one. This post on Ahogrammers’s blog provides a list of pertained models that can be downloaded and used.

A popular pre-trained option is the Google News dataset model, containing 300-dimensional embeddings for 3 millions words and phrases. Download the binary file ‘GoogleNews-vectors-negative300.bin’ (1.3 GB compressed) from

Loading and accessing vectors is then straightforward:

Gensim includes functions to explore the vectors loaded, examine word similarity, and to find synonyms in of words using ‘similar’ vectors:

Gensim provides a number of helper functions to interact with word vector models. Similarity is determined using the cosine distance between two vectors.

Create Custom Word Embeddings

Training your own word embeddings need not be daunting, and, for specific problem domains, will lead to enhanced performance over pre-trained models. The Gensim library provides a simple API to the Google word2vec algorithm which is a go-to algorithm for beginners.

To train your own model, the main challenge is getting access to a training data set. Computation is not massively onerous – you’ll manage to process a large model on a powerful laptop in hours rather than days.

In this tutorial, we will train a Word2Vec model based on the 20_newsgroups data set which contains approximately 20,000 posts distributed across 20 different topics. The simplicity of the Gensim Word2Vec training process is demonstrated in the code snippets below.

Training the model in Gensim requires the input data in a list of sentences, with each sentence being a list of words, for example:

As such, our initial efforts will be in cleansing and formatting the data to suit this form.

Preparing 20 Newsgroups Data

Once the newsgroups archive is extracted into a folder, there are some cleaning and extraction steps taken to get data into the input form and then training the model:

The data is loaded into memory (a single list ‘texts’) at this point; for preprocessing, remove all punctuation, and excess information.

Each original document is now represented in the list, ‘texts’, as a list of sentences, and each sentence is a list of words.

Sentence formatted as a list for Word2Vec Training.

For training word embedding models, a list of sentences, where each sentence is a list of words is created. The source data here is the 20_newsgroups data set.

Finally, combine all of the sentences from every document into a single list of sentences.

Phrase Detection using Gensim Phraser

Commonly occurring multiword expressions (bigrams / trigrams) in text carry different meaning to the words occurring singularly. For example, the words ‘new’ and ‘York’ expressed singularly are inherently different to the utterance ‘New York’. Detecting frequently co-occuring words and combining them can enhance word vector accuracy.

A ‘Phraser‘ from Gensim can detect frequently occurring bigrams easily, and apply a transform to data to create pairs, i.e. ‘New York’ -> ‘New_York’. Pre-processing text input to account for such bigrams can improve the accuracy and usefulness of the resulting word vectors. Ultimately, instead of training vectors for ‘new’ and ‘york’ separately, a new vector for ‘New_York’ is created.

The gensim.models.phrases module provides everything required in a simple form:

a phraser object can detect frequency co-occuring words in text corpus

A Phraser detects frequently co-occuring words in sentences and combines them. Training and applying is simple using the Gensim library.

The Gensim Phraser process can be repeated to detect trigrams (groups of three words that co-occur) and more by training a second Phraser object on the already processed data. (see gensim docs). The parameters are tuneable to include or exclude terms based on their frequency, and should be fine tuned. In the example above, ‘court_of_law’ is a good example phrase, whereas ‘been_established’ may indicate an overly greedy application of the phrase detection algorithm.

Creating the Word Embeddings using Word2Vec

The final step, once data has been preprocessed and cleaned is creating the word vectors.

This example, with only 564k sentences, is a toy example, and the resulting word embeddings would not be expected to be as useful as those trained by Google / Facebook on larger corpus’ of training data.

In total, the 20_newsgroups dataset provided 80,167 different words for our model, and, even with the smaller data set, relationships between words can be observed.

The word embedding dimension and number of words for the 20_newsgroups data is found in the model.

Even with the relatively small (80k unique words) dataset, some informative relations are seen in trained word embeddings.

There are a range of tuneable parameters for the Word2Vec algorithm provided by Gensim to assist in achieving the desired result.

For larger data sets, training time will be much longer, and memory can be an issue if all of the training data is loaded as in our example above. The Rare Technologies blog provides some useful information for formatting input data as an iterable, reducing memory footprint during the training process, and also in methods for evaluating word vector and performance after training.

Once trained, you can access the newly encoded word vectors in the same way as for pretrained models, and use the outputs in any of your text classification or visualisation tasks.

In addition to Word2Vec, Gensim also includes algorithms for fasttext, VarEmbed, and WordRank (original) also.


Ideally, this post will have given enough information to start working in Python with Word embeddings, whether you intend to use off-the-shelf models or models based on your own data sets.

A third option exists, which is to take an off-the-shelf model, and then ‘continue’ the training using Gensim, but with your own application-specific data and vocabulary, also mentioned on the Rare Technologies blog.

For further, and useful reading on these topics, please see:

  1. @Shane, great post on a hot topic!

    You guys are so lucky working with an English corpus!

    If you change the language, it all gets exponentially more complicated:
    – To my knowledge, you can’t train GloVe with your own corpus.
    – Word2Vec is easily trained with Gensim, but everything else is not easy 🙂
    – Word2Vec needs a huge corpus. My guess is >500MB

    My personal conclusion for non-English word modeling:
    – stick to BoW (bi-gram). Unless you’re a monster tech firm, BoW (bi-gram) works surprisingly well.

    PS: for those into Deep Learning + Natural Language Processing, check out (from the makers of SpaCy). enables you to manually label sentences efficiently.

Leave a Reply