Word2vec - The CBOW Architecture

Initially, roughly in the 1980s, ideas arose about representing words as vectors. One-hot encoding was developed in the mid-20th century as a way to embed words with 0s and 1s, stored in often high-dimensional vectors. Essentially, given a vocabulary size of 10,000, we could represent the first word in this vocabulary as [1,0,0,0,...,0], where there are 10,000 columns and the ith word has a 1 in the ith column, as explained by Po (2025). This is not efficient for two reasons: it is sparse, which takes a longer time to process, and it is orthogonal, meaning the dot product of any two vectors is always 0. There is no notion of meaning. Similar words such as ”cat” and ”kitten” and distant words such as ”cooking” and ”pencil” are all equally aligned from each other.


Instead, Mikolov et al. (2013) at Google developed a recurrent neural network with one hidden layer as a way to represent words as dense, low-dimensional vectors. Their paper was published and patented in 2013. The word embeddings of word2vec have several advantages: semantically-similar words have close vec- tors in the vector space, the space has linear structure (logically clear), and embeddings are compressed into a smaller context vector to make the model faster.


There are two training architectures introduced in word2vec: Continuous Bag of Words (CBOW) and Skip-gram. As according to GeeksforGeeks (2023), CBOW’s goal is to predict the target word based on its context words, and Skip- gram oppositely uses a target word to predict its surrounding context words. The CBOW architecture and process is quite nice. First, the input is chosen, such as the words of a sentence. For example, if we had ”I love eating hotdogs,” and our target word was ”eating,” the context words would be ”I,” ”love,” and ”hotdogs.” From Trivedi (2024), each context word is originally a one-hot vector, but is converted to a dense vector through multiplication with an embedding matrix of size V * N. |V | indicates the vocabulary size, and N indicates the embedding dimension. The embedding dimension depends on the size of the dataset, the level of nuance in the NLP task, and how efficient you want the model to be.


The embeddings are then either averaged or summed to find the context vector. Since neural networks expect a fixed-size input vector, averaging or summing compresses each embedding into a single context vector h. As the name suggests, the context words become a bag of words where the order doesn’t matter. All context words are treated equally, making the model simple but efficient.

Finally, h is multipled by a weight matrix W’, which projects h into the vocabulary space. A logit space is produced with raw scores for the likeliness of each word being the target. A softmax function converts these logits to actual probabilities, forming a probability distribution over V. W’ can be thought of as the decoder (Hughes, 2024).

The CBOW training process also undergoes these steps, but there is also a loss and backpropagation step to improve the model’s accuracy, discussed more detailedly in the presentation.

References

GeeksforGeeks (2023). Continuous bag of words (cbow) in nlp.

Hughes, C. (2024). A brief overview of cross entropy loss - chris hughes - medium.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Po, L. (2025). The evolution of word embeddings - lm po - medium. Trivedi, A. (2024). Understanding continuous bag of words (cbow).