Back2School with Vectors, Cosine Similarity, and Word2Vec

By Lucas | May 12, 2017

Python programming logo

Tomorrow, I’ll be making a return visit to the high school where I spent a decade in the mathematics department as a teacher. I’ve got the chance to speak to ten classes over the course of six class periods and tell them a little bit about what I do as a data scientist.

Since many of the students will be familiar with concepts like vectors and trigonometry, I’ve decided to do an activity involving the Python gensim package and Word2Vec. Specifically, each student was asked to submit a “Tweet” about the most interesting thing they’ve done in the last couple of couple of weeks. I was given those Tweets last week and have prepared a little talk and code walk through about how we can use Word2Vec to identify similar Tweets by transforming unstructured text with word embeddings and comparing their cosine similarity.

I’ve decided to go ahead and share the code in a Github repo. If you’re interested in word embeddings, I hope you’ll find it helpful. I’m also posting the presentation I’m giving tomorrow below, but some formatting of indents, margins, etc. did get lost in the process of wrapping it in an iframe, so if you want to see it in the best possible form, check it out here.

Search

Follow