Although I’m late to the start, I’ve been working on the Kaggle Toxic Comment Challenge. The dataset only contains about 560K comments. Before trying a deep learning model, I was curious to see how well a relatively simple approach might do. At the same time, I still wanted to use word embeddings to maximize generalization to unseen text.
The Coursera Deep Learning Sequence Models class described summarizing documents by averaging over each word’s embedding vector. Some quick research led me to this paper that, in part, describes using the TF-IDF weighted sum of embedding vectors instead. That seemed cool so I decided to give it a try.
from sklearn.datasets import fetch_20newsgroups news_train = fetch_20newsgroups(subset='train', categories=['sci.space','rec.autos'], data_home='~/local/tmp')
This data set has 1187 documents, 594 from
rec.autos and 593 from
sci.space. Next, we’ll load
the spaCy Python package and its medium-size
English language model which includes pre-computed
Glove embedding vectors. See the spaCy page for instructions
on downloading the language model.
import spacy nlp = spacy.load('en_core_web_md')
With spaCy loaded, the newsgroup documents can be lemmatised.
def keep_token(t): return (t.is_alpha and not (t.is_space or t.is_punct or t.is_stop or t.like_num)) def lemmatize_doc(doc): return [ t.lemma_ for t in doc if keep_token(t)] docs = [lemmatize_doc(nlp(doc)) for doc in news_train.data]
Next, the Gensim package can be used to create a dictionary and filter out stop and infrequent words (lemmas).
from gensim.corpora import Dictionary from gensim.models.tfidfmodel import TfidfModel from gensim.matutils import sparse2full docs_dict = Dictionary(docs) docs_dict.filter_extremes(no_below=20, no_above=0.2) docs_dict.compactify()
This process yields a vocabulary with 1191 words. Gensim can again be used to create a bag-of-words representation of each document, build the TF-IDF model, and compute the TF-IDF vector for each document.
import numpy as np docs_corpus = [docs_dict.doc2bow(doc) for doc in docs] model_tfidf = TfidfModel(docs_corpus, id2word=docs_dict) docs_tfidf = model_tfidf[docs_corpus] docs_vecs = np.vstack([sparse2full(c, len(docs_dict)) for c in docs_tfidf])
docs_vecs, is a matrix with 1187 rows (docs) and 1191 columns
(TF-IDF terms). Next, we use spaCy to get the 300 dimensional Glove embedding vector for
each TF-IDF term.
tfidf_emb_vecs = np.vstack([nlp(docs_dict[i]).vector for i in range(len(docs_dict))])
tfidf_emb_vecs matrix has 1192 rows (terms) and 300 columns (Glove vectors). To
get a TF-IDF weighted Glove vector summary of each document, we just need to
docs_emb = np.dot(docs_vecs, tfidf_emb_vecs)
docs_emb is a matrix with 1187 rows (docs) and 300
columns (Glove vectors).
To get sense of how well these document summaries do, we can use PCA to reduce the dimensionality,
from sklearn.decomposition import PCA docs_pca = PCA(n_components=8).fit_transform(docs_emb)
and then use t-sne to project the vectors to 2D.
from sklearn import manifold tsne = manifold.TSNE() viz = tsne.fit_transform(docs_pca)
Plotting with matplotlib…
import matplotlib.pyplot as plt fig, ax = plt.subplots() ax.margins(0.05) zero_indices = np.where(news_train.target == 0) one_indices = np.where(news_train.target == 1) ax.plot(viz[zero_indices,0], viz[zero_indices,1], marker='o', linestyle='', ms=8, alpha=0.3, label=news_train.target_names) ax.plot(viz[one_indices,0], viz[one_indices,1], marker='o', linestyle='', ms=8, alpha=0.3, label=news_train.target_names) ax.legend() plt.show()
Produces the following plot which shows that the TF-IDF weighted Glove vectors did a decent job of separating the documents.
I did some additional feature engineering to further improve separability for the Kaggle competition. After fitting an SVM binary classification model for each label, I was able to achieve a score of 0.9470 on the mean column-wise ROC AUC metric used by the competition. While not bad, it’s far from the best leaderboard scores. I may experiment some more with weighted embedding vectors but I fear too much information is lost in the summarizing to get to the top of the board.