One of my goals for 2019 is to resume competing in Kaggle competitions. Thinking about this made me realize that I never posted about my solution for last year’s Toxic Comment Challenge.
The competition’s goal was to train a model to detect toxic comments like threats, obscenity, insults, and identity-base hate. The data set consisted of comments from Wikipedia’s talk page edits. The training data set had ~500K examples each with one, or more, of the following labels:
The test data set had an additional ~500K examples.
This was a difficult competition as it was fairly easy to achieve a high score. The text contained a lot of non-dictionary curse words using symbols, misspellings, etc., that added to the complexity.
I tried a variety of models and ultimately settled on a bi-directional RNN with 80 GRU units. I wrote my solution in Python using Tensorflow, spaCy, and Gensim, and scikit-learn. I also used pre-trained FastText embedding vectors.
I preprocessed the data by,
The following diagram illustrates my final network design. Each line is a tensor annotated with it’s dimensions (excluding batch size). Each box is a simplified representation of operations.
Logits ▲ │ 1x6 │ ┌─────────────────────┐ │ Dense Layer (6) │ └─────────────────────┘ ▲ │ 1x334 │ ┌─────────────────────┐ │ Concat │ └─────────────────────┘ ▲ ┌───────────────────────────────┤ 1x14 │ │ 1x320 ┌────────────────┐ ┌─────────────────────┐ │ Reduce Max │ │ Concat │ └────────────────┘ └─────────────────────┘ ▲ ▲ │ ┌────────────┴────────────┐ │ 1x160│ │ 1x160 │ │ │ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │ Avg Pooling 1D │ │ Max Pooling 1D │ │ └─────────────────────┘ └─────────────────────┘ │ ▲ ▲ │ │ │ │ └────────────┬────────────┘ │ │ 150x160 │ │ │ ┌─────────────────────┐ │ │ Concat │ │ └─────────────────────┘ │ ▲ │ ┌────────────┴────────────┐ │ 150x80 │ │150x80 │ │ │ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │ Forward GRU (80) │ │ Backward GRU (80) │ ┌─────────────┘ └─────────────────────┘ └─────────────────────┘ │ ▲ ▲ │ │ │ │ └────────────┬────────────┘ │ │ │ │ 150x300 │ │ │ ┌─────────────────────────┐ │ │ Dropout │ │ └─────────────────────────┘ │ ▲ │ │ 150x300 │ │ │ ┌─────────────────────────┐ │ ┌───────────────▶│ Embedding Weighting │ │ │ └─────────────────────────┘ │ │ ▲ │ │ 150x1 │ 150x300 │ │ │ │ ┌─────────────────────────┐ ┌─────────────────────────┐ │ │ 1D Convolution │ │ FastText Embeddings │ │ └─────────────────────────┘ └─────────────────────────┘ │ ▲ ▲ └───────────────┤ 150x14 │ 150x1 │ │ Term Scores Term IDs
The model’s inputs were:
The term scores were used in two ways:
Weighting the embeddings was inspired by previous experiments using TF-IDF weighted embeddings. I don’t recall how much weighting the embeddings helped but I believe it had a positive effect.
Another novel thing I tried was weighting the losses for each category by their logs odds ratio. The rationale was to use boosting to address class imbalance. Again, I don’t recall how much this helped but I must have had good reason to keep it!
I trained the model for 8 epochs at a batch size of 128 on my OpenSuse Linux box with a Core i7 6850K (6 cores), 32GB DRAM, and Nvidia Titan X (Pascal) GPU. My final score was an AUC ROC of 0.9804, which is normally great. However, I only ranked 2443 out of 4551 teams (53%). Regardless, I mainly compete in Kaggle competitions to learn and I definitely succeeded on that front.
Source code is available on GitHub. It includes all of the preprocessing code as well as additional unused models that may serve as good examples.
Tags: NLP , Kaggle , TensorFlow , DeepLearning