February 28, 2019

Kaggle Toxic Comments Competition

One of my goals for 2019 is to resume competing in Kaggle competitions. Thinking about this made me realize that I never posted about my solution for last year’s Toxic Comment Challenge.

The competition’s goal was to train a model to detect toxic comments like threats, obscenity, insults, and identity-base hate. The data set consisted of comments from Wikipedia’s talk page edits. The training data set had ~500K examples each with one, or more, of the following labels:

  • Toxic
  • Severe Toxic
  • Obscene
  • Threat
  • Insult
  • Identity Hate

The test data set had an additional ~500K examples.

This was a difficult competition as it was fairly easy to achieve a high score. The text contained a lot of non-dictionary curse words using symbols, misspellings, etc., that added to the complexity.

I tried a variety of models and ultimately settled on a bi-directional RNN with 80 GRU units. I wrote my solution in Python using Tensorflow, spaCy, and Gensim, and scikit-learn. I also used pre-trained FastText embedding vectors.

I preprocessed the data by,

  1. Tokenizing and lematizing the data (spaCy)
  2. Learning the vocabulary (Gensim)
  3. Creating TF-IDF vector models of each comment (Gensim)
  4. Scoring each vocabulary term’s toxic/non-toxic discrimination using Chi2 (sklearn) and Delta TFIDF metrics.
  5. Manually correcting a small number discriminating non-dictionary words

The following diagram illustrates my final network design. Each line is a tensor annotated with it’s dimensions (excluding batch size). Each box is a simplified representation of operations.

                                                 Logits                      
                                                    ▲                        
                                                    │ 1x6                    
                                                    │                        
                                         ┌─────────────────────┐             
                                         │   Dense Layer (6)   │             
                                         └─────────────────────┘             
                                                    ▲                        
                                                    │ 1x334                  
                                                    │                        
                                         ┌─────────────────────┐             
                                         │       Concat        │             
                                         └─────────────────────┘             
                                                    ▲                        
                    ┌───────────────────────────────┤                        
               1x14 │                               │ 1x320                  
           ┌────────────────┐            ┌─────────────────────┐             
           │   Reduce Max   │            │       Concat        │             
           └────────────────┘            └─────────────────────┘             
                    ▲                               ▲                        
                    │                  ┌────────────┴────────────┐           
                    │             1x160│                         │ 1x160     
                    │                  │                         │           
                    │       ┌─────────────────────┐   ┌─────────────────────┐
                    │       │   Avg Pooling 1D    │   │   Max Pooling 1D    │
                    │       └─────────────────────┘   └─────────────────────┘
                    │                  ▲                         ▲           
                    │                  │                         │           
                    │                  └────────────┬────────────┘           
                    │                               │ 150x160                
                    │                               │                        
                    │                    ┌─────────────────────┐             
                    │                    │       Concat        │             
                    │                    └─────────────────────┘             
                    │                               ▲                        
                    │                  ┌────────────┴────────────┐           
                    │           150x80 │                         │150x80     
                    │                  │                         │           
                    │       ┌─────────────────────┐   ┌─────────────────────┐
                    │       │  Forward GRU (80)   │   │  Backward GRU (80)  │
      ┌─────────────┘       └─────────────────────┘   └─────────────────────┘
      │                                ▲                         ▲           
      │                                │                         │           
      │                                └────────────┬────────────┘           
      │                                             │                        
      │                                             │ 150x300                
      │                                             │                        
      │                                ┌─────────────────────────┐           
      │                                │         Dropout         │           
      │                                └─────────────────────────┘           
      │                                             ▲                        
      │                                             │ 150x300                
      │                                             │                        
      │                                ┌─────────────────────────┐           
      │               ┌───────────────▶│   Embedding Weighting   │           
      │               │                └─────────────────────────┘           
      │               │                             ▲                        
      │               │ 150x1                       │ 150x300                
      │               │                             │                        
      │  ┌─────────────────────────┐   ┌─────────────────────────┐           
      │  │     1D Convolution      │   │   FastText Embeddings   │           
      │  └─────────────────────────┘   └─────────────────────────┘           
      │               ▲                             ▲                        
      └───────────────┤ 150x14                      │ 150x1                  
                      │                             │                        
                                                                             
                Term Scores                     Term IDs                     
                                                                             

The model’s inputs were:

  • The comment’s first 150 preprocessed tokens
  • The Chi2 and Delta-IDF score for each token and label

The term scores were used in two ways:

  • To weight the FastText embeddings via a 1D convolutional layer that merged the 14 scores into one scalar weight
  • As features for the final dense layer, after a reduce max operation to take the highest score for each term

Weighting the embeddings was inspired by previous experiments using TF-IDF weighted embeddings. I don’t recall how much weighting the embeddings helped but I believe it had a positive effect.

Another novel thing I tried was weighting the losses for each category by their logs odds ratio. The rationale was to use boosting to address class imbalance. Again, I don’t recall how much this helped but I must have had good reason to keep it!

I trained the model for 8 epochs at a batch size of 128 on my OpenSuse Linux box with a Core i7 6850K (6 cores), 32GB DRAM, and Nvidia Titan X (Pascal) GPU. My final score was an AUC ROC of 0.9804, which is normally great. However, I only ranked 2443 out of 4551 teams (53%). Regardless, I mainly compete in Kaggle competitions to learn and I definitely succeeded on that front.

Source code is available on GitHub. It includes all of the preprocessing code as well as additional unused models that may serve as good examples.

Tags:  NLP , Kaggle , TensorFlow , DeepLearning