./src/PlanB.py score 0.40957 on private leaderboard, with the ranking of 32 among 2384 teams.
-
v8:
- embedding for cat, brand, cond ...
-
v7:
- adam lr
-
v6
- Tf: more epoch (v)
- FM: 18 ? * FTRL: beta=0.01 (v) * WB: (no)
-
v5
- delete stemmer and add two epoches (v)
- log price (v)
- REMEMBER to add fc back (v)
-
v4
- elu instead of relu (v)
- delete one fc layer (v)
- delete dropout after fc
-
V3
- cnn
- dropout after cnn (v)
- 2gram for cnn
- rnn
- strategy
- Ridge: performance/computation time trade off
- ensemble averaging
- Why Ridge is much better than other sklearn models
- Efficient Way to do TFIDF
- Using log price as Dependent Variable But becarefull with those "without zero price" kernel, as it also remove it from the validation set it makes local CV score useless. If you want to remove zero price,, remove it inside the fold, so the validation set still resemble the original dataset, and then your CV score shall resemble LB
- Wordbatch(TFIDF) vs WordSequence
- Best single model
- Wordbatch for preprocessing and modeling
- Surpass 0.40000
- LB shake up
- CNN or RNN: Best single model
- FastText: 1-2 gram
- TF dataset
- My model improves from 0.433 to 0.410 with these attempts:
- Len of text
- Mean price of each category
- Mean of brand/shipping
- Average of word embeddings: Lookup all words in Word2vec and take the average of them. paper, Github Quora
- Better way to remove stop word cached
- Reduce TF time
- Drop price = 0 or < 3 (link, link)
-
Rewrite the code:
- Combine (condition and shipping)
- Concatination of brand, item description and product name
- One dimmensionfor item_condition: https://www.kaggle.com/nvhbk16k53/associated-model-rnn-ridge/versions#base=2256015&new=2410057
- Other features for TF: Quora solutions
- No 1: Number of capital letters, question marks etc...
- No 3: We used TFIDF and LSA distances, word co-occurrence measures (pointwise mutual information), word matching measures, fuzzy word matching measures (edit distance, character ngram distances, etc), LDA, word2vec distances, part of speech and named entity features, and some other minor features. These features were mostly recycled from a previous NLP competition, and were not nearly as essential in this competition.
- No 8 -> a lot
- https://www.kaggle.com/shujian/naive-xgboost-v2/versions
- Tune FM: Compare 1 and 2. topic, kernel.
- Ridge tuning:
- RDizzl3: Try playing with these parameters and see if you can get similar results to Ridge: alpha, eta0, power_t and max_iter. I have been able to get within 0.002 of my ridge predictions (validation) and it is faster.
- Text cleaning
- RDizzl3: I have created rules for replacing text and even missing brand names that do bring some improvement to my score.
- Darragh: I didn't do too much on hand built feature engineering, but have got some boost with working on improving the tokenization. Still looking for what the top guys have done :) - nltk - ToktokTokenizer
- Text norm
- Wordbatch FTRL+FM+TF: Public Score 0.41803
- Wordbatch FTRL+FM+LGB: Public Score 0.42497
- Tensorflow starter: conv1d + emb: Public Score 0.43839
- Wordbatch FTRL+FM+TF+new features: Public Score 0.41502