Kaggle-Mercari

./src/PlanB.py score 0.40957 on private leaderboard, with the ranking of 32 among 2384 teams.

Ideas/things to do

Versions:

v8:
- embedding for cat, brand, cond ...
v7:
- adam lr
v6
- Tf: more epoch (v)
- FM: 18 ? * FTRL: beta=0.01 (v) * WB: (no)
v5
- delete stemmer and add two epoches (v)
- log price (v)
- REMEMBER to add fc back (v)
v4
- elu instead of relu (v)
- delete one fc layer (v)
- delete dropout after fc
V3
- cnn
- dropout after cnn (v)
- 2gram for cnn
- rnn

Worth a read:

strategy
Ridge: performance/computation time trade off
ensemble averaging
Why Ridge is much better than other sklearn models
Efficient Way to do TFIDF
Using log price as Dependent Variable But becarefull with those "without zero price" kernel, as it also remove it from the validation set it makes local CV score useless. If you want to remove zero price,, remove it inside the fold, so the validation set still resemble the original dataset, and then your CV score shall resemble LB
Wordbatch(TFIDF) vs WordSequence
Best single model
Wordbatch for preprocessing and modeling
Surpass 0.40000
LB shake up
CNN or RNN: Best single model
FastText: 1-2 gram
TF dataset
My model improves from 0.433 to 0.410 with these attempts:

Top players

LB

Useful features

Len of text
Mean price of each category
Mean of brand/shipping
Average of word embeddings: Lookup all words in Word2vec and take the average of them. paper, Github Quora
Better way to remove stop word cached
Reduce TF time
Drop price = 0 or < 3 (link, link)

Tricks

Stage 2: 1, 2, Mine
Rewrite the code:
- "without merge(fitting on train and transforming on test) my CV and LB loss increased by 0.009. I can't figure out the reason." Link
- Test set into batches. link
- Better val set for TF

Tried:

Combine (condition and shipping)
Concatination of brand, item description and product name
One dimmensionfor item_condition: https://www.kaggle.com/nvhbk16k53/associated-model-rnn-ridge/versions#base=2256015&new=2410057
Other features for TF: Quora solutions
- No 1: Number of capital letters, question marks etc...
- No 3: We used TFIDF and LSA distances, word co-occurrence measures (pointwise mutual information), word matching measures, fuzzy word matching measures (edit distance, character ngram distances, etc), LDA, word2vec distances, part of speech and named entity features, and some other minor features. These features were mostly recycled from a previous NLP competition, and were not nearly as essential in this competition.
- No 8 -> a lot
- https://www.kaggle.com/shujian/naive-xgboost-v2/versions
- Tune FM: Compare 1 and 2. topic, kernel.
Ridge tuning:
- RDizzl3: Try playing with these parameters and see if you can get similar results to Ridge: alpha, eta0, power_t and max_iter. I have been able to get within 0.002 of my ridge predictions (validation) and it is faster.
Text cleaning
- RDizzl3: I have created rules for replacing text and even missing brand names that do bring some improvement to my score.
- Darragh: I didn't do too much on hand built feature engineering, but have got some boost with working on improving the tokenization. Still looking for what the top guys have done :) - nltk - ToktokTokenizer
- Text norm

Models:

Wordbatch FTRL+FM+TF: Public Score 0.41803
- Wordbatch FTRL+FM+LGB: Public Score 0.42497
- Tensorflow starter: conv1d + emb: Public Score 0.43839
Wordbatch FTRL+FM+TF+new features: Public Score 0.41502

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
src		src
README.md		README.md
Stage2.MD		Stage2.MD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaggle-Mercari

Ideas/things to do

Versions:

Worth a read:

Top players

Useful features

Tricks

Tried:

Models:

About

Releases

Packages

Languages

Shujian2015/Kaggle-Mercari

Folders and files

Latest commit

History

Repository files navigation

Kaggle-Mercari

Ideas/things to do

Versions:

Worth a read:

Top players

Useful features

Tricks

Tried:

Models:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages