Skip to content

Shujian2015/Kaggle-Mercari

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 

Repository files navigation

Kaggle-Mercari

./src/PlanB.py score 0.40957 on private leaderboard, with the ranking of 32 among 2384 teams.


Ideas/things to do


Versions:

  • v8:

    • embedding for cat, brand, cond ...
  • v7:

    • adam lr
  • v6

    • Tf: more epoch (v)
    • FM: 18 ?    * FTRL: beta=0.01 (v)    * WB: (no)
  • v5

    • delete stemmer and add two epoches (v)
    • log price (v)
    • REMEMBER to add fc back (v)
  • v4

    • elu instead of relu (v)
    • delete one fc layer (v)
    • delete dropout after fc
  • V3

    • cnn
    • dropout after cnn (v)
    • 2gram for cnn
    • rnn

Worth a read:

Top players


Useful features

  • Len of text
  • Mean price of each category
  • Mean of brand/shipping
  • Average of word embeddings: Lookup all words in Word2vec and take the average of them. paper, Github Quora
  • Better way to remove stop word cached
  • Reduce TF time
  • Drop price = 0 or < 3 (link, link)

Tricks

  • Stage 2: 1, 2, Mine

  • Rewrite the code:

    • "without merge(fitting on train and transforming on test) my CV and LB loss increased by 0.009. I can't figure out the reason." Link
    • Test set into batches. link
    • Better val set for TF

Tried:

  • Combine (condition and shipping)
  • Concatination of brand, item description and product name
  • One dimmensionfor item_condition: https://www.kaggle.com/nvhbk16k53/associated-model-rnn-ridge/versions#base=2256015&new=2410057
  • Other features for TF: Quora solutions
    • No 1: Number of capital letters, question marks etc...
    • No 3: We used TFIDF and LSA distances, word co-occurrence measures (pointwise mutual information), word matching measures, fuzzy word matching measures (edit distance, character ngram distances, etc), LDA, word2vec distances, part of speech and named entity features, and some other minor features. These features were mostly recycled from a previous NLP competition, and were not nearly as essential in this competition.
    • No 8 -> a lot
    • https://www.kaggle.com/shujian/naive-xgboost-v2/versions
    • Tune FM: Compare 1 and 2. topic, kernel.
  • Ridge tuning:
    • RDizzl3: Try playing with these parameters and see if you can get similar results to Ridge: alpha, eta0, power_t and max_iter. I have been able to get within 0.002 of my ridge predictions (validation) and it is faster.
  • Text cleaning
    • RDizzl3: I have created rules for replacing text and even missing brand names that do bring some improvement to my score.
    • Darragh: I didn't do too much on hand built feature engineering, but have got some boost with working on improving the tokenization. Still looking for what the top guys have done :) - nltk - ToktokTokenizer
    • Text norm

Models:

About

32/2384 Solution to Kaggle Mercari Competition (solo silver medal winner)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages