Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change HP Name & Include Text example #1410

Merged
merged 14 commits into from
Mar 2, 2022

Conversation

Louquinze
Copy link
Collaborator

@Louquinze Louquinze commented Feb 19, 2022

handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

  1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
    this includes changing all *csv and *json files

  2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

@codecov
Copy link

codecov bot commented Feb 21, 2022

Codecov Report

Merging #1410 (bac27b9) into development (00b8e6e) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##           development    #1410      +/-   ##
===============================================
- Coverage        84.52%   84.51%   -0.01%     
===============================================
  Files              146      146              
  Lines            11283    11283              
  Branches          1929     1929              
===============================================
- Hits              9537     9536       -1     
- Misses            1231     1232       +1     
  Partials           515      515              

Impacted file tree graph

automl#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution
@Louquinze Louquinze changed the title Development Change HP Name & Include Text example Feb 21, 2022
Copy link
Contributor

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eddiebergman any idea how to fix PEP8 here?

examples/40_advanced/example_text_preprocessing.py Outdated Show resolved Hide resolved
examples/40_advanced/example_text_preprocessing.py Outdated Show resolved Hide resolved
examples/40_advanced/example_text_preprocessing.py Outdated Show resolved Hide resolved
examples/40_advanced/example_text_preprocessing.py Outdated Show resolved Hide resolved
# ==========================

automl = autosklearn.classification.AutoSklearnClassifier(
# set the time high enough text preprocessing can create many new features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does 20 newsgroup work in the setting on the left? That would be preferable for running this example in the github actions.

Copy link
Contributor

@mfeurer mfeurer Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also use a smaller dataset? You can use the following script to scan on OpenML for datasets containing string data:

import openml

datasets = openml.datasets.list_datasets()
for did in datasets:
    try:
        dataset = openml.datasets.get_dataset(did, download_data=False, download_qualities=False)
        for feat in dataset.features:
            if dataset.features[feat].data_type == 'string':
                print(did, dataset.name)
                break
    except Exception as e:
        print(e)
        continue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the example yields ~80% acc. on the test set. Selecting random would be 5% for 20 labels. Therefore i would say that the example works. But it also runs 300 sec. which are 5 min. So if that is to long i can search another dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant, would the example work when you restrict it to use only a single configuration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a parameter for setting autosklearn to it or is that max_time == timer per model ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would read through the entire API and manual now that you have a bit more familiarity, to know what's possible and what's not
https://automl.github.io/auto-sklearn/master/api.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been there in the previous version of the example: smac_scenario_args={"runcount_limit": 1}

@eddiebergman
Copy link
Contributor

The line to long errors in pre-commit can be fixed by adding # noqa: E501 to the end of those lines. I was rethinking that perhaps 100 line length is fine but that's a seperate thing to discuss, it wouldn't prevent these errors anyway.

The other solution is to have modules import in the __init__ so the imports aren't this long.

automl#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.
@mfeurer
Copy link
Contributor

mfeurer commented Feb 24, 2022

The doc build failure appeared to be unrelated, so I just restarted it. However, the pre-commit fails right now, could you please have a look into this?

@eddiebergman
Copy link
Contributor

eddiebergman commented Feb 24, 2022

The doc build failure appeared to be unrelated, so I just restarted it. However, the pre-commit fails right now, could you please have a look into this?

make format for the formatting. This bug with the leadboard as shown in the docs is triggered by this example. While not directly related to the example, it often occurs when no models are found as the id's of models get messed up.

automl#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.
@Louquinze
Copy link
Collaborator Author

The doc build failure appeared to be unrelated, so I just restarted it. However, the pre-commit fails right now, could you please have a look into this?

make format for the formatting. This bug with the leadboard as shown in the docs is triggered by this example. While not directly related to the example, it often occurs when no models are found as the id's of models get messed up.

i did make format pre-commit works now

automl#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.
automl#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.
automl#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.
automl#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.
…le contains only one model. Therefore we reduced the problem complexity
Copy link
Contributor

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good now. I believe the example could be drastically simplified by restricting the categories via the argument categories to the function load 20 newsgroups.

…le contains only one model. Therefore we reduced the problem complexity
@Louquinze Louquinze requested a review from mfeurer March 1, 2022 16:10
…le contains only one model. Therefore we reduced the problem complexity
cats = ["comp.sys.ibm.pc.hardware", "rec.sport.baseball"]
X_train, y_train = fetch_20newsgroups(
subset="train", # select train set
shuffle=True, # shuffle the data set for unbiased validation results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
shuffle=True, # shuffle the data set for unbiased validation results
shuffle=True, # shuffle the data set for unbiased validation results

) # load this two columns separately as numpy array

X_test, y_test = fetch_20newsgroups(
subset="test", # select test set for unbiased evaluation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
subset="test", # select test set for unbiased evaluation
subset="test", # select test set for unbiased evaluation

examples/40_advanced/example_text_preprocessing.py Outdated Show resolved Hide resolved
# set the time high enough text preprocessing can create many new features
time_left_for_this_task=300,
per_run_time_limit=30,
time_left_for_this_task=60, # absolute time limit for fitting the ensemble
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
time_left_for_this_task=60, # absolute time limit for fitting the ensemble
time_left_for_this_task=60,

examples/40_advanced/example_text_preprocessing.py Outdated Show resolved Hide resolved
examples/40_advanced/example_text_preprocessing.py Outdated Show resolved Hide resolved
examples/40_advanced/example_text_preprocessing.py Outdated Show resolved Hide resolved
…le contains only one model. Therefore we reduced the problem complexity
@Louquinze Louquinze requested a review from mfeurer March 1, 2022 17:20
@mfeurer mfeurer merged commit ab5c016 into automl:development Mar 2, 2022
eddiebergman pushed a commit that referenced this pull request Aug 18, 2022
* rename "ngram_range" to "ngram_upper_bound" this includes renaming it in all *csv and *json files for metalearning

* rename "ngram_range" to "ngram_upper_bound" this includes renaming it in all *csv and *json files for metalearning

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.

* limit 20NG to 5 labels. automl.leaderboard has problems if the ensamble contains only one model. Therefore we reduced the problem complexity

* limit 20NG to 5 labels. automl.leaderboard has problems if the ensamble contains only one model. Therefore we reduced the problem complexity

* limit 20NG to 2 labels. automl.leaderboard has problems if the ensamble contains only one model. Therefore we reduced the problem complexity

* limit 20NG to 2 labels. automl.leaderboard has problems if the ensamble contains only one model. Therefore we reduced the problem complexity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants