Skip to content

Commit

Permalink
Add tokenizer
Browse files Browse the repository at this point in the history
  • Loading branch information
Asing1001 committed Feb 16, 2024
1 parent 304339f commit b8482d8
Showing 1 changed file with 15 additions and 2 deletions.
17 changes: 15 additions & 2 deletions source/_posts/Classification-with-scikit-learn.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,14 @@ category_counts = data['categories'].value_counts()

## Building the Classification Pipeline

With our data ready, we construct a classification pipeline using scikit-learn. Our pipeline consists of a TF-IDF vectorizer for feature extraction and a linear support vector classifier (LinearSVC) as the classification model.
With our data ready, we construct a classification pipeline using scikit-learn. Our pipeline consists of a TF-IDF vectorizer with Jieba as the Chinese tokenizer for feature extraction and a linear support vector classifier (LinearSVC) as the classification model.

```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.calibration import CalibratedClassifierCV
from sklearn.svm import LinearSVC
from util.tokenizer import tokenizer

# Split the data into training and testing sets
X = data['text'].values
Expand All @@ -51,11 +52,23 @@ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_

# Define the pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('jieba', TfidfVectorizer(tokenizer=tokenizer)),
('clf', CalibratedClassifierCV(LinearSVC())),
])
```

util.tokenizer (For Chinese tokenization)

```python
import re
import jieba

def tokenizer(text):
words = list(jieba.cut(text))
words = [word for word in words if re.match(r'^[\w\u4e00-\u9fff]+$', word)]
return words
```

## Model Training and Evaluation

Next, we train our model using grid search to find the optimal hyperparameters and evaluate its performance on the test set.
Expand Down

0 comments on commit b8482d8

Please sign in to comment.