-
Notifications
You must be signed in to change notification settings - Fork 92
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
3f2ed40
commit 7780cd3
Showing
56 changed files
with
21,191 additions
and
0 deletions.
There are no files selected for viewing
86 changes: 86 additions & 0 deletions
86
... 2/Theorey & Quizzes/01. Basics of NLP/1. Naive Bayes Spam Classifier/Readme.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
# Week 1 : The NLP Pipeline | ||
|
||
# Code | ||
|
||
Notebook : [NLP Pipeline](https://github.com/purvasingh96/Natural-Language-Specialization/blob/master/Week-1/text_processing.ipynb) | ||
|
||
# Summary | ||
|
||
## Cleaning | ||
In this step we perform the following tasks - | ||
|
||
1. Get the text (`requests.get(url).text`) | ||
2. Remove html tags 🏷 using `BeautifulSoup`. | ||
|
||
```python | ||
from bs4 import BeautifulSoup | ||
soup = BeautifulSoup(r.text, "html5lib") | ||
print(soup.get_text()) | ||
``` | ||
3. Perform web-scrapping 🕷 | ||
```python | ||
# Extract title | ||
summaries[0].find("a", class_="storylink").get_text().strip() | ||
``` | ||
|
||
## Normalization | ||
|
||
### Case Normalization | ||
|
||
Convert all text to lower case. `text.lower()`. | ||
|
||
### Punctuation Removal | ||
|
||
Remove all punctuation marks. | ||
```python | ||
import re | ||
re.sub(r"[^a-zA-Z0-9]", " ", text) | ||
``` | ||
|
||
## Tokenization | ||
|
||
### Split the text | ||
Token all words in a text or tokenize the text on sentence level. | ||
|
||
|
||
```python | ||
|
||
from nltk.tokenize import word_tokenize, sent_tokenize | ||
|
||
# Split text into words using NLTK | ||
words = word_tokenize(text) | ||
|
||
# Split text into sentences | ||
sentences = sent_tokenize(text) | ||
``` | ||
|
||
### Remove stop-words | ||
Stop words include words such as *'i', 'me', 'my', 'myself', 'we', 'our', 'ours' etc* which increase our vocab size unecessarily. We need to remove them as follows - | ||
```python | ||
from nltk.corpus import stopwords | ||
|
||
# Remove stop words | ||
words = [w for w in words if w not in stopwords.words("english")] | ||
``` | ||
|
||
## Stemming/ Lemmatization | ||
|
||
Stemming reduces a word to its *stem*. Lemmatization reduces the words to it *root.* The difference between the 2 process, is that sometimes Stemming may not generate meaningful words, but the root word generated by lemmatization is always meanigful. | ||
|
||
```python | ||
from nltk.stem.porter import PorterStemmer | ||
from nltk.stem.wordnet import WordNetLemmatizer | ||
|
||
stemmed = [PorterStemmer().stem(w) for w in words] | ||
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words] | ||
``` | ||
|
||
## Final NLP Pipeline | ||
|
||
Final pipeline for text pre-processing looks aas follows - | ||
|
||
<img src="./images/NLP Pipeline.png" height="300"></img> | ||
|
||
|
||
|
||
|
Binary file added
BIN
+423 KB
...uizzes/01. Basics of NLP/1. Naive Bayes Spam Classifier/images/NLP Pipeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.