Skip to content

Commit

Permalink
master : Term 2 begins!
Browse files Browse the repository at this point in the history
  • Loading branch information
purvasingh96 committed Sep 6, 2020
1 parent 3f2ed40 commit 7780cd3
Show file tree
Hide file tree
Showing 56 changed files with 21,191 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Week 1 : The NLP Pipeline

# Code

Notebook : [NLP Pipeline](https://github.com/purvasingh96/Natural-Language-Specialization/blob/master/Week-1/text_processing.ipynb)

# Summary

## Cleaning
In this step we perform the following tasks -

1. Get the text (`requests.get(url).text`)
2. Remove html tags 🏷 using `BeautifulSoup`.

```python
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, "html5lib")
print(soup.get_text())
```
3. Perform web-scrapping 🕷
```python
# Extract title
summaries[0].find("a", class_="storylink").get_text().strip()
```

## Normalization

### Case Normalization

Convert all text to lower case. `text.lower()`.

### Punctuation Removal

Remove all punctuation marks.
```python
import re
re.sub(r"[^a-zA-Z0-9]", " ", text)
```

## Tokenization

### Split the text
Token all words in a text or tokenize the text on sentence level.


```python

from nltk.tokenize import word_tokenize, sent_tokenize

# Split text into words using NLTK
words = word_tokenize(text)

# Split text into sentences
sentences = sent_tokenize(text)
```

### Remove stop-words
Stop words include words such as *'i', 'me', 'my', 'myself', 'we', 'our', 'ours' etc* which increase our vocab size unecessarily. We need to remove them as follows -
```python
from nltk.corpus import stopwords

# Remove stop words
words = [w for w in words if w not in stopwords.words("english")]
```

## Stemming/ Lemmatization

Stemming reduces a word to its *stem*. Lemmatization reduces the words to it *root.* The difference between the 2 process, is that sometimes Stemming may not generate meaningful words, but the root word generated by lemmatization is always meanigful.

```python
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

stemmed = [PorterStemmer().stem(w) for w in words]
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
```

## Final NLP Pipeline

Final pipeline for text pre-processing looks aas follows -

<img src="./images/NLP Pipeline.png" height="300"></img>




Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 7780cd3

Please sign in to comment.