Basic utilties for processing Tweets. Includes:
TweetTokenizer
for tokenizing the Tweet contentTweetReader
for easily iterating over TweetsTweetWriter
for conveniently writing one or more Tweets to a file in JSONlines format
There are two options for installing littlebird
.
git clone https://github.com/AADeLucia/littlebird.git
cd littlebird
python setup.py develop
pip install git+git://github.com/AADeLucia/littlebird.git#egg=littlebird
The example below reads in a Tweet file, filters to Tweets that have a hashtag, and writes out to a new file.
TweetWriter
can write a single Tweet or list of Tweets to a file in JSONlines format. It will also automatically open a GZIP file if the provided filename has a .gz
extension. If you are writing to a GZIP file, it is recommended to write all Tweets at once instead of writing incrementally; this provides better file compression. If you do need to write incrementally, I recommend writing to a normal file and GZIPping after.
from littlebird import TweetReader, TweetWriter
# File in JSONlines form. Automatically handles GZIP files.
tweet_file = "2014_01_02.json.gz"
reader = TweetReader(tweet_file)
# Iterate over Tweets
# Save Tweets that contain hashtags
filtered_tweets = []
for tweet in reader.read_tweets():
if tweet.get("truncated", False):
num_hashtags = len(tweet["extended_tweet"]["entities"]["hashtags"])
else:
num_hashtags = len(tweet["entities"]["hashtags"])
if num_hashtags > 0:
filtered_tweets.append(tweet)
# Write out filtered Tweets
writer = TweetWriter("filtered.json")
writer.write(filtered_tweets)
You can also skip retweeted and quoted statuses
for tweet in reader.read_tweets(skip_retweeted_and_quoted=True):
There are multiple pre-defined tokenizers that are useful for different downstream usecases.
- TweetTokenizer: default customizable tokenizer
- BERTweetTokenizer: tokenizer based off the original BERTweet pre-processing from the BERTweet model repo.
- GloVeTweetTokenizer: tokenizer based off a modified version of the GloVe preprocessor. It's based off of a modified version because the original ruby processing script does not work.
You can also write your own tokenizer. Examples are below.
A basic example using the default Tokenizer settings is below.
from littlebird import TweetReader, TweetTokenizer
# File in JSONlines form. Automatically handles GZIP files.
tweet_file = "2014_01_02.json.gz"
reader = TweetReader(tweet_file)
tokenizer = TweetTokenizer()
# Iterate over tweets and get the tokenized text.
# Automatically checks for retweeted and quoted text except if
# tokenizer = TweetTokenizer(include_retweeted_and_quoted_content=False)
for tweet in reader.read_tweets():
# Tokenize the Tweet's text
tokens = tokenizer.get_tokenized_tweet_text(tweet)
# Get tweet hashtags. Also includes hashtags from retweeted and quoted statuses
hashtags = tokenizer.get_hashtags(tweet)
You can directly access the internal tokenize
method with tokenizer.tokenize(text)
.
If you do not need to skip over tweets and only need the text, then you can use the tokenize_tweet_file()
method. This method gets the tokenized text from the entire tweet (retweet, quoted, and full text).
from littlebird import TweetTokenizer
tweet_file = "2014_01_02.json.gz"
tokenizer = TweetTokenizer()
# Returns a tokenized string for each tweet
tokenized_tweets = tokenizer.tokenize_tweet_file(tweet_file)
# Returns a list of tokens for each tweet
tokenized_tweets = tokenizer.tokenize_tweet_file(tweet_file, return_tokens=True)
Available TweetTokenizer
settings:
language
: right now it only really supports English, but as long as you change thetoken_pattern
accordingly, it should work with other languages. A future integration is usingMoses
for Arabic tokenization.token_pattern
: Pattern to match for acceptable tokens. Default isr"\b\w+\b"
stopwords
: provide a list of stopwords to remove from the text. Default isNone
for no stopword removal.remove_hashtags
: Default isFalse
to keep hashtags in the text (only strips the "#" symbol)lowercase
: Default isTrue
to lowercase all of the text. Change this toFalse
if you are doing case-sensitive tasks like Name Entity Recognition (NER)
The tokenizer works in the following steps:
- Remove hashtags (optional)
- Remove URLs, handles, and "RT"
- Lowercase the text (optional)
- Find all tokens that match the
token_pattern
withregex.findall(token_pattern, text)
- Remove stopwords (optional)
The token pattern is extremely important to set for your use case. Below are some sample token patterns, but I highly recommend refreshing on your regular expressions if you need something more advanced.
Note: the regex
package is used to access character classes like \p{L}
. Basically Java regex patterns.
r"\b\w+\b"
matches any token with one or more letters, numbers, and underscores. This is equivalent to"[\p{L}\_\p{N}]+"
r"\b\p{L}+\b"
matches any token with one or more "letters" (across all alphabets).r"\b[\w']\b
matches any token with one or more letters, numbers, and underscores. This pattern also preserves apostraphes "'", which is useful for not splitting contractions (e.g. not "can't" -> "can t"). It does not preserve quotes (e.g. "She said 'hello'" -> "she said hello")
If you would like to write your own, simply extend the BaseTweetTokenizer
class. See below and tweet_tokenizer.py
for examples.
from littlebird import BaseTweetTokenizer
class NewTweetTokenizer(BaseTweetTokenizer):
def __init__(self, include_retweeted_and_quoted_content = True):
# Initialize super class
super().__init__(include_retweeted_and_quoted_content)
# Define class variables and settings
self.token_pattern = "aa*"
# Overwrite the tokenize method
def tokenize(self, tweet):
regex.findall(self.token_pattern, tweet)
return
# Define any helper methods as needed
def _helper(self):
return
This package is a work in progress. Feel free to open any issues you run into or recommend features. I started this package for my own NLP research with tweets.
@misc{DeLucia2020,
author = {Alexandra DeLucia},
title = {Little Bird},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/aadelucia/littlebird}},
}
Copyright (c) 2020 Alexandra DeLucia