Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Twitter data collection for Open Assistant

Conversations on Twitter can be an interesting and useful source of data for our model to learn from. Certain twitter threads may contain helpful prompts and replies, in a similar fashion to how we want our model to be able to respond to prompts in a useful way.

Thus, these scripts are intended to process twitter data from a variety of sources, process them into cleaner and more useful formats, and then combine the various outputs into a unified training set that can be fed to our model as a conversation, or at least as a prompt with replies.

Note: Based on issue #126

Possible Data Paths

  • Twitterstream archive: https://archive.org/details/twitterstream These are large .tar files with compressed json files inside. However, some data points such as reply counts seem to always be 0 due to limitations when scraping the Twitter API.
  • Alternative APIs such as snscrape, twint, etc. These alternative APIs often are harder to use than the official Twitter API but can often bypass API limits which can make it useful for larger scale data collection. The downside is potentially slower speed, and less features.
  • The official Twitter API

Currently Completed Items

  • Downloaded various archive files (both are .tar, but each have a different format of json compression. One used .gz, and the other.bz2). Each json file is roughly 2000 rows of tweets. There are thousands of these compressed json files. Managing the IO of opening lots of small files is one of the challenges, which is why future steps will consolidate data into larger easier to process files.
  • Wrote script that can loop through the compressed json files, cleans them up a bit by removing truncated tweets or tweets that aren't replies. The script then standardizes the columns, and exports the polars dataframes into parquet files for future processing. Note: Using polars instead of pandas due to performance reasons.
  • Wrote scripts that process the large dump of tweets into conversation threads using the tree and node architecture. This results in around 17K conversation threads based on a dump of 90M tweets.
  • Script can output the conversation threads into a jsonl file for further filtering or use in models.

Main Issue

  • The issue is that we can easily scrape replies, but there is no guarantee the original tweet is in the archive file. Furthermore, the archives are large so they would need to be kept completely in-memory or in a db to reference. We still need to decide if we want to try to mine the archive to piece together the conversations, or we can take the list of replied tweets and loop through those and use alternative apis to fetch the original tweet text, and then match it with the confirmed replies already in our archive to generate the prompt/replies data. Currently, my script can extract conversations based on the dump, but it is a small percentage of the overall dump, and there is no guarantee of the quality of the tweets.
  • The tweet quality is the other major issue. We can get conversations through the currently made scripts, but they most likely don't match a useful instruction -> fulfillment. We are trying to filter the tweets through various means such as matching useful hashtags, or by using cosine similarity against known instructions.
  • The modern Twitter API has conversation_id as a field which can be a way to gather all tweets in a thread sort of automatically although there is pagination limits. The main issue with this is it seems hard to search for it using alternative APIs.

TODO

  • Write scripts to filter existing conversations into useful instructions -> fulfillment with hashtags or cosine similarity.
  • Train model to detect if text is a suitable instruction. This could then be run through the conversations (or full tweet dump) to simplify the process. Related to issue #143.
  • Write script that matches the original tweets and their text with the archive data to create the prompt/reply dataset. (Optional)
  • Decide on final output format and storage options for the dataset. Currently in JSONL with tree / node architecture as python dicts which is acceptable I believe.
  • Alternatively: Store processed tweets into DB or alternative option.(Optional)