-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathprocess.txt
8 lines (5 loc) · 2.76 KB
/
process.txt
1
2
3
4
5
6
7
8
Using the “pandas”, “os”, “csv” library, we processed our Netflix content data and the covid data in Python.
First of all, our Netflix content data was downloaded in a CSV format from Kaggle (URL: https://www.kaggle.com/shivamb/netflix-shows). We defined a list that contains the names of all the columns we wanted to drop and removed the columns using the ‘drop’ function. Those columns were 'director', 'cast', 'listed_in', 'description'. Then, we renamed column 'country' to 'released_country' using the ‘rename’ function so it matches our relation schemas and ER diagram from the previous phases. Next, I removed the rows that contained null values in any of the columns using ‘dropna’ function. After removing any leading whitespace in 'date_added', the ‘date_added’ column was modified so that the original month day, year format (i.e. September 21, 2017) is converted to yyyy-mm-dd format (i.e. 2017-09-21). This was done so the date format matches the dates found in our other data. Finally, the cleaned data was written to a new csv file named “preprocessed_netflix.csv” which was then converted to a text file as “netflix.txt”.
Then, our Covid data was downloaded in a CSV format from Github (URL: https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-data.csv).
We defined an array of columns that we want to drop and passed that as a parameter for drop function in pandas dataframe. As a result, we were able to create a processed csv with five columns: country, date, total_case, new_case, new_case_per_million. Then, we removed the rows with null attributes using the dropna function of the data frame. Also, for our convenience, we have renamed the column names so that they match with the names that we used for our ER diagram. We have created a new column called ‘record_id’ by merging the following two columns: ‘date’ and ‘country’. In our previous ER diagram, we have indicated ‘date’’ as an only primary key of the Covid table, but we realized that it doesn’t work because multiple countries might have a covid report recorded on the same date. Hence, we merged the country’s name and date of record to create a unique primary key for the relation. After completing preprocess, we have converted the processed data into csv, then into txt file.
Lastly, for our Netflix financial data, the initial CSV file was manually created as the financial data containing the data from 2019-2022 was not available for download in a CSV format elsewhere. The data was retrieved from https://www.macrotrends.net/stocks/charts/NFLX/netflix/income-statement?freq=Q. As a side note, the income statement of our interest is issued by companies every quarter of the year and thus, our data size could seem small relative to the other two data.