This repository is for Bram Adams
I looked through the columns in the stackover flow posts table and shortlisted 5 columns other than the ID (primary key) that would be helpful in doing data analysis. These were: answer_count, comment_count,creation_date,score, view_count
I first wrote an SQL query through the google BigQuery platform (https://console.cloud.google.com/bigquery?project=stack-overflow-analysis-343407&d=stackoverflow&p=bigquery-public-data&t=stackoverflow_posts&page=table&ws=!1m10!1m4!1m3!1sstack-overflow-analysis-343407!2sbquxjob_778569bb_17f741986a5!3sUS!1m4!4m3!1sbigquery-public-data!2sstackoverflow!3sstackoverflow_posts)
For just ReactJS:
SELECT id, answer_count, comment_count,creation_date,score, view_count
FROM `bigquery-public-data.stackoverflow.posts_*`
WHERE creation_date BETWEEN '2021-06-01 00:00:00' AND '2021-12-01 23:59:59'
AND tags LIKE "%reactjs%"
For just VueJS:
SELECT id, answer_count, comment_count,creation_date,score, view_count
FROM `bigquery-public-data.stackoverflow.posts_*`
WHERE creation_date BETWEEN '2021-06-01 00:00:00' AND '2021-12-01 23:59:59'
AND tags LIKE "%vuejs%"
For a mix of both:
SELECT id, answer_count, comment_count,creation_date,score, view_count, tags
FROM `bigquery-public-data.stackoverflow.posts_*`
WHERE creation_date BETWEEN '2021-06-01 00:00:00' AND '2021-12-01 23:59:59'
AND (tags LIKE "%reactjs%" OR tags LIKE "%vuejs%")
I got 6 months of Data from the posts table as a CSV which I downloaded to then analyze in this notebook.
I downloaded the CSV files for:
- Just the posts that included the tag "reactjs" (2,202KB)
- Just the posts that included the tag "vuejs" (156KB)
- All the posts with tags that include either "vuejs" or "reactjs" (4011KB)
I ran the .describe() function on all 3 datasets to get basic information about count, min, max, distribution and averages.
The describe functions immediately give us an understanding that ReactJS is more popular than VueJS. Garnering an overall greater number of answers, comments, views and range of scores due to the greater number of voters. But VueJS seems to have a slightly greater average view count per answer tahn reactJS.
It is important to get an understanding of the correlation of the metrics amongst eachother. A very high correlation may signal that a metric is directly/inversely proportional and is strongly linked to another metric. This would cause the metric to be useless when analysing our results.
From the correlation matrix we can see that none of the metrics are strongly correlated. There seems to be a weak correlation between score and view_count. Moreso for the data for ReactJS than for VueJS.
I will now generate a scatter plot of the view count to get a better understanding of the spread and distribution of the vue + react data. I suspect view count to be to some extent correlated to the other values so should give an idea of posts that got the most attention and how common it was.
Most values are around the mean (~95) whereas there are only 7 outliers above the 20000 mark. I assume that older posts would have a greater number of views but it seems as though I may be wrong when the plot is viewed at this scale, zooming in would reveal useful information. The number of views are evenly spread out with a few newer posts having a greater number of views.
I then decided to aggregate the data to show the number of posts created per day as a count to measure the popularity of the framework with time.
The plots shown in the ipynb files show the dady-to-day comparison between VueJS and ReactJS with the number of posts.
ReactJS has a much greater average number of new posts and also has a greater range between minimum and maximum new posts in a day. The sharp peaks and troughs are likely due to the fact that more posts are created during the weakdays than weekends as there are approximately 4 peaks and troughs in each one month period.
My weekday hypothesis has been proven to be correct. Number of new posts created for both VueJS and ReactJS are are lesser on the weekends. With the highest number of posts in the middle of the week between Tuesday and Wednesday and the lowest number of posts on Saturday.
Because plotting graphs for the entire dataset is difficult as it requires greater compute and time, I decided to aggregate the data of each metric by sum and mean by each day to be compared between vuejs and reactjs.
This reduced the number of rows to 183 as there are 183 days in the 6 month time period.
The day of post creation is correlated to the number of answers a post has gotten. We will confirm this by comparing the weekdays of post creation to the number of answers the posts got on average.
This shows that ReactJS posts created on Wednesday are more likely to recieve answers than those made on other days. Similarly, VueJS posts made on Tuesday are more likely to recieve answers than those made on other days. The best time to make a post asking a question would be between Tuesday and Wednesday.
All the above graphs show that there is a direct correlation between all metrics and the day on which the post was created. I will now check if there is a similar pattern in the time at which the posts are created.
This graph shows that posts created between 2am and 12pm on average get more answers. That makes 6am the best time to post.
We have learned from our Exploratory Data Analysis that the best day and time to post on stackoverflow to recieve an answer is around 6am on Wednesday for questions related to ReactJS and 12am on Tuesday for VueJS.
There is a correlation between the time and day a post is made to how many views, answers and comments it will get. Since the creation_date time information is according to UTC and assuming 10am-7pm is the average time a person is expected to be online and active: this shows that the most active timezone is approximately 5 hours behind UTC. This would make Easter Standard Time a prime candidate since it includes some of the largest population centers of North America.
Until now I have done Descriptive Analysis and Exploratory Data Analysis. I can attempt to do a Predictive Analysis of the trends of popularity of ReactJS and VueJS. To do so, I will normalize the number of views for VueJS posts and ReactJS posts and compare their popularity over a 6 months period to predict thier popularity going forward.
This plot shows a sharp increase in popularity for VueJS in July but the posts per month declined for vue until October. VueJS, after being adjusted to match the mean of ReactJS, looks like it will maintain approximately the same ratio to the posts by ReactJS going forward. We see that for each VueJS post there are 14 ReactJS posts and this ratio should hold in the coming future. This prediction would be more substantial if data from a longer time period was analysed.
- Which tag has a greater percentage of accepted answers? (Would be very easy to do)
- Which tag gets an accepted answer sooner? (This could show that the smaller VueJS community could have more active and responsive members)
- Which tag paired with VueJS and ReactJS attracts the most users to answer?