Hashtag-Analysis
#66DaysofData

Analysis of the Tweets under the 66DaysofData Hashtag.
The challange was created by Data-Science-Youtuber Ken Jee.
Have a look at the code on my Github: Github

Programming language: Python
Libraries: numpy, pandas, searchtweets, tweepy, nltk, matplotlib, wordcloud, streamlit

Brief Overview:

Goal

Analyze tweets posted under the hashtag: 66DaysofData.
Independently collect data and create a usable dataset
Opportunity to engage in text analytics (NLP).

Methodology

Tweepy for tweets up to 7 days in the past
Searchtweets for hisrotical tweets

Results(as of 2023/04/12)

40191 tweets from #66DaysofData collected
Tweets from 2020-08-29 to 2023-04-07
1902unique participants took part in the challenge
Streamlit Dashboard

What did I learn?

Methodology for data collection.
Data collection over a longer period of time
Preparing and dealing with text for a data analysis
Deploying a web app with Streamlit share

ToDo

How many did finish the challenge?
What are the main topics?
Differences between first and second round
Sentiment analysis
Topic analysis

Introduction

The #66DaysofData Challenge started on 01.09.2020 and was created by data scientist and youtuber Ken Jee. The goal of the challenge is to spend at least five minutes every day on a topic related to data science and share the progress on social media (mainly via Twitter).

Since I have been following Ken Jee on YouTube for a while, I also started the challenge on 01.09 and published what I did every day on Twitter. After the first month, I had the idea to collect the tweets. I was interested in the topics that the participants are doing and it seems like a great project to do.

Data Gathering

I started collecting the tweets two month into the challenge (29.10.2020) - which was somewhat of a problem. I used tweepy to manually query every week to get the current tweets (tweepy returns tweets up to 7 days in the past). By now, I use a cron-job to do the query automatically every 6 days.

For tweets older than 7 days there is the premium API "Search Tweets: Full ArchiveSandbox" but this is limited for free users and does not allow to remove unimportant tweets (e.g. retweets), so I had to do queries over several months to finally cover the whole period. I queried the premium API with seachtweets.

Data Cleaning

Removed duplicate tweets
Created specific data and time columns
Analyzed the time and date of tweets to figure out if I managed to collect every tweet
Used regex to create new columns for used hashtags, linked persons, day of the challenge and links.
Worked with the nltk library to tokenize, remove stop words and lemmatize the text data

Results

The Results are shown in a dashboard made with streamlit. Streamlit describes itself as 'a faster way to build and share data apps'. The dashboard shows quntitative data about the the tweets and paricipants can create a wordcloud based on their tweets. The web app can be accessed here.