Hi. We are the PyJaMas!

This is our project, Analysis of Tweets Red-Tagging Chad Booc and Lumad Schools, which aims to investigate tweets that red-tag the mentioned parties, and analyse the behaviours of these Twitter users.

  • Pips
  • Julia
  • Marius

Overview

This project hones the students' critical thinking as they analyze tweets and detect misinformation and disinformation related to Chad Booc's death and his activities with the Lumads. Moreover, we examined tweets that contains misinformation regarding Lumad schools and communities.

Motivation

Red-tagging is an issue that hits close to home for any UP student. Chad Booc, a graduate of our department and an alumnus of UP CURSOR, is a famous victim of this issue. The circumstances surrounding his death are particularly polarizing. While many condemned the state for his death, quite a few lauded the government for its actions. This is an indication of the success of the state's red-tagging. Our frustration regarding this and the inherent dangers of red-tagging lead us to choose this issue as the focus of our CS 132 project.

Problem

Twitter has become a prominent platform for red-taggers to spread disinformation and target activists like Chad Booc. The ease with which information can be disseminated on Twitter means that disinformation can quickly gain traction and be amplified, leading to potentially severe consequences for targeted individuals and movements. These consequences include defamation, harassment, or even death, like in Chad Booc's case.

Solution

Through this project, we aim to be able to analyze the behavior and intent of red taggers on Twitter and identify their patterns of red-tagging. Through this, we can gain a better understanding on how red-taggers operate which can be helpful as a reference for combatting this kind of disinformation in the future.

Problem Formulation

At this stage, we decided what we wanted our topic to be and what kind of problem we hoped to answer. We then came up with our hypotheses and our general plan of action.

Research Question

What kind of behaviour did red-taggers exhibit in 2022 concerning Chad Booc and his activities in relation with:

  1. frequency of tweets with mis/disinformation, and
  2. the content of these tweets?

Hypotheses

  1. The frequency of tweets red-tagging Chad Booc and his activities significantly increased after his death.
  2. The tweets containing dis-/misinformation predominantly talk about Chad Booc's death.

Null Hypotheses

  1. There is no significant increase in the frequency of tweets red-tagging Chad Booc and his activities before and after his death.
  2. There is an equal distribution of topic clusters regarding Chad Booc's death; no specific topic cluster is significantly more predominant than others.

Action Plan

Analyse the frequency of tweets and usage of relevant keywords to identify red-taggers' behaviours.

Data Collection

We collected our data from twitter and analyzed each tweet to remove irrelevant tweets from the dataset.

Topic

We gathered tweets that red-tagged Chad Booc and Lumad schools. The scope of our data collection is 2022, the year he was killed.

Keywords

In scraping for tweets, we used two sets of keywords, one for Chad Booc and one for Lumad school. Specifically, these are:

  1. "chad booc (cpp OR npa OR terorista)"
  2. "lumad (cpp OR npa OR terorista)"

With this set of keywords, we were able to collect a majority of tweets from 2022 that red-tagged Chad Booc and Lumad schools.

Tools

In order to speed up the data collection process, we used snscraper, a web scraper for social networking services (SNS).

Data Exploration

After gathering and sorting the data we preprocessed the information and formatted the values such that it would handle outliers and missing values properly. These would help in providing better results for Natural Language Processing and Time Series analysis.

Data Structure

Each row in our dataset contains one tweet. The columns we collected to describe each tweet are the following:

  1. Tweet URL
  2. Keywords Used
  3. Account Handle
  4. Account Display Name
  5. Account Bio
  6. Account Type
  7. Account Join Date
  8. Account Following Count
  9. Account Followers Count
  10. Account Location
  11. Tweet Content
  12. Tweet Type
  13. Tweet Date Posted
  14. Tweet Content Type
  15. Tweet Likes
  16. Tweet Replies
  17. Tweet Retweets
  18. Tweet Quote Tweets
  19. Tweet Views

However, for the purposes of our analysis, we only used the following columns:

  1. Tweet URL
  2. Tweet Content
  3. Tweet Date Posted

Values Formatting

In the columns that we used, we did not encounter any missing values that needed filling. Similarly, we also do not have any columns that need normalising or standardisation.

The column Tweet Date Posted contained datetime values in the format MM/DD/YY hh:mm. In order to make it readable for code, it was parsed into a datetime object as an ISO8601 format string.

Handling Outliers

Due to the nature of our project objectives and data, we encountered no outliers that needed to be handled during preprocessing. Specifically, it is because these outliers are what we will be looking for in our time series analysis as peaks in the dataset.

Natural Language Processing

We preprocessed the data in this stage in order to prepare it for Natural Language Processing (NLP). This was done by making use of procedures such as cleaning, tokenising, lemmatising and more.

Preprocessing

In order to perform topic clustering on the tweets collected, the following basic procedures were performed:

  1. Translate the content
  2. Remove mentions, hashtags, links, punctuations, emojis, and extraneous whitespaces
  3. Lowercase the content

This step was performed via Google App Script.

Tokenising and Stopword Removal

To get only important words and concepts in the tweets, a list of English stopwords is used to filter the words present in a tweet. This removes unnecessary words such as articles and pronouns in the content of the tweet. NLTK corpus' stopwords is used for this, which we extended to include additional words such as meaningless expressions. Then, NLTK's word_tokenize function is used to get the tokenised words to prepare for the next steps.

Lemmatisation

To get the common topics found across the gathered tweets, it is important that the content in each are in the same form. This is done by lemmatising the tokenised words using NLTK's WordNetLemmatizer. The help of NLTK's post_tag is employed to determine each word's part of speech for a more accurate lemmatisation.

After these preprocessing procedures, we were able to get the top keywords used in the tweets:

Time Series Analysis

Binning

In order to group the tweets by date, pandas' resample method is used with a frequency of 1 day.

Graphing this binned dataset, we can see the number of tweets posted per day:

Results and Discussion

We determined the topic clusters and dates where there was the most tweet activity from the given dataset using topic clustering and time series analysis.

Time Analysis

  1. February 25 - March 2

    Within 6 days there was an influx of tweets regarding Chad Booc's death

    Link1 Link2
  2. March 3 - March 27

    There was a relatively low but stable number of tweets present for 25 days. Also, a peak was detected on March 12 when Chad's autopsy report was released.

    Link1 Link2
  3. July 25 - August 3

    Another peak was recorded during the week of President Marcos' first State of the Nation Address.

    Link1 Link2

Topic Clustering

Using Latent Dirichlet Allocation (LDA), we were able to group the tweets into four clusters. The cluster of tweets predominantly talking about Chad Booc's death made up 27.46% of the total. Though this is one of the clusters that had the highest number of tweets, we were unable to find a statistically significant difference on the count in that cluster versus the others (χ2 = 1.96, p = 0.58).

  1. Volunteers being red-tagged as NPA and Lumad schools being called NPA training grounds
  2. Link
  3. Makabayan block and other filipino activists getting killed and red-tagged
  4. Link
  5. Death of Chad Booc
  6. Link
  7. Chad Booc red-tagged as a member of the NPA
  8. Link

Conclusion

We have identified several dates with distinct patterns of tweet frequency and corresponding characteristics. These are: February 25, the day of Chad Booc's death, March 12, with Inquirer's article release on Chad's autopsy, and July 26, PBBM's SONA. Furthermore, topic clustering showed that tweets are uniformly distributed across four topics: (1) Lumad schools as NPA training grounds, (2) red-tagging of Makabayan bloc and Filipino activists, (3) Chad Booc's death, and (4) Chad being red-tagged as an NPA member.

From the results, we conclude that the frequency of tweets red-tagging Chad Booc and his activities significantly increased after his death, but these tweets did not predominantly talk about his death.

About Us

Hi! We are a group of second year BS Computer Science students from the University of the Philippines Diliman and members of the UP Association of Computer Science Majors (UP CURSOR).

marius

Marius Barcenas

Hi! I'm Marius. I'm interested in web development and data science, particularly in the field of NLP.

julia

Julia Dy

Hello, I'm Julia! I'm interested in learning more about different fields of computer science. Outside of academics, I like to read, draw, and look at cat pictures.

joshua

Joshua Felipe

Hi! I'm Joshua. I seek opportunities in different fields of computer technology, especially in software engineering and web development.