Jump to content

How to Scrape Twitter Data for Sentiment Analysis with Python and Power BI


Recommended Posts

Guest Flora_Oladipupo
Posted

largevv2px999.png.c1ca9843e53eb83c37162cc67307d405.png

 

 

 

 

 

OUTLINE

 

  1. Introduction: What is Sentiment Analysis?
  2. Use Case: Twitter Data
  3. Aim of the project
  4. Tools used and workflow
  5. Data Gathering
  6. Data Wrangling/Preprocessing
  7. Sentiment Analysis
  8. Visualization

 

 

 

Introduction

 

 

As a data analyst, there will be scenarios where your data will come from secondary sources. some of them will be gotten through web scraping. A simple use case here; what if a business is interested in understanding their customer perception and emotion about their brand based on their activities on twitter.

 

To get the data for the analysis, you have to find a way to scrape this data first, clean it, analyze it, and then use a visualization tool to present it to the business.

 

 

 

This project is a collaboration between Abisola Agboola (@Abisola_Agboola) and me. We are both Beta Microsoft Learn Student Ambassadors. This article contains embedded links that will lead to Part 2 of this work (Visualizing the Twitter Data with Microsoft Power BI) done by @Abisola_Agboola.

 

 

 

What is Sentiment Analysis?

 

Sentiment analysis is a use case of Natural Language Processing. It is used to get the tone behind an opinion, text, or sentence in a language. Therefore, it is an analysis that simplifies the task of getting to know the feeling behind people’s opinions. There are several ways this analysis is useful, ranging from its usefulness in businesses, product acceptance, perception of services, and many other uses.

 

 

 

Use Case: Twitter Data

 

 

As a data analyst, there will be scenarios where your data will come from secondary sources. some of them will be gotten through web scraping. A simple use case here; what if a business is interested in understanding their customer perception and emotion about their brand based on their activities on twitter.

 

To get the data for the analysis, you have to find a way to scrape this data first, clean it, analyze it, and then use a visualization tool to present it to the business.

 

 

 

Aim of the Project

 

 

Disclaimer

 

This analysis is not for the prediction of the Nigeria 2023 election result, it is rather a use case to demonstrate the twitter data scraping, transformation, analysis, and visualization.

 

 

 

Through this project, we wish to tell compelling story and get the public to be aware of the overall tone of their activities on twitter towards the forthcoming general election in 2023.

 

 

 

Important Library used

 

 

The necessary libraries and modules used in this project are listed in the Jupyter notebook containing the code. Though the major tool used were Snscraper for scraping historical data and TextBlob for determining the polarity of words to get their sentiments.

 

 

 

Workflow

 

  1. Data gathering
  2. Data Preprocessing
  3. Sentiment Analysis
  4. Data Visualization
  5. Communicating result

 

 

 

Data Gathering

 

 

The data was collected using snscraper because of the lack of restriction when using the library. Snscraper allows one to scrape historical data and doesn’t require use of API keys unlike libraries like Tweepy. A total amount of 58,633 data was collected from 1/January/2022 to 30/July/2022. This data yield for each month differed as some months didn't have up to the 20,000 limit set in the code while some had past that.

 

 

 

 

 

largevv2px999.png.274a24a02ff3be0591cd132bca003eb0.png

 

 

 

 

 

The query is where the tweets that one is interested in searching for is written and a for loop is run.

 

 

 

 

 

largevv2px999.png.78345debc2465e3e1090a4b2fee7534e.png

 

 

 

 

 

The result of the query can be seen in a dataframe.

 

 

 

largevv2px999.png.dfe9b6973b802ea093fbb647f42ff051.png

 

 

 

 

 

Data Wrangling/Pre-processing

 

 

The missing locations were filled with the word ‘Unknown’. Words with different spellings were replaced with uniform spelling to get the analysis accurately done. New columns were also created for each of the top three presidential candidates’ parties which are the APC, PDP, and Labour Party. Another set of columns was also created for the top three candidate names. This column was created to accurately get the number of times each name appeared in tweets.

 

 

 

largevv2px999.png.c59f7cd1bbe42e7d952009c53a54386b.png

 

 

 

For the know the number of times each of the top 3 candidates name and their party was mentioned in a tweet the names needs to be extracted into a separate columns by writing a function.

 

 

 

largevv2px999.png.968055ec31b1f68b7b7db72db0f7fcb8.png

 

 

 

largevv2px999.png.434c7d89bacc2be87e9fe00e4177da59.png

 

 

 

The result of the above code can be seen below

 

 

 

largevv2px999.png.7928e5961766177c049d26c5e5b22c47.png

 

 

 

Data preprocessing: It’s on this step that lies the bulk of the project. For the sentiment analysis to be carried out this stage needs to be done accurately. Data pre-processing are not cast in stones. they depend on the nature of data you are working on and what needs to be changed however, there are some transformations that are fixed for the sentiment analysis to be carried out. These pre-processing are in no particular order:

 

 

 

  • Converting the words to lower case: During the preprocessing stage, the tweet column is converted to lower case words to make the words uniform.
  • Removing Url links, digits, punctuation, emojis and every other thing that may not be necessary for the sentiment analysis
  • Tokenizing the tweets column that is breaking the sentence down into bits of words
  • Removing stop words: This are word that don’t give meaning to the context of a sentence example is, the etc.
  • Lemmatizing words: This is to get the base of words ie bags the lemmatized form is bag.

 

 

 

largevv2px999.png.80f0aeeafbdf4c93240530855bb28995.png

 

 

 

A new column called Processed tweets is created and can be seen in the data frame below.

 

 

 

largevv2px999.png.1259ec5b048f74f356fb8fb035fa0b14.png

 

A bit of data wrangling was carried out on the Processed tweet column

 

 

 

largevv2px999.png.2dd10452a641fa19429ccf5535555fd2.png

 

 

 

Sentiment Analysis

 

 

After data wrangling/pre-processing, TextBlob library is used to get the level of the text polarity; that is, the value of how good, bad or neutral the text is which is between the range of 1 to -1. A condition is set to get the sentiment which is set at < 0 is positive, == 0 is neutral and > 1 is negative.

 

 

 

largevv2px999.png.1dffbab79ae7da5a00b46b497205db14.png

 

 

 

The link to this project code can be seen on my Github page.

 

 

 

Data Visualization

 

 

To visualize the data and tell more compelling story, we will be using Microsoft Power BI.

 

Python is not the best tool for visualization because its visual is not appealing to the eyes. The Data used for this project was saved in a file and sent to my partner for visualization.

 

 

 

largevv2px999.png.ccb6302ece9e26d828ac3e5a5bb5c9e3.png

 

 

 

This was carried out by my partner @Abisola_Agboola. The result of which can be seen below. To see how this dashboard was build check out the part II of this article.

 

 

 

Part II

 

 

Click the link here https://aka.ms/twitterdataanalysispart2 to see how this Power BI visual was built and follow through to create yours.

 

 

 

 

 

largevv2px999.thumb.jpg.69e85f432b7b47cce9ba4417747e7398.jpg

 

 

 

 

 

Summary

 

 

In this article, we made it clear that in several scenarios, you will have to work with secondary data in your organization. one of the ways to get these data is through web scraping.

 

 

 

By following this article:

 

  • You have learnt how to scrape twitter using the snscraper library.
  • How to clean the data and transform it to be in a tabular manner.
  • How to use the TextBlob library to calculate the sentiment score based on the tweet.
  • How to export this data to csv/excel. this will be imported in Power BI for visualization.

 

 

 

You can click here to check the Part II https://aka.ms/twitterdataanalysispart2 You will be able to build your own Power BI visualization and horn your skill.

 

 

 

Resources

Power BI Learning Overview | Microsoft Power BI

Azure for Students – Free Account Credit | Microsoft Azure

 

Continue reading...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...