Skip to content

tedapham/UCB_Tweepy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

33 Commits

Repository files navigation

LIVE TWITTER STREAM ANALYSIS

UCB W205 DATA STORAGE AND RETRIEVAL | by TED PHAM | April 9, 2017

APPLICATION SUMMARY

The application captures, processes, and stores live tweets and in "<word>: count" form in a Postgres database. This process runs continuously until user's termination input (Ctrl+c) is received. Analyses of the captured words and their occurrences can be performed by calling the accompanying Python serving scripts which are described in details in the following file structure section.

The codes were developed and tested to be fully functional in a Linux environment from Amazon Web Services EC2 instance of UCB's community AMI UCB MIDS W205 EX2-FULL. The AMI provides the required technologies for the application: 1. HDFS ; 2. Storm ; 3. Postgres ; 4. PsycoPG ; 5. Tweepy ; 6. Python 2; 7. Streamparse.

APPLICATION TOPOLOGY

Application Topology

The application topology is the crucial component in capturing and real-time processing of twitter data. A tweet-spout pulls tweets from Twitter streaming API and runs on three threads. Users can obtain their own twitter credentials for the tweet-spout by creating a twitter application at https://apps.twitter.com . For convenience, a set of working credentials is provided. While the tweet-spout retrieves raw tweets, the two bolts parse-tweet-bolt and count-bolt processes tweets to valid words and counts the word respectively. In addition, the count-bolt also pushes and updates the data into a Postgres database. The Postgres database is initiated before each application is ran and contains only the tweets data of each individual run.

The database is called tcount with data stored in a table called tweetwordcount.

FILE STRUCTURE

The application files are stored in the following Github repository

https://github.com/tedapham/UCB_Tweepy.git , containing exercise_2 folder. The file structure described below is from within this exercise_2 folder.

Name of the programLocationDescription
extweetwordcount.cljextweetwordcount/topologies/topology for the application
tweets.pyextweetwordcount/src/spouts/tweet-spout
parse.pyextweetwordcount/src/bolts/parse-tweet-bolt
initialize.pyextweetwordcount/Create a fresh Postgres databaseNeeds to be run before the application
finalresults.pyextweetwordcount/When passed a single word as an argument, returns the total number of word occurrences in the stream. Without an argument returns all the words in the stream, and their total count of occurrences, sorted alphabetically, one word per line.
histogram.pyextweetwordcount/Gets two integers k1,k2 and returns all the words with a total number of occurrences between k1,k2
top20.pyextweetwordcount/returns 20 words with the largest number of occurrences
plot.png/exercise_2Plot of top 20 words
README.txt/exercise_2Execution Instructions

EXECUTION INSTRUCTIONS

Whether the UCB's community AMI is used, make sure the seven platforms highlighted in the application summary are installed and Postgres server running on the respective Linux platform. Once the repo is cloned, navigate to extweetwordcount subfolder inside UCB_Tweepy/exercise_2.

Run the application as followed:

$ python initialize.py

$ sparse run

initialize.py must be run to create tcount database and tweetwordcount table in Postgres before sparse can be run.

Once the application is run successfully, a continuous log will be displayed such as this:

Tweet Stream

Stop the process with ctrl+ c. At this point, twitter data have been tabulated in Postgres table twitterwordcount.

Running finalresults.py, histogram.py, and top20.py for analyses. For example: Final Resulsts

Fina

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •