Skip to content

Commit 84dcc8e

Browse files
committed
More robust (and faster!) handling of tweet fetching/analysis.
1 parent 1aab180 commit 84dcc8e

File tree

3 files changed

+37
-7
lines changed

3 files changed

+37
-7
lines changed

‎ch06/README.md‎

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
Chapter 6 - Classification II - Sentiment Analysis
2+
==================================================
3+
4+
When doing last code sanity checks for the book, Twitter
5+
was using the API 1.0, which did not require authentication.
6+
With its switch to version 1.1, this has now changed.
7+
8+
It seems that you don't have already created your personal Twitter
9+
access keys and tokens. Please do so at
10+
[https://dev.twitter.com/docs/auth/tokens-devtwittercom](https://dev.twitter.com/docs/auth/tokens-devtwittercom) and paste the keys/secrets into twitterauth.py
11+
12+
Note that some tweets might be missing when you are running install.py.
13+
We experimented a bit with with the tweet-fetch-rate and found that
14+
max_tweets_per_hr=10000 works just fine, now that we are using OAuth. If you experience issues you might want to lower this value.

‎ch06/install.py‎

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,6 @@
1414
# Right now we use unauthenticated requests, which are rate-limited to 150/hr.
1515
# We use 125/hr to stay safe.
1616
#
17-
# We could more than double the download speed by using authentication with
18-
# OAuth logins. But for now, this is too much of a PITA to implement. Just let
19-
# the script run over a weekend and you'll have all the data.
2017
#
2118
# - Niek Sanders
2219
@@ -139,7 +136,7 @@ def download_tweets(fetch_list, raw_dir):
139136
os.mkdir(raw_dir)
140137

141138
# stay within rate limits
142-
max_tweets_per_hr=125
139+
max_tweets_per_hr=10000
143140
download_pause_sec=3600/max_tweets_per_hr
144141

145142
# download tweets
@@ -159,7 +156,22 @@ def download_tweets(fetch_list, raw_dir):
159156
# urllib.urlretrieve(url, raw_dir + item[2] + '.json')
160157

161158
# New Twitter API 1.1
162-
json_data=api.GetStatus(item[2]).AsJsonString()
159+
try:
160+
json_data=api.GetStatus(item[2]).AsJsonString()
161+
excepttwitter.TwitterError, e:
162+
fatal=False
163+
formine.message:
164+
ifm['code'] ==34:
165+
print"Tweet missing: ",item
166+
# [{u'message': u'Sorry, that page does not exist', u'code': 34}]
167+
fatal=False
168+
break
169+
170+
iffatal:
171+
raise
172+
else:
173+
continue
174+
163175
withopen(raw_dir+item[2] +'.json', "w") asf:
164176
f.write(json_data+"\n")
165177

‎ch06/utils.py‎

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,9 +54,13 @@ def load_sanders_data(dirname=".", line_count=-1):
5454

5555
tweet_fn=os.path.join(
5656
DATA_DIR, dirname, 'rawdata', '%s.json'%tweet_id)
57-
tweet=json.load(open(tweet_fn, "r"))
58-
if'text'intweetandtweet['user']['lang'] =="en":
57+
try:
58+
tweet=json.load(open(tweet_fn, "r"))
59+
exceptIOError:
60+
print("Tweet '%s' not found. Skip."%tweet_fn)
61+
continue
5962

63+
if'text'intweetandtweet['user']['lang'] =="en":
6064
topics.append(topic)
6165
labels.append(label)
6266
tweets.append(tweet['text'])

0 commit comments

Comments
(0)