In the previous recipes, we have seen rivers that fetch data from data stores, both SQL and NoSQL. In this recipe, we'll discuss how to use the Twitter river to collect tweets from Twitter and store them in ElasticSearch.
You need a working ElasticSearch and OAuth Twitter token. To obtain it, you need to log in to Twitter (https://dev.twitter.com/apps/) and create a new app at https://dev.twitter.com/apps/new.
For using the Twitter river, we need to perform the following steps:
bin/plugin -install elasticsearch/elasticsearch-river-twitter/1.4.0
-> Installing elasticsearch/elasticsearch-river-twitter/1.4.0... Trying http://download.elasticsearch.org/elasticsearch/elasticsearch-river-twitter/elasticsearch-river-twitter-1.4.0.zip... Downloading …....DONE Installed river-twitter into …/elasticsearch/plugins/river-twitter
… [2013-08-18 14:59:10,143][INFO ][node ] [Fight-Man] initializing ... [2013-08-18 14:59:10,163][INFO ][plugins ] [Fight-Man] loaded [river-twitter, transport-thrift, jdbc-river], sites []
config
(.json
) file to configure the river, as follows:{ "type" : "twitter", "twitter" : { "oauth" : { "consumer_key" : "*** YOUR Consumer key HERE ***", "consumer_secret" : "*** YOUR Consumer secret HERE ***", "access_token" : "*** YOUR Access token HERE ***", "access_token_secret" : "*** YOUR Access token secret HERE ***" }, "type" : "sample", "ignore_retweet" : true }, "index" : { "index" : "my_twitter_river", "type" : "status", "bulk_size" : 100 } }
curl -XPUT 'http://127.0.0.1:9200/_river/twitterriver/_meta' -d @config.json
{"ok":true,"_index":"_river","_type":"twitterriver", "_id":"_meta","_version":1}
The Twitter river, after having logged into Twitter, starts collecting tweets and sends them in bulk to ElasticSearch.
The river type is twitter
and all client configurations live on the twitter
object. The following are the most common parameters:
oauth
: This parameter is an object containing four keys to access the Twitter API. These are generated when you create a Twitter application, and these keys are as follows:consumer_key
consumer_secret
access_token
access_token_secret
type
: This can be one of the following three allowed by the Twitter API:sample
filter
(refer to https://dev.twitter.com/docs/api/1.1/post/statuses/filter)firehose
raw
(default false
): This parameter if true
, the tweets are indexed in ElasticSearch without any change.ignore_retweet
(default false
): This parameter if true
, retweets are skipped.To control the Twitter flow, we need to define an additional filter object.
Defining a filter automatically switches the type to filter. The Twitter filter API allows to define the following additional parameters to filter:
tracks
: This is used to track the keywords.follow
: This follows the IDs of Twitter users.locations
: This tracks a set of bounding box.These are the filter capabilities allowed by Twitter to reduce the number of tweets sent to you and to focus the search on some particular targets.
A filter river config file will look as follows:
{ "type" : "twitter", "twitter" : { "oauth" : { "consumer_key" : "*** YOUR Consumer key HERE ***", "consumer_secret" : "*** YOUR Consumer secret HERE ***", "access_token" : "*** YOUR Access token HERE ***", "access_token_secret" : "*** YOUR Access token secret HERE ***" }, "filter" : { "tracks" : ["elasticsearch", "cookbook", "packtpub"], } } }
18.222.22.49