=== Dataset of Frequent Users of Foursquare v1 === DATASET STATISTICS Data collection interval: 31 August - 1 October 2011 Data source: Foursquare Number of users: 9167 Number of check-ins: 959,122 DATASET DOWNLOAD The dataset download is made up from 2 parts: 1. Extracting the Tweet information (timestamp and text) Since Twitter's terms of service do not allow redistribution of tweets, only Tweet Ids and usernames are provided. The .dat files in the archove are in the same format as the TREC Microblog Corpus. Then, the twitter-corpus-tools downloader can be used to retrieve the content of the tweets: https://github.com/lintool/twitter-corpus-tools 2. Extracting the Foursquare venue information The venue information can be extracted via the Venues Platform of the Foursquare API: https://developer.foursquare.com/overview/venues The rate limit for downloading venues is 5,000 venues/hour with more available on request. The .dat files with the dataset are in the archive FreqUsersV1.tar.gz . They contain lines with the following format: \TAB\TAB DATASET CREATION PROCESS - from the entire Twitter Gardenhose stream (10% sample of the entire Twitter data) on the desired period we filter out all the tweets that are related to Foursquare by examining the `source' field. Even though messages related to Foursquare can be posted from other sources as well, we have estimated that around 80% of the messages have `foursquare' as the source, thus alleviating the need of following the URL's which is a costly operation. - we filter these tweets of messages that contain mayorship or other Foursquare alerts that are not check-ins. - from the obtained set of checkins we group and sort users by the number of check-ins. We select those users with more than 12 check-ins/month, so that we will have an expected average of roughly 4 check-ins/day if we take into consideration that we have a 10% sample of the data). - for each of these users, we have downloaded all of their Foursquare check-ins that were pushed to Twitter in the selected time span using the Twitter Search API. We highlight that by fixing some users and using the same technique we can also collect data in real-time. - for each user's check-in we crawled the link to the Foursquare website contained in the tweet and extracted the venue id where the check-in was registered. - we use the Foursquare Venues API to extract the information associated with each respective check-in and venue, thus obtaining a mapping between the tweet and the full venue information. - we have then removed all the check-ins that were registered within 2 minutes of each other, because these were likely spam and thus uninformative for our purposes. - we have removed all the check-ins that were done in the same place within one hour. - for each user we delete all the information on a given day if it contained less than 3 check-ins, so that we keep only days with high usage of the system. - for every user we consider his `home' location as the venue in which he has checked-in the most number of times in our dataset. Daniel Preotiuc-Pietro www.preotiuc.ro daniel@dcs.shef.ac.uk 06 August 2012