Simply put, we try to identify whether a linear association exists between the checkins of two individual businesses. Next, we try to identify the correlation between two businesses by using the Pearson Correlation Coefficient (read this site for a nice introduction). Hence, in order to make sure that only relevant correlations are calculated, we ignore the ones that have less than a 100 checkins, resulting in around 1920 remaining businesses. In addition, many of these have only a limited set of associated checkins. Unfortunately, checkin data is available for only 8,282 out of the 11,537 supplied businesses. We start by parsing both the business and checkin json-files from the Yelp Dataset challenge. Building the Neo4J checkin correlation graph As always, the full source code of this article can be found on the Datablend public github repository (although you will need to acquire the dataset yourself through the Yelp Dataset Challenge portal). So, with only this data in mind, are we able to cluster businesses as being restaurants or fashion stores, based purely on the correlations calculated amongst their checkin data? For this experiment, we use the Neo4J graph database for storing our checkin-based correlation graph and employ the Gephi graph visualisation platform for interpreting the identified business communities/clusters. The checkin data itself is available on a day-hour level: for each business, it is possible to retrieve the number of checkins on a Sunday afternoon between 3 and 4. In our case, we are interested in finding out whether it is possible to visually cluster businesses by category, based purely on their checkin data. With the help of this data, data scientists can execute real-life experiments with various data mining/machine learning algorithms. Recently, Yelp made available a sample dataset from the greater Phoenix metropolitan area including around 11.000 business, 8000 checkin-sets, 43.000 users and 230.000 user reviews.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |