Tuesday, March 7, 2017

Visualizing and Clustering Pizza Places in Pittsburgh

I am using the dataset provided by Yelp.com to explore the spatial distribution of restaurants in Pittsburgh. I created a small program to parse the json file provided on the website and extract restaurant information from it. The program is available here.

The graph below shows the locations of pizza places plotted on the map of Pittsburgh. The colors of the nodes represent the Stars (ratings from reviewers), and the sizes represent the number of reviews.

 I clustered the pizza places into five groups using the Kmeans algorithm, according to their geographical locations and numbers of reviews.

Clusters Number of Items Center Latitude Center
Review Count
Cluster 1 145 40.345 -80.055 17.952
Cluster 2 73 40.353 -79.791 11.945
Cluster 3 215 40.444 -79.966 38.888
Cluster 4 121 40.531 -80.086 18.397
Cluster 5 79 40.494 -79.78 12.823

It is interesting that the center of Cluster 3 (red) has significantly higher review count than other clusters. Comparing the results with the map of Pittsburgh neighborhoods (Image: Tom Murphy VII.), it is shown that most nodes of Cluster 3 are located in the central district and areas near the city center. That may be why theses pizza places receive more reviews, which imply that they are more popular.

This time, I clustered the pizza places into five groups according to their geographical locations and Stars instead of the number of reviews. This time, Cluster 3 (red) and Cluster 4(light green) have a lot of overlapping areas near the city center. The Stars of the centers of Cluster 3 and 4 are very different, equal to 2.4 and 3.8 respectively. That means the pizza places in the same area are clustered into two groups. Therefore, while pizza places near the city center are more popular than others, they are not higher rated. I find it very interesting to explore issues in the social science by analyzing social media data, and hope to do more in the future!