Tuesday, March 7, 2017

Visualizing and Clustering Pizza Places in Pittsburgh

I am using the dataset provided by Yelp.com to explore the spatial distribution of restaurants in Pittsburgh. I created a small program to parse the json file provided on the website and extract restaurant information from it. The program is available here.

The graph below shows the locations of pizza places plotted on the map of Pittsburgh. The colors of the nodes represent the Stars (ratings from reviewers), and the sizes represent the number of reviews.

 I clustered the pizza places into five groups using the Kmeans algorithm, according to their geographical locations and numbers of reviews.

Clusters Number of Items Center Latitude Center
Review Count
Cluster 1 145 40.345 -80.055 17.952
Cluster 2 73 40.353 -79.791 11.945
Cluster 3 215 40.444 -79.966 38.888
Cluster 4 121 40.531 -80.086 18.397
Cluster 5 79 40.494 -79.78 12.823

It is interesting that the center of Cluster 3 (red) has significantly higher review count than other clusters. Comparing the results with the map of Pittsburgh neighborhoods (Image: Tom Murphy VII.), it is shown that most nodes of Cluster 3 are located in the central district and areas near the city center. That may be why theses pizza places receive more reviews, which imply that they are more popular.

This time, I clustered the pizza places into five groups according to their geographical locations and Stars instead of the number of reviews. This time, Cluster 3 (red) and Cluster 4(light green) have a lot of overlapping areas near the city center. The Stars of the centers of Cluster 3 and 4 are very different, equal to 2.4 and 3.8 respectively. That means the pizza places in the same area are clustered into two groups. Therefore, while pizza places near the city center are more popular than others, they are not higher rated. I find it very interesting to explore issues in the social science by analyzing social media data, and hope to do more in the future!

Wednesday, February 22, 2017

A location model based on rent

I have developed a location model based on rent. In this model, the rent of each cell is calculated by taking the average of agents' income in this area. Agents have different income levels and requirements on space. Agents want to be located in the most accessible area they can afford where their preferences for space are matched.

There are two types of agents: residents and employers. Residents have high income (e.g. financial services), middle income (e.g. teachers and other professional occupations) and low income workers, which are classed as ‘commerce’, ‘service’ and ‘industry’ respectively. These classes are additionally broken down by age as young (18-34), middle aged (35-65) and old (66+). The agent’s age is calculated randomly when it is first created (18-67). Each agent desires a certain amount of space which is broken down by age categories.

Employer agents were designed to reflect the residential agents’ employers, and subsequently the same three groups of ‘commerce’, ‘service’ and ‘industrial’ were used to represent employers’ different roles instead of age, they have a tenure set between 0-6. employer agents’ decrease their tenure to zero. Once zero is reached, the employer can move. As with residents, employers have a space requirement. For example industrial firms are driven by the need for large amounts of land while financial services (i.e. ‘commerce’ employer) need less land but want a more central location. Each employer also has an income which is four times that of residents.

It is assumed that younger residential agents will move more frequently (every 2 iterations on average) than those who are middle aged (every 5 iterations) with the older residents moving the least (every 10 iterations, On the other hand, employers only move if their tenure is 0. Once an employer agent has moved and finds a suitable location, its tenure is reset to 6 and cannot move for 6 iterations of the model.

Agents of either residential or employer type wanting to be located in the most accessible area they can afford where their preferences for space are matched. An alternative zonal system is used, based on a series of small overlapping areas which allow agents to search the entire area which is not restricted to such boundaries and allows agents to identify clusters spread across such boundaries.

When an agent decides to move, it goes through the list of areas and finds which area is the most attractrive area (in this area its based on accessability). The agent initially moves to the centre of the area, then searches the area for an affordable neighborhood.

The results with one city center:

The results with new city center:

The code can be found here: https://github.com/YangZhouCSS/Bitrent

Saturday, April 23, 2016

Walk This Way: Pedestrian agent-based model using mobility datasets

This is a Netlogo reimplementation of the pedestrian model in “Walk This Way: Improving Pedestrian Agent-Based Models through Scene Activity Analysis” by Andrew Crooks et al. The purpose of pedestrian models in general, is to better understand and model how pedestrians utilize and move through space. This model makes use of mobility datasets from video surveillance to explore the potential that this type of information offers for the improvement of agent-based pedestrian models.  

The visualization of the model looks like this:
(Grey boxes are the obstacles. Yellow triangles are the agents.)

Here is a video showing the simulation process:

There are 16 entrances and 18 exits in the model. An agent is created at an entrance, and will choose one exit as its destination. Agents move towards their destinations using shortest route while avoiding both the fixed obstacles and the other agents. The rule of selecting shortest route is simple: set the patch that one can see with the lowest gradient as target, and move towards it. One can see a patch that is both within vision and not blocked by obstacles. The method of calculating gradients will be explained in the following text.

Diagram of the route-planning algorithm:

Two types of empirical data are used in this model. Firstly, the empirical of probability of choosing each entrance and exit is used when creating agents and assigning their entrance and exits. Secondly, the empirical data of how people have moved on this map on August 25th is used to construct the gradients map, according to which agents select their path towards their destinations. The more frequently being chosen as a path + the closer to destination, the lower the gradient will be. When the empirical gradient maps are not used, the gradients map is constructed purely based on distance to destinations. Four scenarios are designed to compare the simulation results with the empirical result, in order to show how mobility data could help to improve pedestrian models.

Scenario 1: No Realistic Information about Entrance/Exit Probabilities or Heat Maps
In this scenario, entrance and exit locations are considered known, but traffic flow through them is considered unknown. Under such conditions, we run the model to understand its basic functionality without calibrating it with real data about entrance and exit probabilities, nor activity-based heat maps. This will serve as a comparison benchmark, to assess later on how the ABM calibration through such information improves (or reduces) our ability to model movement within our scene.

Scenario 2: Realistic Entrance/Exit Probabilities But Disabled Heat Maps
In this scenario, we explore the effects of introducing realistic entrance and exit probabilities on the model. The heat map models used are distance-based, and not informed by the real datasets. Instead, we use distance-based gradients (i.e., agents choose an exit and walk the shortest route to that exit).

Scenario 3: Realistic Heat Maps but Disabled Entrance/Exit Probabilities
In this scenario we introduce real data-derived heat maps in the model calibration. These activity-based heat map-informed gradients are derived from harvesting the scene activity data, however entrance and exit probabilities are turned off. In a sense one could consider this a very simple form of learning how agents walk on paths more frequently traveled within the scene. It also allows us to compare to extent to which the quality of the results are due to the heat maps versus entrance and exit probability.

Scenario 4: Realistic Entrance/Exit Probabilities and Heat Maps Enabled
In the final scenario we use all available information to calibrate our ABM, namely, the heat map-informed gradients and entrance-exit combinations and see how this knowledge impacts the performance of the ABM.

Please note that there is one gradient map for each pair of entrance and exit, therefore, 16 * 18 = 288 maps are loaded. However, the final result is compared to only one path frequency map which is an empirical data obtained on August 25th. Also please note that, when the entrance/exit probabilities table is used, some entrances are exits have a probability of being chosen equals to zero. While the table is not used, agents just randomly choose any entrances or exits.  

Please find the model here:

Monday, February 15, 2016

Pedestrian model of agents exiting a building

I built a model of pedestrians who try to leave the floor through one or two exits. The map being used is from GMU’s Krasnow Institute. The model records the frequency of each cell being chosen as a path and draws the result into a path graph, which can be exported to ArcGIS for further analysis.

Here is a graph showing the path graph opened in ArcGIS:  

Here is a video showing the simulation process:

Each pacth has a variable called elevation, which is determined by (1) the shortest distance to the exit; (2)if it is in a room, elevation is lower being closer to gate. If there are more than one exit patches, the elevation is equal to the shortest distance to closest one of the exit patches. People use the gravity model (always flow to lower elevation, if space is available) to move to the exit.

In this model, the “elevation” of a patch is decided by its distance to exits as well as how close it is located to the gate of the room, so that people can run out if rooms. When running the model, people always try to move to lower elevation. This algorithm can also be used to build a rainfall model to analyze the movement of rain drops on the ground. See this link for the Rainfall model. (http://geospatialcss.blogspot.com/2015/10/rainfall-model-of-crater-lake-national.html)

I have also added the export function to export the path frequency graph to an asc file. You may open the file in ArcGIS for further analysis.

Here is the code:

Saturday, February 6, 2016

Agents Exiting A Room

This is a model of agents who try to leave the room through the exit on the right hand side. The model also records the frequency of each cell being chosen as a path and draws the result into a path graph, which can be exported to GIS for further analysis.  

Here is a graph showing the path graph opened in GIS:  
In order to calculate the “elevation”, each patch calculates its distance to each exit patch, and set the lowest distance as elevation. When running the model, people always try to move to lower elevation. This algorithm can also be used to build a rainfall model to analyze the movement of rain drops on the ground.  

A video showing the process:


Saturday, January 30, 2016

Path finding model using the A-star algorithm in Netlogo

This is a path-finding model using the A-star algorithm to find the shortest path. The models uses the map of George Mason University, including the buildings, walkways, drive-ways, and waters. Commuters randomly select a building as destination, find and follow the shortest path to reach there.

The following is the original map this model uses. It has been simplified in the model for faster computation.

Here is a video showing the process:

How it works?

In the beginning, each commuter randomly selects a destination and then identify the shortest path to the destination. The A-star algorithm is used to find the shortest path in terms of distance. The commuters move one node in a tick. When they reach the destination, they stay there for one tick, and then find the next destination and move again.

The code for path selection can be simply explained as following:

Each node has a variable "distance" that records the shortest distance to the origin. It is set to be 9999 at default. The origin has distance 0.

While not all nodes have updated their neighbors:
     ask those nodes to update their neighbors
           if the distance through this node is shorter than the existing distance of neighbors, update neighbor, and updated neighbor is marked as "has not updated its neighbors"
           the node is marked as "has updated it neighbor"

The loop stops when all nodes have updated their neighbors, in other words, no node can be updated with a shorter distance. The nodes of the shortest path are then put into a list for the commuter to follow.

How is the map simplified?

For faster computation this model simplifies the original data by reducing the number of nodes. To do that, the walkway data is loaded to the 20 x 20 grid in Netlogo, which is small, and therefore, many nodes fall on the same patch. In each patch, we only want to keep one node, and duplicate nodes are removed, while their neighbors are connected to the one node left.

Also, links are created in this model to represent roads. This is so far the best way I can find to deal with road related problems in Netlogo. However, because the way I create links is to link nodes one by one (see code for more details), so some roads are likely to be left behind. But again there is no better way I can find. Therefore, I also used a loop in setup to delete nodes that are not connected to the whole network.

The code and data is here:

Wednesday, November 18, 2015

Segregation Model and Calculation of Moran's I

Recently I have created a segregation model with the calculation of Moran's I, a measure of spatial autocorrelation developed by Patrick Alfred Pierce Moran. In this model, I am using the map of Washington DC.The form of data is vector data. 

Each turtle here represents a houshold that is either blue or red. All turtles want to have neighbors with the same color. The simple rule is that they move to unoccupied patches until they are happy with their neighbors.

Here is the map I am using in this model.     

In the beginning, 10 to 80 turtles are created in each polygon, depending on the population data. Turtles are either blue or red. Red polygons have 60% red and 40% blue. Blue polygons have 60% blue and 40% red.

In each tick, turtles look at two kinds of neighborhoods to decide whether they are happy or not. One is their geometrical neighboring polygons; the other is the 8-connected neighbors. If either neighborhood has different neighbors more than the specified percentage to be unhappy, turtle will move to an unoccupied patch in a polygon that is unoccupied or has the same color with it. The colors of the polygons are decided by the majority of turtles living in each of them, and the colors change every tick.

Here is a video recording the simulation process.

How to identify polygon neighbors?

It is tricky to find the geometrical neighbors of each polygon, since Netlogo does not have this function. How I did it was to use the Polygon Neighbors function in ArcGIS 10.2 to create a text file which maps each polygon to its neighbors. Then, I deleted unecessary information like headers and ask Netlogo to read the information. Notice that neighbors are polygons that share either a boundary (edge) or a corner (node).  

How to export to ArcGIS?

There is a button Export to export the map to GIS. It exports current map to finalmap.csv in data folder. Information will include color and pcentage red for each polygon. To analyze it in ArcGIS, open the csv file in ArcGIS, and export data as a dbf file to replace the oringinal DC.dbf file.

How to calculate Moran's I and verify it?

Moran’s I is a measure of spatial correlation. Values range from −1 (indicating perfect dispersion) to +1 (perfect correlation). If the different items are randomly distributed, Moran’s I is 0. There is a slider to choose whether to do row standardization or not. Row Standardization is a technique for adjusting the weights in a spatial weights matrix. When weights are row standardized, each weight is divided by its row sum. The row sum is the sum of weights for a feature’s neighbors.     

I have verified the Moran's I calculated in my model with ArcGIS, and they are the same. To verify it, open final map in GIS, create a new numeric field equal to pcetred. Then, use the tool "Spatial Autocorrelation (Morans I)" in ArcGIS. Choose the numeric field as input, "CONIGUITY_EDGES_CORNERS" as conceptualization relationship, and whether to do Row Standardization. See below for the settings.

 Compare the results.

Here is the code: