Toward an Open Data Bias Assessment Tool

Working Paper

Ajjit Narayanan,

Graham MacDonald

March 5, 2019

Download Report

(3.96 MB)

Data are a critical resource for government decisionmaking, and in recent years, local governments, in a bid for transparency, community engagement, and innovation, have released many municipal datasets on publicly accessible open data portals. Advocates, reporters, and others have voiced concerns about the bias of algorithms used to guide public decisions and the data that power them.

Although significant progress is being made in developing tools for algorithmic bias and transparency, we could not find any standardized tools available for assessing bias in open data itself. In other words, how can policymakers, analysts, and advocates systematically measure the level of bias in the data that power city decisionmaking, whether an algorithm is used or not?

To fill this gap, we present a prototype of an automated bias assessment tool for geographic data. This new tool will allow city officials, concerned residents, and other stakeholders to quickly assess the bias and representativeness of their data. The tool allows users to upload a file with latitude and longitude coordinates and receive simple metrics of spatial and demographic bias across their city.

The tool is built on geographic and demographic data from the Census and assumes that the population distribution in a city represents the “ground truth” of the underlying distribution in the data uploaded. To provide an illustrative example of the tool’s use and output, we test our bias assessment on three datasets—bikeshare station locations, 311 service request locations, and Low Income Housing Tax Credit (LIHTC) building locations—across a few, hand-selected example cities.

Across the small sample of cities we studied, we consistently find that bikeshare stations are concentrated in downtown areas, overserve neighborhoods with high numbers of non-Hispanic white, non-Hispanic Asian, and college-educated residents, and underserve neighborhoods with large numbers of non-Hispanic Black, Hispanic, unemployed, and poor residents. The results from our analysis of bias in 311 service requests and LIHTC building location data are much more mixed across cities. Of particular note: 311 service requests from Boston and DC overrepresent white and college-educated neighborhoods while 311 service requests from Philadelphia overrepresent non-Hispanic Black and poorer neighborhoods. LIHTC location data from Raleigh demonstrate that buildings tend to be in neighborhoods with higher shares of Black and poor residents and lower shares of white and college-educated residents relative to the city average, in contrast to the other cities we studied, which tended to have much smaller differences.

Research Areas

Neighborhoods, cities, and metros

Social safety net

Race and equity