Geocoding in SAS and R
By Jonggyu Baek on Oct 17th, 2019.
Geocoding is the process of converting addresses (e.g., street address) to geographic coordinates (e.g., longitude and latitude) so that you can use to place markers on a map or position the map. Geocoding street address may not be an easy task for those who do not have any previous experience.
If you want to geocode a handful of subjects' (e.g., 5 subjects) addresses, you can simply go to the census website geocoder (https://geocoding.geo.census.gov/geocoder/locations/address?form), where you can type street address, city, state, zipcode. Then, it provides its longitude and latitude. If you want to know GEOID (https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html) in addition to longitude and latitude, you can use another option (https://geocoding.geo.census.gov/geocoder/geographies/address?form) in geocoder in census where you can find its GEOID from the provided street address or longitude and latitude.
However, if you are a programmer to geocode thousands of subjects at once, you have a few options to use geocoding software (e.g., SAS, R, ArcGIS, etc).
1. Geocoding street address for longitude and latitude in SAS
The below is an example provided in SAS reference (http://support.sas.com/documentation/cdl/en/graphref/65389/HTML/default/viewer.htm#p1o71xg9evwp20n17x3n92qt1vgf.htm)
The below SAS code generates an example data set for subjects' street addresses. For most subjects, street addresses are quite accurate, but for some subjects their street addresses are not accurately provided.
data WORK.CUSTOMERS (label='Input data for street geocoding'); infile datalines dlm='#'; length address $ 32 city $ 24 state $ 2; input address /* House number and street name */ zip /* Customer ZIP code (numeric) */ city /* City name */ state; /* Two-character postal code */ datalines; 555 Junk Street # 99999 # Beverly Hills # CA 305 Cross Lake Drive # 27526 # Fuquay-Varina # NC 2525 Banks Road # 27603 # Raleigh # NC 2222 SAS Campus Drive # 27513 # Cary # NC 1150 SE Maynard Rd. # 27511 # Cary # NC 2117 Graceland # 27606 # Raleigh # NC 1313 Mockingbird Lane # # Delray # CC 133 Jade Circle # 27545 # Knightdale # NC 1005 W South St # 27603 # Raleigh # NC N Winds North Drive # 27591 # Wendell # NC 622 Roundabout Road # 27540 # Holly Springs # NC Johnson Family Rd # 27526 # # 822 Water Plant Road # # Zebulon # NC 502 Possum Track Road # 27614 # # NC 2590 Wolfpack Lane # 27604 # Raleigh # NC 125 Ferris Wheel Ct # 27513 # Cary # NC ; run;
Running the above returns the below data for 16 subjects with their street addresses and cities, states, and zipcodes for separate columns.
Now, let's say we want to know each subject's longitude (LONG) and latitude (LAT) based on their address info so that we can use LONG and LAT to link their census block IDs.
proc geocode /* Invoke geocoding procedure */ method=STREET /* Specify geocoding method */ data=WORK.CUSTOMERS /* Input data set of addresses */ out=WORK.GEOCODED /* Output data set with X/Y values */ lookupstreet=SASHELP.GEOEXM /* Primary street lookup data set */ attributevar=(Tract Block BlkGrp); /* Assign Census Tract to locations */ run;
proc print data=WORK.GEOCODED noobs; var address m_addr m_zip m_obs _matched_ _status_ _notes_ _score_ x y Tract Block BlkGrp; run; quit;
The below is the output from the above proc geocode procedure. In the output data, "Geocoded", there are a few variables generated.
- M_ADDR: contains the street address for a street match. The M_ADDR value is the match value from the lookup data.
- M_ZIP: contains the ZIP code value for a ZIP level match from the lookup data.
- M_OBS: contains the row number for the matching observation in the street matching lookup data set.
- _MATCHED_: indicates how the coordinates were found.
- _STATUS_: indicates the type of match that was found.
- _NOTES_: contains tokens that provide additional information about the match. For more information, see Street Geocoding Note Values.
- _SCORE_: Contains a numeric value indicating the relative accuracy of the match.
- X and Y are longitude and latitude, respectively.
For more information, http://support.sas.com/documentation/cdl/en/graphref/65389/HTML/default/viewer.htm#p0volumqexbvwcn1kcopke78nqc7.htm
2. Geocoding street address for longitude and latitude in R
R is a programming language and open source (free) software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. There are many R libraries available for a wide variety of statistical and graphical techniques. Here, we will learn how to geocode street addresses in an R package, "ggmap
".
Briefly, ggmap
enables us to visualize spatial data and modes on top of static maps from various online sources (e.g., Google Maps and Stamen Maps). That is, it basically uses application programming interface (API) between Google Maps and R for repeated procedures.
Recently, Google changed their policy of API use and so geocoding street addresses in some extent may not be free. Anyway, to use any functions in ggmap, you first need to install ggmap in R and register your API key in R.
Assuming that a recent version of R is already installed, you can install the ggmap R package.
install.packages("ggmap")
library(ggmap)
How can you get your API Key in google?
- Go to google cloud platform console (https://console.cloud.google.com)
- Login your google account
- Click tap on the top left
- Go to APIs & Services and click library
- Search for Geocoding API and get started to enable it
Note that it says that for most of users get $200 free monthly credit usage so that using geocoding API is enough to support their needs. You can refer their pricing for more details. In terms of geocoding pricing, it is $5 per 1000 requests. You may think you can freely geocode 40,000 addresses monthly.
- Once you enable Geocoding API, you can create your credentials (i.e., API key) for its use
- Finally, you need to create your billing account associated with that API key.
Once you set up your Google Cloud Platform API, you can now use ggmap in R. I recommend that you can save your key in somewhere safe in your drive and import it in R.
### To use "geocode", need to register google api key in the package #
key_dir = "H:\\StatMethods\\Geocoding\\key.txt"
key = read.table(key_dir)$V1[1]
### this sets your google map for this session
register_google(key = key)
This is an example for geocoding my address where I provided my address info as exactly as possible.
address = "106 Holman St., Shrewsbury, MA"
geo = geocode(address)
c(geo$lon, geo$lat)
#[1] -71.70786 42.30394
This is an example for geocoding my address without spacing between letters. As you notice, Google Maps automatically spaced between letters.
address_wo_space = "106HolmanSt.ShrewsburyMA"
geo2 = geocode(address_wo_space, output = "more")
c(geo2$lon, geo2$lat)
#[1] -71.70786 42.30394
geo2
# A tibble: 1 x 9
# lon lat type loctype address north south east west
# <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# -71.7 42.3 premise rooftop 106 holman st, shrewsbury, ma 01545, usa 42.3 42.3 -71.7 -71.7
This is an example for geocoding my address with a typo; "Halmam" instead of "Holman".
address_typo = "106 Halmam St., Shrewsbury, MA"
geo3 = geocode(address_typo, output = "more")
geo3
# A tibble: > geo3
# A tibble: 1 x 9
lon lat type loctype address north south east west
<dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 -71.7 42.3 premise rooftop 106 holman st, shrewsbury, ma 01545, usa 42.3 42.3 -71.7 -71.7
c(geo3$lon, geo3$lat)
[1] -71.70786 42.30394
Now, we know how to geocode in R using ggmap. What about linking information about its GEOID? To obtain GEOID for linking census tract, or census block, an R package, "censusr" is useful. This censusr uses API between census and R so that we don't need to type its address or longitude or latitude repeatedly to obtain GEOID.
install.packages("censusr")
library(censusr)
If you have a list of accurate street addresses, you don't need to use ggmap for geocoding. You can directly use censusr and geocode it. For instance,
street = "106 Holman St."
city = "Shrewsbury"
state = "MA"
call_geolocator(street=street, city=city, state=state)
[1] "250277394002004"
If you want to get GEOID from longitude and latitude, you may use
call_geolocator_latlon(lat=42.30394, lon=-71.70786)
[1] "250277394002004"
Now, we want to geocode street addresses in the same example data set we generated in SAS. The SAS example data set was exported from SAS and imported in R.
dir_dat = "H:\\StatMethods\\Geocoding\\SASDat\\customers.csv"
dat = read.csv(dir_dat, header = TRUE)
dat
address city state zip
1 555 Junk Street Beverly Hills CA 99999
2 305 Cross Lake Drive Fuquay-Varina NC 27526
3 2525 Banks Road Raleigh NC 27603
4 2222 SAS Campus Drive Cary NC 27513
5 1150 SE Maynard Rd. Cary NC 27511
6 2117 Graceland Raleigh NC 27606
7 1313 Mockingbird Lane Delray CC NA
8 133 Jade Circle Knightdale NC 27545
9 1005 W South St Raleigh NC 27603
10 N Winds North Drive Wendell NC 27591
11 622 Roundabout Road Holly Springs NC 27540
12 Johnson Family Rd 27526
13 822 Water Plant Road Zebulon NC NA
14 502 Possum Track Road NC 27614
15 2590 Wolfpack Lane Raleigh NC 27604
16 125 Ferris Wheel Ct Cary NC 27513
The below R code is to get GEOID using the function call_geolocator
in censusr R package, but it may not work well unless you have all exactly correct addresses.
n = nrow(dat)
dat$GEOID_cen = NULL
for(i in 1:n) dat$GEOID_cen[i] = call_geolocator(street = paste(dat$address[i]), city = paste(dat$city[i]), state = paste(dat$state[i]))
Address (555 Junk Street Beverly Hills CA) returned no address matches. An NA was returned.
Address (2222 SAS Campus Drive Cary NC) returned no address matches. An NA was returned.
Address (2117 Graceland Raleigh NC) returned no address matches. An NA was returned.
Address (1313 Mockingbird Lane Delray CC) returned no address matches. An NA was returned.
Address (N Winds North Drive Wendell NC) returned no address matches. An NA was returned.
Address (622 Roundabout Road Holly Springs NC) returned no address matches. An NA was returned.
Error in call_geolocator(street = paste(dat$address[i]), city = paste(dat$city[i]), :
Bad Request (HTTP 400).
i
[1] 12 ## This loop stopped at i=12
dat
address city state zip GEOID_cen
1 555 Junk Street Beverly Hills CA 99999 <NA>
2 305 Cross Lake Drive Fuquay-Varina NC 27526 371830531091037
3 2525 Banks Road Raleigh NC 27603 371830531101019
4 2222 SAS Campus Drive Cary NC 27513 <NA>
5 1150 SE Maynard Rd. Cary NC 27511 371830535171013
6 2117 Graceland Raleigh NC 27606 <NA>
7 1313 Mockingbird Lane Delray CC NA <NA>
8 133 Jade Circle Knightdale NC 27545 371830541121010
9 1005 W South St Raleigh NC 27603 371830510002017
10 N Winds North Drive Wendell NC 27591 <NA>
11 622 Roundabout Road Holly Springs NC 27540 <NA>
12 Johnson Family Rd 27526 <NA>
13 822 Water Plant Road Zebulon NC NA <NA>
14 502 Possum Track Road NC 27614 <NA>
15 2590 Wolfpack Lane Raleigh NC 27604 <NA>
16 125 Ferris Wheel Ct Cary NC 27513 <NA>
Another way we get their GEOID is to use the function geocode in ggmap to obtain longitude and latitude. Then, use the resulted longitude and latitude to obtain its corresponding GEOID.
str_add = paste(dat$address, dat$city, dat$state, dat$zip, sep=",") ## to put characters together ##
geo_customers = geocode(str_add, output="more")
geo_customers
# A tibble: 16 x 9
lon lat type loctype address north south east west
<dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 -118. 34.1 locality approximate beverly hills, ca, usa 34.1 34.1 -118. -118.
2 -78.8 35.6 premise rooftop 305 cross lake dr, fuquay-varina, nc 27526, usa 35.6 35.6 -78.8 -78.8
3 -78.7 35.6 street_address range_interpolated 2525 banks rd, raleigh, nc 27603, usa 35.6 35.6 -78.7 -78.7
4 -78.8 35.8 route geometric_center sas campus dr, cary, nc 27513, usa 35.8 35.8 -78.7 -78.8
5 -78.8 35.8 premise rooftop 1150 se maynard rd, cary, nc 27511, usa 35.8 35.8 -78.8 -78.8
6 -78.7 35.8 route geometric_center graceland ct, raleigh, nc 27606, usa 35.8 35.8 -78.7 -78.7
7 -80.1 26.5 route geometric_center mockingbird ln, delray beach, fl 33445, usa 26.5 26.5 -80.1 -80.1
8 -78.5 35.8 street_address range_interpolated 133 jade cir, knightdale, nc 27545, usa 35.8 35.8 -78.5 -78.5
9 -78.7 35.8 premise rooftop 1005 w south st, raleigh, nc 27603, usa 35.8 35.8 -78.7 -78.7
10 -78.4 35.8 route geometric_center northwinds n dr, wendell, nc 27591, usa 35.8 35.8 -78.4 -78.4
11 -78.8 35.7 locality approximate holly springs, nc, usa 35.7 35.6 -78.8 -78.9
12 -78.8 35.6 postal_code approximate fuquay-varina, nc 27526, usa 36.0 35.4 -78.5 -79.0
13 -78.3 35.8 street_address range_interpolated 822 water plant rd, zebulon, nc 27597, usa 35.8 35.8 -78.3 -78.3
14 -78.6 36.0 route geometric_center possum track rd, bartons creek, nc 27614, usa 36.0 35.9 -78.6 -78.6
15 -78.6 35.8 street_address range_interpolated 2590 wolfpack ln, raleigh, nc 27609, usa 35.8 35.8 -78.6 -78.6
16 -78.8 35.8 street_address range_interpolated 125 ferris wheel ct, cary, nc 27513, usa 35.8 35.8 -78.8 -78.8
Google API fixes an address with typo as correctly as possible, but it also provides how accurately the current address is geocoded with 4 categories.
- “ROOFTOP” indicates that the returned result is a precise geocode for which we have location information accurate down to street address precision.
- “RANGE_INTERPOLATED” indicates that the returned result reflects an approximation (usually on a road) interpolated between two precise points (such as intersections). Interpolated results are generally returned when rooftop geocodes are unavailable for a street address.
- “GEOMETRIC_CENTER” indicates that the returned result is the geometric center of a result such as a polyline (for example, a street) or polygon (region).
- “APPROXIMATE” indicates that the returned result is approximate. localities and admin areas (i.e., zipcode) have this value.
Once we have all the longitudes and latitudes from all the addresses, we use censusr api
### Copying lat and long to the data set ###
dat$lat = geo_customers$lat
dat$lon = geo_customers$lon
### for loop to obtain individual GEOID ###
dat$GEOID= NULL
for(i in 1:n)
dat$GEOID[i] = call_geolocator_latlon(dat$lat[i], dat$lon[i])
dat$BLOCK = substring(dat$GEOID, 12, 15)
dat$BlkGrp = substring(dat$GEOID, 12, 12)
dat
address city state zip lat lon GEOID BLOCK BlkGrp
1 555 Junk Street Beverly Hills CA 99999 34.07362 -118.40036 060377008011021 1021 1
2 305 Cross Lake Drive Fuquay-Varina NC 27526 35.60470 -78.76358 371830531091037 1037 1
3 2525 Banks Road Raleigh NC 27603 35.63626 -78.67413 371830531101019 1019 1
4 2222 SAS Campus Drive Cary NC 27513 35.82424 -78.75530 371830535212021 2021 2
5 1150 SE Maynard Rd. Cary NC 27511 35.78415 -78.76352 371830535171013 1013 1
6 2117 Graceland Raleigh NC 27606 35.78908 -78.71075 371830524071005 1005 1
7 1313 Mockingbird Lane Delray CC NA 26.47410 -80.11788 120990066041035 1035 1
8 133 Jade Circle Knightdale NC 27545 35.81396 -78.46105 371830541121010 1010 1
9 1005 W South St Raleigh NC 27603 35.77327 -78.65398 371830510002017 2017 2
10 N Winds North Drive Wendell NC 27591 35.80038 -78.35514 371830544041072 1072 1
11 622 Roundabout Road Holly Springs NC 27540 35.65127 -78.83362 371830532041005 1005 1
12 Johnson Family Rd 27526 35.58434 -78.80003 371830531073000 3000 3
13 822 Water Plant Road Zebulon NC NA 35.83091 -78.34040 371830543023004 3004 3
14 502 Possum Track Road NC 27614 35.95331 -78.61212 371830538051000 1000 1
15 2590 Wolfpack Lane Raleigh NC 27604 35.82415 -78.61026 371830527012027 2027 2
16 125 Ferris Wheel Ct Cary NC 27513 35.79505 -78.79908 371830535242000 2000 2
Now, you can compare the geocoded results in R with the one from SAS