Batch Geocoding with R and Google maps

Turn address strings into gps coordinates for free with google

I’ve recently wanted to geocode a large number of addresses (think circa 60k) in Ireland as part of a visualisation of the Irish property market. Geocoding can be simply achieved in R using the geocode() function from the ggmap library. The geocode function uses Googles Geocoding API to turn addresses from text to latitude and longitude pairs very simply.

There is a usage limit on the geocoding service for free users of 2,500 addresses per IP address per day. This hard limit cannot be overcome without employing new a IP address, or paying for a business account. To ease the pain of starting an R process every 2,500 addresses / day, I’ve built the a script that geocodes addresses up the the API query limit every day with a few handy features:

  • Once it hits the geocoding limit, it patiently waits for Google’s servers to let it proceed.
  • The script pings Google once per hour during the down time to start geocoding again as soon as possible.
  • A temporary file containing the current data state is maintained during the process. Should the script be interrupted, it will start again from the place it left off once any problems with the data /connection has been rectified.
static map generated with direct url to static map api
Map with google maps static maps API.

The R script assumes that you are starting with a database that is contained in a single *.csv file, “input.csv”, where the addresses are contained in the “address” column. Feel free to use/modify to suit your own devices!

Comments are included where possible:

# Geocoding script for large list of addresses. 

# Shane Lynn 10/10/2013

#load up the ggmap library
library(ggmap)
# get the input data
infile <- "input"
data <- read.csv(paste0('./', infile, '.csv'))

# get the address list, and append "Ireland" to the end to increase accuracy 
# (change or remove this if your address already include a country etc.)
addresses = data$Address
addresses = paste0(addresses, ", Ireland")

#define a function that will process googles server responses for us.
getGeoDetails <- function(address){   
   #use the gecode function to query google servers
   geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE)
   #now extract the bits that we need from the returned list
   answer <- data.frame(lat=NA, long=NA, accuracy=NA, formatted_address=NA, address_type=NA, status=NA)
   answer$status <- geo_reply$status

   #if we are over the query limit - want to pause for an hour
   while(geo_reply$status == "OVER_QUERY_LIMIT"){
       print("OVER QUERY LIMIT - Pausing for 1 hour at:") 
       time <- Sys.time()
       print(as.character(time))
       Sys.sleep(60*60)
       geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE)
       answer$status <- geo_reply$status
   }

   #return Na's if we didn't get a match:
   if (geo_reply$status != "OK"){
       return(answer)
   }   
   #else, extract what we need from the Google server reply into a dataframe:
   answer$lat <- geo_reply$results[[1]]$geometry$location$lat
   answer$long <- geo_reply$results[[1]]$geometry$location$lng   
   if (length(geo_reply$results[[1]]$types) > 0){
       answer$accuracy <- geo_reply$results[[1]]$types[[1]]
   }
   answer$address_type <- paste(geo_reply$results[[1]]$types, collapse=',')
   answer$formatted_address <- geo_reply$results[[1]]$formatted_address

   return(answer)
}

#initialise a dataframe to hold the results
geocoded <- data.frame()
# find out where to start in the address list (if the script was interrupted before):
startindex <- 1
#if a temp file exists - load it up and count the rows!
tempfilename <- paste0(infile, '_temp_geocoded.rds')
if (file.exists(tempfilename)){
       print("Found temp file - resuming from index:")
       geocoded <- readRDS(tempfilename)
       startindex <- nrow(geocoded)
       print(startindex)
}

# Start the geocoding process - address by address. geocode() function takes care of query speed limit.
for (ii in seq(startindex, length(addresses))){
   print(paste("Working on index", ii, "of", length(addresses)))
   #query the google geocoder - this will pause here if we are over the limit.
   result = getGeoDetails(addresses[ii]) 
   print(result$status)     
   result$index <- ii
   #append the answer to the results file.
   geocoded <- rbind(geocoded, result)
   #save temporary results as we are going along
   saveRDS(geocoded, tempfilename)
}

#now we add the latitude and longitude to the main data
data$lat <- geocoded$lat
data$long <- geocoded$long
data$accuracy <- geocoded$accuracy

#finally write it all to the output files
saveRDS(data, paste0("../data/", infile ,"_geocoded.rds"))
write.table(data, file=paste0("../data/", infile ,"_geocoded.csv"), sep=",", row.names=FALSE)

Let me know if you find a use for the script, or if you have any suggestions for improvements.

Please be aware that it is against the Google Geocoding API terms of service to geocode addresses without displaying them on a Google map. Please see the terms of service for more details on usage restrictions.

Subscribe
Notify of

102 Comments
Inline Feedbacks
View all comments

Thanks for sharing your code. It was very helpful.

Hi, Do You happen to know how should I change that script to take into account that some of my records do not have addresses with house number… just street name. right now it geocodes just addresses with numbers, when there is a record with just street name I ve got NA

Thanks for the nice write up. I usually use Bing for those purposes as it has a limit of 25,000 requests. See http://stackoverflow.com/questions/17361909/determing-the-distance-between-two-zip-codes-alternatives-to-mapdist. Even though this example is with routes, the package “taRifx.geo” also has a function for geocoding by using Bing.

This is awesome! Thanks for posting. I can use this for my project. 🙂

I’ve been using this code and would love to discuss it a little more in-depth. I recently used it to code ~6100 records. Unfortunately, I couldn’t get the geocoding process to start up again once I hit the 2500 limit. I also found one bug – when the code stores the index to restart the geocoding process, it repeats the index. (Example: code runs through 1180 and pauses. I re-run the code, and it picks up at 1180, repeating the data pull for that index.) Can we discuss sometime soon? Thanks!

Hi again – it said “over query limit” and timed out for another hour (after I had already waited an hour). Any suggestions on a fix, or is this another issue do you think? Thanks!

Hi Shane
Nice code – am hitting slight hitch. At the R prompt if I type
> str1 = “9 Abbey Street Lower Dublin 1”
> geocode(str1)
I get the ‘correct’ lon lat -6.249584 53.35223
However, if I use this script pointing to the same address in input.csv it gives me lat lon for somewhere in the midlands. Any suggestions?
Thanks

Dear Shane,
Many thanks for your R-code – it works perfectly – I used it for geocoding “Asing Prices” retrieved from teh web for the Real Estate Market in Austria (PhD). I use a lookup-table to avoid double geocoding of the same addess.
many thansk again Gerhard

Thanks for posting, very useful!

On line 77 – you have data$long <- geocoded$lat when it should say geocoded$long –

Easy fix!! Thanks for the great code!

I have the same issue as rowan… Did you figure it out?

Hi,
I’m a fresh student of R, so I’m discovering how everythig works. I have a silly doubt.
Here:
# get the input data:
infile <- "input"
data <- read.csv(paste0('./', infile, '.csv'))
I should input my data normally as:
infile<- read.csv("C:/Users/University/Project/Local.csv", sep=";")
and then run the rest normaly? I'm doing that and I receive this warning "Error in file(file, "rt") : invalid 'description' argument"

Thank you for your help!

Hi Shane, firstly thanks for sharing your code.
I’m having problems when the query limit gets over. The function gets in a infinite loop in the “while”, because even after 24 hours, when it uses the geocode() to the same interaction of the ‘for’ (address), the status don’t changes. I really don’t know why, but I thing it’s because the geocode() is “using stored information”. Do you have any suggestion?

Nice code, thanks.
I had same issue as Rowan, restarting and ending up with a larger vector that couldn’t be appended to original data. My workaround was to delete the temp file before restarting the process. Geocoding address is stored with workspace, so it doesn’t request it again from Google server (assuming you save workspace).
A slightly better solution is to comment out the temp file save and read.
And even better solution would be to delete the last row of the temp file and start from there?

Just to add a small piece of related info on what Mark and Rowan wondered:
Personally I didn’t find a function to force a “fresh” geocode -without workspace deletion. So Q:”What if I won’t delete my workspace (or create new) and I want to force a fresh geocode?”
A:ggmap stores locations on .GeocodedInformation on the workspace and
rm(.GeocodedInformation) can be used to remove them. After that obviously new/fresh geocode can be initated.

Hey Shane, thank you for the great code.
I get the error message “Error in geo_reply$status : $ operator is invalid for atomic vectors” when I get to running the code with my own data. Has this come up before?
Thank you!

Hey Hi Shane,
Thank you so much for your code.
I’m doing a project and wanted to get the Lat/Longs for sone streets to make my project more meaningful.
I tried the code but i keep getting an error is.character(location) – my address(ii) value is all string. I can’t fathom what is the cause – though I suspect the data. Appreciate if I could share my screen shot with you for some suggestions.

Hey, just tried something which seems to run but the output does lack the Lat/Long. Line 72 modified to :
result = getGeoDetails(as.character(substitute(addresses[ii]))). Sorry for the multiple messages.

I have a not so smart python code that sorta works..I f you wanna have a peep -lemme know :

Shane,

Thanks for the code. I have been using it quite a bit recently. I do have one question though. I have quite a few addresses to geocode and I’m wondering if you know how to set up the code so that I can go over the 2500 limit per day. I have signed up with Google to do this and created a project but I don’t know how to implement it in the code I have (i.e. your code). I don’t mind paying a few dollars to complete my list of addresses but I just can’t see how to do it. Any ideas?

Many thanks,
Barry

Hi Shane,

thanks for sharing the code with us. Workls like a bless, even though im missing the possibility to add a API Key within the ggmap package. But this has nothing to do with your code. In Case someone wants to transform german adresses, one small hint:

You need to encode into UTF-8 before passing it into the function, just add the function enc2utf8 in line 66 and it workls like a charm.

line 66: result = getGeoDetails(enc2utf8(addresses[ii]))

Hi, Thanks for this code and it helps a lot with my project, I actually have an error when running the script,

“Working on index 93 of 953”
contacting http://maps.googleapis.com/maps/api/geocode/json?address=T8W5J8%20%20%20%20,%20CANADA&sensor=false….Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=T8W5J8%20%20%20%20,%20CANADA&sensor=false
Error in geo_reply$status : $ operator is invalid for atomic vectors
In addition: Warning messages:
1: In readLines(connect, warn = FALSE) :
InternetOpenUrl failed: ‘A connection with the server could not be established’
2: In geocode(address, output = “all”, messaging = TRUE, override_limit = TRUE) :
geocoding failed for “T8W5J8 , CANADA”.
if accompanied by 500 Internal Server Error with using dsk, try google.

Do you know how can I solve this problem? Thanks for your help!!!!!

Hi Shane,

Did you have a chance to finalize the python equivalent code for this?

Thank you for the code! What changes I should make to your code, if I don’t want to bypass the API limit, and allow R to collect the geocodes for several days? Thanks!!!

Thank you for the code, Shane. I am a novice in R, and hence, my question below may appear less intelligent; I apologize in advance.

I am trying to convert 8,615 U.S. addresses to their latitude and longitude. Interestingly, at index 2694, Google Maps API encounters a bad request and reports the error mentioned below.

Question: I expected “address” dataframe to have been populated with the correct values till index 2693, but the dataframes does not seem to have been created. Can you please tell me where I have gone wrong? Thanks, Shane.

[1] “Working on index 2694 of 8615”
contacting http://maps.googleapis.com/maps/api/geocode/json?address=#2050N%201620%2026th%20Street%20Santa%20Monica%20CA%2090404,%20USA&sensor=false…query max exceeded, see ?geocode. current total = 2559
.Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=#2050N%201620%2026th%20Street%20Santa%20Monica%20CA%2090404,%20USA&sensor=false
Error in geo_reply$status : $ operator is invalid for atomic vectors
In addition: Warning messages:
1: In readLines(connect, warn = FALSE) :
cannot open: HTTP status was ‘400 Bad Request’
2: In geocode(address, output = “all”, messaging = TRUE, override_limit = TRUE) :
geocoding failed for “#2050N 1620 26th Street Santa Monica CA 90404, USA”.
if accompanied by 500 Internal Server Error with using dsk, try google.

Shane: please ignore my last comment. I could export “geocoded” dataframe and obtain the latitude and longitude of the 2693 addresses.

Hey Shane!

Im using your awesome code but for some reason I’m having this Error back from R:
I made sure that ggmap is well installed, but nothing.

[1] “Working on index 1 of 2496”
Error in getGeoDetails(address[ii]) : could not find function “geocode”

any idea what could have been wrong?

I have the opposite. I have the latitude and longitude. I would like to find the district (D1, D2, D3…). Do you know how can I do it?

I can’t get the code to work at all…..from
infile <- "input"
data <- read.csv(paste0('./', infile, '.csv'))
that i get a warning saying error in / in paste0.

As a layman of codes, does this work with street address in a csv? what change should i make to the following part to get it work with my csv file?
#define a function that will process googles server responses for us.
getGeoDetails <- function(address){   
   #use the gecode function to query google servers
   geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE)
   #now extract the bits that we need from the returned list
   answer <- data.frame(lat=NA, long=NA, accuracy=NA, formatted_address=NA, address_type=NA, status=NA)
   answer$status <- geo_reply$status
i only have 1800 entries so i guess the later parts aren't really relevant.

Hi, I having the same error. How did you fix this??

1 2 3