Batch Geocoding with R and Google maps

Turn address strings into gps coordinates for free with google

I’ve recently wanted to geocode a large number of addresses (think circa 60k) in Ireland as part of a visualisation of the Irish property market. Geocoding can be simply achieved in R using the geocode() function from the ggmap library. The geocode function uses Googles Geocoding API to turn addresses from text to latitude and longitude pairs very simply.

There is a usage limit on the geocoding service for free users of 2,500 addresses per IP address per day. This hard limit cannot be overcome without employing new a IP address, or paying for a business account. To ease the pain of starting an R process every 2,500 addresses / day, I’ve built the a script that geocodes addresses up the the API query limit every day with a few handy features:

  • Once it hits the geocoding limit, it patiently waits for Google’s servers to let it proceed.
  • The script pings Google once per hour during the down time to start geocoding again as soon as possible.
  • A temporary file containing the current data state is maintained during the process. Should the script be interrupted, it will start again from the place it left off once any problems with the data /connection has been rectified.
static map generated with direct url to static map api
Map with google maps static maps API.

The R script assumes that you are starting with a database that is contained in a single *.csv file, “input.csv”, where the addresses are contained in the “address” column. Feel free to use/modify to suit your own devices!

Comments are included where possible:

# Geocoding script for large list of addresses. 

# Shane Lynn 10/10/2013

#load up the ggmap library
library(ggmap)
# get the input data
infile <- "input"
data <- read.csv(paste0('./', infile, '.csv'))

# get the address list, and append "Ireland" to the end to increase accuracy 
# (change or remove this if your address already include a country etc.)
addresses = data$Address
addresses = paste0(addresses, ", Ireland")

#define a function that will process googles server responses for us.
getGeoDetails <- function(address){   
   #use the gecode function to query google servers
   geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE)
   #now extract the bits that we need from the returned list
   answer <- data.frame(lat=NA, long=NA, accuracy=NA, formatted_address=NA, address_type=NA, status=NA)
   answer$status <- geo_reply$status

   #if we are over the query limit - want to pause for an hour
   while(geo_reply$status == "OVER_QUERY_LIMIT"){
       print("OVER QUERY LIMIT - Pausing for 1 hour at:") 
       time <- Sys.time()
       print(as.character(time))
       Sys.sleep(60*60)
       geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE)
       answer$status <- geo_reply$status
   }

   #return Na's if we didn't get a match:
   if (geo_reply$status != "OK"){
       return(answer)
   }   
   #else, extract what we need from the Google server reply into a dataframe:
   answer$lat <- geo_reply$results[[1]]$geometry$location$lat
   answer$long <- geo_reply$results[[1]]$geometry$location$lng   
   if (length(geo_reply$results[[1]]$types) > 0){
       answer$accuracy <- geo_reply$results[[1]]$types[[1]]
   }
   answer$address_type <- paste(geo_reply$results[[1]]$types, collapse=',')
   answer$formatted_address <- geo_reply$results[[1]]$formatted_address

   return(answer)
}

#initialise a dataframe to hold the results
geocoded <- data.frame()
# find out where to start in the address list (if the script was interrupted before):
startindex <- 1
#if a temp file exists - load it up and count the rows!
tempfilename <- paste0(infile, '_temp_geocoded.rds')
if (file.exists(tempfilename)){
       print("Found temp file - resuming from index:")
       geocoded <- readRDS(tempfilename)
       startindex <- nrow(geocoded)
       print(startindex)
}

# Start the geocoding process - address by address. geocode() function takes care of query speed limit.
for (ii in seq(startindex, length(addresses))){
   print(paste("Working on index", ii, "of", length(addresses)))
   #query the google geocoder - this will pause here if we are over the limit.
   result = getGeoDetails(addresses[ii]) 
   print(result$status)     
   result$index <- ii
   #append the answer to the results file.
   geocoded <- rbind(geocoded, result)
   #save temporary results as we are going along
   saveRDS(geocoded, tempfilename)
}

#now we add the latitude and longitude to the main data
data$lat <- geocoded$lat
data$long <- geocoded$long
data$accuracy <- geocoded$accuracy

#finally write it all to the output files
saveRDS(data, paste0("../data/", infile ,"_geocoded.rds"))
write.table(data, file=paste0("../data/", infile ,"_geocoded.csv"), sep=",", row.names=FALSE)

Let me know if you find a use for the script, or if you have any suggestions for improvements.

Please be aware that it is against the Google Geocoding API terms of service to geocode addresses without displaying them on a Google map. Please see the terms of service for more details on usage restrictions.

100 thoughts on “Batch Geocoding with R and Google maps”

  1. Hello! I am having some fun getting the query running – I have just gotten stuck when starting the geocoding processes and getting this error.
    Error: Google now requires an API key.
    See ?register_google for details.
    I am beyond new to R and also API’s I am guessing I need to register for a key just don’t know how and where to store it!! Sorry if this is a silly question!!

    Emma ๐Ÿ™‚

    1. Hi Emma,

      You need to get an API Key (just google “get API key” and steps will come up).

      Once you have your API Key it is automatically stored on your profile, but you can also copy+paste it into Notepad which you can save as a text file.

      Then, you just need to add this to the beginning of your script:

      ggmap::register_google(key = “copy+paste API Key here”)

      And you should be good to go. Good luck!

  2. Hi ! I have a question about “get the input data” step. Which data input should i use and where could i find it?

  3. rob.sutherland.III@gmail.com

    Hi Shane,
    Your batch geocode script has been a life saver. I’ve been using it for 2+ years, w/latest effort involving geocoding of 220K unique addresses! Thank you for sharing this.

    In an effort to give back, I’d like to share some error handling I’ve added to your script to resolve ‘NA’ responses from Google Maps API, and a run time scripting capture to measure average response time Google.

    Code snippets starting here:

    # Start the geocoding process – address by address. geocode() function takes care of query speed limit.

    ptm <- proc.time() ## ***start clock for runtime measurement***

    for (ii in seq(startindex, length(addresses))){
    print(paste("Working on index", ii, "of", length(addresses)))
    #query the google geocoder – this will pause here if we are over the limit.

    ### ***added loop for error handling for 'NA' responses ***
    if(is.na(geocode(addresses[ii]))) {
    result = data.frame(lat=NA, long=NA, accuracy=NA, formatted_address=addresses[ii], address_type=NA, status='Failed for NA response', index= paste(ii))
    geocoded <- rbind(geocoded, result)
    ii = ii + 1
    startindex <- ii} ### *** end of error handling loop for 'NA' responses ***

    result = getGeoDetails(addresses[ii])
    print(result$status)
    result$index <- ii
    #append the answer to the results file.
    geocoded <- rbind(geocoded, result)
    #save temporary results as we are going along
    saveRDS(geocoded, tempfilename)
    }
    Runtime <- proc.time() – ptm ## ***stop clock for runtime measurement***

    The remainder of script is the same. Hopefully this helps, Thanks again!

Leave a Reply to shanelynn Cancel reply