Turn address strings into gps coordinates for free with google

Batch Geocoding with R and Google maps

I’ve recently wanted to geocode a large number of addresses (think circa 60k) in Ireland as part of a visualisation of the Irish property market. Geocoding can be simply achieved in R using the geocode() function from the ggmap library. The geocode function uses Googles Geocoding API to turn addresses from text to latitude and longitude pairs very simply.

There is a usage limit on the geocoding service for free users of 2,500 addresses per IP address per day. This hard limit cannot be overcome without employing new a IP address, or paying for a business account. To ease the pain of starting an R process every 2,500 addresses / day, I’ve built the a script that geocodes addresses up the the API query limit every day with a few handy features:

  • Once it hits the geocoding limit, it patiently waits for Google’s servers to let it proceed.
  • The script pings Google once per hour during the down time to start geocoding again as soon as possible.
  • A temporary file containing the current data state is maintained during the process. Should the script be interrupted, it will start again from the place it left off once any problems with the data /connection has been rectified.
points in dublin

The R script assumes that you are starting with a database that is contained in a single *.csv file, “input.csv”, where the addresses are contained in the “address” column. Feel free to use/modify to suit your own devices!

Comments are included where possible:

# Geocoding script for large list of addresses. 

# Shane Lynn 10/10/2013

#load up the ggmap library
library(ggmap)
# get the input data
infile <- "input"
data <- read.csv(paste0('./', infile, '.csv'))

# get the address list, and append "Ireland" to the end to increase accuracy 
# (change or remove this if your address already include a country etc.)
addresses = data$Address
addresses = paste0(addresses, ", Ireland")

#define a function that will process googles server responses for us.
getGeoDetails <- function(address){   
   #use the gecode function to query google servers
   geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE)
   #now extract the bits that we need from the returned list
   answer <- data.frame(lat=NA, long=NA, accuracy=NA, formatted_address=NA, address_type=NA, status=NA)
   answer$status <- geo_reply$status

   #if we are over the query limit - want to pause for an hour
   while(geo_reply$status == "OVER_QUERY_LIMIT"){
       print("OVER QUERY LIMIT - Pausing for 1 hour at:") 
       time <- Sys.time()
       print(as.character(time))
       Sys.sleep(60*60)
       geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE)
       answer$status <- geo_reply$status
   }

   #return Na's if we didn't get a match:
   if (geo_reply$status != "OK"){
       return(answer)
   }   
   #else, extract what we need from the Google server reply into a dataframe:
   answer$lat <- geo_reply$results[[1]]$geometry$location$lat
   answer$long <- geo_reply$results[[1]]$geometry$location$lng   
   if (length(geo_reply$results[[1]]$types) > 0){
       answer$accuracy <- geo_reply$results[[1]]$types[[1]]
   }
   answer$address_type <- paste(geo_reply$results[[1]]$types, collapse=',')
   answer$formatted_address <- geo_reply$results[[1]]$formatted_address

   return(answer)
}

#initialise a dataframe to hold the results
geocoded <- data.frame()
# find out where to start in the address list (if the script was interrupted before):
startindex <- 1
#if a temp file exists - load it up and count the rows!
tempfilename <- paste0(infile, '_temp_geocoded.rds')
if (file.exists(tempfilename)){
       print("Found temp file - resuming from index:")
       geocoded <- readRDS(tempfilename)
       startindex <- nrow(geocoded)
       print(startindex)
}

# Start the geocoding process - address by address. geocode() function takes care of query speed limit.
for (ii in seq(startindex, length(addresses))){
   print(paste("Working on index", ii, "of", length(addresses)))
   #query the google geocoder - this will pause here if we are over the limit.
   result = getGeoDetails(addresses[ii]) 
   print(result$status)     
   result$index <- ii
   #append the answer to the results file.
   geocoded <- rbind(geocoded, result)
   #save temporary results as we are going along
   saveRDS(geocoded, tempfilename)
}

#now we add the latitude and longitude to the main data
data$lat <- geocoded$lat
data$long <- geocoded$long
data$accuracy <- geocoded$accuracy

#finally write it all to the output files
saveRDS(data, paste0("../data/", infile ,"_geocoded.rds"))
write.table(data, file=paste0("../data/", infile ,"_geocoded.csv"), sep=",", row.names=FALSE)

Let me know if you find a use for the script, or if you have any suggestions for improvements.

Please be aware that it is against the Google Geocoding API terms of service to geocode addresses without displaying them on a Google map. Please see the terms of service for more details on usage restrictions.

80 thoughts on “Batch Geocoding with R and Google maps”

  1. Hi Ann – whats the error that you are getting? Have you got your csv file in the same place as your R script? And finally have you set the “working directory” of R to the same directory.

    As a another try – change read.csv(paste0(‘./’, infield, ‘.csv’)) to just read.csv(paste0(infield, ‘.csv’))

    Hope this works!

  2. Hi Shane, thanks for this highly useful code. How do rows of your input file look like? I mean how is the address typed in each row? I need to add state in US after the primary address; for example primary address is “abcd high school”, followed by Oklahoma, US. Could you please help me out here.

    1. The address can be just a string column in the csv file. So something like:

      id, address, other_column
      0, “Test Address, Test Town, State”, “other value”
      1, “Test Address2, Test Town2, State”, “other value”

      Doest that make sense? You can use paste() to add a state at the end if you have it in another column etc.

  3. The address can be just a string column in the csv file. So something like:

    id, address, other_column
    0, “Test Address, Test Town, State”, “other value”
    1, “Test Address2, Test Town2, State”, “other value”

    Doest that make sense? You can use paste() to add a state at the end if you have it in another column etc.

  4. Here is the error – Error in gzfile(file, mode) : cannot open the connection
    In addition: Warning message:
    In gzfile(file, mode) :
    cannot open compressed file ‘../data/C:/Users/*******/Documents/April 2017/Addresses.csv_geocoded.rds’, probable reason ‘Invalid argument’

    As a beginner, I find the error to be confusing and unclear. I followed your script without any changes and created my own csv file for testing purposes. All went well until I neared the end, unfortunately. Maybe the script should not be ran exactly as shown or maybe I am missing an important part of the instructions. Advice and suggestions are welcome.

  5. Emily Edwards

    Hi Shanelynn,

    Thank you for the script, it has been very helpful for the purposes of my project. I noticed above in the comments that someone spotted the error on line 77 and it is now correct here. However, I accessed the script from Rbloggers here and the error is still there. I wanted to let you know in case you had any control over fixing the error on Rbloggers as well. Thank you!

  6. Pingback: Le géocodage (comparaison)

  7. Hello,

    I am getting an error when I run line 63 – 73 which says

    “Error in if (location == “”) return(failedGeocodeReturn(output)) :
    missing value where TRUE/FALSE needed

    Thoughts? Ideas? Fixes?

    Thanks!

  8. Hi 🙂 I noticed many people mentioning the repeating of addresses when it continues from index – would simply adding a +1 to the condition not work? Like this:

    if (file.exists(tempfilename)){
    print(“Found temp file – resuming from index”)
    geocoded <- readRDS(tempfilename)
    startindex <- nrow(geocoded)+1
    print(startindex)

    I tried it and seems to work for me but maybe Im missing something :/

  9. Hi, Shane, thanks for your amazing code. Super helpful.
    My dummy question about tempfilename. I should use any, I suppose. But in which format should I save?

    Sincerely
    Oleksiy

  10. Hi Shane,

    Thanks for the code! I have a couple of questions, but I’m brand new to programming, so forgive my ignorance.

    1) I have received the following error: “Error in data$Address (from DB_Geocode_trial1) : object of type ‘closure’ is not subsettable”. What is this? And how do I fix it?

    2) I am working on a water-related project in which I need to find coordinates for “Habitations” (similar to a village) in India. In other words, I don’t have street names. Is this code meant to accomplish a task like this? If not, do you have any ideas for how I can modify it?

    Sorry that I have so many questions. I appreciate any and all assistance!

    Thanks in advance,

    Brooke

    1. Hi Brooke. It sounds like your data didn’t read in correctly for the first error – make sure your CSV file is correctly formatted, and the data shows up correctly if you run “head(data)”. For the water project, the geocoder from google should work fine with areas rather than street addresses, as long as those habitations are marked on Google. To help it along, you can add “India” to the end of the geocoding string.

    1. Hi Rajanna, this is called “reverse-geocoding” – you’ll need to look at the specific API for that – Google has one if you look around the documentation!

  11. This worked for me and I don’t even have addresses. I passed it “Region, Country”, sometimes “,Region, Country” and it’s working well.

    Thank you!

  12. Pingback: Geocoding (comparison)

  13. Pingback: Batch CSV Geocoding in Python with Google Maps API | Shane Lynn

    1. I know this is months beyond the date you posted, but when you load the .csv file, make sure you add “stringsAsFactors = FALSE” in your read.csv line… something like this –> data <- read.csv('file_loaction.csv', stringsAsFactors = FALSE)

  14. This is great! Would it be possible to adapt the code for the gmapsdistance function? Is this something you have already done? That is, I need to calculate multiple travel distances (in time)…rather than geocode addresses. Thanks for your thoughts!

  15. Prakash Hullathi

    Hi Shane,
    Thanks for the useful code . When i tried with 252 addresses its worked fine after 4 days i used new input.csv file with 221 rows it’s started giving the below error
    Error in if (location == “”) return(failedGeocodeReturn(output)) :
    missing value where TRUE/FALSE needed

    and its considering 252 rows old file values .please provide the solution waiting for your reply

    Thanks Regards
    Prakash Hullathi

    1. Hi Prakash – it looks as if there is a missing address or location in your input file. That or a comma in the wrong place – you need to examine the file to find an errors in the input data.

    1. It processed around 900 records in half hour for me. This was with Sys.sleep(1) instead of Sys.sleep(60*60)

  16. HI Shane,

    Thanks for the extremely useful code.

    I am using this for schools. However it is taking an extremely long time to do, around 3 schools every hour (sometimes no schools in an hour), before it hits the query limit.

    Below is an example format I am using. As you can see they some may have longer addresses. Could this be the issue? Or is it normal for it to take this long?

    PRESENTATION PRIMARY SCHOOL, TERENURE, DUBLIN 6W, Ireland
    BEACAN MIXED N S, BEKAN, CLAREMORRIS, CO MAYO, Ireland

    Many Thanks,
    PMc

    1. Change where it says Sys.sleep(60*60) to Sys.sleep(1)

      This will then retry every 1 second.

      The limit is so many records per second.

  17. Pingback: Batch Geocoding with R and Google maps | All Around GIS

  18. Shane you legend. Worked perfectly. I changed the time to retry every 1 second and is currently processing 4800 records. No more time consuming geo coding! 🙂

  19. I’ve noticed it doesn’t work with foreign characters such as á æ ü.

    It falls over on that record.

    I got round it by using a replace all in Excel so á with a – æ with ae etc.

    (These are Scandinavian characters. I’m not sure if it is an issue for other foreign countries)

Leave a Reply