Batch Geocoding with R and Google maps

Turn address strings into gps coordinates for free with google

I’ve recently wanted to geocode a large number of addresses (think circa 60k) in Ireland as part of a visualisation of the Irish property market. Geocoding can be simply achieved in R using the geocode() function from the ggmap library. The geocode function uses Googles Geocoding API to turn addresses from text to latitude and longitude pairs very simply.

There is a usage limit on the geocoding service for free users of 2,500 addresses per IP address per day. This hard limit cannot be overcome without employing new a IP address, or paying for a business account. To ease the pain of starting an R process every 2,500 addresses / day, I’ve built the a script that geocodes addresses up the the API query limit every day with a few handy features:

  • Once it hits the geocoding limit, it patiently waits for Google’s servers to let it proceed.
  • The script pings Google once per hour during the down time to start geocoding again as soon as possible.
  • A temporary file containing the current data state is maintained during the process. Should the script be interrupted, it will start again from the place it left off once any problems with the data /connection has been rectified.
static map generated with direct url to static map api
Map with google maps static maps API.

The R script assumes that you are starting with a database that is contained in a single *.csv file, “input.csv”, where the addresses are contained in the “address” column. Feel free to use/modify to suit your own devices!

Comments are included where possible:

# Geocoding script for large list of addresses. 

# Shane Lynn 10/10/2013

#load up the ggmap library
library(ggmap)
# get the input data
infile <- "input"
data <- read.csv(paste0('./', infile, '.csv'))

# get the address list, and append "Ireland" to the end to increase accuracy 
# (change or remove this if your address already include a country etc.)
addresses = data$Address
addresses = paste0(addresses, ", Ireland")

#define a function that will process googles server responses for us.
getGeoDetails <- function(address){   
   #use the gecode function to query google servers
   geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE)
   #now extract the bits that we need from the returned list
   answer <- data.frame(lat=NA, long=NA, accuracy=NA, formatted_address=NA, address_type=NA, status=NA)
   answer$status <- geo_reply$status

   #if we are over the query limit - want to pause for an hour
   while(geo_reply$status == "OVER_QUERY_LIMIT"){
       print("OVER QUERY LIMIT - Pausing for 1 hour at:") 
       time <- Sys.time()
       print(as.character(time))
       Sys.sleep(60*60)
       geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE)
       answer$status <- geo_reply$status
   }

   #return Na's if we didn't get a match:
   if (geo_reply$status != "OK"){
       return(answer)
   }   
   #else, extract what we need from the Google server reply into a dataframe:
   answer$lat <- geo_reply$results[[1]]$geometry$location$lat
   answer$long <- geo_reply$results[[1]]$geometry$location$lng   
   if (length(geo_reply$results[[1]]$types) > 0){
       answer$accuracy <- geo_reply$results[[1]]$types[[1]]
   }
   answer$address_type <- paste(geo_reply$results[[1]]$types, collapse=',')
   answer$formatted_address <- geo_reply$results[[1]]$formatted_address

   return(answer)
}

#initialise a dataframe to hold the results
geocoded <- data.frame()
# find out where to start in the address list (if the script was interrupted before):
startindex <- 1
#if a temp file exists - load it up and count the rows!
tempfilename <- paste0(infile, '_temp_geocoded.rds')
if (file.exists(tempfilename)){
       print("Found temp file - resuming from index:")
       geocoded <- readRDS(tempfilename)
       startindex <- nrow(geocoded)
       print(startindex)
}

# Start the geocoding process - address by address. geocode() function takes care of query speed limit.
for (ii in seq(startindex, length(addresses))){
   print(paste("Working on index", ii, "of", length(addresses)))
   #query the google geocoder - this will pause here if we are over the limit.
   result = getGeoDetails(addresses[ii]) 
   print(result$status)     
   result$index <- ii
   #append the answer to the results file.
   geocoded <- rbind(geocoded, result)
   #save temporary results as we are going along
   saveRDS(geocoded, tempfilename)
}

#now we add the latitude and longitude to the main data
data$lat <- geocoded$lat
data$long <- geocoded$long
data$accuracy <- geocoded$accuracy

#finally write it all to the output files
saveRDS(data, paste0("../data/", infile ,"_geocoded.rds"))
write.table(data, file=paste0("../data/", infile ,"_geocoded.csv"), sep=",", row.names=FALSE)

Let me know if you find a use for the script, or if you have any suggestions for improvements.

Please be aware that it is against the Google Geocoding API terms of service to geocode addresses without displaying them on a Google map. Please see the terms of service for more details on usage restrictions.

100
Leave a Reply

Leave a Reply

  Subscribe  
Notify of

Hi Shane, thanks for this highly useful code. How do rows of your input file look like? I mean how is the address typed in each row? I need to add state in US after the primary address; for example primary address is “abcd high school”, followed by Oklahoma, US. Could you please help me out here.

rirodrig

Hello Shane, thank you for the great code, it works great!

Address lati long formatted address
1004B Jessica’s Court,Bel Air,MD, 39.53376 -76.37507 1099 Jessicas Ct, Bel Air, MD 21014, USA

i need one more column called formatted address how to change the code

RobertC

Here is the error – Error in gzfile(file, mode) : cannot open the connection
In addition: Warning message:
In gzfile(file, mode) :
cannot open compressed file ‘../data/C:/Users/*******/Documents/April 2017/Addresses.csv_geocoded.rds’, probable reason ‘Invalid argument’

As a beginner, I find the error to be confusing and unclear. I followed your script without any changes and created my own csv file for testing purposes. All went well until I neared the end, unfortunately. Maybe the script should not be ran exactly as shown or maybe I am missing an important part of the instructions. Advice and suggestions are welcome.

Emily Edwards

Hi Shanelynn,

Thank you for the script, it has been very helpful for the purposes of my project. I noticed above in the comments that someone spotted the error on line 77 and it is now correct here. However, I accessed the script from Rbloggers here and the error is still there. I wanted to let you know in case you had any control over fixing the error on Rbloggers as well. Thank you!

[…] cinquième option me direz vous, … je vais tester le package R (ggmap) qui a une fonction de géocodage liée à l’API de Google. Je cherche juste à m’assurer […]

Ryan

Hello,

I am getting an error when I run line 63 – 73 which says

“Error in if (location == “”) return(failedGeocodeReturn(output)) :
missing value where TRUE/FALSE needed

Thoughts? Ideas? Fixes?

Thanks!

Sknizkov

Hi 🙂 I noticed many people mentioning the repeating of addresses when it continues from index – would simply adding a +1 to the condition not work? Like this:

if (file.exists(tempfilename)){
print(“Found temp file – resuming from index”)
geocoded <- readRDS(tempfilename)
startindex <- nrow(geocoded)+1
print(startindex)

I tried it and seems to work for me but maybe Im missing something :/

Oleksiy

Hi, Shane, thanks for your amazing code. Super helpful.
My dummy question about tempfilename. I should use any, I suppose. But in which format should I save?

Sincerely
Oleksiy

Brooke Patton

Hi Shane,

Thanks for the code! I have a couple of questions, but I’m brand new to programming, so forgive my ignorance.

1) I have received the following error: “Error in data$Address (from DB_Geocode_trial1) : object of type ‘closure’ is not subsettable”. What is this? And how do I fix it?

2) I am working on a water-related project in which I need to find coordinates for “Habitations” (similar to a village) in India. In other words, I don’t have street names. Is this code meant to accomplish a task like this? If not, do you have any ideas for how I can modify it?

Sorry that I have so many questions. I appreciate any and all assistance!

Thanks in advance,

Brooke

Max

Out of couriosity, is it possible to request the route between two points?

rajanna ap

hi Shane, is there a code for getting city and country from using lattitude and longitude data? please let me know

Andrew Dodd

This worked for me and I don’t even have addresses. I passed it “Region, Country”, sometimes “,Region, Country” and it’s working well.

Thank you!

[…] about the fifth option? … I will test the R package (ggmap) which has a geocoding function related to the Google API. I’m just trying to make sure the […]

Brock

This was extremely helpful, thank you for sharing this!

[…] a recent project, I ported the “batch geocoding in R” script over to Python. The script allows geocoding of large numbers of string addresses to […]

Ana

Hi i am getting this Error: is.character(location) is not TRUE . any ideas??

Jackie

I know this is months beyond the date you posted, but when you load the .csv file, make sure you add “stringsAsFactors = FALSE” in your read.csv line… something like this –> data <- read.csv('file_loaction.csv', stringsAsFactors = FALSE)

Liz

This is great! Would it be possible to adapt the code for the gmapsdistance function? Is this something you have already done? That is, I need to calculate multiple travel distances (in time)…rather than geocode addresses. Thanks for your thoughts!

Prakash Hullathi

Hi Shane,
Thanks for the useful code . When i tried with 252 addresses its worked fine after 4 days i used new input.csv file with 221 rows it’s started giving the below error
Error in if (location == “”) return(failedGeocodeReturn(output)) :
missing value where TRUE/FALSE needed

and its considering 252 rows old file values .please provide the solution waiting for your reply

Thanks Regards
Prakash Hullathi

Nayanmoni Baishya

Hi,
What is approximate runtime for 10K locations?

Brad Gough

It processed around 900 records in half hour for me. This was with Sys.sleep(1) instead of Sys.sleep(60*60)

PMc

HI Shane,

Thanks for the extremely useful code.

I am using this for schools. However it is taking an extremely long time to do, around 3 schools every hour (sometimes no schools in an hour), before it hits the query limit.

Below is an example format I am using. As you can see they some may have longer addresses. Could this be the issue? Or is it normal for it to take this long?

PRESENTATION PRIMARY SCHOOL, TERENURE, DUBLIN 6W, Ireland
BEACAN MIXED N S, BEKAN, CLAREMORRIS, CO MAYO, Ireland

Many Thanks,
PMc

Brad Gough

Change where it says Sys.sleep(60*60) to Sys.sleep(1)

This will then retry every 1 second.

The limit is so many records per second.

[…] Batch Geocoding with R and Google maps […]

Brad Gough

Shane you legend. Worked perfectly. I changed the time to retry every 1 second and is currently processing 4800 records. No more time consuming geo coding! 🙂

Brad Gough

I’ve noticed it doesn’t work with foreign characters such as á æ ü.

It falls over on that record.

I got round it by using a replace all in Excel so á with a – æ with ae etc.

(These are Scandinavian characters. I’m not sure if it is an issue for other foreign countries)

Jessica

Thanks for this code Shane! It worked perfectly for me all summer, but since a week or two, I keep getting an OVER QUERY LIMIT message on the first query, and never get past that. I already tried changing the order of the requests and checking the input data for any weirdness, and changing Sys.sleep to 5. No luck.. Could this have something to do with the changes at Google since 11 June? Is there a way to work an API key into the code? Experiences, anyone?

Thanks,

Jessica

Richard Charger

Hi, I am getting this error. I’ve narrowed it down to the following snip.

for (ii in seq(startindex, length(addresses))){
print(paste(“Working on index”, ii, “of”, length(addresses)))
#query the google geocoder – this will pause here if we are over the limit.
result = getGeoDetails(addresses[ii])
print(result$status)
result$index <- ii
#append the answer to the results file.
geocoded <- rbind(geocoded, result)
#save temporary results as we are going along
saveRDS(geocoded, tempfilename)
}

$ operator is invalid for atomic vectors

Do I need to change the data type? That might be a good approach, but i'm still new, so i'm still investigating.

I'm also seeing that there is a change to the Google API. Is this approach still worth exploring? Or is your code broken at the moment? Nice job though m8. It was good while it lasted. 🙂

[…] nominatim : github.com/hrbrmstr/nominatim ; ggmap::geocode ; geocodeHERE::geocodeHERE_simple ; geonames paquete; también google r street address geolocation Excelente artículo con código de ejemplo (usando el paquete de ggmap): shanelynn.ie/masivo-geocodificación-con-r-y-google-maps […]

mcoirad

Just want to say, if you are having to wait to geocode 2500 addresses a day, a better solution would be to set up your own geocoder using POST Gis and Census Tiger shapefiles. I set that up for a project that needed to geocode 20 million addresses.

mcoirad

At least for US addresses, I assume for European ones there is a comparable open-source solution.