FAST TRACK: There is some python code that allows you to scrape bike availability from bike schemes at the bottom of this post…
SLOW TRACK: As a recent aside, I was interested in collecting Dublin Bikes usage data over a long time period for data visualisation and exploration purposes. The Dublinbikes scheme was launched in September 2009 and is operated by JCDeaux and the Dublin City Council and is one of the more successful public bike schemes that has been implemented. To date, there have been over 6 million journeys and over 37,000 long term subscribers to the scheme. The bike scheme has attracted considerable press recently since an expansion to 1500 bikes and 102 stations around the city.
I wanted to collect, in real time, the status of all of the Dublin Bike Stations across Dublin over a number of months, and then visualise the bike usage and journey numbers at a number of different stations. Things like this:
There is no official public API that allows a large number of requests without IP blocking. The slightly-hidden API at the Dublin Cyclocity website started to block me after only a few minutes of requests. However, the good people at Citybik.es provide a wonderful API that provides real-time JSON data for a host of cities in Europe, America, and Australasia.
During this process, I scraped approximately 3.5 million entries from the Dublin Bikes website over a period of 30 days in February 2014. You can download the data for your own analysis in CSV form here, and in SQL-Lite form here. The data has columns for station name, station index/id, date and time, and number of bikes and spaces available.
Scrape Data from Dublin Bikes
The code below provides a short and simple scraper that queries the Citybik.es API at a predefined rate and stores all of the results into a CSV or SQLite Database file. All data is returned from the API as a JSON dump detailing bike availability at all stations, this data is parsed, converted into a pandas data frame, and inserted into the requested data container. You’ll need python 2.7, and a few dependencies that are accessible using Pip or easy_install.
Maybe you’ll find it useful on your data adventures.
# City Bikes Scraper.
# Simple functions to use the citybik.es API to record bike availability in a specific city.
# Settings for scrapers can be changed in lines 18-22
# Built using Python 2.7
# Shane Lynn 24/03/2014
import pandas as pd
import pandas.io.sql as pdsql
from time import sleep, strftime, gmtime
# define the city you would like to get information from here:
# for full list see http://api.citybik.es
API_URL = "http://api.citybik.es/dublinbikes.json"
SAMPLE_TIME = 120 # number of seconds between samples
SQLITE = False # If true - data is stored in SQLite file, if false - csv.
SQLITE_FILE = "bikedb.db" # SQLite file to save data in
CSV_FILE = "output.csv" # CSV file to save data in
print "\n\nScraping at " + strftime("%Y%m%d%H%M%S", gmtime())
# this url has all the details
decoder = json.JSONDecoder()
station_json = requests.get(API_URL, proxies='')
station_data = decoder.decode(station_json.content)
print "---- FAIL ----"
#remove unnecessary data - space saving
# we dont need latitude and longitude
for ii in range(0, len(station_data)):
print " --- SUCCESS --- "
def writeToCsv(data, filename="output.csv"):
Take the list of results and write as csv to filename.
data_frame = pd.DataFrame(data)
data_frame['time'] = strftime("%Y%m%d%H%M%S", gmtime())
data_frame.to_csv(filename, header=False, mode="a")
def writeToDb(data, db_conn):
Take the list of results and write to sqlite database
data_frame = pd.DataFrame(data)
data_frame['scrape_time'] = strftime("%Y%m%d%H%M%S", gmtime())
pdsql.write_frame(data_frame, "bikedata", db_conn, flavor="sqlite", if_exists="append", )
if __name__ == "__main__":
# Create / connect to Sqlite3 database
conn = sqlite3.connect(SQLITE_FILE) # or use :memory: to put it in RAM
cursor = conn.cursor()
# create a table to store the data
cursor.execute("""CREATE TABLE IF NOT EXISTS bikedata
(name text, idx integer, timestamp text, number integer,
free integer, bikes integer, id integer, scrape_time text)
#run main function
# we need to run the full collection, parsing, and writing every minute.
station_data = getAllStationDetails()
print "Sleeping for 120 seconds."