:::: MENU ::::

Shane Lynn

Personal site of Shane Lynn Ph.D., Data Analytics Consultant, Tech lover, and startup enthusiast. Currently building KillBiller.com


  • Jun 14 / 2015
  • 1
blog, data science, python, Uncategorized

Summarising, Aggregating, and Grouping data in Python Pandas

I’ve recently started using Python’s excellent Pandas library as a data analysis tool, and, while finding the transition from R’s excellent data.table library frustrating at times, I’m finding my way around and finding most things work quite well.

One aspect that I’ve recently been exploring is the task of grouping large data frames by different variables, and applying summary functions on each group. This is accomplished in Pandas using the “groupby()” and “agg()” functions of Panda’s DataFrame objects.

A Sample DataFrame

Download File IconIn order to demonstrate the effectiveness and simplicity of the grouping commands, we will need some data. For an example dataset, I have extracted my own mobile phone usage records. I analyse this type of data using Pandas during my work on KillBiller. If you’d like to follow along – the full csv file is available here.

Continue Reading

  • Dec 18 / 2014
  • 0
blog, python

Using Python Threading and Returning Multiple Results (Tutorial)

I recently had an issue with a long running web process that I needed to substantially speed up due to timeouts. The delay arose because the system needed to fetch data from a number of URLs. The total number of URLs varied from user to user, and the response time for each URL was quite long (circa 1.5 seconds).

Problems arose with 10-15 URL requests taking over 20 seconds, and my server HTTP connection was timing out. Rather than extending my timeout time, I have turned to Python’s threading library. It’s easy to learn, quick to implement, and solved my problem very quickly. The system was implemented in Pythons web micro-framework Flask.

Parallel programming allows you to speed up your code execution - very useful for data science and data processing

Using Threads for a low number of tasks

Threading in Python is simple. It allows you to manage concurrent threads doing work at the same time. The library is called “threading“, you create “Thread” objects, and they run target functions for you. You can start potentially hundreds of threads that will operate in parallel. The first solution was inspired by a number of StackOverflow posts, and involves launching an individual thread for each URL request. This turned out to not be the ideal solution, but provides a good learning ground.

Continue Reading

  • Jul 29 / 2014
  • 2
blog, python, Software, web

Asynchronous updates to a webpage with Flask and Socket.io

This post is about creating Python Flask web pages that can be asynchronously updated by your Python Flask application at any point without any user interaction. We’ll be using Python Flask, and the Flask-SocketIO plug-in to achieve this. In short, the final result is hosted on GitHub.

What I want to achieve here is a web page that is automatically updated for each user as a result of events that happened in the background on my server system. For example, allowing events like a continually updating message stream, a notification system, or a specific Twitter monitor / display. In this post, I show how to develop a bare-bones Python Flask application that updates connected clients with random numbers. Flask is an extremely lightweight and simple framework for building web applications using Python.

Flask logo

If you haven’t used Flask before, it’s amazingly simple, and to get started serving a very simple webpage only requires a few lines of Python:

Running this file with  python application.py will start a server on your local machine with one page saying “Hello World!” A quick look through the documentation and the first few sections of the brilliant mega-tutorial by Miguel Grinberg will have you creating multi-page python-based web applications in no time. However, most of the tutorials out there focus on the production of non-dynamic pages that load on first accessed and don’t describe further updates.

For the purpose of updating the page once our user has first visited, we will be using Socket.io and the accomanying Flask addon built by the same Miguel Grinberg, Flask-Socketio (Miguel appears to be some sort of Python Flask God). Socket IO is a genius engine that allows real-time bidirectional event-based communication. Gone are the days of static HTML pages that load when you visit; with Socket technology, the server can continuously update your view with new information.

Continue Reading

  • Mar 25 / 2014
  • 2
blog, data science, python, Software

Scraping Dublin City Bikes Data Using Python

Dublin bikes by Dublin City Council

FAST TRACK: There is some python code that allows you to scrape bike availability from bike schemes at the bottom of this post…

SLOW TRACK: As a recent aside, I was interested in collecting Dublin Bikes usage data over a long time period for data visualisation and exploration purposes. The Dublinbikes scheme was launched in September 2009 and is operated by JCDeaux and the Dublin City Council and is one of the more successful public bike schemes that has been implemented. To date, there have been over 6 million journeys and over 37,000 long term subscribers to the scheme. The bike scheme has attracted considerable press recently since an expansion to 1500 bikes and 102 stations around the city.

I wanted to collect, in real time, the status of all of the Dublin Bike Stations across Dublin over a number of months, and then visualise the bike usage and journey numbers at a number of different stations. Things like this:

Example of plot from Dublin bikes

There is no official public API that allows a large number of requests without IP blocking. The slightly-hidden API at the Dublin Cyclocity website started to block me after only a few minutes of requests. However, the good people at Citybik.es provide a wonderful API that provides real-time JSON data for a host of cities in Europe, America, and Australasia.

Continue Reading

  • Feb 03 / 2014
  • 31
blog, data science, R, Software

Self-Organising Maps for Customer Segmentation using R

Self-Organising Maps (SOMs) are an unsupervised data visualisation technique that can be used to visualise high-dimensional data sets in lower (typically 2) dimensional representations. In this post, we examine the use of R to create a SOM for customer segmentation. The figures shown here used use the 2011 Irish Census information for the greater Dublin area as an example data set. This work is based on a talk given to the Dublin R Users group in January 2014.

If you are keen to get down to business:

  • download The slides from a talk on this subject that I gave to the Dublin R Users group in January 2014 are available here
  • download The code for the Dublin Census data example is available for download from here.(zip file containing code and data – filesize 25MB)

SOM diagram

SOMs were first described by Teuvo Kohonen in Finland in 1982, and Kohonen’s work in this space has made him the most cited Finnish scientist in the world. Typically, visualisations of SOMs are colourful 2D diagrams of ordered hexagonal nodes.

Continue Reading

  • Dec 18 / 2013
  • 4
blog, data science

Online Learning Curriculum for Data Scientists

“Is there any online reading or courses I can do to get into data analysis?”

At my workplace, I get asked the question above. The question is usually posed by people typically with a finance background, who’s working as a management consultant. In this post I propose a learning path for such people to “get into data analysis”.  I will assume that the prospective student someone with decent Excel skills, not afraid of a VLOOKUP or a touch of VB, and can throw together decent plots / dashboards using the same Microsoft package, but has little or no knowledge of programming / command line operations.

A data scientist can be defined by Drew Conway‘s Data Science Venn diagram which suggests that data scientists must have a solid mathematical background, skills in coding and computer hacking, and a healthy mix of subject matter expertise.

Data science venn diagram

The courses mentioned below are by no means a “over a weekend” type of engagement – if you are serious about entering the world of data science as a profession, allow yourself at least 3-6 months to complete and study the content of the courses below.

Continue Reading

  • Nov 19 / 2013
  • 9
C++, ROS, Software

CSV Data Extraction Tool for ROS bag files

So you’ve been using ROS to record data from a robot that you use? And you have the data in a rosbag file? And you’ve spent a while googling to find out how to extract images, data, imu readings, gps positions, etc. out of said rosbag file?

This post provides a tool to extract data to CSV format for a number of ROS message types.

ROS (robot operating system) is a software system gaining popularity in robotics for control and automation.

CRUISE vehicle with SICK sensors for autonomous vehicle research

ROS records data in binary .bag files, or bagfiles for short. Getting data out of so-called bagfiles for analysis in MATLAB, Excel, or <insert your favourite analysis software here> isn’t the easiest thing in the world. I’ve put together a small ROS package to extract data from ROS bag files and create CSV files for use in other applications.

** update 6th July 2015 – This code has now been added to Github at https://github.com/shanealynn/ros_csv_extraction/ **

Continue Reading

  • Nov 02 / 2013
  • 0
blog, data science

Data Science Videos from Dublin WebSummit 2013

The Web Summit, Europes largest technology-industry conference was held in Dublin this week. An annual event since 2010, the Web Summit attracted over ten thousand visitors from over 90 countries. The Web Summit puts Ireland on the international startup and internet scene. With speakers like Elon Musk (founder of Paypal, SpaceX, and Tesla) and representatives from new and successful internet companies such as Coursera, Stripe, Hailo, Vine and Mailbox, speaking, I was sickened to not have one of the coveted €1000 euro tickets. (Granted only €1000 for the last week or so!)

The speakers were spread over 4 different stages and spoke on a broad range of topics. Some of the best talks about data science and data visualisation are embedded here:

Des Traynor – Designing Dashboards: From Data to Insights

Des Traynor, founder of Intercom, takes us through the fundamentals of good data display techniques. What are the disadvantages of bubble charts? Why not use a bar chart? What are the common tricks used by people to deceive us with data?


David Coallier – Data Science… What Even?!

Continue Reading

  • Oct 12 / 2013
  • 15
blog, R

Batch Geocoding with R and Google maps

I’ve recently wanted to geocode a large number of addresses (think circa 60k) in Ireland as part of a visualisation of the Irish property market. Geocoding can be simply achieved in R using the geocode() function from the ggmap library. The geocode function uses Googles Geocoding API to turn addresses from text to latitude and longitude pairs very simply.

There is a usage limit on the geocoding service for free users of 2,500 addresses per IP address per day. This hard limit cannot be overcome without employing new a IP address, or paying for a business account. To ease the pain of starting an R process every 2,500 addresses / day, I’ve built the a script that geocodes addresses up the the API query limit every day with a few handy features:

  • Once it hits the geocoding limit, it patiently waits for Google’s servers to let it proceed.
  • The script pings Google once per hour during the down time to start geocoding again as soon as possible.
  • A temporary file containing the current data state is maintained during the process. Should the script be interrupted, it will start again from the place it left off once any problems with the data /connection has been rectified.
  • Sep 06 / 2013
  • 0

Learning JavaScript with CodeSchool

Codeschool paths

Codeschool offer four different streams of classes in different technologies, Ruby, Javascript, Html/CSS and iOS.

As a practicing data scientist, I have regular need to present and distribute the results of analyses, or to provide descriptive statistics on data sets. Ideally these results can be presented in the form of interactive graphics, standalone applications, or as continually updating dashboards. There are a range of excellent dashboard and visualisation building softwares available such as Qlikview, Tableau, and Pentagon but the license fees for such programs can run into the thousands of euro. This expense makes the delivery of such solutions prohibitively expensive for a great deal of small businesses that would otherwise be interested in analytics and data visualisation solutions.

The closest free solution is the brilliant Shiny package developed by RStudio. Shiny is an R package that “makes it super simple for R users to turn analyses into interactive web applications that anyone can use.” Shiny is excellent for rapid protoyping of visualisations. However, the software requires a standalone Linux server for hosting, and is still in beta release. Also, as of yet, I’ve struggled to find a solution to effectively passwording a Shiny application, which isn’t ideal for sensitive data analyses.

As such, I’ve taken to developing custom-built HTML and CSS-based interactive dashboards and visualisations. Unfortunately, there is no handy plug and play solution for this, and I’ve had to improve my knowledge in the areas of Javascript, jQuery, CSS, HTML, and server technology.

Continue Reading

Contact me about this site, projects, collaborations. Contact