Cyclist in the rain. Blog about python scraping data from wunderground rainfall data.

The most recent post on this site was an analysis of how often people cycling to work actually get rained on in different cities around the world. You can check it out here.

The analysis was completed using data from the Wunderground weather website, Python, specifically the Pandas and Seaborn libraries. In this post, I will provide the Python code to replicate the work and analyse information for your own city. During the analysis, I used Python Jupyter notebooks to interactively explore and cleanse data; there’s a simple setup if you elect to use something like the Anaconda Python distribution to install everything you need.

If you want to skip data downloading and scraping, all of the data I used is available to download here.

Scraping Weather Data

Wunderground.com has a “Personal Weather Station (PWS)” network for which fantastic historical weather data is available – covering temperature, pressure, wind speed and direction, and of course rainfall in mm – all available on a per-minute level. Individual stations can be examined at specific URLS, for example here for station “IDUBLIND35”.

There’s no official API for the PWS stations that I could see, but there is a very good API for forecast data. However,  CSV format data with hourly rainfall, temperature, and pressure information can be downloaded from the website with some simple Python scripts.

The hardest part here is to actually find stations that contain enough information for your analysis – you’ll need to switch to “yearly view” on the website to find stations that have been around more than a few months, and that record all of the information you want. If you’re looking for temperature info – you’re laughing, but precipitation records are more sparse.

graphs from wunderground data website

Wunderground have an excellent site with interactive graphs to look at weather data on a daily, monthly, and yearly level. Data is also available to download in CSV format, which is great for data science purposes.


Cleansing and Data Processing

The data downloaded from Wunderground needs a little bit of work. Again, if you want the raw data, it’s here. Ultimately, we want to work out when its raining at certain times of the day and aggregate this result to daily, monthly, and yearly levels. As such, we use Pandas to add month, year, and date columns. Simple stuff in preparation, and we can then output plots as required.

At this point, the dataset is relatively clean, and ready for analysis. If you are not familiar with grouping and aggregation procedures in Python and Pandas, here is another blog post on the topic.

Data after cleansing from Wunderground.com. This data is now in good format for grouping and visualisation using Pandas.

Data after cleansing from Wunderground.com. This data is now in good format for grouping and visualisation using Pandas.

Data summarisation and aggregation

With the data cleansed, we now have non-uniform samples of the weather at a given station throughout the year, at a sub-hour level. To make meaningful plots on this data, we can aggregate over the days and months to gain an overall view and to compare across stations.

At this point, we have two basic data frames which we can use to visualise patterns for the city being analysed.

Visualisation using Pandas and Seaborn

At this point, we can start to plot the data. It’s well worth reading the documentation on plotting with Pandas, and looking over the API of Seaborn, a high-level data visualisation library that is a level above matplotlib.

This is not a tutorial on how to plot with seaborn or pandas – that’ll be a seperate blog post, but rather instructions on how to reproduce the plots shown on this blog post.

Barchart of Monthly Rainy Cycles

The monthly summarised rainfall data is the source for this chart.

Number of days monthly when cyclists get wet commuting at typical work times in Dublin, Ireland.

Number of days monthly when cyclists get wet commuting at typical work times in Dublin, Ireland.

Heatmaps of Rainfall and Rainy Hours per day

The heatmaps shown on the blog post are generated using the “calmap” python library, installable using pip. Simply import the library, and form a Pandas series with a DateTimeIndex and the library takes care of the rest. I had some difficulty here with font sizes, so had to increase the size of the plot overall to counter.

Hours raining per day heatmap

The Calmap package is very useful for generating heatmaps. Note that if you have highly outlying points of data, these will skew your color mapping considerably – I’d advise removing or reducing them for visualisation purposes.

Total Daily Rainfall Heatmap

Heatmap of total rainfall daily over 2015. Note that if you are looking at rainfall data like this, outlying values such as that in August in this example will skew the overall visualisation and reduce the colour-resolution of smaller values. Its best to normalise the data or reduce the outliers prior to plotting.

Exploratory Line Plots

Remember that Pandas can be used on its own for quick visualisations of the data – this is really useful for error checking and sense checking your results. For example:

Quickly view and analyse your data with Pandas straight out of the box. The .plot() command will plot against the axis, but you can specify x and y variables as required.

Quickly view and analyse your data with Pandas straight out of the box. The .plot() command will plot against the axis, but you can specify x and y variables as required.

Comparison of Every City in Dataset

To compare every city in the dataset, summary stats for each city were calculated in advance and then the plot was generated using the seaborn library. To achieve this as quickly as possible, I wrapped the entire data preparation and cleansing phase described above into a single function called “analyse data”, used this function on each city’s dataset, and extracted out the pieces of information needed for the plot.

Here’s the wrapped analyse_data function:

The following code was used to individually analyse the raw data for each city in turn. Note that this could be done in a more memory efficient manner by simply saving the aggregate statistics for each city at first rather than loading all into memory. I would recommend that approach if you are dealing with more cities etc.

The final step in the process is to actually create the diagram using Seaborn.

Percentage of times you got wet cycling to work in 2015 for cities globally. Galway comes out consistently as one of the wettest places for a cycling commute in the data available, but 2015 was a particularly bad year for Irish weather. Here's hoping for 2016.

Percentage of times you got wet cycling to work in 2015 for cities globally. Galway comes out consistently as one of the wettest places for a cycling commute in the data available, but 2015 was a particularly bad year for Irish weather. Here’s hoping for 2016.

If you do proceed to using this code in any of your work, please do let me know!

    • Hi Andrea, great that its useful. You sometimes find errors depending on the weather station that you specify – some of the stations are missing data for individual dates, or don’t have rainfall data at all! This might be the error source.

  1. Shane,
    thanks for your reply. I am using only the first code snippet, which should download the data and store it in a csv file (I am not after rainfall but temperature and humidity).

    At first I thought the station I selected did not have the data, so I extracted the url:
    “http://www.wunderground.com/weatherstation/WXDailyHistory.asp?ID={station}&day={day}&month={month}&year={year}&graphspan=day&format=1”
    from your code and replaced with:
    https://www.wunderground.com/weatherstation/WXDailyHistory.asp?ID=ILONDONL28&day=01&month=08&year=2016&graphspan=day&format=1
    and guess what: there is data! Given that this was the How can that be explained?

      • I think when you add:
        import os
        below “import time ” the script will work fine with Python3+

        If this does not help remove the second part with “data/” from the code below. This will save the csv in the same directory your code runs.

        pd.concat(data[station]).to_csv(“data/{}_weather.csv”.format(station))
        =>
        pd.concat(data[station]).to_csv(“{}_weather.csv”.format(station))

  2. Rob,
    when you suggested that “maybe import io might help” I visited https://www.import.io and could not figure out how to implement any of the website functions with the script.

    I have finally realised you meant: add “import io” to your script! It now works!
    Thanks!

Leave a Reply