I recently had an issue with a long running web process that I needed to substantially speed up due to timeouts. The delay arose because the system needed to fetch data from a number of URLs. The total number of URLs varied from user to user, and the response time for each URL was quite long (circa 1.5 seconds).

Problems arose with 10-15 URL requests taking over 20 seconds, and my server HTTP connection was timing out. Rather than extending my timeout time, I have turned to Python’s threading library. It’s easy to learn, quick to implement, and solved my problem very quickly. The system was implemented in Pythons web micro-framework Flask.

Parallel programming allows you to speed up your code execution - very useful for data science and data processing

Using Threads for a low number of tasks

Threading in Python is simple. It allows you to manage concurrent threads doing work at the same time. The library is called “threading“, you create “Thread” objects, and they run target functions for you. You can start potentially hundreds of threads that will operate in parallel. The first solution was inspired by a number of StackOverflow posts, and involves launching an individual thread for each URL request. This turned out to not be the ideal solution, but provides a good learning ground.

You first need to define a “work” function that each thread will execute separately. In this example, the work function is a “crawl” method that retrieves data from a url. Returning values from threads is not possible and, as such, in this example we pass in a globally accessible (to all threads) “results” array with the index of the array in which to store the result once fetched. The crawl() function will look like:

To actually start Threads in python, we use the “threading” library and create “Thead” objects. We can specify a target function (‘target’) and set of arguments (‘args’) for each thread and, once started, the theads will execute the function specified all in parallel. In this case, the use of threads will effectively reduce our URL lookup time to 1.5 seconds (approx) no matter how many URLs there are to check. The code to start the threaded processes is:

The only peculiarity here is the join()  function. Essentially, join() pauses the calling thread (in this case the main thread of the program) until the thread in question has finished processing. Calling join prevents our program from progressing until all URLs have been fetched.

This method of starting one thread for each task will work well unless you have a high number (many hundreds) of tasks to complete.

Using Queue for a high number of tasks

The solution outlined above operated successfully for us, with users to our web application requiring, on average, 9-11 threads per request. The threads were starting, working, and returning results successfully. Issues arose later when users required much more threaded processes (>400).  With such requests, Python was starting hundreds of threads are receiving errors like:

For these users, the original solution was not viable. There is a limit in your environment to the maximum number of threads that can be started by Python. Another of Pythons built-in libraries for threading, Queue, can be used to get around obstacle. A queue is essentially used to store a number of “tasks to be done”. Threads can take tasks from the queue when they are available, do the work, and then go back for more. In this example, we needed to ensure maximum of 50 threads at any one time, but the ability to process any number of URL requests. Setting up a queue in Python is very simple:

To return results from the threads, we will use the same technique of passing a results list along with an index for storage to each worker thread. The index needs to be included in the Queue when setting up tasks since we will not be explicitly calling each “crawl” function with arguments ( we also have no guarantee as to which order the tasks are executed).

The threaded “crawl” function will be different since it now relies on the queue. The threads are set up to close and return when the queue is empty of tasks.

The new Queue object itself is passed to the threads along with the list for storing results. The final location for each result is contained within the queue tasks – ensuring that the final “results” list is in the same order as the original “urls” list. We populate the queue with this job information:

Our tasks will now not be completely processed in parallel, but rather by 50 threads operating in parallel. Hence, 100 urls will take 2 x 1.5 seconds approx. Here, this delay was acceptable since the number of users requiring more than 50 threads is minimal. However, at least the system is flexible enough to handle any situation.

This setup is well suited for the example of non-computationally intesive input/output work (fetching URLs), since much of the threads time will be spent waiting for data. In data-intensive or data science work, the multiprocessing or celery libraries can be better suited since they split work across multiple CPU cores. Hopefully the content above gets you on the right track!

Further information on Python Threading

There is some great further reading on threads and the threading module if you are looking for more in-depth information:

 

Leave a Reply