The problem is that not all the classes and libraries are thread-safe, meaning that we can not transparently share instances or resources between processes or threads.
Use of resources that are not thread-safe requires careful and thoughtful management of the concurrency to avoid bugs that can seem to be completely unpredictable sometimes.
The best scenario is when you have a resource that can be shared by several threads. In this case we would have only one resource or connection shared by everyone without concurrency issues. The world is not perfect though.
One solution is to use locks or other synchronizing technique. Python's standard library has a module named threading, which provides a primitive lock object allowing you to block execution on critical parts of the code, where you would use shared resources that do not allow concurrency. The loss is that your code lose in concurrency, performance and parallelism, besides gaining an unpleasant and undesired complexity.
Another option would be to use more instances or more resources, though consuming more resources would not always be an option. The situation that I intend to explore is one where resources or connections are limited by a server which is not always under your control.
One example of class that is not thread-safe is FTP from the module ftplib from the standard library. An instance from ftplib.FTP can not be used simultaneously by two or more threads, what would leave you with the option to establish connections in demand or to use an object pool. Establishing connections in demand can lead to performance problems or evens problems with the FTP server, as connections are usually a scarce resource. FTP servers usually have a limited number of connections per user and global connection limit. This situation makes difficult the use an indiscriminate number of connections and limits the parallelism of the transfers.
A particularly interesting solution would be to create a pool. A pool of objects or connections. A pool of resources allows a significant gain of performance and eases the management of the complexity inherent to parallel code and non shareable resources because it encapsulates the concurrency management, removing undesired complexity from the body of the code and optimizes the use of resources.
Let's see a practical situation.
Taking as example of non shareable or not thread-safe resource the following class FTPConn:
import ftplib import time import StringIO class FTPConn: def __init__(self, retries = 5): self.retries = retries self.server = None self.user = None self.password = None def get(self, server, user, password, filename): if (self.server != server) or (self.user != user) or (self.password != password): self.server = server self.user = user self.password = password self.ftp = ftplib.FTP(self.server) self.ftp.login(self.user, self.password) sio = StringIO.StringIO() ftp_errors = 0 while True: try: self.ftp.retrbinary('RETR %s' % str(filename), sio.write) sio.seek(0) data = sio.getvalue() break except: ftp_errors += 1 time.sleep(3) if ftp_errors > self.retries: raise self.ftp = ftplib.FTP(self.server) self.ftp.login(self.user, self.password) return data
This class only encapsulates ftplib.FTP and would be used as follows:
ftp = FTPConn() data = ftp.get('10.0.0.1', 'user', 'password', 'filename')We could try and use this class for several parallel transfers in the following way:
import threading ftp = FTPConn() def get_file(server, user, password, filename): print ftp.get(server, user, password, filename) def test(): for i in range(100): t = threading.Thread(target = get_file, args = ('10.0.0.1', 'user', 'passwd', 'filename')) t.start() test()
This example would fail with apparently unpredictable errors and consequences, as consequence of ftplib.FTP being not thread-safe. Of course the errors would be traceable, though not easily. And the error treatment would be pure insanity and chaos.
The obvious problem is that all the threads are using a single instance from FTPConn and worst, in a completely uncoordinated manner.
An easy and fast fix is to allow every thread to have its own instance of FTPConn. The code would work at the expense of several FTP connections (or not if the server does not allow that many connections). Anyway it would like this:
import threading def get_file(server, user, password, filename): ftp = FTPConn() print ftp.get(server, user, password, filename) def test(): for i in range(100): t = threading.Thread(target = get_file, args = ('10.0.0.1', 'user', 'passwd', 'filename')) t.start() test()
This example would work or not depending on the availability of connections on the server, timeouts and each transfer's length. This is a solution for an ideal world, not the real one with limited resources.
Fortunately, Python sophisticated data types and introspection capabilities allow us to build proxy classes with an object pool very easily. The implementation is actually very simple and uses a queue (Queue.Queue) to hold the FTPConn instances. All the access control a concurrency management is delegated to the queue, leaving the main code completely free from the parallelism management.
FTPPool implementation and example of parallel transfer test:
import Queue import threading class FTPPool: def __init__(self, pool_size = 3): self.proxy_pool = Queue.Queue() for i in range(pool_size): instance = FTPConn() self.proxy_pool.put(instance) def get(self, server, user, password, filename): proxy = self.proxy_pool.get() result = proxy.get(server, user, password, filename) self.proxy_pool.put(proxy) return result pool = FTPPool(pool_size = 10) def get_file(server, user, password, filename): print pool.get(server, user, password, filename) def test(): for i in range(100): t = threading.Thread(target = get_file, args = ('10.0.0.1', 'user', 'passwd', 'filename')) t.start() test()Note that FTPPool has its own version of the method “get”, using the same parameters and passing the call to the instances of FTPConn that it holds. In this way FTPPool is not reusable for other kind of resource, but it is easily modifiable. Of course a generic pool can be built, but its beyond the scope of this article.
Note that all the complexity of locking and concurrency is left for the Queue class to manage. This connection pool configured with a correct number of instances or connections allows a better parallelism, less complexity in the main code and less load on the server.
Edited by Roger, 16 November 2011 - 09:51 PM.
removed email address