Jump to content




Recent Status Updates

View All Updates

Developed by Kemal Taskin
Photo
- - - - -

Object pooling with Python


  • Please log in to reply
3 replies to this topic

#1 ricardomattar

ricardomattar

    CC Lurker

  • Just Joined
  • Pip
  • 2 posts

Posted 16 November 2011 - 06:37 AM

With the intent of improving performance and scalability, we often use multi-threading or parallel programming.

The problem is that not all the classes and libraries are thread-safe, meaning that we can not transparently share instances or resources between processes or threads.

Use of resources that are not thread-safe requires careful and thoughtful management of the concurrency to avoid bugs that can seem to be completely unpredictable sometimes.

The best scenario is when you have a resource that can be shared by several threads. In this case we would have only one resource or connection shared by everyone without concurrency issues. The world is not perfect though.

One solution is to use locks or other synchronizing technique. Python's standard library has a module named threading, which provides a primitive lock object allowing you to block execution on critical parts of the code, where you would use shared resources that do not allow concurrency. The loss is that your code lose in concurrency, performance and parallelism, besides gaining an unpleasant and undesired complexity.

Another option would be to use more instances or more resources, though consuming more resources would not always be an option. The situation that I intend to explore is one where resources or connections are limited by a server which is not always under your control.

One example of class that is not thread-safe is FTP from the module ftplib from the standard library. An instance from ftplib.FTP can not be used simultaneously by two or more threads, what would leave you with the option to establish connections in demand or to use an object pool. Establishing connections in demand can lead to performance problems or evens problems with the FTP server, as connections are usually a scarce resource. FTP servers usually have a limited number of connections per user and global connection limit. This situation makes difficult the use an indiscriminate number of connections and limits the parallelism of the transfers.

A particularly interesting solution would be to create a pool. A pool of objects or connections. A pool of resources allows a significant gain of performance and eases the management of the complexity inherent to parallel code and non shareable resources because it encapsulates the concurrency management, removing undesired complexity from the body of the code and optimizes the use of resources.

Let's see a practical situation.
Taking as example of non shareable or not thread-safe resource the following class FTPConn:
import ftplib
import time
import StringIO

class FTPConn:
    def __init__(self, retries = 5):
        self.retries = retries
        self.server = None
        self.user = None
        self.password = None
        
    def get(self, server, user, password, filename):
        if (self.server != server) or (self.user != user) or (self.password != password):
            self.server = server
            self.user = user
            self.password = password
            self.ftp = ftplib.FTP(self.server)
            self.ftp.login(self.user, self.password)
            
        sio = StringIO.StringIO()
        ftp_errors = 0
        while True:
            try:
                self.ftp.retrbinary('RETR %s' % str(filename), sio.write)
                sio.seek(0)
                data = sio.getvalue()
                break
            except:
                ftp_errors += 1
                time.sleep(3)
                if ftp_errors > self.retries:
                    raise
                self.ftp = ftplib.FTP(self.server)
                self.ftp.login(self.user, self.password)
        return data

This class only encapsulates ftplib.FTP and would be used as follows:
ftp = FTPConn()
data = ftp.get('10.0.0.1', 'user', 'password', 'filename')
We could try and use this class for several parallel transfers in the following way:
import threading

ftp = FTPConn()

def get_file(server, user, password, filename):
    print ftp.get(server, user, password, filename)
    
def test():
    for i in range(100):
        t = threading.Thread(target = get_file, args = ('10.0.0.1', 'user', 'passwd', 'filename'))
        t.start()
        
test()

This example would fail with apparently unpredictable errors and consequences, as consequence of ftplib.FTP being not thread-safe. Of course the errors would be traceable, though not easily. And the error treatment would be pure insanity and chaos.

The obvious problem is that all the threads are using a single instance from FTPConn and worst, in a completely uncoordinated manner.

An easy and fast fix is to allow every thread to have its own instance of FTPConn. The code would work at the expense of several FTP connections (or not if the server does not allow that many connections). Anyway it would like this:
import threading

def get_file(server, user, password, filename):
    ftp = FTPConn()
    print ftp.get(server, user, password, filename)
    
def test():
    for i in range(100):
        t = threading.Thread(target = get_file, args = ('10.0.0.1', 'user', 'passwd', 'filename'))
        t.start()
        
test()

This example would work or not depending on the availability of connections on the server, timeouts and each transfer's length. This is a solution for an ideal world, not the real one with limited resources.

Fortunately, Python sophisticated data types and introspection capabilities allow us to build proxy classes with an object pool very easily. The implementation is actually very simple and uses a queue (Queue.Queue) to hold the FTPConn instances. All the access control a concurrency management is delegated to the queue, leaving the main code completely free from the parallelism management.

FTPPool implementation and example of parallel transfer test:
import Queue
import threading

class FTPPool:
    def __init__(self, pool_size = 3):
        self.proxy_pool = Queue.Queue()
        for i in range(pool_size):
            instance = FTPConn()
            self.proxy_pool.put(instance)
            
    def get(self, server, user, password, filename):
        proxy = self.proxy_pool.get()
        result = proxy.get(server, user, password, filename)
        self.proxy_pool.put(proxy)
        return result

pool = FTPPool(pool_size = 10)

def get_file(server, user, password, filename):
    print pool.get(server, user, password, filename)

def test():
    for i in range(100):
        t = threading.Thread(target = get_file, args = ('10.0.0.1', 'user', 'passwd', 'filename'))
        t.start()
        
test()
Note that FTPPool has its own version of the method “get”, using the same parameters and passing the call to the instances of FTPConn that it holds. In this way FTPPool is not reusable for other kind of resource, but it is easily modifiable. Of course a generic pool can be built, but its beyond the scope of this article.

Note that all the complexity of locking and concurrency is left for the Queue class to manage. This connection pool configured with a correct number of instances or connections allows a better parallelism, less complexity in the main code and less load on the server.

Edited by Roger, 16 November 2011 - 09:51 PM.
removed email address

  • 0

#2 Flying Dutchman

Flying Dutchman

    Programming God

  • Expert Member
  • PipPipPipPipPipPipPip
  • 1,046 posts
  • Location:::1
  • Programming Language:C++, Python

Posted 16 November 2011 - 08:50 AM

Excellent read!

I don't understand why do you put proxy back on queue in FTPPool.get method, could you please elaborate on that.
  • 0
The roots of education are bitter, but the fruit is sweet.

#3 ricardomattar

ricardomattar

    CC Lurker

  • Just Joined
  • Pip
  • 2 posts

Posted 16 November 2011 - 09:42 AM

It is actually very simple.
You get an instance to use from the proxy_pool, use it and the put it back in the pool.
The Queue class block your execution when it is empty and you are trying to .get() something,
so when someone else is finished using a resource it will be back ( .put()) in the Queue, so you can get it
and your thread execution continues.
  • 0

#4 Flying Dutchman

Flying Dutchman

    Programming God

  • Expert Member
  • PipPipPipPipPipPipPip
  • 1,046 posts
  • Location:::1
  • Programming Language:C++, Python

Posted 16 November 2011 - 02:46 PM

Thanks for additional explanation. :)
  • 0
The roots of education are bitter, but the fruit is sweet.