Jump to content

Object pooling with Python

- - - - -

  • Please log in to reply
3 replies to this topic

#1
ricardomattar

ricardomattar

    Newbie

  • Members
  • Pip
  • 2 posts
With the intent of improving performance and scalability, we often use multi-threading or parallel programming.

The problem is that not all the classes and libraries are thread-safe, meaning that we can not transparently share instances or resources between processes or threads.

Use of resources that are not thread-safe requires careful and thoughtful management of the concurrency to avoid bugs that can seem to be completely unpredictable sometimes.

The best scenario is when you have a resource that can be shared by several threads. In this case we would have only one resource or connection shared by everyone without concurrency issues. The world is not perfect though.

One solution is to use locks or other synchronizing technique. Python's standard library has a module named threading, which provides a primitive lock object allowing you to block execution on critical parts of the code, where you would use shared resources that do not allow concurrency. The loss is that your code lose in concurrency, performance and parallelism, besides gaining an unpleasant and undesired complexity.

Another option would be to use more instances or more resources, though consuming more resources would not always be an option. The situation that I intend to explore is one where resources or connections are limited by a server which is not always under your control.

One example of class that is not thread-safe is FTP from the module ftplib from the standard library. An instance from ftplib.FTP can not be used simultaneously by two or more threads, what would leave you with the option to establish connections in demand or to use an object pool. Establishing connections in demand can lead to performance problems or evens problems with the FTP server, as connections are usually a scarce resource. FTP servers usually have a limited number of connections per user and global connection limit. This situation makes difficult the use an indiscriminate number of connections and limits the parallelism of the transfers.

A particularly interesting solution would be to create a pool. A pool of objects or connections. A pool of resources allows a significant gain of performance and eases the management of the complexity inherent to parallel code and non shareable resources because it encapsulates the concurrency management, removing undesired complexity from the body of the code and optimizes the use of resources.

Let's see a practical situation.
Taking as example of non shareable or not thread-safe resource the following class FTPConn:
import ftplib

import time

import StringIO


class FTPConn:

    def __init__(self, retries = 5):

        self.retries = retries

        self.server = None

        self.user = None

        self.password = None

        

    def get(self, server, user, password, filename):

        if (self.server != server) or (self.user != user) or (self.password != password):

            self.server = server

            self.user = user

            self.password = password

            self.ftp = ftplib.FTP(self.server)

            self.ftp.login(self.user, self.password)

            

        sio = StringIO.StringIO()

        ftp_errors = 0

        while True:

            try:

                self.ftp.retrbinary('RETR %s' % str(filename), sio.write)

                sio.seek(0)

                data = sio.getvalue()

                break

            except:

                ftp_errors += 1

                time.sleep(3)

                if ftp_errors > self.retries:

                    raise

                self.ftp = ftplib.FTP(self.server)

                self.ftp.login(self.user, self.password)

        return data


This class only encapsulates ftplib.FTP and would be used as follows:
ftp = FTPConn()

data = ftp.get('10.0.0.1', 'user', 'password', 'filename')

We could try and use this class for several parallel transfers in the following way:
import threading


ftp = FTPConn()


def get_file(server, user, password, filename):

    print ftp.get(server, user, password, filename)

    

def test():

    for i in range(100):

        t = threading.Thread(target = get_file, args = ('10.0.0.1', 'user', 'passwd', 'filename'))

        t.start()

        

test()


This example would fail with apparently unpredictable errors and consequences, as consequence of ftplib.FTP being not thread-safe. Of course the errors would be traceable, though not easily. And the error treatment would be pure insanity and chaos.

The obvious problem is that all the threads are using a single instance from FTPConn and worst, in a completely uncoordinated manner.

An easy and fast fix is to allow every thread to have its own instance of FTPConn. The code would work at the expense of several FTP connections (or not if the server does not allow that many connections). Anyway it would like this:
import threading


def get_file(server, user, password, filename):

    ftp = FTPConn()

    print ftp.get(server, user, password, filename)

    

def test():

    for i in range(100):

        t = threading.Thread(target = get_file, args = ('10.0.0.1', 'user', 'passwd', 'filename'))

        t.start()

        

test()

This example would work or not depending on the availability of connections on the server, timeouts and each transfer's length. This is a solution for an ideal world, not the real one with limited resources.

Fortunately, Python sophisticated data types and introspection capabilities allow us to build proxy classes with an object pool very easily. The implementation is actually very simple and uses a queue (Queue.Queue) to hold the FTPConn instances. All the access control a concurrency management is delegated to the queue, leaving the main code completely free from the parallelism management.

FTPPool implementation and example of parallel transfer test:
import Queue

import threading


class FTPPool:

    def __init__(self, pool_size = 3):

        self.proxy_pool = Queue.Queue()

        for i in range(pool_size):

            instance = FTPConn()

            self.proxy_pool.put(instance)

            

    def get(self, server, user, password, filename):

        proxy = self.proxy_pool.get()

        result = proxy.get(server, user, password, filename)

        self.proxy_pool.put(proxy)

        return result


pool = FTPPool(pool_size = 10)


def get_file(server, user, password, filename):

    print pool.get(server, user, password, filename)


def test():

    for i in range(100):

        t = threading.Thread(target = get_file, args = ('10.0.0.1', 'user', 'passwd', 'filename'))

        t.start()

        

test()

Note that FTPPool has its own version of the method “get”, using the same parameters and passing the call to the instances of FTPConn that it holds. In this way FTPPool is not reusable for other kind of resource, but it is easily modifiable. Of course a generic pool can be built, but its beyond the scope of this article.

Note that all the complexity of locking and concurrency is left for the Queue class to manage. This connection pool configured with a correct number of instances or connections allows a better parallelism, less complexity in the main code and less load on the server.

Edited by Roger, 16 November 2011 - 09:51 PM.
removed email address


#2
Flying Dutchman

Flying Dutchman

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 889 posts
  • Location:::1
Excellent read!

I don't understand why do you put proxy back on queue in FTPPool.get method, could you please elaborate on that.
A conclusion is where you got tired of thinking.
#define class struct    // All is public.

#3
ricardomattar

ricardomattar

    Newbie

  • Members
  • Pip
  • 2 posts
It is actually very simple.
You get an instance to use from the proxy_pool, use it and the put it back in the pool.
The Queue class block your execution when it is empty and you are trying to .get() something,
so when someone else is finished using a resource it will be back ( .put()) in the Queue, so you can get it
and your thread execution continues.

#4
Flying Dutchman

Flying Dutchman

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 889 posts
  • Location:::1
Thanks for additional explanation. :)
A conclusion is where you got tired of thinking.
#define class struct    // All is public.




1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users