Saturday, 12 August 2017

Word clouds: Text analysis for interesting patterns

This post utilises text analysis to identify trends in popular web pages. Of interest are home pages of news sites, Twitter pages of popular personalities and other sites where content changes frequently. Analysing words and projecting to a cloud provides a visual representation of text data which in turn helps to quickly perceive overall changes in trends and interests. Here the cloud is generated on content from Saturday and Sunday.

About the application:

The web application generates a word cloud for each submitted link at a given time. The application first generates a word count for each page using natural language tool kit and stores the results in a database. The counts are converted to a word cloud using opensource javascript libraries. Web links can be added to the application as needed. The word count task is performed asynchronously using celery worker nodes. 

Software built using:

Django1.10/Python3.5, Asynchronous tasks using Celery 4.1.0, Javascript, Natural Language Processing Tool Kit (NLTK), Postgres Database. 3 hosts. One for web application, one for celery workers and Database and third for the message broker. 

Home pages analysed:
Google News
Fox News
Indian Prime Minister Twitter Page
Obama Twitter page
President Trump Twitter Page

Celery Django Architecture

>> Word cloud on Saturday to left and Sunday to right

Fox News

Google News


 President Trump and  the Indian Prime Minister Twitter Handles

Tuesday, 20 June 2017

Band-aid for profiling in a hurry

This post profiles two solutions to a problem to see which one has low latency. As it will be obvious the better solution employs a simple technique and requires no explicit confirmation that it returns faster. Quickly profiling does not hurt either.

The objective is to consecutively calculate the n-th fibonacci where each n is taken from a sequence. Solution 1 is a recursive algorithm. This divides the problem into sub-problems and builds the final solution by combining solutions to individual sub-problems. This solution does not make any other attempts at gaining faster speeds. Solution 2 is also recursive but, uses a map to hold the results of sub-problems encountered during the calculation. That way for each subsequent n in the sequence, the results of sub-problems from the previous run are re-used. The same can be achieved with memoization.

As seen in the profiling results Solution 2 not only saves execution cycles but also saves itself from the overhead of context switching into the sub-problems.

Solution 1 without state

Solution 2 (with state in a class)

Now we profile the two solutions by calculating over a list of 10 numbers and the results are shown below.


Monday, 15 May 2017

Sharded database on a MongoDB cluster

MongoDB is a NoSQL document store. As with other NoSQL databases there are a number of advantages to it including dynamic schema, scalability/performance and no need for ORM layer. A production deployment of MongoDb should be clustered. This will increase scalability and durability for the data. For high volume applications the number of disk operations and network transfers can overwhelm a single node. Example is any back-end for mobile apps with millions of users. On a single node memory is also a precious resource that can become scarce. Finally the whole data set can be too large to fit in one node or one set of nodes. Sharding helps to address these issues. 

This post shows the setup of a sharded cluster and checking the behavior when the primary in a replicaset fails.

A sharded MongoDB topology looks like the following.
There are at least 3 config servers, 1 router and then the shards themselves. Each shard should ideally include at least 3 nodes/servers. One of the nodes in the shard will be elected primary and data replicates to the others in a shard. There can be many shards depending on data set size. The router routes a query to a particular shard after consulting the config servers. So on a minimum there needs to be 7 hosts for a basic sharded cluster. These have to be separate and not vms on the same machine. If these are vms on a single big server in production then we are not addressing the issues that led to the move towards sharding. However the post used vms all on ssd. The setup is described as below.

Setup details

Build 7 vms or configure separate hosts (recommended) with the following. Either an available image can be used or clone using virtualbox.

Config server

Shard's data nodes


Each node runs on 64-bit Ubuntu Server 16.04 and uses MongoDB 3.4.1.

Database used is the MoT dataset from

Config servers

A) Start the config servers one by one. Log into the config server and run the following command.
mongod --configsvr --replSet config_set1 --dbpath ~/mongo-data --port 27019

In the version of MongoDB used the config servers themselves are a replicaset. mongo-data is the path for storing the data files. Do this for each config server.

Log into one of the config servers in mongo and initialise them as shown:

Shard nodes

B)  On each node in the shard start the mongo daemon as
mongod --shardsvr --dbpath ./mongo-data/ --replSet data_rs_1

Log into one of the data nodes and initialise them as shown


C) Start the router. On the router node 

mongos --configdb config_set1/,,

Adding shards

D) Connect to the router and add the shard to the cluster.

View the config server to confirm that the shard was added. This shows up in the SHARDING logs.

Shard the database

E) Enable sharding for the database and collection in that order. Again connect to the router to do the following.

mongo> sh.enableSharding("mot")

Then enable sharding for the collection. Also specify the key field that needs to be used for distributing the data between shards.

F) Check the shard status. On the router do a sh.status(). This should look like


G) Connect a client to the router and populate the database. After this the shard status also shows chunks of data.

H) Check the config on the cluster. Connect MongoDB Compass to the router and look into the config db.

Note that an index is created on the DB for the shard key. This is shown below.

Electing a primary on failure:

Setting the above up can be time consuming but it is all worth it as when the primary node of the shard is killed, one of the other data nodes assumes the primary position through an election. So we go and kill the primary node process. This result in heart beat failure and after waiting for a specific time frame one of the secondaries assumes primary position. This is shown below.

The primary was node

On failure of that is elected as primary. This shows up on the 40 node when it comes back online.

Although queries during primary election can fail it can be caught in exceptions and tired again in the application. This requires no mentioning but the daemons should be run as service.


MongoDB docs at

This can be a bit outdated but is a good read too

Saturday, 22 April 2017

Distributed Processing: Python celery for 1 M tasks

This post is on python distributed processing using celery and a basic profiling of the same. Also some notes on celery is posted at the end. 

Celery is a distributed task queue system in Python. While custom multiprocessing code can be written, for well defined tasks it is better to leverage a framework than re-invent. Celery works together with a message broker from where enqueued tasks are consumed.

Producers of tasks are programs that want one or more tasks done. Workers are consumers of (execute) these tasks. The tasks are submitted to a message broker like Rabbit-MQ. Workers take the tasks from the broker and execute them. The results of tasks can either be ignored or stored some where. Usually this is memcached or a database.

Versions used from pip in virtual env:


Ubuntu Server 16.04  on which workers run has 8 cores and 16 GB RAM.
Message broker runs on vm with 4 cores and ~ 5 GB RAM. 

For this post 1 Million tasks are run on 2 worker processes. Each worker has 3 queues and the particular task is routed to a particular queue via configuration. Each task is to calculate the nth fibonacci using a recursive algorithm. Since this algorithm does not use any optimization techniques it takes a while to run and simulates a cpu intensive task. Thus demanding distribution of load!

- Message broker is rabbit-mq server and runs on a different virtual machine.
- Workers run on another machine as a service. Named w1 and w2.

Concurrency is set to default which is the number of processors i.e 8 on the machine. A screen of the processors being used is immediately visible as shown below.

Three queues on the workers are shown below. Celery has not yet got the notion of task priorities so color coded queues are used here. The names of the queues can be anything as long as each queue is being used for a particular subset of tasks based on cpu intensive, io driven and the like.

Flower is a tool used to monitor celery tasks. It shows details per worker and also graphs in a monitor page. The flower status page is shown below after start.

10 different threads are used to post 1 M fibonacci tasks. Ids of submitted tasks are stored and results later retrieved based on these.  The same screen after 1 M tasks have finished is shown below.

At the point where all the tasks had succeeded cProfile run at the client shows

The function calls also include calls for most of the results but not all.


1. Celery version 4.0.2 has an issue which causes the workers to crash after a restart and that too when there are pending messages in the message queue. 4.0.2 from PyPi exhibited this issue and so 3.1.24 is used. This issue is described in more detail here at There are a number of related issues too. However there is a merge for the same but is not yet available from PyPi at the time of this writing.

2. For a source structure <project>/src/<celery package with the Celery app>, the worker or the service needs to be triggered from the folder src (in this case).

3. Each worker can be configured to take tasks from a particular queue. 

4. By running workers on multiple virtual machines, the solution becomes more distributed.

Celery project links:

Sunday, 25 October 2015

coils: A python data structure library

Follow Hari's python datastructure library on BitBucket

Wednesday, 2 September 2015

Coding lessons from Hollywood: Readable Python conditionals with visitor pattern

Programs have conditional statements. They read like the following

if condition do action 
else do something else. 

These can get long, hairy and will read like

if condition_1 do this 
else if condition_2 do that 
else if condition_3 do that 
finally if nothing matches our expected conditions do something else. 

Switch-case statements which allow you to code this in a more reader friendly manner are not present in Python which means that we do it with chained conditional statements. Chained conditional statements can be ok when there are small operations to be done with each condition. For example, 

However as the number of conditions, context and operations to be performed get complicated, the code can become unwieldy, difficult to understand and thus non-maintainable. The code architecture of an application can go south. 

On the lighter side if you need Hollywood to convince you on this, have a look at this guy reading his colleague's code. Your / Others life may depend on it. :)) 

Chains of code like that will qualify as a hack to sleep in less than 30 seconds. For example if we are reading data off a network and need to perform different operations on the packet type. Imagine 20 different packet types. Something like the following can be cumbersome to read after 3-6 months of initial release. The following is an example of using long conditional to do a job.

Where as the following

looks better, reads better and takes the operation done away to another place (decoupled). Even better we can code the handler to take the packet itself and figure things out inside the handler. This can be done in Python using the visitor pattern. This pattern comes across in data structures but serves well in this context too. Plus this approach has been around for quite sometime in Python. But less frequently utilized for a long switch-case scenario.

So given are a number of options and corresponding operations to be performed. We can use different methods of a Python class to handle the different options. This means that we need to know which method for which option? Python makes this very easy as methods are also attributes in the dynamic __dict__ key value pair of objects. i.e we can query a method in an object, add a method in run-time or even change it for a class. 

Finding out which method handles which option can be done as shown in the following code.

We take the option and check if there is a method tied to that option in __dict__ attribute set. If we find one we use that. If no method is found we just do the default operation corresponding to default switch-case statement. Also we can use Python's handy *args and **kwargs to pass data into the handlers.

A class that implements the different methods for the various options can be like the following. Here we do some work on numeric options and some operations based on country codes.

These can be called as follows

To make things better, the Base class can look for functions named after the PacketTypeClass too, if we have different packet classes like IPv4Packet, IPv6Packet etc. i.e handler methods are named after the object type that they handle. 

It is arguable that introducing a class and object everywhere to handle switch-case statements or long conditionals is not helpful all the time. If the conditional chains are small and context is not that complicated it is acceptable to do conditionals as such. A simple dict with options as keys and method names as values will suffice. But that is exactly what a class in Python does with its __dict__ and whichever way chosen, handler code need to go somewhere in the code structure. Better done in a readable way.

Thursday, 13 August 2015

Using *args and **kwargs : Python code post

In a previous post we looked at *args and **kwargs in different function calls. This post is about a practical use of these two in object oriented programming and inheritance.  Again, these can be used is to avoid having to write __init__ for every class that you create in a large set of small classes.

To avoid having to write __init__ for every class in the hierarchy a parent class can implement __init__ based on *args and **kwargs. Each child class that inherits this class can then be written without an __init__. Note that this makes the code or calling less intelligible for API users. For example,

class Parent(object):
    def __init__(self, *args, **kwargs):
        # loop through args (need names attributes) and kwargs to set attributes

class Child(object):

We can call the child as c = Child(name = 'Jason', age = 10). and so on. Notice that we do not see what all the Child takes in its constructor. To avoid this use the technique discussed below with named parameters for __init__.

With in a class we can define a __init__ which is the equivalent of a constructor in Python. For a basic class which has a fixed set of arguments and is not present in a hierarchy this can be straight forward. In a hierarchy of classes a child class's __init__ could be called with a number of arguments, some of which can be inherited attributes from the parent class. In such a scenario the __init__ can use named parameters for its own attributes and then user *args or **kwargs to pass on the rest of the attributes to the parent class. The parent class can build its attributes from these and if there are any extras it can pass them to the grand parent and so on.

Here we look at an example hierarchy. Items -> RetailItems. RetailItems->FoodItems. RetailItems -> ElectronicAccessories. The code for the base class shown below. *args and **kwargs are used to build the instance in __init__. To process args we need to know names of attributes to assign to each value in args. We can use a list as shown below.

A RetailItem which derieves from Item and a FoodItem are shown below.

FoodItem can then be called as shown below. Notice than when we code up a FoodItem we are not in a position to know the attributes except to lookup the class for _expected_attributes or the parent class. As a advantage this saves time taken to write up an __init__.

To make the API more developer friendly we can use name parameters for __init__. So __init__ in the child class will have parameters of its own in addition to *args and **kwargs. To know what else goes in to the parent class we still need to look at the Parent class documentation. However the child class __init__ can take a variable number of named parameters. These parameters will then be handled by parent classed as needed. This is a usage scenario for *args and **kwargs. Notice that there is no expected list of attributes.

To make sure developers know how to build the object we can use the _expected_attributes. However the attributes other than specifications is handled by parent class (thanks to *args and **kwargs).

Code is available here