Beautiful Plots With Pandas and Matplotlib

[Click here to see the final plot described in this article.]

Data visualization plays a crucial role in the communication of results from data analyses, and it should always help transmit insights in an honest and clear way. Recently, the highly recommendable blog Flowing Data posted a review of data visualization highlights during 2013, and at The Data Science Lab we felt like doing a bit of pretty plotting as well.

For Python lovers, matplotlib is the library of choice when it comes to plotting. Quite conveniently, the data analysis library pandas comes equipped with useful wrappers around several matplotlib plotting routines, allowing for quick and handy plotting of data frames. Nice examples of plotting with pandas can be seen for instance in this ipython notebook. Still, for customized plots or not so typical visualizations, the panda wrappers need a bit of tweaking and playing with matplotlib’s inside machinery. If one is willing to devote a bit of time to google-ing and experimenting, very beautiful plots can emerge.

Visualizing demographic data

For this pre-Christmas data visualization table-top experiment we are going to use demographic data from countries in the European Union obtained from Wolfram|Alpha. Our data set contains information on population, extension and life expectancy in 24 European countries. We create a pandas data frame from three series that we simply construct from lists, setting the countries as index for each series, and consequently for the data frame.

 import pandas as pd import matplotlib as mpl from matplotlib.colors import LinearSegmentedColormap from matplotlib.lines import Line2D countries = ['France','Spain','Sweden','Germany','Finland','Poland','Italy', 'United Kingdom','Romania','Greece','Bulgaria','Hungary', 'Portugal','Austria','Czech Republic','Ireland','Lithuania','Latvia', 'Croatia','Slovakia','Estonia','Denmark','Netherlands','Belgium'] extensions = [547030,504782,450295,357022,338145,312685,301340,243610,238391, 131940,110879,93028,92090,83871,78867,70273,65300,64589,56594, 49035,45228,43094,41543,30528] populations = [63.8,47,9.55,81.8,5.42,38.3,61.1,63.2,21.3,11.4,7.35, 9.93,10.7,8.44,10.6,4.63,3.28,2.23,4.38,5.49,1.34,5.61, 16.8,10.8] life_expectancies = [81.8,82.1,81.8,80.7,80.5,76.4,82.4,80.5,73.8,80.8,73.5, 74.6,79.9,81.1,77.7,80.7,72.1,72.2,77,75.4,74.4,79.4,81,80.5] data = {'extension' : pd.Series(extensions, index=countries), 'population' : pd.Series(populations, index=countries), 'life expectancy' : pd.Series(life_expectancies, index=countries)} df = pd.DataFrame(data) df = df.sort('life expectancy') 

Now, thanks to the pandas plotting machinery, it is extremely straightforward to show the contents of this data frame by calling the pd.plot function. The code below generates a figure with three subplots displayed vertically, each of which shows a bar plot for a particular column of the data frame. The plots are automatically labelled with the column names of the data frame, and the whole procedure takes literally no time.

 fig, axes = plt.subplots(nrows=3, ncols=1) for i, c in enumerate(df.columns): df[c].plot(kind='bar', ax=axes[i], figsize=(12, 10), title=c) plt.savefig('EU1.png', bbox_inches='tight') 

The output figure looks like this:

EU1

Customization with matplotlib directives

While this is an acceptable plot for the first steps of data exploration, the figure is not really publication-ready. It also looks very much “academic” and lacks that subtle flair that infographics in mainstream media have. Over the next paragraphs we will turn this plot into a much more beautiful object by playing around with the options that matplotlib supplies.

Let us first start by creating a figure and an axis object that will contain our subfigure:

 # Create a figure of given size fig = plt.figure(figsize=(16,12)) # Add a subplot ax = fig.add_subplot(111) # Set title ttl = 'Population, size and age expectancy in the European Union' 

Colors are very important for data visualizations. By default, the matplotlib color palette offers solid hues, which can be softened by applying transparencies. Similarly, the default colorbars can be customized to match our taste (see below how one can define a custom-made color map with a gradient that softly changes from orange to gray-blue hues).

 # Set color transparency (0: transparent; 1: solid) a = 0.7 # Create a colormap customcmap = [(x/24.0, x/48.0, 0.05) for x in range(len(df))] 

The main plotting instruction in our figure uses the pandas plot wrapper. In the initialization options, we specify the type of plot (horizontal bar), the transparency, the color of the bars following the above-defined custom color map, the x-axis limits and the figure title. We also set the color of the bar borders to white for a cleaner look.

 # Plot the 'population' column as horizontal bar plot df['population'].plot(kind='barh', ax=ax, alpha=a, legend=False, color=customcmap, edgecolor='w', xlim=(0,max(df['population'])), title=ttl) 

After this simple pandas plot directive, the figure already looks very promising. Note that, because we sorted the data frame by life expectancy and applied a gradient color map, the color of the different bars in itself carries information. We will explicitly label that information below when constructing a color bar. For now we want to remove the grid, frame and axes lines from our plot, as well as customize its title and x,y axes labels.

 # Remove grid lines (dotted lines inside plot) ax.grid(False) # Remove plot frame ax.set_frame_on(False) # Pandas trick: remove weird dotted line on axis ax.lines[0].set_visible(False) # Customize title, set position, allow space on top of plot for title ax.set_title(ax.get_title(), fontsize=26, alpha=a, ha='left') plt.subplots_adjust(top=0.9) ax.title.set_position((0,1.08)) # Set x axis label on top of plot, set label text ax.xaxis.set_label_position('top') xlab = 'Population (in millions)' ax.set_xlabel(xlab, fontsize=20, alpha=a, ha='left') ax.xaxis.set_label_coords(0, 1.04) # Position x tick labels on top ax.xaxis.tick_top() # Remove tick lines in x and y axes ax.yaxis.set_ticks_position('none') ax.xaxis.set_ticks_position('none') # Customize x tick lables xticks = [5,10,20,50,80] ax.xaxis.set_ticks(xticks) ax.set_xticklabels(xticks, fontsize=16, alpha=a) # Customize y tick labels yticks = [item.get_text() for item in ax.get_yticklabels()] ax.set_yticklabels(yticks, fontsize=16, alpha=a) ax.yaxis.set_tick_params(pad=12) 

So far, the lenghts of our horizontal bars display the population (in millions) of the EU countries. All bars have the same height (which is set to 50% of the total space between bars by default by pandas). An interesting idea is to use the height of the bars to display further data. If we could made the bar height dependent on, say, the countries’ extension, we would be adding an supplementary piece of information to the plot. This is possible in matplotlib by accessing the elements that contain the bars and assigning them a specific height in a for loop. Each bar is an element of the class Rectangle, and all the corresponding class methods can be applied to it. For assigning a given height according to each country’s extension, we code a simple linear interpolation and create a lambda function to apply it.

 # Set bar height dependent on country extension # Set min and max bar thickness (from 0 to 1) hmin, hmax = 0.3, 0.9 xmin, xmax = min(df['extension']), max(df['extension']) # Function that interpolates linearly between hmin and hmax f = lambda x: hmin + (hmax-hmin)*(x-xmin)/(xmax-xmin) # Make array of heights hs = [f(x) for x in df['extension']] # Iterate over bars for container in ax.containers: # Each bar has a Rectangle element as child for i,child in enumerate(container.get_children()): # Reset the lower left point of each bar so that bar is centered child.set_y(child.get_y()- 0.125 + 0.5-hs[i]/2) # Attribute height to each Recatangle according to country's size plt.setp(child, height=hs[i]) 

Having added this “dimension” to the plot, we need a way of labelling the information so that the countries’ extension is understandable. A legend would be the ideal solution, but since our plotting directive was set to display the column ['population'], we can not use the default. We can construct a “fake” legend though, and custom-made its handles to roughly match the height of the bars. We position the legend in the lower right part of our plot.

 # Legend # Create fake labels for legend l1 = Line2D([], [], linewidth=6, color='k', alpha=a) l2 = Line2D([], [], linewidth=12, color='k', alpha=a) l3 = Line2D([], [], linewidth=22, color='k', alpha=a) # Set three legend labels to be min, mean and max of countries extensions # (rounded up to 10k km2) rnd = 10000 labels = [str(int(round(l/rnd)*rnd)) for l in min(df['extension']), mean(df['extension']), max(df['extension'])] # Position legend in lower right part # Set ncol=3 for horizontally expanding legend leg = ax.legend([l1, l2, l3], labels, ncol=3, frameon=False, fontsize=16, bbox_to_anchor=[1.1, 0.11], handlelength=2, handletextpad=1, columnspacing=2, title='Size (in km2)') # Customize legend title # Set position to increase space between legend and labels plt.setp(leg.get_title(), fontsize=20, alpha=a) leg.get_title().set_position((0, 10)) # Customize transparency for legend labels [plt.setp(label, alpha=a) for label in leg.get_texts()] 

Finally, there is another piece of information in the plot that needs to be labelled, and that is the color map indicating the average life expectancy in the EU countries. Since we used a custom-made color map, the regular call to plt.colorbar() would not work. We need to create a LinearSegmentedColormap instead and “trick” matplotlib to display it as a colorbar. Then we can use the usual customization methods from colorbar to set fonts, transparency, position and size of the diverse elements in the color legend.

 # Create a fake colorbar ctb = LinearSegmentedColormap.from_list('custombar', customcmap, N=2048) # Trick from http://stackoverflow.com/questions/8342549/ # matplotlib-add-colorbar-to-a-sequence-of-line-plots sm = plt.cm.ScalarMappable(cmap=ctb, norm=plt.normalize(vmin=72, vmax=84)) # Fake up the array of the scalar mappable sm._A = [] # Set colorbar, aspect ratio cbar = plt.colorbar(sm, alpha=0.05, aspect=16, shrink=0.4) cbar.solids.set_edgecolor("face") # Remove colorbar container frame cbar.outline.set_visible(False) # Fontsize for colorbar ticklabels cbar.ax.tick_params(labelsize=16) # Customize colorbar tick labels mytks = range(72,86,2) cbar.set_ticks(mytks) cbar.ax.set_yticklabels([str(a) for a in mytks], alpha=a) # Colorbar label, customize fontsize and distance to colorbar cbar.set_label('Age expectancy (in years)', alpha=a, rotation=270, fontsize=20, labelpad=20) # Remove color bar tick lines, while keeping the tick labels cbarytks = plt.getp(cbar.ax.axes, 'yticklines') plt.setp(cbarytks, visible=False) 


The final and most rewarding step consists of saving the figure in our preferred format.

 # Save figure in png with tight bounding box plt.savefig('EU.png', bbox_inches='tight', dpi=300) 

The final result looks this beautiful:

EU

Table-top data experiment take-away message

When producing a plot based on multidimensional data, it is a good idea to resort to shapes and colors that visually guide us through the variables on display. Matplotlib offers a high level of customization for all details of a plot, albeit the truth is that finding exactly which knob to tweak might be at times bewildering. Beautiful plots can be created by experimenting with various settings, among which hues, transparencies and simple layouts are the focal points. The results are publication-ready figures with open-source software that can be easily replicated by means of structured python code.

Clustering With K-Means in Python

A very common task in data analysis is that of grouping a set of objects into subsets such that all elements within a group are more similar among them than they are to the others. The practical applications of such a procedure are many: given a medical image of a group of cells, a clustering algorithm could aid in identifying the centers of the cells; looking at the GPS data of a user’s mobile device, their more frequently visited locations within a certain radius can be revealed; for any set of unlabeled observations, clustering helps establish the existence of some sort of structure that might indicate that the data is separable.

Mathematical background

The k-means algorithm takes a dataset X of N points as input, together with a parameter K specifying how many clusters to create. The output is a set of K cluster centroids and a labeling of X that assigns each of the points in X to a unique cluster. All points within a cluster are closer in distance to their centroid than they are to any other centroid.

The mathematical condition for the K clusters C_k and the K centroids \mu_k can be expressed as:

Minimize \displaystyle \sum_{k=1}^K \sum_{\mathrm{x}_n \in C_k} ||\mathrm{x}_n - \mu_k ||^2 with respect to \displaystyle C_k, \mu_k.

Lloyd’s algorithm

Finding the solution is unfortunately NP hard. Nevertheless, an iterative method known as Lloyd’s algorithm exists that converges (albeit to a local minimum) in few steps. The procedure alternates between two operations. (1) Once a set of centroids \mu_k is available, the clusters are updated to contain the points closest in distance to each centroid. (2) Given a set of clusters, the centroids are recalculated as the means of all points belonging to a cluster.

\displaystyle C_k = \{\mathrm{x}_n : ||\mathrm{x}_n - \mu_k|| \leq \mathrm{\,\,all\,\,} ||\mathrm{x}_n - \mu_l||\}\qquad(1)

\displaystyle \mu_k = \frac{1}{C_k}\sum_{\mathrm{x}_n \in C_k}\mathrm{x}_n\qquad(2)

The two-step procedure continues until the assignments of clusters and centroids no longer change. As already mentioned, the convergence is guaranteed but the solution might be a local minimum. In practice, the algorithm is run multiple times and averaged. For the starting set of centroids, several methods can be employed, for instance random assignation.

Below is a simple implementation of Lloyd’s algorithm for performing k-means clustering in python:

 import numpy as np def cluster_points(X, mu): clusters = {} for x in X: bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \ for i in enumerate(mu)], key=lambda t:t[1])[0] try: clusters[bestmukey].append(x) except KeyError: clusters[bestmukey] = [x] return clusters def reevaluate_centers(mu, clusters): newmu = [] keys = sorted(clusters.keys()) for k in keys: newmu.append(np.mean(clusters[k], axis = 0)) return newmu def has_converged(mu, oldmu): return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu]) def find_centers(X, K): # Initialize to K random centers oldmu = random.sample(X, K) mu = random.sample(X, K) while not has_converged(mu, oldmu): oldmu = mu # Assign all points in X to clusters clusters = cluster_points(X, mu) # Reevaluate centers mu = reevaluate_centers(oldmu, clusters) return(mu, clusters) 

Clustering in action

Let’s see the algorithm in action! For an ensemble of 100 random points on the plane, we set the k-means function to find 7 clusters. The code converges in 7 iterations after initializing with random centers. In the following plots, dots correspond to the target data points X and stars represent the centroids \mu_k of the clusters. Each cluster is distinguished by a different color.

p_N100_K7

The initial configuration of points for the algorithm is created as follows:

 import random def init_board(N): X = np.array([(random.uniform(-1, 1), random.uniform(-1, 1)) for i in range(N)]) return X 

For a configuration with twice as many points and a target of 3 clusters, often the algorithm needs more iterations to converge.

p_N200_K3

Obviously, an ensemble of randomly generated points does not possess a natural cluster-like structure. To make things slightly more tricky, we want to modify the function that generates our initial data points to output a more interesting structure. The following routine constructs a specified number of Gaussian distributed clusters with random variances:

 def init_board_gauss(N, k): n = float(N)/k X = [] for i in range(k): c = (random.uniform(-1, 1), random.uniform(-1, 1)) s = random.uniform(0.05,0.5) x = [] while len(x) < n: a, b = np.array([np.random.normal(c[0], s), np.random.normal(c[1], s)]) # Continue drawing points from the distribution in the range [-1,1] if abs(a) < 1 and abs(b) < 1: x.append([a,b]) X.extend(x) X = np.array(X)[:N] return X 

Let us look at a data set constructed as X = init_board_gauss(200,3): 7 iterations are needed to find the 3 centroids.

p_N200_K3_G

If the target distribution is disjointedly clustered and only one instantiation of Lloyd’s algorithm is used, the danger exists that the local minimum reached is not the optimal solution. This is shown in the example below, where initial data using very peaked Gaussians is constructed:

p_N100_K9_G

The yellow and black stars serve two different clusters each, while the orange, red and blue centroids are cramped within one unique blob due to an unfortunate random initialization. For this type of cases, a cleverer election of initial clusters should help.

To finalize our table-top experiment on k-means clustering, we might want to take a look at what happens when the original data space is densely populated:

p_N2000_K15_

The k-means algorithm induces a partition of the observations corresponding to a Voronoi tessellation generated by the K centroids. And it is certainly very pretty!

Table-top data experiment take-away message

Lloyd’s two-step implementation of the k-means algorithm allows to cluster data points into groups represented by a centroid. This technique is employed in many facets of machine learning, from unsupervised learning algorithms to dimensionality reduction problems. The general clustering problem is NP hard, but the iterative procedure converges always, albeit to a local minimum. Proper initialization of the centroids is important. Additionally, this algorithm does not supply information as to which K for the k-means is optimal; that has to be found out by alternative methods.

Update: We explore the gap statistic as a method to determine the optimal K for clustering in this post: Finding the K in K-Means Clustering and the f(K) method: Selection of K in K-means Clustering, Reloaded.

Update: For a proper initialization of the centroids at the start of the k-means algorithm, we implement the improved k-means++ seeding procedure.

A Peek at the ORCID Registry Public Data File

Proper authorship attribution oforcid_128x128 intellectual works is an ongoing and challenging problem for bibliographic services and the academic community alike. Often, an author’s signature in a publication is nothing more than a name and an affiliation, which might not suffice for confirming their identity beyond question, and this is particularly true when the name is especially common. A potential approach to solve this problem is the creation of a central registry of unique person identifiers together with a way of linking the authors’ profiles to their scientific output. One of such initiatives is the Open Researcher and Contributor ID (ORCID) project, which was launched in October 2012 as an open, non-profit, community-driven effort.

Release of ORCID’s public data file

One year after launching the ORCID Registry, a first Public Data File containing a snapshot of the database has been released. The data can be downloaded from the ORCID site, and at The Data Science Lab we of course want to have a peek at it. The compressed file contains public information associated with each user’s ORCID record. This is an example of an ORCID profile. Each person’s record is included as a separate file in both json and XML. For this table-top exploratory experiment we will be using the entries in the json format. To avoid performing too many I/O operations, we first concatenate all json files together (this takes a little while, but it pays off later)

 for f in /json/*.json; do (cat "${f}"; echo) >> orcid.json; done 

Using python pandas to manage the ORCID dataset

We load the records into pandas simply by reading this file line by line and storing certain fields in a dictionary class created to this effect. For our exploratory analysis we are interested in the timestamp of the profile creation, the researcher’s name and the number of publications listed in their profile, in addition to the unique alphanumeric ORCID identifier. We create a pandas data frame easily from dictionary.

 columns = ['orcid_id', 'family_name', 'first_name', 'num_pubs', 'date_created'] orcidDict = OrcidDict(columns) with open('/path/to/orcid/orcid.json') as infile: for line in infile: orcidDict.add_orcid_record(line.strip()) df = pd.DataFrame.from_dict(orcidDict.printItems()) 

When reading timestamps from the json data, we drop the time of the day and keep only year, month, day:

 def format_json_timestamp(jsonTimestamp, myFormat): return(datetime.datetime.fromtimestamp(jsonTimestamp/1000).strftime(myFormat)) history = data['orcid-profile']['orcid-history'] dateFormat = '%Y-%m-%d' date_created = format_json_timestamp(history['submission-date']['value'], dateFormat) 

The ORCID project has grown at a steady rate over the last year. The data contained in the public file amounts to 361,209 unique profiles (compare with their latest, live statistics). A simple grouping by profile creation month shows this evolution

 df['month_created'] = df.date_created.map(lambda t: t[:-3]) gCrMon = df.drop_duplicates().groupby('month_created').count()[['month_created']] gCrMon.columns = ['counts_month'] gCrMon.index = gCrMon.index.to_datetime() 

pCreatedAllYear

If we group by creation day rather than month, we can investigate how the rate of profile creation evolved during the year.

pCreatedOct12

pCreatedSep13

In both plots, corresponding to mid October 2012 and mid September 2013 respectively, we observe the expected weekly periodic ups and downs, indicating less activity during the weekends. While ORCID was accreting new users at a rate of few hundreds per day a year ago, about 2,000 new profiles are getting registered daily now. Weekends are shadowed in the plot corresponding to October 2012, where a peak of unusually high activity on November 3rd 2012 can be appreciated. This might most likely have to do with the Altmetrics workshop and hackathon held in San Francisco around that date, which some of the ORCID promoters attended.

complProfiles

If the ORCID project is to succeed at a global scale, we should expect that researchers not only reserve their ID in the system, but also use it as a repository for their publications. It is thus important that not only the researcher’s name be assigned an ORCID identifier, but that the corresponding profile be completed with biographical and bibliographic information. The plot on the left shows the proportion of ORCID profiles that have uploaded a list of publications so far, which amounts to slightly under 20% of all registered researchers. This is despite the fact that the procedure to upload a list of bibliographic records from Scopus is tremendously straightforward. Moving forward, this is an area that the ORCID promoters might want to emphasize.

Update 09.12.13: As Amir Aryani correctly points out, the figure of 20% researchers with uploaded publications might be obscured by the ORCID privacy settings (it is possible to limit the visibility of parts of the profile, including publications). We can roughly estimate how many of the 80% profiles that don’t show any publications do indeed have some, but just private: by looking at the ORCID statistics at the project page we find that 24% of the current live ORCID IDs have at least one work. This supports our claim that the majority of researchers at ORCID have not yet uploaded the list of their bibliographic records.
End update 09.12.13

Gender distribution of the researchers registered in ORCID

A sociologically important question we may ask in our little exploratory experiment involves the gender distribution of the researchers registered in ORCID. Generally speaking, gender assignment in function of given names is not a trivial task, for many names are gender neutral, or culture specific. We can, however, attempt at getting an approximate result.

For this task, we resort to a python package that uses information from about 40,000 first names that have been manually curated and labeled by native speakers. Names that can not be given a definite gender are classified as unknown.

 d = gender.Detector(unknown_value=u'unknown') names = ['Jill', 'Martin', 'Chris', 'Leslie', 'Wei'] print('\n'.join(['\t'.join([name, d.get_gender(name)]) for name in names])) 


Jill female
Martin male
Chris mostly_male
Leslie mostly_female
Wei unknown

The gender identifier is thus applied to the first names of all available ORCID records. Barely 53% of all first names in the ORCID registry are undoubtedly labeled male or female by this procedure. Of those, roughly 30% are female. The plot below shows the histogram of number of publications, broken down by gender, for those profiles that were either identified as male or female. Very interestingly, we observe that the percentage of female authors diminishes with increasing number of publications. Also noteworthy is the anomalous bump at 20 publications, which might correlate with some kind of technical artifact in the submission of the bibliographic data to the ORCID system. Other than that, the distribution shows exponential decay, corresponding to the fact that prolific authors (of both genders) are rare.

histPubsGender

Our exploratory analysis of the public data from the ORCID service shows that the initiative is growing their user base, and that the growth is so far sustained. The relative small number of profiles with accompanying publications shows that the majority of users are not uploading the list of their publications to the system. This is somehow unfortunate, but not catastrophic, provided that a reliable mechanism is implemented to link the DOIs of those publications to the ORCID of their authors. Finally, a naive analysis of the gender of the given names behind the ORCID profiles provides a first look at the demographics of the researchers.

Table-top data experiment take-away message

Exploratory analysis is always a first step towards the understanding of any data science challenge. Exploration should serve to assess the quality and completeness of the data, and to inform further decisions in the process of devising more complicated data analysis techniques. At this stage, exploratory plots are very useful, for they can help formulate first hypotheses as well as clarify the task at hand.