Category Archives: Creating the Visualisations

Context of Visualisations

QM Project Workflow-6

As we mined our data according to our hypotheses, the visualisations we created can also be classified as such. This section of the blog discusses how we created our visualisations while our visualisations themselves and their accompanying analysis can be found here.

In investigating the respective hypotheses, we created these visualisations:

1. Amongst the 3 disciplines, Science will have the highest level of technicality.

  • Pie Charts – Comparison of technicality ratios

2. The lower a subject’s level of technicality, the more likely its technical words are shared with another subject.

  • Bar and Line Graph to compare Citation Metrics of disciplines
  • Word Clouds
    • Frequency of each discipline’s technical words in discipline-specific articles vs Frequency of each discipline’s technical words across all articles

3. A relationship between a subject’s shared vocabulary and its impact in the published world of Academic exists and it is a directly proportional one.

  • Network Visualisation

Creating the Bar and Line Graphs

The purpose of these graphs is to compare the Citation Metrics between the various disciplines.

Using the .csv files which held the citation metrics dataset, the graphs were simply created in Excel. Although Python had the same graphing capabilities, we chose to construct them with Excel. This was because these graphs required several axes and given our Python capabilities, we thought that Excel would be the more efficient programme.

Building the Word Clouds

Word clouds provide a graphical representation of word frequency in a text – the more commonly a word appears, the bigger the word in the cloud. In the context of this project, word clouds are used to communicate the frequency of technical words across articles. The more times a particular technical word is used across articles, the bigger that word will appear.

Although cloud generators like Wordle are freely available on the internet, we wanted more control over the creation of our clouds and thus, turned to Python. We found a particularly useful package on Github: https://github.com/amueller/word_cloud that became the base of our cloud generating process.

Step-by-Step process to create a word cloud:

1. Download and import the Word Cloud Package in Python from Github

As the package used here is not available from Canopy’s Package Manager, the download and installation process differed slightly.

wordcloud installation instructions

Figure 1: Instructions detailing the installation of the Word Cloud package

2. Specify working directory and desired formatting e.g. font

And your word cloud is generated!

wordcloud generation

As mentioned above, our word clouds are intended to communicate the frequency of technical words across articles. For the code to work, a source text with the technical words repeated however many times they appeared in the articles is needed. From our dataset compilation, we already have tables that listed all the technical words and their frequency across the articles saved as a .csv file. To automate the writing of the required source text, the following code was written:

Step-by-Step process to create a source text for the word cloud:

1. Import required packages and open the relevant .csv file

source text for cloud 1

2. Create a new data frame that returns the text as required and save as a .csv file

As seen above, the first .csv lists the word and the number of articles it is mentioned in. However, for the cloud generator to work, the .csv file has to repeat the word rather than merely state its frequency e.g. attribution attribution attribution attribution attribution rather than attribution 5. As such, a new data frame was created by multiplying each word by its frequency as seen in [5]. This data frame is then exported as the required .csv file for [11] (Step 2. of the create a word cloud process).

source text for cloud 2

View the word clouds that we have created here.

Building the Network Visualisation

In order to decipher the relationship between subjects’ shared vocabulary and their impact in Academia through a visualisation, it is essential that this visualisation has 2 indicators that can simultaneously reflect the values of both variables. Additionally, it has to be capable of adequately expressing the complex, non-linear concept of interdisciplinarity. As such, we decided that a network visualisation is most suitable for this purpose.

A network visualisation contains 2 main elements, nodes and edges. Conceptualising the visualisation, a node represents a subject and an edge represents the relationship between 2 subjects. The size of the nodes, therefore, would reflect a subjects’ impact and the thickness of edges would be the percentage of shared vocabulary between two subjects. Altogether, we’d be able to picture the relationship between the two as desired.

As we did not delve into network visualisations in class, the first step towards building this visualisation was exploring the available network visualisation packages in Python and picking one. We tested several packages and finally went with NetworkX as it was the most well-documented and thus, beginner coder friendly.

Before I could even start coding the visualisation, I needed to gather the necessary data. Since the nodes’ size would stand for a subject’s impact in Academia, a node’s value is simply its h5 Index (2014). We had already individually collected the indices while compiling the Citation Metrics so I simply had to retrieve these from the database.

The thickness of the edges is the shared vocabulary between two subjects. The measure of shared vocabulary between 2 subjects is not which words are shared, but the number of articles from either subject that contains a technical word from another subject. For example, if I were looking at the relationship between History and Philosophy, I need the number of History articles that contain Philosophy’s technical words and the number of Philosophy articles that contain History’s technical words. (Code used and further elaboration is here.)

However, I could not simply input the article count straight into the visualisation’s dataset, seeing as the number of technical words per subject and thus the maximum count of articles is different for every subject. So, we converted all the counts into percentage. And then, as the network visualisation is non-directional, to establish a percentage of shared vocabulary between 2 subjects, we took the average of the 2 subjects’ percentage.

Essentially, the value representing this thickness of the edge is the average of percentage of subject 1’s articles that contain subject 2’s technical words and percentage of subject 2’s articles that contain subject 1’s technical words.

As an analogy, let’s assume that Philosophy and History have 20 technical words between them. This means that the maximum number of articles between them in which the other subject’s technical words can be found is 20 x 10 = 200 though the actual number shared is 10. On the other hand, Philosophy and Psychology have 30 technical words between them. This means that the maximum number of articles between them in which the other subject’s technical words can be found is 30 x 10 = 300 though the actual number shared is 10. Although in both Philosophy and History, and Philosophy and Psychology, 10 articles have the other subject’s technical words in them, the percentage is the former is 10/200 x 100% = 5% and the latter is 10/200 x 100% = 3.33%. These percentages are the value we ultimately used for edges’ thickness (width).


The network visualisation code is as follows:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import random
import networkx as nx
G = nx.Graph()
#size=h5 median (2014)
#width = edge's thickness i.e. average of percentage of subject 1's articles that contain subject 2's technical words and percentage of subject 2's articles that contain subject 1's technical words
G.add_node('History', size=19)
G.add_node('English', size=38)
G.add_node('Philosophy', size=43)
G.add_node('Visual Arts', size=13)
G.add_node('Biology', size=247.2)
G.add_node('Medicine', size=312)
G.add_node('Physics', size=63.2)
G.add_node('Psychology', size=106.4)

G.add_edge('History', 'English', width=1.128787879)
G.add_edge('History', 'Philosophy', width=1.30952381)
G.add_edge('History', 'Visual Arts', width=0.333333334)
G.add_edge('History', 'Biology', width=2.894736842)
G.add_edge('History', 'Medicine', width=2.2)
G.add_edge('History', 'Physics', width=3.5)
G.add_edge('History', 'Psychology', width=8.977272725)

G.add_edge('English', 'Philosophy', width=1.586038961)
G.add_edge('English', 'Visual Arts', width=3.977272727)
G.add_edge('English', 'Biology', width=0.113636364) 
G.add_edge('English', 'Medicine', width=4.740909091)
G.add_edge('English', 'Physics', width=7.142857143)
G.add_edge('English', 'Psychology', width=1.590909091)

G.add_edge('Philosophy', 'Visual Arts', width=3.906926407)
G.add_edge('Philosophy', 'Biology', width= 2.869674185) 
G.add_edge('Philosophy', 'Medicine', width= 0.357142857)
G.add_edge('Philosophy', 'Physics', width= 0.357142857)
G.add_edge('Philosophy', 'Psychology', width= 13.42532468)

G.add_edge('Visual Arts', 'Biology', width= 2.368421053) 
G.add_edge('Visual Arts', 'Medicine', width= 7.857142857)
G.add_edge('Visual Arts', 'Physics', width= 2.8)
G.add_edge('Visual Arts', 'Psychology', width= 10)

G.add_edge('Biology', 'Medicine', width= 9.126315789)
G.add_edge('Biology', 'Physics', width= 4.511278195)
G.add_edge('Biology', 'Psychology', width= 5.281100478)

G.add_edge('Medicine', 'Physics', width= 8.157142857)
G.add_edge('Medicine', 'Psychology', width= 11.93181818)

G.add_edge('Physics', 'Psychology', width= 12.27272727)
G.nodes(data=True)
G.edges(data=True)
layout = nx.spring_layout(G)
#layout = nx.spectral_layout(G)
l = list()
for n in G.nodes_iter():
   l.append(G.node[n]['size'])
l = [x*10 for x in l]
m = list()
for n1,n2 in G.edges_iter():
   m.append(G.edge[n1][n2]['width'])
nx.draw(G, with_labels=True, pos = layout, node_size = l, width = m,edge_color = 'k',node_color='w',alpha=0.4)
plt.savefig("nxprogress.png")

nxprogress

The above is a preview of our final network visualisation (which can be viewed here!) A recap: The size of each subject’s node reflects its impact in Academia and the thickness of the edge is the percentage of shared vocabulary between two subjects. So, for instance, this visualisation thus far shows us that less vocabulary is shared between Medicine and Philosophy than Psychology and Philosophy as the edge between Medicine and Philosophy is thinner than Psychology and Philosophy.

SOCIAL SCIENCES: Raw Data & Brief Conclusions

Here are my spreadsheets for each subject:
Anthropology
Screen Shot 2015-01-20 at 4.49.12 AM

Politics

Screen Shot 2015-01-20 at 5.25.57 AM

Law

Screen Shot 2015-01-20 at 5.26.06 AM

These then transferred into averages and presented as pie-charts, through the process outlined here. The results are presented below:

TOP WORD RATIO

Screen Shot 2015-01-20 at 6.28.27 AM

Screen Shot 2015-01-20 at 5.00.14 AM

Screen Shot 2015-01-20 at 5.02.58 AM

In terms of the Top-Word Ratio, the most technical subject is Law, with 42%. However, this is very closely followed by Anthropology and Politics. The ratios were all very close to one another and seem to be around the same percentage. This doesn’t tell us much about which subject is more technical, however it does potentially point out a large flaw in our approach. Perhaps when selecting the technical words, we have a natural urge to reach a certain number out of the 10, which will push us to see technical words, where there perhaps aren’t any. Therefore, I hope that the Common-Word Ratio will tell us more.

COMMON-WORD RATIO

Screen Shot 2015-01-20 at 5.00.21 AM Screen Shot 2015-01-20 at 6.28.35 AM
Screen Shot 2015-01-20 at 5.02.55 AM

Once again the percentages are quite close together, however the same pattern remains: Law is the most technical, followed by Anthropology and then Politics. This makes sense. Law is very technical, not just in terms of the different actors in the legal system (policymakers, defendant, plaintiff, judge, etc), but there is the different courts (supreme, magistrate, etc) and the different types and forms of law. Not to mention there are a lot of words that do not have synonyms, such as case, judge, or plaintiff, to name a few, which means that they are words that have to be used over and over again, therefore increasing the frequency. Anthropology is a very broad subject, however many of its technical words are quite interchangable (such as group, class, type, etc) which would lower the frequency.