In order to decipher the relationship between subjects’ shared vocabulary and their impact in Academia through a visualisation, it is essential that this visualisation has 2 indicators that can simultaneously reflect the values of both variables. Additionally, it has to be capable of adequately expressing the complex, non-linear concept of interdisciplinarity. As such, we decided that a network visualisation is most suitable for this purpose.
A network visualisation contains 2 main elements, nodes and edges. Conceptualising the visualisation, a node represents a subject and an edge represents the relationship between 2 subjects. The size of the nodes, therefore, would reflect a subjects’ impact and the thickness of edges would be the percentage of shared vocabulary between two subjects. Altogether, we’d be able to picture the relationship between the two as desired.
As we did not delve into network visualisations in class, the first step towards building this visualisation was exploring the available network visualisation packages in Python and picking one. We tested several packages and finally went with NetworkX as it was the most well-documented and thus, beginner coder friendly.
Before I could even start coding the visualisation, I needed to gather the necessary data. Since the nodes’ size would stand for a subject’s impact in Academia, a node’s value is simply its h5 Index (2014). We had already individually collected the indices while compiling the Citation Metrics so I simply had to retrieve these from the database.
The thickness of the edges is the shared vocabulary between two subjects. The measure of shared vocabulary between 2 subjects is not which words are shared, but the number of articles from either subject that contains a technical word from another subject. For example, if I were looking at the relationship between History and Philosophy, I need the number of History articles that contain Philosophy’s technical words and the number of Philosophy articles that contain History’s technical words. (Code used and further elaboration is here.)
However, I could not simply input the article count straight into the visualisation’s dataset, seeing as the number of technical words per subject and thus the maximum count of articles is different for every subject. So, we converted all the counts into percentage. And then, as the network visualisation is non-directional, to establish a percentage of shared vocabulary between 2 subjects, we took the average of the 2 subjects’ percentage.
Essentially, the value representing this thickness of the edge is the average of percentage of subject 1’s articles that contain subject 2’s technical words and percentage of subject 2’s articles that contain subject 1’s technical words.
As an analogy, let’s assume that Philosophy and History have 20 technical words between them. This means that the maximum number of articles between them in which the other subject’s technical words can be found is 20 x 10 = 200 though the actual number shared is 10. On the other hand, Philosophy and Psychology have 30 technical words between them. This means that the maximum number of articles between them in which the other subject’s technical words can be found is 30 x 10 = 300 though the actual number shared is 10. Although in both Philosophy and History, and Philosophy and Psychology, 10 articles have the other subject’s technical words in them, the percentage is the former is 10/200 x 100% = 5% and the latter is 10/200 x 100% = 3.33%. These percentages are the value we ultimately used for edges’ thickness (width).