Category Archives: Context

Delve a little deeper into the context behind the .txt and join us for some indulgent food for thought. Here we investigate further into any prominent themes, observations or questions we have noticed as a result of this project. Enjoy!

The Network of Knowledge

Brain

From the get-go this blog has set out to explore the relations between select disciplines within a wider system academia, conceptualised not as a hierarchical tree of knowledge, but as a non-linear and heterogeneous network more representative of rhizome. Unlike a tree, which is governed by hierarchy and linearity, from which it derives meaning; the rhizome is an unordered, interconnected and semiotic scaffold, which is arguably more characteristically representative of the structure of knowledge (Cabrera and Roland, 2014). The re-conceptualisation of knowledge as a network, by which can be taken to represent a system of nodes (disciplines) connected by edges (concepts or language that is shared between these disciplines) means that it can be included and studied as under the parameters of Network Theory.

 

In brief summary, Network Theory is a sub-set of computer science but has relevance in many disciplines including particle physics, biology, economics and sociology. Indeed, applications of Network Theory include the study of metabolic networks in biology, social network platforms like Facebook and Twitter and, perhaps most famously, the World Wide Web. Although the network structures are incredibly prevalent in all areas of life, the study of networks as a scientific process is still in its infancy and is by no means complete. Recent research into the mathematical properties of networks has been largely a result of observations of the properties of actual networks coupled with attempts to model them (Newman, 2003). There are many different kinds of networks, which can be divided loosely into four categories according to whichever circumstance they best describe. These include social networks, information/knowledge networks, technology networks and biological networks. The term ‘loosely’ is used as although each type of network has some fundamental structural and behavioural differences, much of their properties are shared or overlap with one another.

Indeed, these network structures can be applied quite readily to the world of published academia. The protocol amongst academics to cite existing work on related topics creates an interconnecting network where the nodes denote articles and the edges denote the citations between them: a knowledge network. As papers can only cite other papers that already exist – and not those that have yet to be – edges within citation networks can only point backwards in time. This makes citation networks inherently acyclic structures, in which closed loops are impossible to form (Newman, 2003). The structures of this citation network reflect the structure of the information stored at each node. Be it between articles within an academic subject or disciplines as a whole, by virtue of conceptualising these citations as a network of defined structural parameters, it is possible to reflect upon the nature of the relationship between different entities represented as the nodes.

Figure 2: The underlying power law behind citation patterns (Price, 1965)
Figure 1: The underlying power law behind citation patterns (Price, 1965)

Alongside network analysis comes link analysis: the investigation into the crucial relationships and associations between objects (nodes) of different types (i.e. different academic subjects), which are otherwise not apparent in isolated form. Quantitative studies of publication and citation patterns have revealed the presence of an underlying power law driving citation trends within academia (Price, 1965). As shown in Figure 2, the percentage of papers cited decreases exponentially in terms of the number of times it has been cited. In other words, very few papers are cited a great number of times in one year. In theory, different citation patterns would produce different power laws that describe their behaviour. These differences in the power law equations can be studied in order to gain a deeper understanding about the nature of citation patterns within a certain subject or discipline area.

Although a combination of time and technical restraints meant that – at least with regard to this particular investigation – investigation into the mathematical qualities of citation links between different academic subjects, we did however attempt to embody a network-centric approach to our efforts. The underlying mathematical patterns behind citation networks resonate with content covered in both the previous and current QM modules and provide interesting scope for future investigation.

Written by: Isabelle Blackmore

Bibliography

Newman, M., (2003) ‘The structure and function of complex networks’, SIAM Rev., vol. 45, no. 2, pp. 167-256

Price, D., (1965) ‘Networks of Scientific Papers’, Science, vol. 149, July, pp. 510-512

On getting some Python assistance

The best bit of Python is without a doubt running code that finally outputs what you were gunning for! But right before that – the endless tweaking of code to get something that isn’t actually an error message – is probably the most frustrating part.

Thankfully, I’ve found that while stumbling into a problem happens too frequently, finding an answer is almost always as easy! This is probably because Python has been around for such a long time that issues I ran into doing the project had already been encountered and solved by others before I even knew Python was a thing. Here are my suggestions on where to look for crossing a coding hurdle:

0. Google!! !!

1. Documentation

If you’re using a package, it’s almost certain that its creators have written a documentation on it. All the documentations I have come across so far are very detailed, providing instructions from installation to the use of most obscure function it can possibly perform. It’s the first place I look when I’m having trouble calling a specific function and for good reason too. E.g. pandas was used heavily in this project and its documentation can be viewed here: http://pandas.pydata.org/pandas-docs/version/0.12.0/

2. Github’s Issues page

If you’re using a package/code that has been uploaded onto Github, this page will come in handy. The Issues tab can be found on the right column of a project’s main page. You can easily do filtered searches through previously asked questions or compose a new issue. Perhaps because Github projects mean something personal to their authors, I find that authors are quick to answer questions that users might have. Even if that’s not the case, the active Github community will do so eventually anyway.

3. stackoverflow.com

Probably my favourite of the lot. The archived questions on this site is everything. You will almost certainly find whatever you’re looking for on here. It’s well tagged, which makes searching for specific answers easy. Even if you can’t find a solution, you could always post your own question. I initially hesitated to do so because I thought my question was probably very basic but I went ahead anyway as I wanted a solution ASAP. This is my question. Voila, I got responses within minutes of posting and a satisfactory answer soon after. Even though some comments did not provide me with what I was looking for, they taught me something new to consider, which is always welcomed in the learning process! Gained my answer and some rep points for my question, sweet.

Diving into a quantitative, relatively code-heavy project without prior experience in programming is daunting but these resources have definitely helped. Here’s a shoutout to the Python community for its (indirect) support in this project!

Review: Screen Time! by Ben Schmidt

Since our project is one that centres around text analysis, I did a little research on the various text analysis projects/tools on the internet to gain some insight into the area. I happened to chanced upon Ben Schmidt’s blog post about Bookworm, which I thought had a lot of potential as a graphical visualisation aid. Here I present a short review (summary, rather) of his sharing and some thoughts.


Bookworm is a tool that enables a user to visualise trends in repositories of digitised texts. Screen Time! introduces us to Ben Schmidt’s application of Bookworm: Ben was interested in language shifts thus employed Bookworm to explore lexical trends in the language of television and film aired over the last few decades. He entered key terms into the search engine which then summarised the frequency of their appearances in the form of a line graph. In his blog post, he presents the data that he has collected and their implications.

Some examples of the line graphs based on Ben’s searches are presented above. The advantage in this form of visual representation is the ease to comprehend the graph: the two clearly labeled axis allows us to interpret what the graph represents. The use of different colours also allows us to compare between terms easily. From the first graph, one look reveals that the ratio of ‘need to’s to ‘ought to’s has risen significantly since the year 1982. In the second graph, while the uses of ‘global warming’ and ‘climate change’ has fallen in both films and movies, we can tell that the fall was steeper for films, implying that these phenomena were discussed less frequently than before, in a shorter span of time.

Also, in comparison to other word-frequency data presenter, Bookworm incorporates the links to every text searched. This helps with verifying the integrity of data obtained as raw data is easily accessible – something remarkable compared to other forms of data collection.

However, as Ben points out in his blog post, data on Bookworm is limited in the sense that the scope of collection merely consists of open-sourced subtitles. If the consideration of his project is language in tv and film per se, there are obviously going to be several tv programmes and films which have not been subtitled/subtitled incorrectly (incl. spelling errors). Either would affect any findings. [I find this point to be particularly relevant to the challenge we faced while working on our project. Because the .txt files (converted from pdf) we used were often times incomplete in the sense that line breaks and oddly encoded characters amongst other things were littered throughout the file, our results on NTLK were many times less than desirable, which led us to have to manual edit the source rather than automate the conversion process.]

Nevertheless, this visual representation aid is no doubt interesting and perhaps at the forefront of lexicon analysis with the incorporation of technology to the discipline of literature and linguistics.

[Although Bookworm is a relevant tool for our project, it depends on the existence of a digitalised text database. While many articles are indeed available in digital plain-text format, the copyright issues of journals restrict these digital files from existing in the public domain. Hence, we were unavailable to apply Bookworm in this project.]

Bibliography

Schmidt, B., (2014) ‘Screen Time!’, Sapping Attention. (Online). Available at: http://sappingattention.blogspot.co.uk/2014/09/screen-time.html (Accessed: October 2015)

Through the Looking-Glass: Transparency, Open Data and Academia

Figure 1 Alice entering the Looking Glass. Illustration by Sir John Tenniel

Figure 1: Alice entering the Looking Glass. Illustration by Sir John Tenniel. (Online) Available at: http://en.wikipedia.org/wiki/Through_the_Looking-Glass#mediaviewer/File:Aliceroom3.jpg (Accessed 10th January 2015)

“Let’s pretend the glass has got all soft like gauze, so that we can get through. Why, it’s turning into a sort of mist now, I declare! It’ll be easy enough to get through…. In another moment Alice was through the glass”

(Carroll, 1871: 9)

Just like Alice stepping through her looking-glass, a world in which a culture of data transparency reigns supreme may hold much promise and value for innovation, management and public engagement and trust in academic research and governmental initiatives. However, the world through the looking-glass is not quite what it seems and this can certainly be said for the complex socio-technical systems arising as a result of this rush of publically available projects. Indeed, Open Data initiatives need to be more mindful of both the positive and the negative implications of how open data functions within such complexity, especially with regard to academia (Kitchin, 2013).

The Rise of Open Data and the Culture of Transparency

Recent years have seen the drive towards data transparency become a primary ideological aim amongst governments, companies and research bodies. The main objective being to provide citizens of the public with the analytical tools that might enable evidence-informed participation in governmental initiatives, debate or policy-making (DWP, 2013). Evidence of such endeavours towards an Open Data initiative can be seen in the establishment of publically available, open databases such as The London Data Store anddata.gov.uk (Kaye, 2012, Willetts, 2012 and DWP, 2013). Now, it would seem the tide of Open Data initiatives is starting to trickle into the world of academia with more and more agencies, such as the Wellcome Trust Centre, demanding that all their research be made freely available to the public online (Willetts, 2012). Furthermore, with specific regard to the world of academia, it would appear that there is a big focus toward developing methods both to measure the impact of research and to promote the dissemination and re-use of the data created by this research. Tools such as Digital Object Identifiers (DOIs) can be assigned to research outputs, enabling them to be tracked through citations, in the hope that this will provide evidence to encourage the sharing and re-use of such data (Kaye, 2012). Nevertheless, as is the case with all Open Data initiatives, there has been less emphasis and appreciation for the potential negative consequences and problems associated with them. Rob Kitchen identifies four key issues regarding the growing prevalence of Open Data that have not yet been given the appropriate attention, the most important of which – regarding the world of Academia – being both financial stability and the lack of utility and usability (2013).

The Financial Stability of open-access Academic data

To date, the main focus of these open data initiatives has been on the ‘supply-side’ of creating and accessing data, whereas little attention has been paid to the financial aspects of sustaining funding and upkeep for such initiatives (Kitchin, 2013). Although the distribution of this data can be achieved with marginal cost, the initial copy as well as the ongoing management of the data does require some sort of expenditure (especially with respect to obtaining the appropriate technologies and skilled staff). For the majority of cases, the data generated by these researching endeavours has been a major source of income for these academic organisations. Therefore, if the academic world was to embrace the world of open data and public availability, how might these research projects remain finically sustainable in the absence of their main source of revenue?

Indeed, this would appear to resonate with one of the most fundamental concepts in Economics: The Law of Supply and Demand[1] – as represented in Figure 2 – states that when demand is high and supply is low, producers can charge high prices for goods (knowledge being the commodity in question in this instance). Academic research endeavours are usually highly specialised and require a substantial gestation period in which reliable data can be collected and compiled. In other words, the academic supply-chain of such data was incredibly slow, making the commodity in question – knowledge – incredibly rare and therefore highly sort after. Therefore, the commodity of academic data was concealed behind the paywalls of Academic Journals (Willetts, 2012). However, remove these paywalls – by acting upon open data initiatives – and you effectively open up a wealth of academic data but at the same time render that commodity valueless. The challenge then becomes one of viability as with an effectively endless supply and no price, how then do you manage demand?

Figure 2 The infamous Supply and Demand Curve describing perhaps one of the most fundamental relationships in Economics. Available at: http://noahpinionblog.blogspot.co.uk/2013/12/whining-about-price-gouging.html  (Accessed January 10th 2015)
Figure 2The infamous Supply and Demand Curve describing perhaps one of the most fundamental relationships in Economics. Available at: http://noahpinionblog.blogspot.co.uk/2013/12/whining-about-price-gouging.html (Accessed January 10th 2015)

Although a number of different models have been suggested as a solution to this problem, perhaps the most viable and pragmatic solution to securing a stable financial base is direct government subvention (Kitchin, 2013). However, Government subvention would mean that the data produced by the research would have to have either consumer surplus value – the ability to generate significant public goods that are worth the investment of public expenditure – or the ability to create new products, thereby creating new markets and generating additional corporate revenue and tax receipts (Kitchin, 2013). Therefore, academic research for the sake of academic research (i.e. those that generate data with no obvious functional applicability) may become less likely.

Utility and Usability

The world of Academia is incredibly esoteric and specialised; therefore the databases produced by research endeavours are often very technically focused. With regards to open data projects and the technicality of academia, there is a danger of creating little more than websites of “miscellaneous data files, with no attention to the usability, quality of content, or consequences of its use” (Kitchin, 2013). In other words, there is a risk of creating data ‘dumps’ that lack the infrastructure of relevant backup and auditing policies, ethics policies, administrative arrangements and management mechanisms. Moreover, issues around data protection and information sensitivity mean that, for legal reasons, more complex data (e.g. Health Statistics) would be more difficult to make publicly accessible as more management would be necessary to ensure that these datasets comply with data protection laws. Furthermore, without the above-mentioned financial stability, appropriately placed and managed infrastructures are not likely to exist at all.

Part of the issue of quality lies in the fact that most open data sites have arisen as a quick response to an emerging phenomena, usually created by amateur enthusiasts as opposed to professional organisations (Kitchin, 2013). As a result, these databases often lack the contextual knowledge and the technology to fully represent the data. Rather than lead on to a process of refinement, these teething issues have encouraged the release of more and more variations of these data sets, all with different formats and teething issues. This in turn develops into a viscous cycle producing a continuous stream of generation after generation of flawed data sets. Unlike conventional academic data, published periodically (either quarterly or annually), the continual stream of data lacks a consistent pattern of use and the refinement that comes as a result of competing for a place within a particular journal (Kitchin, 2013). Therefore the arrival of open data also brings a trade-off: quality and purpose vs. availability, which is something that needs to be addressed if open data initiatives have any hope in becoming the norm amongst academics.

Conclusion

Despite the potential drawbacks and negative consequences, open data initiatives hold much promise and value for the future, particularly with regards to the dissemination and re-use of data generated by academic research. That being said, more critical attention needs to be paid to how this transparency initiative is affecting the quality and usability of current public data sets as well as the future finical sustainability of academic research endeavours. We are currently at a critical point in the emergence open data where the attention needs to shift focus away from creation and supply of data and towards data management and the consequences of its dissemination. Although the world through the looking-glass might not be quite the Wonderland we were hoping for, it does contain incredible and wondrous scope beyond anything the past, human world could. Regardless of such, one thing we can at least be certain of is that we will never see our world in quite the same way again.

Written by: Isabelle Blackmore

Bibliography

Carroll, L., (1871) Through the Looking-Glass, London: Macmillan & Co

Department for Work and Pensions (DWP), (2013) ‘DWP Open Data’, UK Government(Online). Available at:https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/221158/dwp-open-data-story-and-vision.pdf (Accessed: January 9th 2015)

Kaye, J., (2012) ‘What can Public Open Data and Academia Learn From Each Other?’, The British Library: Social Science Blog (Online). Available at:http://britishlibrary.typepad.co.uk/socialscience/2012/11/what-can-public-open-data-and-academia-can-learn-from-each-other.html (Accessed: 9th January 2015)

Kitchin, R., (2013) ‘Four Critiques of Open Data Initiatives’, LSE Blogs (Online). Available at:http://blogs.lse.ac.uk/impactofsocialsciences/2013/11/27/four-critiques-of-open-data-initiatives/ (Accessed 9th January 2015)

The Law of Supply and Demand (Online). Available at: http://www.whatiseconomics.org/the-law-of-supply-and-demand (Accessed 10th January 2015)

Willetts, D., (2012) ‘Open, free access to academic research? This will be a seismic shift’,The Guardian (Online). Available at: http://www.theguardian.com/commentisfree/2012/may/01/open-free-access-academic-research (Accessed: January 9th 2015)

Smith, N., (2013) ‘Whining about price gouging’, Noahpinion – Blogging as self-medication, (Online). Available at: ‘http://noahpinionblog.blogspot.co.uk/2013/12/whining-about-price-gouging.html (Accessed January 10th 2015)

[1] http://www.whatiseconomics.org/the-law-of-supply-and-demand