The aims and objectives of this project – trying to determine the extent of shared language between published articles of different academic disciplines – meant that, before we even began to think about analysing the data, we had to first formulate our own lexical databases from scratch. In order to achieve this we had to manually source our own articles, clean them up and then formulate a code that enabled us to create a dataset of tokenised words and their corresponding frequency count for that particular article. 4 subjects were investigated within each discipline and 10 articles were looked up for each subject, meaning 40 articles in total. For Sciences, this equated to a whopping 105 213 words that had to be processed and tokenised in order to create the dataset from which we might start our analysis. Such an approach was extremely labour intensive and definitely not for the faint-hearted. So without further ado, here is a step-by-step process of how I went about collecting, processing and creating the database of words for the division of Science.
Choosing Disciplines and Articles
Firstly, the most obvious criteria that any article chosen for analysis had to meet was that it was in English. There would be no point in an analysis of vocabulary to have articles of multiple languages. Secondly, as there were three people in our group, it was decided that we would divide the work load equally according to the conventional grouping of disciplinary fields made in universities:
- The Arts
- The Social Sciences
- The Sciences
Taking advantage of the fact that – being BASc students at UCL – each member of our group has a slightly different academic background, we decided to split the disciplinary categories accordingly. Being a Health and Environment major, my natural inclination was towards the Sciences as an academic grouping of the disciplines and the articles found within them. Within the Sciences, further refinement was necessary in choosing which subjects and subsequent articles were to be investigated. This refinement was achieved through searching the many disciplinary divisions within Google Scholar’s Metric database, which contains a comprehensive range of disciplinary categories and sub-categories. Due to the time constraints on this project and the large amount of processing work needed to create the lexical databases, it was decided that – to impose a standardised method for sourcing articles within one given discipline – all the articles would be sourced from the one academic journal with the highest h5-index*(add link to relevant blog post) and were published in 2014.
Table 1: Subjects chosen for investigation and their corresponding Journals for The Sciences
Discipline | Journal |
Biology | Cell |
Medicine | New England Journal of Medicine |
Physics | International Journal of Physics |
Psychology | Trends in Cognitive Sciences |
All references for articles used in this investigation can be found here.