Category Archives: Creating the Database

Data Sourcing: Articles

QM Project Workflow

The aims and objectives of this project – trying to determine the extent of shared language between published articles of different academic disciplines – meant that, before we even began to think about analysing the data, we had to first formulate our own lexical databases from scratch. In order to achieve this we had to manually source our own articles, clean them up and then formulate a code that enabled us to create a dataset of tokenised words and their corresponding frequency count for that particular article. 4 subjects were investigated within each discipline and 10 articles were looked up for each subject, meaning 40 articles in total. For Sciences, this equated to a whopping 105 213 words that had to be processed and tokenised in order to create the dataset from which we might start our analysis. Such an approach was extremely labour intensive and definitely not for the faint-hearted. So without further ado, here is a step-by-step process of how I went about collecting, processing and creating the database of words for the division of Science.

Choosing Disciplines and Articles

Firstly, the most obvious criteria that any article chosen for analysis had to meet was that it was in English. There would be no point in an analysis of vocabulary to have articles of multiple languages. Secondly, as there were three people in our group, it was decided that we would divide the work load equally according to the conventional grouping of disciplinary fields made in universities:

  1. The Arts
  2. The Social Sciences
  3. The Sciences

Taking advantage of the fact that – being BASc students at UCL – each member of our group has a slightly different academic background, we decided to split the disciplinary categories accordingly. Being a Health and Environment major, my natural inclination was towards the Sciences as an academic grouping of the disciplines and the articles found within them. Within the Sciences, further refinement was necessary in choosing which subjects and subsequent articles were to be investigated. This refinement was achieved through searching the many disciplinary divisions within Google Scholar’s Metric database, which contains a comprehensive range of disciplinary categories and sub-categories. Due to the time constraints on this project and the large amount of processing work needed to create the lexical databases, it was decided that – to impose a standardised method for sourcing articles within one given discipline – all the articles would be sourced from the one academic journal with the highest h5-index*(add link to relevant blog post) and were published in 2014.

Table 1: Subjects chosen for investigation and their corresponding Journals for The Sciences

Discipline Journal
Biology Cell
Medicine New England Journal of Medicine
Physics International Journal of Physics
Psychology Trends in Cognitive Sciences

All references for articles used in this investigation can be found here.

Data Sourcing: Articles

According to the standardised method of article collection – that Isabelle has described above – these are the subjects and corresponding journals chosen for the Arts:

Table 2: Subjects chosen for investigation and their corresponding Journals for The Arts

Discipline Journal
English Lingua
History The American Historical Review
Philosophy Synthese
Visual Arts The Journal of Aesthetics and Art Criticism
Other than the h5-index, what also influenced our journal choices was the specificity of the journals. We did not want to use journals that are theme or region specific as such journals could bias the interdisciplinarity nature of the journal. For instance, in History, even though The Journal of Economic History has the highest h5-index, we avoided it as the articles within it (and hence, the vocabulary it uses) are likely to favour Economics more than other subjects, relative to a general history journal. Hence, we ultimately went with in The American Historical Review.
All references for articles used in this investigation can be found here.

Data Sourcing: Articles

Originally, we had decided on 6 different subjects within the Discipline of the Social Sciences – Geography, Law, Politics, Economics, Anthropology, Archeology – and thought we would have 5 articles for each. However, upon advice from Martin, we decided that we would focus on less subjects and more articles for each, just so we would have a more accurate dataset due to our larger range. So we changed this to 4 subjects and 10 articles per subject (4 articles in total).

The 4 subjects I chose were:

  • Law
  • Politics
  • Economics
  • Anthropology

I chose these because I felt they covered one of the main aspects of society – the legal, the political, the social, and the economic – and, after all, that was what the social sciences were: a study of society.

However, once I had converted and ‘cleaned up’ all my articles, perhaps one of the most labour intensive tasks of the project (read about it here), I had a technical malfunction and this led to all my cleaned-up articles to be deleted. This meant I had to do the entire process again. Due to time constraints [and for the sake of my own sanity], I decided to cut a subject, limiting my subjects to 3. After careful evaluation, we decided to cut Economics, mostly because the most of the law and politics articles were very heavily influenced (as the practices are in real-life) by economics and economic theory. Therefore, we felt this was already represented in the other subjects.

For each of these subjects, we chose the following publications:

Anthropology
Current Anthropology

Law
Havard Law Review

Politics
American Political Science Review

(to find out more about the process we used to choose these publications please click here)

Data Sourcing: Articles

I found my articles using Web of Science, an online database of articles that can be accessed using my UCL login details. This allowed be to search for articles based on the Publication Name and on the year they were published, so that I could make sure that all the articles that the search turned up were from the same publication and the same year.

My first Subject was Anthropology for which the most ‘impacting’ publication, was Current Anthropology. We had agreed to only focus on articles published in 2014, and therefore I had put these two variables into the search engine as shown below: 

Screen Shot 2015-01-06 at 6.18.22 PM

When the results came up, it was important that I changed the filter from ‘Publication Date’, because otherwise the top articles would have all been from December/November of 2014, and then I would not have had a range of different dates throughout 2014. Also, more recent articles are harder to find online, meaning that actually obtaining the article would have been more difficult later on. Neither did I want to filter by ‘Times Cited’ – because though I originally thought this would be the most accurate representation of the publications, I later realised that the articles that were published earliest (in January and February) were cited the most because they had been published the longest. Therefore, I decided to go with the ‘Relevance’ filter, which seemed to give me a good mix of times cited and dates posted.

Screen Shot 2015-01-20 at 12.51.59 AM

Once I got my list of articles, I tried to choose a selection of different articles from different ‘branches’ of Anthropology. For example, some articles were obviously more scientific, such as “Craniofacial Feminization, Social Tolerance, and the Origins of Behavioral Modernity” focused on the biological changes and evolution of the human species in the context of Anthropology, while other were more cultural, studying the impact of different cultural factors or behaviour, such as religion, on specific groups or in specific places. (For example: “Relatedness, Co-residence, and Shared Fatherhood among Ache Foragers of Paraguay” )

Data Sourcing: The Journey from PDF to .txt

QM Project Workflow-2

The conventional format for articles available for download within Academic Journals are in a PDF format. This particular format, although on the web, is not really machine-readable and therefore, according to Tim Berners-Lee, does not rate highly on the accessibility scale – as shown in Figure 1.

Berners-Lee's 5* Development towards Open Data
Figure 1: Berners-Lee’s 5-Star Development towards Open Data
The 5-Star Scale – The meaning behind the stars:

(adapted from http://5stardata.info/)

available on the Web (any format) under an open license
★★ available as structured data (e.g., Excel instead of image scan of a table like a PDF)
★★★ use non-proprietary formats (e.g., CSV instead of Excel)
★★★★ use URIs to denote things, so others might cite your work
★★★★★ link your data to other data – this provides context

It was therefore necessary, due to propriety rights and the image scan format, to convert these files into a format that was better suited to machine-readability: a text file (.txt).

Converting PDF to .txt files

Although there is an extensive collection of free pdf conversion sites, which can easily be found via a search engine, the majority of these ‘free’ services are incredibly misleading because the vast majority of them impose a policy of capping the number of articles one can convert in a given time period (usually a month). I encountered incredible difficulty finding a reliable method that I could use to convert all 40 articles from a pdf into a rich text format (.rtf) and subsequently a .txt file. In order to bypass this cap on free conversions, most sites demanded a subscription to their service and that the complete application be downloaded to your computer. However, to make matters worse, these applications were executable files (.exe), which aren’t compatible with Mac software, making it impossible to use the application beyond its capped limit. I eventually overcame this difficulty by paying a subscription to an online service that converted my PDF files in a Rich Text Format (.rtf). My difficulties in finding a reliable service I might use consistently to convert my files prompted me to think more about issues surrounding data availability and the debate about Open Data, especially with regard to academia – a link to my subsequent blog post can be found here.

Once converted, the text within the files was severely distorted in the sense that some words were split and others were concatenated. Furthermore, in the knowledge that these files would eventually be tokenised and the resulting tokens then counted, in-line citations and references also needed to be removed as these would disrupt the final word count for each article. It was therefore necessary to individually comb through each article and manually correct any words that were incorrectly split or combined in the conversion process as well as remove in-text citations and references. Again, this was an incredibly labour intensive and time-consuming process. However, once this ‘clean-up’ procedure had been completed the files were ready for processing through Python.

Data Sourcing: The Journey from PDF to .txt

As an alternative to Isabelle’s online service subscription, I employed a Python PDF parser and analyzer called PDFMiner in convert the file format of my articles. After installation of the package, it simply required the following line of code to convert a .pdf to .txt:
$ pdf2txt.py -o [desired .txt file name] [path to save .pdf file to]
The code in action:
pdf2txt screenshot
With this method however, the text within the files were also still distorted as Isabelle’s was. I had to manually look through and edit the files as well before they were ready for processing in Python. Thankfully, I found Sublime Text which made the process a lot more efficient than using my laptop’s default editor as it had several functions such as multiple-line selection that proved to be very handy.

Data Sourcing: From .txt to tokens

Once the articles had been selected, downloaded, converted and cleaned-up they were ready to be processed using the NLTK package in Python.

Step-by-Step process used to tokenise the text files:

1. Import the NLTK Package in Python and import the relevant article file (.txt)

Import the NLTK Package and the relevant article text file from the Working Directory
Figure 1: Import the NLTK Package and the relevant article text file from the Working Directory

The NLTK Package is case sensitive and would therefore count the same word – but with a capital letter – as different words. For example, ‘However‘ and ‘however‘ would not be counted as the same word as one begins with a capital letter and the other does not. If we do not account for this then our final vocabulary count would be inaccurate.

This code converts all words in the text file to lower case
Figure 2: This code converts all words in the text file to lower case

2. Tokenising the data

A token in Python is an subsection of a string that results from dividing the original string by defined subclass. These divisions can be made based on sentences, or in the case for this project, words. NLTK has a built-in function: word_punct_tokenizer.tokenize(text) that allows you to divide sentences based on grammar and spaces, thus splitting the file into individual tokens of words, punctuation and spaces. However, as it is only the words we are interested in, it is necessary to remove any punctuation or spaces that might disrupt the final vocabulary count.

Built-in NLTK Package that allows the tokenisation of words in a string or text file
Figure 3: Built-in NLTK Package that allows the tokenisation of words in a string or text file

3. Removing words: ‘the’, ‘and’, ‘he’, ‘she’, ‘it’

NLTK has a built-in package called ‘stopwords’ which contains an all-inclusive list of common words for a given language (specified as ‘English’ in In[150]) these words include ‘he’, ‘she’, ‘the’ etc. as examples. The presence of these words in the text file will undoubtedly distort the final vocabulary count and therefore need to be removed. Fortunately we can remove these common words by using this stopword package to write a ‘for loop‘ that scans through the tokens created in the previous line of code and skips over any words that match those contained in the stopwords list, thus creating a new list (defined as new_edit) that do not contain these common words.

Figure 3: The NLTK Package 'Stopwords' contains an inclusive record of common words for a specified language
Figure 4: The NLTK Package ‘Stopwords’ contains an inclusive record of common words for a specified language

4. Removing Spaces and Punctuation from the tokenised file

Although we have removed the most common – yet in this instance meaningless – words, the list still contains various pieces of punctuation and symbols that will undoubtedly make up a large proportion of the word count. It is therefore necessary to remove these punctuations and symbols as well.

Figure 4: Removal of punctuation using the Punctuation Package and a For Loop
Figure 5: Removal of punctuation using the Punctuation Package and a For Loop

It is then merely a case of writing a ‘for’ loop, which runs through all tokens created by the previous code skipping over any token that matches with the punctuation in stopwords. However, is is necessary to append a space using the code in In[152] as this is not accounted for in stopwords but will still distort the dataset.

5. Saving our tokenised list as a .csv file

We now have a collection of tuples – a sequence of immutable Python objects, of the word and the frequency of that word in the article. The final stage in creating our vocabulary dataset is to convert this collection into a Data Frame and save it as a comma separated value (.csv) file.

Figure 6: Converting the Tuples to a Data Frame using Python Pandas
Figure 6: Converting the Tuples to a Data Frame using Python Pandas

The code in In[306] rearranges the Data Frame so that the words with the highest frequency appears at the top of the Data Frame.

Saving the Data Frame as a .csv file to the working directory
Figure 7: Saving the Data Frame as a comma separated value (.csv) file to the working directory
HEY PRESTO we’ve created a dataset!

This process was repeated for each individual article for each of the four Science subjects.

Getting the Article Ready for Processing – Problems Encountered

From Article to .txt file

Most Articles online are in PDF format, and the problem with this is that it is really tedious to get these into PlainText (.txt) format, which is the format you need it to be in for Python to be able to read it. The process usually goes something like this:

PDF -> using bought converter -> Word Document -> RichText -> PlainText

The issue with this process is the bought converter in order to get PDF files into a Word Document you would need to use a converter (as the one bought by Isabelle) which you need to buy. Having no desire to spend money on an app I would never use again, (and assuming it would be frowned upon to torrent it illegally), I needed a way to get my articles as word documents.

I found that JSTOR (another database we have access to with our UCL login) publishes full texts online. This meant all I had to do was copy and paste it into a word document. This worked beautifully, except that it took a lot of work to clean up the document.

First, all I had to remove the citations, to do this I used the Find Function (Ctrl+F) to find all the brackets in the document, which would usually give me a citation, for example (Janssen, 2014). This entire citation would then be deleted. Seeing as a lot of articles were over 20 pages long, this took a lot more time than expected. Next all figure and their captions had to be removed. This was difficult because sometimes there were text-boxes that referred to specific figure and tables which had moved during the Copy&Paste process, meaning the entire article would have to be scanned for these.

Screen Shot 2015-01-19 at 7.34.36 PMFinally, all hyperlinks and formatting had to be removed. The easiest way to do this is the ‘remove all formatting’ button on word. Now, this would leave just the text, which could then be saved as a txt file. Despite being rather time-consuming and labour intensive, it did prove a good solution as it didn’t cost me a penny.  However, not all articles were available on JSTOR, which meant I had to adapt and on 2 occasions, choose different articles.

Tokenising
Using the very basic code we learnt in class and some help from the internet (Thank you StackOverflow!), I was able to get a code that would tokenise the article files I uploaded, so I could analyse it. However, I found that it was tokenising weird – it was making each letter a token, rather than each word. Which meant that instead of the most common words, I would get a list of the most common letters. Which obviously wasn’t much help.

After asking Isabelle, she sent me a ‘cleaning code’ that  basically removed all the punctuation, double spaces and stopwords (words that are very common – example: is, that, the…). After Using this code, the tokenising worked again.

Screen Shot 2015-01-20 at 5.40.48 AM

Data Sourcing: From .txt to tokens

An alternative I used for Step 4. of the process Isabelle described is NLTK’s in-built Regular-Expression Tokenizer (RegexpTokenizer). “`RegexpTokenizer` splits a string into substrings using a regular expression… It tokens out non-alphabetic sequences, money expressions, and any other non-whitespace sequences.” (Source)

E.g.  
>>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks."  
>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+') 
>>> tokenizer.tokenize(s)  ['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.',  'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']"""

This is a sample of my code showing its application:

Screen Shot 2015-01-18 at 18.12.18

The rest of the code essentially returned a .csv file like Isabelle’s that contains a data frame of all the words used within an article and their frequency. I ran all 40 Arts articles through this code.

Creating the Dataset – The Original Approach

Originally, I had approached this project in a way we had outlined in our presentation, which was a little different from how we eventually tackled the articles. Below, I specify this process and what difficulties we encountered:

Finding the Technical Words
The plan was to come up with a list of technical words from the abstract of the article and the list of key words that were both given on their respective Web of Science page. For every article I had chosen, I copied and pasted the title, abstract, and key words into a word document, creating one file, with a list of all 10 articles I was looking at. You can find an example below:

Screen Shot 2015-01-20 at 1.09.04 AM

This is where I encountered my first problems, I soon realised not all the articles had abstracts or key-words. This meant i had to go back and change my original selection of articles, to make sure that each had these.

Once I had copied and pasted all abstracts into the document, I went through each abstract and highlighted all the words I found both discipline-specific and technical. I then did these same to the list of key words, and made a list for each article of these words. However, I soon found more issues. The first issue was that not all articles had the same number of keywords, and even then many key-words were not even remotely technical, for example, one article had 4 key words which were: ‘Anthropology, Papua New Guinea, Church, God‘. I started having a case where not all articles had even 1 keyword, which meant I was tempted to go back and choose more technical articles, which obviously would have unfairly influenced our results.

Counting the Technical Words
For the articles where I did find technical words, wrote a very basic code to count the number of time a technical word occurred. My aim was to see which percentage of the article can be considered ‘technical’, therefore, I tokenised the article and used the article.count(‘word’) code to count how many times a word occurred. An example is shown below using the word ‘kinship’:

Screen Shot 2015-01-20 at 1.20.48 AM

I did this for all the technical words, and counted up how many times they appeared throughout the article. I then added up (using the + sign in python) all the total occurrences of each of the technical words and divided this by the total word count (found using the len(final_article) code ).

Flawed Results
However, this is where the problems started, because even for my most technical articles, I found that the decimal I was getting was so small, it was basically zero. I would be pointless to be comparing a ratio equal to 0.00012 to the ratio 0.00043, even if this indicated a huge different in technicality. I decided this ratio could be increased by increasing the occurrence of technical words, by adding different versions of a word. For example, instead of just looking for ‘EVOLUTION’ – I included ‘evolution’, ‘evolve’, ‘evolutionary’, etc. I found these derivatives using the digital library on my computer. I basically would enter the word, and check the derivativeslists, an example of which is found below:Screen Shot 2015-01-20 at 2.36.19 AM

I considered the perhaps creating a function that multiplied each by a constant so that the difference would become more visible. However, the more I started the thinking about the process, the more I saw how badly this approach represented the actual aims of the project. Just because a difficult word was not in the abstract, didn’t mean that there were no difficult words in the actual article itself. Furthermore, many of the key-words were more used as ‘search terms’, therefore sometimes these  words didn’t  even occur in the article, an obvious limitation.

This is when I realised I needed to change our approach and I touched base with Isabelle and decided to approach this by focusing on the actual articles, rather than the abstracts, and look at the most common words.

– Isadora Janssen

Building a Dataset: Concatenating article csv files and visualising ‘Common Disciplinary Words’

QM Project Workflow-4

After having created individual csv files, each containing a dataset of vocabulary and their occurrence frequency, it was necessary to concatenate these to create a comprehensive language dataset, which would then represent the language used in that discipline.

Concatenating Code:

1. Import the relevant Python packages:
  • Matplotlib –  a python 2D plotting library used for the creation of plots such as:
    • Histograms
    • Scatter Plots
    • Bar Charts
  • Pandas – a python library used for data structures and data analysis
  • Numpy – a python library used for scientific computing:
    • Linear algebra
    • Simple and complex mathematical computations
Figure 1: Import the relevant Python packages
Figure 1: Import the relevant Python packages
2. Import relevant csv files into Python using Python Pandas
Figure 2: Importing the relevant csv files from directory under variable names
Figure 2: Importing the relevant csv files from directory under variable names
 3. Concatenate the different csv files using Python Pandas
Figure 3: Concatenate the csv files
Figure 3: Concatenate the csv files

In[18] was put in as a checkpoint in the code. The concat.head() function prints the first 5 items in the Data Frame. Checkpoints – such as concat.head() – are always useful to write into your code to ensure that it is functioning as intended and to guard against bugs within the code.

The final step was to save the resulting Data Frame  – to the working directory – as a csv file so that it may be called upon later when compiling yet more general datasets.

Figure 4: Save resulting Data Frame to working directory as a csv file
Figure 4: Save resulting Data Frame to working directory as a csv file
4. Visualising the Top 50 Most Common Words within these newly compiled datasets
common_wordplot
Figure 5: Uploading revised csv file to Python

As the csv file contained vocabulary that had been sorted descending order of word count, it was necessary to move the Top 5 Technical Words to the Top of the file, so that they might be included in the visualisation. This meant that the saved csv file had to then be reuploaded to Python after this editing process had occured. This was achieved in In[37] in the above screenshot.

plot
Figure 6: Assigning Frequency Count and Plot Type

The Frequency Count (ct) was used to determine the extent of the indexes in the Data Frame that are included in the final visualisation. However, in order to ensure that the appropriate indexes are included it is necessary to reset the index.

Furthermore, to add another informative element to the visualisation, we also incorporated some code that would produce a line representative of the mean number of times any word occurs in that article. This would then be projected on top of the bar chart for comparative reasons. df_mean = df[:100] ensures that this mean is taken from the first 100 items in the dataset.

The parameters of the x-axis and the mean value are defined. The type of visualisation is then defined as a bar chart and is given a title and the axis are labelled accordingly. The final visualisation is then saved into the working directory.

ctct2

plot3
Figure 7: Screenshots of the code used to visualise a Bar Chart of the Most Common Words in Physics – as sourced from the collection of Physics Articles
5. Example of the resulting Visualisation:
physics_mostcommon
Figure 8: Example of visualisation resulting from above code

Identifying technical words

To identify technical words within an article, I opened the relevant .csv file that contained all the words used in that article sorted by the frequency of the word. As per our definition of technical words, I then hid all words that had a count of less than 10 and recorded all the words qualified as technical in a new spreadsheet. After this step, I ended up with .csv files of the technical words found in each Arts subject as well as a concatenated file that contained all the technical words of the Arts.

Tackling Shared Vocabulary – How to Compile Dataset of Words shared Between Subjects

We then sought to identify vocabulary shared between subjects. Between two subjects, shared vocabulary is a technical word from either subject that can be found in articles of the other subject.

Initially I thought it would be possible to use the df.merge() using the ‘inner’ determiner in Python PANDAS package. The premise being that all the resulting dataset would only contain the words that occurred in both original data frames. However, when I ran the code the output was an empty data frame:

Figure 1: Attempt to conduct an 'inner' join between two datasets
Figure 1: Unsuccessful attempt to conduct an ‘inner’ join between two datasets

Unfortunately, when this code was run for each vocabulary dataset for each academic subject the result was very much the same in all instances. Therefore, it was necessary to re-think my approach for concatenating data frames.

Figure 2: Concatenating to create a dataset of all shared vocabulary between the disciplines investigated
Figure 2: Concatenating to create a dataset of all shared vocabulary between the subjects investigated

Although this new approach achieved the aim of creating a dataset of the shared vocabulary between different subjects, it fundamentally changed the nature of the data being investigated. The premise of using value_counts() after concatenating the data frames of individual subjects was to count the occurrence a particular item of vocabulary across the total number of subjects investigated. If one word returned a value of 4 after having run the value_counts() code, that meant that it had occurred in at least one of the articles for every subject investigated.

Figure 2: An example of the data and the values returned after running value_counts()
Figure 3: An example of the data and the values returned after running value_counts()

Although in essence, this was successful in the sense that it demonstrates which words are shared between subjects, it does not account for the prevalence of each word within each subject. Take the word positive for example, although the dataset shows that it occurs in all subjects, this does not mean that it features heavily in every article. It might be that ‘positive’ is only used once in one article out of the ten investigated for physics, but the fact that it is also used at least once in at least one article of all other subjects means that it is given a the highest frequency value (4 out of 4 scientific subjects that were investigated). This bias needs to be flagged and addressed  to readers in order to avoid any misrepresentation of data or misleading results.

This inherent bias within the data was resolved by manually searching through each disciplinary dataset of vocabulary to determine how many articles in which each particular word was present. However, this meant manually searching though a database of an extremely large number of words, which would have been incredibly laborious and time consuming if tackled without some sort of sorting mechanism.

Figure 4: Streamlining the dataset for more specific analysis
Figure 4: Streamlining the dataset for more specific analysis

In the example shown in Figure 4, the original dataset containing all shared words between all investigated subjects was ‘streamlined’ to contain only a specified proportion. In this instance we are only looking at the words that occurred in 2 out of the 4 subjects investigated in ‘The Sciences’ category.

From this point onwards it was a case of returning to each original dataset, conducting a specific search for the words contained in the ‘streamlined’ dataset and then making a note of their prevalence among the individual articles for that subject.

Figure 5: A bar chart to show the prevalence of words shared between 3 out of 4 Scientific Disciplines in each individual discipline
Figure 5: A bar chart to show the prevalence of words shared between the Scientific Subjects

SOCIAL SCIENCES: Studying each Subject as a Whole

I wanted to look at the technicality of each subject as a whole, rather each article individually. To do so, I took all 10 articles from the a single subject, in this case Anthropology, and copy and pasted all them into the same Text File. This effectively created a new article, which had all the words used in Anthropology in it. I ran this article through the code, tokenised and cleaned it up for each discipline and analysed the most common words for each subject. Below I have divided it by Subject:


ANTHROPOLOGY

Once I had cleaned the code up, I used the FreqDist function to find the top 50 words within the subject. I was mainly interested in the top 5, however it was interesting to see a complete list of words. I soon noticed some small irregularities. First of all, the numbers ‘1’ and ‘2’ were on the most common words list, so was the letter ‘g’ and small basic words or abbreviations such as  ‘co’ and ‘also’. I, therefore needed to remove these from the word list before I continued. At first I tried to do this with the big_article.remove(‘co’) command, however I soon found that this removed 1 instance of the word, when sometimes, the word occurred over 50 times. I wrote a loop that removed all these small insignificant words in one go. It was important to place the code BEFORE I changed the file type from list to Text, as Text file doesn’t allow you to remove them. The code used is seen below:
Screen Shot 2015-01-20 at 10.29.42 AM
Once this was done, I could use the frequency distribution to find the top 5 words. However, I encountered a second issue: The word ‘human’ was first and the word ‘humans’ was 5th. The word Human and Humans means the same thing, therefore, I want to consider these the same word. Therefore, I wanted the code to count these as the same. I thought of solving this by replacing every instance of ‘humans’ on the list with the word ‘human’. Therefore, I wrote the following bit of code to replace the word:

Screen Shot 2015-01-20 at 10.50.01 AM

Which produce the following ‘Top 50’ words list:

Screen Shot 2015-01-20 at 10.52.48 AM

From here I could use the same code outlined here to create a Bar Chart of the 50 most common words. Which can be seen below:

Screen Shot 2015-01-20 at 10.53.41 AM

However, while this graph is an interesting visualisation, it gives me too many words to consider. I needed to narrow down my list. I decided to go for the 5 top words. So from here, I created a Data frame of the top 5 most occurring words in Anthropology. The Dataframe can be seen below:

Screen Shot 2015-01-20 at 10.55.17 AM

These are the words that could be found in (almost) every single anthropology article. While they are very straight-forward words (After all, there are little english speakers who do not know what the word ‘human’ means), they are also quite discipline specific, and I am content,

POLITICS
I repeated the process above for Politics. Once again I found that there were words on the top 50 list that I knew for a fact only appeared multiple times in the same article, rather than a range of articles. Therefore, I decided to only focus on the top 5 again. The Dataframe created can be seen below:

Screen Shot 2015-01-20 at 11.06.01 AM

There aren’t any HUGE surprises in this list, though I did find it peculiar that the word ‘self’ is used so often in politics articles, I would not have guessed that. I also found it very interesting that 2 (social and group) out of the 5 were actually the same words as in Anthropology, showing some definite shared vocabulary between the 2 disciplines. However, once again, one can’t be too surprised at this, as the entire discipline, the SOCIAL SCIENCES, is related to society and how we function as a collective, which is obviously directly related to to the words ‘social’ and ‘group’.

LAW

When I ran the code for law, I had a similar problem – it was counting court and courts, and right and rights as separate words. Which obviously affected the top 50 list as can be seen below:

Screen Shot 2015-01-20 at 11.15.29 AM

Therefore I had to include the same little string of code I used previously to replace words:

Screen Shot 2015-01-20 at 11.16.59 AM

Which then meant that the new list, had both words under the same count:

Screen Shot 2015-01-20 at 11.20.17 AM

Finally, the top 5 words were:

Screen Shot 2015-01-20 at 11.20.33 AM

The least surprised about this set of words, though perhaps I didn’t know that ‘copyright’ was such a recurring theme in Law Articles, though I suppose copyrights and copyright law was quite a hot topic in 2014 (click link for details and source) which may have lead to several articles at least referring to examples of copyright law. Sadly, no shared words between Law and any of the other Social Science Subjects, unlike Politics and Anthropology. Although, I can’t say I am very surprised. Law is a very specific subject with very specific terminology, unlike perhaps the other social sciences, whose silo lines perhaps blur more into one another.

–Isadora Janssen

Processing Data

I calculated the Top-Word Ratio an the Common-Word Ratio for each of the 10 articles in each of the subjects and recorded these in an excel spreadsheet as the example below:

Screen Shot 2015-01-20 at 5.25.57 AM

I then found the mean of both these ratios using excel for each of these subjects:

Screen Shot 2015-01-19 at 6.27.54 PM

(Adding them together and dividing by the total, 10)

and made these into pie-charts again using excel (as seen below), which meant that I ended up with 6 pie-charts, (2 for each subject) which can be found here.

Screen Shot 2015-01-20 at 4.52.29 AM

Types of Measurement for Academic Influence: h-index

The h5-index of an academic journal is a measure that reflects the productivity and impact of a journal. It is the journal equivalent of an author’s h-index. Essentially, the more academic papers that cite articles from within that journal, the higher the h5-index. As such, a high h5-index implies a journal’s relative importance within its field.

Within a journal with the score of 30, there exists 30 journals that have been cited at least 30 times.

While the h-index was formulated by Jorge Hirsch, the h5-index has been made popular by Google Scholar’s Metrics. The score on the h-5 index reflects that during the past five years, a journal has published [score no.] articles and each of these have been cited at least [score no.] times. Within a journal with the score of 30, there exists 30 journals that have been cited at least 30 times.

The h5-index is not without its limitations. Nonetheless, it provides a quantitative reflection of scientific impact – a vast concept – in a manner that is simple to understand, even to the layman. Additionally, there is an increasing likelihood that its relevance in academia will strengthen, given the ease at which the index can be calculated and accessed. Hence, we decided that it would be an appropriate measure for academic impact in our project.

Types of Measurement for Academic Influence: Impact Factor (IF)

The Impact Factor (IF) of an academic journal is a measure that reflects the average number of citations to recent articles published in that journal. It is frequently used as a proxy for the relative importance of a journal within its field, that is to say, journals with higher impact factors deemed to be more important than those with lower ones.

Formulated by Eugene Garfield, the founder of the Institute for Scientific Information, the Impact Factor is calculated by dividing the number of citations in a specified year by the total number of articles published in the two previous years. Therefore, an Impact Factor of 1.0 means that, on average, the articles published one or two year ago have been cited one time. An Impact Factor of 2.5 means that, on average, the articles published one or two year ago have been cited two and a half times etc. etc. The citing works may be articles published in the same journal. However, most citing works are from different journals, proceedings, or books indexed by Web of Science and can be accessed from the Thomson Reuters’Journal Citation Reports .

The Impact Factor provides us with a neat numerical value for – what can be considered – a highly vague concept. However, aside from the obvious issue of assigning a definitive numerical value to a subjective entity, there are other drawbacks to taking these measurements as indicators. This numerical evaluation system, as well as providing useful insight for investigations such as this one, is also responsible for the distribution of financial resources between research institutions. Therefore, there is a danger that the existence and use of these scores will drive top researchers to compete to be published by more ‘prestigious’ journals, thus introducing a bias towards a set collection of journals. Furthermore, the use of citation metrics in relation to impact and influence measurement shifts the attention among the academic community away from innovative research and discovery to popularity. However, the advantages of using a standardised measurement – especially within an evidence-based investigation – greatly outweighs the potential drawbacks such a reliance can entail , particularly with regards to this investigation.

(Isabelle Blackmore)

Tackling Shared Vocabulary – How to Compile Dataset of Words shared Between Subjects

An alternative piece of code for this step that essentially returns the same set of results i.e. the number of articles within Subject 1 that mentions technical words from Subject 2 is as follows:

#Import packages
import nltk
from nltk.book import *
from nltk.tokenize import RegexpTokenizer
"""``RegexpTokenizer`` splits a string into substrings using a regular expression... It tokens out non-alphabetic sequences, money expressions, and any other non-whitespace sequences." (Source)
E.g.  >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks."  >>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')  >>> tokenizer.tokenize(s)  ['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.',  'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']"""
from nltk.corpus import stopwords 
#Stopwords are words like "I" and "me" that have little lexical content 
stopset = set(stopwords.words('english'))
import requests
import nltk
import pickle
import matplotlib
from pylab import *
#Open file containing all words used in articles of a subject e.g. history, in this piece of code. This text file has already been streamlined in that a word used in an article only appears once in the file, no matter how often it was repeated in that article. A word can however, appear more than once if it was used in more than one article. I.e. A word in this .csv file can only be repeated 10 times at max.
with open('/Users/rain/Desktop/qm/history/historyall.csv', 'r') as text_file: 
   text = text_file.read() 
   text = text.lower() 
   #Making all letters lower-case 
   tokenizer = RegexpTokenizer(r'\w+') 
   tokens = tokenizer.tokenize(text) 
   #Removing stopwords 
   tokens = [w for w in tokens if not w in stopset]
#Open .txt file containing all technical words from a subject e.g. Visual Arts, in this piece of code
file = open('/Users/rain/Desktop/qm/visual arts/vartstechnical.txt', 'r')
technicalwords = file.readlines()
#At this point, technicalsword returns ['word1\n', 'word2\n', 'word3\n', 'word4\n'...]. To remove '\n', this line of code is needed:
technicalwords = [word.strip() for word in technicalwords]
for word in technicalwords:
   print word
   print tokens.count(word)
#In this example, the result lists the number of History articles a technical word from Visual Arts can be found in. I.e. 'Sublime' can be found in 1 out of 10 of the History articles analysed.
sublime
1
purposiveness
0
purposive
0
kantian
0
contra
0

With this code, I compiled the shared vocabulary between every subject in all 3 disciplines that we looked at. The lines coloured in blue are what I edited each time. For example, if I wanted to study the shared vocabulary between History and Philosophy, I’d have to look at both the number of History articles that contain Philosophy’s technical words and the number of Philosophy articles that contain History’s technical words. To do the former, I’d open the file containing all of History’s words in the first blue line and the file containing all of Philosophy’s technical words in the second blue line and run the code. For the latter,  I’d open the file containing all of Philosophy’s words in the first blue line and the file containing all of History’s technical words in the second blue line and run the code.

For every technical word, the highest possible number of subject-specific articles it can appear in is 10, since we looked at 10 articles from each subject. Therefore, the maximum count of History articles that contain Philosophy’s technical words would be the total number of Philosophical technical words x 10.

In summary, from this step, I accumulated the shared vocabulary between every single subject measured in terms of the number of articles within either subject that contained technical words originating from the other.

Other Types of Impact Measurement – The Eigenfactor Score

Rather than being a definitive score out of a set maximum, the Eigenfactor scores are cumulative and are scaled so that the sum of the Eigenfactor scores of all journals listed in Thomson’s Journal Citation Reports (JCR) adds up to 100. To put these scores into perspective the top thousand journals all have Eigenfactor scores above 0.01. In fact, the highest score ever recorded was attributed to the journal Nature in 2006. So, for the purpose of understanding and appreciating the meaning of the Eigenfactor score, any score above 0.01 means that that particular journal resides within Thomson Reuter’s top 1000 journals.

The Eigenfactor score of a journal is an estimate of the percentage of time users spends with a journal.

The score is generated by an algorithm corresponding to a simple model of research that follows and monitors readers as they move from journal to journal through chains of citations. The idea being that the amount of time the researcher spends with each journal equates roughly to a measure of that journal’s importance within network of academic citations.

 The Eigenfactor ranking system also accounts for ‘prestige’ among citing journals. It is a ranking of that article’s status, respectability and reputation among academic researchers.

Citations from Nature or Cell, for example, are considered formidable publications in the field of Health, Medical and Biological Sciences and would therefore be valued highly relative to citations from third-tier journals with narrower readership. Furthermore, another advantage of the ‘Eigenfactor score’ is that it also accounts and adjusts for differences in citation patterns among disciplines. Certain disciplines, such as Biology and Psychology, are predisposed to more citations – perhaps for methodological reasons – than other disciplines like English or History, which can make other measurements. One citation for a journal within a discipline that is not predisposed to many citations carries more weighting than one from a discipline where it is commonplace, which is often not accounted for in other measurements such as raw citation count.

Although initially this seemed like a promising index to use due to its appreciation for ‘prestige’ among citing journals and disciplinary predispositions to use citations, we did not include it in our own final analysis. The reason being is that the Eigenfactor score is used primarily for mapping the structure of scientific input and provides less insight into subjects from disciplines such as The Social Sciences and Art and Humanities. Furthermore, there is a difficulty in comparing Eigenfactor scores with JCR metrics. Unlike JCR analysis, the Eigenfactor is compiled based on a rigidly defined partition between subject categories – form a hard partition in which each journal belongs to only one category. However, the JCR categories are more of a soft partition in which journals are allowed multiple category membership. Therefore, due to the fundamental mismatch between the methodologies behind measuring disciplinary impact, it was thought best to avoid using the Eigenfactor Score, lest we run the risk of publishing misleading data.

More information can be found at eigenfactor.org