Category Archives: Level of Technicality

Data Sourcing: Articles

January 13, 2015Level of Technicality, Scienceisabelleblackmore

The aims and objectives of this project – trying to determine the extent of shared language between published articles of different academic disciplines – meant that, before we even began to think about analysing the data, we had to first formulate our own lexical databases from scratch. In order to achieve this we had to manually source our own articles, clean them up and then formulate a code that enabled us to create a dataset of tokenised words and their corresponding frequency count for that particular article. 4 subjects were investigated within each discipline and 10 articles were looked up for each subject, meaning 40 articles in total. For Sciences, this equated to a whopping 105 213 words that had to be processed and tokenised in order to create the dataset from which we might start our analysis. Such an approach was extremely labour intensive and definitely not for the faint-hearted. So without further ado, here is a step-by-step process of how I went about collecting, processing and creating the database of words for the division of Science.

Choosing Disciplines and Articles

Firstly, the most obvious criteria that any article chosen for analysis had to meet was that it was in English. There would be no point in an analysis of vocabulary to have articles of multiple languages. Secondly, as there were three people in our group, it was decided that we would divide the work load equally according to the conventional grouping of disciplinary fields made in universities:

The Arts
The Social Sciences
The Sciences

Taking advantage of the fact that – being BASc students at UCL – each member of our group has a slightly different academic background, we decided to split the disciplinary categories accordingly. Being a Health and Environment major, my natural inclination was towards the Sciences as an academic grouping of the disciplines and the articles found within them. Within the Sciences, further refinement was necessary in choosing which subjects and subsequent articles were to be investigated. This refinement was achieved through searching the many disciplinary divisions within Google Scholar’s Metric database, which contains a comprehensive range of disciplinary categories and sub-categories. Due to the time constraints on this project and the large amount of processing work needed to create the lexical databases, it was decided that – to impose a standardised method for sourcing articles within one given discipline – all the articles would be sourced from the one academic journal with the highest h5-index*(add link to relevant blog post) and were published in 2014.

Table 1: Subjects chosen for investigation and their corresponding Journals for The Sciences

Discipline	Journal
Biology	Cell
Medicine	New England Journal of Medicine
Physics	International Journal of Physics
Psychology	Trends in Cognitive Sciences

All references for articles used in this investigation can be found here.

Data Sourcing: Articles

January 13, 2015Arts, Creating the Database, Level of TechnicalityRain Soo Jamin

According to the standardised method of article collection – that Isabelle has described above – these are the subjects and corresponding journals chosen for the Arts:

Table 2: Subjects chosen for investigation and their corresponding Journals for The Arts

Discipline	Journal
English	Lingua
History	The American Historical Review
Philosophy	Synthese
Visual Arts	The Journal of Aesthetics and Art Criticism

Other than the h5-index, what also influenced our journal choices was the specificity of the journals. We did not want to use journals that are theme or region specific as such journals could bias the interdisciplinarity nature of the journal. For instance, in History, even though The Journal of Economic History has the highest h5-index, we avoided it as the articles within it (and hence, the vocabulary it uses) are likely to favour Economics more than other subjects, relative to a general history journal. Hence, we ultimately went with in The American Historical Review.

All references for articles used in this investigation can be found here.

Data Sourcing: Articles

January 13, 2015Creating the Database, Level of Technicality, Social SciencesIsadora_Janssen

I found my articles using Web of Science, an online database of articles that can be accessed using my UCL login details. This allowed be to search for articles based on the Publication Name and on the year they were published, so that I could make sure that all the articles that the search turned up were from the same publication and the same year.

My first Subject was Anthropology for which the most ‘impacting’ publication, was Current Anthropology. We had agreed to only focus on articles published in 2014, and therefore I had put these two variables into the search engine as shown below:

When the results came up, it was important that I changed the filter from ‘Publication Date’, because otherwise the top articles would have all been from December/November of 2014, and then I would not have had a range of different dates throughout 2014. Also, more recent articles are harder to find online, meaning that actually obtaining the article would have been more difficult later on. Neither did I want to filter by ‘Times Cited’ – because though I originally thought this would be the most accurate representation of the publications, I later realised that the articles that were published earliest (in January and February) were cited the most because they had been published the longest. Therefore, I decided to go with the ‘Relevance’ filter, which seemed to give me a good mix of times cited and dates posted.

Once I got my list of articles, I tried to choose a selection of different articles from different ‘branches’ of Anthropology. For example, some articles were obviously more scientific, such as “Craniofacial Feminization, Social Tolerance, and the Origins of Behavioral Modernity” focused on the biological changes and evolution of the human species in the context of Anthropology, while other were more cultural, studying the impact of different cultural factors or behaviour, such as religion, on specific groups or in specific places. (For example: “Relatedness, Co-residence, and Shared Fatherhood among Ache Foragers of Paraguay” )

Data Sourcing: The Journey from PDF to .txt

January 13, 2015Level of Technicality, Scienceisabelleblackmore

The conventional format for articles available for download within Academic Journals are in a PDF format. This particular format, although on the web, is not really machine-readable and therefore, according to Tim Berners-Lee, does not rate highly on the accessibility scale – as shown in Figure 1.

Berners-Lee's 5* Development towards Open Data — **Figure 1**: Berners-Lee’s 5-Star Development towards Open Data

The 5-Star Scale – The meaning behind the stars:

(adapted from http://5stardata.info/)

★	available on the Web (any format) under an open license
★★	available as structured data (e.g., Excel instead of image scan of a table like a PDF)
★★★	use non-proprietary formats (e.g., CSV instead of Excel)
★★★★	use URIs to denote things, so others might cite your work
★★★★★	link your data to other data – this provides context

It was therefore necessary, due to propriety rights and the image scan format, to convert these files into a format that was better suited to machine-readability: a text file (.txt).

Converting PDF to .txt files

Although there is an extensive collection of free pdf conversion sites, which can easily be found via a search engine, the majority of these ‘free’ services are incredibly misleading because the vast majority of them impose a policy of capping the number of articles one can convert in a given time period (usually a month). I encountered incredible difficulty finding a reliable method that I could use to convert all 40 articles from a pdf into a rich text format (.rtf) and subsequently a .txt file. In order to bypass this cap on free conversions, most sites demanded a subscription to their service and that the complete application be downloaded to your computer. However, to make matters worse, these applications were executable files (.exe), which aren’t compatible with Mac software, making it impossible to use the application beyond its capped limit. I eventually overcame this difficulty by paying a subscription to an online service that converted my PDF files in a Rich Text Format (.rtf). My difficulties in finding a reliable service I might use consistently to convert my files prompted me to think more about issues surrounding data availability and the debate about Open Data, especially with regard to academia – a link to my subsequent blog post can be found here.

Once converted, the text within the files was severely distorted in the sense that some words were split and others were concatenated. Furthermore, in the knowledge that these files would eventually be tokenised and the resulting tokens then counted, in-line citations and references also needed to be removed as these would disrupt the final word count for each article. It was therefore necessary to individually comb through each article and manually correct any words that were incorrectly split or combined in the conversion process as well as remove in-text citations and references. Again, this was an incredibly labour intensive and time-consuming process. However, once this ‘clean-up’ procedure had been completed the files were ready for processing through Python.

Data Sourcing: The Journey from PDF to .txt

January 13, 2015Arts, Creating the Database, Level of TechnicalityRain Soo Jamin

As an alternative to Isabelle’s online service subscription, I employed a Python PDF parser and analyzer called PDFMiner in convert the file format of my articles. After installation of the package, it simply required the following line of code to convert a .pdf to .txt:

$ pdf2txt.py -o [desired .txt file name] [path to save .pdf file to]

The code in action:

With this method however, the text within the files were also still distorted as Isabelle’s was. I had to manually look through and edit the files as well before they were ready for processing in Python. Thankfully, I found Sublime Text which made the process a lot more efficient than using my laptop’s default editor as it had several functions such as multiple-line selection that proved to be very handy.

Data Sourcing: From .txt to tokens

January 13, 2015Level of Technicality, Sciencefinal vocabulary, punctuation, text fileisabelleblackmore

Once the articles had been selected, downloaded, converted and cleaned-up they were ready to be processed using the NLTK package in Python.

Step-by-Step process used to tokenise the text files:

1. Import the NLTK Package in Python and import the relevant article file (.txt)

**Figure 1:** Import the NLTK Package and the relevant article text file from the Working Directory

The NLTK Package is case sensitive and would therefore count the same word – but with a capital letter – as different words. For example, ‘However‘ and ‘however‘ would not be counted as the same word as one begins with a capital letter and the other does not. If we do not account for this then our final vocabulary count would be inaccurate.

**Figure 2:** This code converts all words in the text file to lower case

2. Tokenising the data

A token in Python is an subsection of a string that results from dividing the original string by defined subclass. These divisions can be made based on sentences, or in the case for this project, words. NLTK has a built-in function: word_punct_tokenizer.tokenize(text) that allows you to divide sentences based on grammar and spaces, thus splitting the file into individual tokens of words, punctuation and spaces. However, as it is only the words we are interested in, it is necessary to remove any punctuation or spaces that might disrupt the final vocabulary count.

**Figure 3:** Built-in NLTK Package that allows the tokenisation of words in a string or text file

3. Removing words: ‘the’, ‘and’, ‘he’, ‘she’, ‘it’

NLTK has a built-in package called ‘stopwords’ which contains an all-inclusive list of common words for a given language (specified as ‘English’ in In[150]) these words include ‘he’, ‘she’, ‘the’ etc. as examples. The presence of these words in the text file will undoubtedly distort the final vocabulary count and therefore need to be removed. Fortunately we can remove these common words by using this stopword package to write a ‘for loop‘ that scans through the tokens created in the previous line of code and skips over any words that match those contained in the stopwords list, thus creating a new list (defined as new_edit) that do not contain these common words.

Figure 3: The NLTK Package 'Stopwords' contains an inclusive record of common words for a specified language — **Figure 4:** The NLTK Package ‘Stopwords’ contains an inclusive record of common words for a specified language

4. Removing Spaces and Punctuation from the tokenised file

Although we have removed the most common – yet in this instance meaningless – words, the list still contains various pieces of punctuation and symbols that will undoubtedly make up a large proportion of the word count. It is therefore necessary to remove these punctuations and symbols as well.

Figure 4: Removal of punctuation using the Punctuation Package and a For Loop — **Figure 5**: Removal of punctuation using the Punctuation Package and a For Loop

It is then merely a case of writing a ‘for’ loop, which runs through all tokens created by the previous code skipping over any token that matches with the punctuation in stopwords. However, is is necessary to append a space using the code in In[152] as this is not accounted for in stopwords but will still distort the dataset.

5. Saving our tokenised list as a .csv file

We now have a collection of tuples – a sequence of immutable Python objects, of the word and the frequency of that word in the article. The final stage in creating our vocabulary dataset is to convert this collection into a Data Frame and save it as a comma separated value (.csv) file.

Figure 6: Converting the Tuples to a Data Frame using Python Pandas — **Figure 6:** Converting the Tuples to a Data Frame using Python Pandas

The code in In[306] rearranges the Data Frame so that the words with the highest frequency appears at the top of the Data Frame.

Saving the Data Frame as a .csv file to the working directory — **Figure 7:** Saving the Data Frame as a comma separated value (.csv) file to the working directory

HEY PRESTO we’ve created a dataset!

This process was repeated for each individual article for each of the four Science subjects.

Getting the Article Ready for Processing – Problems Encountered

January 13, 2015Creating the Database, Level of Technicality, Social SciencesIsadora_Janssen

From Article to .txt file

Most Articles online are in PDF format, and the problem with this is that it is really tedious to get these into PlainText (.txt) format, which is the format you need it to be in for Python to be able to read it. The process usually goes something like this:

PDF -> using bought converter -> Word Document -> RichText -> PlainText

The issue with this process is the bought converter in order to get PDF files into a Word Document you would need to use a converter (as the one bought by Isabelle) which you need to buy. Having no desire to spend money on an app I would never use again, (and assuming it would be frowned upon to torrent it illegally), I needed a way to get my articles as word documents.

I found that JSTOR (another database we have access to with our UCL login) publishes full texts online. This meant all I had to do was copy and paste it into a word document. This worked beautifully, except that it took a lot of work to clean up the document.

First, all I had to remove the citations, to do this I used the Find Function (Ctrl+F) to find all the brackets in the document, which would usually give me a citation, for example (Janssen, 2014). This entire citation would then be deleted. Seeing as a lot of articles were over 20 pages long, this took a lot more time than expected. Next all figure and their captions had to be removed. This was difficult because sometimes there were text-boxes that referred to specific figure and tables which had moved during the Copy&Paste process, meaning the entire article would have to be scanned for these.

Finally, all hyperlinks and formatting had to be removed. The easiest way to do this is the ‘remove all formatting’ button on word. Now, this would leave just the text, which could then be saved as a txt file. Despite being rather time-consuming and labour intensive, it did prove a good solution as it didn’t cost me a penny. However, not all articles were available on JSTOR, which meant I had to adapt and on 2 occasions, choose different articles.

Tokenising
Using the very basic code we learnt in class and some help from the internet (Thank you StackOverflow!), I was able to get a code that would tokenise the article files I uploaded, so I could analyse it. However, I found that it was tokenising weird – it was making each letter a token, rather than each word. Which meant that instead of the most common words, I would get a list of the most common letters. Which obviously wasn’t much help.

After asking Isabelle, she sent me a ‘cleaning code’ that basically removed all the punctuation, double spaces and stopwords (words that are very common – example: is, that, the…). After Using this code, the tokenising worked again.

Data Sourcing: From .txt to tokens

January 13, 2015Creating the Database, Level of Technicality, UncategorizedRain Soo Jamin

An alternative I used for Step 4. of the process Isabelle described is NLTK’s in-built Regular-Expression Tokenizer (RegexpTokenizer). “`RegexpTokenizer` splits a string into substrings using a regular expression… It tokens out non-alphabetic sequences, money expressions, and any other non-whitespace sequences.” (Source)

E.g.  
>>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks."  
>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+') 
>>> tokenizer.tokenize(s)  ['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.',  'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']"""

This is a sample of my code showing its application:

The rest of the code essentially returned a .csv file like Isabelle’s that contains a data frame of all the words used within an article and their frequency. I ran all 40 Arts articles through this code.

Creating the Dataset – The Original Approach

January 13, 2015Creating the Database, Level of Technicality, Social SciencesIsadora_Janssen

Originally, I had approached this project in a way we had outlined in our presentation, which was a little different from how we eventually tackled the articles. Below, I specify this process and what difficulties we encountered:

Finding the Technical Words
The plan was to come up with a list of technical words from the abstract of the article and the list of key words that were both given on their respective Web of Science page. For every article I had chosen, I copied and pasted the title, abstract, and key words into a word document, creating one file, with a list of all 10 articles I was looking at. You can find an example below:

This is where I encountered my first problems, I soon realised not all the articles had abstracts or key-words. This meant i had to go back and change my original selection of articles, to make sure that each had these.

Once I had copied and pasted all abstracts into the document, I went through each abstract and highlighted all the words I found both discipline-specific and technical. I then did these same to the list of key words, and made a list for each article of these words. However, I soon found more issues. The first issue was that not all articles had the same number of keywords, and even then many key-words were not even remotely technical, for example, one article had 4 key words which were: ‘Anthropology, Papua New Guinea, Church, God‘. I started having a case where not all articles had even 1 keyword, which meant I was tempted to go back and choose more technical articles, which obviously would have unfairly influenced our results.

Counting the Technical Words
For the articles where I did find technical words, wrote a very basic code to count the number of time a technical word occurred. My aim was to see which percentage of the article can be considered ‘technical’, therefore, I tokenised the article and used the article.count(‘word’) code to count how many times a word occurred. An example is shown below using the word ‘kinship’:

I did this for all the technical words, and counted up how many times they appeared throughout the article. I then added up (using the + sign in python) all the total occurrences of each of the technical words and divided this by the total word count (found using the len(final_article) code ).

Flawed Results
However, this is where the problems started, because even for my most technical articles, I found that the decimal I was getting was so small, it was basically zero. I would be pointless to be comparing a ratio equal to 0.00012 to the ratio 0.00043, even if this indicated a huge different in technicality. I decided this ratio could be increased by increasing the occurrence of technical words, by adding different versions of a word. For example, instead of just looking for ‘EVOLUTION’ – I included ‘evolution’, ‘evolve’, ‘evolutionary’, etc. I found these derivatives using the digital library on my computer. I basically would enter the word, and check the derivativeslists, an example of which is found below:

I considered the perhaps creating a function that multiplied each by a constant so that the difference would become more visible. However, the more I started the thinking about the process, the more I saw how badly this approach represented the actual aims of the project. Just because a difficult word was not in the abstract, didn’t mean that there were no difficult words in the actual article itself. Furthermore, many of the key-words were more used as ‘search terms’, therefore sometimes these words didn’t even occur in the article, an obvious limitation.

This is when I realised I needed to change our approach and I touched base with Isabelle and decided to approach this by focusing on the actual articles, rather than the abstracts, and look at the most common words.

– Isadora Janssen

Building a Dataset: Concatenating article csv files and visualising ‘Common Disciplinary Words’

January 13, 2015Level of Technicality, Scienceisabelleblackmore

After having created individual csv files, each containing a dataset of vocabulary and their occurrence frequency, it was necessary to concatenate these to create a comprehensive language dataset, which would then represent the language used in that discipline.

Concatenating Code:

1. Import the relevant Python packages:

Matplotlib – a python 2D plotting library used for the creation of plots such as:
- Histograms
- Scatter Plots
- Bar Charts
Pandas – a python library used for data structures and data analysis
Numpy – a python library used for scientific computing:
- Linear algebra
- Simple and complex mathematical computations

2. Import relevant csv files into Python using Python Pandas

Figure 2: Importing the relevant csv files from directory under variable names — **Figure 2:** Importing the relevant csv files from directory under variable names

3. Concatenate the different csv files using Python Pandas

Figure 3: Concatenate the csv files — **Figure 3:** Concatenate the csv files

In[18] was put in as a checkpoint in the code. The concat.head() function prints the first 5 items in the Data Frame. Checkpoints – such as concat.head() – are always useful to write into your code to ensure that it is functioning as intended and to guard against bugs within the code.

The final step was to save the resulting Data Frame – to the working directory – as a csv file so that it may be called upon later when compiling yet more general datasets.

Figure 4: Save resulting Data Frame to working directory as a csv file — **Figure 4:** Save resulting Data Frame to working directory as a csv file

4. Visualising the Top 50 Most Common Words within these newly compiled datasets

common_wordplot — **Figure 5:** Uploading revised csv file to Python

As the csv file contained vocabulary that had been sorted descending order of word count, it was necessary to move the Top 5 Technical Words to the Top of the file, so that they might be included in the visualisation. This meant that the saved csv file had to then be reuploaded to Python after this editing process had occured. This was achieved in In[37] in the above screenshot.

plot — **Figure 6**: Assigning Frequency Count and Plot Type

The Frequency Count (ct) was used to determine the extent of the indexes in the Data Frame that are included in the final visualisation. However, in order to ensure that the appropriate indexes are included it is necessary to reset the index.

Furthermore, to add another informative element to the visualisation, we also incorporated some code that would produce a line representative of the mean number of times any word occurs in that article. This would then be projected on top of the bar chart for comparative reasons. df_mean = df[:100] ensures that this mean is taken from the first 100 items in the dataset.

The parameters of the x-axis and the mean value are defined. The type of visualisation is then defined as a bar chart and is given a title and the axis are labelled accordingly. The final visualisation is then saved into the working directory.

plot3 — **Figure 7**: Screenshots of the code used to visualise a Bar Chart of the Most Common Words in Physics – as sourced from the collection of Physics Articles

5. Example of the resulting Visualisation:

physics_mostcommon — **Figure 8:** Example of visualisation resulting from above code

Identifying technical words

January 13, 2015Arts, Creating the Database, Level of Technicality, UncategorizedRain Soo Jamin

To identify technical words within an article, I opened the relevant .csv file that contained all the words used in that article sorted by the frequency of the word. As per our definition of technical words, I then hid all words that had a count of less than 10 and recorded all the words qualified as technical in a new spreadsheet. After this step, I ended up with .csv files of the technical words found in each Arts subject as well as a concatenated file that contained all the technical words of the Arts.

SOCIAL SCIENCES: Studying each Subject as a Whole

January 13, 2015Creating the Database, Datasets, Level of Technicality, Social SciencesIsadora_Janssen

I wanted to look at the technicality of each subject as a whole, rather each article individually. To do so, I took all 10 articles from the a single subject, in this case Anthropology, and copy and pasted all them into the same Text File. This effectively created a new article, which had all the words used in Anthropology in it. I ran this article through the code, tokenised and cleaned it up for each discipline and analysed the most common words for each subject. Below I have divided it by Subject:

ANTHROPOLOGY

Once I had cleaned the code up, I used the FreqDist function to find the top 50 words within the subject. I was mainly interested in the top 5, however it was interesting to see a complete list of words. I soon noticed some small irregularities. First of all, the numbers ‘1’ and ‘2’ were on the most common words list, so was the letter ‘g’ and small basic words or abbreviations such as ‘co’ and ‘also’. I, therefore needed to remove these from the word list before I continued. At first I tried to do this with the big_article.remove(‘co’) command, however I soon found that this removed 1 instance of the word, when sometimes, the word occurred over 50 times. I wrote a loop that removed all these small insignificant words in one go. It was important to place the code BEFORE I changed the file type from list to Text, as Text file doesn’t allow you to remove them. The code used is seen below:

Once this was done, I could use the frequency distribution to find the top 5 words. However, I encountered a second issue: The word ‘human’ was first and the word ‘humans’ was 5th. The word Human and Humans means the same thing, therefore, I want to consider these the same word. Therefore, I wanted the code to count these as the same. I thought of solving this by replacing every instance of ‘humans’ on the list with the word ‘human’. Therefore, I wrote the following bit of code to replace the word:

Which produce the following ‘Top 50’ words list:

From here I could use the same code outlined here to create a Bar Chart of the 50 most common words. Which can be seen below:

However, while this graph is an interesting visualisation, it gives me too many words to consider. I needed to narrow down my list. I decided to go for the 5 top words. So from here, I created a Data frame of the top 5 most occurring words in Anthropology. The Dataframe can be seen below:

These are the words that could be found in (almost) every single anthropology article. While they are very straight-forward words (After all, there are little english speakers who do not know what the word ‘human’ means), they are also quite discipline specific, and I am content,

POLITICS
I repeated the process above for Politics. Once again I found that there were words on the top 50 list that I knew for a fact only appeared multiple times in the same article, rather than a range of articles. Therefore, I decided to only focus on the top 5 again. The Dataframe created can be seen below:

There aren’t any HUGE surprises in this list, though I did find it peculiar that the word ‘self’ is used so often in politics articles, I would not have guessed that. I also found it very interesting that 2 (social and group) out of the 5 were actually the same words as in Anthropology, showing some definite shared vocabulary between the 2 disciplines. However, once again, one can’t be too surprised at this, as the entire discipline, the SOCIAL SCIENCES, is related to society and how we function as a collective, which is obviously directly related to to the words ‘social’ and ‘group’.

LAW

When I ran the code for law, I had a similar problem – it was counting court and courts, and right and rights as separate words. Which obviously affected the top 50 list as can be seen below:

Therefore I had to include the same little string of code I used previously to replace words:

Which then meant that the new list, had both words under the same count:

Finally, the top 5 words were:

The least surprised about this set of words, though perhaps I didn’t know that ‘copyright’ was such a recurring theme in Law Articles, though I suppose copyrights and copyright law was quite a hot topic in 2014 (click link for details and source) which may have lead to several articles at least referring to examples of copyright law. Sadly, no shared words between Law and any of the other Social Science Subjects, unlike Politics and Anthropology. Although, I can’t say I am very surprised. Law is a very specific subject with very specific terminology, unlike perhaps the other social sciences, whose silo lines perhaps blur more into one another.

–Isadora Janssen