All posts by Isadora_Janssen

Technicality Between Subjects

January 20, 2015Hypothesis 1Isadora_Janssen

To analyse the overall technicality between subjects, we took the Top-Word and Common-Word ratios for every subject within a discipline and averaged these to find the mean ratios for that discipline. The results are displayed in the table below:

With these averages, we could then create pie-charts using Excel to visualise these. Again the Results are below:

TOP-WORD RATIO

C lick the graphs to see them in a larger size

As you can see from the charts, the most technical discipline is Science. This is quite expected, as science holds a stereotype of concerning itself with specific and specialist terms. Furthermore, many of the technical terms in science (such as terms for specific anatomical structures or physical phenomena) are not words one uses in their daily life. For example, the technical terms in the social science subjects, such as politics, include words that are understood by the majority of people, such as democracy, colonialism, feudal, constitution. However, in Science is not the case. Often technical words in Science have everyday synonyms, for example the word ‘Patella’ would be referred to in everyday life as your ‘knee cap’. The lack of Scientific words in our everyday vocabulary makes it a more technical subject. However, it is worth pointing out that in the case of the common-word Ratio, the Social Sciences are a close second to the Sciences, which could be due to its technicality or due to a fault in the approach we took. The Arts are very obviously the least technical, which is perhaps not surprising in general, however it surprised me how low these numbers were. On the whole The Arts has very technical words as well, such as literary devices or particular artistic movements and styles.

COMMON-WORD RATIO

Click the graphs to see them in a larger size

Again, Science is clearly the most technical, and this time by a much larger margin. This indicates that perhaps the close difference between the two disciplines in the Top-Word Ratio is because of the approach. As mentioned previously, perhaps when selecting the technical words, we have a natural urge to reach a certain number out of the 10, which will push us to see technical words, where there perhaps aren’t any. However, in the Common Word Ratio its a lot more likely that you will toss a word aside, and there is no specific number of words you feel you need to find. Once again, the arts is incredibly low, however this perhaps makes sense. If you are writing an analysis of a poem, you are likely to mention a range of different literary device that occur in the poem, rather than continue on for a whole article about the same one device. This might mean that these technical words don’t even appear 10 times in the article, meaning they won’t show up in our data. Obviously, this highlights a great flaw in the approach which should be discussed in the reflection.
RELATION TO HYPOTHESIS 1
Our first hypothesis was that:

Amongst the 3 disciplines, Science will have the highest level of technicality.

These results obviously prove that Science is indeed the most technical subject, proving our hypothesis right.

– Isadora Janssen

Creating the Dataset – The New Approach

January 20, 2015Social SciencesIsadora_Janssen

We needed a new approach. This is when I talked to Isabelle and Rain, and we discussed a new approach. There were several things that were important:

Must be an accurate representation of the documents’ technicality
Must produce a standardised ratio (more on that here) so we can compare outcomes

Isabelle said she had been looking at the articles by looking at the top 10 most common words, and then looking at how many of these words were technical. This approach inspired our first technicality ratio – we call this the Topword Ratio – the percentage of top 10 common words which are technical.

THE CODE

After having tokenised the article and removed all punctuation, you will have your article in the ‘list’ format – this means python sees it as a list of words, rather than an article. You need to convert this to a ‘text’ format, to use the FreqDist function, which will find the top most common words in the article. In this case, we decided on the 10 most common, therefore, we filled ’10’ in the brackets, as seen in the screenshot below:

This will then produce a list of the top 10 most common words. From this list, we will need to decided on which words we find technical. We then need to fill in these words into the “technical_words” list. I then wrote some very simple code which counts how many words there are in the list, and then takes this number and divides it by 10 (the total number of common words) – this gives our the first technicality ratio – also known as the TopWord Ratio:

LIMITATIONS OF THE TOP-WORD RATIO

The obvious limitation is that many technical terms might not make the top 10 – an article can be very technical but also continuously use less technical words. Therefore, this did not seem like an accurate representation of an article’s technicality. Therefore, we needed at least another ratio, that preferably treated the article as a whole, to give our project any credibility. This introduced the second ration – the Common-Word Ratio.

THE COMMON-WORD RATIO

First, we decided that in order for a technical word to influence your understanding of the article, it must appear at least 10 times. Therefore, we wanted to find out which words appear at least 10 times and then try to calculate how many of these were technical – this ratio would be our second, more accurate, technicality ratio.

THE CODE

First I needed to create a code that would create a list of all the words that occurred 10 times or more. First, I created a list, I called this list ‘common‘. At the time, I was unsure of how to create a blank list, so I added 1 variable: ‘hello’, which I then removed later (seen in figure 1 below). I realised later that I could just have create a blank list, which would have worked too (I could have done this by not putting anything inside the square bracket – as seen in figure 2 below).

Figure 1 - with the extra variable — Figure 1 – with the extra variable

Screen Shot 2015-01-20 at 3.47.36 AM — Figure 2 – Empty List

I then used a loop to add all the words that occurred equal to or more than 10 times to the list common. At first this code refused to work because it claimed that final_article was a Text type, not a list. The function .count(i) does not work for a text file. I was confused because at the start of the code, final_article is a list type. I quickly found the issue. Instead of introducing a new label, I found that before I found the common words, I changed the file type. I changed this by making nltk.Text(final_article) equal to text_article, a new label, instead of final_article. This is seen in the screenshots below:

Once this was fixed, the code worked and it gave me a list of all the words that occurred 10 times or more. However, this list had duplicates, meaning that the same word appeared several times, making it hard to analyse. As you can see in the screenshot, in just this small section, the word ‘kinship’ appears several times:

Therefore, for easier analysis, I had to remove the duplicates, and create a new list with every word only once. I called this list nodupli. This was done with the code below:

I then changed the type of nodupli from a list to Text, so that we could put it in a frequency distribution. Even though the frequency of the word was 1 and therefore there is little meaning in the outcome, this function prints the result in a list, rather than paragraph of words separated by commas, which again made it easier to analyse.

I then had to go through each of these words and decided which were considered technical and make a note of these. Then, once you had a list, you had to find the derivatives for each word (where possible) – for example for ‘evolution’, you should include ‘evolutionary’ and ‘evolutions’.

These words then needed to all be added into the code that counts how many times the appear in the article. Each word count was given a letter (so they were easier to plug into the calculation later on). All these word counts were then added together and the result was called alltechnical.

This then needed to be changed into a float (python-language for number) because it did not automatically register the type as a float. This was important for so that the calculations shown below would work.

I then calculated how many words were in the common list, which was the list of all the words that occured 10 or more times, where the duplicates had not been removed. These two values were then divided, which gave us a result that told us how many of the ‘common words’ were technical. Which gives us our second ratio – the Common-Word Ratio.

Isadora Janssen

January 20, 2015Meet the Jargon-BustersIsadora_Janssen

Official Nickname: Isadora from Waitrose – Cause I look just as good behind check-outs as Alex from Target did.

Primarily responsible for analysing articles and disciplines that fall into the Social Sciences bracket- came up with the technicality ratios used to compare articles between subjects and between disciplines. Also, wrote the code that calculates both ratios, removes repeats and replaces bits of string in a list. Oh and a couple of blog pages here and there.

Data Sourcing: Articles

January 13, 2015Creating the Database, Level of Technicality, Social SciencesIsadora_Janssen

I found my articles using Web of Science, an online database of articles that can be accessed using my UCL login details. This allowed be to search for articles based on the Publication Name and on the year they were published, so that I could make sure that all the articles that the search turned up were from the same publication and the same year.

My first Subject was Anthropology for which the most ‘impacting’ publication, was Current Anthropology. We had agreed to only focus on articles published in 2014, and therefore I had put these two variables into the search engine as shown below:

When the results came up, it was important that I changed the filter from ‘Publication Date’, because otherwise the top articles would have all been from December/November of 2014, and then I would not have had a range of different dates throughout 2014. Also, more recent articles are harder to find online, meaning that actually obtaining the article would have been more difficult later on. Neither did I want to filter by ‘Times Cited’ – because though I originally thought this would be the most accurate representation of the publications, I later realised that the articles that were published earliest (in January and February) were cited the most because they had been published the longest. Therefore, I decided to go with the ‘Relevance’ filter, which seemed to give me a good mix of times cited and dates posted.

Once I got my list of articles, I tried to choose a selection of different articles from different ‘branches’ of Anthropology. For example, some articles were obviously more scientific, such as “Craniofacial Feminization, Social Tolerance, and the Origins of Behavioral Modernity” focused on the biological changes and evolution of the human species in the context of Anthropology, while other were more cultural, studying the impact of different cultural factors or behaviour, such as religion, on specific groups or in specific places. (For example: “Relatedness, Co-residence, and Shared Fatherhood among Ache Foragers of Paraguay” )

Getting the Article Ready for Processing – Problems Encountered

January 13, 2015Creating the Database, Level of Technicality, Social SciencesIsadora_Janssen

From Article to .txt file

Most Articles online are in PDF format, and the problem with this is that it is really tedious to get these into PlainText (.txt) format, which is the format you need it to be in for Python to be able to read it. The process usually goes something like this:

PDF -> using bought converter -> Word Document -> RichText -> PlainText

The issue with this process is the bought converter in order to get PDF files into a Word Document you would need to use a converter (as the one bought by Isabelle) which you need to buy. Having no desire to spend money on an app I would never use again, (and assuming it would be frowned upon to torrent it illegally), I needed a way to get my articles as word documents.

I found that JSTOR (another database we have access to with our UCL login) publishes full texts online. This meant all I had to do was copy and paste it into a word document. This worked beautifully, except that it took a lot of work to clean up the document.

First, all I had to remove the citations, to do this I used the Find Function (Ctrl+F) to find all the brackets in the document, which would usually give me a citation, for example (Janssen, 2014). This entire citation would then be deleted. Seeing as a lot of articles were over 20 pages long, this took a lot more time than expected. Next all figure and their captions had to be removed. This was difficult because sometimes there were text-boxes that referred to specific figure and tables which had moved during the Copy&Paste process, meaning the entire article would have to be scanned for these.

Finally, all hyperlinks and formatting had to be removed. The easiest way to do this is the ‘remove all formatting’ button on word. Now, this would leave just the text, which could then be saved as a txt file. Despite being rather time-consuming and labour intensive, it did prove a good solution as it didn’t cost me a penny. However, not all articles were available on JSTOR, which meant I had to adapt and on 2 occasions, choose different articles.

Tokenising
Using the very basic code we learnt in class and some help from the internet (Thank you StackOverflow!), I was able to get a code that would tokenise the article files I uploaded, so I could analyse it. However, I found that it was tokenising weird – it was making each letter a token, rather than each word. Which meant that instead of the most common words, I would get a list of the most common letters. Which obviously wasn’t much help.

After asking Isabelle, she sent me a ‘cleaning code’ that basically removed all the punctuation, double spaces and stopwords (words that are very common – example: is, that, the…). After Using this code, the tokenising worked again.

Creating the Dataset – The Original Approach

January 13, 2015Creating the Database, Level of Technicality, Social SciencesIsadora_Janssen

Originally, I had approached this project in a way we had outlined in our presentation, which was a little different from how we eventually tackled the articles. Below, I specify this process and what difficulties we encountered:

Finding the Technical Words
The plan was to come up with a list of technical words from the abstract of the article and the list of key words that were both given on their respective Web of Science page. For every article I had chosen, I copied and pasted the title, abstract, and key words into a word document, creating one file, with a list of all 10 articles I was looking at. You can find an example below:

This is where I encountered my first problems, I soon realised not all the articles had abstracts or key-words. This meant i had to go back and change my original selection of articles, to make sure that each had these.

Once I had copied and pasted all abstracts into the document, I went through each abstract and highlighted all the words I found both discipline-specific and technical. I then did these same to the list of key words, and made a list for each article of these words. However, I soon found more issues. The first issue was that not all articles had the same number of keywords, and even then many key-words were not even remotely technical, for example, one article had 4 key words which were: ‘Anthropology, Papua New Guinea, Church, God‘. I started having a case where not all articles had even 1 keyword, which meant I was tempted to go back and choose more technical articles, which obviously would have unfairly influenced our results.

Counting the Technical Words
For the articles where I did find technical words, wrote a very basic code to count the number of time a technical word occurred. My aim was to see which percentage of the article can be considered ‘technical’, therefore, I tokenised the article and used the article.count(‘word’) code to count how many times a word occurred. An example is shown below using the word ‘kinship’:

I did this for all the technical words, and counted up how many times they appeared throughout the article. I then added up (using the + sign in python) all the total occurrences of each of the technical words and divided this by the total word count (found using the len(final_article) code ).

Flawed Results
However, this is where the problems started, because even for my most technical articles, I found that the decimal I was getting was so small, it was basically zero. I would be pointless to be comparing a ratio equal to 0.00012 to the ratio 0.00043, even if this indicated a huge different in technicality. I decided this ratio could be increased by increasing the occurrence of technical words, by adding different versions of a word. For example, instead of just looking for ‘EVOLUTION’ – I included ‘evolution’, ‘evolve’, ‘evolutionary’, etc. I found these derivatives using the digital library on my computer. I basically would enter the word, and check the derivativeslists, an example of which is found below:

I considered the perhaps creating a function that multiplied each by a constant so that the difference would become more visible. However, the more I started the thinking about the process, the more I saw how badly this approach represented the actual aims of the project. Just because a difficult word was not in the abstract, didn’t mean that there were no difficult words in the actual article itself. Furthermore, many of the key-words were more used as ‘search terms’, therefore sometimes these words didn’t even occur in the article, an obvious limitation.

This is when I realised I needed to change our approach and I touched base with Isabelle and decided to approach this by focusing on the actual articles, rather than the abstracts, and look at the most common words.

– Isadora Janssen

SOCIAL SCIENCES: Studying each Subject as a Whole

January 13, 2015Creating the Database, Datasets, Level of Technicality, Social SciencesIsadora_Janssen

I wanted to look at the technicality of each subject as a whole, rather each article individually. To do so, I took all 10 articles from the a single subject, in this case Anthropology, and copy and pasted all them into the same Text File. This effectively created a new article, which had all the words used in Anthropology in it. I ran this article through the code, tokenised and cleaned it up for each discipline and analysed the most common words for each subject. Below I have divided it by Subject:

ANTHROPOLOGY

Once I had cleaned the code up, I used the FreqDist function to find the top 50 words within the subject. I was mainly interested in the top 5, however it was interesting to see a complete list of words. I soon noticed some small irregularities. First of all, the numbers ‘1’ and ‘2’ were on the most common words list, so was the letter ‘g’ and small basic words or abbreviations such as ‘co’ and ‘also’. I, therefore needed to remove these from the word list before I continued. At first I tried to do this with the big_article.remove(‘co’) command, however I soon found that this removed 1 instance of the word, when sometimes, the word occurred over 50 times. I wrote a loop that removed all these small insignificant words in one go. It was important to place the code BEFORE I changed the file type from list to Text, as Text file doesn’t allow you to remove them. The code used is seen below:

Once this was done, I could use the frequency distribution to find the top 5 words. However, I encountered a second issue: The word ‘human’ was first and the word ‘humans’ was 5th. The word Human and Humans means the same thing, therefore, I want to consider these the same word. Therefore, I wanted the code to count these as the same. I thought of solving this by replacing every instance of ‘humans’ on the list with the word ‘human’. Therefore, I wrote the following bit of code to replace the word:

Which produce the following ‘Top 50’ words list:

From here I could use the same code outlined here to create a Bar Chart of the 50 most common words. Which can be seen below:

However, while this graph is an interesting visualisation, it gives me too many words to consider. I needed to narrow down my list. I decided to go for the 5 top words. So from here, I created a Data frame of the top 5 most occurring words in Anthropology. The Dataframe can be seen below:

These are the words that could be found in (almost) every single anthropology article. While they are very straight-forward words (After all, there are little english speakers who do not know what the word ‘human’ means), they are also quite discipline specific, and I am content,

POLITICS
I repeated the process above for Politics. Once again I found that there were words on the top 50 list that I knew for a fact only appeared multiple times in the same article, rather than a range of articles. Therefore, I decided to only focus on the top 5 again. The Dataframe created can be seen below:

There aren’t any HUGE surprises in this list, though I did find it peculiar that the word ‘self’ is used so often in politics articles, I would not have guessed that. I also found it very interesting that 2 (social and group) out of the 5 were actually the same words as in Anthropology, showing some definite shared vocabulary between the 2 disciplines. However, once again, one can’t be too surprised at this, as the entire discipline, the SOCIAL SCIENCES, is related to society and how we function as a collective, which is obviously directly related to to the words ‘social’ and ‘group’.

LAW

When I ran the code for law, I had a similar problem – it was counting court and courts, and right and rights as separate words. Which obviously affected the top 50 list as can be seen below:

Therefore I had to include the same little string of code I used previously to replace words:

Which then meant that the new list, had both words under the same count:

Finally, the top 5 words were:

The least surprised about this set of words, though perhaps I didn’t know that ‘copyright’ was such a recurring theme in Law Articles, though I suppose copyrights and copyright law was quite a hot topic in 2014 (click link for details and source) which may have lead to several articles at least referring to examples of copyright law. Sadly, no shared words between Law and any of the other Social Science Subjects, unlike Politics and Anthropology. Although, I can’t say I am very surprised. Law is a very specific subject with very specific terminology, unlike perhaps the other social sciences, whose silo lines perhaps blur more into one another.

–Isadora Janssen