Category Archives: Shared Vocabulary

Tackling Shared Vocabulary – How to Compile Dataset of Words shared Between Subjects

We then sought to identify vocabulary shared between subjects. Between two subjects, shared vocabulary is a technical word from either subject that can be found in articles of the other subject.

Initially I thought it would be possible to use the df.merge() using the ‘inner’ determiner in Python PANDAS package. The premise being that all the resulting dataset would only contain the words that occurred in both original data frames. However, when I ran the code the output was an empty data frame:

Figure 1: Attempt to conduct an 'inner' join between two datasets
Figure 1: Unsuccessful attempt to conduct an ‘inner’ join between two datasets

Unfortunately, when this code was run for each vocabulary dataset for each academic subject the result was very much the same in all instances. Therefore, it was necessary to re-think my approach for concatenating data frames.

Figure 2: Concatenating to create a dataset of all shared vocabulary between the disciplines investigated
Figure 2: Concatenating to create a dataset of all shared vocabulary between the subjects investigated

Although this new approach achieved the aim of creating a dataset of the shared vocabulary between different subjects, it fundamentally changed the nature of the data being investigated. The premise of using value_counts() after concatenating the data frames of individual subjects was to count the occurrence a particular item of vocabulary across the total number of subjects investigated. If one word returned a value of 4 after having run the value_counts() code, that meant that it had occurred in at least one of the articles for every subject investigated.

Figure 2: An example of the data and the values returned after running value_counts()
Figure 3: An example of the data and the values returned after running value_counts()

Although in essence, this was successful in the sense that it demonstrates which words are shared between subjects, it does not account for the prevalence of each word within each subject. Take the word positive for example, although the dataset shows that it occurs in all subjects, this does not mean that it features heavily in every article. It might be that ‘positive’ is only used once in one article out of the ten investigated for physics, but the fact that it is also used at least once in at least one article of all other subjects means that it is given a the highest frequency value (4 out of 4 scientific subjects that were investigated). This bias needs to be flagged and addressed  to readers in order to avoid any misrepresentation of data or misleading results.

This inherent bias within the data was resolved by manually searching through each disciplinary dataset of vocabulary to determine how many articles in which each particular word was present. However, this meant manually searching though a database of an extremely large number of words, which would have been incredibly laborious and time consuming if tackled without some sort of sorting mechanism.

Figure 4: Streamlining the dataset for more specific analysis
Figure 4: Streamlining the dataset for more specific analysis

In the example shown in Figure 4, the original dataset containing all shared words between all investigated subjects was ‘streamlined’ to contain only a specified proportion. In this instance we are only looking at the words that occurred in 2 out of the 4 subjects investigated in ‘The Sciences’ category.

From this point onwards it was a case of returning to each original dataset, conducting a specific search for the words contained in the ‘streamlined’ dataset and then making a note of their prevalence among the individual articles for that subject.

Figure 5: A bar chart to show the prevalence of words shared between 3 out of 4 Scientific Disciplines in each individual discipline
Figure 5: A bar chart to show the prevalence of words shared between the Scientific Subjects

Tackling Shared Vocabulary – How to Compile Dataset of Words shared Between Subjects

An alternative piece of code for this step that essentially returns the same set of results i.e. the number of articles within Subject 1 that mentions technical words from Subject 2 is as follows:

#Import packages
import nltk
from nltk.book import *
from nltk.tokenize import RegexpTokenizer
"""``RegexpTokenizer`` splits a string into substrings using a regular expression... It tokens out non-alphabetic sequences, money expressions, and any other non-whitespace sequences." (Source)
E.g.  >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks."  >>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')  >>> tokenizer.tokenize(s)  ['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.',  'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']"""
from nltk.corpus import stopwords 
#Stopwords are words like "I" and "me" that have little lexical content 
stopset = set(stopwords.words('english'))
import requests
import nltk
import pickle
import matplotlib
from pylab import *
#Open file containing all words used in articles of a subject e.g. history, in this piece of code. This text file has already been streamlined in that a word used in an article only appears once in the file, no matter how often it was repeated in that article. A word can however, appear more than once if it was used in more than one article. I.e. A word in this .csv file can only be repeated 10 times at max.
with open('/Users/rain/Desktop/qm/history/historyall.csv', 'r') as text_file: 
   text = text_file.read() 
   text = text.lower() 
   #Making all letters lower-case 
   tokenizer = RegexpTokenizer(r'\w+') 
   tokens = tokenizer.tokenize(text) 
   #Removing stopwords 
   tokens = [w for w in tokens if not w in stopset]
#Open .txt file containing all technical words from a subject e.g. Visual Arts, in this piece of code
file = open('/Users/rain/Desktop/qm/visual arts/vartstechnical.txt', 'r')
technicalwords = file.readlines()
#At this point, technicalsword returns ['word1\n', 'word2\n', 'word3\n', 'word4\n'...]. To remove '\n', this line of code is needed:
technicalwords = [word.strip() for word in technicalwords]
for word in technicalwords:
   print word
   print tokens.count(word)
#In this example, the result lists the number of History articles a technical word from Visual Arts can be found in. I.e. 'Sublime' can be found in 1 out of 10 of the History articles analysed.
sublime
1
purposiveness
0
purposive
0
kantian
0
contra
0

With this code, I compiled the shared vocabulary between every subject in all 3 disciplines that we looked at. The lines coloured in blue are what I edited each time. For example, if I wanted to study the shared vocabulary between History and Philosophy, I’d have to look at both the number of History articles that contain Philosophy’s technical words and the number of Philosophy articles that contain History’s technical words. To do the former, I’d open the file containing all of History’s words in the first blue line and the file containing all of Philosophy’s technical words in the second blue line and run the code. For the latter,  I’d open the file containing all of Philosophy’s words in the first blue line and the file containing all of History’s technical words in the second blue line and run the code.

For every technical word, the highest possible number of subject-specific articles it can appear in is 10, since we looked at 10 articles from each subject. Therefore, the maximum count of History articles that contain Philosophy’s technical words would be the total number of Philosophical technical words x 10.

In summary, from this step, I accumulated the shared vocabulary between every single subject measured in terms of the number of articles within either subject that contained technical words originating from the other.