We then sought to identify vocabulary shared between subjects. Between two subjects, shared vocabulary is a technical word from either subject that can be found in articles of the other subject.
Initially I thought it would be possible to use the df.merge() using the ‘inner’ determiner in Python PANDAS package. The premise being that all the resulting dataset would only contain the words that occurred in both original data frames. However, when I ran the code the output was an empty data frame:
![Figure 1: Attempt to conduct an 'inner' join between two datasets](https://qm2awesome.wordpress.com/wp-content/uploads/2015/01/concatfail.png?w=604)
Unfortunately, when this code was run for each vocabulary dataset for each academic subject the result was very much the same in all instances. Therefore, it was necessary to re-think my approach for concatenating data frames.
![Figure 2: Concatenating to create a dataset of all shared vocabulary between the disciplines investigated](https://qm2awesome.wordpress.com/wp-content/uploads/2015/01/concatalt1.png?w=604)
Although this new approach achieved the aim of creating a dataset of the shared vocabulary between different subjects, it fundamentally changed the nature of the data being investigated. The premise of using value_counts() after concatenating the data frames of individual subjects was to count the occurrence a particular item of vocabulary across the total number of subjects investigated. If one word returned a value of 4 after having run the value_counts() code, that meant that it had occurred in at least one of the articles for every subject investigated.
![Figure 2: An example of the data and the values returned after running value_counts()](https://qm2awesome.wordpress.com/wp-content/uploads/2015/01/concat_head.png?w=604)
Although in essence, this was successful in the sense that it demonstrates which words are shared between subjects, it does not account for the prevalence of each word within each subject. Take the word positive for example, although the dataset shows that it occurs in all subjects, this does not mean that it features heavily in every article. It might be that ‘positive’ is only used once in one article out of the ten investigated for physics, but the fact that it is also used at least once in at least one article of all other subjects means that it is given a the highest frequency value (4 out of 4 scientific subjects that were investigated). This bias needs to be flagged and addressed to readers in order to avoid any misrepresentation of data or misleading results.
This inherent bias within the data was resolved by manually searching through each disciplinary dataset of vocabulary to determine how many articles in which each particular word was present. However, this meant manually searching though a database of an extremely large number of words, which would have been incredibly laborious and time consuming if tackled without some sort of sorting mechanism.
![Figure 4: Streamlining the dataset for more specific analysis](https://qm2awesome.wordpress.com/wp-content/uploads/2015/01/streamline.png?w=604&h=231)
In the example shown in Figure 4, the original dataset containing all shared words between all investigated subjects was ‘streamlined’ to contain only a specified proportion. In this instance we are only looking at the words that occurred in 2 out of the 4 subjects investigated in ‘The Sciences’ category.
From this point onwards it was a case of returning to each original dataset, conducting a specific search for the words contained in the ‘streamlined’ dataset and then making a note of their prevalence among the individual articles for that subject.
![Figure 5: A bar chart to show the prevalence of words shared between 3 out of 4 Scientific Disciplines in each individual discipline](https://qm2awesome.wordpress.com/wp-content/uploads/2015/01/df_e3_shared.png?w=783&h=298)