Using text analysis to quantify the similarity and evolution of scientific disciplines

We use an information-theoretic measure of linguistic similarity to investigate the organization and evolution of scientific fields. An analysis of almost 20 M papers from the past three decades reveals that the linguistic similarity is related but different from experts and citation-based classifications, leading to an improved view on the organization of science. A temporal analysis of the similarity of fields shows that some fields (e.g. computer science) are becoming increasingly central, but that on average the similarity between pairs of disciplines has not changed in the last decades. This suggests that tendencies of convergence (e.g. multi-disciplinarity) and divergence (e.g. specialization) of disciplines are in balance.


Introduction
The digitization of scientific production opens new possibilities for quantitative studies on scientometrics and science of science [1], bringing new insights into questions such as how knowledge is organized (maps of science) [2][3][4][5][6], how impact evolves over time (bibliometrics) [7,8], or how to measure the degree of interdisciplinarity [9,10]. At the heart of these questions lies the problems of identifying scientific fields and how they relate to each other. The difficulty of these problems, and the inadequacy of a purely essentialist approach, were clear already to Popper [11]: 'The belief that there is such a thing as

Dissimilarity of scientific fields
We are interested in the general problem [2,4] of quantifying the relationship between two scientific fields i, j through the computation of a dissimilarity measure D(i, j), i.e. a quantification of how different i and j are. Dissimilarity measures are symmetric D(i, j) = D(j, i), non-negative D(i, j) ≥ 0 and D(i, i) = 0 [29]. Each scientific field is defined by (at least hundreds of) papers classified by Web of Science as belonging to the same category (see Material and methods §5.1 for details on the data). We consider dissimilarities computed from the following three different types of information (expert, citation and language).

Experts
The classification of disciplines by their relationship is as old as science itself. The most used structure is a strict hierarchical tree, as seen in the traditional departmental division of universities. The collection of papers used here, provided by ISI Web of Science [30], provides a classification of papers according to the Organisation for Economic Cooperation and Development (OECD) classification of fields of science and technology [31]. This scheme is a hierarchical tree with scientific fields defined at three levels (domains, disciplines and specialities). For instance, Applied Mathematics (a speciality) is part of Mathematics (a discipline) which is part of Natural Sciences (a domain). The natural dissimilarity measure D exp (i, j) between two fields in this structure is the number of links needed to reach a common ancestor of i and j. For instance, considering i, j at the specialty level, D exp can assume three different values: D exp = 1 for specialties belonging to the same discipline (e.g. Applied Mathematics and Statistics & Probability), D exp = 2 for specialties belonging to the same domain (e.g. Applied Mathematics and Condensed Matter Physics) and D exp = 3 for the other pairs of specialties (e.g. Applied Mathematics and Linguistics). While researchers have pointed out potential issues with classification into categories of ISI Web of Science [4], it offers the most extensively available classification and remains widely used to relate articles and journals to disciplines [9,32].

Citations
Another popular approach is to consider that fields i and j are more similar if there are citations from (to) papers in i to (from) papers in j [4,16,17]. Here, we consider a dissimilarity measure D cite (i, j) which decreases for every citation between papers in i and j and increases with every citation from i that is not to j (and vice versa), but that remains unchanged by the number of citations that do not involve either i or j. These requirements are achieved using (for i = j) a symmetrized Jaccard-like dissimilarity [29,33]: where c i,j are the number of citations from i to j, C a,b = N t=1,t =b c a,t and Cā ,b = N t=1,t =a c t,b 1 .

Language
We compare the language of fields i and j based on the frequency of words in each field using methods from information theory. Measuring the frequency p(w) of word w, for each field i we obtain a vector of frequencies p i ≡ p i (w) for w = 1, . . . , V, where V is the size of the vocabulary (i.e. number of different words). From this, following [27], the dissimilarity between two fields i and j is where H 2 (p i ) = 1 − w p i (w) 2 is the generalized entropy of order 2 and the denominator ensures normalization (i.e. 0 ≤ D lang (i, j) ≤ 1). To increase the discrimination power and to avoid statistical biases in our estimation, we removed a list of stop words and included only the V = 20 000 most frequent words (see Material and methods §5.3 for a justification). The dissimilarity (2.2) corresponds to a generalized (and normalized) Jensen-Shannon divergence which yields statistically robust estimations in texts [27,28] (for details and motivation, see Material and methods §5.4).
In contrast with most previously proposed methods, equation (2.2) has two critical properties that are essential in order to obtain the interpretable results mentioned in the Introduction. On the one hand, it is well founded in information theory and its statistical properties (in terms of systematic and statistical errors) are well understood [27,34], distinguishing it from other heuristic approaches. On the other hand, it has convenient properties: D lang (i, j) depends only on the papers contained in fields i and j and it is normalized 0 ≤ D lang (i, j) ≤ 1. As a result, the measured distance between two fields, D lang (i, j), has an absolute meaning. This is in contrast with alternative similarity measures [2,4], including machinelearning approaches (e.g. topic models [15,35]) based on (un-) supervised classification of documents into coherent subgroups. Here, the main limitations stem from the fact that either (i) the division into subgroups is typically based on statistically significant differences in the usage of words between the different subgroups independent of the actual effect size, or (ii) the resulting distance between two fields depends also on all other fields (e.g. the distance between 'Physics' and 'Chemistry' depends on whether one includes articles about 'Anthropology' in the classification).

Results
We now present and interpret results obtained computing the three dissimilarity measures (D exp , D cite and D lang ) reported above for scientific fields i, j defined by papers published in different time intervals and categorized (by Web of Science) as belonging to the same specialty (e.g. Applied Mathematics), discipline, (e.g. Mathematics) or domain (e.g. Natural Sciences). Figure 1 shows the three D(i, j) at the level of specialties (i, j) for the complete time interval 1991-2014. The concentration of low D(i, j) close to the diagonal shows that both the citations and language of scientific papers partially reflect the disciplinary classification done by the experts. However, visual inspection already reveals that citations and our language analysis show relationships not present in the expert classification, e.g. the low dissimilarity between Engineering and Natural Sciences (most clearly between Electrical Engineering and Physical Sciences) and between the disciplines inside the Agriculture domain and Biological Sciences.

Comparison of dissimilarity measures
We start by quantifying the relationship between the three different dissimilarity measures, i.e. (D exp , D cite and D lang ), across all pairs of specialties (i, j). In table 1, we report the rank correlation between the three measures, which we obtain from ranking for each dissimilarity the pairs of (i, j) according to D(i, j). The choice of this non-parametric correlation is motivated by the fact that the range of the three measures differs dramatically (e.g. D exp ∈ {0, 1, 2, 3} and D lang ∈ [0, 1]). The positive statistically significant correlation between all pairs of D(i, j)'s confirms the visual impression described above. The correlation between citations and language is higher than the correlation with the experts classification. Remarkably, language and citations show a very similar correlation with experts, but language is systematically less correlated than citations (p-value = 1.8 × 10 −5 for Spearman-ρ and p-value = 2.2 × 10 −5 for Kendall-τ 2 ). We conclude that the language dissimilarity D lang introduced here is able to retrieve the well-known relationships between disciplines to a similar extent with that of the (well-studied) citation analysis.
We now explore how the relationship between the different dimensions depends on the different scientific fields. The results in figure 2 confirm the conclusions of the aggregated analysis but show further interesting features. First, the correlation in (D exp , D lang ) is smaller than (D exp , D cite ) mainly in the natural sciences. Second, while the correlation between citations and language remains largely constant, large fluctuations in the correlations between expert and citations (as well as expert and language) exist. This is seen both as the strong downward spikes and also in the manifested dependence on disciplines and domains. The titles of the specialties at the low peaks already suggest that these are specialties with interdisciplinary connections. For instance, Chemistry, Medicinal is a specialty that (according to the experts classification) belongs to the discipline Basic Medicine and to the domain Medical Science. Therefore, D exp = 3 between Chemistry, Medicinal and all specialties of the Natural Sciences (in particular, for all specialties from the discipline Chemical Sciences). Instead, the dissimilarity measured by citations        . Results for citations (language) were obtained by agglomerative hierarchical clustering, applying the group average method [36] to D cite (i, j) (D lang (i, j)). The x-axis shows the clustering dissimilarity (i.e. the dissimilarity of two clusterings that are merged). The dashed line corresponds to a clustering dissimilarity equal to the percentile 0.92 of the values of all cluster dissimilarities at each measure (citations/language).

Hierarchical clustering
A strict hierarchical classification of scientific fields is both aesthetically appealing and of practical use in bibliographical and document classification tasks. It also allows us to further highlight the differences in the relationship between scientific fields revealed by the different dissimilarity measures (in particular by D lang ). While D exp is precisely based on one such hierarchical classification, D cite and D lang are not. In    based on citations, Agriculture appears more isolated, while based on language this happens for Medical Science. A more detailed picture of the differences between language and citation is revealed at the level of disciplines (figure 3b). While at the first division, both citations and language create a cluster in which all disciplines of the domains Humanities and Social Sciences appear, further divisions show more subtle differences between the two dissimilarity measures. Remarkably, the hierarchy obtained from language creates a cluster containing all and only Humanities disciplines. By contrast, the hierarchy based on citations creates one clustering with three of the five Humanities disciplines (Lang. and Literature, Arts and Other Humanities while the two remaining ones (History & Archaeology and Philosophy, ethics, religion) are clustered together in the middle of a cluster of disciplines in Social Science. Another interesting difference between the clusterings is revealed looking at three disciplines of the domain Medicine: In the analysis based on citations, the minimum cluster that includes the three disciplines includes Biological Sciences and Other Natural Sciences, while in the language analysis, this cluster includes additionally three related Engineering disciplines (Medical Eng., Ind. Biotechnology and Environ. Biotechnology).
Probably the most remarkable feature of the clustering obtained by, both, citations and language is that it repeatedly clusters together related disciplines from Natural Sciences with disciplines from Engineering and Medicine (e.g. Chemical Sciences and Materials Science). This clustering, not present in the experts classification, suggests that the distinction between fundamental and applied sciences present in the expert classification has no strong effect on citations and the language of the publications. Instead, in this specific case, the citation and language analysis seem to be capturing a connection between 'subject matters' that was necessarily absent from the strict hierarchical expert classification.

Temporal evolution
While in the previous sections we looked at a static snapshot of the relation between disciplines, here we are interested in how the linguistic relationship D lang (i, j) between pairs (i, j) of disciplines evolved over the last three decades. 3 In figure 4, we show the temporal evolution for five out of 703 pairs (i, j), with focus on the discipline Physical Sciences, illustrating different types of dynamic patterns. On the one hand, the dissimilarity to Chemical Sciences (its most similar discipline) and Mathematics stays roughly constant over time. On the other hand, we also observe systematic trends of disciplines becoming more or less similar over time. While the proximity to Biological Sciences and Computer and Information Science has steadily increased (decreased dissimilarity D lang (i, j)) after the year 2000, the opposite trend is seen for Electrical, Electronical and Information Engineering. These observations are consistent with the increasing number of biological and computational-related publications in Physics, and with a departure from the historical connections to Engineering.
The observations reported above raise the question as to whether scientific disciplines are showing an overall tendency to become more similar to each other. In a more general context, this amounts to the question of whether the purported increase in interdisciplinarity leads to a larger overlap in the language used by different disciplines. We address this question by computing, for each pair of disciplines, the mean yearly variation lang (i, j)), (3.2) where the time interval t ≡ t f − t 0 was usually from t 0 = 1991 to t f = 2014. The distribution of values of ν for all discipline pairs (i, j) is shown at the (rightmost) box plot in figure 4b. We see that there are both positive and negative variations, consistent with our qualitative observations in the example of Physical Sciences in figure 4a. However, the average variation ν ≈ −0.00025 over all pairs of disciplines (i, j) is not distinguishable from zero (the null hypothesis of ν = 0 has a p-value = 0.07 in the t-test for the mean of one sample and a p-value = 0.21 in the non-parametric Wilcoxon test), i.e. the typical dissimilarity remains unchanged. This result suggests that, while there are systematic trends for individual pairs of disciplines, on average there is no significant increase or decrease in the interdisciplinarity for the science as a whole in the last three decades as measured by the language. On a more fine-grained level, however, we observe systematic trends that suggest that individual disciplines tend to become more (less) central. For this, we focus on the discipline pairs (i, j) which experienced the most extreme variation in the last decade (one standard deviation away from ν ). These pairs have typically |ν| 0.003, meaning that their (normalized) dissimilarity changes roughly 3% in a decade. The three disciplines that are most frequently seen in the left tail (ν < 0) are: 1-02 Computer and Information Sciences, 2-08 Environmental Biotechnology and 3-01 Basic Medicine. The language of these disciplines became significantly more similar to the language of other disciplines in the last three decades, suggesting that these disciplines became more central. By contrast, the three disciplines that experienced most strongly the opposite effect (most frequently seen in the right tail, ν > 0) are (in decreasing order): 5-01 Psychology, 2-05 Materials Engineering and 2-02 Electrical Engineering, Electronic Engineering, Information Engineering.
In the interpretation of the results reported in this section it is crucial to take into account that the measure D lang we use depends only on the frequency of the words in each of the fields and in each year. In particular, this means that the results can be interpreted as an absolute dissimilarity independent of the content or volume of other fields. Another advantage of our measure D lang is that it allows us to quantify the contribution of individual words [28]. This general feature of our method is illustrated in table 2 and allows for a deeper interpretation of the meaning of D lang (e.g. the contribution of topical words and stylistic differences).

Discussion
We investigated the similarity between scientific fields from different perspectives: an expert classification, a citation analysis and a newly proposed measure of linguistic similarity. We found that these different dimensions are related yet different, yielding thus new insights on the relationship between disciplines, their hierarchical organization and their temporal evolution.
Our first main finding is that the language and citation relationships between disciplines are similar and substantially different from the expert classification. This is consistent with the motivation exposed in our introduction which associated the expert classification to the (largely idealized) essentialist view of scientific disciplines, while the citation (social) and language (cognitive) were closer to dimensions that play a more important role in the relationship between fields. Interestingly, our results indicate that the language-relation of fields is more distinct from the expert classification than the citation-relation is, especially in the natural sciences.
Our second main finding is that in the last 30 years the language of different scientific fields remain, on average, at the same distance from all other fields. While individual disciplines show clear trends of increasing (or decreasing) centrality, this suggests that, overall, diverging tendencies in science (e.g. specialization) are in balance with converging tendencies (e.g. multidisciplinarism). This is a remarkable quantitative finding because of the substantial changes observed in this period.
The latter result demonstrates that our textual measure is of practical relevance for the study of interdisciplinarity. In recent years, interdisciplinary research achieved a central position [10] due to its broader relation to the concept of diversity [37] and its effect on the impact [38,39] and performance of teams [40] as well as its implications for policymaking, e.g. in terms of funding [41]. Is it just a fashion or science is really getting more and more interdisciplinary? A usual way to assess interdisciplinarity is based on citation networks using heuristic approaches [9,32,42] or methods from complex networks [43][44][45][46]. In line with the arguments exposed in the introduction, interdisciplinarity can be viewed through different dimensions and the cognitive dimension would be best measured using textual data. However, there are only very few works [47][48][49] relating textual measures with interdisciplinarity, despite the increasing availability of the text of scientific articles. In this view, the significance of our approach is that it provides a measure of interdisciplinarity based on how much the usage of words in different disciplines overlap.
Finally, we hope our results and methodology will stimulate a multiple-dimensional approach in other problems related to the study of sciences, profiting from the modern availability of large (textual) databases of scientific publications that allow us to go beyond traditional bibliometric analysis [1,9]. These include, but are not limited to, the formulation of more meaningful bibliometric indicators [50], the identification and prediction of influential papers and disciplines [51][52][53], or the inclusion of textual information in recommending related scientific papers [54].

Data and grouping of corpora
We use the Web of Science database [30] and explore the following information available for individual articles: citations, title, abstract and the classification in one scientific speciality (per OECD classification [31]). We use all papers published between 1991 and 2014 because, the number of articles with text in the abstract is substantial only after 1991 and because, at the time we started our analysis, 2014 was the last complete year available to us. The text of an article was built concatenating its title and abstract. The corpus representing a speciality in a given year is obtained from the concatenation of the text of all articles for that specialty in that year. The corpus for one discipline (or domain) concatenates all articles in all specialities belonging to that discipline (or domain).
Our analysis is based on 19 589 166 articles for each of the textual and classification information were available (92% of all articles indexed in Web of Science during 1991-2014). In our analysis we considered only citations from and to the papers in our list because only for these papers we had a reliable classification of specialties. These citations corresponded to roughly half of the ≈625 M citations associated with these papers. See [55] for the divergences we obtained from this dataset.