Usage frequency and lexical class determine the evolution of kinship terms in Indo-European

Languages do not replace their vocabularies at an even rate: words endure longer if they are used more frequently. This effect, which has parallels in evolutionary biology, has been demonstrated for the core vocabulary, a set of common, unrelated meanings. The extent to which it replicates in closed lexical classes remains to be seen, and may indicate how general this effect is in language change. Here, we use phylogenetic comparative methods to investigate the history of 10 kinship categories, a type of closed lexical class of content words, across 47 Indo-European languages. We find that their rate of replacement is correlated with their usage frequency, and this relationship is stronger than in the case of the core vocabulary, even though the envelope of variation is comparable across the two cases. We also find that the residual variation in the rate of replacement of kinship terms is related to genealogical distance of referent to kin. We argue that this relationship is the result of social changes and corresponding shifts in the entire semantic class of kinship terms, shifts typically not present in the core vocabulary. Thus, an understanding of the scope and limits of social change is needed to understand changes in kinship systems, and broader context is necessary to model cultural evolution in particular and the process of system change in general.


Kinship data
We collected kin terms from 45 languages for the following relations: B, D, F, M, MB, MZ, MZD, MZS, S, Z (i.e. brother, daughter, father, mother, mother's brother, mother's sister, mother's sister's daughter, mother's sister's son, son, sister), collected from a combination of native speakers, ethnographies, and dictionaries in our 'Kinbank' database. Initially, the analyses used a broader set of kinship terms (e.g. FB, FZ, BW, ZH, i.e. father's brother, father's sister, brother's wife, sister's husband), however, we restricted the sample to form a comparable set of word frequencies. This is because most Indo-European languages do not have separate terms for e.g. MZD and MBD and separate terms for e.g. BW are exceedingly rare (both across languages as types and within languages as tokens). We have not included husband and wife, because these are often synonymous with man and woman, respectively.
Frequency data were collected from 34 corpora in 21 languages in three corpus types: spoken, written, and web-crawled. The list of corpora is in Table S1 below.  (Benešová, Kren, and Waclawicová 2013) The analysis uses one term per language per kinterm. If it arises that a language has multiple words for a particular kinterm, we use which ever was more frequent in the corpus data. In the case where a language has multiple words for a single kinterm, and we do not have access to frequency data, we rely on expert judgement to select a term.

Supplementary data table
See the supplementary data csv file for the raw data used in this study.

Data summary
We estimate rate of replacement and compare it with frequency of use for ten types of kin relations. Here we use MB, MZ, MZS, and MZD as shorthand for broader terms (uncle, aunt, male cousin, and female cousin, respectively). This is because, as we note in the first section, commonly used terms in this set of languages do not distinguish terms according to the parent's gender.
Below is a table indicating the shorthand used for each kinterm, the number of languages for which we have data, and the number of states (or cognates) for that term.
We collected terms for a superset of the languages from which frequency data is available in order to have a more robust estimate of rate of replacement for each term.

Frequency data
Below are bar graphs of term frequency by language, across corpora types (web, written, or spoken).

Cognate data
We generated cognate classes using the Indo-European Etymological Dictionary (Buck 2008), LingPy (List, Greenhill, and Forkel 2018), and a panel of volunteer experts, recruited on Linguist List (all faults remain ours). All terms were automatically transcribed into the Speech Assessment Methods Phonetic Alphabet (SAMPA) through LingPy's uni2sampa function. Cognates were automatically allocated using LingPy's cluster function. Using the cognate-coded Swadesh list, the list of core vocabulary terms (subset to those languages for which we also have kinterms), we tested the appropriateness of edit distance, SCA, and turchin algorithms, alongside Phonemic and Phonetic transcriptions for cognate detection in our data, following the code examples from Lingpy.org. The F-score was highest for the edit-distance algorithm with a 0.4 threshold (see table S4) (List, Greenhill, and Forkel 2018). We then manually adjusted the results, followed by expert review which resulted in minor changes. Automatic decisions and the corrections are available in the supplementary data file.

Phylogeny
We used 1000 phylogenies from the most recent Bayesian posterior of Indo-European phylogenies (Bouckaert et al. 2012). Trees in the sample are rooted. Branch lengths are given in years and derived from statistical and historical calibration. The Indo-European posterior used has an approximate age of 8,700 years. Trees initially have 111 taxa, and these were pruned down for each kinterm dependent on available data. Counts for taxa for each kinterm can be found in table S3. By using a sample of likely phyloygenies and through using a Bayesian approach, we account for the phylogenetic uncertainty. Table S3 shows the number of languages and states used to estimate rate of change for each kinterm. Each language is linked to a taxon in the Indo-European phylogeny. Following the methods in Pagel and Meade (2018) we use BayesTraits version 3.0.1 to implement a Bayesian MCMC approach to estimate the instantaneous global rate of change for each kin-term through Q-matrix normalisation. Probabilities of frequency were scaled to represent the empirical frequencies. We used a stepping-stone sampler, using 100 stones for 1000 iterations each. MCMC chains ran for a total of 10,010,000 iterations, with a burn-in of 10,000, sampling every 1000 iterations. This left a posterior sample of 10,000 iterations, which is approximately 10 samples per tree. To make the rates comparable to Pagel et al., we scale instantaneous rates to change per 10,000 years.

Rates of change
Each analysis was run 3 times to ensure the MCMC chain converged. Tables S5 -S14 display the marginal log-likelihood for each MCMC run, the mean global rate of change for each run, and the average across the three runs. Each table is labelled by kin code. For each kinterm, we also used the Gelman-Rubin diagnostic test for convergence (Gelman and Rubin 1992). This tests for MCMC convergence between multiple chains by analysing the differences between them. By estimating a 'potential scale reduction factor', which when multiplying across chains would remove the differences, we can quantify the differences between chains (a scale reduction factor 1 indicating no change needed). A rule of thumb suggests a point estimates of less than 1.1 is sufficient to claim convergence, and ensuring upper limits are also around these limits.

Half-life
We calculate the half-life of each kinterm following methods from Pagel and Meade (2018). The half-life of a term estimates the expected amount of time before a 50% chance of a cognate change.

Frequency of use and rates of change: Swadesh words and kin terms
We want to see if • Rate of change correlates with frequency of use for kin terms • What the strength of this relationship is compared to Swadesh terms The difficulty is that the two data sets are structured differently. A given kin term can have a written / spoken / web frequency as well as a word / lemma frequency. A given Swadesh term (core vocabulary term) only has one frequency (though it may be written / spoken / etc. depending on the source corpus). Term length correlates with frequency of use in a way that is not directly relevant to our analysis either.
In order to create comparable kin-and Swadesh-datasets, we fit a linear mixed model (M1.1) as control on the kinterm data and use word meaning random intercepts from this model in a second, predictive, model (M2)

Control model: kin terms
Model 1.1: Centralised log frequency of use per million~Corpus genre + Frequency type + (1 | Word meaning) The aim of M1.1 is to provide us with a word meaning random intercept for M, F, B, Z, etc. that incorporates genre and frequency type information. As a result, random slopes were not tested. Table S17 shows the fixed effects for this model.
The word meaning-level random intercepts capture word frequency across data sources. The intercepts predict rate of change, even when controlling for word length. Since the frequency measure and word length are both word-level predictors, we have no potential random slopes and report the model with a random intercept for language only. The estimates of the fixed effects can be seen in table S18.
M1.1 provides us with aggregated information on the centralised log frequency per million of each kinship word. This allows direct comparison with the core vocabulary (where we only have one datum per word) without a considerable loss of information.
M1.2 only serves to demonstrate that the frequency effect is not an artefact of word length. Centralised rate of replacement~Frequency measure * Word type + (Word type | Language) The predictive model (M2) uses the random word-meaning intercepts (named Frequency measure) from M1.1 as measures of frequency of use for kin terms, and centralised log frequency per million for the Swadesh terms. We restrict the dataset to languages for which we have kin term data. We propose two possible random effect structures, either random intercepts for each language (M2.1), or random slopes for each word type in each language (M2.2. Word-type is the only fixed effect that can vary across language. Goodness-of-fit tests reveal that this random slope results in a better fit (Table S19), so we report M2.2, the model with the slope (Table S20). The interaction effect is plotted in figure S5. Figure S6 shows a plot of the raw data used in the model, highlighting the kinship terms.