Emo, love and god: making sense of Urban Dictionary, a crowd-sourced online dictionary

The Internet facilitates large-scale collaborative projects and the emergence of Web 2.0 platforms, where producers and consumers of content unify, has drastically changed the information market. On the one hand, the promise of the ‘wisdom of the crowd’ has inspired successful projects such as Wikipedia, which has become the primary source of crowd-based information in many languages. On the other hand, the decentralized and often unmonitored environment of such projects may make them susceptible to low-quality content. In this work, we focus on Urban Dictionary, a crowd-sourced online dictionary. We combine computational methods with qualitative annotation and shed light on the overall features of Urban Dictionary in terms of growth, coverage and types of content. We measure a high presence of opinion-focused entries, as opposed to the meaning-focused entries that we expect from traditional dictionaries. Furthermore, Urban Dictionary covers many informal, unfamiliar words as well as proper nouns. Urban Dictionary also contains offensive content, but highly offensive content tends to receive lower scores through the dictionary’s voting system. The low threshold to include new material in Urban Dictionary enables quick recording of new words and new meanings, but the resulting heterogeneous content can pose challenges in using Urban Dictionary as a source to study language innovation.


Contemporary information communication technologies open up new ways of cooperation leading to
the emergence of large-scale crowd-sourced collaborative projects [1]. Examples of such projects are open software development [2], citizen science campaigns [3] and most notably Wikipedia [4]. All these projects are based on contributions from volunteers, often anonymous and non-experts. Although the success of most of these examples is beyond expectation, there are challenges and shortcomings to be considered as well. In the case of Wikipedia for instance, inaccuracies [5], edit wars and destructive interactions between contributors [6,7] and biases in coverage and content [8,9] are only a few to name among many undesirable aspects of the project that have been studied in detail.
The affordances of Internet-mediated crowd-sourced platforms have also led to the emergence of crowd-sourced online dictionaries. Language is constantly changing. Over time, new words enter the lexicon, others become obsolete, and existing words acquire new meanings (i.e. senses) [10]. Dictionaries record new words and new meanings, are regularly updated, and sometimes used as a source to study language change [11]. However, a new word or a new meaning needs to have enough evidence backing it up before it can enter a traditional dictionary. For example, selfie was the Oxford Dictionaries word of the year in 2013 and its frequency in the English language increased by 17 000% in that year. Its first recorded use dates back to 2002, 1 but was only added to OxfordDictionaries.com in August 2013. Even though some of the traditional online dictionaries, such as Oxford Dictionaries 2 or Macmillan Dictionary, 3 have considered implementing crowdsourcing in their workflow [12] (see [13, pp. 3-6] for a typology of crowdsourcing activities in lexicography), for most, they rely on professional lexicographers to select, design and compile their entries.
Unlike traditional online dictionaries [13, p. 11], the content in crowd-sourced online dictionaries comes from non-professional contributors and popular examples are Urban Dictionary 4 and Wiktionary [14]. 5 Collaborative online dictionaries are constantly updated and have a lower threshold for including new material compared to traditional dictionaries [13, p. 2]. Moreover, it has also been suggested that such dictionaries might be driving linguistic change, not only reflecting it [15,16]. Crowd-sourced dictionaries could potentially complement online sources such as Twitter, blogs and websites (e.g. [17][18][19]) to study language innovation. However, such dictionaries are subject to spam and vandalism, as well as 'unspecific, incorrect, outdated, oversimplified or overcomplicated descriptions' [12]. Another concern affecting such collaborative dictionaries is the question of whether their content reflects real language innovation, as opposed to the concerns of a specific community of users, their opinions, and generally neologisms and new word meanings that will not last in the language.
This paper presents an explorative study of Urban Dictionary (UD), an online crowd-sourced dictionary founded in December 1999. Users contribute by submitting an entry describing a word and a word might, therefore, have multiple entries. According to Aaron Peckham, its founder, 'People write really opinionated definitions and incorrect definitions. There are also ones that have poor spelling and poor grammar [. . .] I think reading those makes definitions more entertaining and sometimes more accurate and honest than a heavily researched dictionary definition' [20]. An UD entry for selfie is shown in figure 1, in which selfie is defined as 'The beginning of the end of intelligent civilization' and accompanied with an example usage 'Future sociologists use the selfie as an artifact for the end of times'. Furthermore, entries can contain tags (e.g. #picture, #photograph). In total, UD contains 76 entries for selfie (July 2016), the earliest submitted in 2009, and a range of variations (e.g. selfie-conscious, selfied, selfieing and selfie-esteem). Overall, there are 353 entries that describe a word (or phrase) containing the string selfie (see figure 2 for a plot over time). Figure 3 shows a similar plot for fleek and on fleek, a phrase that went viral in 2014. UD thus not only captures new words rapidly, but it also captures the many variations that arise over time. Furthermore, the personal, informal and often offensive nature of the content in this popular site is different from the content typically found in both traditional dictionaries (see [13, pp. 3-4] and [13, p. 7]) and more regulated collaborative dictionaries like Wiktionary. The status of UD as source of evidence for popular and current usage is widely recognized [21][22][23] and it has even been consulted in some legal cases [24]. UD has also been used as a source to cross-check emerging word forms identified through Twitter [18].     UD has also been used for the development of natural language processing systems that have to deal with informal language, non-standard language and slang. For example, UD has been consulted when building a text normalization system for Twitter [25] and it has been used to create more training data for a Twitter-specific sentiment lexicon [26]. In a recent study, UD is used to automatically generate explanations of non-standard words and phrases [24].
While UD seems a promising resource to record and analyse language innovation, so far little is known about the characteristics of its content. In this study, we take the first step towards characterizing UD. So far, UD has been featured in a few studies, but these qualitative analyses were based on a small number of entries [23,27]. We study a complete snapshot (December 1999-July 2016) of all the entries in the dictionary as well as selected samples using content analysis methods. To the best of our knowledge, this is the first systematic study of UD at this scale.

Results
We start with presenting an overall picture of UD ( §2.1), such as its growth and how content is distributed. Next, we compare its size to Wiktionary based on the number of headwords ( §2.2). We then present results based on two crowd-sourcing experiments in which we analyse the types of content and the offensiveness in the entries ( §2.3). Finally, we discuss how characteristics of the entries relate to their popularity on UD ( §2.4).

Overall picture
Since its inception in 1999, UD has had a rather steady growth. Figure 4 shows the number of new entries added each week. So far, UD has collected 1 620 438 headwords (after lower casing) 6 and 2 661 625 entries with an average of 1.643 entries per headword. However, as depicted in figure 5a, the distribution of the number of entries for each headword varies tremendously from one headword to another. While the majority of headwords have only one definition, there are headwords with more than 1000 definitions. Table 1 reports the headwords with the largest number of definitions.
This fat-tailed, almost power-law distribution is not limited to the number of definitions per headword; the number of definitions contributed by each user follows a similar distribution, shown in figure 5b. The majority of users have contributed only once, while there are few power-users with more than 1000 contributed definitions. These types of distributions are common in self-organized human systems, particularly similar crowd-based systems such as Wikipedia [28,29] or the citizen science projects Zooniverse [3], social media activity levels such as on Twitter [30] or content sharing systems such as Reddit or Digg [31].
A noteworthy feature of UD is that users can express their evaluation of different definitions for each headword by up or down voting the definition. There is little to no guideline on 'what a good definition is' in UD and users are supposed to judge the quality of the definitions based on their own subjective perception of how an urban dictionary should be. Figure 6a shows the distribution of the number of up/down votes that each definition has received among all the definitions of all the headwords. A similar pattern is evident, in which many definitions have received very few votes (both up and down) and few definitions have many votes. Figure 6b shows a scatter plot of the number of down votes versus the number of up votes for each definition. There is a striking correlation between the number of up and down votes for each definition which emphasizes the role of visibility rather than quality in the number of votes. However, there seems to be a systematic deviation from a perfect correlation in which the number of up votes generally outperforms the number of down votes. This is more evident in figure 6c, where the distribution of the ratio of up votes to down votes is shown. Evidently, there is a wide variation among the definitions with some having more than 10 times more up votes than down votes and some the other way around.

Number of headwords
We now compare the number of unique headwords in UD to the number of unique headwords in Wiktionary, another crowd-sourced dictionary. Wiktionary manifests a different policy from that of UD. The content in Wiktionary is created and maintained by administrators (selected by the community), registered users and anonymous contributors [14]. In contrast to UD, there are many different mechanisms in Wiktionary to ensure that the content adheres to the community guidelines. Each page is accompanied by a talk page, where users can discuss the content of the page and resolve any possible conflicts. Furthermore, in Wiktionary guidelines can be found for the structure and content of the entries. Capitalization is consistent and content or headwords that do not meet the Wiktionary guidelines are removed. For example, while both UD and Wiktionary have misspelled headwords (e.g.      Because of the inconsistent capitalization in UD, we experiment with three approaches to match the headwords between both dictionaries: no preprocessing, lower casing of all characters, and mixed. 8 Table 2 reports the result of this matching. The number of unique headwords in UD is much higher and the lexical overlap is relatively low. Sometimes there is a match on the lexical level (i.e. the headwords match), but UD or Wiktionary cover different or additional meanings. For example, phased is described in UD as 'something being done bit by bit-in phases', a meaning also covered in Wiktionary. However, UD also describes several other meanings, including 'A word that is used when your asking if someone wants to fight' and 'to be "buzzed" when you arent drunk, but arent sober'.
Because there is little curation of UD content, there are many headwords that would not typically be included in a dictionary. Examples include nick names and proper names (e.g. shaskank defined as 'Akshay Kaushik's nick name for his boyfriend Shashank'; dan taylor, defined as 'A very wonderful man that cooks the best beef stew in the whole wide world. [. . .]'), as well as informal spelling (e.g. and made-up words that actually no one uses (e.g. Emptybottleaphobia 9 ). Based on manual inspection, it seems that these are often headwords with only one entry.  We, therefore, also perform a matching considering only headwords from UD with at least two entries (table 3). In this way, we use the number of entries as a crude proxy for whether the headword is of interest to a wider group of people. Note that this filtering is not applied to Wiktionary, because each headword has only one page and headwords that do not match Wiktionary guidelines are already removed by the community. For example, an important criterion for inclusion in Wiktionary is that the term is reasonably widely attested, e.g. has widespread use or is used in permanently recorded media. 10 Compared to the first analysis, the difference is striking. In this comparison, the number of unique headwords in Wiktionary is higher than that of UD. From a manual inspection we see that many Wiktionary-specific headwords include domain specific and encyclopaedic words (e.g. acacetins, dramaturge and shakespearean sonnets), archaic words (e.g. unaffrighted), as well as some commonly used words (e.g. deceptive, e-voucher). We also find that many of the popular UD headwords (i.e. headwords that have many entries) that are not covered in Wiktionary are proper nouns: the top five entries are canada's history, justin bieber, george w. bush, runescape and green day. In some cases, entries uniquely appearing in UD refer to words with genuine general coverage, such as loml (in total 11 entries) defined as, for example, 'Acronym of "Love of My Life"' or broham 'a close buddy, compadre, smoking and/drinking buddy. a term of endearment between men to reaffirm heterosexuality' (in total 18 entries).

Content analysis
In this section, we present our analyses on the different types of content as well as the offensiveness of the content in UD.

Content type
We now analyse several aspects of the content in UD that we expect to be different from content typically found in traditional dictionaries as well as Wiktionary. For example, manual inspection suggested that UD has a higher coverage of informal and infrequent words and of proper nouns (e.g. names of places or specific people). Many of the headwords are not covered in knowledge bases or encyclopaedias. To characterize the data, we therefore annotated a sample of the data using crowdsourcing (see Data and methods). In order to limit the dominance of headwords with only one entry (which represent the majority of headwords in UD), the sample was created by taking headwords from each of the 11 frequency bins (see table 10 for details on the way the bins were created and sampled from). Note that the last two bins are very small. For each headword, we include up to three entries (top ranked, second ranked and random based on up and down votes). Annotations were collected on the entry level and crowd workers were shown the headword, definition and example.

Proper nouns
Dictionaries are usually selective with including proper nouns (e.g. names of places or individuals) [32, p. 77]. In contrast, in UD many entries describe proper nouns. We therefore asked crowdworkers whether the entry described a proper noun (yes or no). In our stratified sample, 16.4% of the entries were annotated as being about a proper noun. Figure 7 shows the fraction of proper nouns by frequency bin.

Opinions
Most dictionaries strive towards objective content. For example, Wiktionary states 'Avoid bias. Entries should be written from a neutral point of view, representing all usages fairly and sympathetically'. 11 In contrast, the entries provided in UD do not always describe the meaning of a word, but they sometimes contain an opinion (e.g. beer 'Possibly the best thing ever to be invented ever. I MEAN IT' or Bush 'A disgrace to America'). We therefore asked the crowdworkers whether the definition describes the meaning of the word, expresses a personal opinion, or both. Figures 8 and 9 show the fraction of entries labeled as opinion, meaning or both, separated according to whether they were annotated as describing proper nouns. In higher frequency bins, the fraction of entries marked as opinion is higher. We also find that the number of entries marked as opinion is higher for proper nouns. While most entries are marked as describing a meaning, the considerable presence of opinions suggests that the type of content in UD is different from that in traditional dictionaries [13, pp. 3-4].

Familiarity
UD enables quick recording of new words and new meanings, many of them which may not have seen a widespread usage yet. Furthermore, as discussed in the previous section, some entries are about madeup words or words that only concern a small community. In contrast, many dictionaries require that included headwords should be attested (i.e. have widespread use). These observations suggest that many definitions in UD may not be familiar to people. To quantify this, we asked crowdworkers whether they were familiar with the meaning of the word. The majority of the entries in UD were not familiar to the crowdworkers. Examples are common headwords with an uncommon meaning such as coffee defined as 'a person who is coughed upon' or shipwreck 'The opposite of shipmate. A crew member  who is an all round liability and as competent as a one legged man in an arse kicking competition', as well as uncommon headwords and uncommon meanings (e.g. Once-A-Meeting defined as 'An annoying gathering of people for an hour or more once every pre-defined interval of time (e.g. once a day). Once-A-Meetings could easily be circumvented by a simple phone call or e-mail but are instead used to validate a project managers position within the company.'). Figure 10 shows that in higher frequency bins, more definitions are marked as being familiar, suggesting that the number of definitions per headword is indeed related to the general usage of a headword.

Formality
The focus of UD on slang words [33] means that many of the words are usually not appropriate in formal conversations, like a formal job interview. To quantify this, we asked crowdworkers whether the word in the described meaning can be used in a formal conversation. As figure 11 shows, most of the words in their described meanings were indeed not appropriate for use in formal settings.

Offensiveness
Online platforms with user generated content are often susceptible to offensive content, which may be insulting, profane and/or harmful towards individuals as well as social groups [34,35]. Furthermore, the existence of such content in platforms could signal to other users that such content is acceptable and impact the social norms of the platform [36]. As a response, various online platforms have integrated different mechanisms to detect, report and remove inappropriate content. In contrast, regulation is minimal in UD and one of its characteristics is its often offensive content. UD not only contains offensive entries describing the meaning of offensive words, but there are also offensive entries for non-offensive words (e.g. a definition describing women as 'The root of all evil'). We    note, however, that UD also contains non-offensive definitions for offensive words (e.g. asshole defined as 'A person with no concept of boundaries, respect or common decency'). To investigate how offensive content is distributed in UD, we ran a crowdsourcing task on CrowdFlower (see Data and methods for more details). Workers were shown three definitions for the same headword, which they had to rank from the most to the least offensive. We only included headwords with at least three definitions. In total, we obtained annotations for 1322 headwords and thus 3966 definitions. Out of these 1322 headwords there are 326 headwords for which the majority of the workers agreed that none of the definitions were offensive. Table 4 reports the offensiveness scores separated by whether the definitions describe a meaning, opinion or both. An one-way ANOVA test indicates a slight significant difference (F 2, 3963 = 2.766, p < 0.1). A post hoc comparison using the Tukey test indeed indicates a slight significant difference between the scores of definitions describing a meaning and opinion (p < 0.1). Thus, definitions stating an opinion tend to be ranked as more offensive compared to definitions describing a meaning. Table 5 reports the offensiveness scores by formality. Definitions for words that were annotated as not being appropriate for formal settings (based on their described meaning) tend to be ranked as being more offensive. An one-way ANOVA confirms that the differences between the groups are highly significant (F 2, 3963 = 22.72, p < 0.001). Post hoc comparisons using the Tukey test indicate significant differences between the formal and not formal categories (p < 0.001), and between the unclear and not formal categories (p < 0.05). We also find that definitions for which crowdworkers had indicated that they were familiar with the described meaning of the word tended to be perceived as less offensive (table 6, p < 0.001 based on a t-test). We observe the same trends when we only consider definitions that describe a meaning.

Content and popularity
An important feature of UD is the voting mechanism that allows the users to express their evaluation of entries by up or down voting them. For a given headword, entries are ranked according to these votes and the top ranked one is labeled as top definition. The votes thus drive the online visibility of entries, leading to the following implications. First, the top ranked entries are immediately visible when UD is consulted to look up the meaning of a headword. Many users might not browse the additional pages with lower ranked entries. Second, by users expressing their evaluation through votes, social norms are formed regarding what content is valued in UD. UD does not provide clear guidelines on 'what a good definition is'. Various factors could influence the up and down votes an entry receives, including whether the voter thinks the entry is offensive, informative, funny and whether the voter (dis)agrees with the expressed view. In this section, we analyse how characteristics of the content as discussed in the previous section relate to the votes the entries receive. Because the number of up and down votes varies highly depending on the popularity of the headword, we perform the analysis based on the rankings of entries (top ranked, second ranked and random) instead of the absolute number of up and down votes. Only headwords with at least three entries are included. Table 7 shows the distribution of opinion-based versus meaning-based definitions separated by whether the headwords are annotated as proper nouns by the crowdworkers. The proportion of definitions that are annotated as opinions is much higher for proper nouns, which is consistent with our previous analysis. However, among the top ranked definitions for proper nouns, the proportion of opinions is lower (but not significant). Table 8 characterizes the entries by formality and familiarity. We discard proper nouns and entries marked as opinion, since it is less clear what formality and familiarity mean in these contexts. We find that the top ranked definitions tend to be more familiar (χ 2 (2, N = 2991) = 15.385, p < 0.001) and more appropriate for formal settings (but not significant). Table 8 also reports the average offensiveness ranking of the definitions separated by their popularity (again, discarding proper nouns and entries marked as opinions). The difference in rankings between top ranked and second ranked definitions is minimal, but random definitions are more often ranked as being more offensive. A one-way ANOVA test confirms that the differences between the groups are highly significant (F 2, 2988 = 22.07, p < 0.001). Post hoc comparisons using the Tukey test indicate significant differences between the random and top ranked, and random and second ranked definitions (p < 0.001).    Table 9. Ordinal regression results. The dependent variable is the ranking: top ranked (0), second ranked (1) or a random rank (2). A similar trend is observed when we consider all definitions (F 2, 3963 = 34.87, p < 0.001). Thus, although UD contains offensive content, very offensive definitions do tend to be ranked lower through the voting system. However, the small difference in scores between the groups indicates that offensiveness only plays a small role in the up and down votes a definition receives.
To analyze the different factors jointly, we fit an ordinal regression model (table 9) using the ordinal R library based on definitions that were annotated as not being an opinion and not describing proper nouns. We find that familiarity and offensiveness indeed have a significant effect. More familiar and less offensive definitions tend to have a higher ranking. Similar trends in coefficients were observed with fitting logistic regression models when dichotomizing the ranking variable.

Discussion and conclusion
In this article, we have studied a complete snapshot (1999-2016) of UD to shed light on the characteristics of its content. We found that most contributors of UD only added one entry and very few added a high number of entries. Moreover, we found a number of skewed distributions, which need to be taken into account whenever performing analyses on the UD data. Very few headwords have a high number of entries, while the majority have only one entry. Similarly, few entries are highly popular (i.e. they collected a high number of votes). We also found a strong correlation between the number of up and down votes for each entry, illustrating the importance of visibility on the votes an entry receives.
The lexical content of UD is radically different from that of Wiktionary, another crowdsourced, but more highly moderated dictionary. In general, we can say that the overlap between the two dictionaries is small. Considering all unique UD headwords that are not found in Wiktionary, we found that this number is almost three times the number of headwords that uniquely occur in Wiktionary. However, if we exclude words with only one definition in UD (which tend to be infrequent or idiosyncratic words), we found the opposite pattern, with Wiktionary-only headwords amounting to almost three times the UD-only headwords.
Our analyses based on crowd-sourced annotations showed more details on the specific characteristics of UD content. In particular, we measured a high presence of opinion-focused entries, as opposed to the meaning-focused entries that we expect from traditional dictionaries. In addition, many entries in UD describe proper nouns. The crowdworkers were not familiar with most of the definitions presented to them and many words (and their described meaning) were found not to be appropriate for formal settings.
UD captures many infrequent, informal words and it also contains offensive content, but highly offensive definitions tend to get ranked lower through the voting system. The high content heterogeneity in UD could mean that, depending on the goal, considerable effort is needed to filter and process the data (e.g. the removal of opinions) compared to when traditional dictionaries are used. We also found that words with more definitions tended to be more familiar to crowdworkers, suggesting that UD content does reflect broader trends in language use to some extent.
There are several directions of future work that we aim to explore. We have compared the lexical overlap with Wiktionary in terms of headwords. As future work, we plan to extend the current study by performing a deeper semantic analysis and by comparing UD with other non-crowdsourced dictionaries. Furthermore, we plan to extend the current study by comparing the content in UD with language use in social media to advance our understanding of the extent to which UD reflects broader trends in language use.

Urban Dictionary
We crawled UD in July 2016. First, the definitions were collected by crawling the 'browse' pages of UD and by following the 'next' links. After collecting the list of words, the definitions themselves were crawled directly after (between 23 July and 29 July 2016). We did not make use of the API, since the API restricted the maximum number of definitions returned to 10 for each word.

Wiktionary
We downloaded the Wiktionary dump of the English language edition of 20 July 2016, so that the date matched our crawling process. To parse Wiktionary, we made use of code available through ConceptNet 5.2.2 [37]. Pages in the English Wiktionary edition can also include sections describing other languages (e.g. the page about boot contains an entry describing the meaning of boot in the Dutch language ('boat')). We only considered the English sections in this study.

Crowdsourcing
Most headwords in UD have only one entry, and, therefore, these headwords would dominate a random sample. Because such headwords tend to be uncommon, a random sample would not be able to give us much insight into the overall content of UD. We therefore sampled the headwords according to the number of their entries. For each headword (after lower casing), we counted the number of entries and placed the headword in a frequency bin (after taking a log base 2 transformation). For each bin, we randomly sampled up to 200 headwords. For each sampled headword, we included the top two highest scoring entries (scored according to the number of thumbs up minus the number of thumbs down) and another random entry. In total, we sampled 4465 entries (table 10).
We collected the annotations using CrowdFlower. The quality was ensured using test questions and by restricting the contributors to quality levels two and three and the countries Australia, Canada, Ireland, New Zealand, UK, and the USA. We marked the crowdsourcing tasks as containing explicit content, so that the tasks were only sent to contributors that accepted to work with such content.

Content type
For each task, we collected three judgements. The workers were paid $0.03 per judgement. We

Agreement
For each definition we have three judgements. We calculate Fleiss' kappa (using the irr package in R) and the pairwise agreement (table 11). The agreement for the first question, asking whether the word is a proper noun, is the highest. In general the agreement is low, due to the difficulty of the task. For example, in these cases all three workers answered differently to the question whether the definition described a meaning or an opinion: AR-15 defined as 'AR does NOT stand for Assault Rifle' and Law School defined as 'Where you go for to school for four years after college to learn to become a lawyer. In these four years, you will work your butt off every day, slog through endless amounts of reading, suffer through so much writing, and after you graduate, you do not get to call yourself "doctor"'. We merge the answers for each question by taking the majority vote. We use 'both' for Q2 and 'unclear' for Q4 if there was no majority.

Offensiveness
We experimented with different pilot setups in which we asked workers to annotate the level and type of offensiveness for individual definitions. However, we found that this led to confusion and disagreement among the crowdworkers. For example, an offensive word can be described in a non-offensive way and a non-offensive word can be described in an offensive way. Furthermore, people have different thresholds of what they consider to be offensive, making it challenging to ask for a binary judgement.
In the final setup, we therefore showed the sampled definitions for the same word and asked workers to rank the definitions according to their offensiveness, with 1 being the most offensive and 3 being the least offensive. Even if workers have different thresholds of what they consider offensive, they could still agree when being asked to rank the definitions. Indeed, we found that this led to a higher agreement. Note that in this article, we have reversed the ratings (3 = most offensive, 1 = least offensive) for a more intuitive presentation of the results. Workers were also asked to indicate whether they considered all definitions equally offensive, equally non-offensive, or none. For each task, we collected five judgements. We paid $0.04 per judgement. We collected 6610 judgements from a total of 158 workers (median number of judgements per worker: 44). Table 12 provides examples for two words (goosed and dad) and their ratings.  Def. 3 to apply pressure on one's taint (or space between genitalia and anus), preferably of the opposite sex! Def. 2 the parent that takes the most shit. Sure, if you had a shitty father, then go ahead and bitch, but not all of us did. Some of us had great fathers, who really loved us, and weren't assholes. Honestly, if you could see how much damage a mother could do to one's self esteem, you wouldn't even place so much blame on 'dear old dad'

Agreement
We calculate agreement using Kendall's W (also called Kendall's coefficient of concordance), which ranges from 0 (no agreement) to 1 (complete agreement). We calculate Kendall's W for each word separately. The average value of Kendall's W is 0.511 (standard deviation = 0.303). If we exclude words for which a worker indicated that the definitions were equal in terms of offensiveness, the value increases to 0.714 (standard deviation = 0.238).
Ethics. In this study we employ crowdsourcing to collect annotations. The tasks were marked as containing explicit content, so that the tasks were only visible to contributors that accepted to work with such content. The tasks also explicitly mentioned that the results will be used for scientific research ('By participating you agree that these results will be used for scientific research'). We closely monitored the crowdsourcing tasks and contributor satisfaction was consistently high.
Data accessibility. Despite several attempts to contact Urban Dictionary to confirm their data sharing policies, the authors have not been able to confirm that deposition of our data in a public repository would breach their terms and conditions. Furthermore, owing to these concerns it has not been possible to host the current dataset in a public repository. With this in mind, the authors note that the R analysis code and annotations are available through https://github.com/alan-turing-institute/urban-dictionary-rsos2018. The authors are happy to provide researchers with the original data in case they contact us personally. This statement has been agreed with the journal.