Journal of The Royal Society Interface
You have accessResearch articles

Automated identification of chicken distress vocalizations using deep learning models

Axiu Mao

Axiu Mao

Department of Infectious Diseases and Public Health, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong SAR, People's Republic of China

Contribution: Formal analysis, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

Google Scholar

Find this author on PubMed

,
Claire S. E. Giraudet

Claire S. E. Giraudet

Department of Infectious Diseases and Public Health, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong SAR, People's Republic of China

Centre for Animal Health and Welfare, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong SAR, People's Republic of China

Contribution: Visualization, Writing – original draft, Writing – review & editing

Google Scholar

Find this author on PubMed

,
Kai Liu

Kai Liu

Department of Infectious Diseases and Public Health, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong SAR, People's Republic of China

Animal Health Research Centre, Chengdu Research Institute, City University of Hong Kong, Chengdu, People's Republic of China

[email protected]

Contribution: Formal analysis, Methodology, Supervision, Validation, Visualization, Writing – review & editing

Google Scholar

Find this author on PubMed

,
Inês De Almeida Nolasco

Inês De Almeida Nolasco

School of Electronic Engineering and Computer Science, Queen Mary University of London, London, UK

Contribution: Formal analysis, Methodology, Writing – original draft

Google Scholar

Find this author on PubMed

,
Zhiqin Xie

Zhiqin Xie

Guangxi Key Laboratory of Veterinary Biotechnology, Guangxi Veterinary Research Institute, 51 North Road You Ai, Nanning 530001, Guangxi, People's Republic of China

Contribution: Conceptualization, Data curation, Resources, Writing – review & editing

Google Scholar

Find this author on PubMed

,
Zhixun Xie

Zhixun Xie

Guangxi Key Laboratory of Veterinary Biotechnology, Guangxi Veterinary Research Institute, 51 North Road You Ai, Nanning 530001, Guangxi, People's Republic of China

Contribution: Conceptualization, Data curation, Resources, Writing – review & editing

Google Scholar

Find this author on PubMed

,
Yue Gao

Yue Gao

School of Computer Science and Electronic Engineering, University of Surrey, Guildford, UK

Contribution: Conceptualization, Writing – review & editing

Google Scholar

Find this author on PubMed

,
James Theobald

James Theobald

Agsenze, Parc House, Kingston Upon Thames, London, UK

Contribution: Conceptualization, Writing – review & editing

Google Scholar

Find this author on PubMed

,
Devaki Bhatta

Devaki Bhatta

Agsenze, Parc House, Kingston Upon Thames, London, UK

Contribution: Conceptualization, Writing – review & editing

Google Scholar

Find this author on PubMed

,
Rebecca Stewart

Rebecca Stewart

Dyson School of Design Engineering, Imperial College London, London, UK

Contribution: Conceptualization, Validation, Writing – review & editing

Google Scholar

Find this author on PubMed

and
Alan G. McElligott

Alan G. McElligott

Department of Infectious Diseases and Public Health, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong SAR, People's Republic of China

Centre for Animal Health and Welfare, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong SAR, People's Republic of China

[email protected]

Contribution: Conceptualization, Funding acquisition, Project administration, Supervision, Writing – original draft, Writing – review & editing

Google Scholar

Find this author on PubMed

    Abstract

    The annual global production of chickens exceeds 25 billion birds, which are often housed in very large groups, numbering thousands. Distress calling triggered by various sources of stress has been suggested as an ‘iceberg indicator’ of chicken welfare. However, to date, the identification of distress calls largely relies on manual annotation, which is very labour-intensive and time-consuming. Thus, a novel convolutional neural network-based model, light-VGG11, was developed to automatically identify chicken distress calls using recordings (3363 distress calls and 1973 natural barn sounds) collected on an intensive farm. The light-VGG11 was modified from VGG11 with significantly fewer parameters (9.3 million versus 128 million) and 55.88% faster detection speed while displaying comparable performance, i.e. precision (94.58%), recall (94.89%), F1-score (94.73%) and accuracy (95.07%), therefore more useful for model deployment in practice. To additionally improve light-VGG11's performance, we investigated the impacts of different data augmentation techniques (i.e. time masking, frequency masking, mixed spectrograms of the same class and Gaussian noise) and found that they could improve distress calls detection by up to 1.52%. Our distress call detection demonstration on continuous audio recordings, shows the potential for developing technologies to monitor the output of this call type in large, commercial chicken flocks.

    1. Introduction

    Vocalizations can be used to infer whether animals are experiencing positive or negative states [1,2]. Chickens (Gallus gallus domesticus) have a variety of vocalizations associated with different states, including pleasure and distress [3,4]. Distress vocalizations made by young chickens are repetitive, high-energy and relatively loud calls [5]. Importantly, the output of distress calls from commercial broiler chicken flocks (e.g. approximately 25 000 birds together in one location) can be used to predict growth rates and mortality levels during the production cycle [5]. However, to date, the process of assessing the number of distress calls produced in large-scale recordings largely relies on manual annotations, which is labour-intensive, time-consuming and prone to subjective judgements of individuals. Thus, it is essential to develop new automated, objective and cost-effective methods for identifying and quantifying distress vocalizations, against a background of other vocalizations and noises that are usually contained in the audio recordings.

    Despite research highlighting the potential for automated monitoring of vocalizations as a means to assess and monitor animal welfare states [6], progress has been slow. In chickens, most methods have focused on detecting issues associated with respiratory diseases or measuring growth [710]. However, due to the links between emotional states and types of vocalizations and recent advances in machine learning applied to audio data [3,1115], we hypothesized that automated detection of chicken distress calls would be feasible.

    The use of bioacoustic methods as non-invasive techniques for monitoring health and welfare is becoming more widespread [12,13]. For distinguishing healthy birds from those with infectious bronchitis, a support vector machine (SVM) was used to reach 97.85% accuracy based on their audio recordings [10]. Sneezing is a clinical sign of many respiratory diseases, and the monitoring of it was achieved through a linear discriminant analysis using chicken sound signals at a group level [16]. Du et al. [17] were the first to show a significant correlation between specific vocalizations (alarm and squawk call) and thermal comfort indices (temperature-humidity index), using a hen vocalization detection algorithm based on SVM. However, manual feature extraction in machine learning relies on expert domains and easily causes feature engineering issues [18]. Furthermore, feature extraction and classification separation make the process unsuitable for online and real-time large data analysis [19].

    Recently, deep learning classification models have been highly accurate and reliable in bioacoustics [15,2022]. In particular, convolutional neural networks (CNNs) have been used predominantly to process image data and other image-like data, such as Mel spectrograms, the most common input for deep learning models [12]. A spectrogram can visually represent signal strength over time at different frequencies present in a specific waveform [23]. In practice, CNN-based models, including some existing CNN architectures (e.g. AlexNet, VGG-16 and ResNet50), have been used to recognize a wide variety of vocalizations (e.g. pecking activity of group-housed turkeys, alarm calls of laying hens and cough sounds of cattle) using audio spectrograms [21,24,25].

    In real-world scenarios, the available data for training is rather limited, either due to the enormous manual efforts required to collect and annotate data or since it is difficult to acquire large amounts of data in some cases [25]. Data augmentation as an elegant solution has been beneficial for expanding dataset sizes [26]. It artificially enlarges the data representativeness through label-preserving transformations to simulate real-world scenarios with the presence of many different kinds of noise [27]. The data augmentation techniques commonly used in image data comprise random cropping, rotation, flip and scaling [26]. In bioacoustics, when a time-frequency sound signal representation obtains images, the data augmentation techniques must be specifically devised for this context [27]. For bird audio detection using CNN-based models, time shift, frequency shift and time reversing were randomly applied to the spectrograms during training to reduce overfitting [28]. Recently, an augmentation scheme named SpecAugment, which acts exclusively on the audio spectrogram, has been validated to enhance the performance of end-to-end networks on public speech datasets [29,30].

    Herein, we exploited deep learning models to classify audio recordings of large groups of broiler chickens housed indoors on a commercial farm, to investigate the potential of deep learning for automatically identifying chicken distress calls. To automatically classify distress calls and other natural background sounds recorded, we developed a novel CNN-based model called light-VGG11, modified from VGG11 with a significantly smaller size in parameters (9.3 million versus 128 million). In addition, to improve the generalization capability of the proposed model, data augmentation was explored as a means to increase the diversity of the limited training dataset. To the best of our knowledge, this is the first time deep learning methods were used in chicken distress calls identification based on audio recordings.

    2. Methods

    2.1. Data collection

    Recordings were collected in production facilities owned by Lengfeng Poultry Ltd, in Guangxi province, People's Republic of China, in November and December 2017 and in November 2018. Chickens (mix of Chinese ‘spotted’ and ‘three-yellow’ breeds) were kept in stacked cages (three cages per stack, with 13–20 individuals per cage), with approximately 2000 to 2500 birds per house, as shown in figure 1. The microphone was positioned approximately 2 m up from the floor, mounted on the top cage of a stack in the middle of the barn. This placement was chosen to ensure that the recording devices did not interfere with the farm staff cleaning and maintaining the barns. All recordings were sampled at 22.05 kHz with a 16-bit bit-depth throughout the natural life cycle (0–35 days) of chickens using a portable recorder (Zoom H4n Pro, Zoom Corporation, Tokyo, Japan). These recordings were collected from two different chicken flocks at the same farm, one of which was used as the development dataset to learn the model, and the other was used as a continuous testing dataset to verify the model's generalization capability.

    Figure 1.

    Figure 1. The layout of the chicken house and microphone placement (84 m length, 11.3 m width and 3.2 m height).

    2.2. Data annotation

    Data were annotated using the Sonic Visualiser software (Sonic Visualiser v. 4.3, Centre for Digital Music, Queen Mary University of London, UK) [31]. The audio taxonomy consisted of distress calls, work sounds and natural barn sounds. Distress calls were the loudest calls that can be heard, and an audio example (10 s duration) can be found in the electronic supplementary material. Work sounds were defined as any noises that the farm staff made while cleaning the barn or carrying out routine husbandry for the birds. Natural barn sounds were the natural sounds of the barn when there were neither distress calls nor work sounds inside the barn. Calls such as pleasure chirps and fear trills were not annotated due to their relatively low amplitude compared with distress calls [32]. Contact calls were not included because they are produced nearly continually by the animals and are not indicative of welfare states, e.g. emotions, appropriate shelter and disease prevention [32]. In particular, distress calls and natural barn sounds were the sounds of interest, and their examples of data annotation are shown in figure 2. The labels and timestamps of these two sound types of interest were then exported to CSV files for further processing.

    Figure 2.

    Figure 2. An example of how the data were annotated. The squares in red highlight the distress calls, while the square in white shows what would be termed ‘natural barn sounds’ (the absence of work sounds and distress calls).

    2.3. Automatic classification

    We developed an audio classification method based on deep learning, i.e. light-VGG11 (see §2.3.2), and the development dataset (figure 3). First, recordings in the audio files were chunked into non-overlapping segments of 1 s to produce a set of 5336 samples, which consist of 3363 distress calls and 1973 natural barn sounds. The segments less than 1 s long were excluded from the dataset. These raw audio segments were transformed to log-Mel spectrograms, then image-wise normalized to the range [0, 1]. After that, we split these normalized spectrograms into training, validation and test sets based on a fivefold cross-validation technique that helps avoid overfitting and promotes the model's robustness. The training set and validation set were used to learn the model, and the test set was used to test the model. The average of the five test results was used to evaluate the model's performance.

    Figure 3.

    Figure 3. The overall flow chart of the audio classification method. (a) Raw waveform signals. (b) Log-Mel spectrograms. (c) Fivefold cross-validation. (d) The architecture of light-VGG11. (e) Predictions.

    2.3.1. Input preparation

    To represent audio signals as images within the use of CNNs, a log-Mel spectrogram, which is a typical and effective distinguishing feature tool in sound recognition [24,29,33,34], was applied for feature representation. We first applied short-time Fourier transform with Hann window function (length of 2048 points) and 75% overlap (hop length of 512 points) to extract the magnitude spectrum [22,27,35], generating 87 temporal frames for each sample. Eighty-eight log-Mel scale band filters were then extracted from the magnitude spectrum. Thus, a log-Mel spectrogram (88 × 87) was obtained as the two-dimensional image input. The horizontal axis and vertical axis correspond to time (s) and frequency (Hz), respectively, and the intensity of each pixel represents the amplitude (dB) of the frequency component at a particular time. Herein, the reference value of calculating the log-Mel spectrogram in dB was set to 1.0. The Librosa library [36] was used in this feature extraction process. The examples of normalized spectrogram images are presented in figure 4. The covered frequency ranges from 0 Hz to the Nyquist frequency of 11025 Hz. The high frequencies of natural barn sounds and distress calls mainly ranged from 2048 Hz to 4096 Hz, and distinction existed between these two sounds. Indeed, there are regular and repeated patterns occurring on distress calls as highlighted in the black boxes (figure 4b) while we did not find any on natural barn sounds (figure 4a).

    Figure 4.

    Figure 4. Log-Mel spectrograms samples of (a) natural barn sounds and (b) distress calls.

    2.3.2. Neural network structure

    CNNs with very small convolution filters (3 × 3), which effectively increase network depth and aggregate more complex patterns, have greatly succeeded in large-scale image classification [37]. This study adopted two types of popular CNNs, i.e. VGG Nets [37] and ResNets [38], to achieve the time-frequency representation learning. The preliminary results revealed that VGG11 displayed a favourable performance (table 1). VGG11 network, which mainly consists of 11 weight layers, i.e. eight convolutional layers and three fully connected layers, has been widely used in various classification tasks [33,39]. However, these fully connected layers lead to the explosion of parameters (up to 128.8 million approximately), posing a challenge to the computation efficiency in real-world applications. Thus, we modified its structure to get a lightweight variant, renaming it light-VGG11 (figure 3d). We removed the last three fully connected layers and replaced them with two 1 × 1 convolutional layers followed by an average-pooling layer. The 1 × 1 convolutional layers enabled us to achieve weight sharing among the pixels in the feature map. Thus, the parameters of the 1 × 1 convolutional kernel were only related to the channel numbers of the adjacent layers, which can effectively reduce the number of model parameters (decrease to 9.3 million herein) [38,40]. The operation was shown as follows:

    Z=Conv11×1(RELU(Conv1281×1(Z)))2.1
    andZout=GAP(Z),2.2
    where Z and Z denoted the feature maps after the last max-pooling layer and the last convolutional layer, respectively, and Zout denoted the output. Conv1281×1() and Conv11×1() represented 1 × 1 convolution operations with 128 filters and one filter, respectively. RELU() denoted the rectified linear unit activation function [41], and GAP() denoted the global average-pooling operation. In addition, each convolution operation was followed by a batch normalization operation. A sigmoid function was further applied to map the output to the probability between 0 and 1.

    Table 1. Comparative results between different models. The best results for each evaluation metric are highlighted in italics.

    networka precision (%) recall (%) F1 score (%) accuracy (%) parameters (M) speed (ms)
    machine learning
     naive Bayes 72.01 73.40 72.14 73.03 <1.0 6.74
     DT 75.18 75.39 75.23 76.84 <1.0 6.72
     SVM 84.13 84.77 84.39 85.33 <1.0 7.35
     RF 85.84 85.88 85.85 86.83 <1.0 6.74
     GB 87.40 87.64 87.51 88.33 <1.0 6.73
    deep learning
     ResNet18 94.44 94.42 94.43 94.81 11.2 37.32
     ResNet34 94.33 94.42 94.38 94.75 21.3 75.96
     VGG11 95.16 95.43 95.29 95.60 128.8 84.66
     VGG16 94.79 95.13 94.95 95.28 134.3 129.75
     VGG19 94.91 95.24 95.07 95.39 139.6 151.52
     light-VGG11 94.58 94.89 94.73 95.07 9.3 54.31

    aDT, decision tree; SVM, support vector machine; RF, random forest; GB, gradient boosting.

    2.4. Data augmentation

    To artificially increase the diversity of the training dataset, inspired by previous research [30,42,43], we separately evaluated the impact of four different data augmentation strategies on the model's performance, including time masking, frequency masking, mixed spectrograms of the same class (SpecSameClassMix) and Gaussian noise. Each method, when chosen, was randomly applied during training to each input image with a probability of 0.5. Figure 5 visualizes the spectrograms separately passed through these four augmentation methods. The details of these methods are shown as follows:

    (1)

    Time masking. Time masking refers to the practice of randomly masking part of the spectrogram along with time-wise with a specific probability. It could effectively reduce the interdependence of features that appeared in the same image, thereby mitigating the information loss problem due to interruption of collected data caused by hardware devices. In our implementation, we adopted the time masking method proposed by Park et al. [30], i.e. two time masks with a maximum adaptive size ratio (pS = 0.2), where the maximum adaptive size was set to pS times time dimensions of the spectrogram.

    (2)

    Frequency masking. Similar to time masking, we adopted the frequency masking method proposed by Park et al. [30] to enhance the model's robustness, i.e. three frequency masks with parameter F of 15.

    (3)

    SpecSameClassMix. Inspired by the SpectrogramSameClassSum and Mixup methods [26,44], a new method was created here, namely SpecSameClassMix. Each mixed sample z was generated using z=r×x+(1r)×y where x was the original sample, y was the sample randomly selected from the same class and r was chosen from a uniform distribution in the range [0, 1] [45].

    (4)

    Gaussian noise. Gaussian noise addition, a common data augmentation method, has been proven effective in some classification tasks [46,47]. The training data were augmented by adding noise generated from a Gaussian distribution with a mean of zero and a variance of σ2. Herein, σ was set as 0.03 [46].

    Figure 5.

    Figure 5. The original log-Mel spectrogram and its four transformed versions. (a) Original input. (b) Input with time masking. (c) Input with frequency masking. (d) Input with SpecSameClassMix. (e) Input with Gaussian noise.

    2.5. Evaluation metrics

    The comprehensive performance of the audio classification model was indicated by the four commonly used evaluation metrics [24,48], i.e. precision, recall, F1 score and accuracy:

    Precision=TPTP+FP×100%,2.3
    Recall=TPTP+FN×100%,2.4
    F1score=2TP2TP+FP+FN×100%2.5
    andAccuracy=TP+TNTP+TN+FP+FN×100%,2.6
    where TP, FP, TN and FN were the number of true positives, false positives, true negatives and false negatives, respectively. We calculated these four metrics of each category separately for the binary classification and then performed the macro average to eliminate the impacts of imbalance.

    2.6. Implementation details

    To evaluate the effectiveness of deep learning models, we first tested five conventional machine learning methods, i.e. naive Bayes (NB), decision tree (DT), SVM, random forest (RF) and gradient boosting (GB). During the feature extraction process, Mel frequency cepstrum coefficients with 12 filter frequency bins were calculated based on the audio files of the distress calls and natural barn sounds [12]. These processed data were then scaled by removing the mean and scaling to unit variance, followed by being reshaped into one-dimensional arrays used for the final classification. The experiments of these five machine learning methods were conducted using a CPU (i7-8665U, 1.90 GHz).

    To obtain quick convergence and improve the model's robustness and uncertainty, we pre-trained all selected deep learning models on the public dataset ImageNet. The pre-trained parameters that are neither in the first layer nor the last layer were loaded and then fine-tuned using the current dataset. During training, the binary cross-entropy loss was chosen as the loss function, and an Adam optimizer with an initial learning rate of 1 × 10−4 was employed. The epochs and batch size were set to 100 and 128, respectively. To promote the model's robustness, we applied the fivefold cross-validation method. The training/validation/test sets were split with a 3 : 1 : 1 ratio in each fold. The training set was used to learn the model, and the learned model with the highest accuracy on validation set was saved as the best model. Then, we used the best model to obtain the model's performance on the test set of each fold. Eventually, the test results of fivefolds were averaged to evaluate the model's performance. The experiments of all deep learning models were executed using the PyTorch framework on an NVIDIA Tesla V100 GPU.

    We measured the detection speeds of all selected models uniformly on the CPU (i7-8665U, 1.90 GHz). Specifically, we made the prediction on a single sample and measured the detection speed simultaneously. We repeated this operation for each sample in the test dataset and then averaged the detection speeds of all samples as the compared value.

    3. Results

    3.1. Comparison of different models

    We present the comparative results of the light-VGG11 against machine learning methods (i.e. NB, DT, SVM, RF and GB) and two types of deep learning methods (i.e. ResNets and VGG Nets) in table 1. Note that we excluded the data augmentation processing here as we only attempted to evaluate the distinction of network architecture. As shown in table 1, machine learning models obtained much lower values than deep learning algorithms in terms of all the metrics, although having significantly smaller parameter sizes and faster detection speed. It demonstrates the promising ability of deep learning for identifying chicken distress calls. Among the deep learning models, VGG Nets performed better than ResNets, where the VGG11 had the best performance with precision, recall, F1 score and accuracy of 95.16%, 95.43%, 95.29 and 95.60%, respectively. However, the parameter sizes of VGGNets, which have reached more than one hundred million, are significantly larger than that of other networks. This required more computational costs and reduced operational efficiency, posing a challenge for online and real-time large data analysis in reality. By contrast, ResNet18 had the highest detection speed, as high as 37.32 ms, 126.85% faster than VGG11. Our light-VGG had 92.78% smaller parameters (9.3 million) and 55.88% faster detection speed than its previous architecture VGG11. In addition, it is worth noting that the light-VGG11 obtained a comparable performance with the precision, recall, F1 score and accuracy of 94.58%, 94.89%, 94.73% and 95.07%, respectively, in comparison with VGG11. It indicated that our light-VGG11 achieved a good trade-off between classification performance and computational cost.

    3.2. Comparison of different models with added data augmentation

    We implemented the network architectures of ResNet18, VGG11 and light-VGG11 and applied same data augmentation methods to the training set. The comparative results of these models combined with data augmentation are illustrated in table 2. The light-VGG11, together with data augmentation, contributed to a favourable performance promotion, which demonstrated the good capability of data augmentation to enrich the limited training dataset in the current task. Specifically, the Gaussian noise possessed the best performance with precision, recall, F1 score and accuracy of 95.40%, 95.81%, 95.60% and 95.88%, respectively. Time masking enabled our model to obtain premium performance with precision, recall, F1 score and accuracy of 95.49%, 95.11%, 95.30% and 95.63%, respectively. Frequency masking and SpecSameClassMix also exhibited superior performance with different degrees of improvement, compared with not using data augmentation.

    Table 2. Comparison between different deep learning networks (ResNet18, VGG11 and light-VGG11) combined with various data augmentation strategies. The best results for each metric are highlighted in italics.

    network Data Auga precision (%) recall (%) F1 score (%) accuracy (%)
    ResNet18 TM 94.56 94.55 94.55 94.92
    FM 94.52 94.46 94.50 94.87
    Spec_SCS 92.29 93.51 93.88 94.34
    GN_A 95.09 94.38 94.71 95.11
    VGG11 TM 95.72 95.13 95.41 95.75
    FM 94.39 94.98 94.67 95.00
    Spec_SCS 94.63 94.92 94.77 95.11
    GN_A 95.44 95.48 95.46 95.77
    light-VGG11 TM 95.49 95.11 95.30 95.63
    FM 95.25 94.79 95.01 95.37
    Spec_SCS 94.88 94.82 94.85 95.20
    GN_A 95.40 95.81 95.60 95.88

    aData Aug, data augmentation; TM, time masking; FM, frequency masking; Spec_SCS, SpecSameClassMix; GN_A, Gaussian noise.

    Our light-VGG11 always displayed a better performance than ResNet18, whichever data augmentation techniques were applied. In comparison with VGG11, our light-VGG11 showed superior capability for classification performance with increments of 0.86%, 0.34% and 0.37% in precision, F1 score and accuracy, respectively, when frequency masking was applied. We witnessed an increase of 0.25%, 0.08% and 0.09% in precision, F1 score and accuracy, respectively when SpecSameClassMix was applied. Finally, using Gaussian noise led to an increase of 0.33%, 0.14% and 0.11% in the recall, F1 score and accuracy, respectively. Hence, it can be concluded that the light-VGG11 outperformed ResNet18 and VGG11 when applied with data augmentation, which reinforced the suitability of our method for distress calls identification tasks using audio recordings in real scenarios, especially in resource-constrained environments.

    3.3. Identification performance analysis

    Among these metrics, recall represents the percentage of correctly classified samples. The recall confusion matrix of our light-VGG11 without and with data augmentation is presented in figure 6. We can see that the recall values of distress calls displayed improvements when data augmentation was applied, which revealed that data augmentation effectively enlarged the data representations of distress calls. Specifically, the time masking, frequency masking, SpecSameClassMix and Gaussian noise enabled the recalls of distress calls to reach 97.12%, 97.03%, 96.28% and 96.07%, respectively. In addition, it was worth noting that except for Gaussian noise which showed help in the detection of natural barn sounds, all other three data augmentation strategies degraded the recalls of natural barn sounds to different degrees. This indicated that more samples of natural barn sounds were misclassified as distress calls when applying time masking, frequency masking, as well as SpecSameClassMix.

    Figure 6.

    Figure 6. The recall confusion matrix of light-VGG11 without data augmentation (a) and with four data augmentation strategies: time masking (b), frequency masking (c), SpecSameClassMix (d) and Gaussian noise (e).

    3.4. Demonstration

    To verify the generalization performance of our model in real-world scenarios, we selected two continuous audio episodes lasting 10 min each from the continuous testing dataset and used them to test the model's distress call detection capability. Our method achieved favourable detection ability for distress calls with precision, recall and F1 score of 88.48%, 85.14% and 86.78%, respectively. We also characterized the total frequency of the distress call events with various durations and cumulative duration within the two 10 min continuous audio episodes (figure 7). It can be observed that ground truths and predictions showed a similar trend, i.e. chickens display short-duration distress calls at a higher frequency than long-duration distress calls. In addition, some long-duration distress call clips were detected into multiple shorter duration clips. For example, the predicted number of 1 s duration distress call events was 29 higher than the actual number, while the predicted number of 2, 3 and 4 s duration distress call events was 7, 2 and 2 lower than the actual number, respectively. Correspondingly, the predicted cumulative duration was approximate to the actual cumulative duration within 4 s.

    Figure 7.

    Figure 7. Frequency and cumulative time of the distress call events with various durations within two continuous 10 min audio episodes.

    4. Discussion

    In chickens, early-life welfare constraints often predict later-life welfare concerns [49]. The output of distress vocalizations in commercial flocks is linked to growth rates and mortality levels [5]. However, since the annual global production of chickens exceeds 25 billion birds [50], and they are often housed in very large groups, numbering thousands, we need to develop automated methods and advanced technologies to monitor the display of distress calls. Herein, we used deep learning classification methods to automatically identify chicken distress calls using audio recordings of chickens. Considering the practical implementation, we modified the structure of VGG11, which performed best preliminarily while possessing a large number of parameters, to get a novel light-VGG11 and trained it on the log-Mel spectrograms converted from raw audio signals. The results indicated that the light-VGG11 exhibited a comparable performance with 92.78% significantly smaller parameters and 55.88% faster detection speed than VGG11. Data augmentation was further adopted and validated to improve the classification performance of the light-VGG11 without increasing parameters. This illustrated that our method achieved distress calls identification at a low computational cost, giving it considerable potential for online and real-time large data analysis in real applications, particularly in scenarios with limited computation resources and power budgets [19,51].

    To the best of our knowledge, our research is the first to exploit deep learning methods to identify chicken distress calls. It has shown its superior performance compared with conventional machine learning methods. This was because deep learning models could aggregate more complex and general patterns without domain knowledge. Among the selected deep learning models, the test performance of the ResNets was not as good as VGG Nets, although the short-cut connections in ResNets largely improved the ability of the model to learn some highly complex patterns in the data. This may be because the patterns in the current dataset were less complex than those in public datasets such as ImageNet [52]. The VGG11 outperformed other VGG Nets with deeper layers, further indicating that favourable performance can be obtained simply by using CNN models with shallower layers based on our dataset, as much deeper models tend to overfit. In addition, our light-VGG11 possessed competitive performance with VGG11 while having a significantly smaller parameter size and faster detection speed. This was because the convolutional layers in the raw architecture were kept, which effectively helps the network extract features.

    Data augmentation was widely applied for improving the model's generalization ability in animal audio classification [26]. We found that data augmentation performed reasonably well in improving our model's performance and particularly the detection rate of distress calls without increasing parameter size. This was ascribed to the fact that data augmentation took into account big amounts of noise or fluctuation presenting in real environments so as to better simulate real complexity [46]. In addition, all data augmentation strategies negatively impacted the detection of natural barn sounds except for the Gaussian noise technique which displayed better results. It reflected that Gaussian noise effectively promoted the robustness of the network to the environmental noise, which can be attributed to the fact that it is similar to the actual noise [53]. Beyond the data augmentation strategies that we used in our model, more complicated data augmentation methods were applied directly to audio recordings, such as time-stretching, pitch shifting and mixing multiple audios [27,45,54]. However, those data augmentation techniques scaled the dataset several times since they generated data-augmented samples before training models, taking up a large amount of storage space. On the contrary, the techniques used in our work were applied to spectrograms directly and performed on the fly during training only, which enabled the avoidance of storing the transformed images on the disk [28].

    Our model showed a desired detection accuracy of 85.14% for the distress calls displayed in two continuous 10 min audio episodes, although the accuracy was lower than the detection accuracy on the test data containing 1 s sound clips. The main reason was that our algorithm was only for the prediction of 1 s sound clips, while some 1 s distress call clips segmented from continuous audio contained other sudden noises (e.g. fans and work sounds), which largely degraded the detection performance. In addition, some long-duration distress call clips were easily detected into multiple shorter duration clips, demonstrating that some distress calls were classified as natural barn sounds. This might also be a consequence of distress calls that overlapped with other noises. Thus, one future research direction will be towards the prediction of audio clips with less than 1 s duration to enhance the detection accuracy of distress calls.

    In the scenarios of real-world audio identification, it is necessary to consider the limited computational capabilities and constraints in resources, particularly low-powered devices [43]. Some previous studies tended to reduce the computational cost by reducing model size and complexity [22,55], tuning the number of parameters [56], or replacing more efficient arithmetic operations [57]. For instance, Anvarjon et al. [58] applied a lightweight CNN model with fewer layers for speech emotion recognition. A lightweight human action recognition algorithm based on one-dimensional CNN was verified to outperform previous solutions in energy efficiency, with much fewer parameters (1.3 million) instead of more than 88 million parameters of commonly used networks [51]. In this study, our light-VGG11 was more lightweight and competent for big data analysis as well as real-time monitoring since it improved the primitive VGG11 to achieve the identification of distress calls at a low computational cost. In the future, this method will potentially allow staff to monitor chicken welfare in real-time and remotely, promoting earlier husbandry interventions when necessary. It can also reduce analysts' workloads and facilitate the analysis of large datasets, improving the husbandry and management of animals [22].

    We investigated one type of vocalization (i.e. distress call), and the recordings came from one particular field season. In real-world identification, many other specific chicken sounds like alarm and gakel calls may also be regarded as potential welfare indicators [17], and distress calls might be inconsistent between different breeds and welfare scale systems. This will require us to build a larger dataset that incorporates more types of vocalizations from different breeds and production environments in the future. In addition, our model could be integrated into more complete systems with other detection methods to achieve additional functions like identifying the vocal source location [59].

    Overall, our method shows a way for identifying chicken distress calls using deep learning combined with bioacoustics, which will allow the development of technologies that can monitor the output of distress calls in large, commercial flocks. We fully considered the constraints in computation resources by slightly modifying the existing CNN architecture to reduce overhead time, and the difficulty of acquiring real-world datasets by utilising data augmentation to increase dataset diversity. As part of a precision livestock farming system, it is crucial to interpret how the output of chicken distress calls reflects their internal status and external surroundings [60,61]. Thus, by continuously monitoring and analysing chicken calls in different conditions, we can aim to adjust their care in order to try to ensure better welfare.

    Ethics

    This study was reviewed and approved by the Animal Welfare and Ethical Review Board committee of Queen Mary University of London. The research carried out was non-invasive and ethical approval was granted by the Head of Animal Ethics at Queen Mary University of London.

    Data accessibility

    An example of distress calls (10 s duration) is provided as electronic supplementary material [62]. Data for the analyses reported can be accessed via the Figshare Digital Repository: https://doi.org/10.6084/m9.figshare.20049722.

    The source code is available from the Figshare Digital Repository: https://doi.org/10.6084/m9.figshare.20050943.

    Authors' contributions

    A.M.: formal analysis, methodology, software, validation, visualization, writing---original draft and writing—review and editing; C.S.E.G.: visualization, writing—original draft and writing—review and editing; K.L.: formal analysis, methodology, supervision, validation, visualization and writing—review and editing; I.D.A.N.: formal analysis, methodology and writing—original draft; Z.X.: conceptualization, data curation, resources and writing—review and editing; Z.X.: conceptualization, data curation, resources and writing—review and editing; Y.G.: conceptualization and writing—review and editing; J.T.: conceptualization and writing—review and editing; D.B.: conceptualization, funding acquisition and writing—review and editing; R.S.: conceptualization, validation and writing—review and editing; A.G.M.: conceptualization, funding acquisition, project administration, supervision, writing—original draft and writing—review and editing.

    All authors gave final approval for publication and agreed to be held accountable for the work performed therein.

    Conflict of interest declaration

    We declare we have no competing interests.

    Funding

    The research was carried out as part of the LIVEQuest project supported by InnovateUK and BBSRC grant no. 2016YFE01242200.

    Acknowledgements

    We thank Kalpana Chaturvedi, Charlie Ellis, Ben McCarthy, Michael Mcloughlin and María Soledad Mercadal Menchaca for their help, and the farmers for access to their animals.

    Footnotes

    Electronic supplementary material is available online at https://doi.org/10.6084/m9.figshare.c.6051618.

    Published by the Royal Society. All rights reserved.