What can associative learning do for planning?

There is a new associative learning paradox. The power of associative learning for producing flexible behaviour in non-human animals is downplayed or ignored by researchers in animal cognition, whereas artificial intelligence research shows that associative learning models can beat humans in chess. One phenomenon in which associative learning often is ruled out as an explanation for animal behaviour is flexible planning. However, planning studies have been criticized and questions have been raised regarding both methodological validity and interpretations of results. Due to the power of associative learning and the uncertainty of what causes planning behaviour in non-human animals, I explored what associative learning can do for planning. A previously published sequence learning model which combines Pavlovian and instrumental conditioning was used to simulate two planning studies, namely Mulcahy & Call 2006 ‘Apes save tools for future use.’ Science 312, 1038–1040 and Kabadayi & Osvath 2017 ‘Ravens parallel great apes in flexible planning for tool-use and bartering.’ Science 357, 202–204. Simulations show that behaviour matching current definitions of flexible planning can emerge through associative learning. Through conditioned reinforcement, the learning model gives rise to planning behaviour by learning that a behaviour towards a current stimulus will produce high value food at a later stage; it can make decisions about future states not within current sensory scope. The simulations tracked key patterns both between and within studies. It is concluded that one cannot rule out that these studies of flexible planning in apes and corvids can be completely accounted for by associative learning. Future empirical studies of flexible planning in non-human animals can benefit from theoretical developments within artificial intelligence and animal learning.

There is a new associative learning paradox. The power of associative learning for producing flexible behaviour in nonhuman animals is downplayed or ignored by researchers in animal cognition, whereas artificial intelligence research shows that associative learning models can beat humans in chess. One phenomenon in which associative learning often is ruled out as an explanation for animal behaviour is flexible planning. However, planning studies have been criticized and questions have been raised regarding both methodological validity and interpretations of results. Due to the power of associative learning and the uncertainty of what causes planning behaviour in non-human animals, I explored what associative learning can do for planning. A previously published sequence learning model which combines Pavlovian and instrumental conditioning was used to simulate two planning studies, namely  'Apes save tools for future use.' Science 312, 1038-1040 and Kabadayi & Osvath 2017 'Ravens parallel great apes in flexible planning for tool-use and bartering.' Science 357, 202-204. Simulations show that behaviour matching current definitions of flexible planning can emerge through associative learning. Through conditioned reinforcement, the learning model gives rise to planning behaviour by learning that a behaviour towards a current stimulus will produce high value food at a later stage; it can make decisions about future states not within current sensory scope. The simulations tracked key patterns both between and within studies. It is concluded that one cannot rule out that these studies of flexible planning in apes and corvids can be completely accounted for by associative learning. Future empirical studies of flexible planning in non-human animals can benefit from theoretical developments within artificial intelligence and animal learning.

Introduction
To the amazement of the world, associative learning models used in artificial intelligence (AI) research now achieve human level skills in video games [1] and beat human masters in the Chinese board game (Corvus corax) [28]. The simulations were found to track key patterns within and between these studies. It is concluded that one cannot rule out that studies of flexible planning in apes and corvids can be accounted for by associative learning. Therefore, associative learning cannot only produce human-like behaviour (e.g. [1,2]) but is a candidate explanation for observations of planning and self-control in non-human animals.

Material and methods
Here I describe our learning model [19], the logic of the two different studies that were used for the simulations, and details of the simulations.

A description of the model
An animal has a behaviour repertoire and it can use its behaviours to navigate in a world of detectable environmental states. A behaviour takes the animal from one state to another. Each state, or stimuli, has a primary reinforcement value that is genetically fixed. These values can be negative, neutral or positive, and they guide learning so that behaviours favouring survival and reproduction are promoted. Animals are assumed to make choices that maximize the total value, and expectations of the value of a future state can develop [section 2. 3. in 19]. The model can thus generate goal-directed behaviour (see [35, p. 32] for another discussion of goal-directed behaviour and learning).
In short, the model describes the learning of sequences of behaviour towards stimuli through changes in memory. It includes decision-making that takes memory into account to determine what behaviour should be selected when a given stimulus is perceived. Take for instance learning a single behaviour, such as when a dog learns to give its paw in response to the command 'shake'. Lifting the paw is the behaviour, the command 'shake' and the reward are stimuli. The event sequence to be learned is: The model collects information about the value of performing behaviours towards different stimuli (or states), and information about the value of different stimuli (or being in specific states) [19]. Learning occurs through updates of two different kinds of memories. These memories correspond to Pavlovian and instrumental learning and are updated after an event sequence like in the dog example, or in general terms the event sequence S ! B ! S 0 . The first kind of memory is a stimulus -response association. We used v S!B to denote the associative strength between stimulus S and behaviour B. In functional terms, v S!B can be described as the estimated value of performing behaviour B when perceiving stimulus S. The second memory stores the value of a stimulus. We used w S to denote this stimulus value and it is updated according to the value of a subsequent stimulus. In other words, w S is the conditioned reinforcement value of being in state S. These memories are updated according to after experiencing the event sequence S ! B ! S 0 . The stimulus-response association v S!B is updated according to u S 0 a primary inborn fixed value of stimulus S 0 , and w S 0 the conditioned reinforcement value and the previously stored stimulus-response association v S!B . With conditioned reinforcement, the value of performing behaviour B when perceiving stimulus S is the sum of the primary and conditioned reinforcement value of stimulus S 0 . If only the first equation is used and w is excluded, then it represents instrumental stimulus -response learning, that is an instrumental version of the classic Rescorla-Wagner learning model [43,44]. The learning rates a v and a w determine the rate at which memory updates take place. For the learning model to generate and select behaviour, a mechanism for decision-making is needed. We used a decision-making mechanism that selects behavioural responses and causes some variation in behaviour through exploration. This specifies the probability of behaviour B in state S as which includes a parameter b that regulates the amount of exploration. All behaviours are equally likely to be selected if b ¼ 0 without taking estimated values into account. If b is large, then the behaviour with the highest estimated value (v) will mainly be selected.
rsos.royalsocietypublishing.org R. Soc. open sci. 5: 180778 Let us return to the dog for a practical example. The dog hears the command 'shake', stimulus S. If the dog moves its paw upwards, that is performing behaviour B, it will receive the reward S 0 . The food reward S 0 has a primary inborn value u. When the dog receives this reward after having responded correctly to the command 'shake', the stimulus -response memory v command 'shake 0 !lift paw will increase according to the top row in equation (2.1). In addition, the stimulus value w of the command 'shake' will be updated according to the bottom row of equation (2.1). This value w of command 'shake' will approach the value u of the food reward, and thereby gain reinforcing properties in its own right; it has become a conditioned reinforcer. The conditioned reinforcer can pave the way for learning more behaviours before moving the paw upwards. This can happen because behaviours that result in the dog hearing the command 'shake' can be reinforced.

Simulating planning studies on great apes and ravens
The simulations of the planning experiments were based on detailed descriptions of the course of events in the two studies where key events were identified. Key events included what behaviours were trained before the tests and towards what objects, and what outcomes resulted from different choices during pretraining and tests. It is important to identify details in these studies [24,28], because test phases included a mix of rewarding and non-rewarding actions. Therefore, both stimulus -response (v) and stimulus values (w) were expected to change throughout the tests.
To both make the simulations possible and realistic, it was assumed that the animals entered these studies with some necessary everyday skills. It was assumed that the animals had, for example, previously learned to hold objects, how to move between rooms and compartments, where different things were located, and some basic skills regarding how to interact with the experimenters. The apes were for instance ushered out of the test room after choices to later be allowed back into the test room. By ignoring such everyday skills, the simulations and the behaviour descriptions were focused on the unique behaviour sequences the animals had to learn as part of the experiments.
The two studies [24,28] share key features. Before testing started, animals were subjected to pretraining. Here they learned to perform behaviours later scored as correct. Apart from the pretraining of correct behaviours, the raven study [28] also included extinction training. During extinction training, the ravens had the chance to learn that non-functional objects did not result in rewards. The key events in both studies used for scoring correct vs. incorrect choices were forced choice tests. Here the animals were forced to choose between one object they had previously learned could result in a reward, versus other objects that could not be used for later rewards (distractor objects). The ravens learned during extinction training that these distractor objects could not result in rewards. After the forced choice both studies included a time delay of some time, after which the animals were allowed to perform a behaviour using the previously chosen object. If an animal made a correct choice before the delay, it could later use its chosen object to get a reward. If an animal made an incorrect choice before the delay there were no opportunities for rewarding behaviours after the delay.
The simulations performed followed the pretraining phase and test phase of the studies. Comparisons are made with chance levels of correct choices set by the two studies. Mulcahy & Call [24] expected the apes to choose the correct by chance 25% of the times (one functional object and three distractor objects). Kabadayi & Osvath [28] expected the ravens to by chance make 25% correct choices in experiments 1 and 2, and 20% correct choices in experiment 3 and 4 (one functional object and three distractor objects in experiments 1 and 2, and 1 functional object, 1 small reward and three distractor objects in experiments 3 and 4). See simulation scripts for exact descriptions (see electronic supplementary material). To make it easier to follow the simulations here are in-depth descriptions of the two studies.

A description of Mulcahy and Call's study on great apes
These tests were performed with orangutans and bonobos [24]. The study started with pretraining. Here an animal was placed in a test room and trained on two different tool tasks to get a reward from an apparatus. These functional tools will be referred to as functional objects. One task was to choose a tube and insert this tube into an apparatus. The other task was to choose a hook and use this to reach a bottle that could not be reached without having the hook. After pretraining, the animal was subjected to a forced choice test between functional objects and three corresponding non-functional objects (later referred to as distractor objects). But during this forced choice, access to the apparatus containing a reward was blocked. After the choice was made, the animal was ushered away from the test room into a waiting room. Objects not taken by the animal were now cleared from the test room. At this point, there was a delay. After the delay the animal was again allowed into the test room and given access to the apparatus. If a functional object had been chosen in the forced choice test, the animal could now use the object to get a reward, thereby exhibiting the behaviour it had learned during pretraining.
This study included four tests that were slightly different. Tests varied with respect to what tool was the functional object and the duration of delays. In addition, in the last test, the animals did not have to use the tool to get a reward. Note that here, in experiment 4, two new individuals were used and they did not take part in experiments 1, 2 or 3. This last part was of little importance here for reasons mentioned in the Results section. The simulations followed the logic of the study, and here are the details of the key events and delays used in the simulation: Pretraining: Before tests, all subjects learned to use the functional tools. In two steps, a minimum of three plus eight pretraining trials were allowed for the tube task and a minimum of five pretraining trials were allowed for the hook task. Experiment 1, tube condition: (1) Forced choice with functional tube and distractor objects (16 trials). (2) After choice go to another room. The behaviour sequences to learn were the following: Tube condition: S tube ! B take tube ! S apparatus ! B use tube ! S reward Hook condition: S hook ! B take hook ! S apparatus ! B use hook ! S reward In both conditions, the apes were never rewarded for choosing the distractor objects, or:

A description of Kabadayi & Osvath's study on ravens
These tests were performed with ravens [28]. This study started with pretraining. Here an animal was placed in a test room and trained on two different tool tasks to get a reward from an apparatus. As above, functional tools will be referred to as functional objects. One task was to put a stone in an apparatus to get a reward. The other task was to take a bottle cap (called token) and give it to a human. In contrast with the study on apes, before the tests started the ravens were also allowed extinction trials. Here an animal was allowed to interact with the objects that would be present during the forced choice tests, but that could never be used to get rewards (later referred to as distractor objects). After pretraining, the animal was subjected to a forced choice test between a functional object and three distractor objects. After a choice was made, the animal was not allowed to use the functional object for some time. In other words, no reward could be collected immediately after the choice test (with the exception of experiment 4). At this point, there was a delay. After the delay, the animal was allowed to use its chosen object. If a functional object had been chosen in the forced choice test, the animal could now use that object to get a reward, thereby exhibiting the behaviour it had learned during pretraining. This study also included four tests that were slightly different. Tests varied with respect to the number of trials, the duration of delays, and in the last test, the animals did not have to wait before using a functional object to get a reward. It should be noted that in this study, two different rewards were used. One high value reward was used in pretraining and in all experiments. And in experiments 3 and 4, a known reward of little value was used in the forced choice situation alongside the functional tool and the distractor objects. Note that the experiments were not performed in the same order as they were numbered in the published study. I have chosen to present the tests in the temporal order in which they were performed rsos.royalsocietypublishing.org R. Soc. open sci. 5: 180778 5 (1,3,2,4). The simulations followed the logic of the study, and here are the details of the key events used in the simulation: the key events before and during the experiments were: Pretraining: Before tests, all subjects learned to use the functional tools. In two steps, a minimum of three plus five pretraining trials were allowed for the tool task and 35 pretraining trials were allowed for the token task. Extinction trials: In this phase, subjects were allowed to manipulate distractor objects for 5 min without receiving any rewards. Experiment 1: (1) Forced choice with functional object and distractor objects. 14 trials in tool condition and 12 Â 3 trials in token condition. (2)  The behaviour sequences to learn were the following: The ravens were also taught during an extinction phase that it was never rewarding choosing or using distractor objects. This was also the case during all tests, or: In the self-control phases of the study, the ravens had the opportunity to choose a small reward that was presented alongside the functional object (tool or token) and the distractor objects. Therefore, in experiments 3 and 4, these behaviour sequences were also possible: Tool condition: S dog kibble ! B take small reward ! S small reward Token condition: S dog kibble ! B take small reward ! S small reward

Illustration of memory updates during pretraining
To illustrate how these behaviour sequences are affected by learning, here is an example of memory updates for pretraining in the raven study. The behaviour sequence that developed during pretraining can be described as S tool ! B take tool ! S apparatus ! B use tool ! S reward where the value of inserting the stone into the apparatus increased, so that v Sapparatus ! B use tool ) 0. As the model also includes conditioned reinforcement, the value of the stone itself is updated according to the value of the following stimulus, the large reward. With repeated experiences, the stimulus value (w) of S reward will cause the stimulus value of S tool to grow. As shown in our description of this model [19], with enough experiences the value of the tool will approximate the value of the large reward. By contrast, the extinction trials with repeated unrewarded experiences of the three distractor objects can be described as S distractor ! B pick distractor ! S no reward . This event sequence will cause a reduction in both the associative strength of choosing a distractor v S distractor ! B pick distractor and the conditioned reinforcement value (w distractor ) of the distractor. When the first test starts with a forced choice, the ravens' behaviour was influenced by the pretraining with both the stone and the distractors.

Simulation details
The model above was incorporated in a Python program where learning occurred according to the detailed procedures of the two studies, as defined above, to get estimates of probabilities of choosing the different stimuli, and v-and w-values, throughout the studies. Two kinds of simulations were run.
rsos.royalsocietypublishing.org R. Soc. open sci. 5: 180778 First simulations with the full model were run, and then simulations without stimulus values (w), that is only allowing our version of stimulus-response learning using only the first row in equation (2.1) together with decision-making (equation (2.2)). This was done to explore differences between our model that includes conditioned reinforcement [19] and a version of stimulus-response learning alone [43,44]. That version of stimulus -response learning is identical to the classic Rescorla-Wagner learning rule but in [19] we considered it in terms of an instrumental instead of a Pavlovian setting.
To account for delays, one time step per minute was included in the simulation at times of delay. During these time steps, only a background stimulus was experienced. This is not very important for the sake of memory updates because both stimulus -response and stimulus value memories are longterm memories. That animals remember stimulus -response associations and stimulus values for a very long time was not mentioned in either of the simulated studies [19].
The same learning parameters were used in all simulations. All behaviours started with an initial stimulus-response value v ¼ 1, both v-and w-values were updated with learning rate a ¼ 0.2, exploration was set to b ¼ 1, and rewards were set to u ¼ 6 apart from the low value rewards in experiments 3 and 4 in Kabadayi & Osvath [28] that were set to u ¼ 2. Behaviour cost for all behaviours was 0.1 apart from passive responses that were set to 0 (see information for all behaviours and stimulus elements included in simulations in the electronic supplementary material). All simulations were run for 500 subjects and the number of trials followed approximately that of the experiments. That the number of trials did not perfectly match the empirical studies was due to the probabilistic nature of the decision-making equation. The lack of information of initial values of the animals makes exact quantitative comparisons difficult.
Although both the ravens and the apes had rich backgrounds, previously learned behaviour was ignored and initial values were assumed to be the same for distractor objects and functional objects. To be conservative, all associative strengths between behaviours and stimuli were assumed to be equal at the start of the simulations. Kabadayi & Osvath [28] did not calibrate the preferences of ravens with respect to the value of the two different food rewards, so there is no quantitative information about the differences between the rewards available. They stated in the method that the high quality food reward was both larger and more attractive. Exact information about the amount of extinction was lacking from the raven study, therefore it was assumed that the ravens had five extinction experiences with the distractors.
The behaviours and stimulus elements used in the simulations were as follows:

Data from the empirical studies
To compare the simulation results with the empirical data from the two studies [24,28], averages were calculated from the available data in the two respective studies (see figures in Results). This resulted in the average proportion of correct and incorrect choices in the forced choice tests. Note that experiment 4 in the ape study did not involve any correct behaviour using the tool upon returning to the apparatus after the delay, making this experiment difficult to interpret. In addition, data on choices for experiment 4 were not available in the text, therefore data from [24, fig. S2] was used for that data point. It is unfortunate to mix data this way but I chose this in favour of leaving data from experiment 4 out.

Results
Overall the studies, apart from experiment 4 in the ape experiment [24]. That the use of functional objects was rewarding throughout was sufficient for driving performance well above chance levels ( figure 1). In the raven study, rewards delivered during the experiment account well for the near perfect performance in the two final parts of that study. The fit was good between the empirical tests (shown as filled circles in figure 1) and simulations in that functional objects were more likely to be chosen than the distractor objects. The simulations also followed the general trends in that performance increased in the great ape study during experiments 1 and 2 and that performance was reduced in experiment 3. Although the simulations underestimated the performance in the tool condition of the raven study, the simulations followed closely the pattern in that performance was high in experiment 1, decreased in experiment 3 to reach nearly perfect performance in experiment 4. One reason for the simulation to have a lower success rate in the tool condition could be that the ravens were well trained and had rich backgrounds that are helpful in test situations. These birds were raised by humans and interact regularly with humans. They are also familiar with many different objects, experimental set-ups and rewards. By contrast, the simulations started assuming no previous knowledge. There was a close match between the simulations and the empirical data for the token condition, but the reduction in performance during experiment 3 was greater in the empirical data.
The simulations also captured that the great apes exhibited an overall lower success rate than the ravens did. At least two factors could have contributed to this difference. The apes experienced less pretraining than the ravens and, in contrast to the ravens, the apes were not allowed extinction training with the distractor objects prior to testing. This is shown in figure 1 where the probability of choosing the correct object is much higher at the start of experiment 1 in the raven study as compared with the ape study. That a lot of pretraining trials (35 in the token condition) combined with extinction trials can result in high performance in the forced choices is most clearly shown in the token condition of the raven study. Here the simulation tracked the observed high success rate closely.
Pretraining and extinction training did not only influence the likelihood of making correct decisions. Simulations reveal how pretraining and extinction also affect the proportion of choosing the incorrect objects, such as small rewards (figure 1). The effect of pretraining and extinction was most pronounced in the token condition of the raven study where the simulation suggests that the likelihood that the ravens should choose the small rewards over the functional objects was close to zero. The large amount of rewarding experiences with the functional objects (tool and token) resulted in large conditioned reinforcement values for these objects (figure 2). The simulations corroborated the pattern that ravens did not choose small rewards instead of functional objects, and that self-control is expected to emerge from associative learning.
The growth of stimulus-response values and stimulus values are shown in the top panel of figure 2. Note that experiment 4 in the great ape study matches the simulations the least. Here two new apes were allowed to get the reward without using the previously functional tool and they returned with a correct tool 2 of 16 times, lower than in the simulation. This difference between empirical test and simulation could be reduced by increasing the cost of the behaviour. Increasing the cost of a behaviour that does not lead to a reward will lead to a reduction in performing the behaviour. But it is unclear what to expect from the animals in this situation when the apes face a situation with a less clear connection between a tool and a reward. And two of the four apes never attempted to solve the problem. To conclude, it is difficult to judge the precision and meaning of that data point (see [32, p. 922]).
The simulations also show the differences between associative learning models of different complexity. The limits of our version of stimulus -response learning [43,44] become obvious when compared with simulations using our learning model that incorporates both Pavlovian and instrumental learning [19]. In stimulus -response learning alone, behaviour sequences where a behaviour is not immediately followed by a reward cannot be learned (figure 2). For behaviour sequences to develop, stimuli more than one step before the reward need to become rewarding through conditioned reinforcement. When a previously neutral stimulus acquires a positive w-value, that is it becomes rewarding, it can drive the acquisition of positive v-values for behaviours that do not result in immediate rewards (top panel in figure 2). When comparing our model that can learn sequences of behaviour with the instrumental version of the Rescorla-Wagner model, it is clear that the probability of choosing the correct stimulus will not increase if only stimulus -response learning is allowed ( figure 2). In addition, as v-values are only updated by the immediate reinforcer in stimulusresponse learning, this also has the consequence that the small reward will be chosen in favour of the token and the tool, as the token and the tool cannot become valuable stimuli. This is shown in figure  2 as the incorrect choice of small rewards increases across trials when only our version of stimulus-rsos.royalsocietypublishing.org R. Soc. open sci. 5: 180778 response learning is allowed (marked with R-W in figure 2). Stimulus -response learning alone could not account for the results in neither the raven nor the ape study.

Discussion
Simulations of the two planning studies on ravens and great apes suggest that behaviour previously claimed to have been generated by flexible planning [24,28] can be accounted for by associative learning. As shown in artificial intelligence research and animal behaviour research, these models of associative learning are powerful in generating flexible behaviour sequences [1,19,45]. Therefore, the conclusion drawn in both the raven and great ape studies [24,28], that ravens and apes solve these problems by a specific flexible mechanism, has little support. Simulations performed here support critics that interpreted these results as consequences of associative learning [33,34]. If future studies aim at distinguishing associative processes from other kinds of mental mechanisms, they would benefit from improved experimental design including proper controls taking advantage of state-of-the-art learning models.
It was interesting to note that the simulations captured the difference between the study on ravens [28] and great apes [24]. This suggests that the simulations captured well the effects of pretraining-, extinction phases and rewards throughout the studies. High conditioned reinforcement values (w-values) for the correct objects (tool and token) and low values for the distractor objects were established before the first tests ( figure 2). This was especially obvious in the token part of the raven experiment where the ravens were subjected to 35 pretraining trials where the behaviour sequence S token ! B take token ! S human ! B give token ! S reward was consistently rewarded (lower panel, figure 1).  Figure 2. Results from the simulations to enable comparisons between the output from our learning model that includes conditioned reinforcement (stimulus values), with an instrumental version of the Rescorla-Wagner (R-W) model [19]. Simulations of the raven study are on the left and simulations of the ape study are on the right side. The top panels show memory updates: stimulusresponse associations v for behaviours towards functional objects, and stimulus values w of these objects. As the functional objects are not themselves rewarding, simulations show that stimulus-response associations for choosing functional objects will not develop with the simpler learning model (R-W). And the bottom panels show that the stimulus-response learning model (R-W) cannot reproduce the behaviour patterns observed in the two studies, in stark contrast to our learning model that allows conditioned reinforcement. Experimental phases are the same as in figure 1, but here phases are not shown for clarity. Note that the X-axes in the right panels are broken because experiment 4 was done with new individuals that only experienced pretraining prior to the experiment. Raven and ape graphics were downloaded from openclipart.org. rsos.royalsocietypublishing.org R. Soc. open sci. 5: 180778 Another important factor for the positive results in the raven and great ape studies was that choosing the correct objects were rewarded throughout the tests. This maintained high v-and w-values for correct behaviours and correct objects, respectively. This also explains why the ravens neglected the small reward when presented together with the functional objects (figure 1). The functional objects led to rewards repeatedly throughout the study so they had acquired high stimulus values. As long as these values are higher than the value of the small reward, these functional objects will be chosen most of the time. However, with only stimulus-response learning-only allowing the updates of v-values as in the Rescorla -Wagner model-the small reward will be chosen because this model lacks conditioned reinforcement (figure 2). If one wants to avoid learning during tests, there are benefits with carrying out tests under extinction, as for instance in outcome revaluation studies (e.g. [46,47]). This way tests can reveal the consequences of prior experimental manipulations.
The results support the idea that self-control emerged through associative learning. We have previously shown how animals can, through associative learning, acquire self-control, given they are provided enough information and experiences [19, §2.3]. Kabadayi & Osvath [28] did not define self-control, but in a previous study [48] they defined it as '[ . . . ] the suppression of immediate drives in favour of delayed rewards'. This functional view of self-control fits many descriptions of behaviour in the animal behaviour literature. Observations of animals learning to reject small rewards when expecting large rewards, or in other words reject unprofitable prey when profitable prey are abundant, come from for instance fish (bluegill sunfish Lepomis macrochirus, [49]), crustaceans (shore crabs, Carcinus maenas, [50], and birds (great tits Parus major, [51] and redshanks Tringa totanus, [52]). These kinds of studies have to a large degree been ignored in studies where self-control is often studied as a separate kind of mental mechanism and not something that is subject to learning (e.g. [6,28,48]). Instead, in the light of these simulations, previous studies of self-control within animal cognition research (as e.g. [48]) may best be understood as being caused by learning including conditioned reinforcement [31].
Theoretically, self-control can develop in more than one way. Self-control can emerge through the acquisition of high conditioned reinforcement values for the functional objects. The functional object becomes more valuable than a small reward. But self-control can also emerge if for example 'wait' is considered as a behaviour in its own right. In this case, self-control can emerge through an increased v-value for 'wait' in the presence of a particular stimulus. Self-control in hunting cats might emerge through high v-values for waiting when subjected to a prey that is far away. More research is needed to better understand how different aspects of learning mechanisms interact to give rise to patterns of self-control. Genetic predispositions are likely to play a large role and interact with stimulus -response associations and stimulus values.
Another important result was that the difference between the ravens' performance in experiment 3 and experiment 4 was captured by the simulations. The reason for the perfect performance in experiment 4 in both the raven study and the simulation was that the delay between choice and behaviour resulting in reward was omitted. Instead, there was an opportunity to use the object to collect a reward right after the forced choice. For this reason, every trial led potentially directly to rewards whereas choosing the correct object in experiment 3 was only rewarded after the delay. Or in other words, in experiments 1-3, the ravens could only get a reward every second time they chose the correct object, whereas in experiment 4 they got rewards every time and immediately after having chosen and used the functional item.
One similarity between our learning model and some reinforcement learning models in AI is that these mechanisms allow agents and animals to identify world states that are valuable, and what behaviours are productive in these valuable states. In an operational sense, these learning models generate planning when a behaviour ( put in apparatus or give to human) towards a stimulus (stone or token) will produce high value food at a later stage. This happens despite the fact that the food (or another rewarding stimulus) is absent. Osvath & Kabadayi [53], in a reply to critics [33], defined flexible planning as 'making decisions about futures outside one's current sensory scope in domains for which one is not predisposed'. Irrespective of whether models come from AI [54] or animal behaviour [19], when conditioned reinforcement is included in learning models, planning behaviours that match this definition will emerge through the clever interplay of stimulus-response values and stimulus values. The key is that currently available stimuli can provide information about what behaviours should be performed to enter future valuable states. However, these learning models cannot simulate different outcomes mentally, they cannot travel mentally in time, nor reorganize information internally. To paraphrase Roberts [55], non-human animals can be 'stuck in time', while still exhibiting planning behaviour.
Mulcahy & Call [24] attempted to rule out instrumental conditioning as an explanation for the behaviour of the apes by performing experiment 4. This phase was similar to experiment 3, but the rsos.royalsocietypublishing.org R. Soc. open sci. 5: 180778 apes were not rewarded for using the functional tool. Instead of an ape entering the room with a functional tool that could be used to get a reward (as in experiment 3), an ape entered the room and found a reward if it had carried the functional tool to the test room from the waiting room. It was argued that if the apes performed better in the other experiments than in this one, it would suggest that the apes planned flexibly. Mulcahy & Call concluded their results 'represent a genuine case of future planning'. A devil's advocate could identify differences between experiments 3 and 4, rendering learning a more likely explanation. In experiment 3, the apes were explicitly rewarded for using the tool. This results in a high conditioned reinforcement value for the tool and a high stimulus-response value for using the tool on the apparatus. In experiment 4, however, Mulcahy & Call point out that there was a longer time between picking the tool up in the waiting room, carrying the tool to the test room, to subsequently get a reward without using the tool. Perhaps the low performance in experiment 4 was caused by the unclear connection between the tool and the reward, as the delay inhibits the acquisition of picking up the tool to later receive a reward. Proper control conditions are important to enable the rejection of hypotheses unambiguously (e.g. recent discussions in [56,57]). Our learning model can be used in future research to analyse such behavioural differences caused by variation in learning contingencies.
The simulations show that the ape study [24] and raven study [28] can be understood through associative learning. However, results from experiments with caching specialists [58,59], probably dependent upon genetic specializations [27,29,30], are currently beyond the scope of our learning model. Caching behaviour and feeding behaviour involve different motivational states in animals [60]. Motivational states can be regarded as internal stimuli and readily integrated in an associative learning model, which would result in increased flexibility in terms of making foraging and caching decisions. Our model does not include different motivational states in its current state, but we have given examples of how genetic predispositions can be integrated with the model [19, table 2]. One possible solution would be to introduce context-dependence, so that exploration is different for different external stimuli and/or for different internal states. Importantly, when making assumptions about more flexible mental mechanisms, the higher costs of exploration that are incurred by increased flexibility need to be taken into account (see [19, §3.3]). We expect that evolution has fine-tuned genetic predispositions that together with associative learning generate productive and species-specific behaviours.
Another important point for future studies is that when animals learn about consequences of behaviour, and stimulus -response values and stimulus values are updated, these are long-term memories (e.g. [61][62][63], see also [40]). A raven trained to give tokens to a human does not simply forget how to do this one day later. Behaviourally, the tool condition of the raven study is identical to when dog owners teach furry friends to 'clean up' by putting toys in a designated basket. Instead of the raven being rewarded for putting a stone in an apparatus, a dog gets a reward for putting a toy in a basket. Such long-term memories that are updated through associative learning are very different from the short-term memory of arbitrary stimuli [23].
In conclusion, the development of associative learning models is impressive in AI research and models have proven powerful in generating complex behaviour. One can ask why these powerful models are not more widely applied to non-human animal behaviour and why these models are underestimated as a cause of flexible behaviour in non-human animals. This is especially relevant given that research in animal cognition where non-human animals are claimed to have insights, exhibit causal reasoning, and the plan is criticized on a regular basis for suffering from grand claims based on a weak methodology (e.g. [31,[64][65][66][67][68][69][70]). One way to solve this associative learning paradox is by integrating the fields of AI, animal learning, and animal cognition [71]. To understand mechanisms generating behaviour, formal bottom-up associative models are likely to be more illuminating than verbal top-down 'higher-order' cognitive models. For instance, because the latter models are more difficult to reject and they cannot be implemented in simulations or used when building robots. To sum up, it is concluded that one cannot rule out that flexible planning in apes and corvids, and probably many other species, emerges through associative learning.