Error cancellation

The human cognitive system houses efficient mechanisms to monitor ongoing actions. Upon detecting an erroneous course of action, these mechanisms are commonly assumed to adjust cognitive processing to mitigate the error's consequences and to prevent future action slips. Here, we demonstrate that error detection has far earlier consequences by feeding back directly onto ongoing motor activity, thus cancelling erroneous movements immediately. We tested this prediction of immediate auto-correction by analysing how the force of correct and erroneous keypress actions evolves over time while controlling for cognitive and biomechanical constraints relating to response time and the peak force of a movement. We conclude that the force profiles are indicative of active cancellation by showing indications of shorter response durations for errors already within the first 100 ms, i.e. between the onset and the peak of the response, a timescale that has previously been related solely to error detection. This effect increased in a late phase of responding, i.e. after response force peaked until its offset, further corroborating that it indeed reflects cancellation efforts instead of consequences of planning or initiating the error.

The human cognitive system houses efficient mechanisms to monitor ongoing actions. Upon detecting an erroneous course of action, these mechanisms are commonly assumed to adjust cognitive processing to mitigate the error's consequences and to prevent future action slips. Here, we demonstrate that error detection has far earlier consequences by feeding back directly onto ongoing motor activity, thus cancelling erroneous movements immediately. We tested this prediction of immediate auto-correction by analysing how the force of correct and erroneous keypress actions evolves over time while controlling for cognitive and biomechanical constraints relating to response time and the peak force of a movement. We conclude that the force profiles are indicative of active cancellation by showing indications of shorter response durations for errors already within the first 100 ms, i.e. between the onset and the peak of the response, a timescale that has previously been related solely to error detection. This effect increased in a late phase of responding, i.e. after response force peaked until its offset, further corroborating that it indeed reflects cancellation efforts instead of consequences of planning or initiating the error.

Introduction
Human errors can lead to drastic consequences for agents and their environment. It is therefore not surprising that human action control is governed by efficient mechanisms to monitor and regulate ongoing performance. These mechanisms ensure that the cognitive system detects erroneous actions readily and takes adaptive measures to steer future behaviour toward success [1][2][3][4][5]. Efficient error detection allows for swift error-correction responses, and it promotes adaptions for upcoming actions as documented by observations such as post-error slowing [6][7][8][9]. Converging results from event-related electroencephalography consistently yielded a negative deflection that peaks within only 100 ms after error commission, which has been related to error detection [10,11]. Computational models of performance monitoring further support the proposed link between early detection and subsequent countermeasures to correct unforeseen consequences and to avoid errors in upcoming actions [12,13].
Here we argue that efficient error detection might not only target future behaviour, but it might even feedback immediately onto the erroneous action by cancelling ongoing motor activity as quickly as possible [14][15][16][17][18]. Unexpected events such as errors have indeed been proposed to trigger an orienting response followed by general motor inhibition [19], a process that has been proposed to underlie observations such as post-error slowing. However, motor inhibitory signals could also instil immediate cancellation of an ongoing erroneous action. The benefits of such countermeasures apply especially to real-world actions, which typically unfold over an extended timescale and multiple consecutive steps. For example, if an agent is about to throw a letter into the wrong mail box, or ring a bell on a door instead of turning on the light switch in a hall, an adaptive system would aim at cancelling the current course of action as soon as possible. Similar tendencies might even operate for more ballistic keypress responses that have commonly been employed to study error processing. Several observations point to this possibility, by indicating that errors come with reduced peak forces (PFs) and shorter response durations (RDs) than correct responses [15,[20][21][22]. Whether these observations are indicative of active cancellation or whether they derive from differences during motor planning and initiation is an open question that the present study aimed to resolve.
Cancellation on such short notice as in the case of a keypress response poses a profound challenge to current models of performance monitoring by suggesting auto-corrective effects for erroneous actions on a timescale that has previously been assumed to capture initial error detection instead. We hypothesize that active error cancellation would lead to shorter RDs (i.e. the time from onset of a keypress to its release) of erroneous than correct responses even when matching erroneous and correct responses for several cognitive and behavioural surface parameters. For one, erroneous responses often have shorter response times (RTs) than correct responses [15,20,23]. To arrive at a plausible estimate of the hypothesized effect of error commission on RD, and to examine the role of the predicted RT differences for our main measure of RDs, we re-analysed data from previous work in which we had participants perform a speeded choice reaction task and recorded onsets as well as offsets of their keypress responses (Experiments 1 of [24,25]). We analysed RDs and RTs as a joint function of accuracy (correct versus error) and quartiles of the RT distribution. Figure 1 summarizes the key findings of these analyses (see Pilot data for detailed results).
The pilot analyses document a marked difference in the duration of correct and erroneous responses across the RT distribution, consistent with the hypothesized process of active error cancellation (figure 1a,c). Moreover, RDs of erroneous responses became smaller at longer RTs whereas RDs of correct responses were relatively stable across the RT distribution. Visual inspection of figure 1b further indicated consistent pre-error speeding [27,28], with particularly fast error responses (figure 1b,d). These results suggest that erroneous responses come with systematically reduced duration that cannot be explained in terms of differing RTs.
Inferring error cancellation from reduced RDs, however, poses the additional challenge that errors also tend to come with smaller PFs than correct responses [15,18,[20][21][22]. While this observation of attenuated error responses might itself be consistent with the notion of active cancellation, it might also originate from differences in how errors and correct responses are generated. The observation of lower PFs for errors therefore cannot make a clear case for active error cancellation. Furthermore, if errors were already triggered with reduced force, low PFs would imply different biomechanical constraints for errors as compared with (on average) more forceful correct responses. We therefore aimed at overcoming this potential confound by matching responses for RTs and especially for PFs. Such matching parallelizes potential biases so that any remaining differences between errors and correct responses are difficult to explain without assuming error cancellation. Observing reduced RDs for errors even in the matched data would therefore provide convincing indication of active cancellation of erroneous responses.
The main study tackled this challenge by means of continuous force measurements as compared with the discrete distinction of response onsets and offsets used in the pilot data. We hypothesized that active error cancellation leads to shorter RDs of erroneous than correct responses even when matching for RT and when matching for PF (Hypothesis 1; table 1). We pre-registered that we would only infer active error cancellation if commission errors come with shorter RDs than correct responses for the unmatched datasets and for both matched datasets. Mixed results for this main hypothesis were planned to be followed up by Bayesian analyses and potentially additional sampling to either refute or support the current standard model without immediate error cancellation.
Moreover, we aimed at discerning whether error cancellation emerges in an early or a late phase of responding. RD comprises two clearly distinguishable time epochs of the force profile, namely from force onset to PF, and from PF to force offset. Early, anticipatory error cancellation would hold that effects of cancellation in RDs occur already when a response reaches its PF, whereas late, reactive cancellation would manifest only after the peak (Hypothesis 2). Continuous force measurements further allow for assessing whether erroneous responses are enacted with less overall force than correct responses in the unmatched and matched data, measured through the area under the curve (AUC) of the force profile (Hypothesis 3). In line with Hypothesis 2, we planned to test whether such attenuation is already present in an early phase of responding before reaching PF, or only emerges reactively afterward (Hypothesis 4). If the evidence pointed to an early reduction of force in the unmatched and both matched datasets, we planned to dissect the early force profile even further in relation to the registration of the response. For RT-matched data, we planned to scrutinize how the reduction of force develops for erroneous responses in five time steps leading up to the registration of the response (Hypothesis 5). In additional analyses, we aimed to assess whether overall stronger force reduction during erroneous relative to correct responding relates to the successful abortion and correction of subthreshold responses. In particular, we predicted that larger differences between correct and erroneous responding in AUCs would come with a higher percentage of low-threshold erroneous responses in correct trials of the unmatched data (Hypothesis 6). Finally, we planned to test whether we replicate shorter RTs (Hypothesis 7) and smaller PFs (Hypothesis 8) for erroneous than correct responses to validate our data against established findings in the literature.

Methods
The data and analysis code for the pilot analyses and for the pre-registered main study are publicly available, as is the data and analysis code for the validation study described in the electronic supplementary material and the Stage 1 protocol (osf.io/5v9es [30]). Table 1 provides an overview of our research questions, hypotheses, sampling plan, analysis plan and interpretation.  RTs of trial sequences where an error (bright, orange dot) in trial n was preceded and followed by two correct responses (black squares), respectively. Error bars represent the 95% confidence interval for paired differences between consecutive trials (CI PD ; [26] This power analysis suggested a sample size of at least 23 participants for a two-tailed paired t-test (α = 5%) and for the main effect of the factor accuracy in all analyses of variance (ANOVAs; see Data analysis). We recruited a total sample of 34 participants to allow for sufficient power also for all additional analyses (see following table  rows).
Hypotheses (1) and (2)  In the case of a significant interaction between accuracy and time-window (i.e. evidence for differences in force attenuation between early and late phases of the response), we further explored the time course as per the following question. (2) RDs are already shorter for erroneous than correct responses before the force profile of a response reaches its peak (PF).
In the case of significant two-way interactions in the ANOVAs (see above), Hypothesis (2) was further tested in two one-tailed paired-samples t-tests. We tested for shorter RDs in erroneous than correct trials before and after the peak, respectively. In the presence of a two-way interaction, we would only infer early error cancellation if there was a significant difference between the RDs of correct and erroneous responses for the pre-peak condition in all three analyses. Observing reduced RDs for errors only in the post-peak interval would support active but late error cancellation.
Are erroneous responses enacted with less overall force than correct actions?
(3) AUCs as computed for the force profile are smaller for erroneous than correct responses. Effect sizes for comparisons of PFs (rather than overall force) in previous reports [15] point towards an effect of d z = 0.65 for which about 33 participants allow for a power of 95%. Note that the effect size estimate can only be approximate, however, because the present study was the first to implement fine-grained analyses for force profiles rather than PF-based measures. The reported validation data suggested the estimate to be feasible (here: Hypotheses (3) and (4) were tested in 2 × 2 ANOVAs with the within-subject factors accuracy (correct versus error) and timewindow (pre-peak versus post-peak) and AUCs as the dependent variable. We performed this analysis on unmatched data and on data where erroneous and correct responses were matched by RT and PF, respectively. The analysis on the matched data were again conditional upon the success of our (iterative) matching procedures (see above). We would infer enactment of less overall force for erroneous than correct actions if commission errors came with lower AUCs than correct responses in all three analyses.
In the case of a significant interaction between accuracy and time-window (i.e. evidence for differences in force attenuation between early and late phases of the response), we further explored the time course as per the following question. (4) AUCs are smaller for erroneous than correct responses already before a response reaches its PF.
In the case of significant two-way interactions in the ANOVAs (see above), Hypothesis (4) was further tested in two one-tailed paired-samples t-tests. We tested for shorter AUCs in erroneous than correct trials before and after the peak, respectively. In the presence of a two-way interaction, we would only infer a weaker enactment of force for erroneous than correct responses in an early phase if there was a significant difference between the AUCs of correct and erroneous responses for the pre-peak condition in all three analyses.
Observing smaller AUCs only in the post-peak interval would support active but late attenuation of response force. In the case of a significant two-way interaction, Hypothesis (5) was further tested in a one-tailed paired-samples ttest per time-window to probe for smaller mean forces in erroneous than correct trials. We would infer a weaker enactment of force for erroneous responses in all time-windows preceding the response if there was a significant main effect without a significant interaction in the omnibus ANOVA. In the case of a significant interaction, we confined this interpretation to time-windows with a significant difference between correct and erroneous responses.
Does a more pronounced attenuation of overall response force in erroneous relative to correct responses relate to the successful abortion of erroneous subthreshold responses in correct trials? Hypothesis (6)  Hypothesis (7) was tested in a one-tailed paired-samples t-test (i.e. error < correct) with RTs as the dependent variable.
We inferred faster initiation of erroneous than correct responses in the case of a significant result. Careful matching of erroneous and correct trials for one of the three main analyses (as described in the first row of this  (3) and (4).
Hypothesis (8) was tested in a one-tailed paired-samples t-test (i.e. error < correct with PFs as the dependent variable).
We would infer enactment of less maximum force for erroneous than correct actions if commission errors come with lower PFs.
Careful matching of erroneous and correct trials for one of the three main analyses (as described in the first row of this We reanalysed a dataset with 48 participants of which we had to exclude four datasets because of insufficient observations in at least one design cell (see below). The design of the pilot study resembled the main study (see Stimuli and apparatus and Procedure) with the major exception that participants responded on a standard German QWERTZ keyboard instead of the custom-built apparatus to measure force profiles as used in the main study. In short, participants responded to target letters in a speeded choice reaction task with a 4 : 2 mapping of target stimuli to response keys (see electronic supplementary material, figure S1A for the trial procedure). Four task-irrelevant distractor letters surrounded each target letter to increase perceptual noise and thus error likelihood. Participants worked on 1120 trials of this task in 20 blocks with the first block serving as practice.
For each response, we measured RT (i.e. time from target onset to keypress) and RD (i.e. time from pressing to releasing the key) to gather first evidence for active error cancellation (figure 1). We excluded the practice block and the first trial of each block. We selected trials with a correct response in the preceding trial (19.5% excluded). We then discarded trials with miscellaneous errors where participants used any other than the instructed keys or responded multiple times (2.0%), as well as omission errors (2.6%). We further excluded trials as outliers if either RT or RD deviated more than 2.5 s.d. from their cell mean, calculated separately for each participant and accuracy condition (2.3%). For visualization in figure 1a,b, we selected trial sequences with two correct responses preceding and following an error, respectively.
We compared correct and erroneous RTs and RDs in separate two-tailed paired-samples t-tests. In addition, we computed RT-quartiles (ntiles function of the R package schoRsch v. 1.9.1; [31,32]) and analysed RTs and RDs as a function of accuracy and RT-quartile in separate 2 × 4 analyses of variance (ANOVAs). In the case of significant two-way interactions, we tested for effects of accuracy in each quartile via two-tailed paired-samples t-tests. We excluded four participants from the analyses because they provided less than 10 observations in at least one of the design cells of the ANOVAs.
Erroneous responses indeed came with markedly shorter RDs than correct ones (94 versus 112 ms;

Sample
For our main research question (Hypothesis 1), we computed the 95% CI d for the effect size d z = 1.34, 95% CI d = [0.93; 1.74], as observed for the differences between RDs of erroneous and correct responses in the pilot analyses (ci.sm function of the R package MBESS v. 4.8.0; [33]). Because the planned matching procedures probably constrain the resulting effect size, we opted to use the lower bound of the 1 In the pilot study, participants did not receive immediate feedback for correct responses and commission errors, but they received a summary of their performance after each block. We handled performance feedback similarly in the main study (see Stimuli and apparatus and Procedure). Validating this design choice, we found similar effects of error cancellation in a replication study of the reported work, where we manipulated between participants whether the commission of an error was or was not fed back at the end of a trial (see electronic supplementary material, figure S1 for the trial procedure; [25]). Again, we found shorter RDs, F 1,92 = 62.20, p < 0.001, h 2 p ¼ 0:40, d z = 0.81, and shorter RTs, F 1,92 = 106.37, p < 0.001, h 2 p ¼ 0:54, d z = 1.06, for errors than for correct responses across feedback conditions (with non-significant main effects of feedback and two-way interactions of accuracy and feedback in RDs and RTs, Fs 1,92 ≤ 1.52, ps ≥ 0.220, h 2 p s 0:02, d z ≤ 0.13). We relied on the effect size estimate of the first pilot analyses for the power analysis of the main study, however, because feedback and temporal characteristics match the main study more closely than the replication design.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 9: 210397 confidence interval as a conservative estimate. About 23 participants would ensure a high power of 99% to detect the effect size corresponding to this lower bound in a two-tailed test (α = 5%; power.t.test function of the R package stats v. 4.0.3; [32]). Considering the remaining hypotheses, the effect size for the difference between correct and erroneous responses in PFs informed from previous reports [20] was the smallest relevant effect size for any of the hypotheses tested in this study (d z = 0.65). We thus considered this effect size for the computation of our sample size, and a small-scale validation of the proposed design (N = 4) found effects to exceed this estimate consistently across dependent measures and matching procedures (see the electronic supplementary material). We therefore tested 34 participants to arrive at a counterbalanced sample that ensured a high power of 95% in a two-tailed paired test for this hypothesis (α = 5%) while ensuring even higher power for all remaining hypotheses (including a power of more than 99% for our main hypothesis).
We planned to exclude the data of participants who opted to abort the study prematurely (which did not occur), for whom the study could not proceed as planned (see Stimuli and apparatus and Procedure) because of technical errors (four participants), and based on a priori criteria to establish sufficient data quality (see Data selection; six participants). We replaced excluded datasets with new participants until we had 34 datasets for statistical analyses. If the analyses for our main research question (Hypothesis 1) had returned mixed results, we would have increased the sample size by means of adaptive sampling informed by Bayes factors (see Data analysis).

Stimuli and apparatus
Stimuli were presented on a 24 00 screen with a display resolution of 1920 × 1080 pixels and a refresh rate of 100 Hz. Participants responded with their two index fingers on custom-built keys that measured isometric force with a sampling rate of 250 Hz (see electronic supplementary material, figure S2). The force-sensitive parts of the keys had a size of 1.8 × 1.8 cm with an elevated circular platform (1.3 cm diameter) as finger rest. They were embedded in a frame that was 2.5 × 2.5 cm in width and depth and 1.4 cm in height. The combined apparatus of frame and the force-sensitive part was about 1.5 cm in height. Participants responded to four target letters, of which T and N mapped to one key while V and K mapped to the other key. The mapping of letter pairs to response keys was counterbalanced across participants. Target letters appeared in white font colour in the centre of a black screen closely surrounded by four distractor letters above, below (distance amounts to 6% of the screen height) and to both sides (3% of screen width). 2 The distractor letters were O, W, X, U, Z, Y, H and A. None of the distractors mapped onto a response but rather the presence of distractors increased visual noise to stimulate commission errors.

Procedure
Participants were explicitly instructed about the mapping rule before the study and they were encouraged to ignore the distractors that accompanied each target. The trial started with a fixation cross for 500 ms, followed by a display showing target and distractors for a maximum duration of 600 ms (target and distractors disappeared upon response onset). The force exerted on both response keys was measured from the onset of fixation. The first 10 measurements were averaged for each key and used as baseline. Response onsets were identified when the force on one key was at least 0.25 arbitrary units (a.u.; about 250 g or 2.5 newton) above its baseline. Previous experience with this device as well as a small validation study reported in the electronic supplementary material yielded a reasonable amount of omission errors with this response criterion while ensuring that the keys operate with sufficient sensitivity. RT then denoted the time from onset of the target until the force on one response key reached the response threshold. After the response onset had been registered, the screen went black for an inter-trial interval of 1000 ms while the force on both keys was still measured. The offset of the response was registered when the force matched or exceeded the threshold for the last time. 2 In the accepted Stage 1 version of this article, we announced that distractors would appear in a distance of 3% screen height and 5% screen width from the target although we intended to use the same stimulus setup as in the validation study and in the pilot study (6% and 3%, respectively). We noticed our error during the preparation of data collection. We decided to implement our original stimulus arrangement, deviating from the preregistered method in this regard; however, we made this decision before any data had been collected.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 9: 210397 Participants worked through one initial practice block, followed by 17 experimental blocks. In the practice block, participants received feedback for each response for 1000 ms at the end of the trial. Correct responses were fed back with 'Good!' (German: 'Gut!'), early responses during the fixation with 'Too early!' (German: 'Zu früh!'), commission errors (left response when right response would be appropriate and vice versa) with 'Wrong!' (German: 'Falsch!') and omissions of any response, i.e. no response before the deadline of 600 ms, with 'Too slow!' (German: 'Zu langsam!'). In experimental blocks, participants only received feedback for early responses and omissions. At the end of each block, participants received a summary on their performance with the mean RT of correct responses and the number of commission and omission errors. They were also urged to respond as quickly as possible while trying to avoid high numbers of errors independently from their performance in the block.
The practice block featured a random sequence of 32 trials in which each target appeared once with each distractor. In each experimental block, each combination of targets and distractors appeared twice, resulting in a random order of 64 trials.

Data selection
We neither analysed the practice block nor the first trial of each block. We then determined the frequency of error types for each participant and excluded participants who responded correctly in less than 60% of the trials (five participants; remaining participants: M = 80.9%, s.d. = 8.6% correct responses and M = 9.4%, s.d. = 5.6% commission errors) as well as participants whose data came with less than 10 observations in at least one of the design cells for any of the main analyses (see Data analysis; one participant).
We only selected trials for further analyses with a correct response in the preceding trial to control for potential effects of post-error processing (19.7% excluded). From these trials, we selected trials with an above-threshold response (i.e. baseline-corrected force of at least 0.25 a.u.) that was correct or constituted a commission error (i.e. left response to a letter assigned to the right response, or vice versa). We excluded all other erroneous trials, i.e. anticipatory above-threshold responses during fixation (less than 0.1%) as well as omissions of an above-threshold response (9.8%). Individual force profiles were baseline-corrected by subtracting the average force of the first 10 measurements after target onset from each force measurement of that response. This correction might lead to seemingly negative force values although actual negative force values cannot emerge by design. To avoid confusion, we set all force values that turned negative due to the correction procedure to zero. We then identified the maximum force (i.e. PF) of each trial, the time to PF, as well as onset and offset times.
For the same set of trials, we processed the force applied to the key that had not been pressed abovethreshold and determined for each trial whether participants hit or exceeded a low-threshold force of 0.1 a.u. at least once in the trial. In correct trials, these covert responses corresponded to subthreshold commission errors and in error trials, they represented subthreshold correct responses.
We extracted forces in two ways: first, time-locked to target onset until 1600 ms after target onset (target-locked), and second, time-locked to the first occurrence of the PF in a trial, with a window from 300 ms before the peak to 300 ms after the peak ( peak-locked). To allow for averaging, we employed linear interpolation to estimate force values for every millisecond in the corresponding timeframe. From the target-locked force data, we also derived the duration of each response (RD) for statistical analyses by determining the time-window between reaching a threshold of 0.25 a.u. for the first time in a trial and the time point where it reached this point for the last time in a trial. We also considered when force peaked (for the first time) in a trial, to analyse the effects of error processing on RD before and after this peak. For the peak-locked force, we computed relative force values by dividing each force value by its PF. We excluded trials from further analyses as outliers when RT, RD or PF deviated more than 2.5 standard deviations from their cell mean (4.9%).
In a next step, we matched correct and error trials by (i) their RTs and (ii) their PFs for each participant and referred to these data sources as RT-matched and PF-matched data, respectively. As error trials were less frequent than correct trials, we selected error trials one-by-one from the lowest trial number to the highest trial number. For the error trial with the lowest trial number, we subtracted the RT (PF) of that error trial from each correct trial. The correct trial with the smallest absolute difference was chosen as a match. In the case of ties, we assessed the trial number of the error trial and all tied correct trials, selecting the correct trial that lay closest to the error trial; if two trials were tied also on this latter test, we would select the trial with the smaller trial number. After each match, we proceeded to the error trial with the next higher trial number considering only correct trials without a match until every error trial had a matching correct trial.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 9: 210397 11 2.6. Data analysis 2.6.1. Main analysis: response durations as a function of accuracy and time-window Our main analysis assessed RDs in a 2 × 2 ANOVA with the within-subject factors accuracy (correct versus error) and time-window ( pre-peak versus post-peak). We performed this analysis on the unmatched data, on the RT-matched data and on the PF-matched data, and we would only infer active error cancellation if commission errors came with shorter RDs than correct responses in all three analyses. Significant two-way interactions in the ANOVAs were followed up with separate onetailed paired-samples t-tests before and after the peak to determine whether error cancellation operated pre-peak, post-peak or in both timeframes. If any of the three ANOVAs returned a nonsignificant effect of accuracy, we would compute Bayes factors for the pairwise comparison of RDs between correct and erroneous responses and collect additional data in increments of two participants until reaching a Bayes factor of BF 01 > 10 or BF 01 < 0.10 for all three analyses (using a Cauchy distribution with a scale parameter of 1 as prior) or until reaching a sample of 100 analysable datasets. We used separate Bayesian t-tests rather than Bayesian ANOVA, to avoid the variability of Bayes factor estimates inherent in current approaches to the latter [34].

Temporal evolution of response force
To further characterize the temporal evolution of the force profiles, we computed the AUC from the aggregated relative forces for each participant (computed on the unmatched, RT-matched and PF-matched, peak-locked data). We analysed the AUCs in 2 × 2 ANOVAs with the within-subject factors accuracy (correct versus error) and time-window ( pre-peak versus post-peak) with follow-up tests as for RDs.
Finally, we summed AUCs of both time-windows of the unmatched data and computed differences between correct and error trials as an overall summary statistic (Δ AUC ). One-tailed tests of the Pearsoncorrelations between these Δ AUC and the percentage of low-threshold erroneous responses in correct trials (i.e. number of correct trials with low-threshold erroneous responses/number of correct trials) informed about the relation of error cancellation to successful abortion of subthreshold errors. In the case of a nonsignificant correlation, we would compute a Bayes factor using a shifted, scaled beta distribution as prior (r scale parameter = 1/3) [35,36].

Response time and peak force analyses
To further validate our approach, we probed whether RTs and PFs were higher for correct than for erroneous responses in separate one-tailed paired-samples t-tests. We also checked the success of our matching procedure by employing equivalence tests [37]. In one-tailed one-sample t-tests, we tested whether differences between correct and erroneous responses were greater than −2 ms in RTs or −0.05 a.u. in PFs, and less than 2 ms in RTs or 0.05 a.u. in PFs. We chose these boundaries based on the differences observed in the unmatched data of our validation study, which were about 3 (RTs) or 4 (PFs) times as large as the effective boundaries (see electronic supplementary material). If differences were both significantly greater than the lower boundary and smaller than the upper boundary (we reported the test with the smaller t-statistic), we assumed equivalence of correct and erroneous responses. If a matching procedure did not yield comparable datasets, we would trim the distribution of the respective dependent variable (PF or RT) of the error data. We would remove the bottom 5% data points for participants whose mean difference scores were equivalent to or lower than the negative test value or equivalent to or higher than the positive test value and re-run the matching procedure. This process would be iterated until reaching a satisfactory match. If all participants were within the critical values of the equivalence tests but there was still no significant equivalence, we would trim the bottom 5% data points of erroneous responses for all participants. This process would be iterated until reaching a satisfactory match.
Finally, we determined the mean RT of the RT-matched data, which will be referred to as mean matched RT in the following. If all three AUC analyses pointed towards smaller differences between correct and erroneous responses before reaching the PF, we would conduct follow-up analyses on the time course of the force profile. More precisely, we would analyse mean forces of RT-matched data in time-windows of 20 ms preceding the mean matched RT in a 5 × 2 ANOVA with the within-subjects factors time-window (99 to 80 ms versus 79 to 60 ms versus 59 to 40 ms versus 39 to 20 ms versus royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 9: 210397 19 ms to mean matched RT) and accuracy (correct versus error). We would test for violations of sphericity and report Greenhouse-Geisser corrections along with the corresponding ε estimate if necessary. A significant interaction of both factors would be followed up by separate one-tailed paired-samples t-tests to test whether correct forces were higher than erroneous forces in each time-window.

Validation study
We conducted a small-scale validation study to establish that the proposed study procedures, data processing routines and power considerations were feasible (see the electronic supplementary material for a detailed report; electronic supplementary material, figure S3 summarizes the corresponding results). In short, these data indicated the proposed study design and analysis plan to be feasible and the planned sample size to deliver appropriate power for all relevant effect sizes (greater than 95% for all hypotheses, greater than 99% for the main hypothesis). AUCs for RT-matched trials were smaller for erroneous than for correct responses (155 versus 161 a.u.; Hypothesis 3), F 1,33 = 9.88, p = 0.004, h 2 p ¼ 0:23, and before than after the peak (66 versus 92 a.u.), F 1,33 = 318.62, p < 0.001, h 2 p ¼ 0:91. The interaction of both factors was not significant (Hypothesis 4), F 1,33 = 1.78, p = 0.191, h 2 p ¼ 0:05. Hypothesis 5 was not tested because differences between correct and erroneous responses in AUCs before the peak were not significant in the PF-matched data (significant effects in all datasets were a precondition for testing Hypothesis 5).

Discussion
The current study investigated whether agents cancel erroneous actions even on the short timescale of a simple keypress. We therefore assessed RDs of keypress responses and measured the force profile of each response to scrutinize how early this process of error cancellation kicks in. Crucially, we aimed at matching correct and erroneous responses for as many surface parameters as possible-specifically: RT and PF-to arrive at a pure comparison of the corresponding duration data. Our main results indeed showed that RDs were consistently shorter for erroneous than for correct responses even when controlling for the significantly shorter RTs and lower PFs of erroneous responses (Hypothesis 1). Matching for RT did not seem to exert an impact on effects of accuracy in RD (unmatched and matched h 2 p ¼ 0:79). However, the size of the effect considerably dropped for the PF-matched data although it was still highly systematic (h 2 p ¼ 0:31). This drop might indicate, first, that error cancellation does not only reduce the duration but also the strength of execution, or second, that the same initiation and planning processes that led to reduced PFs also reduced RDs, irrespective of any error cancellation efforts. The results of the PF-matched data deliver strong support, however, that RDs reflect error cancellation proper.
Differences in RDs already emerged early during responding, namely in the time epoch that spanned from response onset to PF, and continued to be evident after the peak (Hypothesis 2). In fact, these effects were robust across datasets at both timepoints. Accordingly, error cancellation should already emerge before or during the deflection of the error-related negativity (ERN), which is at odds with the interpretation of the ERN as reflecting the earliest time point of error processing [10,11]. Early error cancellation, however, complements findings on early preparation of error-correction responses that have been reported for manual actions [17,38,39] and visual search behaviour alike [40]. 3 Conflict between the execution and the cancellation of an erroneous response (here: pressing versus releasing a key) might instead contribute to the ERN. Recent data from our laboratory indeed point to a strong impact of erroneous RDs on the size of the ERN [41]. Another promising avenue to explore error detection and cancellation in this regard would be a comparison of RDs between correct and erroneous responses for subthreshold and supra-threshold responses. Early error detection during action planning might lead to traces of error cancellation for subthreshold responses. The results of the current study already demonstrate that active countermeasures against errors operate at a considerably earlier timepoint than previously assumed (see also [42,43] for evidence on early error sensations).
Erroneous responses did not only start early (Hypothesis 7), peak on a lower level (Hypothesis 8) and end earlier than correct responses, they were further enacted with less overall force than correct actions, reflected in smaller AUCs for erroneous responses in all three datasets (Hypothesis 3). Although this difference applied to both parts of the force curve, i.e. before and after the peak of the response in the unmatched and the RT-matched data, it was only evident after the peak in the PF-matched data (Hypothesis 4 and 5). Therefore, we conclude that response force is attenuated for errors especially in a late phase of responding. Together with the finding of larger differences in RDs after than before the peak, these results suggest that cancellation efforts become stronger over the course of responding erroneously, further rebutting alternative explanations of these effects in terms of response planning and initiation. The late attenuation of overall force might also explain why we did not find evidence for a relationship between this effect and successful abortion of erroneous subthreshold responses. Instead, indicators for early cancellation success, as for example particularly short RDs of erroneous responses before force peaks, might show a stronger relation with the successful cancellation of errors 3 Erroneous saccades to a frequent distractor location during visual search have been shown to come with dwell times that are too small to allow for planning a new saccade only after landing on a distractor (less than 150 ms; [40]). These observations mirror early correction responses in manual tasks in that planning of a correction response starts even before the erroneous response is fully performed [38]. In contrast with manual actions, however, error correction (i.e. performing the intended correct response) and error cancellation (i.e. aborting the current erroneous response) are necessarily confounded for eye-movements. The present results suggest that low dwell times probably draw on contributions from correction and cancellation alike.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 9: 210397 before response threshold. Further, we might have chosen a relatively weak indicator of success in the abortion of erroneous subthreshold responses. High values might indicate that agents usually prepared both responses up to a certain level whereby the correct response gained activation more rapidly in the majority of episodes without any active cancellation of the erroneous response.
The observation of immediate error cancellation also suggests that the erroneous action might be subject to inhibition. Instead, erroneous responses seem to remain even more accessible in an upcoming action episode compared with a neutral response that neither corresponded with the preceding erroneous nor the actual correct response ( [25]; see also [44,45]). This observation, however, does not contradict the assumption of an overall inhibition of motor activity [19]. It would still be feasible to assume that responding is inhibited in general whereby the specific erroneous response receives somewhat less inhibition. Whether the strength of error cancellation relates to the future accessibility of the erroneous response is an open question worthy of exploration. These considerations also establish intriguing links toward theories of maladaptive and adaptive error processing [4]. Error cancellation itself qualifies as an adaptive mechanism in potentially avoiding negative consequences of the error. However, the current perspective on maladaptive and adaptive processes relates to behaviour after an error, error cancellation therefore calls for an extension of this perspective.

Conclusion
At a timescale that researchers have attributed to mere error detection, erroneous actions already show a reliable pattern of cancellation. We assume that error cancellation will be even more powerful for more complex actions and sequences of actions where agents still have a good chance to mitigate (some of ) the consequences of their errors by cancelling ongoing motor activity.
Ethics. This research complies with the ethical regulations of the Ethics Committee of the local Institute of Psychology, the German Psychological Society and the German Research Foundation. This study qualifies for approval without individual review by the local ethics committee as participants signed informed consent, data collection is anonymous, and the task does not pose any foreseeable risk for participants.
Data accessibility. The data and analysis code for the pilot analyses and for the pre-registered main study are publicly available, as are the data and analysis code for the validation study described in the electronic supplementary material [46] and the Stage 1 protocol (osf.io/5v9es).