Decoding and encoding models reveal the role of the depth of processing in the brain representation of meaning

How the brain representation of conceptual knowledge vary as a function of processing goals remains unclear. We hypothesized that the brain representation of semantic categories is shaped by the depth of processing. Participants were presented with visual words during functional MRI. During shallow processing, participants had to read the items. During deep processing, they had to mentally simulate the features associated with the words. Multivariate classification, informational connectivity and encoding models were used to reveal how the depth of processing determines the brain representation of word meaning. Decoding accuracy in putative substrates of the semantic network was enhanced when the depth processing was high, and the brain representations were more generalizable in semantic space relative to shallow processing contexts. This pattern was observed even in association areas in inferior frontal and parietal cortex. Deep information processing during mental simulation also increased the informational connectivity within key substrates of the semantic network. To further examine the properties of the words encoded in brain activity, we compared computer vision models - associated with the image referents of the words - and word embeddings. Computer vision models explained more variance of the brain responses across multiple areas of the semantic network. These results indicate that the brain representation of word meaning is highly malleable by the depth of processing imposed by the task, relies on access to visual representations and is highly distributed, including prefrontal areas previously implicated in semantic control.


Introduction
Grounded models of semantic cognition propose that knowledge about the world is re-enacted in the same modality-specific brain systems that are involved in perceptual or action processes. For instance, the concept of 'guitar' comprises the way it looks, how it is played, and the sound it makes. Damasio's convergence zone theory proposes that re-enactment is mediated by brain association areas that integrate information from modality-specific systems (Damasio, 1989). The perceptual symbols theory (Barsalou, 1999) proposes that co-activation patterns in sensory-motor substrates are critical. On this account, conceptual knowledge involves an agent's brain simulating the different properties of the object in question (e.g. shape, color, texture, sound, action) in a way that resembles how the information is encoded in sensorimotor systems during overt behaviour.
However, how the brain representational spaces of conceptual knowledge vary as a function of internal processing goals and strategies remains unclear. For instance, previous research has not determined the role of the depth of processing on the brain representation of concepts. In some studies participants are required to actively process the meaning of experimental stimuli (T. M. Mitchell et al., 2008;Shinkareva et al., 2011;Buchweitz, Shinkareva, Mason, Mitchell, & Just, 2012), while in other studies, participants are just required to pay attention to the items (Simanova et al., 2012;Huth et al., 2016). We here addressed how semantic decodability and generalizability in key substrates of the semantic network (Binder, Desai, Graves, & Conant, 2009) is affected by the depth of processing, and further determine the extent to which different types of features (i.e. semantic/syntactic vs. perceptual, i.e. visual) are encoded in brain activity. According to perceptual symbols theory, conceptual knowledge is supported by simulation processes in perceptual systems and this simulation is thought to occur automatically without the need of committing the information into a conscious working memory system (Barsalou, 2008).
We used functional MRI during a visual word recognition task in which participants either engaged in mental simulation of word items (henceforth the deep processing context) or merely read the items (the shallow processing context), and then applied multivariate pattern (MVP) analyses and encoding models to test the following hypotheses in a set of left-lateralised regions involved in visual word processing based on a previous meta-analysis (Binder et al., 2009). If the depth of processing influences the brain representational spaces of meaning then (i) the level of decoding accuracy of word category should be superior during mental simulation relative to when the words are merely read; (ii) the corresponding MVP of a concept ought to be more generalizable with respect to new examples of the same category (i.e. MVP classifiers trained with a subset of the words should better predict out-of-sample words not used for training) (iii) if explicit mental simulation supports integrated brain representations of meaning then the level of informational connectivity between key substrates of semantic network (i.e. the temporal correlations in the level of semantic decodability across brain areas) ought to increase relative to shallower processing conditions. Lastly, (iv) we tested different encoding models based on word embeddings and computer vision models fitted with the image referents of the words to examine the properties of the corresponding brain representations, in particular, how semantic/syntactic vs. visual properties associated with the words are encoded across the shallow and deep processing conditions. For instance, if simulation processes for word concepts occur automatically in perceptual systems (Barsalou, 2008) then we would expect that computer vision models explain brain responses during semantic processing.

Participants
Following informed consent, twenty-seven participants (20-33 years, mean age: 24±3, 10 males) took part in return of monetary compensation. The study conformed to the Declaration of Helsinki and was approved by the BCBL Research Ethics Board.

Experimental Task and Procedure
The experiment was programmed using Psychopy (Peirce, 2007). It was comprised of 8 fMRI runs. Each trial began with a fixation period of 250 ms followed by a blank screen of 500 ms (see Figure 1) and then by the target visual word which was displayed for 1 s. The word was randomly selected from a pool of 18 living and 18 non-living words. Stimuli were centrally presented in white against black background in uppercase Arial font. The offset of the word was followed by a blank screen of 4 s. During these 4 s, depending on session instructions (shallow or deep processing), the participants were supposed to either read the word (shallow processing condition), or mentally simulate the properties associated with the word (e.g. its shape, its color etcetera), henceforth, the deep processing condition. A red asterisk was centrally presented during the inter-trial interval and participants were instructed to relax and wait for the next trial. To ensure that the participants focused on the stimuli and the task, a maximum of two catch trials were set to appear at random points in each of the sessions. These catch trials showed number words (zero, one and three) in place of usual living/non-living words, and participants were asked to respond by pressing any one of the four buttons on the fMRI response pad. The total number of catch trials was kept equal across conditions. To maximize the separation between the brain response for each of the trials, the time for which the asterisk stayed on the screen was jittered between 6 to 8 s. The jitter was based on a pseudo-exponential distribution resulting in 50% of trials with the inter-trial interval of 6 s, 25% with 6.5 s, 12.5% with 7 s and so on.

MRI Data Preprocessing
The preprocessing of fMRI data was performed using FSL FEAT (FMRIB Software Library; v5.0). The first 9 volumes were discarded to ensure steady state magnetisation; to remove non-brain tissue, brain extraction tool (BET) (Smith, 2002) was used; volume realignment was performed using MCFLIRT Figure 1: Illustration of the experimental protocol (Jenkinson, Bannister, Brady, & Smith, 2002); minimal spatial smoothing was performed using a gaussian kernel with FWHM of 3 mm. Next, ICA based automatic removal of motion artifacts (ICA-AROMA) was used to remove motion-induced signal variations (Pruim et al., 2015) and this was followed by a high-pass filter with a cutoff of 60 s. The sessions aligned to a reference volume of the first session.

Multivariate Pattern Analysis for Decoding
Multivariate pattern analysis was conducted using scikit-learn (Pedregosa et al., 2011) and PyMVPA (Hanke et al., 2009). Specifically, classification based on a supervised machine learning algorithm i.e. linear support vector machine (Fan, Chang, Hsieh, Wang, & Lin, 2008), was used to evaluate whether multi-voxel patterns in each of the ROIs contained information about the word semantic category (animal vs. tools) in each of the experimental The 15 left-lateralized areas were prespecified and included regions: inferior parietal lobe, inferior temporal lobe, middle temporal lobe, precuneus, fusiform gyrus, parahippocampal gyrus, superior frontal gyrus, posterior cingulate gyrus, pars opercularis, pars triangularis, pars orbitalis, frontal pole, medial orbitofrontal cortex, laterial orbitofrontal cortex and anterior temporal lobe contexts (i.e. deep and shallow processing).

Data Preparation
For each participant, the relevant time points or scans of the preprocessed fMRI data of each run were labeled with attributes such as word, category, and condition using the behavioural data files generated by Psychopy. Invariant voxels (or features) were removed. These were voxels whose value did not vary throughout the length of one session. If not removed, such features can cause numerical difficulties with procedures like z-scoring of features. Next, data from all sessions were stacked and each voxel's time series was run-wise z-scored (normalized) and linear detrended. Finally, to account for the hemodynamic lag, examples were created for each trial by averaging the 4 volumes between the interval of 3.4 s and 6.8 s after word onset.

Pattern Classification
Linear support vector machine (SVM) classifier, with all parameters set to default values as provided by the scikit-learn package (l2 regularization, C = 1.0, tolerance = 0.0001), was used. The following procedure was repeated for each ROI separately. To obtain an unbiased generalization estimate, following (Varoquaux et al., 2016) the data was randomly shuffled and split to create 300 sets of balanced train-test (80%-20%) splits with separate items for training and testing. Each example was represented by a single feature vector with each feature being the mean of voxel intensities across the subinterval of 3.4 s and 6.8 s. Hence, the length of feature vector was equal to the number of voxels in the ROI. To further reduce the dimensionality of the data and thus reduce the chances of overfitting (Pereira, Mitchell, & Botvinick, 2009;T. Mitchell et al., 2004), Principal Component Analysis (PCA) with all parameters set to default values as provided by the scikitlearn was used. The number of components was equal to the number of examples thus resulting in all ROIs having an equal number of components. These components were linear combinations of the preprocessed voxel data and since none of the components was excluded, it was an information lossless change of the coordinate system to a subspace spanned by the examples (Mourão Miranda, Bokde, Born, Hampel, & Stetter, 2005). Features thus created were used to train the decoder, and its classification performance on the test set was recorded. This procedure was repeated separately for each of the 300 sets, and the mean of corresponding accuracies was collected for each of the participants.

Statistics
To determine whether the observed decoding accuracy in a given ROI is statistically significantly different from the chance-level of 0.5 (or 50%), a two-tailed t-test was performed with p-values corresponding to each of the ROIs corrected for multiple comparisons using false discovery rate (FDR).

Informational Connectivity Pipeline
Informational connectivity analysis is used to identify regions of the brain that display temporal correlation in the multivariate patterns regarding the key stimulus classes (Coutanche & Thompson-Schill, 2013). The purpose of this analysis here was to investigate how this informational connectivity between the 15 pre-specified ROIs varies across the deep and shallow processing conditions. The fMRI data was preprocessed and labeled as mentioned in § 2.3 and § 2.4.1. PCA was used for dimensionality reduction, and SVM for classification (see § 2.4.2). A leave-one-trial out cross-validation was performed. Specifically, the classifier was trained on all the volumes between the sub-interval of 3.4 s to 6.8 s of all the trials except one, and was tested on all the volumes (not confined to any sub-interval) of the left-out trial. All the trials were used, one by one, as test trials starting with the first trial, and the corresponding probability of detecting a correct class was recorded for each of the volumes. In this way, a time-series of MVP discriminability values (one per timepoint) was obtained for each of the ROIs. To calculate the informational connectivity between the ROIs, these time-series were correlated using Pearson moment-to-moment correlation for each of the pairs of ROIs, and a matrix of informational connectivity was created. This procedure was performed separately for shallow and deep processing sessions resulting in two matrices for each of the participants.

Word Embedding Models
The word embedding models used in the analysis (Fast Text, GloVe and Word2vec) were pre-trained (Bravo-Marquez & Kubelka, 2018) using the Spanish Billion Word Corpus a . A common feature of these three models is that they are trained based on a corpus of text, and hence are sensitive to the characteristics of the text that capture both semantic but and syntactic information. For each word used in the current experiment, the corresponding vector representation of 300 dimensions (300-D) was extracted. 300 was a conventional choice that is commonly used in the natural language processing community. To visualize the representational patterns of the word embedding features, we computed the representational dissimilarity matrices (RDMs) of the models. Prior to the visualization, the representational feature of each word was normalized by subtracting the mean. The RDMs respect to Fast Text (see Figure S1), GloVe ( Figure S2), and Word2Vec ( Figure  S3) are shown in the Supplementary materials.

Computer Vision Models
We applied computer vision models to extract abstract representations of the image referents associated with the words used in the present study. VGG19 (Simonyan & Zisserman, 2014), MobileNetV2 (Howard et al., 2017), and DenseNet169 (Huang, Liu, Van Der Maaten, & Weinberger, 2017) were the computer vision models used in the current analyses. These were pretrained models provided by the Keras Python library (Tensorflow 2.0 edia http://crscardellino.github.io/SBWCE/ tion) (Chollet, 2015) using the ImageNet dataset b . The pre-trained models, in general, were multi-layer convolutional neural networks. After several convolution-pooling blocks, a global pooling layer was applied to represent the image with a dense vector. Followed by several feedforward layers of fullyconnected layers, the probability of each class was estimated. The training process involved predicting one of the one thousand classes of the ImageNet dataset given an image. However, the representational dimensionality of the computer vision models are different from the word embedding models (e.g. VGG19 represents images with 512-D, while DenseNet represents images with 1664-D). Thus, we performed fine-tuning (Mesnil et al., 2011;Yosinski, Clune, Bengio, & Lipson, 2014) to have each of the computer vision models to have 300-D as the word embedding models. The computer vision models were fine-tuned using 101 assorted object category images c (Fei-Fei, Fergus, & Perona, 2004). 'BACKGROUND Google', 'Faces', 'Faces easy', 'brain', 'stop sign', and 'trilobite' were removed from the categories due to the lack of conceptual relationship to the "living, non-living" categories of our experiment. In order to balance the instances of each category, 30 images were randomly selected from each image set. Six images of each category were randomly selected to form the validation set, while the rest formed the training set. We added an additional layer of 300 units, and a classification layer to decode the rest of 96 categories. The convolution pooling blocks of the pretrained model were frozen during training and validation, and only the newly added layer's weights were to be updated. In order to obtain robust performance, image augmentation such as rotation, shifting, zoom, and flipping was applied. Before training or validating, all images were normalized according to the selected models (see preprocessing steps in Keras documentation) and resized to 128 by 128 pixels, and the normalization procedure was applied to each image individually. The activation function of the layer with 300 units was Self-Normalizing (SELU) (Klambauer, Unterthiner, Mayr, & Hochreiter, 2017) and the weights were initialized with LeCun normal initialization procedure (Y. A. LeCun, Bottou, Orr, & Müller, 2012) as suggested in the Tensorflow documentation d . The optimizer was Adam and the learning rate was 0.0001. Loss function was categorical cross-entropy (Kingma & Ba, 2014). No other regularization procedures were used. During training, the fine-tuning model could reach to a maximum of 3000 epochs with batch size of 16 images per step of back-propagation. However, if the model's performance on the validation set did not improve for 5 consecutive epochs, the b http://www.image-net.org/ c http://www.vision.caltech.edu/Image_Datasets/Caltech101/ d https://www.tensorflow.org/api_docs/python/tf/keras/activations/selu/ training would be terminated and the fine-tuned weights were saved for later use.
These fine-tuned computer vision models were applied to extract abstract representations of the image referents of the words. In particular, for each word, we sampled 10 images collected from the internet. Images were cropped and the object appeared at the center on white background. The values of each image were normalized according to the selected model. The output vector of the newly added layer of 300 units for a given image was the feature representation associated with the image referent of the word, with the 10 vector representations averaged. The feature extraction was done trialwise accordingly for each participant. We selected the above 3 models on the following grounds. Evidence e has shown that the DenseNet169 (Freeman, Ziemba, Heeger, Simoncelli, & Movshon, 2013;Freeman et al., 2013;Majaj, Hong, Solomon, & DiCarlo, 2015) can explain on average more variance of the brain response than many other computer vision models. MobileNetV2 was chosen to represent a simpler computer vision model that had less parameters (VGG19 has 143,667,240, DenseNet169 has 14,307,880, and Mo-bileNetV2 has 3,538,984). VGG19 was chosen to represent a baseline model that had shallower architeture (VGG19 has 26 layers, DenseNet169 has 169 layers, and MobileNetV2 has 88 layers), which has been well studied by previous studies (Ren, He, Girshick, & Sun, 2015;He, Zhang, Ren, & Sun, 2016;Russakovsky et al., 2015).
To visualize the representational patterns of the computer vision models, we computed the representational dissimilarity matrices (RDMs) of the models similar to the word embedding models. These are in shown in the Supplementary materials (DenseNet169, Figure S4; MobileNetV2, Figure S5; and VGG19, Figure S6).

Encoding Model Pipeline
The encoding model pipeline was the same as in Miyawaki et al. 2008(Miyawaki et al., 2008 and implemented in Nilearn (Pedregosa et al., 2011;Buitinck et al., 2013). After standardizing the feature representations by subtracting the mean and dividing by standard deviation, the feature representations were mapped to the BOLD signal of a given ROI by means of L2 regularized regression (Ridge Regression) with the regularization term equal to 100, following Miyawaki et al.
To estimate the performance of the regression, we partitioned the data into 300 folds to perform cross-validation by stratified random shuffling similar to the decoding pipeline above (Little et al., 2017;Varoquaux et al., 2016). It is important to note that labels of the BOLD signals were only used for cross-validation partitioning, but they were not used in the encoding model fitting nor testing procedure. In each fold, we randomly held out 20% of the data for testing, while the remaining 80% was used to fit the ridge regression model, using the feature representations from word embedding models or computer vision models as features and the BOLD signals as targets. Predictions were then derived for the held-out data. The proportion of variance explained in each voxel was computed for the predictions. An average estimate of the variance explained was calculated. The best possible score is 1. The score can be also negative if the model is worse than random guessing. Voxels that had positive variance explained values were identified for further analysis (Miyawaki et al., 2008;Holdgraf et al., 2017) for each participant, ROI and condition. To estimate the empirical chance level performance of the encoding models, a random shuffling was added to the training phase during cross-validation before model fitting. The random shuffling was applied to the order of the samples for the features in the encoding models while the order of the samples for the targets remained the same.

Multivariate Pattern Analyses: Decoding Results
Notably, decoding in most of the above regions was higher during deep processing compared to the shallow processing condition (see Figure 3). The black asterisks mark ROIs that showed statistically significant improvement in decoding accuracy in deep as compared to shallow processing condition.

Out-of-sample Generalization
We then repeated the decoding analyses with a different cross-validation procedure that allowed testing the generalizability of the semantic representations and how this was modulated by the different task contexts. Specifically, the classifier was trained using all the words but leaving a pair of words out from each class. Then the classifier was tested on the left-out pair. Figure 4 presents the summary statistics of the ROIs for out-of-sample generalization in both shallow and deep processing conditions. It can be seen that in the shallow processing condition, the decoding of the semantic category (living/non-living) was found to be at chance-level in all pre-specified ROIs including the frontal pole ( Decoding accuracy in 13 out of 15 ROIs was also better than in the shallow processing condition (see Figure 4). This shows that generalization of brain representations of semantic knowledge is better when the depth of processing is higher during mental simulation.

Informational Connectivity Results
We then assessed changes in the informational connectivity across the different ROIs, during deep relative to shallow processing conditions. This approach assesses the temporal correlation of multivariate patterns of responses across different ROIs and can reveal whether different brain regions interact in terms of the specific information that they carry (Coutanche & Thompson-Schill, 2013).
Figure 4: The figure shows summary statistics of the ROIs for out-of-sample generalization of the semantic category. The three dotted lines inside each violin are the quartiles. The black asterisks mark ROIs that showed statistically significant improvement in decoding accuracy in deep as compared to shallow processing condition. Figure 5 illustrates the regions showing statistically higher correlations of decoding time-courses during mental simulation compared to the shallow processing. There were two substrates that showed the strongest changes in informational connectivity with the rest of the semantic network. First, informational connectivity between the left orbito-frontal cortex increased with inferior temporal, parahippocampal, fusiform, inferior temporal, anterior temporal lobe and inferior parietal. A similar effect involved the middle temporal cortex, which showed increased informational connectivity during mental simulation with pars triangularis, medial and lateral orbitofrontal, frontal pole, precuneus and posterior cingulate and anterior temporal lobe. Figure 5 also shows that the parsopercularis, and superior frontal lobe showed no changes in informational connectivity as function of the depth of processing.

Encoding Results
An encoding model predicts the brain activity patterns using a set of features that are (non)linearly transformed from the stimuli (Diedrichsen & Kriegeskorte, 2017;Kriegeskorte & Douglas, 2018). Word embedding models (i.e. Word2Vec-2013 (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013)) provide a way to characterize the geometry of semantic spaces. Computer vision models (i.e. deep convolutional neural networks (Y. LeCun, Bottou, Bengio, & Haffner, 1998)can also reveal the structural organization of meaning in the ventral visual pathway (Simonyan & Zisserman, 2014). To examine the properties of the brain representations during word recognition, we tested both word embeddings and also computer vision models based on the image referents of the words used in the experiment (see § 2.4.5 and § 2.4.6). Figure 6 shows the average variance explained by the computer vision (VGG19, Densent169, MobilenetV2) and the word embedding (Fast Text, GloVe, Word2Vec) models, averaging across 27 subjects. The error-bars represent 95% confidence interval of a bootstrapping of 1000 iterations. For each ROI, a one-way analysis of variance (ANOVA) was performed within the computer vision models and within the embedding models. The ANOVAs aimed to detect the difference in variance explained within a type of models. After all the ANOVAs were performed, FDR correction procedures were applied to the raw p-values to correct for the multiple comparison within each condition (deep vs. shallow). There was no difference among the different word embedding models, and no difference among the computer vision models.
We then computed the difference between each of the computer vision model and each of the word embedding models within each ROI and condition, in order to assess whether word embedding models or computer vision models were better. One-sample t-tests against zero were conducted for each pair with FDR correction. All the computer vision models performed better than any of the word embedding models (see Figure 7).
Note the above results were obtained with pre-trained computer vision models that were fine tuned to ensure that the abstract layer had 300-D for comparison with the word embedding models. Otherwise, the results might not be comparable due to differences in the dimensionality of the models. However, we performed the same encoding analyses using the original dimensionality of the computer vision models and similar patterns were obtained. This indicates the robustness of the encoding results (see Supplementary Results and Supplementary Figures S7, S8, S9).
Finally, we computed the mean of the differences in variance explained between the computer vision models and the word embedding models and then assess the extent of this difference in the deep and shallow processing Figure 7: Differences between computer vision and word embedding models in variance explained. Computer vision models significantly explained more variance of the BOLD responses compared to word embedding models.
conditions by using paired t-tests with FDR correction. Figure 8 illustrates the pattern of results within each ROI. We found that the advantage of computer vision models over word embeddings models was higher in the deep processing condition relative to the shallow processing in PCG, PHG, and POP, while the opposite pattern was observed in FFG, IPL, and ITL (see Figure 7).

Discussion
We sought to understand how the depth of processing shapes the brain representations of conceptual knowledge by using both decoding and encoding models. The results clearly show that the decoding of word category information in most of the putative substrates of the semantic network (Binder et al., 2009) is consistently higher during the mental simulation condition relative to a shallow processing condition in which participants merely read the words. Semantic category was also decoded in the read condition, though decoding performance was far lower by comparison and the classifier did not reliably generalise to new examples that were not seen during training. By contrast, the generalizability of the classifier at predicting the semantic category of out-of-sample examples increased during mental simulation. Shallower processing modes are sufficient for in-sample classification but not enough for out-of-sample generalization. Significant decoding of semantic category was observed in multiple areas of association (transmodal) cortex involving the middle temporal gyrus, anterior and inferior temporal, inferior parietal lobe and prefrontal substrates. Of particular relevance are the prefrontal areas, which are typically thought to be involved in semantic control, rather than representing semantic knowledge (Wagner, Paré-Blagoev, Clark, & Poldrack, 2001;Poldrack et al., 1999;Whitney, Kirk, O'Sullivan, Lambon Ralph, & Jefferies, 2010). The present multivariate classification results are consistent with a role of association cortex in the representation of semantic categories too. Interestingly, the level of informational connectivity (Anzellotti & Coutanche, 2018) observed during mental simulation also indicates that association transmodal cortices interact with multiple regions of the semantic network in terms of the specific information that multivoxel activity patterns carry across time. Relative to the shallow processing condition, informational connectivity increased during mental simulation between the left anterior prefrontal cortex, inferior parietal, temporal and occipital areas. Likewise, the middle temporal cortexa region implicated in semantic control according to a prior meta-analyses (Noonan, Jefferies, Visser, & Lambon Ralph, 2013) -showed increased informational connectivity with inferior frontal and anterior prefrontal areas, the posterior cingulate and anterior temporal lobe. The depth of process-ing associated with mental simulation therefore triggered a general broadcasting effect of the specific content being represented, recruiting a highly distributed set of areas of the semantic network in an integrated manner. The highly distributed nature of semantic decodability across the brain is in keeping with previous fMRI pattern classification studies (Simanova et al., 2012;Shinkareva et al., 2011) and fMRI work employing encoding models (Huth et al., 2016;T. M. Mitchell et al., 2008).
Encoding models further revealed the properties of the semantic representations encoded in brain activity across the different processing depths. We found that computer vision models outperformed word embedding models in explaining brain responses across the different regions of the semantic network. We found that the embedding layer of the computer vision model, which is likely to contain a condensed summarization of visual features that are frequent in the examples of the image referents for a given word, explained more variance of the brain responses. Intriguingly, the advantage of computer vision models was similar regardless of the depth of processing. One may expect however that access to visual representations is more likely when the depth of processing is higher during mental simulation. The evidence here was mixed. While computer vision models explained more variance in the posterior cingulate, inferior frontal and anterior prefrontal cortex during deep compared to shallow processing, the opposite pattern was found in the fusiform, inferior parietal lobe and inferior temporal lobe, which are part of the ventral visual pathway. To account for this pattern of results, we propose that the presentation of a word during shallow processing automatically activates an image referent, but additional contextual features are not activated. By contrast, during mental simulation, additional contextual features associated with the word are likely to be activated. These contextual items may not be related to the image referents that were sampled for the encoding analyses, which only contained the object shape. This might lead to the computer vision models capturing less variance in the ventral visual pathway in the deep processing condition during mental simulation than in the shallow processing case. This would also explain why, in contrast to the pattern of decoding results, encoding model performance was not affected by the depth of processing. These results are however also consistent with the view that conceptual knowledge relies on modality specific perceptual representations that need no access to conscious visual (i.e. imagery-based) processing (Pecher, van Dantzig, & Schifferstein, 2009;Soto & Humphreys, 2007).
The encoding results indicate that the semantic and syntactic similarity metrics provided by word embedding models may not capture the actual se-mantic knowledge that is represented in the brain during word processing.
Rather, it appears that image based properties are necessary to account for how the brain represents word meaning. That this occurred both during mental simulation an,d also during the shallow processing condition is consistent with models that propose simulation processes may occur, to some extent, automatically (Barsalou, 2008;Zwaan, Stanfield, & Yaxley, 2002;Stanfield & Zwaan, 2001). However, it is clear from the superior decoding performance and the increased informational connectivity during mental simulation that top-down factors associated with the depth of processing also play a key modulating role. Together, these results demonstrate that the depth of processing is a key factor for triggering highly distributed, generalizable and integrated brain representations in the semantic network.

Competing Interests Statement
The authors declare no competing interests.     Figure S7 shows the average variance explained by the computer vision (VGG19, Densent169, mobilenetV2) without fine-tuning and the word embedding (Fast Text, Glove, word2vec) models, averaged across 26 subjects. The errorbar represents 95% confidence interval of a bootstrapping with 1000 iterations. For each ROI, a one-way analysis of variance (ANOVA) was performed within the computer vision models and within the embedding models across subjects. The ANOVAs aimed to detect the difference in variance explained within a type of models. After all the ANOVAs were performed, FDR correction procedures were applied to the raw p-values to correct for the multiple comparison within each condition (deep v.s. shallow). There was no difference among different word embedding models, but there was a significant difference among the computer vision models in most of the ROIs for each condition. MobileNetV2, which was considered the simplest model within the computer vision models due to its smaller set of parameters, performed the best in explaining the variance in most of the ROIs for each condition. We then computed the difference between each of the computer vision and word embedding models within each ROI and condition, in order to assess whether word embeddings or computer vision models were better. Figure S8: Differences between a Word Embedding and a Computer Vision model in variance explained. No fine-tuning was performed for the computer vision models.

Supplementary Results
One-sample t-tests against zero for each pair were conducted with FDR correction. All the computer vision models performed better than any of the word embedding models (see Figure S8).
Then, we computed the average difference between the computer vision models and the word embedding models. We then performed paired t-tests to compare the difference between the variance explained in the deep and shallow processing conditions, using FDR correction. Figure S9 illustrates the pattern of results within each ROI. We found that the advantage of computer vision models over word embeddings was higher in the deep processing condition relative the shallow processing in some of the ROIs (see Figure S8). Figure S9: Overall Difference between Word Embedding and Computer Vision models per ROI. No fine-tuning was performed for the computer vision models.