Bits and pieces: understanding information decomposition from part-whole relationships and formal logic

Partial information decomposition (PID) seeks to decompose the multivariate mutual information that a set of source variables contains about a target variable into basic pieces, the so-called ‘atoms of information’. Each atom describes a distinct way in which the sources may contain information about the target. For instance, some information may be contained uniquely in a particular source, some information may be shared by multiple sources and some information may only become accessible synergistically if multiple sources are combined. In this paper, we show that the entire theory of PID can be derived, firstly, from considerations of part-whole relationships between information atoms and mutual information terms, and secondly, based on a hierarchy of logical constraints describing how a given information atom can be accessed. In this way, the idea of a PID is developed on the basis of two of the most elementary relationships in nature: the part-whole relationship and the relation of logical implication. This unifying perspective provides insights into pressing questions in the field such as the possibility of constructing a PID based on concepts other than redundant information in the general n-sources case. Additionally, it admits of a particularly accessible exposition of PID theory.

Partial information decomposition (PID) seeks to decompose the multivariate mutual information that a set of source variables contains about a target variable into basic pieces, the so called "atoms of information". Each atom describes a distinct way in which the sources may contain information about the target. For instance, some information may be contained uniquely in a particular source, some information may be shared by multiple sources, and some information may only become accessible synergistically if multiple sources are combined. In this paper we show that the entire theory of PID can be derived, firstly, from considerations of partwhole relationships between information atoms and mutual information terms, and secondly, based on a hierarchy of logical constraints describing how a given information atom can be accessed. In this way, the idea of a partial information decomposition is developed on the basis of two of the most elementary relationships in nature: the part-whole relationship and the relation of logical implication. This unifying perspective provides insights into pressing questions in the field such as the possibility of constructing a PID based on concepts other than redundant information in the general n-sources case. Additionally, it admits of a particularly accessible exposition of PID theory.

Introduction
Partial information decomposition (PID) is an example of a rare class of problems where a deceptively simple question has perplexed researchers for many years, leading to heated disputes over possible solutions [18], simple but incomplete answers [13], and even to statements that the question should not be asked [11]. The core question of PID is how the information carried by multiple source variables about a target variable is distributed over the source variables. In other words, it is the information theoretic question of 'who knows what about the target variable'. Intuitively, answering this question involves finding out which information we could get from multiple variables alike (called redundant or shared information), which information we could get only from specific variables, but not the others (called unique information), and which information we can only obtain when looking at some variables together (called synergistic information).
Examples of questions involving PID, are found in almost all fields of quantitative research. In neuroscience, for instance, we are interested in how the activity of multiple neurons, that were recorded in response to a stimulus, can provide information about (i.e. encode) the stimulus. Specifically, we are interested in whether the information provided by those neurons about the stimulus is provided redundantly, such that we can obtain it from many (or any) of the recorded neural responses, or whether certain aspects are only present uniquely in individual neurons, but not others; finally, we may find that we need to analyze all neural responses together to decode the stimulus -a case of synergy. All three ways of providing information about the stimulus may coexist and the aim of PID analysis is to determine to what degree each of them is present [24].
In this way PID can be used as a framework for systematically testing and comparing theories of neural processing (such as predictive coding [27] or coherent infomax [26]) in terms of their information theoretic "footprint", i.e. in terms of the amounts of unique, redundant or synergistic information processing predicted by the theory. The key idea is to identify such theories with a specific information theoretic goal function (e.g. "maximize redundancy while at the same time allowing for a certain degree of unique information"). One may then investigate empirically whether a given neural circuit in fact maximizes the goal function in question or one may use the PID framework to come up with entirely new goal functions [25].
The PID problem also arises in cryptography in the context of so called "secret sharing" [14]. The idea is that a multiple participants (the sources) each hold some partial information about a particular piece of information called the secret (the target). However, the secret can only be accessed if certain participants combine their information. In this context, PID describes how access to the secret is distributed over the participants.
The partial information decomposition framework has furthermore been used to to operationalize several core concepts in the study of complex and computational systems. These concepts include for instance the notion of information modification [10,23] which has been suggested along with information storage and transfer as one of three fundamental component processes of distributed computation. It has also been proposed that the concepts of emergence and self-organisation can be made quantifiable within the PID framework [16], [17].
Despite the universality of the PID problem, solutions have only arisen very recently, and the work on consolidating and on distilling them into a coherent structure is still in progress. In this paper we aim to do so by rederiving the theory of partial information decomposition from the perspective of mereology (the study of parthood relations) and formal logic. The general structure of PID arrived at in this way is equivalent to the one originally described by Williams and Beer [29]. However, our derivation has the advantage of tackling the problem directly from the perspective of the parts into which the information carried by the sources about the target is decomposed, the so called "atoms of information". By contrast, the formulation used until now takes an indirect approach via the concept of redundant information. Furthermore, the approach described here is based on particularly elementary concepts: parthood between information contributions and logical implication between statements about source realizations. The remainder of this paper is structured as follows: First, in §2 we derive the general structure underlying partial information decomposition from considerations of elementary parthood relationships between information contributions. This structure is general in the sense that it still leaves open the possibility for multiple alternative measures of information decomposition. We show that the axioms underlying the formulation by Williams and Beer [7,29] can be proven within the framework described here. In §3 we utilize formal logic to derive a specific PID measure and in this way provide a complete solution to the information decomposition problem. §4 shows that there is an intriguing connection between formal logic and PID in that the mathematical lattice structure underlying information decomposition is isomorphic to a lattice of logical statements ordered by logical implication. This gives rise to a completely independent exposition of PID theory in terms of a hierarchy of logical constraints on how information about the target can be accessed. In §5 we show that the ideas presented here can be utilized to systematically answer the question of whether a (full n-sources) PID can be induced by measures other than redundant information such as synergy or unique information. Before concluding in §7, we briefly address the important distinction between parthood relations and quantitative relations in §6.

The parthood perspective
Suppose there are n source variables S 1 , . . . , Sn carrying some joint mutual information I(T ∶ S 1 , . . . , Sn) [4,19] about some target variable T (see Figure 1, left). The goal of partial information decomposition is to decompose this joint mutual information into its component parts, the so called atoms of information. As explained in the introduction, these parts are supposed to represent unique, redundant, and synergistic information contributions. Now, what distinguishes these contributions are their defining part-whole relationships to the information provided by the different source variables: the information uniquely associated with one of the sources in only part of the information provided by that source and not part of the information provided by any other source. The information provided redundantly by multiple sources is part of the information carried by each of these sources. And the information provided synergistically by multiple source is only part of the information carried by them jointly but not part of the information carried by any of them individually. For this reason, it seems natural to make the part-whole relationship between pieces of information the basic concept of PID. The goal of this section is to make this idea precise, and in this way, to open up a new perspective for thinking about partial information decomposition.
The underlying idea is that any theory should be put on the foundation of as simple and elementary concepts as possible. The part-whole relation is one of the most basic relationships in nature. It appears on all spatial and temporal scales: atoms are parts of molecules, planets are parts of solar systems, the phase of hyperpolarisation is part of an action potential, infancy is part of a human beings life. Moreover, it is not a purely scientific concept but is also ubiquitous in ordinary life: we say for instance, that a prime minister is part of the government or that a slice of pizza is part of the whole pizza. This ubiquity makes it particularly easy to think in terms of part-whole relationships. We hope, therefore, that starting from this vantage point will provide a particularly accessible and intuitive exposition of partial information decomposition. This factor is of particular importance when it comes to the practical application of PID to specific scientific questions and the interpretation of the results of a PID analysis.
Developing the theory of partial information decomposition means that we have to answer three questions: (iii) How large are the different atoms of information given a specific joint probability distribution of sources and target? How many bits of information does each atom provide?
In the following sections we will tackle each of these questions in turn. (a) What do the atoms of information mean?
Asking how to decompose the joint mutual information into its components parts is a bit like asking "How to slice a cake?". Of course, there are many possible ways to do so, and hence, there is no unique answer to the question. In order to make the question more precise we first have to provide a criterion according to which we would like to decompose the joint mutual information. This is what this section is about. What are the atoms of information supposed to mean in the end, i.e. what type of information do they represent? To a first approximation, the core idea underlying the parthood approach to partial information decomposition is to decompose the joint mutual information I(T ∶ S 1 , . . . , Sn) into information atoms, such that each atom is characterized by its parthood relations to the mutual information provided by the different sources. For instance, one atom of information will describe that part of the joint mutual information which is part of the information provided by each source, i.e. the information that is redundant to all sources. Another atom will describe the part of the joint mutual information that is only part of the information provided by the first source, i.e. it is unique to the first source. And so on. Now, we have to refine this idea a bit: it is important to realize that it would not be enough to consider parthood relations to information provided by individual sources. The reason is that a collection of sources may provide some information that is not contained in any individual source but which only arises by combining the information from multiple sources in that collection. The classical example for this phenomenon is the logical exclusive-or shown in Figure 1, right. In this example the sources are two independent coin flips. The target is the exclusive-or of the sources, i.e. the target is 0 just in case both coins come up heads or both come up tails, and it is 1 otherwise. Initially, the odds for the target being zero or one respectively are 1:1 because there are four equally likely outcomes in two of which the target is 1 while it is 0 in the other two. Now, if we are told the value of one of the coins, these odds are not affected, and accordingly, we do not obtain any information about the target. For instance, if we are told that the first coin came up heads there are two equally likely outcomes left: Heads-Heads and Heads-Tails. In the first case, the target is zero and in the second case it is one. Hence, the odds are still 1 (i) There are cases in which multiple information sources combined provide some information that is not contained in any individual source. This type of information is generally called synergistic information. (ii) Any reasonable theory of information should be compatible with the existence of synergistic information. In particular, it should allow that, in some cases, the information provided jointly by multiple sources is larger than the sum of the individual information contributions provided by the sources.
Regarding the second point we may note that classical information theory satisfies this constraint because in some cases In fact, in the exclusive-or example, each individual source provides zero bits of information while the sources combined provide one bit of information.
Based on these consideration we may rephrase the basic idea of the parthood approach as: we are looking for a decomposition of the joint mutual information into atoms such that each atom is characterized by its parthood relations to the information carried by the different possible collections of sources about the target. Of course, we allow collections containing only a single source, such as {1}, as a special case. Note that we will generally refer to source variables and collections thereof by their indices. So instead of writing {S 1 } and {S 1 , S 2 } to refer to the first source and the collection containing the first and second source, we write {1} and {1, 2} respectively. There are several important technical reasons for this that will become apparent in the following sections. For now it is sufficient to just think of it as a shorthand notation.
Let's now investigate how the idea of characterizing the information atoms by parthood relations plays out in the simple case of two sources S 1 and S 2 . In this case, there are four collections: Now, in order to characterize an information atom Π we have to ask for each collection a: Is Π part of the information provided by a? For two of the collections we can answer this question immediately for all Π: First, no atom of information should be contained in information provided by the empty collection of sources because there is no information in the empty set. If we do not know any source, then we cannot obtain any information from the sources. Second, any atom of information should be contained in the mutual information provided by the full set of sources since this is precisely what we want to decompose into its component parts. Regarding the collections {1} and {2} we are free to answer yes or no leaving four possibilities as shown in Table 1.
The first possibility (first row of Table 1) is an atom of information that is only part of the information provided by the sources jointly but not part of the information in either of the individual sources. This is the synergistic information. The second possibility (second row) is an atom that is part of the information provided by the first source but which is not part of the information in the second source. This atom of information describes the unique information of the first source. Similarly, the third possibility (third row) is an atom describing information uniquely contained in the second source. The fourth and last possibility (fourth row) is an atom that is part of the information provided by each source. This is the information redundantly provided or shared by the two sources.   So based on considerations of parthood we arrived at the conclusion that there should be exactly four atoms of information in the case of two source variables. Each atom is characterized by its parthood relations to the mutual information provided by the different collections of sources. These relationships are described by the rows of Table 1 which we will call parthood distributions. Each atom Π is formally represented by its parthood distribution f Π .
Mathematically, a parthood distribution is a Boolean function from the powerset of {1, . . . , n} to {0, 1}, i.e. it takes a collection of source indices as an input and returns either 0 (the atom described by the distribution is not part of information provided by the collection) or 1 (the atom described by the distribution is part of that information) as an output. But note that not all such functions qualify as a parthood distribution. We already saw that certain constraints have to be satisfied. For instance, the empty set of sources has to be mapped to 0. We propose that there are exactly three constraints a parthood distribution f has to satisfy leading to the following definition The third constraint says that if an atom of information is part of the information provided by some collection of sources a, then it also has to be part of the information provided by any superset of this collection. For example, if an atom is part of the information in source 1, then it also has to be part of the information in sources 1 and 2 combined. Note that this monotonicity constraint only matters if there are more than two information sources. Otherwise it is implied by the first two constraints. To fix ideas, an example of a Boolean function that is not a parthood distribution is shown in Table 2  We may now answer the question about the meaning of the atoms of information, i.e. what type of information they represent: They represent information that is part of the information provided by certain collections of sources but not part of the information of other collections. More precisely we can phrase this idea in terms of the following core principle:

Core Principle 1.
Each atom of information is characterized by a parthood distribution describing whether or not it is part of the information provided by the different possible collections of sources. The atom Π(f ) with parthood distribution f is exactly that part of the joint mutual information about the target which is 1) part of the information provided by all collections of sources a for which f (a) = 1, and 2), which is not part of the information provided by collections for which f (a) = 0.
Given this characterization of the information atoms we are now in a position to answer the second question: How many atoms are there for a given number of information sources.
(b) How many atoms of information are there?
Since each atom is characterized by its parthood distribution, the answer is straightforward: there is one atom per parthood distribution, or in other words, one atom per Boolean function satisfying the constraints presented in the previous section. The monotonicity constraint turns out to be most restrictive. In fact, once the monotonicity constraint is satisfied the other two constraints only rule out one Boolean function each as shown in Table 3. The reason is the following: Firstly, there is only a single monotonic Boolean function that assigns the value 1 to the empty set, namely, the function that is always 1. Since the empty set is subset of any other set, monotonicity enforces to assign a 1 to all sets once the empty set has value 1. However, this possibility is ruled out by the first constraint saying that there is no information in the empty set. Secondly, there is only a single monotonic Boolean function assigning the value 0 to the full set {1, . . . , n}, namely the function that is always 0. Since any other set of source indices is contained in the full set, monotonicity forces us to assign a 0 to all sets once the full set has value 0. If we were to assign a 1 to any other set, then we would have to assign a 1 to the full set as well.  This means that the number of atoms is equal to the number of monotonic Boolean functions minus two. Now the sequence of the numbers of monotonic Boolean functions of n-bits is a very famous sequence in combinatorics called the Dedekind numbers. The Dedekind numbers are a very rapidly (in fact super-exponentially) growing sequence of numbers of which only the first eight entries are known to date [21]. The values for 2 ≤ n ≤ 6 of the Dedekind numbers are: 6, 20, 168, 7581, 7828354. Now that we have answered what type of information the different atoms represent and how many there are for a given number of information sources, there is one important question left: How large are these different atoms? How many bits of information does each atom provide?
(c) How large are the atoms of information?
The question of the sizes of the atoms is not a trivial one since the number of atoms grows so quickly. In the case of four information sources there are already 166 atoms. Hence, it does not appear to be feasible to define the amount of information of each of these atoms separately. What we need is a systematic approach that somehow fixes the sizes of all atoms at the same time. The core idea is to transform the problem into a much simpler one in which only a single type of informational quantity has to be defined. In the following we show how this can be achieved in three steps.

(i) Define a quantitative relationship between atoms and composite quantities
So far we have only discussed how the atoms of information relate qualitatively to composite information quantities that are made up of multiple atoms, in particular mutual information (in the next section we will encounter another non-atomic quantity). We saw for instance, that in the case of two sources, the mutual information contributions provided by the individual sources, I(T ∶ S 1 ) and I(T ∶ S 2 ), each consist of a unique and a redundant information atom, while the joint mutual information I(T ∶ S 1 , S 2 ) additionally consists of a synergistic part. This is illustrated in the information diagram shown in Figure 2. The inner two black circles represent the mutual information provided by the first source (left) and the second source (right) about the target. Each of these mutual information terms contains two atomic parts: I(T ∶ S 1 ) consists of the unique information in source 1 (Π unq 1 , blue patch) and the information shared with source 2 (Π red , red patch).
consists of the unique information in source 2 (Π unq 2 , yellow patch) and again the shared information. The joint mutual information I(T ∶ S 1 , S 2 ) is depicted by the large black oval encompassing the inner two circles. I(T ∶ S 1 , S 2 ) consists of four atoms: The unique information in source 1 (Π unq 1 , blue patch), the unique information in source 2 (Π unq 2 , yellow patch), the shared information (Π red , red patch), and additionally the synergistic information (Π syn , green patch). Now the question arises: How are these mutual information terms related to the atoms they consist of quantitatively? The most straightforward answer (and the one generally accepted in the PID field) is that the mutual information is simply the sum of the atoms it consists of. We propose to extend this principle to any composite information quantity, i.e. any quantity that can be described as being made up out of multiple information atoms: Core Principle 2. The size of any non-atomic information quantity (i.e. the amount of information it contains) is the sum of the sizes of the information atoms it consists of.
We could also rephrase this as "wholes are the sums of their (atomic) parts". In the case of two information sources, this principle leads to the following three equations: This already gets us quite far in terms of determining the sizes of the atoms: The sizes of the atoms are the solutions to a linear system of equations. The only problem is that the system is underdetermined. We have four unknowns but only three equations. In the case of three sources, the problem is even more severe. In this case, there are seven non-empty collections of sources, 9 rspa.royalsocietypublishing.org Proc R Soc A 0000000 and hence, seven mutual information terms. Again each of these terms is the sum of certain atoms. But as shown in section §2b there are 18 atoms. So we are short of 11 equations! In general the equations relating the mutual information provided by some collection of sources a and the information atoms can be expressed easily in terms of their parthood distributions: where Π(f ) is the information atom corresponding to parthood distribution f and the summation notation means that we are summing over all f such that f (a) = 1. Note that on the lefthand-side we are using the shorthand notation I(T ∶ a) for the mutual information I(T ∶ (S i ) i∈a ) provided by the collection a. Equation (2.5) can be taken to define a minimal notion of a partial information decomposition, i.e. any set of quantities Π(f ) at least has to satisfy this equation in order to be considered a partial information decomposition (or at least to be considered a parthood-based / Williams and Beer type PID). For a formal definition of such a minimally consistent PID see Appendix (a). This concludes the first step. The next one is to find a way to come up with the appropriate number of additional equations. In doing so we will follow the same approach as Williams and Beer and utilize the concept of redundant information to introduce additional constraints. It should be noted that this is not the only way to derive a solution for the information atoms. In other words, a PID does not have to be "redundancy based". This issue is discussed in detail in §5. For now, however, let us follow the conventional path and see how it enables us to determine the sizes of the atoms of information.

(ii) Formulate additional equations using the concept of redundant information
The basic idea is now to extent the considerations of the previous step to another composite information quantity: the redundant information provided by multiple collections of sources about the target which we will generically denote by I∩(T ∶ a 1 , . . . , am). The ∩-symbol refers to the idea that the redundant information of collections a 1 , . . . , am is the information contained in a 1 and a 2 and, . . . , and am. Intuitively, given two collections of sources a 1 and a 2 , their redundant information is the information "shared" by those collections, what they have "in common", or geometrically: their overlap. These informal ideas are illustrated on the left side in Figure 3. Note that the redundant information of multiple collections of information sources is not defined in classical information theory. We have to come up with an appropriate measure of redundant information ourselves. However, the informal ideas just describes already tell us that redundant information, no matter how we define it, should be related qualitatively to the This is the only atom that is part of both the information provided by the first source and also part of the information provided by the second source. But this is really a special case. Note what happens if we add a third source to the scenario. In this case the redundant information I(T ∶ {1}, {2}) of sources 1 and 2 should consist of two parts: First, the information shared by all three sources (which is certainly also shared by sources 1 and 2), and secondly, the information shared only by sources 1 and 2 but not by source 3. This is illustrated on the right side in Figure  3. Note also that in the case of three sources there are actually many redundancies that we may compute: It turns out that in total there are 11 redundancies (strictly speaking we should say 11 "proper" redundancies as will be explained below). But this is exactly the number of missing equations in the case of three information sources (see last paragraph of previous section). Now, combining Core Principles 2 and 3, allows us the answer what the quantitative relationship between redundant information and information atoms has to be: the redundant information of collections of sources a 1 , . . . , am is the sum of all atoms that are part of the information provided by each collection: where again the notation means that we are summing over all f that satisfy the condition below the summation sign. This equation can be read in two ways: First, as placing a constraint on the redundant information I∩, namely that it has to be the sum of specific atoms. This means that if we already knew the sizes of the Π's, we could compute I∩. However, the sizes of the Π's are precisely what we are trying to work out. Now the crucial idea is that we can also read the equation the other way around: if we can come up with some reasonable measure of redundant information I∩ we may be able to invert equation 2.6 in order to obtain the Π's. So the final step will be to show that such an inversion is in fact possible and will lead to a unique solution for the atoms of information. Before proceeding to this step, it is important to briefly clarify the relationships between the three central concepts we have discussed so far: (i) the mutual information (the quantity we want to decompose) (ii) the information atoms (the quantities we are looking for) (iii) redundant information (the quantity we are going to use to find the information atoms) These concept are easily confused with each other but should be clearly separated. The relationships between them are shown in Figure 4. First, based on what we have said so far, mutual information can be shown to be a special case of redundant information: the redundant information of a single collection I∩(T ∶ a 1 ), i.e. "the information the collection shares with itself about the target". The reason for this is that Core Principle 3 tells us that the redundant information of a single collection consists of all the atoms that are part of the mutual information carried by that collection about the target. But this is simply the mutual information of that collection: Accordingly, mutual information has been called "self-redundancy" in the PID literature (although not based on parthood arguments) [29]. The relationship between redundant information and atoms is as follows: Only the "all-way" redundancy, i.e. the information shared by all n sources is itself an atom. Any other redundancy, such as the redundancy of only a subset of sources, is a composite quantity made up out of multiple atoms.

(iii) Show that a measure of redundant information leads to a unique solution for the information atoms
There is a very useful fact about parthood distributions that will help us to obtain a unique solution for the atoms given an appropriate measure of redundant information: parthood distributions can be ordered in a very natural way into a lattice structure that is tightly linked to the idea of redundancy. The lattice for the case of three sources is shown in Figure 5. The parthood distributions are ordered as follows: If there is a 1 in certain positions on a parthood distribution f , then all the parthood distributions g below it also have a 1 in the same positions, plus some additional ones. Or in terms of the atoms corresponding to these parthood distributions: If an atom Π(f ) is part of the information provided by some collections of sources, then all the atoms Π(g) below it are also part of the information provided by these collections. Formally, we will denote this ordering by ⊑ and it is defined as For n information sources we will denote the lattice of parthood distributions by (Bn, ⊑), where Bn is the set of all parthood distribution in the context of n sources (for proof that this structure is in fact a lattice in the formal sense see Appendix (b). Note that the different "levels" of the lattice contain parthood distributions with the same number of ones and that higher level parthood distributions contain less ones: At the very top in Figure 5, there is the parthood distribution describing the atom that is only part of the joint mutual information provided by all three sources combined, i.e. the synergy of the three sources. One level down, there are the three parthood distributions that assign the value 1 exactly two times. Yet another level down, we find the three possible parthood distributions that assign the value 1 exactly three times. And so on and so forth until we reach the bottom of the lattice which corresponds to the information shared by all three sources. Accordingly the corresponding parthood distribution assigns the value 1 to all collections (except of course the empty collection). Ordering all the parthood distributions (and hence atoms) into such a lattice provides a good overview that tells us how many atoms exist for a given number of source variables and what their characteristic parthood relationships are. But the lattice plays a much more profound role because it is very closely connected to the concept of redundant information. The idea is to associate with each parthood distribution in the lattice a particular redundancy: the redundant information of all the collections that are assigned the value 1 by the distribution. In other words, for any parthood distribution f we consider the redundancy redundant information: these atoms are the ones that have value 1 on each a i . But, by definition of the ordering, these are precisely the ones corresponding to parthood distributions below and including the parthood distribution for which we are computing the associated redundancy. In other words, the redundant information associated with a parthood distribution f can always be expressed as In this way we obtain one equation per parthood distribution. And since there are as many information atoms as parthood distributions, we obtain as many equations as unknowns. This is already a good sign. But is a unique solution for the information atoms guaranteed? This question can be answered affirmatively by noting that the system of equations described by (2.10) (one equation per f ) is not just any linear system, but has a very special structure: one function I∩(T ∶ f ) evaluated at a point f on a lattice is the sum of another function Π(f ) over all points on the lattice below and including the point f. The process of solving such a system for the Π(f )'s once all the I∩(T ∶ f )'s are given, or in other words inverting equation (2.10), is called Moebius Inversion. Crucially, a unique solution is guaranteed for any real or even complex valued function I∩ that we may put on the lattice [22]. This means that we have now completely shifted the problem of determining the sizes of the information atoms to the problem of coming up with a reasonable definition of redundant information I∩(T ∶ f ). Even though we have to define this quantity for each parthood distribution f this is still a much simpler task. The reason is that all the I∩'s represent exactly the same type of information, namely redundant information. On the other hand, the information atoms Π represent completely different types of information. Even in the simplest case of two sources we have to deal not only with redundant information, but also unique information and synergistic information. And the story gets more and more complicated the more information sources are considered. Now, note that apparently we only need to define quite special redundant information terms, namely the redundancies associated with parthood distributions I∩(T ∶ f ) (see definition (2.9)). However, we will now show that these are in fact all possible redundancies, i.e. the redundancy of any tuple of collections of sources a 1 , . . . , am is necessarily equal to a redundancy associated with a specific parthood distribution. The reason for this is that the quantitative relation between atoms and redundant information (equation (2.6)) not only provides a way to solve for the information atoms once we know I∩, it also implies that I∩ has to satisfy the following invariance properties: We can easily ascertain that any measure of redundant information I∩ has to have these properties by taking a closer look at the condition describing which atoms to sum over in order to obtain a particular redundant information term I(T ∶ a 1 , . . . , am): we have to sum over the atoms with parthood distribution satisfying f (a i ) = 1 for all i = 1, . . . , m. Now whether or not this condition is true of a given parthood distribution f , first, does not depend on the order in which the collections a i are given (symmetry), secondly, it does not depend on whether the same collection a is repeated multiple times (idempotency), and thirdly, it does not matter whether we add or remove some collection a i that is a proper superset of some other collection (superset removal/addition). This fact is due to the monotonicity constraint on parthood distributions. Finally, the "self-redundancy" property was already established in the previous section.
These invariance properties are referred in the literature as the Williams and Beer axioms for redundant information [7] (in addition there is a quantitative monotonicity axiom that we reject. See §6). However, in the parthood formalism described here they are not themselves axioms but are implied by the core principles we have set out. The first two invariance properties imply that we may restrict ourselves to sets instead of tuples of collections in defining I∩. The third constraint additionally tells us that we can restrict ourselves to those sets of collections {a 1 , . . . , am} such that no collection a i is a superset of another collection a j . Such sets of collections are called antichains. Hence, the redundancy of any tuple of collections of sources a 1 , . . . , am is necessarily equal to the redundancy associated with a particular antichain. This antichain results from ignoring the order and repetitions of the a i , and removing any supersets.
We can now see that the redundancies I∩(T ∶ f ) are in fact all possible redundancies by associating with any antichain α = {a 1 , . . . , am} a parthood distribution fα that assigns the value 1 to all a i and all supersets of these collections, while it assigns the value 0 to all other collections. Now, due to the invariance of I∩ under removal of supersets, it immediately follows that I∩(T ∶ fα) = I∩(T ∶ α). So in conclusion, there is one redundancy for each antichain α and these redundancies are equal to the redundancies associated with the corresponding parthood distributions. Hence the redundancies I∩(T ∶ f ) are in fact all possible redundancies.
Of course, there is also an inverse mapping associating with any parthood distribution f an antichain α f . In fact, the lattice of parthood distributions (Bn, ⊑) is isomorphic to a lattice of antichains (An, ⪯) equipped with an ordering relationship that was originally introduced by Crampton and Loizou [5] and used by Williams and Beer in their original exposition of PID. The formal proof of this fact is postponed to section §4 where a third perspective on PID, the logical perspective, is introduced.
In the next section, we will tackle the problem of defining a measure of redundant information for each parthood distibution / antichain by connecting PID theory to formal logic. The measure I sx ∩ derived in this way is identical to the one described in [12]. In showing how this measure can be inferred from logical-and parthood-principles we aim to 1) strengthen the argument for I sx ∩ , and 2), open the gateway between PID-theory and formal logical. This latter point is elaborated in §4.

Using logic to derive a measure of redundant information
We have now solved the PID problem up to specifying a reasonable measure of redundant information I∩ between collections that form an antichain. In this section, we will derive such a measure. In doing so we will first move from the level of random variables T, S 1 , . . . , Sn to the level of particular realizations t, s 1 , . . . , sn of these variables. This level of description is generally called the pointwise level and has been used as the basis of classical information theory by Fano [6]. Pointwise approaches to PID have been put forth by [7] and [12]. Note that moving to the level of realizations simplifies the problem considerably because realizations are much simpler objects than random variables. A realization is simply a specific symbol or number whereas a random variables is an object that may take on various different values and can only be fully described by an entire probability distribution over these values.

(a) Going Pointwise
The idea underlying the pointwise approach is to consider the information provided by a particular joint realization (observation) of the source random variables about a particular realization (observation) of the target random variable. So from now on we assume that these variables have taken on specific values s 1 , . . . , sn, t. It was shown by Fano [6] that the whole of classical information theory can be derived from this pointwise level. By placing a certain number of reasonable constraints or axioms on pointwise information, it follows that this information must have a specific form. In particular, the pointwise mutual information i(t ∶ s) is given by The mutual information I(T ∶ S) is then simply defined as the average pointwise mutual information. Note that pointwise mututal information (in contrast to mutual information) can be both positive and negative. It essentially measures whether we are guided in the right or wrong direction with the respect to the actual target realization t. If the conditional probability of T = t given the observation of S = s is larger than the unconditional (prior) probability of T = t, then we are guided in the right direction: The actual target realization is in fact t and observing that S = s makes us more likely to think so. Accordingly, in this case the pointwise mutual information is positive. On the other hand, if the conditional probability of T = t given the observation of S = s is smaller than the unconditional (prior) probability of T = t, then we are guided in the wrong direction: Observing S = s makes us less likely to guess the correct target value. In this case the pointwise mutual information is negative. The joint pointwise mutual information of source realizations s 1 , . . . , sn about the target realization is defined in just the same way: The idea is now to perform the entire partial information decomposition on the pointwise level, i.e. to decompose the pointwise joint mutual information i(t ∶ s 1 , . . . , sn) that the source realizations provide about the target realization [7]. This leads to pointwise atoms π s1,...,sn,t (in the following we will generally drop the subscript). Crucially, we are only changing the quantity to be decomposed from I(T ∶ S 1 , . . . , Sn) to i(t ∶ s 1 , . . . , sn). Otherwise, the idea is completely analogous to what we have discussed in §2 (simply replace I by i and Π by π): the goal is to decompose the pointwise mutual information into information atoms that are characterized by their parthood relations to the pointwise mutual information provided by the different possible collections of source realizations. These atoms have to stand in the appropriate relationship to pointwise redundancy: the pointwise redundancy i∩(t ∶ a 1 , . . . , am) is the sum of all pointwise atoms π(f ) that are part of the information provided by each collection of source realizations a i . By exactly the same argument as described in §2ciii, there is a unique solution for the pointwise atoms once a measure of pointwise redundancy i(t ∶ α) is fixed for all antichains α = {a 1 , . . . , am}. The variable-level atoms Π are then defined as the average of the corresponding pointwise atoms: We are now left with defining the pointwise redundancy i∩ among collections of source realizations. As noted above this is a much easier problem than coming up with a measure of redundancy among collections of entire source variables. In the next section, we show how the pointwise redundancy of multiple collections of source realizations can be expressed as the information provided by a particular logical statement about these realizations.

(b) Defining pointwise redundancy in terms of logical statements
The language of formal logic allows us to form statements about the source realizations. In particular, we will consider statements made up out of the following ingredients: (i) n basic statements of the form S i = s i , i.e. "Source S i has taken on value s i " (ii) the logical connectives ∧ (and), ∨ (or), ¬ (not), → (if, then) (iii) brackets ),( In this way, we may form statements such as S 1 = s 1 ∧ S 2 = s 2 ("Source S 1 has taken on value s 1 and source S 2 has taken on value s 2 ") or S 1 = s 1 ∨ (S 2 = s 2 ∧ S 3 = s 3 ) ("Either source S 1 has taken on value s 1 or source S 2 has taken on value s 2 and source S 3 has taken on value s 3 "). Now  to be logically weaker than both A and B. Thus it has to be part of the information provided by A and also part of the information provided by B. Accordingly, it is contained in the "overlap", i.e. the redundant information of A and B. In order to obtain the entire redundant information statement C has to be "maximized", i.e. it has to be chosen as the strongest statement weaker than both A and B (this is indicated by the arrows).
we may ask: What is the information provided by the truth of such statements about the target realization t? Classical information theory allows us to quantify this information as a pointwise mutual information: Let A be any statement of the form just described, then the information i(t ∶ A) provided by the truth of this statement is where I A is the indicator random variable of the event that the statement A is true, i.e. I A = 1 if the event occurred and I A = 0 if it did not. The interpretation of this information is that it measures whether and to what degree we are guided in the right or wrong direction with respect to the actual target value once we learn that statement A is true.
Note that according to this definition the pointwise mutual information provided by a collection of source realizations i(t ∶ a) is the information provided by the truth of the conjunction ⋀ i∈a S i = s i : Therefore, the information redundantly provided by collections of source realizations a 1 , . . . , am is the information redundantly provided by the truth of the corresponding conjunctions. Now, what is this information? We propose that in general the following principle describes redundancy among statements: Core Principle 4. The information redundantly provided by the truth of the statements A 1 , . . . , Am is the information provided by the truth of their disjunction A 1 ∨ . . . ∨ Am.
There are two motivations for this principle: First, the logical inferences to be drawn from the disjunction A ∨ B are precisely the inferences that can be drawn redundantly from both A and B. If some conclusion C logically follows from both A and B, then it also follows from A ∨ B. Conversely, any conclusion C that follows from the disjunction A ∨ B follows from both A and B. Formally, where ⊧ denotes logical implication. The second motivation again invokes the idea of parthood relationships: If some statement C is logically weaker than a statement A, then the information provided by C should be part of the information provided by A. For instance, the information provided by the statement S 1 = s 1 has to be part of the information provided by the statement S 1 = s 1 ∧ S 2 = s 2 . This idea is illustrated in the information diagram on the left side in Figure 6. Now, this idea implies that if a statement C is weaker than both A and B, then the information provided by C is part of the information carried by A and also part of the information carried by B. But this means that the information provided by C is part of the redundant information of A and B. In order to obtain the entire redundant information, the statement C should therefore be chosen as the strongest statement logically weaker than both A and B (see right side of Figure 6). But this statement is the disjunction A ∨ B (or any equivalent statement).
Based on these ideas we can now finally formulate our proposal for a measure of pointwise redundancy i∩ (t ∶ a 1 , . . . , am). We noted above that the information redundantly provided by collections of realizations a 1 , . . . , am is the information redundantly provided by the conjunctions ⋀ i∈aj S i = s i . And by the arguments just presented this is the information provided by the disjunction of these conjunctions. We denote the function that measures pointwise redundant information in this way by i sx ∩ (for reasons that will be explained shortly). It is formally defined as: Recall that by definition this is the pointwise mutual information provided by the truth of the statement in question. Hence, it measures whether and to what degree we are guided in the right or wrong direction with respect to the actual target value t once we learn that the statement is true.
We have now arrived at a complete solution to the partial information decomposition problem: Given the measure i sx ∩ we may carry out the Moebius-Inversion in order to obtain the pointwise atoms π sx . This has to be done for each realization s 1 , . . . , sn, t.
The obtained values can then be averaged as per Equation (3.3) to obtain the variable-level atoms Π sx . As shown in [12], the measure i sx ∩ can also be motivated in terms of the notion of shared exclusions (hence the superscript "sx"). The underlying idea is that redundant information is linked to possibilities (i.e. points in sample space) that are redundantly excluded by multiple source realizations. We argue that the fact that the measure i sx ∩ can be derived in these two independent ways provides further support for its validity. We offer a freely accessible implementation of the i sx ∩ PID as part of the IDTxl toolbox [28]. Worked examples of its computation and details on the computational complexity can be found in [12].
In the following section, we show that the value of formal logic within the theory of partial information decomposition goes far beyond helping us to define a measure of pointwise redundant information. In fact, similar to the lattices of parthood distributions and antichains, there is a lattice of logical statements that can equally be used as the basic mathematical structure of partial information decomposition. This lattice is particularly useful because the ordering relationship turns out to be very simple and well-understood: the relation of logical implication. We will show that this perspective also offers an independent starting point for the development of PID theory.

The logical perspective (a) Logic Lattices
The considerations of the previous section identified the information redundantly provided by collections a 1 , . . . , am with the information provided by a particular logical statement: a disjunction of conjunctions of basic statements of the form S i = s i . This has an interesting implication: there is a one-to-one mapping between antichains α and logical statements. Let us now look at this situation a bit more abstractly by replacing the concrete statements S i = s i with  propositional variables φ 1 , . . . , φn. Together with the logical connectives ¬, ∨, ∧, → (plus brackets) these form a language of propositional logic [20]. We will denote this language by L. We may now formally introduce a mapping Ψ from the set of antichains A into L via In other words, α is mapped to a statement by first conjoining the φ i corresponding to indices within each a i and then disjoining these conjunctions. For instance, the antichain {{1, 2}, {2, 3}} will be associated with the statement (φ 1 ∧ φ 2 ) ∨ (φ 2 ∧ φ 3 ). Note that if we interpret the propositional variables φ i as "source S i has taken on value s i ", then this is of course precisely the mapping of an antichain to the statement providing the redundant information (in the sense of i sx ∩ ) associated with that antichain. 1 The range L ⊆ L of Ψ is set of all disjunctions of logically independent conjunctions of pairwise distinct propositional variables. The logical independence of the conjunctions is the logical counterpart of the antichain property. The "pairwise distinct" condition ensures that the same atomic statement does not occur multiple times in any conjunction. The set L can now be equipped with the relationship of logical implication ⊧ in order to obtain a new structure (L, ) which we will show to be isomorphic to the lattices of antichains and parthood distributions. Here ⊧ means "implies" and means "is implied by".
Based on these concepts, the following theorem expresses the isomorphism of (L, ) to the lattices of antichains and parthood distributions:  In this way the logical perspective is put on equal footing with the parthood perspective and "antichain" perspective described by Williams and Beer [29]. They are in fact three equivalent ways to describe the mathematical structure underlying partial information decomposition. These three "worlds" of PID are illustrated in Figure 7 for the case of three information sources. 1 There is a slight ambiguity in the definition of Ψ since the order of the conjunctions ⋀i∈a φi and statements φi is not specified.  Intuitively, the logic lattice can be understood as a hierarchy of logical constraints describing how (i.e. via which collections of sources) information about the target may be accessed. The information atom associated with a nodeα in the logic lattice is exactly the information about the target that can be accessed in the way described by the constraintα. For example, the information shared by all sources Π ({1}, {2}, {3}) is to be found at the very bottom of the logic lattice because access to this information is constrained in the least possible way: the shared information can be accessed via any source (i.e. via source 1 or source 2 or source 3). By monotonicity, the shared information is of course also accessible via any collection of sources so that in total there are seven ways to access it (one per collection). By contrast, the all-way synergy Π({1, 2, 3}) is located at the very top of the logic lattice because access to it is most heavily constrained: the synergy can only be accessed if all sources are known at the same time. Hence, there is only a single way (collection) to access it. All other atoms are to be found in between these two extremes. For instance, the atom corresponding to the constraint φ 1 ∨ (φ 2 ∧ φ 3 ) is exactly the information that can be accessed either via source 1 or via sources 2 and 3 jointly (and of course via any superset of these collections by monotonicity) but not in any other way In general, the atoms on the k-th level of the logic lattice (starting to count at the top) are precisely the atoms that can be accessed via k collections of sources (compare this to the very similar insight in §2ciii).
Finally, one may also associate a redundant information term with each node in the logic lattice by interpreting the statements as sufficient conditions for access instead of constraints, i.e. sufficient and necessary conditions, on access. For instance, the redundancy associated with the statement φ 1 ∧ φ 2 ∧ φ 3 would be all information for which joint knowledge of all three sources is sufficient. But this is of course all information contained in the sources, i.e. the entire joint mutual information. By contrast, the information atom associated with the same statement is the information for which joint knowledge of all three sources is not only sufficient but also necessary, i.e. it cannot be obtained via any other collection of sources. Or put generally: while the redundancy is the information we obtain if we have knowledge of certain collections of sources, the information atom is the information we obtain if and only if we have such knowledge. Defined in this way the redundant information of a lattice node is again the sum of atoms associated with nodes below and including it.
In this way the logical perspective can be used as an independent starting point to develop PID theory. Instead of characterizing atoms by their defining parthood relations one might equally characterize them by their defining access constraints and relate them to the notion of redundant information in the way just described. This is summarized in the following Core Principle: Core Principle 5. Each atom of information is characterized by a logical constraint describing via which collections of sources it can be accessed. The atom Π(α) associated with constraintα = ⋁ a∈α ⋀ i∈a φ i is exactly that part of the joint mutual information about the target that can be accessed if and only if we have knowledge of any one of the collections of sources a. Now that we have fully introduced both the parthood and logical approaches to PID it is worth noting their key difference to the original "antichain" approach by Williams and Beer: whereas the parthood and logic approaches are looking at the problem from the perspective of the atoms and seek to describe their defining parthood relations / access constraints, the antichain based approach starts off by placing certain axioms on measures of redundant information leading to the insight that the definition of redundancy may be restricted to antichains. The atoms are then indirectly introduced in terms of a Moebius-Inversion over the lattice of antichains.
The next section highlights an additional use of logic lattices, namely as a mathematical tool to analyse the structure of PID lattices. (b) Using logic lattices as a mathematical tool to analyse the structure of PID lattices One advantage that logic lattices have over the lattices of antichains and parthood distributions is that their ordering relationship is particularly natural and well-understood: logical implication between statements. By contrast, the ordering relation ⪯ on the lattice of antichains only seems to have been studied in quite restricted order theoretic contexts so far. Furthermore, it is a purely technical concept that does not have a clear-cut counterpart in ordinary language. Because of the simplicity of its ordering relation, many important order theoretic concepts have a simple interpretation within the logic lattice. This makes it a useful tool to understand the structure of the lattice itself which in turn is relevant to the computation of information atoms.
There is an interesting fact about the statements in L that will be useful in the following investigations: they correspond to statements with monotonic truth-tables. The truth-table Tα ∶ V → {0, 1} of a statementα describes which models V ∈ V satisfyα ("makeα true"), i.e.
A truth-table T shall be called monotonic just in case ∀i ∈ {1, . . . , n} In other words, suppose a statementα is satisfied by a valuation V . Now suppose further that a new valuation V ′ is constructed by flipping one or more zeros to one in V . Thenα has to be satisfied by V ′ as well. Making some φ i true that were previously false cannot makeα false if it was previously true. With this terminology at hand the following proposition can be formulated: Proposition 1. Allα ∈ L have monotonic truth-tables. Conversely, for all monotonic truth-tables T, there is exactly oneα ∈ L such that Tα = T . In other words, the statements in L are, up to logical equivalence, exactly the statements of propositional logic with monotonic truth-tables.

Proof. See Appendix (c)
Now, it was shown in [7] that the information atoms have a closed form solution in terms of the meets of any subset of children of the corresponding node in the lattice. The meet (infimum) and join (supremum) operations, however, have quite straightforward interpretations on (L, ): The meet of two statementsα andβ is the strongest statement logically weaker than bothα and β. Similarly, the join is the weakest statement logically stronger than bothα andβ. The meet is logically equivalent (though not identical) to the disjunction ofα andβ while the join is logically equivalent (though not identical) to their conjunction. The conjunction and disjunction of two elements of L do generally not lie in L because they do not necessarily have the appropriate form (disjunction of logically independent conjunctions). However, this can easily be remedied because both the disjunction and the conjunction of elements of L have monotonic truth-tables. Thus, by Proposition 1 there is a unique element in L with the same truth-table in both cases. These elements are therefore the meet and join. The detailed construction of meet and join operators is presented in Appendix (c).
Let us now turn to the notions of child and parent. A child of a statementα ∈ L is a strongest statement strictly weaker thanα. Similarly, a parent ofα is a weakest statement strictly stronger thanα. The following three propositions provide, first, a characterization of children in terms of their truth tables, second, a lower bound on the number of children of a statement, and third, an algorithm to determine all children of a statement. Due to the isomophism of antichains, parthood distributions, and logical statements, the propositions can be utilized to study any of these three structures.

Proof. See Appendix (c)
Proposition 3 (Lower bound on number of children). Any α ∈ A such that there is at least one a ∈ α with a = k ≥ 1 has at least k children.

Proof. See Appendix (c)
Proposition 4 (Algorithm to determine children). The children of a statementα can be determined via the following algorithm (for a pseudocode version see Appendix (c)). It proceeds in three steps: (iii) If k > 0, decrease k by 1 and repeat Step 2. Otherwise, terminate.

Proof. See Appendix (c)
This concludes our discussion of the relationship between formal logic and PID. In the next section we return to the parthood side of our story. In particular, we will address an apparent arbitrariness in the argument presented in §2c. Here we showed that the sizes of the atoms of information can be obtained once a measure of redundant information is specified. Now, one may ask of course: why redundant information? Couldn't the same purpose be achieved by utilizing some other informational quantity such as synergistic or unique information? We will now discuss how the parthood approach can help answering this question in a systematic way.

Non-Redundancy based PIDs
Let us briefly revisit the structure of the argument in §2c. It involved three steps (presented in slightly different order above): First, based on the very concept of redundant information, we phrased a condition describing which atoms are part of which redundancies (Core Principle 3). Secondly, we showed that this parthood criterion entails a number of contraints on the measure I∩. Finally, we showed that, as long as these constraints are satisfied, we obtain a unique solution for the atoms of information. There is actually a fourth step: We would have to check that the information decomposition satisfies the consistency equations relating atoms to mutual information terms (Equation 2.5). However, in the case of redundant information this condition is trivially satisfied due to the self-redundancy property. In other words, the consistency equations are themselves part of the system of equations used to solve for the information atoms.
In order to obtain an information decomposition based on a quantity other than redundant information, lets call it I * (T ∶ a 1 , . . . , am), we may use precisely the same scheme: Let us work through these steps in specific cases.

(a) Restricted Information PID
Recall that the redundant information of multiple collections of sources is the information we obtain if we have access to any of the collections. Similarly, we can define the information "restricted by" collections of sources a 1 , . . . , am as any information we obtain only if we have access to at least one of the collections. For instance, assuming n = 2, the information restricted by the first source consists of its unique information and its synergy with the second source. Both of these quantities can only be obtained if we have access to the first source. Thus, in general the restricted information Ires(T ∶ a 1 , . . . , am) should consist of all the atoms that are only part of the information carried by some of the a i but not part of the information provided by any other collection of sources. Thus the parthood condition Cres is given by and we obtain the relation This can be established using the chain rule for mutual information as detailed in Appendix (d). The next step is to show that we may obtain a unique solution for the information atoms once a measure of restricted information satisfying these conditions is given. This can be achieved in much the same way as for redundant information. The restricted information associated with an antichain α can be expressed as a sum of information atoms Π(β) below and including α in a specific lattice of antichains (A, ⪯ ′ ). This lattice is simply the dual (inverted version) of the antichain lattice (A, ⪯), i.e.
Accordingly, a unique solution is guaranteed via Moebius-Inversion of the relationship As a final step we need to show that the resulting atoms stand in the appropriate relationships to mutual information terms. These relationships are given by the consistency equation (2.5). Again using the chain rule it can be shown that this equation is equivalent to a condition relating conditional mutual information to the information atoms: Now consider any collection of source indices a = {j 1 , . . . , jm}, then we obtain where the last equality follows because in the case of singletons the parthood condition Cres reduces to f (α C ∪ ) = 0. This establishes that the resulting atoms satisfy the consistency condition and we obtain a valid PID. In the following section we will use the same approach to analyse the question of whether a synergy based PID is possible.

(b) Synergy based PID
Note that the restricted information of multiple collections of sources stands in a direct correspondence to a weak form of synergy which we will denote by Iws (T ∶ a 1 , . . . , am). This quantity is to be understood as the information about the target we cannot obtain from any individual collection a i . Accordingly, the parthood criterion is But this information is of course the same as the information that we can get only if some other collections are known (except subcollections of course), i.e.
Iws(T ∶ a 1 , . . . , am) = Ires(T ∶ (b ∀ib ⊆ a i )) (5.12) Consider the case of two sources: the information we cannot get from source 2 alone, Iws(T ∶ {2}), is the same as the information we can get only if the first source is known, Ires(T ∶ {1}): unique information of source 1 plus synergistic information. Due to this correspondence, the argument presented above can also be used to show that a consistent PID can be obtained by fixing a measure Iws of weak synergy. Once such a measure is given we can first translate it to the corresponding restricted information terms and then perform the Moebius inversion of Equation (5.6) (alternatively, the above argument could be redeveloped directly for Iws with minor modifications) Interestingly, if we associate with every antichain α in the lattice (A, ⪯) the corresponding Iws(T ∶ β) (so that Ires(T ∶ α) = Iws(T ∶ β)), then the β form an isomorphic lattice but with a different ordering (see Figure 8). Just as the original antichain lattice this structure on the antichains has been introduced by Crampton and Loizou [5].
In the PID field a restricted version of this lattice (i.e. restricted to a certain subset of antichains) has been described by [9] and [1] under the name "constraint lattice". This terms is also appropriate in the present context: Intuitively, if we move up the constraint lattice we encounter information that satisfies more and more constraints. First, all of the information in the sources (Iws(T ∶ ∅)). This is the case of no constraints. Then all the information that is not contained in a particular individual source (Iws(T ∶ {1}) and Iws(T ∶ {2})). And finally the information that is not contained in any individual source (Iws (T ∶ {1}, {2})) .
Most recently, the full version of the lattice (i.e. defined on all antichains) has been utilized by [15] to formulate a synergy centered information decomposition. They call the lattice extended constraint lattice and define "synergy atoms" S ∂ in terms of a Moebius-Inversion over it. The concept of synergy S α utilized in this approach closely resembles what we have called weak synergy. However, the decomposition is structurally different from the type of decomposition  Note that following a widespread convention we left out the outer curly brackets around the antichains. discussed here and generally assumed in previous work on PID. Even though it leads to the same number of atoms, these atoms do not stand in the expected relationships to mutual information. For instance, in the 2-sources case, there is no pair of atoms that necessarily adds up to the mutual information provided by the first source and no such pair of atoms for the second source. The consistency equation (2.5) is not satisfied (except for the full set of sources). This means that synergy atoms S ∂ are not directly comparable to standard PID atoms Π. They represent different types of information.
Let us now move towards stronger concepts of synergistic information. The reason for the term "weak" synergy is that a key ingredient of synergy seems to be missing in its definition: intuitively, the synergy of multiple sources is the information that cannot be obtained from any individual source but that becomes "visible" once we know all the sources at the same time. However, the definition of weak synergy only comprises the first part of this idea. The weak synergy Iws(T ∶ a 1 , . . . , am) also contains parts that do not become visible even if we have access to all a i . For instance, given n = 3, the weak synergy Iws(T ∶ {1}, {2}) also contains the unique information of the third source Π({3}) because this quantity is accessible from neither the first nor the second source.
So let us add this missing ingredient by strengthening the parthood criterion: We obtain a moderate type of synergy we denote by Ims(T ∶ a 1 , . . . am). It has a nice geometrical interpretation: in an information diagram it corresponds to all atoms outside of all areas associated with the mutual information carried by some a i but inside the area associated with the mutual information carried by the union of the a i (see Figure 9). Furthermore, we can immediately see that the parthood condition cannot be satisfied for individual collections a (it demands f (a) = 0 and f (a) = 1 at the same time). This makes intuitive sense because the synergy of an individual collection appears to be an ill-defined concept: at least two things have to come together for there to be synergy. We will get back to the case of individual collections below. Let us first see what properties are implied by Cms. It can readily be shown that Ims is symmetric, idempontent, and invariant under subset removal. This again allows us to restrict the domain of Ims to the antichains. Additionally, Ims satisfies the following condition: If ∃i ∶ α∪ = a i , then Ims(T ∶ α) = 0 (zero condition) (5.14) This property says that whenever the union of the collection happens to be equal to one of collections then the moderate synergy must be zero. This is in particular the case for the moderate "self-synergy" of a single collection. On first sight this raises a problem since the synergy equations associated with individual collections become trivial (0 = 0) and do not impose any constraints on the atoms. This situation can be remedied, however, by noting that these missing constraints are provided by the consistency equations relating the atoms to mutual information / conditional  mutual information. In this way a unique solution for the atoms is indeed guaranteed (one could also axiomatically set the "self-synergies" to the respective conditional mutual information terms). The proof of this statement is given in Appendix (d).
An instructive fact about the moderate synergy based PID is that the underlying system of equations does not have the structure of a Moebius-Inversion over a lattice: there is no arrangement of atoms into a lattice such that each Ims(T ∶ α) turns out to be the sum of atoms below and including a particular lattice node. The reason is that any finite lattice always has a unique least element. In other words, some atom would have to appear at the very bottom of the lattice and would therefore be contained in all synergy terms. However, in the case of moderate synergy, there is no such atom for n ≥ 3. The only viable candidate would be the overall synergy Π({1, . . . , n}). But due to the condition that the synergistic information has to become visible if we know all collections in question, this atom is not contained e.g. in Ims(T ∶ {1}, {2}). Now one may wonder if the concept of synergy can be strengthened even further by demanding that the synergistic information should not be accessible from the union of any proper subset of the collections in question. For instance, the synergistic information Isyn(T ∶ {1}{2}{3}) of sources 1, 2, and 3 should not be accessible from the collections {1, 2}, {1, 3}, or {2, 3}. We have to know all three sources to get access to their synergy. Thus, we may add this third constraint to obtain a strong notion of synergy we denote by Isyn(T ∶ a 1 , . . . , am). An atom Π(f ) should satisfy the corresponding parthood condition Csyn(f ∶ a 1 , . . . , am) just in case The last condition is phrased as a conditional because the union of a proper subset of collection might happen to be equal to the union of all collections in question. Consider the case of three sources and the synergy Isyn(T ∶ {1, 2}{1, 3}{2, 3}). In this case the union of a proper subset of these collections, for instance {1, 2} ∪ {1, 3}, happens to be equal to the union of all a i .

(c) Unique information PID
Let us briefly discuss the last obvious candidate quantity for determining the PID atoms: unique information [3]. The appropriate parthood criterion for a measure of unique information Iunq seems straightforward in the case of individual collections a: It should consist of all atoms that are part of the information provided by the collection a but not part of the information provided by any other collection. This is what makes this information "unique" to the collection. Since there is always just one such atom this means that Iunq(T ∶ a) = Π(a). For instance, Iunq(T ∶ {1}) = Π({1}), as expected. However, defining Iunq only for individual collections does not yield enough equations to solve for the atoms. We need one equation per antichain / parthood distribution, and hence, some notion of the unique information associated with multiple collections a 1 , . . . , am. This is a trickier question. What does it mean for information to be unique to these collections? Certainly, uniqueness demands that this information should not be contained in any other collection. But what about the collections a 1 , . . . , am themselves? It seems that the appropriate condition is that the unique information should consist of atoms that are contained in all of these collections. This idea aligns well with ordinary language: for instance, saying that a certain protein is unique to sheep and goats means that this protein is found in both sheep and goats and nowhere else. Using this idea, the parthood criterion becomes However, this condition simply defines the atom Π(a 1 , . . . , am) making the unique information based PID possible but maybe not very helpful: it just amounts to defining all the atoms separately because Iunq(T ∶ α) = Π(α) for all antichains α.

Parthood descriptions vs. quantitative descriptions
Before concluding we would like to briefly point out an issue that arises quite naturally when thinking about information theory from a parthood perspective and that merits a few remarks: throughout this paper we have drawn a distinction between parthood relationships and quantitative relationships between information contributions. In particular, Core Principles 1 and 3 express parthood relationships between information atoms on the one hand and mutual information / redundant information on the other. Core Principle 2 by contrast describes the quantitative relationship between any information contribution and the parts it consists of. It is crucial to draw this distinction because these principles are logically independent. Consider the case of two sources: In this case, one could agree that the joint mutual information should consist of four parts while disagreeing that it should be the sum of these parts. On other hand, one could agree that the joint mutual information should be the sum of its parts but disagree that it consists of four parts. The distinction between parthood relations and quantitative relations is also important in the argument that the redundant information provided by multiple statements is the information carried by the truth of their disjunction. One of the two motivations for this idea was based on the principle that the information provided by a statement A is always part of the information provided by any stronger statement B. This does not mean however, that statement A necessarily provides quantitatively less information than B (i.e. less bits of information). In fact, this latter principle would contradict classical information theory. Here is why: suppose the pointwise mutual information i(t ∶ s) = i(t ∶ S = s) is negative. Now, consider any tautology such as S = s ∨ ¬(S = s). Certainly, this statement is logically weaker than S = s because a tautology is implied by any other statements. Furthermore, the probability of the tautology being true is equal to 1. Therefore, the information i(t ∶ S = s ∨ ¬(S = s)) provided by it is equal to 0. But this means i(t ∶ S = s) < i(t ∶ S = s ∨ ¬(S = s)) even though S = s ∨ ¬(S = s) S = s.
Nonetheless, there certainly is a sense in which a stronger statement B provides "more" information than a weaker statement A: the information provided by A is part of the information provided by B. If we know B is true than we can by assumption infer that A is true, and hence, we have access to all the information provided by A. The fact that the stronger statement B may nonetheless provide less bits of information can be explained in terms of misinformation: If we know B is true, then we obtain all the information carried by A plus some additional information.
If it happens that this surplus information is misinformative, i.e. negative, then quantitatively B will provide less information than A. This idea is illustrated in Figure 10. Importantly, the possible negativity and non-monotonicity of i sx ∩ as well as the potential negativity of π sx can be completely explained in terms of misinformative contributions in the following sense: it is possible [8] to uniquely separate i sx ∩ into an informative part i sx+ ∩ and a misinformative part i sx-∩ such that Now, each of these components can be shown to be non-negative and monotonically increasing over the lattice. Moreover, the induced informative and misinformative atoms π sx+ and π sxare non-negative as well [12]. In other words, once we seperate out informative and misinformative components any violations of non-negativity and monotonicity disappear. Hence, these violations can be fully accounted for in terms of misinformative contributions.

Conclusion
In this paper we connected PID theory with ideas from mereology, i.e. the study of parthood relations, and formal logic. The main insights derived from these ideas are that the general structure of information decomposition as originally introduced by Williams and Beer [29] can be derived entirely from 1) parthood relations between information contributions and 2) in terms of a hierarchy of logical constraints on how information about the target can be accessed. In this way the theory is set up from the perspective of the atoms of information, i.e. the quantities we are ultimately interested in. The n-sources PID problem has conventionally been approached by defining a measure of redundant information which in turn implies a unique solution for the atoms of information. We showed how such a measure can be defined in terms of the information provided by logical statements of a specific form. We discussed furthermore how the parthood perspective can be utilized to systematically address the question of whether a PID may be determined based on concepts other than redundancy. In doing so, we showed that this is indeed possible in terms of measures of "restricted information", "weak synergy", and "moderate synergy" but not in terms of "strong synergy". We hope to have shown that there are deep connections between mereology, formal logic and information decomposition that future research in these fields may benefit from. by the sources S 1 , . . . , Sn about the target T is any function Π Pj ∶ Bn → R, determined by P J , that satisfies

Appendix
for all a ⊆ {1, . . . , n}. The subscripts P J indicate that both the mutual information and the information atoms are functions of the underlying joint distribution. A valuation is said to satisfy a statementα, written as ⊧ Vα , under the following conditions (i) Ifα is an atomic statement, then ⊧ Vα ⇐⇒ V (α) = 1 (ii) Ifα is of the formβ ∧γ, then ⊧ Vα ⇐⇒ ⊧ Vβ and ⊧ Vγ (iii) Ifα is of the formβ ∨γ, then ⊧ Vα ⇐⇒ ⊧ Vβ or ⊧ Vγ In this way, the satisfaction relationship is inductively defined for all statements of the propositional language we are considering here. The relation of logical implication is now defined such that a statementα implies a statementβ just in case all valuations that satisfyα also satisfỹ β. Formally,α ⊧β ⇐⇒ ∀V ∈ V ∶⊧ Vα →⊧ Vβ First, ϕ is surjective: let f ∈ B, then ϕ(α f ) = f for the set α f of minimal elements with value 1, i.e.
We now turn to the isomorphism between (L, ) and (A, ⪯). The mapping Ψ ∶ A → L defined in the main text is an isomorphism. Ψ is injective for let α, β ∈ A be two distinct antichains. Then there has to be an a ∈ α not contained in β (or vice versa). But then the conjunction ⋀ i∈a φ i will appear inα while it does not appear inβ. Accordingly,α andβ are distinct elements of L. Ψ is surjective as well for letα ∈ L. Thenα is of the form ⋁ j∈J ⋀ i∈j φ i for some set of index sets J = {j 1 , . . . , jm} where j i ⊆ {1, . . . , n}. Because the conjunctions ⋀ i∈j φ i have to be logically independent it follows that the index sets cannot be subsets of each other, i.e. ¬(j k ⊇ j l ) for k ≠ l. But this implies that J is an antichain which is, by definition of Ψ , mapped ontoα. It only remains to be shown that β ⪯ α ⇐⇒β α. First, suppose that β ⪯ α. We need to show that for all valuations V ∈ V = {0, 1} {φ1,...,φn} : ⊧ Vα →⊧ Vβ , i.e. all Boolean valuations of the φ i that makeα true, also makeβ true. So suppose ⊧ Vα , then there must be an a ∈ α such that ⊧ V ⋀ i∈a φ i . But since β ⪯ α, there must be a b ∈ β such that a ⊇ b. Therefore, ⊧ V ⋀ i∈b φ i . Hence, V also satisfies the disjunction over all b ∈ β: Regarding the other direction, suppose thatβ α, i.e. all valutions satisfyingα also satisfỹ β. Now suppose for contradiction that ¬(β ⪯ α), i.e. ∃a * ∈ α∀b ∈ β ∶ ¬(a ⊇ b). In this case, we can construct a valuation V that satisfiesα but notβ in the following way: By construction all b ∈ β contain at least one index i not contained in a. Therefore, V does not satisfy any of the conjunctions ⋀ i∈b φ i , and thus it does not satisfyβ, in contradiction to the initial assumption. Hence, β ⪯ α, concluding the proof. Proof. Follows from the isomorphism and the fact that (A, ⪯) is a lattice as shown by [5].

(c) Proofs of Propositions (i) Monotonic truth tables
Proof of Proposition 1. Letα ∈ L and let V, Suppose that Tα(V ) = 1. Then V must satisfy at least one of the conjunctions ⋀ i∈a φ i . But since V (φ i ) = 1 → V ′ (φ i ) = 1 any conjunction satisfied by V must also be satisfied by V ′ . Hence, Tα(V ′ ) = 1.
Regarding the converse: let T be a monotonic truth-table. Then T = Tα * for the statement Note thatα * is generally not in L because the conjunctions are not necessarily logically independent. But one can obtain an equivalent statementα ∈ L by removing all conjunctions from α * that logically imply another conjunction inα * . Letα be this statement. Then, ifα is true, certainlyα * is true because the latter differs from the former only through additional disjuncts. Conversely, ifα * is true, then one of its conjuncts must be true. If the true conjunct inα * does appear inα as well (i.e. it has not been removed), then triviallyα has to be true as well. On the other hand, if this conjunct does not appear inα, then it must have been removed which implies that there is a logically weaker conjunct inα. But then this logically weaker conjunct has to be true as well, thereby makingα true. Therefore,α * andα have the same truth-table T andα ∈ L as desired. Furthermore,α is unique because ⊧ is antisymmetric on L by Corollary 1. Hence, there can be no two distinct but logically equivalent elements (i.e. elements with the same truth-table) in L.

(ii) Characterization of Children
Proof of Proposition 2. Concerning the if-part we show the contraposition: Suppose that there is aβ strictly in betweenγ andα. If this is the case, then there must be a model V 1 such that Tβ(V 1 ) = 1 while Tα(V 1 ) = 0 and a distinct model V 2 such that Tγ (V 2 ) = 1 while Tβ(V 2 ) = 0. But for both of these models it would be true that Tγ (V 1 ) = 1 while Tα(V 1 ) = 0. Thus,γ would be true in at least two additional cases. Concerning the only-if part we show the contraposition again: Suppose thatγ is true in the k ≥ 2 additional cases contained in V * = {V 1 , V 2 , . . . , V k }. Consider the subset of these models with . Then the truth table is monotonic and the statementβ associated with this truth-table is strictly in betweenγ and α. The latter is true because all valuations that satisfyα also satisfyβ and all valuations that satisfyβ also satisfyγ. At the same time there is a valuation, namely V * , that satisfiesγ but not β, and a set of valuations with at least one element, namely V * {V * }, that satisfiesβ but notα. Thus, all three statements have to be distinct. Regarding the monotonicity: by assumptionγ has a monotonic truth-table and the truth-table ofβ is identical except that Tβ(V * ) = 0. So the only way Tβ could not be monotonic would be for there to exist a valuation V ′ * , distinct from V * , that would enforce Tβ(V * ) = 1 via monotonicity, i.e. a valuation that results from flipping some ones in V * to zeros and that satisfiesβ. Suppose there is such a valuation. V ′ * would have to satisfyβ while not satisfyingα, since if it did satisfyα, V * would have to satisfyα as well in contradiction to V * ∈ V * . Furthermore, as V ′ * satisfiesβ it also satisfiesγ. Therefore, V ′ * ∈ V * . However, if it were true that , contradicting the fact that V * ∈ V min * .

(iii) Lower bound on children
Proof of Proposition 3. Let α be such an antichain and let a ∈ α be a set of indices such that a = k. We utilize the isomorphism between A and L by showing thatα has at least k children. Since a = k there are exactly k distinct indices i 1 , . . . , i k ∈ a and we can define k subsets of valuations . . .
In other words, the valuations in V 1 , first, do not satisfyα, and second, assign a one to all φ i if i is in the collection a but not equal to i 1 . The definition of the other V i is analogous. The goal is now to find 'maximal' valuations (making as many φ i true as possible) in these sets and modify the truth-table ofα by assigning a one to exactly one of these valuations. This can be done for all valuations separately to obtain k novel monotonic truth-tables. These monotonic truth-tables are uniquely associated with specific statements via Proposition 1 which can then be shown to be children by Proposition 2 since they are true in exactly one more case thanα. To make this argument note first that V 1 , . . . , V k each contain at least one element V 1 , . . . , V k respectively: . . .
These valuations do not satisfyα: They don't satisfy the conjunction ⋀ i∈a φ i and since α is an antichain each a ′ ≠ a has to contain at least one index j not contained in a. The corresponding conjunctions ⋀ i∈a ′ φ i = φ j ∧ ⋀ i∈a ′ {j} φ i are therefore not satisfied by any V i since by construction . . .
Due to the maximality of these valuations the following truth-tables are monotonic This is because, first, the truth-table ofα is already monotonic, and second, if a zero is flipped to a one in V * 1 or . . . or M * k the resulting valuations are by construction guaranteed to satisfyα. Otherwise, we would obtain valuations in V 1 or . . . or V k respectively, containing more ones than V * 1 or . . . or V * k respectively, in contradiction to the maximality of these valuations. The uniquely defined statementsγ 1 , . . . ,γ k corresponding to these truth-tables via Proposition 1 are children of α by Proposition 2 because each of them is true in exactly one additional valuation compared tõ α. Finally all of these statements are distinct since they are pairwise logically independent and a single statement cannot have multiple truth-tables.
(iv) Algorithm to determine children Proof of Proposition 4. Firstly, anyγ produced by the algorithm is a direct child since its truth-table differs from that ofα only through an additional one, i.e.γ is true in exactly one more case thañ α and is thus a direct child by Proposition 2. Secondly, there is no child ofα that is not generated by the algorithm. Again by Proposition 2, the truth-table of any such child would differ from that ofα only through a single one. But the algorithm explores systematically all possibilities to add a single one to the truth-table ofα. Thus any childγ will be generated at some point.
A pseudocode version of the algorithm is shown in Algorithm 1.

(v) Meet and Join operations on logic lattices
The meet∧ and join∨ operations can be explicitly constructed in the following way: The element of L logically equivalent to the disjunctionα ∨β can be obtained by simply removing all disjuncts that logically imply another disjunct. The element of L logically equivalent to the conjunctionα ∧ β can be obtained by, first, applying the distributive law to obtain a disjunction of conjunctions, second, applying the idempotency law to all conjunctions to remove repeated statements, and third, removing again all disjuncts that logically imply another disjunct. Denoting these three operations by D, I, and ○ (underline) respectively, the meet and join have the explicit expressions given in the following proposition: Proof. By construction,α ∨β and I(D(α ∧β)) are in L. Furthermore, since the operations D, I, and ○ do not affect the truth-conditions of statements,α ∨β and I(D(α ∧β)) are logically equivalent toα ∨β andα ∧β, respectively. Hence, it only needs to be shown that these latter statements satisfy the conditions of meet and join respectively. Now, clearlyα ∨β is logically weaker than bothα andβ whileα ∧β is logically stronger than bothα andβ. It remains to be shown that former is the strongest such statement while the latter is the weakest such statement. Suppose there was statementγ stronger thanα ∨β, then there would have to be a model M * makingγ false andα ∨β true. But sinceα ∨β is true whenever eitherα is true orβ is true, this means thatγ would have to be false in a case where one ofα orβ is true. However, this implies thatγ cannot be logically weaker than bothα andβ, and hence,α ∨β must be the strongest statement logically weaker thanα andβ. Now suppose there was a statementγ weaker thañ α ∧β, then there would have to be a model M * makingγ true butα ∧β false. But this means that γ would be true in a case in which eitherα orβ is false. Accordingly,γ cannot be stronger than bothα andβ, and hence, I(D(α ∧β)) must be the weakest statement logically stronger thanα andβ.  (ii) Proof that moderate synergy induces a unique PID The claim that defining a measure of moderate synergy leads to a unique solution for the atoms of information can be shown by starting from the system of equation associated with weak synergy. These equations can be transformed into the moderate synergy equations by operations that preserve invertibility. First, the "self-synergy" equations = I(T ∶ α C ∪ α∪) (8.28) where the second to last equality follows because the monotonicity of parthood distributions implies that f (α∪) = 0 → f (a) = 0 ∀ a ∈ α. Therefore, we obtain Ims(T ∶ α) = Iws(T ∶ α) − I(T ∶ α C ∪ α∪) (8.29) = Iws(T ∶ α) − Iws(T ∶ α∪) (8.30) showing that the moderate synergy equation associated with α is the difference between two weak synergy equations. Since subtracting two equations from each other leaves invertibility unaffected this establishes that the moderate synergy system of equations is invertible as well.