Breaking the MDS-PIR Capacity Barrier via Joint Storage Coding

Sun, Hua; Tian, Chao

doi:10.3390/info10090265

Open AccessFeature PaperArticle

Breaking the MDS-PIR Capacity Barrier via Joint Storage Coding

by

Hua Sun

^1,*

and

Chao Tian

^2,*

¹

Department of Electrical Engineering, University of North Texas, Denton, TX 76203, USA

²

Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA

^*

Authors to whom correspondence should be addressed.

Information 2019, 10(9), 265; https://doi.org/10.3390/info10090265

Submission received: 20 June 2019 / Revised: 1 August 2019 / Accepted: 19 August 2019 / Published: 22 August 2019

(This article belongs to the Special Issue Private Information Retrieval: Techniques and Applications)

Download Versions Notes

Abstract

:

The capacity of private information retrieval (PIR) from databases coded using maximum distance separable (MDS) codes was previously characterized by Banawan and Ulukus, where it was assumed that the messages are encoded and stored separably in the databases. This assumption was also usually made in other related works in the literature, and this capacity is usually referred to as the MDS-PIR capacity colloquially. In this work, we considered the question of if and when this capacity barrier can be broken through joint encoding and storing of the messages. Our main results are two classes of novel code constructions, which allow joint encoding, as well as the corresponding PIR protocols, which indeed outperformed the separate MDS-coded systems. Moreover, we show that a simple, but novel expansion technique allows us to generalize these two classes of codes, resulting in a wider range of the cases where this capacity barrier can be broken.

Keywords:

private information retrieval; maximum distance separable codes; capacity

1. Introduction

Private information retrieval (PIR) [1] has attracted significant attention from researchers in the fields of theoretical computer science, cryptography, information theory, and coding theory. In the classical PIR model, a user wishes to retrieve one of the K available messages, from N non-communicating databases, each of which has a copy of these K messages. User privacy needs to be preserved during the retrieval process, which requires that the identity of the desired message not be revealed to any single database. To accomplish the task efficiently, good codes need to be designed such that the least amount of data should be downloaded. The inverse of the minimum amount of the downloaded data per-bit of desired message is referred to as the capacity of the PIR system. The capacity of the classical PIR system was characterized precisely in a recent work by Sun and Jafar [2].

In distributed systems, databases may fail; moreover, each storage node (database) is also constrained on the storage space. Erasure codes can be used to improve both storage efficiency and failure resistance, which motivated the investigation of PIR from data encoded with maximum distance separable (MDS) codes [3,4,5,6,7], with coding parameter

(N, T)

, i.e., the messages can be recovered by accessing any T databases. The capacity of PIR from MDS-coded databases (MDS-PIR) was characterized by Banawan and Ulukus [5], which is usually referred to as the MDS-PIR capacity colloquially.

In all these existing works, the storage code was designed such that each message was independently encoded and stored in the databases and thus could also be recovered individually. In fact, even when the storage codes are not necessarily MDS codes, most existing works on the capacity of private information retrieval assumed this separate coding architecture [8,9,10,11,12,13], and the only exceptions we are aware of are [14,15,16]. Though the architecture of separately encoding each message offers a simple storage solution with good data reliability, it is by no means the only possible MDS storage coding strategy. Instead, the messages can be stored jointly using an MDS code, which could provide the same level of data reliability at the same amount of storage overhead. Motivated by this simple observation, which was first mentioned as a footnote in [15] and can be further traced back to a code example in [14], we ask the following natural question: When can the MDS-PIR capacity barrier, which was established in [5] for separately encoding the messages using an MDS code, be broken, by allowing joint encoding of the messages using an MDS code?

In this work, we show that there are many cases, where through jointly encoding and storing the messages, the messages can be protected using an

(N, T)

MDS code, but retrieved with less data download than the separate coding architecture. In other words, the capacity barrier for separately encoding the messages can be broken for these cases. More precisely, the mathematical question we ask is under what

(K, N, T)

parameters jointly encoding and storing the MDS-coded messages can provide strict PIR retrieval rate improvement; we show that this can be done at least in the following two cases:

$(K, N, T) = (2, N, 2)$ and $N \geq 3$ ;
$(K, N, T) = (K, K + 1, K)$ and $K \geq 2$ .

To establish this result, we provide two novel code constructions and PIR protocols, which yield strict performance improvement over the strategy of encoding and storing messages separately using an MDS code. Moreover, we show that through a simple, but novel code expansion technique, the MDS-PIR capacity barrier can also be broken for the following cases for an arbitrary integer

m \geq 1

:

$(K, N, T) = (2, m N, 2 m)$ and $N \geq 3$ ;
$(K, N, T) = (K, m (K + 1), m K)$ and $K \geq 2$ .

It should be noted that related to the line of works that focus on PIR capacity, there exists another interesting line of work in coding theory [17,18,19,20,21,22] focusing on a different metric—the virtual server rate [19]—which studies how to store the messages jointly such that a minimum number of servers are used to simulate existing PIR protocols. In contrast to our current work, since the resulting joint storage code is not required to be MDS, it often turns out to be non-MDS, and thus cannot be compared directly.

The rest of the paper is organized as follows. In Section 2, we provide a precise description of the system model and problem formulation. In Section 3 and Section 4, we provide two novel joint coding storage codes and PIR protocols. In Section 5, we present a technique that yields more general classes of the codes, which can strictly improve upon separately encoding and storing the messages. Section 6 finally concludes the paper.

2. System Model and Problem Formulation

In this section, we first provide a formal description of the system model, then proceed to pose the problem we seek to answer in this work. A couple of additional remarks to clarify the relation between our system model and those seen in the literature are given at the end of the section.

2.1. System Model

There is a total of K mutually-independent messages

W^{1}, W^{2}, \dots, W^{K}

in the system. Each message is uniformly distributed over

X^{L}

, i.e., the set of length L sequences in the finite alphabet

X

. The messages are MDS-coded and then distributed to N databases, such that from any T databases, the messages can be fully recovered. Since the messages are

(N, T)

MDS-coded, it is without loss of generality to assume that

L \cdot K = M \cdot T

for some integer M.

When a user wishes to retrieve a particular message

W^{k^{*}}

, N queries

Q_{1 : N}^{[k^{*}]} = (Q_{1}^{[k^{*}]}, \dots, Q_{N}^{[k^{*}]})

are sent to the databases, where

Q_{n}^{[k^{*}]}

is the query for database n. The retrieval needs to be information theoretically private, i.e., any database is not able to infer any knowledge as to which message is being requested. For this purpose, a random key

F

in the set

F

is used together with the desired message index

k^{*}

to generate the set of queries

Q_{1 : N}^{[k^{*}]}

. Each query

Q_{n}^{[k^{*}]}

belongs to the set of allowed queries for database n, denoted as

Q_{n}

. After receiving query

Q_{n}^{[k^{*}]}

, database n responds with an answer

A_{n}^{[k^{*}]}

. Each symbol in the answers from database n belongs to a finite field

A_{n}

, and the answers may have multiple (and different numbers of) symbols. Using the answers

A_{1 : N}^{[k^{*}]}

from all N databases, together with

F

and

k^{*}

, the user then reconstructs

{\hat{W}}^{k^{*}}

. We shall refer to such a system as a

(K, N, T)

MDS-PIR system.

A more rigorous definition of a

(K, N, T)

system can be specified by a set of coding functions as follows. In the following, we denote the cardinality of a set

B

as

| B |

.

Definition 1.

A

(K, N, T)

MDS-PIR code consists of the following coding components:

1.: A set of MDS encoding functions:

$\begin{matrix} Φ_{n} : = X^{L K} \to X^{M}, n \in {1, \dots, N}, \end{matrix}$

(1)

where each $Φ_{n}$ encodes all the messages together in the information to be stored at database n.
2.: A set of MDS decoding recovery functions:

$\begin{matrix} Ψ_{T} : X^{L K} \to X^{L K}, \end{matrix}$

(2)

for each $T \subseteq {1, \dots, N}$ such that $| T | = T$ , whose outputs are denoted as ${\tilde{W}}_{T}^{1 : K}$ ;
3.: A query function:

$\begin{matrix} ϕ_{n} : {1, \dots, K} \times F \to Q_{n}, n \in {1, \dots, N}, \end{matrix}$

i.e., for retrieving message $W^{k^{*}}$ , the user sends the query $Q_{n}^{[k^{*}]} = ϕ_{n} (k^{*}, F)$ to database n;
4.: An answer length function:

$\begin{matrix} ℓ_{n} : Q_{n} \to {0, 1, \dots}, n \in {1, \dots, N}, \end{matrix}$

(3)

i.e., the length of the answer from each database, a non-negative integer, is a deterministic function of the query, but not the particular realization of the messages;
5.: An answer-generating function:

$\begin{matrix} ϕ_{n}^{(q_{n})} : X^{M} \times Q_{n} \to A_{n}^{ℓ_{n}}, q_{n} \in Q_{n}, n \in {1, \dots, N}, \end{matrix}$

(4)

i.e., the answer when $q_{n} = Q_{n}^{[k^{*}]}$ is the query received by database n;
6.: A reconstruction function:

$\begin{matrix} ψ : \prod_{n = 1}^{N} A_{n}^{ℓ_{n}} \times {1, \dots, K} \times F \to X^{L}, \end{matrix}$

(5)

i.e., after receiving the answers, the user reconstructs the message as ${\hat{W}}^{k^{*}} = ψ (A_{1 : N}^{[k^{*}]}, k^{*}, F)$ .

These functions satisfy the following three requirements:

1.: MDS recoverable: For any $T \subseteq {1, \dots, N}$ such that $| T | = T$ , we have ${\tilde{W}}_{T}^{1 : K} = W^{1 : K}$ .
2.: Retrieval correctness: For any $k^{*} \in {1, \dots, K}$ , we have ${\hat{W}}^{k^{*}} = W^{k^{*}}$ .
3.: Privacy: For every $k, k^{'} \in {1, \dots, K}$ , $n \in {1, \dots, N}$ and $q \in Q_{n}$ ,

$\begin{matrix} \Pr (Q_{n}^{[k]} = q) = \Pr (Q_{n}^{[k^{'}]} = q) . \end{matrix}$

(6)

The retrieval rate is defined as:

\begin{matrix} R : = \frac{L log | X |}{\sum_{n = 1}^{N} E (ℓ_{n}) log | A_{n} |} . \end{matrix}

(7)

This is the number of bits of desired message information that can be privately retrieved per bit of downloaded data. The maximum possible retrieval rate is referred to as the capacity of the

(K, N, T)

system.

2.2. Separate vs. Joint MDS Storage Codes

In the general problem definition we have provided above, the MDS encoding functions

Φ_{n}

allow the messages to be jointly encoded. For example, suppose we have

K = 2

messages,

N = 3

databases, and from any

T = 2

databases, we may decode both messages. A simple jointly-encoded MDS storage code is as follows. Each message has

L = 2

bits, denoted as

W^{1} = (a_{1}, a_{2}), W^{2} = (b_{1}, b_{2})

. Each database stores

M = L K / T = 2

bits, i.e., Database 1 stores

(a_{1}, a_{2})

, Database 2 stores

(b_{1}, b_{2})

, and Database 3 stores

(a_{1} + b_{1}, a_{2} + b_{2})

. However, in almost all existing works in the literature, e.g., [3,5,7,23,24,25,26], the messages are encoded separately. In other words, the MDS encoding functions have the special form:

\begin{matrix} Φ_{n} = (Φ_{n}^{1}, Φ_{n}^{2}, \dots, Φ_{n}^{K}), \end{matrix}

(8)

where:

\begin{matrix} Φ_{n}^{k} : X^{L} \to X^{M / K}, n \in {1, \dots, N}, k \in {1, \dots, K}, \end{matrix}

(9)

which encodes message

W^{k}

into its MDS-coded form to be stored at database n. Correspondingly, the MDS decoding functions have the form:

\begin{matrix} Ψ_{T} = (Ψ_{T}^{1}, Ψ_{T}^{2}, \dots, Ψ_{T}^{K}), \end{matrix}

(10)

where:

\begin{matrix} Ψ_{T}^{k} : X^{L} \to X^{L}, k \in {1, \dots, K}, \end{matrix}

(11)

which decodes message k from the information regarding

W^{k}

stored in the databases in the set

T

. Particularly, since most practical MDS codes are linear, several existing works have directly assumed the MDS encoding functions to be linear, and moreover, the component coding functions

Φ_{n}^{k}

for different messages

W^{k}

’s are the same; see, e.g., [5,23]. In other words, in this class of codes, the encoding function

Φ_{n}^{k}

can be written as the multiplication of the message vector

W^{k}

with an

L \times M / K

encoding matrix

G_{n}

, whose elements are also in the finite field

X

. To compare with the jointly-encoded MDS storage example above, we consider the same setting where

K = 2

messages,

L = 2

bits per message,

N = 3

servers, and the MDS parameter

T = 2

. A separate MDS storage code where each database stores

M / K

= 1 bit per message is as follows. Database 1 stores

(a_{1}, b_{1})

; Database 2 stores

(a_{2}, b_{2})

; and Database 3 stores

(a_{1} + a_{2}, b_{1} + b_{2})

. It is easy to see that for separately-encoded MDS storage codes, the storage space is divided evenly for each message, and each divided storage space can only be a function of the corresponding message.

Let us denote the capacity of the

(K, N, T)

MDS-PIR system as

C (K, N, T)

, that of separate MDS coding as

C_{⊥} (K, N, T)

, and that of separate linear MDS coding with a uniform component function as

C_{\oplus} (K, N, T)

. It is clear from the definitions that:

\begin{matrix} C (K, N, T) \geq C_{⊥} (K, N, T) \geq C_{\oplus} (K, N, T) . \end{matrix}

(12)

It was shown in [5] that:

\begin{matrix} C_{\oplus} (K, N, T) = {(1 + \frac{T}{N} + \dots + {(\frac{T}{N})}^{K - 1})}^{- 1} . \end{matrix}

(13)

However, a close inspection of the converse proof in [5] reveals that:

\begin{matrix} C_{⊥} (K, N, T) = C_{\oplus} (K, N, T) . \end{matrix}

(14)

The issue we thus wish to understand in this work is the relation between

C (K, N, T)

and

C_{⊥} (K, N, T)

. In particular, we wish to identify the set of the

(K, N, T)

triples such that:

\begin{matrix} C (K, N, T) > C_{⊥} (K, N, T), \end{matrix}

(15)

if the set is not empty. We shall show in this work that such triples indeed exist, and they in fact span a rather wide range.

2.3. Further Remarks on the System Model

The result in [5] is in fact slightly stronger than we stated in (13). Let us assume a particular MDS storage code

C

is used in the

(K, N, T)

system, then the corresponding capacities of the

(K, N, T)

systems as described above can be denoted as

C (K, N, T, C)

,

C_{⊥} (K, N, T, C)

, and

C_{\oplus} (K, N, T, C)

, respectively. The result in [5] can then be stated as that for any linear MDS code

C

,

\begin{matrix} C_{\oplus} (K, N, T, C) = C_{\oplus} (K, N, T) = {(1 + \frac{T}{N} + \dots + {(\frac{T}{N})}^{K - 1})}^{- 1} . \end{matrix}

(16)

It is natural to ask whether for any particular MDS code

C

, which is not necessarily linear or does not necessarily use a uniform component MDS coding function, whether

C_{⊥} (K, N, T, C) = C_{⊥} (K, N, T)

and, more generally, whether for any MDS code

C

,

C (K, N, T, C) = C (K, N, T)

. We believe this is in general not true; however, it appears difficult to prove or disprove this conjecture.

The MDS recovery requirement implies the following information theoretic relation:

\begin{matrix} \sum_{n \in T} H (Φ_{n} (W^{1 : K})) = K L log | X |, \end{matrix}

(17)

\begin{matrix} H (W^{1 : K} | Φ_{n} (W^{1 : K}), n \in T) = 0, \end{matrix}

(18)

for any

T \subseteq {1, 2, \dots, N}

and

| T | = T

. These conditions can be used to derive converse results for a

(K, N, T)

system and sometimes are stated directly (e.g., [24]) as the MDS recovery requirement, instead of enforcing the MDS recovery property on the coding functions.

3. Code Construction: $(K, N, T) = (2, N, 2), N \geq 3$

In this section, we present the storage and PIR code construction when

K = T = 2, N \geq 3

and show that the PIR rate achieved with the proposed joint MDS storage code is strictly higher than the capacity of PIR with separate MDS storage code, i.e.,

C (2, N, 2) > C_{⊥} (2, N, 2)

.

3.1. Example: $N = 4$

To illustrate the main idea in a simpler setting, we start with an example where

N = 4

. We set the message size

L = 3

so that each message consisted of three symbols from

F_{3}

. Denote

W^{1} = (a_{0}; a_{1}; a_{2}) \in F_{3}^{3 \times 1}, W^{2} = (b_{0}; b_{1}; b_{2}) \in F_{3}^{3 \times 1}

.

Storage code: From the joint MDS storage code constraint, each database stores

\frac{L K}{T} = 3

symbols, and the stored variables are specified in Table 1.

It is easy to verify that we may recover both messages from the storage of any two databases. For example, consider Database 3 and Database 4. It suffices to show that

(a_{1} - 2 a_{2}; a_{2} - 2 a_{0}; a_{0} - 2 a_{1})

are invertible to

W^{1} = (a_{0}; a_{1}; a_{2})

. Equivalently, we show that the following matrix has full rank over

F_{3}

.

\begin{matrix} [\begin{matrix} 0 & 1 & - 2 \\ - 2 & 0 & 1 \\ 1 & - 2 & 0 \end{matrix}] \to [\begin{matrix} 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \end{matrix}] \to det [\begin{matrix} 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \end{matrix}] = 2 \neq 0 \end{matrix}

(19)

PIR code: When we retrieve

W^{1}

, the answers are shown in Table 2.

When we retrieve

W^{2}

, the answers are shown in Table 3.

Correctness and privacy: Both correctness and privacy are easy to verify. Correctness follows from the observation that from the four symbols downloaded (one from each database), we may decode the three desired symbols, as only one undesired symbol appears in the answers. Privacy is guaranteed because no matter which message is desired, for each database, the answers are identically distributed. For example, consider Database 3. The answers are equally likely to be

a_{0} + b_{2}, a_{1} + b_{0}

, and

a_{2} + b_{1}

, regardless of the desired message index.

Rate that outperforms separate MDS-PIR capacity: The desired message has

L = 3

symbols, and we are downloading one symbol from each database,

l_{n} = 1, \forall n \in {1, 2, 3, 4}

. Then, the rate achieved is

\frac{L}{\sum_{n} l_{n}} = \frac{3}{4} \leq C (2, 4, 2)

, which is strictly higher than

C_{⊥} (2, 4, 2) = {(1 + \frac{2}{4})}^{- 1} = \frac{2}{3}

, the capacity of the separate MDS storage code.

3.2. General Proof: Arbitrary $N \geq 3$

We set message size

L = N - 1

, then each message consisted of

N - 1

symbols from

F_{p^{m}}

for a prime number p and an integer m such that

p^{m} \geq (N - 3) (N - 1) + 2

. The primitive element of the finite field

F_{p^{m}}

is denoted as

α

. Denote

W^{1} = (a_{0}; a_{1}; \dots; a_{N - 2}) \in F_{p^{m}}^{(N - 1) \times 1}, W^{2} = (b_{0}; b_{1}; \dots; b_{N - 2}) \in F_{p^{m}}^{(N - 1) \times 1}

.

Storage code: From the joint MDS storage code constraint, each database stores

\frac{L K}{T} = N - 1

symbols, and the stored variables

S_{n} \in F_{p^{m}}^{(N - 1) \times 1}, n \in {1, \dots, N}

are set as follows.

Denote the cyclically-shifted message vector as

{\tilde{W}}^{1} (i) = (a_{\bar{i}}; a_{\bar{i + 1}}; \dots; a_{\bar{i + N - 2}}), i \in {1, \dots, N - 2}

where

\bar{i} = i mod (N - 1)

, i.e., the symbol indices are interpreted modulo

N - 1

.

\begin{matrix} S_{1} & = W^{1} \end{matrix}

(20)

\begin{matrix} S_{2} & = W^{2} \end{matrix}

(21)

\begin{matrix} S_{3} & = α {\tilde{W}}^{1} (1) + W^{2} \end{matrix}

(22)

\begin{matrix} ⋮ \end{matrix}

(23)

\begin{matrix} S_{n} & = α^{n - 2} {\tilde{W}}^{1} (n - 2) + W^{2} \end{matrix}

(24)

\begin{matrix} ⋮ \end{matrix}

(25)

\begin{matrix} S_{N} & = α^{N - 2} {\tilde{W}}^{1} (N - 2) + W^{2} \end{matrix}

(26)

Specifically,

\begin{matrix} S_{1} = (S_{1, 0}; \dots; S_{1, N - 2}) = (a_{0}; \dots; a_{N - 2}) \end{matrix}

(27)

\begin{matrix} S_{2} = (S_{2, 0}; \dots; S_{2, N - 2}) = (b_{0}; \dots; b_{N - 2}) \end{matrix}

(28)

\begin{matrix} S_{n} = (S_{n, 0}; \dots; S_{n, N - 2}) = (α^{n - 2} a_{\bar{n - 2}} + b_{0}; \dots; α^{n - 2} a_{\bar{n + N - 4}} + b_{N - 2}), n \in {3, \dots, N} \end{matrix}

(29)

The proof that the above storage code satisfies the MDS criterion is deferred to Section 3.2.1.

PIR code: When we retrieve

W^{1}

, the answers are set as follows.

F

is uniformly distributed over

{0, 1, \dots, N - 2}

. When

F = f \in {0, 1, \dots, N - 2}

, we set:

\begin{matrix} A_{1}^{[1]} = S_{1, f} = a_{f} \end{matrix}

(30)

\begin{matrix} A_{2}^{[1]} = S_{2, f} = b_{f} \end{matrix}

(31)

\begin{matrix} A_{3}^{[1]} = S_{3, f} = α a_{\bar{f + 1}} + b_{f} \end{matrix}

(32)

\begin{matrix} ⋮ \end{matrix}

(33)

\begin{matrix} A_{n}^{[1]} = S_{n, f} = α^{n - 2} a_{\bar{f + n - 2}} + b_{f} \end{matrix}

(34)

\begin{matrix} ⋮ \end{matrix}

(35)

\begin{matrix} A_{N}^{[1]} = S_{N, f} = α^{N - 2} a_{\bar{f + N - 2}} + b_{f} \end{matrix}

(36)

When we retrieve

W^{2}

, the answers are set as follows.

F

is uniformly distributed over

{0, 1, \dots, N - 2}

. When

F = f \in {0, 1, \dots, N - 2}

, we set:

\begin{matrix} A_{1}^{[2]} = S_{1, f} = a_{f} \end{matrix}

(37)

\begin{matrix} A_{2}^{[2]} = S_{2, f} = b_{f} \end{matrix}

(38)

\begin{matrix} A_{3}^{[2]} = S_{3, \bar{f - 1}} = α a_{f} + b_{\bar{f - 1}} \end{matrix}

(39)

\begin{matrix} ⋮ \end{matrix}

(40)

\begin{matrix} A_{n}^{[2]} = S_{n, \bar{f - (n - 2)}} = α^{n - 2} a_{f} + b_{\bar{f - (n - 2)}} \end{matrix}

(41)

\begin{matrix} ⋮ \end{matrix}

(42)

\begin{matrix} A_{N}^{[2]} = S_{N, \bar{f - (N - 2)}} = α^{N - 2} a_{f} + b_{\bar{f - (N - 2)}} \end{matrix}

(43)

Correctness and privacy: Similar to the example presented in the previous section, both correctness and privacy are easy to verify. Correctness follows from the observation that the N symbols downloaded (one from each database) contain all

N - 1

desired symbols and only one undesired symbol. Specifically, when

W^{1}

is desired, we may recover

W^{1}

from

(A_{1}^{[1]}, A_{3}^{[1]} - A_{2}^{[1]}, \dots, A_{N}^{[1]} - A_{2}^{[1]})

, and when

W^{2}

is desired, we may recover

W^{2}

from

(A_{2}^{[2]}, A_{3}^{[2]} - α A_{1}^{[2]}, \dots, A_{N}^{[2]} - α^{N - 2} A_{1}^{[2]})

. Privacy is guaranteed because no matter which message is desired,

A_{n}^{[1]}

and

A_{n}^{[2]}

are identically distributed. For

n = 1, 2

, this is trivial to see; when

n \geq 3

, since

A_{n}^{[1]} = S_{n, f}

,

A_{n}^{[2]} = S_{n, \bar{f - (n - 2)}}

and

f \in {0, 1, \dots, N - 2}

, it is seen that f and

\bar{f - (n - 2)} = (f - (n - 2)) mod (N - 2)

take values from the same set

{0, 1, \dots, N - 2}

for any n, and moreover, the queries follow the same uniform distribution on this set for both messages.

Rate that outperforms separate MDS-PIR capacity: The desired message has

L = N - 1

symbols, and we are downloading one symbol from each database,

l_{n} = 1, \forall n \in {1, \dots, N}

. Then, the rate achieved is

\frac{L}{\sum_{n} l_{n}} = \frac{N - 1}{N} \leq C (2, N, 2)

. When

N \geq 3

,

C (2, N, 2) \geq \frac{N - 1}{N} > \frac{N}{N + 2} = C_{⊥} (2, N, 2)

, the capacity of separate MDS storage code.

3.2.1. Proof of MDS Storage Criterion

We show that from the stored variables of any two databases,

S_{i}, S_{j}, i < j, i, j \in {1, \dots, N}

, we may recover both

W^{1}

and

W^{2}

.

When

i = 1, 2

, the proof is immediate. Henceforth, we consider

i \geq 3

. To show that from

(S_{i}, S_{j})

, we may recover

(W^{1}, W^{2})

, it suffices to prove that from

S_{i} - S_{j}

, we may recover

W^{1}

. Note that:

\begin{matrix} S_{i} - S_{j} & = α^{i - 2} {\tilde{W}}^{1} (i - 2) - α^{j - 2} {\tilde{W}}^{1} (j - 2) \end{matrix}

(44)

\begin{matrix} = (α^{i - 2} a_{\bar{i - 2}} - α^{j - 2} a_{\bar{j - 2}}; \dots; α^{i - 2} a_{\bar{i + N - 4}} - α^{j - 2} a_{\bar{j + N - 4}}) \end{matrix}

(45)

\begin{matrix} = C_{i, j} (a_{0}; \dots; a_{N - 2}) \end{matrix}

(46)

where

C_{i, j}

is an

(N - 1) \times (N - 1)

circulant matrix whose rows consist of all possible cyclic shifts of the following

1 \times (N - 1)

row vector,

\begin{matrix} c = (c_{0}, c_{1}, \dots, c_{N - 2}) = (α^{i - 2}, \underset{j - i - 1 0^{'} s}{\underset{︸}{0, \dots, 0}}, - α^{j - 2}, 0, \dots, 0) . \end{matrix}

(47)

We are left to prove the circulant matrix

C_{i, j}

has full rank. From a result by Ingleton [27], a circulant matrix has full rank if the following two polynomials have no common root.

\begin{matrix} f (x) & = c_{0} + c_{1} x + \dots c_{N - 2} x^{N - 2} = α^{i - 2} - α^{j - 2} x^{j - i}, \end{matrix}

(48)

\begin{matrix} g (x) & = x^{N - 1} - 1 . \end{matrix}

(49)

To show that

f (x), g (x)

have no common root for all integers

i, j, 3 \leq i < j \leq N

, we prove by contradiction. Suppose on the contrary that there exists an element

x_{0} \in F_{p^{m}}

and two integers

i, j, 3 \leq i < j \leq N

such that

f (x_{0}) = 0

and

g (x_{0}) = 0

, i.e.,

\begin{matrix} α^{i - 2} & = α^{j - 2} {x_{0}}^{j - i} \end{matrix}

(50)

\begin{matrix} x_{0}^{N - 1} & = 1 \end{matrix}

(51)

Taking (50) to the

(N - 1) th

power, we have:

\begin{matrix} α^{(i - 2) (N - 1)} & = α^{(j - 2) (N - 1)} {({x_{0}}^{N - 1})}^{j - i} \end{matrix}

(52)

\begin{matrix} \overset{(51)}{\Rightarrow} α^{(i - 2) (N - 1)} & = α^{(j - 2) (N - 1)} \end{matrix}

(53)

\begin{matrix} \Rightarrow 1 & = α^{(j - i) (N - 1)} \end{matrix}

(54)

Note that

(j - i) (N - 1) \leq (N - 3) (N - 1)

. Combining with the assumption that

p^{m} - 2 \geq (N - 3) (N - 1)

and

α

is a primitive element of

F_{p^{m}}

, we have [28]:

\begin{matrix} 1 \notin {α, α^{2}, α^{3}, \dots, α^{p^{m} - 2}} \Rightarrow α^{(j - i) (N - 1)} \neq 1, \end{matrix}

(55)

which contradicts (52). The proof is now complete.

Remark 1.

The field size may be further reduced by a result from [29]. To ensure

C_{i, j}

has full rank, it suffices to ensure

f (x)

and

g^{'} (x) = x^{r} - 1

have no common root, where

N - 1 = r p^{l}

and

p, r

are co-prime [29]. Using this result and following similar proof steps as above, we may set

p^{m} \geq (N - 3) r + 2

. Note that here, r depends on p, so to find the smallest field size, we may search by first fixing p.

4. Code Construction: $(K, N, T) = (K, K + 1, K), K \geq 2$

In this section, we present the storage and PIR code construction when

N = K + 1 = T + 1

and show that the PIR rate achieved with the proposed joint MDS storage code is strictly higher than the capacity of PIR with separate MDS storage code, i.e.,

C (K, K + 1, K) > C_{⊥} (K, K + 1, K)

.

4.1. Example: $(K, N, T) = (3, 4, 3)$

To illustrate the main idea in a simpler setting, we start with an example where

K = 3, N = 4, T = 3

. We set the message size

L = 2

so that each message consisted of two bits from

F_{2}

. Denote

W^{1} = (a_{1}; a_{2}), W^{2} = (b_{1}; b_{2}), W^{3} = (c_{1}; c_{2})

.

Storage code: From the joint MDS storage code constraint, each database stores

\frac{L K}{T} = 2

bits, and the stored variables are specified in Table 4.

The MDS storage criterion is easily verified, i.e., we may recover both messages from the storage of any three databases.

PIR code: When we retrieve

W^{1}

, the answers are shown in Table 5.

When we retrieve

W^{2}

or

W^{3}

, the answers are shown in Table 6 and Table 7.

Correctness and privacy: Both correctness and privacy are easy to see.

Rate that outperforms separate MDS-PIR capacity: The rate achieved was

\frac{L}{\sum_{n} l_{n}} = \frac{2}{4} = \frac{1}{2} \leq C (3, 4, 3)

, which was strictly higher than

C_{⊥} (3, 4, 3) = {(1 + \frac{3}{4} + {(\frac{3}{4})}^{2})}^{- 1} = \frac{16}{37}

, the capacity of separate MDS storage code.

4.2. General Proof: $(K, N, T) = (K, K + 1, K), K \geq 2$

The proof is a simple generalization of the example presented above. We set

L = 2

, and each message consisted of two bits from

F_{2}

. Denote

W^{k} = (W_{1}^{k}; W_{2}^{k}), k \in {1, \dots, K}

.

Storage code: Each database stores

\frac{L K}{T} = 2

bits, and the stored variables are specified in Table 8. Note that

K = T = N - 1

.

The MDS storage criterion is easily verified, i.e., we may recover both messages from the storage of any

T = N - 1

databases.

PIR code: When we retrieve

W^{k}

, the answers are shown in Table 9.

Correctness and privacy: These follow immediately.

Rate that outperforms separate MDS-PIR capacity: The rate achieved was

\frac{L}{\sum_{n} l_{n}} = \frac{2}{N} \leq C (K, K + 1, K)

, while the capacity of separate MDS storage code was

C_{⊥} (K, K + 1, K) = {(1 + \frac{N - 1}{N} + \dots + {(\frac{N - 1}{N})}^{N - 1})}^{- 1} = \frac{1 - \frac{N - 1}{N}}{1 - {(\frac{N - 1}{N})}^{N}} = \frac{1}{N (1 - {(\frac{N - 1}{N})}^{N})}

. To prove

C (K, K + 1, K) > C_{⊥} (K, K + 1, K)

, it remains to show that:

\begin{matrix} \frac{2}{N} & > \frac{1}{N (1 - {(\frac{N - 1}{N})}^{N})} \end{matrix}

(56)

\begin{matrix} \Leftrightarrow {(1 - \frac{1}{N})}^{N} & < \frac{1}{2} \end{matrix}

(57)

\begin{matrix} \Leftarrow {(1 - \frac{1}{N})}^{N} & \leq \frac{1}{e} < \frac{1}{2} . \end{matrix}

(58)

The proof is thus complete.

5. Regime Expansion Building upon Base Codes

We show that the two classes of base codes presented in previous sections for

(K, N, T)

systems can be extended to

(K, m N, m T)

systems (m is a positive integer). We present this result in the next two subsections, one for each class of base codes. Let us start from the simpler case of

(K, K + 1, K)

systems.

5.1. From $(K, K + 1, K)$ to $(K, m (K + 1), m K)$ Systems

We show that

C (K, m (K + 1), m K) > C_{⊥} (K, m (K + 1), m K)

, where

K \geq 2

and m is a positive integer.

The key idea is that we may split the messages and databases into m generic copies so that the same PIR rate is preserved. Note that the separate MDS-PIR capacity is a function of

\frac{T}{N}

, i.e.,

C_{⊥} (K, m (K + 1), m K) = C_{⊥} (K, K + 1, K)

. As

C (K, K + 1, K) > C_{⊥} (K, K + 1, K)

, it suffices to provide a joint MDS storage code for a

(K, m (K + 1), m K)

system that achieves the same PIR rate as that of a

(K, K + 1, K)

system (i.e., rate

\frac{2}{K + 1}

). Such a storage and PIR code construction is presented next.

Each message is “multiplied” by m, so that we set

L = 2 m

, and each message consists of

2 m

symbols from

F_{q}

, where q is an integer power of a prime number and is no fewer than

(m + 1) K

. To highlight that the message symbols form two segments, we denote

W^{k} = (W_{1}^{k}; W_{2}^{k}) \in F_{q}^{2 \times m}

, where

W_{i}^{k} = (W_{i, 1}^{k}, \dots, W_{i, m}^{k}) \in F_{q}^{1 \times m}, i \in {1, 2}

.

Storage code: Each database stores

\frac{L K}{T} = 2

symbols, as specified in Table 10. For the ease of presentation, the

N = m (K + 1)

databases are divided into

K + 1

groups (m databases each) and labeled as

D B (1, 1), \dots, D B (1, m), \dots, D B (K + 1, m)

. Denote a group of databases as

DB (k, :)

= (D B (k, 1), \dots, D B (k, m)), k \in {1, 2, \dots, K + 1}

. A database in group

k, k \in {1, \dots, K}

stores two distinct

W^{k}

symbols (one from

W_{1}^{k}

and one from

W_{2}^{k}

). The

(K + 1) th

group of databases stores generic combinations of the message symbols. Denote

W_{1} = ({(W_{1}^{1})}^{T}; \dots; {(W_{1}^{K})}^{T}) \in F_{q}^{m K \times 1}, W_{2} = ({(W_{2}^{1})}^{T}; \dots; {(W_{2}^{K})}^{T}) \in F_{q}^{m K \times 1}

.

C (i, :) \in F_{q}^{1 \times m K}, i \in {1, \dots, m}

denotes the

i th

row of an

m \times m K

Cauchy matrix

C

with elements

C (i, j)

in the form:

\begin{matrix} C (i, j) = \frac{1}{α_{i} - β_{j}}, α_{i} \neq β_{j}, \forall i \in {1, \dots, m}, j \in {1, \dots, m K} . \end{matrix}

(59)

Note that

q \geq (m + 1) K

; therefore, such distinct

α_{i}

’s and

β_{j}

’s exist.

We now verify that the MDS storage criterion is satisfied, i.e., both messages can be recovered from the storage of any

T = m K

databases. The two message segments

W_{1}, W_{2}

are encoded in the same manner, so it suffices to consider one segment, say Segment

1, W_{1}

. Suppose among the

T = m K

databases,

T_{1} \leq (m - 1) K

databases are from the first K database groups and the remaining

T - T_{1}

databases are from the

(K + 1) th

database group. The

T_{1}

databases from the first K database groups contribute

T_{1}

raw message symbols from

W_{1}

, then we only need to show that the remaining

T - T_{1}

symbols from

W_{1}

can be recovered from the

T - T_{1}

databases of the

(K + 1) th

database group. This is equivalent to prove that a

(T - T_{1}) \times (T - T_{1})

sub-matrix of the Cauchy matrix

C \in F_{q}^{m \times m K}

has full rank, which trivially holds for any Cauchy matrix.

PIR code: When we retrieve

W^{k}

, the answers are shown in Table 11.

Correctness and privacy: Privacy follows from the observation that no matter which message is desired, the answer from any database is equally likely to come from Message Segment 1 or 2. To see the correctness, note that all non-desired message symbols appeared in answers from the

(K + 1) th

database group are directly downloaded, and thus, they can be canceled. m desired symbols are directly downloaded, and the other m desired symbols can be successfully recovered because the m linear combinations of desired symbols downloaded from the

(K + 1) th

database group have full rank (note that

C \in F_{q}^{m \times m K}

is a Cauchy matrix). The rate achieved is

\frac{2}{K + 1}

as

L = 2 m

, and we have downloaded one symbol from each of the

m (K + 1)

databases.

5.2. From $(2, N, 2)$ to $(2, m N, 2 m)$ Systems

We show that

C (2, m N, 2 m) > C_{⊥} (2, m N, 2 m)

, where

N \geq 3

and m is a positive integer. Similar to the reasoning in the previous section, it suffices to provide a joint MDS storage code for a

(2, m N, 2 m)

system that achieves the PIR rate

\frac{N - 1}{N}

(same as that of a

(2, N, 2)

system from Section 4). The idea is also based on splitting the messages and databases. Let us start with an example where

N = 4, m = 2

.

5.2.1. Example: $N = 4, m = 2$

The message size is multiplied by

m = 2

so that we set

L = m (N - 1) = 6

, and each message consisted of six symbols from

F_{q}

, where q will be specified later. At this point, it is useful to view q as a sufficiently large prime number. Denote

W^{1} = (a_{0}; a_{1}; a_{2})

, where

a_{i} = (a_{i}; a_{i}^{'}), i \in {0, 1, 2}

and

W^{2} = (b_{0}; b_{1}; b_{2})

, where

b_{i} = (b_{i}; b_{i}^{'}), i \in {0, 1, 2}

.

Storage code: Each database stores

\frac{L K}{T} = 3

symbols, as specified in Table 12. Define:

\begin{matrix} h_{i} = (h_{i}, h_{i}^{'}) \in F_{q}^{1 \times 2}, g_{i} = (g_{i}, g_{i}^{'}) \in F_{q}^{1 \times 2}, i \in {1, 2, \dots, 12} . \end{matrix}

(60)

We will show that there exist feasible choices of

h_{i}, g_{i}

. Specifically, we may choose

h_{i}, h_{i}^{'}, g_{i}, g_{i}^{'}

i.i.d. and uniform over

F_{q}

.

To verify the MDS storage criterion, we need to show that both messages can be recovered from the storage of any four databases. The detailed proof is deferred to the general proof presented in the next section, and we give a sketch here. Every four databases contributes 12 linear combinations on the 12 message symbols, and this linear mapping is given by a

12 \times 12

matrix. We view its determinant polynomial as a function of variables

(h_{i}, h_{i}^{'}, g_{i}, g_{i}^{'})

. As shown in the general proof, these determinant polynomials are not zero polynomials. Overall, we have

(\binom{8}{4})

determinant polynomials, and each polynomial has degree at most 12. Consider the product of all such determinant polynomials, which is another polynomial with degree at most

12 \times (\binom{8}{4})

. Therefore, by the Schwartz–Zippel lemma, if we set

q > 12 \times (\binom{8}{4})

, then the probability that this product polynomial evaluates to zero is non-zero. In other words, we found a feasible choice of

(h_{i}, h_{i}^{'}, g_{i}, g_{i}^{'})

that guarantees the storage code satisfies the MDS criterion.

PIR code: The PIR code is almost identical to that when

m = 1

. When we retrieve

W^{1}

, the answers are shown in Table 13.

When we retrieve

W^{2}

, the answers are shown in Table 14.

Correctness and privacy: Privacy is easily seen. To prove correctness, note that non-desired symbols can be canceled, and we only need to ensure the received desired equations are invertible to the message symbols. This claim follows from Schwartz–Zippel lemma that shows that

(h_{2 i - 1}; h_{2 i}) \in F_{q}^{2 \times 2}, (g_{2 i - 1}; g_{2 i}) \in F_{q}^{2 \times 2}

have full rank with non-zero probability over a sufficiently large field. Here, we have 12 matrices, each of which has dimension

2 \times 2

and has a determinant polynomial of degree at most two.

Overall, we need to guarantee the correctness and MDS criterion are simultaneously satisfied. Take the product of all determinant polynomials, whose degree is at most

12 \times (\binom{8}{4}) + 12 \times 2

. Therefore, we set

q > 12 \times (\binom{8}{4}) + 12 \times 2

, and by Schwartz–Zippel lemma, there exist a feasible choice of

(h_{i}, h_{i}^{'}, g_{i}, g_{i}^{'})

over

F_{q}

.

5.2.2. General Proof: Arbitrary $N \geq 3, m \geq 2$

We set

L = m (N - 1)

, and each message consisted of L symbols from

F_{q}

, where q is an integer power of a prime number and is no fewer than

2 m (N - 2) (N - 1) + 2 m (N - 1) (\binom{m N}{2 m})

. Denote

W^{1} = (a_{0}; \dots; a_{N - 2}) \in F_{q}^{m (N - 1) \times 1}

, where

a_{i} = (a_{i, 1}; \dots; a_{i, m}) \in F_{q}^{m \times 1}, i \in {0, 1, \dots, N - 2}

and

W^{2} = (b_{0}; \dots; b_{N - 2}) \in F_{q}^{m (N - 1) \times 1}

, where

b_{i} = (b_{i, 1}; \dots; b_{i, m}) \in F_{q}^{m \times 1}, i \in {0, 1, \dots, N - 2}

.

Storage code: Each database stores

\frac{L K}{T} = N - 1

symbols. Denote the

m N

databases as

D B (1, 1), \dots, D B (1, m), \dots, D B (N, m)

. The stored variables

S_{n, j} \in F_{q}^{(N - 1) \times 1}, n \in {1, \dots, N}

,

j \in {1, \dots, m}

are set as follows.

Denote

\bar{i} = i mod (N - 1)

. For any

j \in {1, \dots, m}

,

\begin{matrix} S_{1, j} & = (S_{1, j, 0}; \dots; S_{1, j, N - 2}) = (a_{0, j}; a_{1, j}; \dots; a_{N - 2, j}) \end{matrix}

(61)

\begin{matrix} S_{2, j} & = (S_{2, j, 0}; \dots; S_{2, j, N - 2}) = (b_{0, j}; b_{1, j}; \dots; b_{N - 2, j}) \\ S_{n, j} & = (S_{n, j, 0}; \dots; S_{n, j, N - 2}) = (h_{n, j, 0} a_{\bar{n - 2}} + g_{n, j, 0} b_{0}; \dots; h_{n, j, N - 2} a_{\bar{n + N - 4}} + g_{n, j, N - 2} b_{N - 2}), \end{matrix}

(62)

\begin{matrix} n \in {3, \dots, N} \end{matrix}

(63)

where for any

i \in {0, \dots, N - 2}

,

\begin{matrix} h_{n, j, i} \in F_{q}^{1 \times m}, g_{n, j, i} \in F_{q}^{1 \times m} . \end{matrix}

(64)

The proof that there exist choices of

h_{n, j, i}, g_{n, j, i}

such that the above storage code satisfies the MDS criterion is deferred to Section 5.2.3.

PIR code: When we retrieve

W^{1}

, we download one symbol from each database, and the answers are set as follows.

F

is uniform over

{0, 1, \dots, N - 2}

. When

F = f \in {0, 1, \dots, N - 2}

, for any

j \in {1, 2, \dots, m}

, we set:

\begin{matrix} A_{1, j}^{[1]} = S_{1, j, f} = a_{f, j} \end{matrix}

(65)

\begin{matrix} A_{2, j}^{[1]} = S_{2, j, f} = b_{f, j} \end{matrix}

(66)

\begin{matrix} A_{n, j}^{[1]} = S_{n, j, f} = h_{n, j, f} a_{\bar{f + n - 2}} + g_{n, j, f} b_{f}, n \in {3, \dots, N} \end{matrix}

(67)

When we retrieve

W^{2}

, the answers are set as follows.

F

is uniform over

{0, 1, \dots, N - 2}

. When

F = f \in {0, 1, \dots, N - 2}

, for any

j \in {1, 2, \dots, m}

, we set:

\begin{matrix} A_{1, j}^{[2]} = S_{1, j, f} = a_{f, j} \end{matrix}

(68)

\begin{matrix} A_{2, j}^{[2]} = S_{2, j, f} = b_{f, j} \end{matrix}

(69)

\begin{matrix} A_{n, j}^{[2]} = S_{n, j, \bar{f - (n - 2)}} = h_{n, j, \bar{f - (n - 2)}} a_{f} + g_{n, j, \bar{f - (n - 2)}} b_{\bar{f - (n - 2)}}, n \in {3, \dots, N} \end{matrix}

(70)

Correctness and privacy: Privacy is easy to verify. For any

n, j

,

A_{n, j}^{[1]}

and

A_{n, j}^{[2]}

are identically distributed due to the modulo operation. Next, consider correctness. Due to symmetry, we only need to consider the case when

W^{1}

is the desired message. From

A_{2, j}^{[1]}, \forall j \in {1, \dots, m}

, we have obtained all non-desired symbols

(b_{f, 1}; \dots; b_{f, m}) = b_{f}

. After canceling the contribution of

b_{f}

from

A_{n, j}^{[1]}, n \geq 3

, we need to show that for any

n, f

, the

m \times m

matrix

(h_{n, 1, f}; \dots; h_{n, m, f})

has full rank, which follows from the Schwartz–Zippel lemma over a sufficiently large field. We have

2 (N - 2) (N - 1)

such matrices, each of size

m \times m

. The product of all these determinant polynomials has degree at most

2 m (N - 2) (N - 1)

.

5.2.3. Proof of the MDS Storage Criterion

We show that when each element of

h_{n, j, i}, g_{n, j, i}

is drawn independently and uniformly from

F_{q}

; the probability that the MDS criterion is satisfied is non-zero, so that there exists a feasible choice.

Consider any

T = 2 m

databases. We show that there exists an assignment of

h_{n, j, i}, g_{n, j, i}

so that the mapping from the storage of the T databases to the

2 L

message symbols is invertible. This shows that the

2 L \times 2 L

matrix that describes the linear mapping has a non-zero determinant polynomial. Consider all choices of

(\binom{m N}{2 m})

databases, and take the product of all such determinant polynomials. Each polynomial has degree at most

2 L

, so the degree of the product polynomial is at most

2 L (\binom{m N}{2 m})

. Therefore, over a sufficiently large field, by the Schwartz–Zippel lemma, there exists a choice of

h_{n, j, i}, g_{n, j, i}

so that all polynomials evaluate to non-zero values, and the storage code is indeed MDS.

We are left to show that for any

T = 2 m

databases, we may assign

h_{n, j, i}, g_{n, j, i}

(for a given choice of

T = 2 m

databases) so that the storage is able to recover all

2 L

message symbols. The proof is based on a crucial property, stated in the following lemma. Define

{\vec{a}}_{j} = (a_{0, j}; a_{1, j}; \dots; a_{N - 2, j})

,

{\vec{b}}_{j} = (b_{0, j}; b_{1, j}; \dots; b_{N - 2, j}), j \in {1, \dots, m}

.

Lemma 1.

Consider any

n \in {3, \dots, N}, j \in {1, \dots, m}

; there exists a choice of

h_{n, j, i}, g_{n, j, i}, i \in {0, 1, \dots, N - 2}

so that from

S_{n, j}

, we may obtain

{\vec{a}}_{j^{*}}

for any

j^{*} \in {1, \dots, m}

and another choice of

h_{n, j, i}, g_{n, j, i}, i \in {0, 1, \dots, N - 2}

so that from

S_{n, j}

, we may obtain

{\vec{b}}_{j^{*}}

for any

j^{*} \in {1, \dots, m}

.

Proof of Lemma 1.

The proof is fairly simple because

S_{n, j}

contains all symbols from

{\vec{a}}_{j^{*}}

and

{\vec{b}}_{j^{*}}

for any

j^{*}

. Consider first

S_{n, j} = {\vec{a}}_{j^{*}}

. For all

i \in {0, 1, \dots, N - 2}

, set:

\begin{matrix} g_{n, j, i} & = 0 \end{matrix}

(71)

\begin{matrix} h_{n, j, i} & = e_{j^{*}} \end{matrix}

(72)

where

e_{j^{*}}

is a

1 \times m

unit vector so that only the element of the

j^{*}

-th position is one and all other elements are zero, then we have:

\begin{matrix} S_{n, j} & = (S_{n, j, 0}; \dots; S_{n, j, N - 2}) = (h_{n, j, 0} a_{\bar{n - 2}} + g_{n, j, 0} b_{0}; \dots; h_{n, j, N - 2} a_{\bar{n + N - 4}} + g_{n, j, N - 2} b_{N - 2}) \end{matrix}

(73)

\begin{matrix} = (h_{n, j, 0} a_{\bar{n - 2}}; \dots; h_{n, j, N - 2} a_{\bar{n + N - 4}}) \end{matrix}

(74)

\begin{matrix} = (a_{\bar{n - 2}, j^{*}}; \dots; a_{\bar{n + N - 4}, j^{*}}) \end{matrix}

(75)

which is a cyclic shift of

{\vec{a}}_{j^{*}} = (a_{0, j^{*}}; a_{1, j^{*}}; \dots; a_{N - 2, j^{*}})

.

The case of

S_{n, j} = {\vec{b}}_{j^{*}}

follows similarly from the assignment given above. □

Fix any

T = 2 m

databases. Suppose

T_{1} \leq 2 m

databases are from

D B (i, j)

where

i \in {1, 2}

,

j \in {1, \dots, m}

and they will contribute

T_{1}

distinct

{\vec{a}}_{j_{1}^{*}}

and

{\vec{b}}_{j_{2}^{*}}

vectors. The remaining

T - T_{1}

databases are from

D B (n, j)

where

n \in {3, \dots, N}, j \in {1, \dots, m}

, and our goal is to recover all remaining

T - T_{1}

{\vec{a}}_{j_{3}^{*}}

and

{\vec{b}}_{j_{4}^{*}}

vectors. We can identify a one-to-one mapping between the

T - T_{1}

databases and the remaining

(T - T_{1}) {\vec{a}}_{j_{3}^{*}}

and

{\vec{b}}_{j_{4}^{*}}

vectors and apply Lemma 1 to find the assignment such that the

{\vec{a}}_{j_{3}^{*}}

and

{\vec{b}}_{j_{4}^{*}}

vectors are fully recovered. Hence, from any T databases, we may recover

({\vec{a}}_{1}, \dots, {\vec{a}}_{m})

and

({\vec{b}}_{1}, \dots, {\vec{b}}_{m})

, i.e., all symbols from

W^{1}

and

W^{2}

. Therefore, there indeed exists a choice of

h_{n, j, i}, g_{n, j, i}

for which the determinant polynomial is not zero.

Finally, we need to consider the correctness and MDS criterion jointly and show that there exists a single choice of

h_{n, j, i}, g_{n, j, i}

that satisfies both constraints at the same time. The product of all determinant polynomials has degree at most

2 m (N - 2) (N - 1) + 2 m (N - 1) (\binom{m N}{2 m})

, and as

q > 2 m (N - 2) (N - 1) + 2 m (N - 1) (\binom{m N}{2 m})

, the Schwartz-Zippel lemma guarantees the existence of a feasible choice.

6. Conclusions

We considered the problem of private information retrieval from MDS-coded databases. Different from the prevailing approach in the literature where the messages are encoded separately using MDS codes, we considered encoding and storing the messages jointly using an MDS code in the databases. There are many cases for which by jointly MDS-coding, we can break the capacity barrier of the separate coding MDS-PIR. To establish this result, two novel code constructions and the corresponding PIR protocols were presented, and moreover, an expansion technique was introduced to allow more general parameters. The capacity of PIR with joint MDS storage, especially the converse side, remains an interesting future direction.

References

Author Contributions

The authors contribute equally to this work.

Funding

The work of C. Tian was supported in part by the National Science Foundation under Grants CCF-18-32309 and CCF-18-16546.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chor, B.; Kushilevitz, E.; Goldreich, O.; Sudan, M. Private Information Retrieval. J. ACM (JACM) 1998, 45, 965–981. [Google Scholar] [CrossRef]
Sun, H.; Jafar, S.A. The capacity of private information retrieval. In Proceedings of the 2016 IEEE Global Communications Conference (GLOBECOM), Washington, DC, USA, 4–8 December 2016; pp. 1–6. [Google Scholar]
Shah, N.B.; Rashmi, K.; Ramchandran, K. One extra bit of download ensures perfectly private information retrieval. In Proceedings of the 2014 IEEE International Symposium on Information Theory (ISIT), Honolulu, HI, USA, 29 June–4 July 2014; pp. 856–860. [Google Scholar]
Freij-Hollanti, R.; Gnilke, O.W.; Hollanti, C.; Karpuk, D.A. Private information retrieval from coded databases with colluding servers. SIAM J. Appl. Algebra Geom. 2017, 1, 647–664. [Google Scholar] [CrossRef]
Banawan, K.; Ulukus, S. The capacity of private information retrieval from coded databases. IEEE Trans. Inf. Theory 2018, 64, 1945–1956. [Google Scholar] [CrossRef]
Tajeddine, R.; Rouayheb, S.E. Private Information Retrieval from MDS Coded Data in Distributed Storage Systems. arXiv 2016, arXiv:1602.01458. [Google Scholar]
Xu, J.; Zhang, Z. On Sub-Packetization and Access Number of Capacity-Achieving PIR Schemes for MDS Coded Non-Colluding Databases. Sci. China Inf. Sci. 2018, 61, 100306:1–100306:16. [Google Scholar] [CrossRef]
Kumar, S.; Lin, H.Y.; Rosnes, E.; i Amat, A.G. Achieving maximum distance separable private information retrieval capacity with linear codes. IEEE Trans. Inf. Theory 2019, 65, 4243–4273. [Google Scholar] [CrossRef]
Attia, M.A.; Kumar, D.; Tandon, R. The capacity of private information retrieval from uncoded storage constrained databases. arXiv 2018, arXiv:1805.04104. [Google Scholar]
Woolsey, N.; Chen, R.R.; Ji, M. An Optimal Iterative Placement Algorithm for PIR from Heterogeneous Storage-Constrained Databases. arXiv 2019, arXiv:1904.02131. [Google Scholar]
Banawan, K.; Arasli, B.; Wei, Y.P.; Ulukus, S. The Capacity of Private Information Retrieval from Heterogeneous Uncoded Caching Databases. arXiv 2019, arXiv:1902.09512. [Google Scholar]
Raviv, N.; Tamot, I. Private Information Retrieval in Graph Based Replication Systems. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 1739–1743. [Google Scholar]
Lin, H.Y.; Kumar, S.; Rosnes, E.; i Amat, A.G. On the fundamental limit of private information retrieval for coded distributed storage. arXiv 2018, arXiv:1808.09018. [Google Scholar]
Chan, T.H.; Ho, S.W.; Yamamoto, H. Private information retrieval for coded storage. In Proceedings of the 2015 IEEE International Symposium on Information Theory (ISIT), Hong Kong, China, 14–19 June 2015; pp. 2842–2846. [Google Scholar]
Sun, H.; Jafar, S.A. Multiround private information retrieval: Capacity and storage overhead. IEEE Trans. Inf. Theory 2018, 64, 5743–5754. [Google Scholar] [CrossRef]
Tian, C.; Sun, H.; Chen, J. A Shannon-Theoretic Approach to the Storage-Retrieval Tradeoff in PIR Systems. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 1904–1908. [Google Scholar]
Fazeli, A.; Vardy, A.; Yaakobi, E. Codes for distributed PIR with low storage overhead. Proceedings of IEEE International Symposium on Information Theory (ISIT), Hong Kong, China, 14–19 June 2015; pp. 2852–2856. [Google Scholar]
Rao, S.; Vardy, A. Lower Bound on the Redundancy of PIR Codes. arXiv 2016, arXiv:1605.01869. [Google Scholar]
Blackburn, S.R.; Etzion, T. PIR array codes with optimal virtual server rate. IEEE Trans. Inf. Theory 2019. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, X.; Wei, H.; Ge, G. On private information retrieval array codes. IEEE Trans. Inf. Theory 2019, 65, 5565–5573. [Google Scholar] [CrossRef]
Skachek, V. Batch and PIR codes. In Network Coding and Subspace Designs; Springer: Berlin/Heidelberg, Germany, 2018; pp. 427–442. [Google Scholar]
Vajha, M.; Ramkumar, V.; Kumar, P.V. Binary, shortened projective Reed Muller codes for coded private information retrieval. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 2648–2652. [Google Scholar]
Tajeddine, R.; Gnilke, O.W.; El Rouayheb, S. Private information retrieval from MDS coded data in distributed storage systems. IEEE Trans. Inf. Theory 2018, 64, 7081–7093. [Google Scholar] [CrossRef]
Sun, H.; Jafar, S.A. Private Information Retrieval from MDS Coded Data with Colluding Servers: Settling a Conjecture by Freij-Hollanti et al. IEEE Trans. Inf. Theory 2018, 64, 1000–1022. [Google Scholar] [CrossRef]
Wang, Q.; Skoglund, M. Symmetric private information retrieval for MDS coded distributed storage. In Proceedings of the 2017 IEEE International Conference on Communications (ICC), Paris, France, 21–25 May 2017; pp. 1–6. [Google Scholar]
Zhou, R.; Tian, C.; Liu, T.; Sun, H. Capacity-Achieving Private Information Retrieval Codes from MDS-Coded Databases with Minimum Message Size. In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019; pp. 370–374. [Google Scholar]
Ingleton, A.W. The rank of circulant matrices. J. Lond. Math. Soc. 1956, 1, 445–460. [Google Scholar] [CrossRef]
Lidl, R.; Niederreiter, H. Introduction to Finite Fields and Their Applications; Cambridge University Press: Cambridge, UK, 1994. [Google Scholar]
Guan, P.H.; He, Y. Exact results for deterministic cellular automata with additive rules. J. Stat. Phys. 1986, 43, 463–478. [Google Scholar] [CrossRef]

Table 1. Stored variables.

Database 1	Database 2	Database 3	Database 4
$a_{0}$	$b_{0}$	$a_{1} + b_{0}$	$2 a_{2} + b_{0}$
$a_{1}$	$b_{1}$	$a_{2} + b_{1}$	$2 a_{0} + b_{1}$
$a_{2}$	$b_{2}$	$a_{0} + b_{2}$	$2 a_{1} + b_{2}$

Table 2. Answers for

W^{1}

.

Table 2. Answers for

W^{1}

.

$F$	Database 1	Database 2	Database 3	Database 4
0	$a_{0}$	$b_{0}$	$a_{1} + b_{0}$	$2 a_{2} + b_{0}$
1	$a_{1}$	$b_{1}$	$a_{2} + b_{1}$	$2 a_{0} + b_{1}$
2	$a_{2}$	$b_{2}$	$a_{0} + b_{2}$	$2 a_{1} + b_{2}$

Table 3. Answers for

W^{2}

.

Table 3. Answers for

W^{2}

.

$F$	Database 1	Database 2	Database 3	Database 4
0	$a_{0}$	$b_{0}$	$a_{0} + b_{2}$	$2 a_{0} + b_{1}$
1	$a_{1}$	$b_{1}$	$a_{1} + b_{0}$	$2 a_{1} + b_{2}$
2	$a_{2}$	$b_{2}$	$a_{2} + b_{1}$	$2 a_{2} + b_{0}$

Table 4. Stored variables.

Database 1	Database 2	Database 3	Database 4
$a_{1}$	$b_{1}$	$c_{1}$	$a_{1} + b_{1} + c_{1}$
$a_{2}$	$b_{2}$	$c_{2}$	$a_{2} + b_{2} + c_{2}$

Table 5. Answers for

W^{1}

.

Table 5. Answers for

W^{1}

.

$F$	Database 1	Database 2	Database 3	Database 4
1	$a_{2}$	$b_{1}$	$c_{1}$	$a_{1} + b_{1} + c_{1}$
2	$a_{1}$	$b_{2}$	$c_{2}$	$a_{2} + b_{2} + c_{2}$

Table 6. Answers for

W^{2}

.

Table 6. Answers for

W^{2}

.

$F$	Database 1	Database 2	Database 3	Database 4
1	$a_{1}$	$b_{2}$	$c_{1}$	$a_{1} + b_{1} + c_{1}$
2	$a_{2}$	$b_{1}$	$c_{2}$	$a_{2} + b_{2} + c_{2}$

Table 7. Answers for

W^{3}

.

Table 7. Answers for

W^{3}

.

$F$	Database 1	Database 2	Database 3	Database 4
1	$a_{1}$	$b_{1}$	$c_{2}$	$a_{1} + b_{1} + c_{1}$
2	$a_{2}$	$b_{2}$	$c_{1}$	$a_{2} + b_{2} + c_{2}$

Table 8. Stored variables.

Database 1	Database 2	⋯	Database $(N - 1)$	Database N
$W_{1}^{1}$	$W_{1}^{2}$	⋯	$W_{1}^{K}$	$\sum_{k = 1}^{K} W_{1}^{k}$
$W_{2}^{1}$	$W_{2}^{2}$	⋯	$W_{2}^{K}$	$\sum_{k = 1}^{K} W_{2}^{k}$

Table 9. Answers for

W^{k}

.

Table 9. Answers for

W^{k}

.

$F$	Database 1	⋯	Database k	⋯	Database $(N - 1)$	Database N
1	$W_{1}^{1}$	⋯	$W_{2}^{k}$	⋯	$W_{1}^{K}$	$\sum_{k = 1}^{K} W_{1}^{k}$
2	$W_{2}^{1}$	⋯	$W_{1}^{k}$	⋯	$W_{2}^{K}$	$\sum_{k = 1}^{K} W_{2}^{k}$

Table 10. Stored variables.

$DB (1, :)$	$DB (2, :)$	⋯	$DB (K, :)$	$DB (K + 1, 1)$	⋯	$DB (K + 1, m)$
$W_{1}^{1}$	$W_{1}^{2}$	⋯	$W_{1}^{K}$	$C (1, :) W_{1}$	⋯	$C (m, :) W_{1}$
$W_{2}^{1}$	$W_{2}^{2}$	⋯	$W_{2}^{K}$	$C (1, :) W_{2}$	⋯	$C (m, :) W_{2}$

Table 11. Answers for

W^{k}

.

Table 11. Answers for

W^{k}

.

$F$	$DB (1, :)$	⋯	$DB (k, :)$	⋯	$DB (K, :)$	$DB (K + 1, 1)$	⋯	$DB (K + 1, m)$
1	$W_{1}^{1}$	⋯	$W_{2}^{k}$	⋯	$W_{1}^{K}$	$C (1, :) W_{1}$	⋯	$C (m, :) W_{1}$
2	$W_{2}^{1}$	⋯	$W_{1}^{k}$	⋯	$W_{2}^{K}$	$C (1, :) W_{2}$	⋯	$C (m, :) W_{2}$

Table 12. Stored variables.

(DB1, DB2)	(DB3, DB4)	(DB5, DB6)	(DB7, DB8)
$(a_{0}, a_{0}^{'})$	$(b_{0}, b_{0}^{'})$	$(h_{1} a_{1} + g_{1} b_{0}, h_{2} a_{1} + g_{2} b_{0})$	$(h_{7} a_{2} + g_{7} b_{0}, h_{8} a_{2} + g_{8} b_{0})$
$(a_{1}, a_{1}^{'})$	$(b_{1}, b_{1}^{'})$	$(h_{3} a_{2} + g_{3} b_{1}, h_{4} a_{2} + g_{4} b_{1})$	$(h_{9} a_{0} + g_{9} b_{1}, h_{10} a_{0} + g_{10} b_{1})$
$(a_{2}, a_{2}^{'})$	$(b_{2}, b_{2}^{'})$	$(h_{5} a_{0} + g_{5} b_{2}, h_{6} a_{0} + g_{6} b_{2})$	$(h_{11} a_{1} + g_{11} b_{2}, h_{12} a_{1} + g_{12} b_{2})$

Table 13. Answers for

W^{1}

.

Table 13. Answers for

W^{1}

.

(DB1, DB2)	(DB3, DB4)	(DB5, DB6)	(DB7, DB8)
$(a_{0}, a_{0}^{'})$	$(b_{0}, b_{0}^{'})$	$(h_{1} a_{1} + g_{1} b_{0}, h_{2} a_{1} + g_{2} b_{0})$	$(h_{7} a_{2} + g_{7} b_{0}, h_{8} a_{2} + g_{8} b_{0})$
$(a_{1}, a_{1}^{'})$	$(b_{1}, b_{1}^{'})$	$(h_{3} a_{2} + g_{3} b_{1}, h_{4} a_{2} + g_{4} b_{1})$	$(h_{9} a_{0} + g_{9} b_{1}, h_{10} a_{0} + g_{10} b_{1})$
$(a_{2}, a_{2}^{'})$	$(b_{2}, b_{2}^{'})$	$(h_{5} a_{0} + g_{5} b_{2}, h_{6} a_{0} + g_{6} b_{2})$	$(h_{11} a_{1} + g_{11} b_{2}, h_{12} a_{1} + g_{12} b_{2})$

Table 14. Answers for

W^{2}

.

Table 14. Answers for

W^{2}

.

(DB1, DB2)	(DB3, DB4)	(DB5, DB6)	(DB7, DB8)
$(a_{0}, a_{0}^{'})$	$(b_{0}, b_{0}^{'})$	$(h_{5} a_{0} + g_{5} b_{2}, h_{6} a_{0} + g_{6} b_{2})$	$(h_{9} a_{0} + g_{9} b_{1}, h_{10} a_{0} + g_{10} b_{1})$
$(a_{1}, a_{1}^{'})$	$(b_{1}, b_{1}^{'})$	$(h_{1} a_{1} + g_{1} b_{0}, h_{2} a_{1} + g_{2} b_{0})$	$(h_{11} a_{1} + g_{11} b_{2}, h_{12} a_{1} + g_{12} b_{2})$
$(a_{2}, a_{2}^{'})$	$(b_{2}, b_{2}^{'})$	$(h_{3} a_{2} + g_{3} b_{1}, h_{4} a_{2} + g_{4} b_{1})$	$(h_{7} a_{2} + g_{7} b_{0}, h_{8} a_{2} + g_{8} b_{0})$

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, H.; Tian, C. Breaking the MDS-PIR Capacity Barrier via Joint Storage Coding. Information 2019, 10, 265. https://doi.org/10.3390/info10090265

AMA Style

Sun H, Tian C. Breaking the MDS-PIR Capacity Barrier via Joint Storage Coding. Information. 2019; 10(9):265. https://doi.org/10.3390/info10090265

Chicago/Turabian Style

Sun, Hua, and Chao Tian. 2019. "Breaking the MDS-PIR Capacity Barrier via Joint Storage Coding" Information 10, no. 9: 265. https://doi.org/10.3390/info10090265

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Breaking the MDS-PIR Capacity Barrier via Joint Storage Coding

Abstract

1. Introduction

2. System Model and Problem Formulation

2.1. System Model

2.2. Separate vs. Joint MDS Storage Codes

2.3. Further Remarks on the System Model

3. Code Construction: $(K, N, T) = (2, N, 2), N \geq 3$

3.1. Example: $N = 4$

3.2. General Proof: Arbitrary $N \geq 3$

3.2.1. Proof of MDS Storage Criterion

4. Code Construction: $(K, N, T) = (K, K + 1, K), K \geq 2$

4.1. Example: $(K, N, T) = (3, 4, 3)$

4.2. General Proof: $(K, N, T) = (K, K + 1, K), K \geq 2$

5. Regime Expansion Building upon Base Codes

5.1. From $(K, K + 1, K)$ to $(K, m (K + 1), m K)$ Systems

5.2. From $(2, N, 2)$ to $(2, m N, 2 m)$ Systems

5.2.1. Example: $N = 4, m = 2$

5.2.2. General Proof: Arbitrary $N \geq 3, m \geq 2$

5.2.3. Proof of the MDS Storage Criterion

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Breaking the MDS-PIR Capacity Barrier via Joint Storage Coding

Abstract

1. Introduction

2. System Model and Problem Formulation

2.1. System Model

2.2. Separate vs. Joint MDS Storage Codes

2.3. Further Remarks on the System Model

3. Code Construction: ( K , N , T ) = ( 2 , N , 2 ) , N ≥ 3

3.1. Example: N = 4

3.2. General Proof: Arbitrary N ≥ 3

3.2.1. Proof of MDS Storage Criterion

4. Code Construction: ( K , N , T ) = ( K , K + 1 , K ) , K ≥ 2

4.1. Example: ( K , N , T ) = ( 3 , 4 , 3 )

4.2. General Proof: ( K , N , T ) = ( K , K + 1 , K ) , K ≥ 2

5. Regime Expansion Building upon Base Codes

5.1. From ( K , K + 1 , K ) to ( K , m ( K + 1 ) , m K ) Systems

5.2. From ( 2 , N , 2 ) to ( 2 , m N , 2 m ) Systems

5.2.1. Example: N = 4 , m = 2

5.2.2. General Proof: Arbitrary N ≥ 3 , m ≥ 2

5.2.3. Proof of the MDS Storage Criterion

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3. Code Construction: $(K, N, T) = (2, N, 2), N \geq 3$

3.1. Example: $N = 4$

3.2. General Proof: Arbitrary $N \geq 3$

4. Code Construction: $(K, N, T) = (K, K + 1, K), K \geq 2$

4.1. Example: $(K, N, T) = (3, 4, 3)$

4.2. General Proof: $(K, N, T) = (K, K + 1, K), K \geq 2$

5.1. From $(K, K + 1, K)$ to $(K, m (K + 1), m K)$ Systems

5.2. From $(2, N, 2)$ to $(2, m N, 2 m)$ Systems

5.2.1. Example: $N = 4, m = 2$

5.2.2. General Proof: Arbitrary $N \geq 3, m \geq 2$