You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This entry should serve as an overview of our ongoing research into proofs of storage. It captures some basic intuitions and outlines some potential future research directions.
As outlined in Dagger Research Scope, one of the desired outcomes of the first phase of Dagger is to categorize and validate existing proof of storage (or as they are also sometimes known remote auditing) schemes. In this post we'll attempt to outline some of the already explored approaches and give a sense of the basic intuitions behind each.
Proofs of Storage
It is far from being easy to provide an unequivocal definition of what Proof of Storage means, and in fact several interpretations have emerged in literature in the past two decades. Two major approaches to remote auditing are Provable Data Possession (PDP) and Proofs of Retrievability (PoR). Both schemes try to accomplish similar objectives, however the assumptions and security guarantees are different depending on the scheme.
PDP focuses on showing that data is stored in its entirety by the prover with some probability (as in [Ateniese2007] which coined the term PDP, where the scheme can detect data loss with high probability). Data possession does not formally guarantee that the data is recoverable from the node possessing it.
PoRs, on the other hand, focus on guaranteeing retrieval, i.e. that the whole data can be extracted from the storage node. Its initial version [Jules2007] in fact defines an extractor function that given the correct security parameters can detect if the data in the file is still recoverable or not.
Both schemes were improved on and subsequently formally proved by Schaham and Waters in their Compact Proofs of Retrievability [Schaham2008] paper.
The intuition behind both of these schemes is similar, but PoRs add a twist with the use of erasure coding and guarantees of extraction.
One potentially obvious but important aspect to note is that any proof of storage should be significantly more efficient than sending the entirety or even a portion of the data to the verifier, otherwise these schemes become impractical.
PDP vs PoR
The original PDP scheme uses homomorphic authenticators to generate proofs that some previously tagged blocks or chunks still exist.
Publicly verifiable PDP constructs usually rely on some form of public key or asymmetrical cryptography; an important aspect of a practical public PDP is the lack of state that a verifier needs to keep between verifications. The original construct presented by [Ateniese2007] relies on RSA authenticators and was further improved to use BLS signatures by [Schaham2008].
To some extent, most public PDP schemes can be considered to be zero knowledge proofs of knowledge, as they require neither previous knowledge of the data nor do they reveal any information of the data being proved.
The homomorphism implied by the authenticators is usually of the additive form, although full homomorphism might have additional desirable properties that can be exploited by PDPs and PoRs.
PoRs, as they were originally defined by Juels and Kaliski, use a combination of erasure coding, sentinel blocks and encryption to prove possession of data. The introduction of erasure coding and extraction formally make PoRs more secure than PDPs, but in practice a PDP scheme can be easily extended to a PoR scheme with the use of erasure coding.
The intuition behind this is that given an erasure coded data set, where any n blocks can be used to recover the entire set and given an encoding of 2n (i.e. at least twice the size of the original data), one can assume that having >=50% of any of the original 2n blocks can recover the data set. Naturally, this lowers the requirement on the proofs and at the same time makes them more robust, since we're no longer trying to prove that all the original blocks are there, we only need to know that there are enough blocks to recover the data.
What do these schemes really prove?
One important question to answer is what do these schemes prove and how to use them in the context of a storage solution.
Both of these proofs are utilized to determine that at the time of querying the prover was still in possesion of the data, but nothing else. As per the Gamblers Fallacy, one can easily see, that past performance doesn't guarantee future outcomes. This means that, no matter how many times we've verified the data, the chances of it going missing between the last round and now are the same.
However, what these schemes do give us is the ability to efficiently and with high probability detect whether the data has gone missing or corrupted beyond repair. Detecting this allows building protocols to recover the data from additional replicas if they exist and hold the storage provider accountable in case something has gone awry.
Data Durability and Redundancy
We haven't been able to find a good definition for data durability, but a useful one (albeit taken from a random source) states - "Durability, ... refers to long-term data protection, i.e. the stored data does not suffer from bit rot, degradation or other corruption.". Another definition coined by us reads as follows - "How likely is the data to persist in the network (for a given time?) without being lost (to the point that it is not recoverable in its entirety)".
In other words, data durability refers to persistance and integrity of the data, but neither proofs of storage schemes increase data durability, so what does then?
Redundancy.
Where proofs of storage are really useful is in guaranteeing that a certain redundancy factor is being upheld and if it goes below some threshold detect it and potentially do something to restore it.
What is redundancy?
To quote wikipedia - "In computer main memory, auxiliary storage and computer buses, data redundancy is the existence of data that is additional to the actual data and permits correction of errors in stored or transmitted data."
In other words, redundancy is the additional data required to recover the actual data. This can be independent copies of the original; some error correcting code that can recover the original or a combination of both.
For storage systems it has two main implications:
There should exist more than one copy of data sufficient to recover the original
The copies should be stored independently such that in case of loss of one or more of them (due to some catastrophic event such as a natural disaster), the others are unaffected and can be used to recover/reconstruct the original
In the context of proofs of storage such as PoRs and PDPs, this means being able to tell not only that the current data set is intact or recoverable, but also that there is more than one physical copy of it. This types of schemes are usually known as Multi Replica Auditing.
Multiple replica auditing
One of the first attempts to address multi replica auditing was outlined in MR-PDP: Multiple-Replica Provable Data Possession by Curtmola et al. This approach uses variation of PDP and a technique to mask blocks with a some secret data, which allows to physically differentiate between each replica. Unfortunately, this schemes only works in a private setting and it doesn't lend well to a public setup such as the one we expect to have in Dagger. Nevertheless, several other schemes have been proposed since, some of which have been outlined in the Multiple‐replica integrity auditing schemes for cloud data storage by Li et al.
Multi replica auditing is something that will be covered in more detail in a future update as this is part of our ongoing research.
Future direction
There is still much to cover when it comes to proofs of storage and we'll continue to explore this and other potential solutions. A few potential directions to explore next besides MR-PDP/PoR is the use of more clever erasure coding in proofs of storage; another possible direction is the use of more sophisticated proofs of knowledge such as ZK proofs and polynomial commitments.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Proofs of Storage for Dagger
As outlined in Dagger Research Scope, one of the desired outcomes of the first phase of Dagger is to categorize and validate existing proof of storage (or as they are also sometimes known remote auditing) schemes. In this post we'll attempt to outline some of the already explored approaches and give a sense of the basic intuitions behind each.
Proofs of Storage
It is far from being easy to provide an unequivocal definition of what Proof of Storage means, and in fact several interpretations have emerged in literature in the past two decades. Two major approaches to remote auditing are Provable Data Possession (PDP) and Proofs of Retrievability (PoR). Both schemes try to accomplish similar objectives, however the assumptions and security guarantees are different depending on the scheme.
PDP focuses on showing that data is stored in its entirety by the prover with some probability (as in [Ateniese2007] which coined the term PDP, where the scheme can detect data loss with high probability). Data possession does not formally guarantee that the data is recoverable from the node possessing it.
PoRs, on the other hand, focus on guaranteeing retrieval, i.e. that the whole data can be extracted from the storage node. Its initial version [Jules2007] in fact defines an
extractor
function that given the correct security parameters can detect if the data in the file is still recoverable or not.Both schemes were improved on and subsequently formally proved by Schaham and Waters in their
Compact Proofs of Retrievability
[Schaham2008] paper.The intuition behind both of these schemes is similar, but PoRs add a twist with the use of erasure coding and guarantees of extraction.
One potentially obvious but important aspect to note is that any proof of storage should be significantly more efficient than sending the entirety or even a portion of the data to the verifier, otherwise these schemes become impractical.
PDP vs PoR
The original PDP scheme uses homomorphic authenticators to generate proofs that some previously tagged blocks or chunks still exist.
Publicly verifiable PDP constructs usually rely on some form of public key or asymmetrical cryptography; an important aspect of a practical public PDP is the lack of state that a verifier needs to keep between verifications. The original construct presented by [Ateniese2007] relies on RSA authenticators and was further improved to use BLS signatures by [Schaham2008].
To some extent, most public PDP schemes can be considered to be zero knowledge proofs of knowledge, as they require neither previous knowledge of the data nor do they reveal any information of the data being proved.
The homomorphism implied by the authenticators is usually of the additive form, although full homomorphism might have additional desirable properties that can be exploited by PDPs and PoRs.
PoRs, as they were originally defined by Juels and Kaliski, use a combination of erasure coding, sentinel blocks and encryption to prove possession of data. The introduction of erasure coding and extraction formally make PoRs more secure than PDPs, but in practice a PDP scheme can be easily extended to a PoR scheme with the use of erasure coding.
The intuition behind this is that given an erasure coded data set, where any
n
blocks can be used to recover the entire set and given an encoding of2n
(i.e. at least twice the size of the original data), one can assume that having>=50%
of any of the original2n
blocks can recover the data set. Naturally, this lowers the requirement on the proofs and at the same time makes them more robust, since we're no longer trying to prove that all the original blocks are there, we only need to know that there are enough blocks to recover the data.What do these schemes really prove?
One important question to answer is what do these schemes prove and how to use them in the context of a storage solution.
Both of these proofs are utilized to determine that at the time of querying the prover was still in possesion of the data, but nothing else. As per the Gamblers Fallacy, one can easily see, that past performance doesn't guarantee future outcomes. This means that, no matter how many times we've verified the data, the chances of it going missing between the last round and now are the same.
However, what these schemes do give us is the ability to efficiently and with high probability detect whether the data has gone missing or corrupted beyond repair. Detecting this allows building protocols to recover the data from additional replicas if they exist and hold the storage provider accountable in case something has gone awry.
Data Durability and Redundancy
We haven't been able to find a good definition for data durability, but a useful one (albeit taken from a random source) states - "Durability, ... refers to long-term data protection, i.e. the stored data does not suffer from bit rot, degradation or other corruption.". Another definition coined by us reads as follows - "How likely is the data to persist in the network (for a given time?) without being lost (to the point that it is not recoverable in its entirety)".
In other words, data durability refers to persistance and integrity of the data, but neither proofs of storage schemes increase data durability, so what does then?
Redundancy.
Where proofs of storage are really useful is in guaranteeing that a certain redundancy factor is being upheld and if it goes below some threshold detect it and potentially do something to restore it.
What is redundancy?
To quote wikipedia - "In computer main memory, auxiliary storage and computer buses, data redundancy is the existence of data that is additional to the actual data and permits correction of errors in stored or transmitted data."
In other words, redundancy is the additional data required to recover the actual data. This can be independent copies of the original; some error correcting code that can recover the original or a combination of both.
For storage systems it has two main implications:
In the context of proofs of storage such as PoRs and PDPs, this means being able to tell not only that the current data set is intact or recoverable, but also that there is more than one physical copy of it. This types of schemes are usually known as
Multi Replica Auditing
.Multiple replica auditing
One of the first attempts to address multi replica auditing was outlined in
MR-PDP: Multiple-Replica Provable Data Possession
by Curtmola et al. This approach uses variation of PDP and a technique to mask blocks with a some secret data, which allows to physically differentiate between each replica. Unfortunately, this schemes only works in a private setting and it doesn't lend well to a public setup such as the one we expect to have in Dagger. Nevertheless, several other schemes have been proposed since, some of which have been outlined in theMultiple‐replica integrity auditing schemes for cloud data storage
by Li et al.Multi replica auditing is something that will be covered in more detail in a future update as this is part of our ongoing research.
Future direction
There is still much to cover when it comes to proofs of storage and we'll continue to explore this and other potential solutions. A few potential directions to explore next besides MR-PDP/PoR is the use of more clever erasure coding in proofs of storage; another possible direction is the use of more sophisticated proofs of knowledge such as ZK proofs and polynomial commitments.
Beta Was this translation helpful? Give feedback.
All reactions