-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.txt
231 lines (154 loc) · 7.26 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
s3storage - zope in the cloud
=============================
Setup
-----
boto from http://code.google.com/p/boto
memcached bindings http://cheeseshop.python.org/pypi/memcached/
start a memcached server with the -M option so it won't garbage collect your
locks..
Put this in your zope.conf
%import Products.s3storage
<zodb_db main>
# Main S3Storage database
mount-point /
connection-class Products.s3storage.connection.StoragePerConnection
<s3storage>
aws_access_key_id 12345ABC...
aws_secret_access_key abc123.....
bucket_name mys3bucket
memcache_servers 127.0.0.1:12345
</s3storage>
</zodb_db>
Current Status
--------------
Dog slow ;-)
Far too many writes to s3. Will be interesting to see what performance you
get from ec2 when they have some more beta keys
Design ideas - not completely current!
------------
s3storage is an experiment. Amazon's S3 storage service provides an eventual
consistancy guarentee, but propogation delays mean that one must be careful
when trying to cram transactional semantics on top of the model. That said, it
has the potential (along with the elastic compute cloud) to provide a massively
scalable object system.
Benefits of S3:
* Simple model, it's just a massive hash table
* Cheap. $0.15 per GB per month
* Scalable
* Reliable (eventually)
Drawbacks:
* Indeterminate propogation time (the time between when a write occurs and
a read on the resource can be relied upon)
* No built in transaction support
* No renaming (so we can't reuse Directory Storage)
So that leaves EC2 (the elastic compute cloud) to fill in the gaps.
Benefits of EC2:
* Powerful virtual machines on demand
* Cheap. $0.10 per vm per hour
* It's Linux (pretty much any distribution you want)
Drawbacks:
* Individual VMs are not reliable, this runs on commodity hardware, expect them
to fail occaisionally
Bandwidth for both services is charge at $0.20 per GB, but only outside the AWS
system (so it's free between S3 and EC2)
So the challenge is to build a scalable zope system that leverages the S3's
scalability while making up for it's shortcomings with EC2.
Vision
======
Build an S3 storage backend that can scale to many readers (lots of zope
clients) but delegate locking to a known system such as zeo or postgres
for writes. Realtime systems are not the target here, think content and asset
management.
Model
-----
S3 provides no locking. So this must be provided for in EC2.
After a commit an object while spend an indeterminate time before it becomes
reliably readable. Until that time has passed the view is not reliable. For
sake of argument lets say Tp = 5 minutes. This means that the last five minutes
of object state must be manged through a transactional system (think zeo or
postgres). There's no use trying to solve the read only case if the write case
is the bottleneck...
This gives us three pools of object revisions:
* Uncommitted pool.
* Committed, but not to be relied upon. The centrally accounted pool.
* Propogated and relied upon. The distributed pool.
And two distribution methods:
* Scalable but limited S3
Performance considerations
--------------------------
Read performance encourages us to opt for the distributed path. But this incurs
a time gap that is potentially disasterous for write performance.
Potentially we get a conflict error if any object is updated within Tp.
TODO: Understand commit retries. Could selected transactions be upgraded to a
more current view? Must avoid this central hit on every read.
Possible approaches to scalability
----------------------------------
* Data partitioning. i.e. spread load between several locking servers. e.g.
folder A on one server, folder B on another. Complicated to manage. Though
it would be necessary for very large systems.
* Something like memcached? Potentially simpler to manage.
* Or Amazon SQS for locking? Use a queue as a lock, make oids prefixed by a
client id (16 bits) followed by 48 bits of client increasing
Reliability considerations
--------------------------
* EC2 is not reliable. We must consider the possibility transaction manager
node failure. This should not be disasterous. As S3 has a guarentee of
eventual consistancy so potentially no more than ~Tp down time would be
necessary on node failure, and then only for writes. The storage scheme must
prepare for this.
NOTE: this assumes a guarantee of a successful write eventually succeeding.
I _think_ S3 gives this guarantee.
* But overhead of disaster preparedness must not be so onerous as to overly
limit the common case.
Filesystem Structure
====================
* S3 is a giant hash table, not a hierarchical filesystem
* S3 provides an efficient list operation. Listing a bucket can be filtered by
prefix, limited by max-results and also have results rolled up with a
delimiter (such as '/'). A marker can be provided to list only results that
are alphabetically after a particular key. List results are always returned
in alphabetical order.
* S3 is atomic only for single file puts.
* Only existance is certain
* Existing filestorage strategies do not map well
* Three places to store information, the key, the file contents and
metadata. Only the key is returned on a list. (actually some s3 system
metatdata is returned here too). keys must be UTF-8 and up to 1024 bytes.
If keys contains data you don't already know then two requests are required
to read the contents.
Transaction Log
---------------
* Only a single file operations are atomic. They either succeed or fail.
Latest timestamp wins for operations on the same resource.
On a transaction commit a transaction log is written such as
type:transaction,serial:<serial repr>,prev:<prev serial repr>,tid:<tid repr>,ltid:<prev tid repr>
It contains a list of modified oids
Precondition:
all pickle writes must have successfully completed before this file is written
The prev: information is there so that readers can be certain they are seeing a
complete list (only existance is certain)
Efficiency of reading the transaction log
-----------------------------------------
We have a design decision here. We can choose to efficiently read from the end
of the filelist or from a certain point in the list depending on the
representation format used for tids, i.e. if tid=1 ends up alphabetically lower
or higher than tid=2.
(Note we could use another index: field but it would add complication)
Records
-------
* A record is the state of an object at a particular transaction
On a store a record is written with a key
type:record,oid:<oid repr>,serial:<serial repr>
Note that records can exist for data in aborted or in progress transactions
Indexing
--------
We want to avoid having to read through the entire transaction log in order
to find out
* The record for an oid currently or at a particular transaction
loadBefore, load
this should be implemented as a list operation. So we write a 0 byte file
type:index_commit,oid:<oid repr>,serial:<serial repr>,prev:<prev serial repr>,tid:<tid repr>,ltid:<ltid repr>
and the serial for a tid
type:index_tid,tid:<tid repr>,serial:<serial repr>,prev:<prev serial repr>,ltid<ltid repr>
So this means that sequence must be reverse alphabetically sorted
These are starting to look very like rdf data structures