Replication
Riak was built to act as a multi-node cluster. It distributes data across multiple physical servers, which enables it to provide strong availability guarantees and fault tolerance.
The CAP theorem, which undergirds many of the design decisions behind Riak’s architecture, defines distributed systems in terms of three desired properties: consistency, availability, and partition (i.e. failure) tolerance. Riak can be used either as an AP, i.e. available/partition-tolerant, system or as a CP, i.e. consistent/partition-tolerant, system. The former relies on an Eventual Consistency model, while the latter relies on a special strong consistency subsystem.
Although the CAP theorem dictates that there is a necessary trade-off between data consistency and availability, if you are using Riak in an eventually consistent manner, you can fine-tune that trade-off. The ability to make these kinds of fundamental choices has immense value for your applications and is one of the features that differentiates Riak from other databases.
At the bottom of the page, you’ll find a screencast that briefly explains how to adjust your replication levels to match your application and business needs.
An option introduced in Riak version 2.0 is to use Riak as a strongly consistent system for data in specified buckets. Using Riak in this way is fundamentally different from adjusting replication properties and fine-tuning the availability/consistency trade-off, as it sacrifices all availability guarantees when necessary. Therefore, you should consult the Using Strong Consistency documentation, as this option will not be covered in this tutorial.
How Replication Properties Work
When using Riak, there are two ways of choosing replication properties:
- On a per-request basis
- In a more programmatic fashion, using bucket types
Per-request Replication Properties
The simplest way to apply replication properties to objects stored in Riak is to specify those properties
Replication Properties Through Bucket Types
Let’s say, for example, that you want to apply an n_val
of 5, an r
of 3, and a w
of 3 to all of the data in some of the buckets that
you’re using. In order to set those replication properties, you should
create a bucket type that sets those properties. Below is an example:
riak-admin bucket-type create custom_props '{"props":{"n_val":5,"r":3,"w":3}}'
riak-admin bucket-type activate custom_props
Now, any time you store an object in a bucket with the type
custom_props
those properties will apply to it.
Available Parameters
The table below lists the most frequently used replication parameters
that are available in Riak. Symbolic values like quorum
are discussed
below. Each
parameter will be explained in more detail in later sections:
Parameter | Common name | Default value | Description |
---|---|---|---|
n_val |
N | 3 |
Replication factor, i.e. the number of nodes in the cluster on which an object is to be stored |
r |
R | quorum |
The number of servers that must respond to a read request |
w |
W | quorum |
Number of servers that must respond to a write request |
pr |
PR | 0 |
The number of primary vnodes that must respond to a read request |
pw |
PW | 0 |
The number of primary vnodes that must respond to a write request |
dw |
DW | quorum |
The number of servers that must report that a write has been successfully written to disk |
rw |
RW | quorum |
If R and W are undefined, this parameter will substitute for both R and W during object deletes. It is extremely unlikely that you will need to adjust this parameter. |
notfound_ok |
true |
This parameter determines how Riak responds if a read fails on a node. Setting to true (the default) is the equivalent to setting R to 1: if the first node to respond doesn’t have a copy of the object, Riak will immediately return a not found error. If set to false , Riak will continue to look for the object on the number of nodes specified by N (aka n_val ). |
|
basic_quorum |
false |
If notfound_ok is set to false , Riak will be more thorough in looking for an object on multiple nodes. Setting basic_quorum to true in this case will instruct Riak to wait for only a quorum of responses to return a notfound error instead of N responses. |
A Primer on N, R, and W
The most important thing to note about Riak’s replication controls is
that they can be at the bucket level. You can use bucket types
to set up bucket A
to use a particular set of replication properties
and bucket B
to use entirely different properties.
At the bucket level, you can choose how many copies of data you want to
store in your cluster (N, or n_val
), how many copies you wish to read
from at one time (R, or r
), and how many copies must be written to be
considered a success (W, or w
).
In addition to the bucket level, you can also specify replication properties on the client side for any given read or write. The examples immediately below will deal with bucket-level replication settings, but check out the section below for more information on setting properties on a per-operation basis.
The most general trade-off to be aware of when setting these values is the trade-off between data accuracy and client responsiveness. Choosing higher values for N, R, and W will mean higher accuracy because more nodes are checked for the correct value on read and data is written to more nodes upon write; but higher values will also entail degraded responsiveness, especially if one or more nodes is failing, because Riak has to wait for responses from more nodes.
N Value and Replication
All data stored in Riak will be replicated to the number of nodes in the
cluster specified by a bucket’s N value (n_val
). The default n_val
in Riak is 3, which means that data stored in a bucket with the default
N will be replicated to three different nodes, thus storing three
replicas of the object.
In order for this to be effective, you need at least three nodes in your cluster. The merits of this system, however, can be demonstrated using your local environment.
Let’s create a bucket type that sets the n_val
for any bucket with
that type to 2. To do so, you must create and activate a bucket type
that sets this property:
riak-admin bucket-type create n_val_equals_2 '{"props":{"n_val":2}}'
riak-admin bucket-type activate n_val_equals_2
Now, all buckets that bear the type n_val_equals_2
will have n_val
set to 2. Here’s an example write:
curl -XPUT http://localhost:8098/types/n_val_equals_2/buckets/test_bucket/keys/test_key \
-H "Content-Type: text/plain" \
-d "the n_val on this write is 2"
Now, whenever we write to a bucket of this type, Riak will write a replica of the object to two different nodes.
n_val
must be greater than 0 and less than or equal to the number of actual
nodes in your cluster to get all the benefits of replication. We advise
against modifying the n_val
of a bucket after its initial creation as this
may result in failed reads because the new value may not be replicated to all
the appropriate partitions.
R Value and Read Failure Tolerance
Read requests to Riak are sent to all N nodes that are known to be
currently responsible for the data. The R value (r
) enables you to
specify how many of those nodes have to return a result on a given read
for the read to be considered successful. This allows Riak to provide
read availability even when nodes are down or laggy.
You can set R anywhere from 1 to N; lower values mean faster response time but a higher likelihood of Riak not finding the object you’re looking for, while higher values mean that Riak is more likely to find the object but takes longer to look.
As an example, let’s create and activate a bucket type with r
set to
1
. All reads performed on data in buckets with this type require a
result from only one node.
riak-admin bucket-type create r_equals_1 '{"props":{"r":1}}'
riak-admin bucket-type activate r_equals_1
Here’s an example read request using the r_equals_1
bucket type:
bucket = client.bucket_type('r_equals_1').bucket('animal_facts')
obj = bucket.get('chimpanzee')
Location chimpanzeeFact =
new Location(new Namespace("r_equals_1", "animal_facts"), "chimpanzee");
FetchValue fetch = new FetchValue.Builder(chimpanzeeFact).build();
FetchValue.Response response = client.execute(fetch);
RiakObject obj = response.getValue(RiakObject.class);
System.out.println(obj.getValue().toString());
$response = (new \Basho\Riak\Command\Builder\FetchObject($riak))
->buildLocation('chimpanzee', 'animal_facts', 'r_equals_1')
->build()
->execute();
echo $response->getObject()->getData();
bucket = client.bucket_type('r_equals_1').bucket('animal_facts')
bucket.get('chimpanzee')
{ok, Obj} = riakc_pb_socket:get(Pid,
{<<"r_equals_1">>, <<"animal_facts">>},
<<"chimpanzee">>).
curl http://localhost:8098/types/r_equals_1/buckets/animal_facts/keys/chimpanzee
As explained above, reads to buckets with the r_equals_1
type will
typically be completed more quickly, but if the first node to respond
to a read request has yet to receive a replica of the object, Riak will
return a not found
response (which may happen even if the object lives
on one or more other nodes). Setting r
to a higher value will mitigate
this risk.
W Value and Write Fault Tolerance
As with read requests, writes to Riak are sent to all N nodes that are
know to be currently responsible for the data. The W value (w
) enables
you to specify how many nodes must complete a write to be considered
successful—a direct analogy to R. This allows Riak to provide write
availability even when nodes are down or laggy.
As with R, you can set W to any value between 1 and N. The same performance vs. fault tolerance trade-offs that apply to R apply to W.
As an example, let’s create and activate a bucket type with w
set to
3
:
riak-admin bucket-type create w_equals_3 '{"props":{"w":3}}'
riak-admin activate w_equals_3
Now, we can attempt a write to a bucket bearing the type w_equals_3
:
bucket = client.bucket_type('w_equals_3').bucket('animal_facts')
obj = Riak::RObject.new(bucket, 'giraffe')
obj.raw_data = 'The species name of the giraffe is Giraffa camelopardalis'
obj.content_type = 'text/plain'
obj.store
Location storyKey =
new Location(new Namespace("w_equals_3", "animal_facts"), "giraffe");
RiakObject obj = new RiakObject()
.setContentType("text/plain")
.setValue(BinaryValue.create("The species name of the giraffe is Giraffa camelopardalis"));
StoreValue store = new StoreValue.Builder(obj)
.withLocation("giraffe")
.build();
client.execute(store);
(new \Basho\Riak\Command\Builder\StoreObject($riak))
->buildLocation('giraffe', 'animal_facts', 'w_equals_3')
->build()
->execute();
bucket = client.bucket_type('w_equals_3').bucket('animal_facts')
obj = RiakObject(client, bucket, 'giraffe')
obj.content_type = 'text/plain'
obj.data = 'The species name of the giraffe is Giraffa camelopardalis'
obj.store()
Obj = riakc_object:new({<<"w_equals_3">>, <<"animal_facts">>},
<<"giraffe">>,
<<"The species name of the giraffe is Giraffa camelopardalis">>,
<<"text/plain">>),
riakc_pb_socket:put(Pid, Obj).
curl -XPUT \
-H "Content-type: text/plain" \
-d "The species name of the giraffe is Giraffa camelopardalis" \
http://localhost:8098/types/w_equals_3/buckets/animal_facts/keys/giraffe
Writing our story.txt
will return a success response from Riak only if
3 nodes respond that the write was successful. Setting w
to 1, for
example, would mean that Riak would return a response more quickly, but
with a higher risk that the write will fail because the first node it
seeks to write the object to is unavailable.
Primary Reads and Writes with PR and PW
In Riak’s replication model, there are N vnodes, called primary vnodes, that hold primary responsibility for any given key. Riak will attempt reads and writes to primary vnodes first, but in case of failure, those operations will go to failover nodes in order to comply with the R and W values that you have set. This failover option is called sloppy quorum.
In addition to R and W, you can also set integer values for the primary read (PR) and primary write (PW) parameters that specify how many primary nodes must respond to a request in order to report success to the client. The default for both values is zero.
Setting PR and/or PW to non-zero values produces a mode of operation called strict quorum. This mode has the advantage that the client is more likely to receive the most up-to-date values, but at the cost of a higher probability that reads or writes will fail because primary vnodes are unavailable.
If PW is set to a non-zero value, there is a higher risk (usually very small) that failure will be reported to the client upon write. But this does not necessarily mean that the write has failed completely. If there are reachable primary vnodes, those vnodes will still write the new data to Riak. When the failed vnode returns to service, it will receive the new copy of the data via either read repair or active anti-entropy.
Durable Writes with DW
The W and PW parameters specify how many vnodes must respond to a
write in order for it to be deemed successful. What they do not specify
is whether data has actually been written to disk in the storage backend.
The DW parameters enables you to specify a number of vnodes between 1
and N that must write the data to disk before the request is deemed
successful. The default value is quorum
(more on symbolic names below).
How quickly and robustly data is written to disk depends on the configuration of your backend or backends. For more details, see the documentation on Bitcask, LevelDB, and multiple backends.
Delete Quorum with RW
It is no longer necessary to specify an RW value when making delete requests.
We explain its meaning here, however, because RW still shows up as a property
of Riak buckets (as rw
) for the sake of backwards compatibility. Feel free
to skip this explanation unless you are curious about the meaning of RW.
Deleting an object requires successfully reading an object and then writing a tombstone to the object’s key that specifies that an object once resided there. In the course of their operation, all deletes must comply with any R, W, PR, and PW values that apply along the way.
If R and W are undefined, however, the RW (rw
) value will substitute
for both R and W during object deletes. In recent versions of Riak, it
is nearly impossible to make reads or writes that do not somehow specify
oth R and W, and so you will never need to worry about RW.
The Implications of notfound_ok
The notfound_ok
parameter is a bucket property that determines how
Riak responds if a read fails on a node. If notfound_ok
is set to
true
(the default value) and the first vnode to respond doesn’t have a
copy of the object, Riak will assume that the missing value is
authoritative and immediately return a not found
result to the client.
This will generally lead to faster response times.
On the other hand, setting notfound_ok
to false
means that the
responding vnode will wait for something other than a not found
error
before reporting a value to the client. If an object doesn’t exist under
a key, the coordinating vnode will wait for N vnodes to respond with
not found
before it reports not found
to the client. This setting
makes Riak search more thoroughly for objects but at the cost of slower
response times, a problem can be mitigated by setting basic_quorum
to
true
, which is discussed in the next section.
Early Failure Return with basic_quorum
Setting notfound_ok
to false
on a request (or as a bucket property)
is likely to introduce additional latency. If you read a non-existent
key, Riak will check all 3 responsible vnodes for the value before
returning not found
instead of checking just one.
This latency problem can be mitigated by setting basic_quorum
to
true
, which will instruct Riak to query a quorum of nodes instead of N
nodes. A quorum of nodes is calculated as floor(N/2) + 1, meaning that 5
nodes will produce a quorum of 3, 6 nodes a quorum of 4, 7 nodes a
quorum of 4, 8 nodes a quorum of 5, etc.
The default for basic_quorum
is false
, so you will need to
explicitly set it to true
on reads or in a bucket’s properties. While
the scope of this setting is fairly narrow, it can reduce latency in
read-heavy use cases.
Symbolic Consistency Names
Riak provides a number of “symbolic” consistency options for R, W, PR, RW, and DW that are often easier to use and understand than specifying integer values. The following symbolic names are available:
all
— All replicas must reply. This is the same as setting R, W, PR, RW, or DW equal to N.one
— This is the same as setting 1 as the value for R, W, PR, RW, or DW.quorum
— A majority of the replicas must respond, that is, half plus one. For the default N value of 3, this calculates to 2, an N value of 5 calculates to 3, and so on.default
— Uses whatever the per-bucket consistency property is for R, W, PR, RW, or DW, which may be any of the above symbolic values or an integer.
Not submitting a value for R, W, PR, RW, or DW is the same as using
default
.
Client-level Replication Settings
Adjusting replication properties at the bucket level by using bucket types is how you set default properties for all of a bucket’s reads and writes. But you can also set replication properties for specific reads and writes without setting those properties at the bucket level, instead specifying them on a per-operation basis.
Let’s say that you want to set r
to 2 and notfound_ok
to true
for
just one read. We’ll fetch John Stockton’s
statistics from the nba_stats
bucket.
bucket = client.bucket('nba_stats')
obj = bucket.get('john_stockton', r: 2, notfound_ok: true)
Location johnStocktonStats =
new Namespace(new Namespace("nba_stats"), "john_stockton");
FetchValue fetch = new FetchValue.Builder(johnStocktonStats)
.withOption(FetchOption.R, new Quorum(2))
.withOption(FetchOption.NOTFOUND_OK, true)
.build();
client.execute(fetch);
(new \Basho\Riak\Command\Builder\FetchObject($riak))
->buildLocation('john_stockton', 'nba_stats')
->withParameter('r', 2)
->withParameter('notfound_ok', true)
->build()
->execute();
bucket = client.bucket('nba_stats')
obj = bucket.get('john_stockton', r=2, notfound_ok=True)
{ok, Obj} = riakc_pb_socket:get(Pid,
<<"nba_stats">>,
<<"john_stockton">>,
[{r, 2}, {notfound_ok, true}]).
curl http://localhost:8098/buckets/nba_stats/keys/john_stockton?r=2¬found_ok=true
Now, let’s say that you want to attempt a write with w
set to 3 and
dw
set to 2. As in the previous example, we’ll be using the default
bucket type, which enables us to not specify a bucket type upon write.
Here’s what that would look like:
bucket = client.bucket('nba_stats')
obj = Riak::RObject.new(bucket, 'michael_jordan')
obj.content_type = 'application/json'
obj.data = '{"stats":{ ... large stats object ... }}'
obj.store(w: 3, dw: 2)
Location michaelJordanKey =
new Location(new Namespace("nba_stats"), "michael_jordan");
RiakObject obj = new RiakObject()
.setContentType("application/json")
.setValue(BinaryValue.create("{'stats':{ ... large stats object ... }}"));
StoreValue store = new StoreValue.Builder(obj)
.withLocation(michaelJordanKey)
.withOption(StoreOption.W, new Quorum(3))
.withOption(StoreOption.DW, new Quorum(2))
.build();
client.execute(store);
(new \Basho\Riak\Command\Builder\StoreObject($riak))
->buildJsonObject('{'stats':{ ... large stats object ... }}')
->buildLocation('john_stockton', 'nba_stats')
->withParameter('w', 3)
->withParameter('dw', 2)
->build()
->execute();
Obj = riakc_obj:new(<<"nba_stats">>,
<<"michael_jordan">>,
<<"{'stats':{ ... large stats object ... }}">>,
<<"application/json">>),
riakc_pb_socket:put(Pid, Obj).
curl -XPUT \
-H "Content-Type: application/json" \
-d '{"stats":{ ... large stats object ... }}' \
http://localhost:8098/buckets/nba_stats/keys/michael_jordan?w=3&dw=2
All of Basho’s official Riak clients enable you to set replication properties this way. For more detailed information, refer to the tutorial on basic key/value operations in Riak KV or to client-specific documentation:
Illustrative Scenarios
In case the above explanations were a bit too abstract for your tastes, the following table lays out a number of possible scenarios for reads and writes in Riak and how Riak is likely to respond. Some of these scenarios involve issues surrounding conflict resolution, vector clocks, and siblings, so we recommend reading the Vector Clocks documentation for more information.
Read Scenarios
These scenarios assume that a read request is sent to all 3 primary vnodes responsible for an object.
Scenario | What happens in Riak |
---|---|
All 3 vnodes agree on the value | Once the first 2 vnodes return the value, that value is returned to the client |
2 of 3 vnodes agree on the value, and those 2 are the first to reach the coordinating node | The value is returned to the client. Read repair will deal with the conflict per the later scenarios, which means that a future read may return a different value or siblings |
2 conflicting values reach the coordinating node and vector clocks allow for resolution | The vector clocks are used to resolve the conflict and return a single value, which is propagated via read repair to the relevant vnodes |
2 conflicting values reach the coordinating node, vector clocks indicate a fork in the object history, and allow_mult is set to false |
The object with the most recent timestamp is returned and propagated via read repair to the relevant vnodes |
2 siblings or conflicting values reach the coordinating node, vector clocks indicate a fork in the object history, and allow_mult is set to true |
All keys are returned as siblings, optionally with associated values (depending on how the request is made) |
Write Scenarios
These scenarios assume that a write request is sent to all 3 primary vnodes responsible for an object.
Scenario | What happens in Riak |
---|---|
A vector clock is included with the write request, and is newer than the vclock attached to the existing object | The new value is written and success is indicated as soon as 2 vnodes acknowledge the write |
A vector clock is included with the write request but conflicts with the vclock attached to the existing object, with allow_mult set to true |
The new value is created as a sibling for future reads |
A vector clock is included with the write request but conflicts with (or is older than) the vclock attached to the existing object, with allow_mult set to false |
Riak will decide which object “wins” on the basis of timestamps; no sibling will be created |
A vector clock is not included with the write request and an object already exists, with allow_mult set to true |
The new value is created as a sibling for future reads |
A vector clock is not included with the write request and an object already exists, with allow_mult set to false |
The new value overwrites the existing value |
Screencast
Here is a brief screencast that shows just how the N, R, and W values function in our running 3-node Riak cluster:
Tuning CAP Controls in Riak from Basho Technologies on Vimeo.