Started a comment here but it got too long: https://aphyr.com/posts/325-comments-on-you-do-it-too
There are some points where I just don’t understand you (GC are not partitions in the CAP proof, and it’s not new: see cap faq point 16 http://henryr.github.io/cap-faq/). You can push it into it (for example by putting some deadline constraints in the definition of availability), but it’s not really CAP anymore.
It’s easier to choose an algorithm which is safe in the general case of asynchronous networks,
Yeah this one is interesting. I agree on the algo choice (there is no other choice anyway), but many errors (typically not flushing the writes to the disk) are visible only with kill9 or node failures. You have to test independently. Partition tolerance does not mean fault tolerance.
Partitions in the G&L proof are any pattern of message loss. Any garbage collection period which exceeds the duration an application will wait for a message before considering it lost (e.g. tcp socket timeouts, application-level timeouts) will get you the same behavior.
Yep, that’s what I call pushing it into it (application cannot wait forever => there are deadlines constraints). CAP applies in this case (i.e. you really have to choose between consistency & availability).
GC are still a little bit simpler than a true network partition, because the process stops working for everybody. Agreed, there will be some nasty race conditions at the end. But you don’t loose a full rack when there is a GC. It’s a much nicer kind of fault, one node at a time. In a big data system, if you lose a rack you have to replicate the data again for safety. With a GC you do not need that: the node will come back to life before you start to replicate the 2Tb of data it was storing (not to mention the case w/ a rack!).
I do agree with you on the asynchronous part: the algorithm you need to choose when the network is asynchronous will help a lot with partitions and with GC. But you need to test both.
It’s a much nicer kind of fault, one node at a time
GC is a notorious cause of distributed systems data-loss, typically where it allows two nodes to believe they’re the primary at the same time. Take a look at the Elasticsearch mailing lists sometime for all kinds of horror stories around high load, memory pressure, or GC causing inconsistency.
I’m not sure if the size of the fault is necessarily relevant for this discussion, either.
Agreed again, but node failures and network partitions will add a few other horror stories.
I mean, I would expect a software vendor to say
We have tested dirty node crashes, no data loss
We have tested GC. No dataloss, no performance issue if the GC is contained within 10s.
We have not tested network partitions. Per our design it should be fine (we’re aiming at AP: availability during partition), but it’s still an edge case.
Rather than: “we’re partition tolerant of course.”
And for a system like ES (for example), the design for availability under network partition could be something with partial results and so on (harvest & yield). Not that easy to do (I don’t know ES).
Absolutely agreed. The reason I mention GC in this response is because you’ve argued that LANs won’t partition. Even if LAN fabric were totally reliable, I’m trying to remind people that partition tolerance is about message delivery, not just what we traditionally consider “the network”.
And for a system like ES (for example), the design for availability under network partition could be something with partial results and so on (harvest & yield).
Gilbert & Lynch, footnote 4: “Brewer originally only required almost all requests to receive a response. As allowing
probabilistic availability does not change the result when arbitrary failures occur, for simplicity we are requiring 100% availability.”
Absolutely agreed. The reason I mention GC in this response is because you’ve argued that LANs won’t partition.
I doubt I said something like this :-)
But yeah, for sure the whole post is only about network partitions. I will update the post to make this clear.
“CA exists, and is described as acceptable for systems running on LAN”
“Stonebraker was considering small systems of high range servers on a LAN”
“it is also fine to assume CA on a LAN”
None of those are mine (lynch/stonebraker/brewer)
Both Stonebraker and Brewer consider (quoting Brewer but Stonebraker said exactly the same thing) “CA should mean that the probability of a partition is far less than that of other systemic failures”, so even if they think that CA is acceptable in some specific cases on a LAN that does not mean they think that LANs won’t partition.
GC are still a little bit simpler than a true network partition… It’s a much nicer kind of fault, one node at a time
This is usually the case. However, I’ve also seen the back pressure resulting from a GC cause other nodes to become loaded. The other nodes then started a GC. Now there was a feedback loop and the entire system ended up falling over.
The system could have been configured better… but that’s kind of the same point about experiencing partitions on a LAN. It’s not cost-effective and you’re still going to miss something.
This article is making me really nervous but I know I don’t have the distributed chops to prove it wrong.
I’ll say this: when an author starts talking about probabilistic interpretations of the theorem and going on about cost-benefit analysis (seriously, why are we worked up about poor “administration interfaces” here?!) my BS needle starts twitching. And when they do that when an impossibility proof exists that shows element availability and atomic consistency are not both possible, it starts swinging around madly.
The article reads like an awful lot of language lawyering around fairly well understood concepts, but I’m not sure what the motivations of the author are.
Heh… Sigh. It reads like an attempt to illuminate, but a bad one. That seems worthwhile if it were shorter and clearer; I don’t think the concepts are actually all that well understood, unfortunately. At a previous job, after two months of arguing that Riak was the wrong choice for the company, I finally got through:
Me: “What exactly is the benefit you’re hoping for from using a distributed technology? Uptime? Preventing data loss?”
Them: “Yes, both of those.”
Me: “Those are mutually-exclusive in our situation.”
Them: “Oh… Maybe something else would be okay.”
(And no, they aren’t inherently mutually exclusive, but the data was peculiar and merging later, after resolving a partition, wasn’t an option. I can’t go into it.)
I definitely don’t want that to be read as an insult to the intelligence of the person involved; they were quite competent. It’s just that databases are a subject not all engineers actually know very much about, and distributed ones are a rather new technology in the scheme of things.
It’s worth noting that not all distributed systems are databases, too, of course!
That’s not what the impossibility proof says–he references that paper.
“In 2002, Seth Gilbert and Nancy Lynch publish the CAP proof. CA exists, and is described as acceptable for systems running on LAN.”
“If there are no partitions, it is clearly possible to provide atomic, available
data. In fact, the centralized algorithm described in Section 3.2.1 meets
these requirements. Systems that run on intranets and LANs are an example
of these types of algorithms” 
I don’t think CAP is very well understood. I think folks end up very confused about what consistent means, and what partition-tolerant means.
I think this is pretty well researched. I’m not sure why cost-benefit analysis makes you nervous.
James Hamilton of AWS says it best, I think:
Mike also notes that network partitions are fairly rare. I could quibble a bit on this one. Network partitions should be rare but net gear continues to cause more issues than it should. Networking configuration errors, black holes, dropped packets, and brownouts, remain a popular discussion point in post mortems industry-wide.
Gilbert & Lynch’s implicit assertion is that LANs are reliable and partition free; I can buy this in theory but does this happen in practice? When Microsoft performed a large analysis of failures in their data centers, they found frequent loss occurring that was only partially mitigated by network redundancy.
But either way you make a fair point: CA models aren’t strictly precluded by that proof. I’m just not certain I’ve seen a network that is trustworthy enough to preclude partitions.
Network partitions are not even remotely rare, honestly. LANs are actually worse culprits than the Internet, but both do happen.
You already cited one of the better sources for it, but mostly I believe this because I’ve been told it by network engineers who I respect a lot.
Even if network partitions were rare, I’ll tell you what aren’t (for most people): garbage collections. What I did not like about this post is it, over and over again, just talks about network partitions and the actual networking hardware. But weird application-specific things happen as well that appear to be unresponsive for longer than some timeout value and these are part of the ‘P’ as well.
In reality, I think CAP is too cute to go away but not actually adequate in talking about these things in detail. PACELC makes the trade-offs much clearer.
LANs are actually worse culprits than the Internet
Funny you mention that: over the past few days I’ve been fighting an issue with our internal network that has resulted in massive packet loss internally (>50% loss in some spikes), and ~0.5% to the Internet. That’s probably why this article raised my eyebrows - it’s my personal bugbear for the week.
The culprit seems to have been a software update to a Palo Alto device that stopped playing nice with certain Cisco switches… plug the two of them together and mumble mumble spanning tree mumble loops mumble. The network guys start talking and my eyes glaze over. But all I know is that I’ve learned the hard way to not trust the network - and when a proof exists that the network must be reliable in order to have CA systems, well…
Heh - my sympathies.
I think some of the confusion comes from describing all node failures as network partitions. In reality “true” network partitions are rare enough (lasting in durations long enough to matter to humans), but nodes failing due to hardware failure, operational mistakes, non-uniform utilization across the system, and faulty software deploys are sometimes overlooked in this context.
i like the comment above “It’s worth noting that not all distributed systems are databases, too, of course!”, but i think this is also a matter of perspective. most useful systems contain state, isn’t twitter.com as a network service a distributed database? kind of neat to think about
It’s not clear to me that the distinction the author makes between a CA and a CP system exists. He uses ZooKeeper as an example of a CP system, but the minority side of networking partition in ZooKeeper cannot make progress, just like his CA example. In reality, CP seems to be a matter of degree not boolean, to me. Why does a CP system that handles 0 failures have to be different than one that handles 2f-1?
When the system availability is zero (not available at all) after a partition, you can claim both CP and CA (that’s the overlap between CP/CA).
There are two corner cases when the system is not available at all:
the system does not even restart after the partition. You can claim CP theoretically. The proof’s definitions don’t prevent this formally. But it makes little sense in practice.
the system restarts after the partition and remains consistent. Both CP and CA are ok.
But ZooKeeper is not concerned by these corner cases, because it is partly available during the partition.
No, you can’t: a system which is not available during a partition does not satisfy A, and cannot be called CA. If you could claim both CA and CP you would have disproved CAP.
CA means: I have a magical network without partition. If my network is not that magical at the end, I will be CP/AP and more likely in a very bad state, not fully available and not fully consistent.
I’m responding to “When the system availability is zero (not available at all) after a partition, you can claim both CP and CA”. Please re-read Gilbert & Lynch’s definition of A: you cannot claim CA if you refuse to satisfy requests during a partition.
But those magic networks do not exist, so how can a CA system exist?
:-) It exists until there is a partition. Then the most probable exit is to restore manually the system state. 2PC with heuristic resolution being an example.
Or, if you build a system for machine learning: 20 nodes with GPU, 2 days of calculation per run. If there is a network partition during these two days you throw away the work in progress, fix the partition and start the calculation process again. I don’t see myself waiting for the implementation/testing of partition tolerance for such a system. I will put it in production even if I know that a network partition will break it apart.
That system is still CP. You are tolerating the notion of partitions, and in the case of a partition you sacrifice A (fail to fulfill a request–a job in this case) and restart the entire system for the sake of C.
It exists until there is a partition.
If a system reacts to a partition by sacrificing availability - as it must, and you haven’t demonstrated differently - how can you claim it is CA?
If there is a network partition during these two days you throw away the work in progress, fix the partition and start the calculation process again. I don’t see myself waiting for the implementation/testing of partition tolerance for such a system. I will put it in production even if I know that a network partition will break it apart.
I feel like I’m in bizarro world.
If the system sacrifices consistency (it could also be consistency, or both), then there is an overlap between CA and CP. That’s what Daniel Abadi said 5 years ago: “What does “not tolerant” mean? In practice, it means that they lose availability if there is a partition. Hence CP and CA are essentially identical.”
The key point is that forfeiting partitions does not mean they won’t happen. To quote Brewer (in 2012) “CA should mean that the probability of a partition is far less than that of other systemic failures”
That’s why there is an overlap. I can choose CA the probability of a partition is far less than that of other systemic failures, but I could have a partition. And if I have a partition I will be either non consistent, either non available, either both, and I may also have broken some of my system invariants.
I’m sure it does not help you as I’m just repeating my post, and this part is only a repetition of something that was said previously by others :-(
Trying differently, maybe the issue to understand this is that you have:
CAP as a theorem: you have to choose between consistency and availability during a partition. There are 3 options here:
full consistency (the CP category)
full availability (the AP category)
not consistent but only partial availability (not one of the CAP categories, but possible in practice, typically 2PC with heuristic resolutions: all cross-partition operations will fail).
CAP as a classification tool with 3 options: AP/CP/CA. There are a description of the system. CA means you forfeited partition tolerance, i.e. it’s a major issue for the system you build.
And, in case there is any doubt: most systems should not forfeit partitions. I always mention 2PC/heuristic because is a production proven exception.
Could you rephrase your statement? I am having trouble parsing what you have said.
the cr went away. let me edit.
If we take your second case - as it’s the only real case worth discussing, as you note :-) - how can you claim the system is available?
The system is CA under a clean network until time n when the network partitions. The partition clears up after m ticks. So from [1, n) and (m, inf) the system is CA, but from [n, m] it is unavailable. Can we really say the system maintains availability? That feels odd to me.
Maybe it makes more sense to discuss this in terms of PACELC - a system in your second case has PC behavior; in the presence of a partition it’s better to die hard than give a potentially inconsistent answer.
Having said all of this, my distributed systems skills are far below those of the commentators here, so please point out any obvious missteps.
CA is forfeiting partition tolerance (that’s how it was described by Eric Brewer in 2000). So if a partition occurs it’s out of the operating range, you can forfeit consistency and/or availability. It’s an easy way out of the partition tolerance debate ;-). But an honest one: it clearly says that the network is critical for the system.
Maybe it makes more sense to discuss this in terms of PACELC - a system in your second case has PC behavior;
Yep it works, Daniel Abadi solved the overlap by merging CA and CP (“What does “not tolerant” mean? In practice, it means that they lose availability if there is a partition. Hence CP and CA are essentially identical.”) It’s not totally true (a CA system can lose its consistency if there is a partition, like 2PC w/ heuristic resolutions), but it’s a totally valid choice. If you do the same choice as Daniel in CAP you choose CP for the system 2 above. CA says “take care of your network and read the documentation before it is too late”.
seriously, why are we worked up about poor “administration interfaces” here
:-) Because I’ve seen a lot of system where the downtime/data corruptions were caused mainly by: 1) software bugs 2) human errors.
I also think that a lot of people take partition tolerance for granted (i.e. “this system is widely deployed in production, so it is partition tolerant as I’m sure everybody has network issues all the time, so I can deploy it safely myself w/o thinking to much about the network”). Many systems are not partition tolerant (whatever they say). That’s why Aphyr’s test crash them (dataloss, lost counters,…), even if they are deployed in production.
It does not mean they have no value. It’s a matter of priority. See Aphyr’s post on ES, imho they should plan partition tolerance and implement immediately crash tolerance for example, instead of trying to do both at the same time.
I prefer a true “secure your network” rather than a false “of course we’re partition tolerant, CAP says anything else is impossible” statement (with extra points for “we’re not consistent so we’re available”).
CAP tells you that you can’t have both C and A when a partition happens. Most people take that to mean you must choose one or the other and have a CP or AP system. But it’s worth remembering that you do have the option of making sure that partitions never happen - either by making the system non-distributed or by making the communications reliable enough. And for some use cases that might be the correct approach.
 In a probabilistic sense - you can’t ensure that a network partition never happens, but nor can you ensure that you won’t lose all the nodes of your distributed system simultaneously. Any system will have an acceptable level of risk of total failure; it’s possible to lower the probability of a network partition to the point where “any network partition is a total system failure” is an acceptable risk.
I think it’s important to modify your statement a bit. What you have to do is ensure that in the face of a partition you remain consistent then try your darnedest to reduce the frequency of partitions. The distinction being you have control over what happens during a partition but not control over a partition happening.
you have control over what happens during a partition but not control over a partition happening.
I don’t think that this sharp distinction exists. You don’t have absolute control over what happens during a partition - to take an extreme example, the CPU you’re running on might have a microcode bug that means it executes different instructions from the one you intended. And you do have control - to the extent that you have control of anything - over the things that cause network partitions; you can construct your network (or pay people to construct your network) so as to mitigate the risks. It is absolutely possible to construct a network which won’t suffer partitions (or rather, in which partitions are less likely than simultaneous hardware failures on your nodes) if you’re willing to spend enough money to do so (this is rarely a smart choice, but it could be).
I do not think byzantine faults really matter for this discussion, they are a whole other issue to partitions. But I do not think your response invalidates my point at all. Partitions are something that happens to you, how your program handles them is something you do.