Threads for moses

    1. 1

      Hmm, I think the distinction between “introduction” and “service discovery” is a bit too fine for me. Can you elaborate a bit? Is DNS resolution a kind of “introduction”?

      1. 3

        Service discovery and DNS resolution are both kinds of introduction, yes. But not all introduction is service discovery - there are some examples of that in the article.

    2. 5

      Adopting the Contributor Covenant

      I am not too happy about this. I believe that all users should be treated equally, but something about these recent CoC changes just destroys communities:

      Off topic, I see that you are the author of the linked post. I also see that you have more posts than comments:

      Is it your intention to just use this platform to advertise for GitHub?

      1. 17

        I think codes of conduct are good. They’re important for making community norms explicit, and as tools to change community norms that are inappropriate. Both of these are important for marginalized members of the community.

        Something important to remember is the paradox of tolerance. If you’re willing to tolerate the intolerant, the intolerant will eventually turn the community into an intolerant community.

        The code of conduct that git has adopted is a pretty relaxed one. As long as you’re not a jerk you won’t even have to think about it.

        1. 10

          I’m not a CoC-fan either (though primarily because it implicates corporate professionalism (ala anti hacker culture) than freedom of speech issues), but I still wonder. What examples are there the “Contributor Covenant” in practice. When has it helped in ways that a simple “be kind and respectful” wouldn’t have?

          Also don’t forget that it’s this specific document that’s far more controversial than any self-written community guidelines or mailing list rules. I for example still don’t understand what’s so great about it specifically. You say it makes community norms explicit, but most of the time, it’s just a copy-pasted document from outside said community.

        2. 4

          While I think that code of conducts are probably a good thing, I don’t share your optimism about the Contributor Covenant.

          I admit I haven’t read the CC closely since a very early version[0], but what I remember of it was quite vague (perhaps this is what you mean by “relaxed”?). Moreover, its focus on “professionalism” thoroughly undermines its goals of inclusion, and (for people whom that isn’t reason enough) is counter to the hacker ethos.

          If you’re looking for a code of conduct similar to the CC, better (not perfect IMO, but much better!) examples include the Slack, Rust, and Django CoCs.

          [0]: which I really wanted to like, up until I suggested a minor clarification to the text and then was … promptly blocked by the author.

          [1]: Note: a one- or two- page CoC isn’t nearly enough space to make community norms explicit to the point where every decision to fire or censure a moderator won’t devolve into a flamewar (worse yet, it may invite exegesis). For comparison, the IWW—which wikipedia describes as an anarchist organization—has a process document that is about about 100 pages long. A CoC is not a process document.

          1. 2

            Thanks for the link regarding the use of the word “professionalism”. I’d like to note that the term is last in a list of stuff that’s example of what not to do, so it’s not front and foremost in the main text.

            I’d also like to note that the term “hacker ethos” is similarly coded, as “hacker culture” is historically coded male and privileged. If there are attempts to reclaim the term, I’d love to have some more information to read about that.

            I do agree that the CC is maybe a bit too brief, and not enough to prevent acrimonious rules-lawyering.

            1. 4

              “hacker culture” is historically coded […] privileged

              Is it? Before these excessive amounts of money poured into tech in the late 90s, being part of hacker culture implied a certain danger in general society, socially but also physically.

              (no argument on it historically being a male domain)

              1. 2

                Re: the term “privileged” - up until the late 90s, just accessing a piece of computing machinery implied access to quite a bit of disposable income. I realize the word has more connotations than just material wealth, but it’s certainly a big part of it.

                1. 5

                  I grew up in a western “working poor” situation and got my home computer in 1990. It was horribly outdated by that point, but computing access was one of these situations where “trickle down” actually worked because those with the money had lots of motivation to go with the latest and greatest and got to get rid of their stuff from last year. Two or three iterations later it became affordable for those who didn’t really have the money but the interest in pursuing the topic.

                  Sure, somewhere in the poorer parts of Africa or Asia widespread access to computing is still not a given today (although smartphones are making inroads), but that’s generally not what the people mean when designating others as “privileged”.

                  1. 4

                    Stepping back from a particular time period, “hacker culture” has a long history, and that history is mostly students at prestigious universities and colleges (MIT, Stanford). Now, not all of these students were personally affluent, but by attending these seats of learning they were privileged.

                    In the aggregate, it was easier to get involved in computing and hacking if you were part of an affluent milieu, or if you were not, was encouraged to study - and more importantly, had the opportunity to study, instead of for example, working for a living. Both contexts imply disposable “income” - either in money, or in the acceptance of opportunity cost in preparing a young person for higher education.

                    1. 3

                      Both contexts imply disposable “income” - either in money, or in the acceptance of opportunity cost in preparing a young person for higher education.

                      So Europeans (no matter their intersectionality in other aspects) are off the scale thanks to access to higher education being taken care of, for the most part.

                      If we collect some more of these properties that cover entire continents, everybody is privileged, rendering the term meaningless.

                      1. 4

                        Higher education in much of Europe may be gratis in regards to attendance fees, etc, but there are not enough places for everyone who wishes to attend them. So you still have a situation where more well-to-do are able to prepare their offspring for the competition of attending by extra study time, language study abroad, etc etc.

                        Anyway, I don’t think it’s productive to continue this discussion. We obviously have different meanings of the word “privileged”, and that’s where the rub lies. Thanks for taking the time to respond.

            2. 2

              I’d also like to note that the term “hacker ethos” is similarly coded, as “hacker culture” is historically coded male and privileged.

              Male yes, but I think irrelevant. Privileged, no.

              It’s pretty tough to pin down a strict definition of “hacker ethos.” To me it means tinkering with computers and soaking in the culture of bbs and meetups and reading through books like Steven Levy’s Hackers [0] and stacks of 2600.

              But I’m fairly young and there are many different experiences.

              Historically, and currently, the vast majority has been written by males but I think the hacker ethos is not exclusive to non-males and the difference is around curiosity and knowledge sharing over anything gender-specific.

              Note that I think it includes many characteristics of culture that aren’t specific to hacking.

              But I don’t think the hacker ethos is privileged at all. It’s focus on low resource environments and DIY and sharing is as close to anti-privileged as you can get.

              My own first experiences with computers was through a public school and library and I didn’t have access to computing resources for many years. MIT hackers were great, but they aren’t everyone. Visiting hackers in many countries shows similar tinkerers and low/zero resource learning systems. It was really neat meeting hackers in Kampala who grew up programming on raspberry pis, getting electricity through creative means because their home didn’t have the electricity grid.

              So while there were certainly hackers with lots of resources, there were (and still are) many without privilege.


        3. 3

          Codes of conduct should only be needed if and when bad conduct repeatedly and systematically occurs. I have never come across any such bad conduct in my many years of working on free software so I can only assume it is not wide-spread. It would be enlightening to see some examples of the bad conduct which led the Git project into adopting this code. If there are none they’re on thin ice in adopting it as far as I’m concerned, especially since these codes themselves often lead to strife.

          1. 6

            Codes of conduct should only be needed if and when bad conduct repeatedly and systematically occurs

            By that time it’s too late, the damage is done.

            Much like political constitutions, it’s best to get these things in place before things get bad, and when everyone is on the same page.

          2. 10

            It’s possible that the way that you experience the world is not the way that marginalized people experience the world. And maybe we should be focusing on making it easier for them to contribute, rather than the people who are least marginalized.

            Git is part of the larger open source community, and part of the larger computer science community, both of which have had many problems over the years. You can find many examples of this if you google “open source sexual harassment”. Linus Torvalds, who started the Git project, is famous for his abusive rants. When we’ve seen so many examples of fires breaking out in other similar projects, it seems sensible to get a fire extinguisher.

            1. 13

              Linus Torvalds, who started the Git project, is famous for his abusive rants

              His rants were abusive in the regard of someone who he considered “dumb” at the time, or similar. To my knowledge he wasnt ranting at someone because of their race or gender identity. Do you have evidence to the contrary? Else it comes off as you slandering him, which I wont stand for.

            2. 6

              It’s possible that the way that you experience the world is not the way that marginalized people experience the world

              I’m not sure it’s wise to tell random folks you likely never met (or in case of standardized documents: whole communities) that they’re not marginalized as if speaking from authority.

            3. 2

              Linus Torvalds, who started the Git project, is famous for his abusive rants.

              I love Linus “rants” and they are not abusive, they just use colorful language, which is refreshing.

              You can only interpret these texts as abusive if you are less than 12 years old.

            1. 10

              (the twitter link leads to a post in which SS calls Ted Tso a ‘rape apologist’)

              Please note, I am not diminishing what rape is, and or any particular person’s experience. However, I am challenging the use of statistics that may be hyperbolic and misleading … – Ted Tso

              That Ted Tso?

              Throwing epithets does not a truth make. In this case Tso was called a ‘rape apologist’ but that does not make him one, it only means someone attached a label to him because he dared to disagree. Disagreement is not the same as bad conduct. Sometimes it can be solved by discussion, sometimes there is no solution other than acceptance of the fact that people disagree. Let it be, let them be, they have the same right to an opinion as you have.

              (to make it clear, I realise that cup posted this link as an example of how these policies lead to strife)

            2. 10

              Remember the Halloween documents [1], Microsoft’s infamous plan to derail free software in general and Linux in particular? Just image what they would have been able to achieve by strategically calling out developers as ‘rape apologists’ and such.

              Maybe someone did realise this after all? Identity politics is a potentially devastating way to break up communities, and many of these ‘codes of conduct’ can be traced back to this type of politics.


      2. 4

        I am not too happy about this. I believe that all users should be treated equally, but something about these recent CoC changes just destroys communities:

        I’m actually rather surprised that it all happened so quietly. I guess it was overshadowed by the RMS-situation.

      3. 8

        Me neither, identity politics - the basis of the contributor covenant - has no place in free software communities - or any other community for that matter. It only serves to create factions where none should be, it raises (often trivial) differences between individuals to their defining characteristics and uses those to split communities into groups, groups into hierarchies of oppressed and oppressors. To what purpose, other than to create strife? It is totally antithetical to the famous New Yorker cartoon of the dog-with-a-keyboard telling the world that “on the internet, nobody knows you’re a dog” [1].

        Sendmail was written is and maintained by Eric Allman who was and is openly gay. He’s married to Marshal McKusick of BSD fame. Nobody cared. Nobody cares. That is how it should be and how it was, but that is not how it will be if identity politics really takes hold because Allman will find himself pushed into a certain slot (white [-10] middle-aged [-6] gay [+7] man [-10]) instead of just being known as ‘the guy who developed Sendmail’. Same for McKusick, Same for, well, everyone else.

        The History [2] and Talk [3] section on the Contributor Covenant article [4] on Wikipedia is telling in this respect: criticism is silenced with a claims of ‘The situation is completely solved.There is no need for this outdated section’.





      4. 3

        New contributors can be assured that the Git community is behind this adoption with the introduction of the Code of Conduct, Acked-by 16 prominent members of the Git community.

        … out of 1316 contributors currently listed on

        1. 1

          Thanks. I was trying to learn what such a vague statement meant.

      5. 3

        I also see that you have more posts than comments […] Is it your intention to just use this platform to advertise for GitHub?

        All five posts are “Git highlights” blog posts. Nothing is specific to GitHub, other than their being hosted on GitHub’s blog.

    3. 5

      I work on a library called Finagle that has spent roughly the last ten years plumbing this area, and you’re right that it’s still a pretty active area of research! I think academia used to be interested in this area ~30 or 40 years ago, but it’s no longer very fashionable.

      I think Finagle has pretty neat approaches to many of the problems you’ve laid out, so I’ll describe a few of them. In many cases we’ve come to similar conclusions as you, and in some we’ve come to very different conclusions. Feel free to reach out on the mailing list if you have more questions.

      Health Measurement: Finagle’s load balancer balances over a host abstraction, which has the concept of “status”. We keep track of whether an instance is “open”, “busy”, or “closed”. If an instance is “busy” or “closed” we assume it won’t serve any requests, and don’t route to it. At a different layer, we measure failure both from a connection point of view (if we’re unable to establish a connection, use an exponential backoff before attempting to reconnect, and fail connections proactively in the mean time), and a request point of view (you can specify a failure accrual policy, and we have “consecutive” or “proportional” failure accrual policies). We also do connection healthchecking, but its benefits are sometimes unclear. You can also be made “busy” if your remote clear issues you a goaway, or if your lease expires. We’ve also investigated whether we can have a continuous “liveness” measurement, instead of the binary “healthy/unhealthy” but we haven’t done the work to switch over to that yet, although our experiments suggest it’s pretty fruitful.

      Load Balancer Metrics: We’ve done quite a lot of research here, mostly focused around least loaded and latency. We’ve also implemented p2c. We haven’t seen huge issues with the mobbing behavior that you’re describing, perhaps because we don’t see significant differences in latency, and because we use p2c so we don’t converge as quickly as you might with a more exact strategy. In general, we’ve seen very good results with a peak exponentially weighted moving average latency approach, which we call “Peak EWMA”, and you can check it out here. We collaborated with linkerd, a company that was originally built around “Finagle as a Service”, to compare behavior of a few of the different load balancers.

      Recently we’ve been focusing on a strategy we’re calling aperture where we talk to just a subset of remote peers, instead of the whole cluster. This has significant advantages when the cluster you’re talking to is enormous, but it means that you now have two load balancing problems, one for picking which hosts to consider, and then a second one of picking which host within that subset to pick for a given request.

      We’ve considered having servers advertise a load metric to the client, but haven’t invested heavily in it, because what we have works pretty well so far.

      Load Shedding: We were able to opt every client at Twitter into load shedding by ensuring that we shed load gently, and that we only do it when servers are rejecting traffic at high volumes.

      Some prior art that might be interesting to you, at spotify, and netflix. If you want to learn more about finagle clients, take a look here.

      I thought many of your conclusions were sound–but one sticks out to me as worrisome, which is using weighted random selection. My concern is that if all of your weights are the same, it devolves to random, which may converge to a uniform distribution on a long enough time horizon, but it’s actually a binomial distribution, and it will be obvious on a short time horizon, since you’ll always have uneven instantaneous load. I instead encourage you to use p2c, and use the weights that you would have assigned in weighted random selection as the load metric, which I believe achieves the same thing but converges much faster, so that instantaneous load won’t be wrong.

      1. 3

        Oh wow, this is great! Finagle (and the other resources you linked) somehow never came up in my searches. I’ll have to read up. :-)

        p2c is two-choice, right? It’s a pet peeve of mine when people call it the “power of two” method (nothing is being raised to a power!) and “p2c” seems like a reasonable name—I’ll have to start using that, especially if it’s more common. Interesting to hear that you haven’t seen much in the way of mobbing. I’d expect that with p2c (as you said, convergence is slower), but I’m curious if you’ve seen it with other methods. I wasn’t sure how realistic a concern it was.

        I share your concern about weighted random selection. In my test runs I wasn’t able to get good information about high-frequency variation in weights, but we’re doing a dark launch of an algorithm using weighted random selection and will be collecting metrics on the instantaneous ratio of the highest and lowest weight under actual production circumstances. The weights are multi-factorial, being the product of 4 health factors, each derived from one of latency (inverse of a decaying exponential average), success (decaying exponential average), concurrency (inverse), and connection age (ramp up from epsilon to 1 over the first minute). The success factor is raised to the power of 4 or 5 to give it more effect on the weighting. (I’m sure this can all be heavily optimized to avoid the floating point math, but that’s not our bottleneck.) In the trial runs, WRS did fantastically well, but I do have some concern about high frequency variation in the weighting. I’m OK with flapping if it’s solely due to the very coarse-grained concurrency factor, since that has near instant feedback, but I’m less sure how it will interact with the others. I won’t be able to gather high granularity data on weights, but if the max weight disparity metric isn’t too large then I don’t think it’s cause for concern, at least for our use-case…

        The other thing I worry about with WRS is that when I derive the weight factors from the raw metrics, I “stretch” the numbers. For example, I want there to be a large difference between the weights of servers with 99% and 95% success rates, much larger than a 10% difference. Most of the requests should go to the 99% server. But if all the servers are suffering a bit (80%, 85%, 90%), I don’t necessarily want all the weight to be on the one that’s only marginally better (90%) even though that’s exactly what I want to have happen in a 99%, 99%, 95% scenario. I experimented a bit with taking the log of the failure rate and grouping hosts into “nines” buckets, but didn’t get anything satisfying. WRS is kind of a blunt instrument in this regard, and I feel like I could do a lot better with a more direct anomaly detection algorithm, perhaps one that explicitly checks for outliers.

        I didn’t follow the bit about binomial distributions (I’ve never been great with stats) but what I find confusing is that you seem to imply that p2c with the same weights would produce a less uneven distribution that WRS. That seems like it might be correct in the short term, but I have concern about long term unevenness (which I mentioned in the post.) Is that something I’ve gotten wrong?

        1. 2

          Interesting to hear that you haven’t seen much in the way of mobbing. I’d expect that with p2c (as you said, convergence is slower), but I’m curious if you’ve seen it with other methods. I wasn’t sure how realistic a concern it was. I think it might have to do with your throughput and latency? I’ve heard concerns about mobbing from people working on realtime systems too, but distributed systems typically have latencies on the order of hundreds of microseconds to milliseconds–that might be saving us from the worst kinds of bursts. We haven’t really done the research to figure out why we aren’t affected though. Marc Brooker from AWS ELB has a good post about how p2c does well in those circumstances.

          I didn’t follow the bit about binomial distributions (I’ve never been great with stats) but what I find confusing is that you seem to imply that p2c with the same weights would produce a less uneven distribution that WRS. That seems like it might be correct in the short term, but I have concern about long term unevenness (which I mentioned in the post.) Is that something I’ve gotten wrong? Probably a good to find someone at your company who has a background in stats to check your work, we’ve found that to be invaluable when reasoning about different load-balancing schemes. I’m not great at stats either, but I’ll try to explain =). We can model the number of requests that a given backend receives under a true random (I’m assuming all weights are the same to make the stats easier, and because it will probably often be the case) load balancer as a binomial distribution, a distribution where you say you have an experiment (a remote peer picking a backend) happens n times, and the experiment “succeeds” (picks this specific backend) with a probability p. It will probably look sort of like a normal distribution. You can imagine a PDF as a dartboard, where each host is like a dart, and it could hit anywhere under the curve for the PDF, and after the dart hits, you check the number of “successes” and that’s your server’s concurrency rate. So really what you want is a super tight distribution around a single number–that would produce a dartboard where every host is receiving the same amount of traffic.

          I see your concern if there’s a persistent difference in health–we’ve found that our remote peers usually are either healthy and basically never return failures, they have a blip where they return failures for a bit then recover, or else they are very unhealthy and decay very fast and perish. We don’t typically run into the case where just one server has a low success rate forever. With that said, there’s a good chance you would simply want to route around that. With that said, the p2c paper does make the assumption that you’re using load as your load metric, which will be evened out by the p2c algorithm itself adding load–since p2c can’t do anything about health (assuming it’s not related to load) that might be more of an issue for you, if you find these persistent health differences to be common in your environment.

          latency (inverse of a decaying exponential average) … concurrency (inverse) one thing you might want to consider is that latency and concurrency (or “load”) are sort of measuring the same thing if your load-balancer is working–a faster server should receive more requests, but also maintain a low level of concurrency, because it can clear requests faster. however, we’ve found that measuring latency is better than measuring concurrency because it means that new hosts don’t get slammed on start-up, and it can automatically slow down when the remote peer starts to slow down under heavy load.

      2. 2

        Oh hey, I really like the Peak EWMA calculation here: As I understand it, the weight is adjusted by the call frequency to make a constant-time half-life (that’s EWMA for unevenly spaced time series) but also any measurement over the current average is just taken wholesale (that’s the peak-sensitivity). A coworker had suggested something like the peak-sensitivity aspect to deal with the low-latency failure situation that rachelbythebay calls the “load-balanced capture effect”, but I guess there’s a more general applicability. Very nice.

      3. 1

        We’ve considered having servers advertise a load metric to the client, but haven’t invested heavily in it, because what we have works pretty well so far.

        Load Shedding: We were able to opt every client at Twitter into load shedding by ensuring that we shed load gently, and that we only do it when servers are rejecting traffic at high volumes.

        You get both of these if you use a configuration where the frontend requests a token (or multiple tokens upfront) and you process requests in a Last-In-First-Out (LIFO) reversed manner.

        When a request comes in, the load balancer broadcasts/multicasts a request asking “hey, who can do this work for me?”. Each node delays the response by some function of the load of the system by a number of microseconds. The first reply to the load balancer gets the actual request.

        If a backend is busy, naturally it gets fewer requests. If a backend fails, it never replies its willingless to do work. You no longer need to store state in the load balancer about health (though in practice this is not really a big problem). Moving from offline->online the node can have a forced 15 or so second warmup time to reduce service flapping.

        Another nice side effect is your load balancer no longer need routing logic as the request of willingness to do work can contain meta data (or the full request) and the node can decide to not respond at all if it it not configured to be able to handle that request (think data sharding, or splitting your interactive queries from your batch jobs, etc). Of course whist the node ways for the actual request to come through, it can pre-emptively start processing the request.

        1. 2

          That’s really interesting, but it also entails a bunch of extra round trips and network load–we’ve experimented with giving out leases periodically, but not on each request. Multicast is also extremely expensive–under an aperture scheme it wouldn’t be as bad, but in the normal case for small messages, it would increase your network overhead by a factor of your number of remote peers. The scheme you’re describing of delaying your response also seems like it would deliberately increase latency, which is something we’re trying to optimize for. Since network latency is often on the order of hundreds of microseconds already, you would have to have your delay also be on the order of hundreds of microseconds for it to not just be swallowed by network latency variance.

          I’m also not as optimistic as you about health–in particular, a server might be able to respond happily but still be serving unhealthy responses. There are different kinds of failures, and backends aren’t always good at measuring them.

          I think directly communicating server-load data via out of band requests, or piggy-backing updates via responses is probably more appropriate. As you mentioned, it’s not that burdensome to keep track of load data.

    4. 3

      I think this post sort of belabors the point: small changelists are better than long ones, and it’s easier to manage small PRs if you have better tooling for it.

      I disagree with the “one local commit is one diff” argument, and although my company uses phabricator, we don’t use it that way. Sometimes a commit is only part of an idea, or doesn’t include tests, and that can still make sense as a local commit–with that said, it probably won’t make sense as a commit on master, and you should take advantage of the diff to collate different pieces that will become a single commit eventually.

    5. 4

      Camille Fournier has a really excellent book on exactly this topic, called The Manager’s Path. I strongly recommend it.

      1. 3

        I love that book. Ordered it 2 weeks ago and currently in the middle of reading it. :)

    6. 19

      My one beef with these kinds of articles is that they phrase things so that it sounds like Google has a grand plan to destroy open standards, but it may actually be that there are many local decisions that ended up doing it. The CEO of Google probably didn’t reach down one day and say, “Let’s get rid of XMPP”. I think it’s more likely that the hangouts group decided to stop maintaining it so they could compete with other chat products that weren’t restricted by XMPP. This isn’t to say that a trend of this kind of behavior from Google isn’t something to talk about, but probably it’s either something fundamental about how to make money from open standards, or else something about Google’s incentive structure. If you asked Sundar Pichai to stop doing this, he would probably say, “I don’t know what you’re talking about, but we have never made “destroying open standards” part of our long-term strategy.”

      1. 37

        Their intentions are irrelevant; only their actions and the consequences of them matter.

        1. 17

          If the sole intention of the article is to encourage other folks to avoid this pitfall, sure! If we want to also convince Google to stop doing it, then the practical mechanics of how these things actually happen are of vital importance. It’s probably the rank and file that want to do something innovative in order to hit quarterly goals, and are sacrificing open standards at the altar–getting those kinds of people to understand the role that they are personally playing is important in that context.

          1. 6

            You’re right that it’s important that the labor understand what the consequences of their efforts actually are; I think that’s what I’m saying, too.

            It’s also important that other people who are impacted by these actions by powerful actors like Google or Apple or Microsoft, but who don’t work there, understand who is responsible for these social negatives.

            It’s further important, for everyone to understand, that the directly responsible parties for that social cost are the corporations themselves, whose individual human members’ culpability for those costs is proportional to those members’ remuneration. Pressure should be applied as closely and directly to the top of that hierarchy as possible in order to convince them to stop it, in whatever way you can do best. The OP article is addressing the top of that hierarchy in Google’s (Alphabet’s?) particular instance, since they’re a very powerful actor in the space of the Internet and software in general.

          2. 6

            Additionally, though it may be helpful for third parties to critique actors like Google by having concrete suggestions or perfect empathy for the foot-soldiers caught up in the inhumane machine that Google in some ways is, it’s not the obligation of the victims to make things easy for the powerful. It’s the moral obligation of the powerful to be mindful and careful with how they act, so that they don’t inadvertently cause human suffering.

            Before any ancap libertarian douchetards weigh in with “corporations aren’t moral entities”, they absolutely act within the human sphere, which makes them moral agents. Choosing to be blind to their moral obligations makes them monsters, not blameless. Defending their privilege to privatize profit and socialize cost is unethical and traitorous to the human race.

      2. 7

        I think it’s more likely that the hangouts group decided to stop maintaining it so they could compete with other chat products that weren’t restricted by XMPP.

        That’s reasonable –by all accounts, xmpp is terrible – but the replacement could have been open sourced. This detail makes it clear that closing off the chat application was intentional. When GChat’s userbase was small, it made sense to piggy back off of the larger XMPP community. When GChat became the dominant chat client, it no longer needed the network effect that a federated protocol provided, and it moved to a proprietary protocol.

        1. 1

          By whose accounts?

          The vast majority of “commercial” chat networks are xmpp under the hood, with federation disabled.

          Being technically poor isn’t why they turned off federation, it’s because federated chat gives zero vendor lock-in.

          1. 2

            I believe you and @orib are in agreement when s/he says:

            This detail makes it clear that closing off the chat application was intentional

      3. 9

        On the other hand, the CEO of Google could decree that using open standards is important.

        I agree that this is closer to a natural disaster than a serial killer but an apathetic company doesn’t mean the outcome is better than an actively antagonistic company.

      4. 9

        I thought this article was fairly agnostic about how conscious Google’s embrace, extend, extinguish pattern is. This seems like the right approach, as we don’t have any way of knowing.

        We know from court proceedings that Microsoft executives used the term “embrace, extend, extinguish” (and no doubt they justified this to themselves as necessary and for the greater good). We don’t have the same window into Google executives’ communications, but it seems foolhardy to think that some of them wouldn’t recognize the similarities between Microsoft’s “embrace, extend, extinguish,” and Google’s current approach. Sunar Pichai could be lying to himself, or he could just be lying to us. Either way the particular psychology of Google executives doesn’t seem important when the effects are predictable.

    7. 2

      Something the author doesn’t seem to understand is that google is trying to improve security for all users of Chrome. So it might be that no one ever gets man-in-the-middled on any of the author’s domains, but that doesn’t mean that no one will do it for, which is served over unencrypted HTTP as of today. Chrome can guarantee you’re protected against certain classes of attacks with encrypted HTTP that it can’t with unencrypted HTTP, but if you’ve visited unencrypted websites, it can’t.

      My suspicion is that someone at Google has a metric they’re trying to optimize for the proportion of traffic Chrome is delivering to users that they can prove has been encrypted. Unlike many metrics, this actually does seem to be one that’s good for all users. It’s certainly an inconvenience for website owners.

      With that said, for personal websites, I agree with commenters who don’t think it’s that big of a deal, especially since Cloudflare will do it for you for free.

    8. 2

      I’m a big fan of having a template for commit messages for your open source project. As an example, finagle and netty both have templates, and it makes it much easier to understand the purpose of a commit, and then also how it achieved the purpose, which is what we typically want out of the commit message.

    9. 16

      The classic version of this is “How to ask questions the smart way”.

    10. 10

      Aside from being poorly written, this article tries to skewer event loops with the argument that CPU work will block your thread and prevent you from achieving high throughput. This doesn’t have to do with blocking vs non-blocking. Your CPU resources will be consumed regardless of the approach you take. The actual difference in throughput here is how many cores you can consume. Presumably if you were actually trying to achieve high throughput in production with node, you would have several node processes per machine and load balance your work across the different processes.

      I’m not a huge proponent of node, but this article is not good.

    11. 7

      I helped out with the Twitter decision to vote no on JSR 376, and here’s what we said. The short version is that we felt like the JSR doesn’t have very much bang for the buck as it stands, although it’s an opportunity to tackle a real problem.

    12. 5

      Something a bit less obvious–you can write an async recursive loop with futures if you’re clever, but your future implementation needs to have built in support for it. In scala (using Twitter futures, although Scala futures support this too as of 2.10.3)

      def loop(): Future[Unit] = Future.sleep(5.seconds).onSuccess { _ =>

      This is quite tricky–the originally returned future is never satisfied, and this keeps on telescoping in forever, so if you’re not smart about this you’ll end up with a big space leak. If you’re curious, the scala promise implementation has a good explanation of how this works.

    13. 17

      I don’t think journalistic ethics have caught up with the ethics around doxing yet. The problem is that journalism tries to answer some basic questions, like “who, what, where, why, when, how” and historically, “who” has been a meatspace “who” because that was the only “who”. Now folks have persistent online identities, so it would be reasonable to refer to this guy just as MalwareTech and it would be fine. “Who” in this case doesn’t have to just be, “an anonymous person online” because MalwareTech is himself an identifiable person online, separate from what he does in meatspace.

      Clearly, some kinds of doxing aren’t OK in journalism, like publishing someone’s address, telephone number, or social security number, but violations of privacy have always been somewhat fuzzy. Cf paparazzi, or revealing who Elena Ferrante was.

      For what it’s worth, the SPJ code covers this kind of thing, but I suspect it will still take more time for journalists to get a good sense of how this works in the internet era.

      1. 3

        And now I was sitting here, slightly confused whether Simon Peyton-Jones gave an enlightning talk about online privacy that I missed before I followed the link …

    14. 3

      So eventually, after we platformize our hack and are comfortable from having run parallel infrastructures for some time, we’ll be handing off our DNS infra to the folks that probably know how to do it better than us.

      So far I’m 2 for 2 on “companies I’ve heard of running their own complicated DNS set up, despite it not being a core part of their business” vs “companies who would have been far better off outsourcing their DNS.”

      1. 1

        What does this look like when you’re in your own datacenter?

        1. 1

          One of:

          • You put the entirety of your zones on the external DNS service and you put only caching (if any) nameservers inside the DC.
          • You put the public-facing part of your zones on the external DNS service and you do split-horizon DNS to have an subdomain visible only inside your DC, inside which all your internal-only records go. You put at least one nameserver inside your DC which responds to requests from inside your DC only, believes itself to be authoritative for, and delegates all other requests.

          In both cases you get an API and UI for managing your publicly visible DNS entries because every worthwhile DNS provider does that.

        2. 1

          You can still outsource DNS in that scenario. Maybe it makes less sense though, but it’s equally possible as when you’re entirely cloud hosted.

      2. 1

        I think that an ideal state is “companies should only do the core part of their business”, but the reality is that “companies have to own whatever they need to own to ensure their customers can access their product”.

        If that means running your own code or your own DNS or your own fileserver then that’s what you gotta do. It’s obviously more expensive but some companies don’t have the luxury of saying (as I’ve heard many on hn say) “amazon is down lol that means the internet’s broken guess we can go to lunch until it’s working again”.

        1. 1

          This can’t possibly be true if “own” means “run themselves”. Every company that sells products using the internet needs, amongst many other things, DNS service. Proportionally very few of those companies are capable of running a DNS service with higher uptime than, say, Route 53.

    15. 2

      My main problem with commit messages in git is typically that I find them inscrutable after a few months, or if it’s on a piece of code I’m not familiar with. My team has adopted a strategy of ensuring that commit messages have a motivation for the commit and then explain how it fixes the problem, which I really like, and you can find here. It’s quite lightweight.

    16. 4

      I work on finagle, which underlies the technology that duolingo switched to (finatra) so I have a horse in this race, but I wanted to talk a little more about what you were saying about averages being useless.

      The tricky thing is that latency is something we get when we make a request and get a response. This means that latency is subject to all kinds of things that actually happen inside of a request. Typically requests are pretty similar, but sometimes they’re a little different–like the thread that’s supposed to epoll and find your request is busy with another request when you come in, so you wait a few extra microseconds before being helped, or you contend on a lock with someone else and that adds ten microseconds, and then all of those things add up to being your actual latency.

      In practice, this ends up meaning that your latency is subject to the whims of what happens inside of your application, which is probably obvious to you already. What’s interesting here is what kinds of things might happen in your application. Typically in garbage collected languages the most interesting thing is a garbage collection, but other things, like having to wait for a timer thread, might also have an effect. If 10% of your requests need to wait for a timer thread that ticks every 10ms, then they’ll create a uniform distribution from the normal request latency + [0, 10ms).

      This ends up meaning that when people talk about normal metrics being mostly useless for latency, this is usually because they mean the aggregate latency, which has samples which had to wait for the timer thread, and has samples which had to sit through a garbage collection, etc. However, it isn’t that the distribution they construct are particularly odd, but more that the distributions are composed of many other quite typical distributions. So there’s a normal distribution for the happy path, and then a normal distribution for the garbage collections, and a uniform distribution for when you were scheduled on the timer, and put all together they end up making a naive average difficult to interpret.

      But we can make an informed guess here, which is that probably the happy path is around 10ms now, and was probably around 750ms before, which is a quite nice improvement. As far as the unhappy path, my suspicion is that JVM gc pauses are better than Python gc pauses, but it’s quite difficult to tell for sure. My guess would be that their gc pauses are on the order of 100s of milliseconds, and were previously also on the order of 100s of milliseconds, so that the p9999 is probably still better than the p50 they saw previously.

      Anyway, this is just to say that averages are useless, but also that just knowing the p50 or p99 is also sort of useless. Really what I want to be able to see are the actual histograms. As a side note, finatra exports real histograms, so if you get ahold of one of the duolingo people, would be pretty interested to see some of those graphs.

      1. 2

        Agreed, a histogram – or any more details around performance – would have been useful. It’s unclear what they measured and what was sped up, so it’s hard to evaluate anyway.

        And this is the problem with precision, but not accuracy: if you’re telling me 750ms and 10ms, that send me a different signal than 750ms and 14ms. In fact, if I wasn’t going to dive deep into the perf aspects, I might have either dropped numbers altogether (and stated “more than an order of magnitude improvement”), or said “50 times faster”, and then I would’ve gotten the gist of the speedup (which seems awesome) without tripping over the concrete numbers (especially 14).

    17. 3

      I think this article is interesting, but unrelated to the core interests of lobsters, which are computer technology. I think folks can find these kinds of articles on other websites, and it would be best to keep this kind of stuff off lobsters to keep the signal to noise ratio high.

      1. 16

        I slightly disagree. It fits in well. UI design is a very important part of CS. I will caveat my comment with this probably should have the off-topic tag.

        Edit: I swear there used to be an off topic category.

        1. 19

          I think CS people tend to find these kinds of metro-layout type discussions interesting for the even more specific reason that they’re closely related to CS sub-areas like automated graph layout. In fact if automated graph layout worked ‘perfectly’ for some definition of perfectly, you would just use that to make metro maps.

          1. 4

            Indeed. The blog post from transit app, was an excellent read on this topic:


      2. 8

        I disagree completely. computer science is largely about representation and communication of data, and few sets of data affect more people than those related to transit systems. even if one finds the visual / graphic aspects uninteresting, the implicit analysis of the data set can inform all manner of algorithmic thinking.

    18. 22

      Are there any stripe folks on lobsters who know why stripe chose Consul as the service-discovery tool, instead of straight-up DNS or zookeeper? b0rk phrases it as “effective and practical” but hashicorp’s first commit to Consul was almost exactly three years ago, so if they’ve been using it since it came out. In comparison, kubernetes that she contrasted as “a new technology coming out”, had its first commit two and a half years ago, but it was already 250 files, so it was probably in development for at least half a year before that. I wonder if maybe the way Stripe talks about Consul has changed since they started using it–since they’ve used it for a couple of years, they think of it as battle-hardened, even though in the larger world of distributed systems it is not particularly broadly used. This might be true for Stripe, since they have already worn down the rough edges for their use case, but I don’t think if I was looking at choosing a service discovery system for a company that I would consider choosing Consul anti-flashy.

      One thing that worried me about Consul is exactly what Stripe ran into when running it, that it’s built on top of a framework for guaranteeing consistency. In practice, strong consistency might not be what you want out of a service discovery framework. It might be appropriate if you really don’t want your servers to ever accidentally talk to the wrong server, and you reuse names (or IP addresses, which is often what service discovery nodes point to), since after you remove a node from a cluster, you can be pretty certain that all clients will see it provided the service discovery cluster is talking to people. In the long run, a better solution to this problem than requiring strong consistency in your service discovery tool is requiring that clients and servers authenticate each other, so they agree that they’re talking to who they think they’re talking to. If I was picking a flashy new service discovery framework, I would probably look at eventually consistent tools like Netflix’s Eureka. If I was trying to do something battle-hardened, I would probably pick Zookeeper.

      Looking at Zookeeper naively, you might ask, “Why is this strongly consistent hierarchical file system the go-to default for service discovery?” One thing is that it was designed for generic configuration changes, so it receives updates via the “watches” API, and the “ephemeral znodes” API, which can be the fundamental building blocks of a service discovery tool. That’s the long and short of why people have used it for practically a decade.

      Other than that, zookeeper doesn’t have a lot that particular commends it. It does great throughput (for a strongly consistent store), and many people have used it for service discovery for a long time so you can be pretty confident in it. On the other hand, when it’s not confident it can make progress safely, it just doesn’t make progress–this can mean that new nodes can’t start up (because you can add them to a cluster) and that old nodes can’t be removed (because you can’t remove them from the cluster). Leader elections can also be pretty painful. Unfortunately, these are also problems that Consul faces, because it made the same choices about consistency that Zookeeper did.

      Now that they’re using DNS on top of Consul, they have two single points of failure. Although we treat DNS like air, and assume it’s an unlimited resource, DNS is still a name server, and it can still go down. With that said, DNS is really battle-hardened, so usually the problem comes when somebody poisons your DNS server somehow. This is problem is mitigated by being in an environment where experts run your DNS servers, but it can still be bad.

      The other thing is that network partitions are real, and you don’t necessarily want to take down your own website because your service discovery cluster can’t talk to every remote host. Just because your service discovery cluster is partitioned from them, doesn’t mean that they’re partitioned from each other! The nastiest problem then isn’t when Consul is down, but when Consul is up and is sending you garbage that makes you think everyone is down. An end solution ends up being to only trust Consul as a source for adding new information–your load balancer assumes that Consul tells you about new nodes, and ignores information about dead nodes until it can validate for itself that they’re dead.

      As b0rk mentioned, DNS can be pretty slow to update, which is usually the reason why people don’t want to use just DNS. If you’re happy with DNS’s propagation speed, it might make sense to cut the middleman and skip running your own service discovery tool. With that said, it can be a hassle to have to wait on the order of minutes for a service discovery update–in particular, it makes rolling restarts especially slow, since if you want to sequence rolling 5% of your cluster each time, you’ve added at least twenty minutes to your deploy. You can use blue/green deploys to make it easier to roll back, but as your cluster size grows, it becomes increasingly expensive.

      With all that said, I think this is a really cool story of taking a technology, getting it to work (upstreaming bugfixes! be still my heart), improving reliability by relaxing consistency, and simplifying healthchecking. Despite my skepticism of the newfangled hashicorp stuff, service discovery is a well-known danger area, so having zero incidents in a year is pretty dang good. I hope companies continue to put out blog posts like this one–the history and explanations of why decisions were made are great. Stripe does have an advantage since they’re employing b0rk though ;)

    19. 1

      Are these guys duplicating their own work?

      Sorry, bad wording: are Scala guys duplicate the work done by Java guys?

      1. 3

        I think scala-native doesn’t target the JVM. AOT compilation allows a mixed-mode style, so that it’s still on the JVM but some libraries are precompiled to native code. Note that there’s even a style of running that still allows JIT-ing of AOT-ed code.

        For what it’s worth, the scala-native people aren’t coming up with something that has never been done before. There are existing tools to compile java to native code, most notably gcj. I think what’s exciting about scala-native are the proposed extensions to scala which allow you to control your memory in a much finer-grained way, and the improved FFI.

      2. 1

        Is it the same guys?

        I would expect scala-native to support the native ABI (or at least lightweight bindings to it), whereas that doesn’t seem to be a goal for this project. Whether that’s actually an important use case (and/or worth the cost of working without support for anything written in Java) is an open question.

      3. 1

        More like Oracle trying to badly duplicate what already works in Scala.

        Will be Scala.js vs. GWT all over again: Two implementations, one works, one doesn’t.

        (I expect that Java-AOT like GWT will not even try to have any sensible kind of interop/integration into JS/native, making them foreign objects in the respective place. Java-AOT will likely be some shrink-wrapped JVM+app code thing.)

    20. 1

      This will be really useful when it comes out. From the perspective of someone who helps write (gently) latency-sensitive systems, we expect that we have to warm up all JVM-based services in order to help with many things. Of the top of my head, hydrating caches, resizing socket buffers, JIT-ing, getting GC heuristics going, connection establishment, and ensuring lazily evaluated code is already run. All of these have been tunable or fixable, except for JIT-ing, which absolutely must happen, and for which there is no other way of doing other than exercising the code paths. This change will allow us to consider a brave new world where we don’t need to figure out how to coordinate warm up requests for all of our applications. This will be especially useful for applications with a broad workload, where it’s a hassle to figure out how to warm up every workload, and for cases where it’s difficult or impossible to send synthetic traffic that doesn’t mutate a persistent store.

      I’m pretty excited for JDK9. It has been a long time coming, but it looks like there will be some really exciting goodies in there.