How to start learning high scalability

distributed systems

When we usually are interested about scalability we look for links, explanations, books, and references. This mini article links to the references I think might help you in this journey.

DISCLAIMER:

You don’t need to have N machines to build/test a cluster/high scalable system, currently you can use Vagrant or docker and up N machines easily.

THE REFERENCES:

Now that you know you can empower yourself with virtual servers, I challenge you to not only read these links but put them into practice.

Good questions to test your knowledge:

  • Why to scale? how people do that usually?
  • How to deal with user session on memory RAM with N servers? how LB know which server is up? how LB knows which server to send the request?
  • Isn’t LB another SPOF? how can we provide a failover for LB?
  • Isn’t my OS limited by 64K ports? is linux capable of doing that out of the box?
  • How does mongo solves failover and high scalability? how about cassandra? how cassandra does sharding when a new node come to the cluster?
  • What is cache lock? What caching policies should I use?
  • How can a single domain have multiple IP addresses (ex: $ host www.google.com)? What is BGP? How can we use DNS or BGP to serve geographically users?

Bonus round: sometimes simple things can achieve your goals of making even an AB test.

Please let me know any mistake, I’ll be happy to fix it.

Redundancy and failover on your life

Very simplified introduction: failover is the ability to keep using a service or device in case it fails, and you usually achieve that by having redundancy, having more than one service or device at time. A silly example could be when the power goes off in your house, you can handle that (failover) by having and using a flashlight (redundancy) as backup light system.

In my life I faced similar problems all the time and I think it is valid to share that. I’ll start with the very basic service (but currently very necessary) Internet, suppose we’re not at home or we’re travelling or our beloved ISP is off, I deal with that by having an extra 3G modem and Kindle with 3G free-worldwide.

travel quite often and everytime sometimes I face issues with outlets.  My notebook is meant to be used on Brazil outlet pattern but when I go to US I need to use an outlet adapter. It is not a very accurate failover mechanism however travelling with one world outlet adapter can save you from some pain.

The main way I have fun is by playing games; then in case of my console broke or the power went out I have the portable, again it’s not quite accurate a failover mechanism but for my purpose it is.

Another area is TV, suppose my TV was stealed I can keep watching it by using my usb tv or even my gps tv.

Going to digital world I can tell you endless stories and ways to have failover. The most obvious could be have your files in your computer but keep it also in a cloud storage. I love this thing about digital buying, I used to buy games digitally (Steam, eShop, PSN) and even though I change computer I don’t need dirty old DVD’S to recovery my games, they are all associated with my account.

These last two are the best IMHO: make all the documents you have digital copy (this is easy today since any smartphone can take pictures), try to attached them to cloud (email, storage…), it saved me a lot of time. You should have at least two phone numbers of a service (food delivery, cab, hospital and etc.) because sometimes you don’t have easy access to get this info.

And you, what do you do to have failover in your life?

Functional programming with Clojure

Clojure

I’ve been studying the new language called Clojure (all the cool kids are talking about Clojure). It is a functional language created by Rich Hickey around 2007. This is a(nother) dialect of Lisp. It is a dynamic language as Ruby, JavaScript and others. As said before Clojure (pronounced as closure) it’s a impure functional language in contrast with Haskell, a pure functional language. It runs over the JVM, so it’s fast, interoperable with Java among a lots of good stuffs that JVM give us. To put hands-on and try code something you can use the try Clojure online or you can download the clojure.jar file and run it. Surprisingly Clojure it’s easy to learn.


java -jar clojure-x.x.x.jar

What it a functional language? (concepts)

first-order functions -> functions are treated as values. You can store a function on a variable, you can pass one function to another or you can return a function from another function.

var sum = function(a,b){
  return a + b;
};

var obj = function(sum){
  return {
    hello: "hello",
    sum: sum
  };
}();

obj.sum(3,5);

functions constructs -> the language constructs are function instead of keyword. Constructions for conditions (if), for iterations (for, while), catch exceptions (try, catch) and others.


(if condition do-it else-do-it)

stateless -> it’s functional in the sense of math, you have functions which defines values input and output and doesn’t rely on outside global state. In such pure function you won’t produce any side-effect (read, write outside resource). Obviously we will produce programs which causes side-effects, clojure helps you build “mutable” data . On other pure languages like Haskell side-effects are treated as expections so you have concepts like actors and monad.

immutable data -> collections and local variable, in clojure, are immutable. The immutability, helps us in parallelism, since the “values” are immutable you can shared then without worry about locks.

currying -> is the technique of transforming a function that takes multiple arguments (or an n-tuple of arguments) in such a way that it can be called as a chain of functions each with a single argument (partial application).

memoization -> is an optimization technique used primarily to speed up computer programs by having function calls avoid repeating the calculation of results for previously processed inputs.

Resources

Tips for performance in your web sites

HTTP

HTTP is a networking protocol for distributed, collaborative, hypermedia information systems and it is also the foundation of data communication for the World Wide Web.

How it works?

You send (using a browser, for example) a request for the server: -Hey Internet, give me the Ruby overview web site.

 

GET /2011/07/11/ruby-overview.html
HTTP/1.1
Host: leandromoreira.com.br
User-Agent: Mozilla/5.0 (s.o.) Gecko/20100101 Firefox/5.0

And server can answer you: – Okay!

HTTP/1.1 200 OK
Content-Type: plaintext/html
Last-Modified: Fri, 15 Jul 2011 00:23:15 GMT
Content-Lenght: 896
blah blah content blah blah ruby blah blah

As you can see HTTP is a protocol where you can request resources for the sever, using some fields. This conversation between you and the server can results in further requests. Using some of HTTP fields, you can request compressed data from server (then use less network load), server can inform you how long you can keep components at cache.

The 14 golden rules

Steve Souders wrote the amazing book (based on his researches on Yahoo!) High Performance Web Sites to help us made more scalable and fast web sites. This book shows fourteen rules at practice to follow and achieve the fast web site, it also presents real web sites examples. I certainly recommended you this book. (Thank Guilherme Motta for this recommendation.)

Rule #01 – Make fewer HTTP requests

Yeah, this rule might be obvious but not so obvious is the possible solutions for that:

  • Use image map and css sprites instead of use multiply images.
  • Hard core -> Sometimes you can use inline images. (it doesn’t work with IE)
  • Combine & Minify your files: three js scripts to only one (with removed spaces and etc.) and nine stylesheets to one. In fact this thing of combine everything into one, broke our so beloved coupling OO rule. I think this combine and minify process should be made on deployment and delivery, this let’s free to create cohese and uncoupled scripts and stylesheets.

Rule #02 – Use Content Delivery Networks

I’ve worked in a company where we have some stylesheets and js shared between the projects. Our approach for that was, each project injects the files in its workspace. This take us to the hell of outdate files.

Thinking in terms of caching, if we had this files into a single known server, all the requests for our several web sites would take advantage of cache.

Taking this for a big picture, I mean dream that we are handling a huge web site, we could also take the advantage of customer’s proximity. If our components are next to our clients so the download time would be decreased.

Rule #03 – Add an Expires Header

In HTTP conversation, the server can tell to a client that certain component could be used at his local cache for a certain time using the HTTP field: Expires. You request a component for server and it answers with that component and give you a valid time for that component. The setup for this is made on server-side. (See also: Max-age and Cache-Control – overcome the limitation of Expires, when you need to specify the exact date, causing clock synchronizing issues. Using Cache-Control you can set time in seconds!)

People, usually, don’t put this because they fear the change. For example: I create a component (company’s logo) and put the expires date to ten days, but if I would change it before that time?! Well, you could starting writing your components with rev. number on its name. The company_logo_1.0.png is cached but your newer version company_logo.1.1.png isn’t.

Rule #04 – Gzip ’em all

Once again, you can inform to server that you understand and accept compressed files (Accept-Encoding: gzip) and maybe server answer you with compressed data. (Content-Enconding: gzip) Using compressing, saves the average of 66% in your network flow, it’s huge and worth to do.

Rule #05 – Stylesheets at top

The history is long (progressive rendering, browser way to render, blank screen when css at bottom) so just follow (and read the book to understand the beginning of this long history) the rule! BTW, by testing, it was proved that use LINK way of include css instead @import is better.

Rule #06 – Scripts at bottom

Again, long history (browser way to understand it, EVEN BLOCK parallel downloads because of scripts and middle of your html) just follow the rule!

Rule #07 – Don’t use CSS expression (okay, avoid)

This rule brings a new concept to me. I didn’t know that we could write css and javascritp into css rules (in this way). For example:

width: expression( document.body.clientWith < 600 ? "600px" : "auto" );

Rule #08 – Make JS & CSS external

Inline vs External -> In raw terms inline is faster. But don’t forget about others rules : compressing, CDN and caching. In general, for real world projects external files always win for performance. (Mainly for users with primed cache, components cached because they’re visited site before)

Rule #09 – Reduce DNS lookups

The roundtrip that one request does to find the it can be a huge time-consuming, so DNS caching would improve a lot your user first visit.

Rule #10 – Minify Javascript

I already cite this but it’s a valid rule, imagine the space you can save and bytes you can save by minifying your js.


myFunction(parameter0, paramenter1){

window.title = parameter0 + parameter1;

}

Works equality as:


myFunction(_0,_1){window.title=_0+_1;}

There is a bunch of minifyers on Internet. And even obsfucators but I don’t think it’s necessary and most of them can introduce bugs. There is also CSS minifyers too: write 10 instead of 10px, #606 instead of #660066.

Rule #14 – Make ajax cacheable

This is, maybe, the most hard rule to explain and apply! Then it’s better read it here.

PS: I skipped the rules 11 (avoid redirects), 12(remove duplicate scripts) and 13 (configure etags) just because they are most known for general people, except ETags but usually the tip is : avoid ETag, so it is.

Oh yes, you can see more rules of high performance at Yahoo! and at Google.com too.

Useful tools to help mensure your site

Today, we fortunately have tools to help us identify and fix performance issues on our web sites. You can simply grab one of that bellow or both and run it on your site. They give you a complete report about the pain points and thing you can do to increase performance. YSlow uses the fourteen rules as basis and Speed Trace is Google tool, it doesn’t need to say anything.

YSlow

YSlow analyzes web pages and suggests ways to improve their performance based on a set of rules for high performance web pages. It can be installed as browser plugin.

Speed Trace

Speed Tracer is a tool to help you identify and fix performance problems in your web applications. It visualizes metrics that are taken from low-level instrumentation points inside of the browser and analyzes them as your application runs. Speed Tracer is available as a Chrome extension and works on all platforms where extensions are currently supported (Windows and Linux).