via youtube.com
I'm really impressed with Vidder's openness about their architecture in a post on the High Scalability blog. Their list of 10 lessons learned are great and anyone thinking about doing a web startup should take a look.
- Mix and Match. They are using a combination of nodes from different providers. The CDN handles the content. Amazon is used for stateless operations like encoding and storage. Nodes in the colo are used for everything else. It may be a bit confusing having functionality in several different locations, but they are staying with what works unless there's a compelling business reason or ease of use reason to change. Moving everything to Amazon might make sense, but it also would take them away from their priorities, it would be risky, and it would cost more.
- Watch out for table growth. Queries that used to take a reasonable amount of time can suddenly crush a site once it grows larger. Use reporting instances to offload reporting traffic from interactive traffic. And don't write queries that suck.
- Look at costs. Balancing costs is a big part of their decision making process. They prefer growth via new features over consolidation of existing features. It's a tough balancing act, but consciously making this a strategic imperative helps everyone know where you are going. In the longer term they are thinking about how they can get the benefits of the cloud operations model while taking advantage of the lower cost structure of their own colo.
- Experiment. Viddler loves to experiment. They'll try different technologies to see what works and then actually make use of them in production. This gives them an opportunity to see if new technologies can help them bring down their costs and provide new customer features.
- Segment teams by technology stack and release flexibility. Having distributed teams can be a problem. Having distributed teams on different technology stacks and radically different release cycles is a big problem. Having distributed teams with strong dependencies and cross functional responsibilities is a huge problem. If you have to be in this situation then moving to a model with as few dependencies between the groups is a good compromise.
- Learn from outages. Do a survey of why your site went down and see what you can do to fix the top problems. Seems obvious, but it isn't done enough.
- Use free users as guinea pigs. Free users have a lower SLA expectation so they are candidates for new infrastructure experiments. Having a free tier is useful for just this purpose, to try out new features without doing great harm.
- Pay more for top tier hosting. The biggest problem they've had is picking good datacenters. Their first and second datacenters had problems. Being a scrappy startup they looked for the cheapest yet highest quality datacenter they could find. It turns out datacenter quality is hard to judge. They went with a top name facility and got a great price. This worked fine for months and then problems started happening. Power outages, network outages, and they eventually were forced to move to another provider because the one they were with was pulling out of the facility. Being down for any length of time is not acceptable today and a redundant site would have been a lot of effort for such a small group. Paying more for a higher quality datacenter would have cost less in the long run.
- What matters in the end is what the users sees, not the architecture. Iterate and focus on customer experience above all else. Customer service is even valued above sane or maintainable architecture. Build only what is needed. They could not have kick started this company maintaining 100% employee ownership without running ultra scrappy. They are now taking what was learned in scrappy stage and building a very resilient multi-site architecture in a top-tier facility. Their system is not the most efficient, or the prettiest, the path they took is the customer needs something so they built it. They go after what the customer needs. The way they went about selecting hardware, and building software, with an emphasis on getting the job done, is what built the company.
- Automate. While all the experimentation and solve the immediate problem for the customer stuff is nice, you still need an automated environment so you can reproduce builds, test software, and provide a consistent and stable development environment. Automate from the start.