Last year I attended Velocity 2010 from the Operations side of things as an Operations Engineer at SugarCRM, this year, I attended as a Software Developer at Highgroove Studios which influenced my choice of sessions and my perspective.
Outside of the sessions, there was almost too much going on to keep track of. A full job board, a full exhibit hall, and people from every company that is anybody in the internet space. I got face time with people from Facebook, Yahoo, Google, Youtube, Heroku, Opscode, Dyn, Amazon Web Services, eBay, AOL, Github, VMware, and plenty of others. Some were casual conversations about company culture between sessions or at a meal, others were detailed QA at their booths getting specific needs I had worked out. Specifically, I learned some things from developers at Amazon and Heroku that will help me write better applications, and I might be moving a bunch of things over to Dyn's services. Steaks with Facebook's AppOpps guys were delicious, they know I'm not ready to move out to the bay, and the guy sitting next to me was part of a team that held the Internet2 Land Speed Record for a period of time as a grad student in Spain.
The slides and videos are slowly being posted so head over there to see the full spread. This was going to be my recommendation for the things to watch/read, but it ended up being a short description of most of the sessions I went to, all of which are worth paying some attention to.
How to Scale Dirty and it's Influence on People - Philip Kromer (Infochimps), Dennis Yang (Infochimps)
These guys have some pretty good processes going on, and have the approach to scaling at a startup figured out. A few quotes from them:
- "A startup is a tool to turn time and money into a validaiton of what the world wants"
- "Only automate out of boredom or terror"
- "Don't solve problems you'd like to have"
Perhaps the most interesting part of their talk was introducing 'radical decoupling' which is reducing the critical path edges by splitting code into multiple non-dependent components that can be developed/tested/run individually. Specifically for Ruby on Rails, they recommend using Rails Engines to split things up. 2 separate codebases of 1000 lines of code each is much easier to develop/test/run than a single 2000 line codebase.
Advanced PostMortem Fu & Human Error 101 - John Allspaw (Etsy)
Allspaw is no stranger to these conferences and is one of the "Rockstar" Operations guys out there. He uses a lot of words and concepts from Human Factors and formal analysis of systems and errors that is pretty uncommon to the 'startup' scene, but they're extremely useful. (And he mentioned a favorite book of mine: Normal Accidents: Living with High-Risk Technologies .) Once the video of his talk gets posted, I highly recommend watching it. A few quotes:
- "Human error is an effect, not a cause"
- "There is no root cause"
- "Any explanation is better than none"
- "Peoples need to be right is stronger than their ability to be objective"
- "Accountability = Responsibility + Requisite Authority"
- "Reprimanding someone is like peeing your pants - first feels nice and warm, then its pretty uncomfortable."
CSS3 & HTML5 - Beyond the Hype! - Nicole Sullivan (@stubbornella)
Lots of great knowledge in this talk with immediate practical applications. These slides/talk are worth reviewing for anyone writing client side markup.
- Don't worry about degradations in IE6 for non-functional things (rounded corners, drop shadows, etc)
- Use a CSS framework like oocss
- Use classes/ids instead of descendants (faster lookups)
- use .className instead of div.className (KISS and faster performance)
- Avoid transparency calculations for anything involved in an effect (border-radius, box-shadow, rgba, etc)
- use CSS3 to cut down on the number of images
- Use border-radius for rounded borders and make sure to place it after the vendor prefixed properties
- Really Cool CSS3 patterns
- CSS Lint open sourced during presentation! Use it to check your CSS for potential performance issues.
Why the Yahoo FrontPage Went Down and Why It Didn't Go Down For up to a Decade before That - Jake Loomis (Yahoo!)
This presentation is mainly interesting because they show some insight into a big company with lots of real traffic. Yahoo! peaks at 41,000 requests/second for their new page. They "error proof change" by treating staging as production, dark launching code, running continuous integration, and forking in production traffic to staging to test things out. This presentation is worth watching to see how they can shuffle around traffic between properties and datacenters, and how they drop features to add capacity when needed. The two most interesting facts:
- Yahoo runs metrics on all sites linked to from the front page. When any of them start experiencing performance degradations, Yahoo throttles back the traffic being sent to them by reducing or eliminating the number of users that see links to them. This prevents Yahoo from taking out smaller sites.
- APC was the root cause of the Yahoo.com several-minute outage because it caused cascading server failures in multiple datacenters at the same time when additional language information was added to the cache. APC problems have caused me plenty of headaches, and it's nice to see that a big company has trouble with it too sometimes.
Oh, To Be Single Again - Building a Single Codebase in a Client-server World - Daniel Hunt (Yahoo!)
Building for the Cloud: Lessons Learned at Heroku - Mark Imbriaco (Heroku)
Mark gave a good overview of how Heroku works, how they dealt with the Amazon EBS outage a few weeks ago, and some handy tips. Heres a few of the interesting tidbits:
- "Yo dawg, I heard you like platforms so I put a platform on your platform" - Heroku is deploying all the infrastructure they use to manage Heroku... on Heroku. It makes boostrapping a little tricky but helps point out places where things are too tightly coupled.
- Heroku is opinionated about decisions and does their best to force the Right Way on their users. Examples: Getting rid of Varnish and forcing customers to use CDNs for delivery of large static content, and not offering a persistent filesystem.
- "Avoid the disk! EBS is tempting but don't do it. Keep everything in memory."
- They are working towards a disposable compute model where any node can fail at any time. This leads to a very well understood failure domain which makes fixing problems easier (or automated). This is the Netflix "Chaos Monkey" strategy.
- It's easier to throw away nodes with random problems than diagnosing every problem that comes up.
- Doozerd is a consistent distributed data store for storing very important information. Use this or something like it for storing state like master server IPs in a stateless environment.
World IPv6 day - Ian Flynt (Yahoo!)
- "Monitoring isn't the same in a dual-stack [IPv4/IPv6] environemnt" - Yahoo's health checks didn't know about IPv6 on their first test, so they fell back to the default rotation which is US datacenters. This led to much slower traffic for customers in Europe and Asia during this test.
- "Don't start something big and risky at a traffic inflection point"
- "Always have more than one way to look at things." - If your monitoring server can't connect to your other datacenters, it may look like everything is down or underperforming when everything is actually going fine
Velocity Culture - Jon Jenkins (Amazon.com)
- Amazon averages a code deploy every 12s, averages 10k hosts receiving a deployment simultaneously, and as of November 10, 2010 every amazon.com web property is running on EC2.
- The success of culture depends on linking it to the business. Doing things just because they are the Right Way won't sell to management, but if developers/operators/etc can show how decreased load time leads to increased conversions, everybody wins.
- People focus on capacity planning which is really just a focus on spending money. It's better to do capacity optimization.
Artur on SSD's - Artur Bergman (Fastly)Short, sweet, and to the point. SSDs are not that expensive any more, they save massive amounts of power and time, and there is not really a reason not to use them.
State of the Infrastructure - Rachel Chalmers (The 451 Group)
Rachel pointed out an interesting idea: Science Fiction is appealing because of our obsession with tools, while Fantasy is appealing because of our obsession with symbols. Her talk is worth watching as it's humerous and well informed, and I'll keep my summary to one more quote:
Holistic Performance - John Resig (Mozilla Corporation)
John spoke about performance in the jQuery project. It's more than just wall time, also of concern are battery usage, parse time, number of requests, file size, etc. A few key tidbits:
- You shouldn't drop browser support for performance gains in another, and you shouldn't do things like slow down IE just to make others faster
- Use jsperf
- It doesn't matter how much you unroll a loop if that loop is doing DOM manipulation. Anything that interacts with the DOM is expensive
- Don't compromise code quality in exchange for performance improvements
- It's very hard to create realistic test cases for performance
Lightning Demos Thursday - Michael Schneider (Google), Andreas Grabner (dynaTrace Software), Paul Irish (jQuery Developer Relations), Sergey Chernyshev (truTV)
Lots of short cool demos on cool technology. There's PageSpeed which everyone should be using, Dynatrace which John Resig and Steve Sounders both like, but the biggest things to me were things now available in the Chrome Dev Tools:
- In the task manager -> right click to get more info on a process
- `performance.timing` has timing information about the request in JS
- `performance.memory` is added via --enable-memory-info to get memory usage info in JS
- `window.onerror` - run some method to handle errors on the client, perhaps reporting to web server
- `console.markTimeline()` - Draw a vertical line on the resource timeline for events in your code
- new extension support for audits (usable to maintain standards like no images above 80k)
- heap profiler
- remote debugging --remote-debugging port - lets you run most debug tools from a separate machine, super useful for mobile devices
ShowSlow also got some updates to pull in custom metrics from GoogleAnalytics, order systems, etc, and supports ability to send events to it via a web service.
Instrumenting the real-time web: Node.js, DTrace and the Robinson Projection - Bryan Cantrill (Joyent, Inc.)
This was a great talk on Node, both as an introduction and case study. Bryan, former Sun employee and author of drace is super sharp and a great speaker. This talk is worth watching the video of! First up: the three core ideas of Node are:
- High-performance VMs
- The system abstractions that God intended "dynamic c"
"If you're hitting GC in node you have a memory leak; You're like a drug addict that has hit rock bottom. If I give you more memory it's going right in your arm."
Joyent uses OS level virtualization instead of hypervisor to allow introspection like dtrace from dom-0 when all the apps are in containers. The code to run these traces is open source: node-libdtrace. Their leaderd/tickerd solution gave them 700ms latency from a connection to a competitors app to it appearing on the dashboard. So what map to use to display it? All projections are terrible, Robinson is the best but it's not actually a projection. Source for that: Robinson
The key take away: node.js is perfect for web-facing real-time systems that are hurt by long latency events and not CPU time, i.e. "Data Intensive Real Time" - DIRT
reddit.com War Stories: The mistakes we made and how you can avoid them. - Jeremy Edberg (reddit.com)
Jeremy gave some great insights into Reddit operations on mistakes they've made:
- mistake: relying on a single cloud product and expecting it to work as advertised (avoid EBS for now, or RAID around it)
- mistake: single EBS for a database, now: they use 13 disks now, 6 pairs spanned and a spare
- mistake: not account for increased latency in virtualized environments
- mistake: not using a service based architecture sooner
- mistake: not using a consistent key hashing algorithm at first -> move to Cassandra (Dynamo model for consistent hashing)
- mistake: using bleeding edge software in production (Cassandra 0.7)
- mistake: not having enough monitoring and not having monitoring that is virtualization friendly (Use Ganglia, backed by RRD, not friendly to change)
- they use londiste for replication which is great and flexible, but doesn't handle errors well like slow disk
And some general pointers
- users notice inconsistency and make comments about it (i.e. when replication is lagging and comments don't show up immediately)
- plan for 3 or more than 3 whenever you are writing code: datacenters, servers, databases, job queues, etc
- queues are your friend
- treat logged out users as second class (i.e. serve all content as cached on cdn)
- reddit is open source: github.com/reddit
Choose Your Own Adventure 2: Electric Boogaloo ;-) - Adam Jacob (Opscode), Jesse Robbins (Opscode)
The last session of the conference, is Adam talking about whatever people want him to. The guy that kept asking him to 'talk about something random' but never had a question or topic never got what he was looking for except lots of laughs from everyone. The best tidbits:
- How marketing works: Marketing brings in leads via campaigns, and sales nurtures them into qualified prospects. This is called 'prosecuting'
- "No Assholes Rule" - positive interactions must outnumber negative ones 5:1. There are three types of assholes: withholders of effort, affectively negative, and interpersonal deviants.
- Things that should be automated: provisioning, DNS, server inventory, configuration management, identity management, version control, monitoring and trending, application deployment
- Sysadmins are software developers with shitty languages. They should learn scheme.
- Sysadmin to a developer:"You don't care that I got divorced because your crappy code woke me up." Developers must be on call, sysadmins should be escalated to. Ops are responsible for system availability and efficiency, not bugs in code.
- Metrics should tie to money.
- Ops should be saying 'yes' instead of no, but making people commit. "If you need 1000 servers racked, I'll rent a bus and we'll take your deparement to the datacenter to do it."
- "People don't remember the tools used to build great things." Things can only be measured by the final solution. Your best skill is knowing systems and problems, not a specific tool.
- Open Source: We like to look at pretty cars, but we take the ugly ones home and work on it/fix it up/etc. You cannot leapfrog this stewardship process by just throwing an open source license on some code
- Devops is not a job description. You can't 'be' a 'devops'.
- Devops is all inclusive. If someone is not happy or exclusive, then someone is Doing It Wrong.
Whew, thats a lot of words. Velocity 2011 was great, perhaps more fun and useful than 2010, and hopefully I'll get the chance to apply a lot of this and make it out again next year!. Already in play are some Feature Flag implementations in a project I'm working on and some browser performance tuning on another, I'm looking forward to a need to use Node.js on something. and theres a handful of new books on my wishlist.