April 4th, 2010
So much of our time these days is spent talking about all the new features & capabilities that people are building into their adserving platforms. One component often neglected in these conversations is scalability.
A hypothetical ad startup
Here’s a pretty typical story. A smart group of folks come up with a good idea for an advertising company. Company incorporates, raises some money, hires some engineers to build an adserver. Given that there are only so many people in the world who have built scalable serving systems the engineering team building said adserver is generally doing this for the first time.
Engineering team starts building the adserver and is truly baffled as to why the major guys like DoubleClick and Atlas haven’t built features like dynamic string matching in URLs or boolean segment targeting (eg, (A+B)OR(C+D)). Man, these features are only a dozen lines of code or so, let’s throw them in! This adserver is going to be pimp!
It’s not just the adserver that is going to be awesome. Why should it ever take anyone four hours to generate a report, that’s so old school. Let’s just do instant loads & 5-minute up to date reporting! No longer will people have to wait hours to see how their changes impacted performance and click-through rates.
The CEO isn’t stupid of course, and asks
“Is this system going to scale guys?”
Responds the engineering manager. We’re using this new thing called “cloud computing” and we can spool up equipment near instantly whenever we need it, don’t worry about it!
And so said startup launches with their new product. Campaign updates are near instant. Reporting is massively detailed and almost always up to date. Ads are matched dynamically according to 12 parameters. The first clients sign up and everything is humming along nicely at a few million impressions a day. Business is sweet.
Then the CEO signs a new big deal… a top 50 publisher wants to adopt the platform and is going to go live next week! No problem, let’s turn on a few more adservers on our computing cloud! Everything should be great.. and then…
New publisher launches and everything grinds to a halt. First, adserving latency sky-rockets. Turns out all those fancy features work great when running 10 ads/second but at 1000/s — not so much. Emergency patches are pushed out that rip out half the functionality just so that things keep running. Yet, there’s still weird unexplainable spikes in latency that can’t be explained.
Next all the databases start to crash with the new load of added adservers and increased volume. Front-end boxes no longer receive campaign updates anymore because the database is down and all of a sudden nothing seems to work anymore in production. Reports are now massively behind… and nobody can tell the CEO how much money was spent/lost in over 24 hours!
Oh crap… what to tell clients…
Yikes — Why?
I would guess that 99% of the engineers who have worked at an ad technology company can commiserate with some or all of the above. The thing is, writing software that does something once is easy. Writing software that does the same thing a trillion times a day not quite so much. Trillions you ask… we don’t serve trillions of ads! Sure, but don’t forget for any given adserver you will soon be evaluating *thousands* of campaigns. This means for a billion impressions you are actually running through the same dozen lines of code trillions of times.
Take for example boolean segment targeting — the idea of having complex targeting logic. Eg, “this user is in both segments A and B OR this user is in segments C and D”. From a computing perspective this is quite a bit more complicated than just a simple “Is this user in segment A”? I don’t have exact numbers on me, but imagine that the boolean codetakes about .02ms longer to compute on a single ad impression when written by your average engineer. So what you say, .02ms is nothing!
In fact, most engineers wouldn’t even notice the impact. WIth only 50 campaigns the total impact of the change is a 1ms increase in processing time — not noticeable. But what happens when you go from 50 campaigns to 5000? We now spend 100ms per ad-call evaluating segment targeting — enought to start getting complaints from clients about slow adserving. Not to mention the fact that each CPU core can now only process 10 ads/second versus the 1000/s it used to be able to do. This means to serve 1-billion ads in a day I now need 3,000 CPU cores at peak time –> or about 750 servers. Even at cheap Amazon AWS prices that’s still about $7k in hosting costs per day.
Optimizing individual lines of code isn’t the only thing that matters though. How systems interact, how log data is shipped back and forth and aggregated, how updates are pushed to front-end servers, how systems communicate, how systems are monitored … every mundane detail of ad-serving architecture gets strained at internet scale.
Separating the men from the boys…
What’s interesting about today’s market is that very few of the new ad technologies that are entering the market have truly been tested at scale. If RTB volumes grow as I expect they will throughout this year we’ll see a lot of companies struggling to keep up by Q4. Some will outright fail. Some will simply stop to innovate — only a few will manage to continue to both scale and innovate at the same time.
Don’t believe me? Look at every incumbent adserving technology. DoubleClick, Atlas, Right Media, MediaPlex, OAS [etc.] — of all of the above, only Google has managed to release a significant improvement with the updated release of DFP. Each of these systems is stuck in architecture hell — the original designs have been patched and modified so many times over that it’s practically impossible to add significant new functionality. In fact, the only way Google managed to release an updated DFP in the first place was by completely rebuilting the entire code base from scratch into the Google frameworks — and that took over two years of development.
I’ll write a bit more on scalability techniques in a future post!
- Scalability Follow-Up — The challenge customers impose on innovation
- Are you generating revenue?
- Redirects and Integration, Part II: Hacking Around the Browser
- RTB Serving Speed
- Can’t we all just 302? Report on redirect timings