Notice: This blog is no longer updated. You may find a broken link or two

You can follow my new adventures @mikeonwine

So much of our time these days is spent talking about all the new features & capabilities that people are building into their adserving platforms. One component often neglected in these conversations is scalability.

A hypothetical ad startup

Here’s a pretty typical story. A smart group of folks come up with a good idea for an advertising company. Company incorporates, raises some money, hires some engineers to build an adserver. Given that there are only so many people in the world who have built scalable serving systems the engineering team building said adserver is generally doing this for the first time.

Engineering team starts building the adserver and is truly baffled as to why the major guys like DoubleClick and Atlas haven’t built features like dynamic string matching in URLs or boolean segment targeting (eg, (A+B)OR(C+D)). Man, these features are only a dozen lines of code or so, let’s throw them in! This adserver is going to be pimp!

It’s not just the adserver that is going to be awesome. Why should it ever take anyone four hours to generate a report, that’s so old school. Let’s just do instant loads & 5-minute up to date reporting! No longer will people have to wait hours to see how their changes impacted performance and click-through rates.

The CEO isn’t stupid of course, and asks
    “Is this system going to scale guys?”
    “Of course”
Responds the engineering manager. We’re using this new thing called “cloud computing” and we can spool up equipment near instantly whenever we need it, don’t worry about it!

And so said startup launches with their new product. Campaign updates are near instant. Reporting is massively detailed and almost always up to date. Ads are matched dynamically according to 12 parameters. The first clients sign up and everything is humming along nicely at a few million impressions a day. Business is sweet.

Then the CEO signs a new big deal… a top 50 publisher wants to adopt the platform and is going to go live next week! No problem, let’s turn on a few more adservers on our computing cloud! Everything should be great.. and then…

KABLOOOOOOOOEY

New publisher launches and everything grinds to a halt. First, adserving latency sky-rockets. Turns out all those fancy features work great when running 10 ads/second but at 1000/s — not so much. Emergency patches are pushed out that rip out half the functionality just so that things keep running. Yet, there’s still weird unexplainable spikes in latency that can’t be explained.

Next all the databases start to crash with the new load of added adservers and increased volume. Front-end boxes no longer receive campaign updates anymore because the database is down and all of a sudden nothing seems to work anymore in production. Reports are now massively behind… and nobody can tell the CEO how much money was spent/lost in over 24 hours!

Oh crap… what to tell clients…

Yikes — Why?

I would guess that 99% of the engineers who have worked at an ad technology company can commiserate with some or all of the above. The thing is, writing software that does something once is easy. Writing software that does the same thing a trillion times a day not quite so much. Trillions you ask… we don’t serve trillions of ads! Sure, but don’t forget for any given adserver you will soon be evaluating *thousands* of campaigns. This means for a billion impressions you are actually running through the same dozen lines of code trillions of times.

Take for example boolean segment targeting — the idea of having complex targeting logic. Eg, “this user is in both segments A and B OR this user is in segments C and D”. From a computing perspective this is quite a bit more complicated than just a simple “Is this user in segment A”? I don’t have exact numbers on me, but imagine that the boolean codetakes about .02ms longer to compute on a single ad impression when written by your average engineer. So what you say, .02ms is nothing!

In fact, most engineers wouldn’t even notice the impact. WIth only 50 campaigns the total impact of the change is a 1ms increase in processing time — not noticeable. But what happens when you go from 50 campaigns to 5000? We now spend 100ms per ad-call evaluating segment targeting — enought to start getting complaints from clients about slow adserving. Not to mention the fact that each CPU core can now only process 10 ads/second versus the 1000/s it used to be able to do. This means to serve 1-billion ads in a day I now need 3,000 CPU cores at peak time –> or about 750 servers. Even at cheap Amazon AWS prices that’s still about $7k in hosting costs per day.

Optimizing individual lines of code isn’t the only thing that matters though. How systems interact, how log data is shipped back and forth and aggregated, how updates are pushed to front-end servers, how systems communicate, how systems are monitored … every mundane detail of ad-serving architecture gets strained at internet scale.

Separating the men from the boys…

What’s interesting about today’s market is that very few of the new ad technologies that are entering the market have truly been tested at scale. If RTB volumes grow as I expect they will throughout this year we’ll see a lot of companies struggling to keep up by Q4. Some will outright fail. Some will simply stop to innovate — only a few will manage to continue to both scale and innovate at the same time.

Don’t believe me? Look at every incumbent adserving technology. DoubleClick, Atlas, Right Media, MediaPlex, OAS [etc.] — of all of the above, only Google has managed to release a significant improvement with the updated release of DFP. Each of these systems is stuck in architecture hell — the original designs have been patched and modified so many times over that it’s practically impossible to add significant new functionality. In fact, the only way Google managed to release an updated DFP in the first place was by completely rebuilting the entire code base from scratch into the Google frameworks — and that took over two years of development.

I’ll write a bit more on scalability techniques in a future post!

Related Posts:



  • http://www.openx.org John Linden

    Mike,

    Great to see someone post this. We have a unique view of the world due to the fact that we have both an Enterprise Hosted (SaaS) service and Download (free open-source) product. We get flooded with people that believe running a few billion ads through an open source download sitting on a $50 / month web hosting company is a piece of cake – until they can’t scale, have system / data issues, get hacked, etc. Then we usually get blamed for offering a product that doesn’t work for them! :)

    Like AppNexus we have poured millions into our internal SaaS infrastructure and agree 100% with you that there could be a lot of “RTB” companies that by Q4 will definitely struggle to keep up and have already witnessed many blow ups with companies that believe they can handle the volumes.

    One thing you didn’t mention is the compounding effect when not only dealing with scale but also the complexities of dealing with multiple geographic locations and redundancies, etc. Will definitely make for an interesting 2010!

  • http://www.eyeblaster.com Eldad Persky

    Very relevant and timely discussion. With the inflation of ad startups which try to “throw in” ad serving one might be tempted to think there isn’t any sophistication to it. In fact, any serious company in this space can testify to the amount of investment going into infrastructure. As described above, we too learned that scaling up sometimes requires starting from scratch. The advantage is that once you have new infrastructure you can innovate much more quickly. When we developed MediaMind, half way through the process of rebuilding the infrastructure we found out that we were able to add new features twice as fast as we used to. It is risky to make such big investments in large projects but it can payoff big-time.

  • http://SteveBurris.com Steve Burris

    Fascinating article and discussion and it leaves me feeling the need to keep things as simple as possible.

    I have a friend who has a newish network doing 85million impressions a day.

    Just for discussions sake, lets say i wanted to start a similar network and being short sighted, i never intended to grow beyond 100million impressions per day (tops).

    What would you recommend using for that that is ready made (nothing custom).

  • DS

    “….Take for example boolean segment targeting — the idea of having complex targeting logic. Eg, “this user is in both segments A and B OR this user is in segments C and D…..”

    I completely agree, if an engineer/architect wants to do this in real-time, then it probably does not speak too highly of their technical skills-and I am not sure if I agree with the comment that this is actually what the better architected ad servers do and that this particular use case can be cited as the cause of increased latency. The trick here is to simply serve an ad based on the incoming cookie Id-which ad to serve, which segment to allocate this cookie to, which set of creatives to serve within the ad(if it is a dynamic ad)- these are all offline decisions and there are plenty of tools to do that kind of work(SAS Enterprise Miner, Hadoop et al.) Having engineers/architects who attempt doing these things in real time is a great way to head to the dead pool- That being said, what building scalable ad servers requires is not industry veterans who have spent their lives doing these things(and yes of course, they are far and few in between) but just smart, forward thinking coders and a lot of refactoring!- Hardly rocket science I say

  • CDConner

    New ideas in a stale product category are risky but often reap great rewards. There are a couple of things that innovative and pioneering companies should always do in order to ensure that the KABLOOOOOOOOEY moments are kept at bay.

    First, they need to do comprehensive load testing. This isn’t difficult but takes expertise and a good QA team can set this up with automation. QA team, NOT the engineers that built the product. Pushing your production environment to the very edge of its capabilities and going through as many possible configuration and use scenarios as possible will not only show you the boundaries of your current capabilities but will usually open your eyes to other weaknesses and inefficiencies in your code. There are many companies that will drive dummy traffic to your system to help you determine your scalability and while it isn’t exactly the same as “real” real-time traffic this is a must.

    The second thing that these ad technology companies should do is be realistic in what types of clients they’re willing to take on in the first year or so as they continue to build out their product. Yes, it would be fantastic to land a large client that will take you to break even sooner than expected but the safer, and smarter, play is to go with as many small to medium-sized clients as possible for the first year. This will place a burden on your client services staff but it’s easier to increase the size of that group than to let down a large partner and tarnish your name within the space.

    The last suggestion I have, and it’s maybe more important than the first two, is you need to specialize. At least in the beginning. Don’t try to be everything to everyone in the ad world. It took DoubleClick, Atlas and others years and armies of engineers and experience to build their products. Why try to create a system to serve everyone? You’re just asking for the pain train to come rolling over you and your engineering staff and to take you in twenty different directions based on the needs of different types of marketers. Focus on one area that you think you can do well and dominate and then work outward from there.

  • Mike

    Hi Steve,

    If indeed your volumes will never exceed 100M/day I think in generally most solutions will work out alright. My general rule of thumb is that 1b/day is 30k/second at peak time — so 100M/day is about 3k/second.

    Which solution to choose isn’t a question to be answered in a blog comment =) — that greatly depends on the type of network you are trying to build and what technology requirements you have!

    -Mike

  • http://www.aggregateknowledge.com David Jakubowski

    Mike, great article. I’ve had the pleasure (read sarcastically), of only working on next gen adserving platform teams my whole career. The one universal truth, we’ve made some scale mistake in everyone of them. From Juno to Quigo, to Microsoft… without fail, you never know all of what you needed to know. I would like to think we made new and interesting mistakes all along.
    I’ve also had the pleasure of working with guys like Oded Itzhak, yaron galai Brian Burdick, Tarek Najm, Vikhas Jha and most recently Paul Martino and Rob Gryzwinski who have made careers out of big scale technologies. Even these guys with oodles of big scale experience run into problems. This is truly a space that seperates the “men from the boys.” I can tell you, I recommend looking for scale first, features second. I would rather have a guy that has experience with scale than a guy that knows every ad bell and whistle imaginable.

    We’re already seeing companies fall down in scale all around us and I anticipate many more in 2010 alone. Exchanges are a great example. All this talk about RTB, and I’ve yet to see anyone in the ecosystem, thats right, no one do it for real. All I see today are trumped up re-targeting systems. I don’t think the real capabilities have yet hit the market… and likely won’t at scale until 2011. Show me the system processing 100,000 requests/sec with real logic sifting through inventory making dynamic decisions on multiple data points (yes that means more than 1) and I’ll show you a bunch of tired, over caffeinated, platform veterans on their 3rd or 4th platform.

    Great article!

  • http://www.adjuggler.com John Shomaker

    Mike, totally agree. The decline in CPMs over the past 24 months highlights your point: ad serving technology companies has reduced R&D in transitioning their platforms to higher scalability, and new media entrants have been lured by low-to-zero priced ad servers to maintain media margins. The challenge with both situations is latency, customer loss, and ultimately much higher marginal costs. Fortunately or unfortunately, AdJuggler hasn’t had the luxury of DoubleClick’s coffers and, as such, has built a highly scalable, yet small footprint platform. We have many publisher and network clients that suggest an impression volume in our cluster runs on 1/5 of the footprint of other providers. To Mike’s point, building the scale – at the OS, data, Java, and UI layers – is not trivial, and a key driver why such platform segments will consolidate like utility companies. Perhaps more importantly, as CPM have dropped, the sheer TCO – hardware, software, developers, backup, bandwidth, and other labor – makes a case that no media player should build or manage their own platform.

  • http://www.ciblage-comportemental.net/blog/2010/04/04/the-challenge-of-scaling-an-adserver/ Behavioral Advertising / Publicité Comportementale » The Challenge of Scaling an Adserver

    [...] WIth only 50 campaigns the total impact of the change is a 1ms increase in processing time — not n… [...]

  • http://www.adwebmaroc.com/2010/04/the-challenge-of-scaling-an-adserver/ Adwebmaroc – 1ère Régie Publicitaire Internet au Maroc » Blog Archive » The Challenge of Scaling an Adserver

    [...] WIth only 50 campaigns the total impact of the change is a 1ms increase in processing time — not n… [...]

  • http://graphsofwrath.net nitsua ttocs

    Good article Mike…not sure on the men from the boys tittle…is scale what is meant to separate or you going for a boyzIImen reunion tour? :)

    My view on the industry is that what separates businesses that succeed vs. ones that don’t is maturity combined with execution. Good ideas are great but they don’t get sold easily and I agree scale is important.

    Look forward to more posts!

  • http://www.mikeonads.com/2010/04/11/scalability-follow-up-the-challenge-customers-impose-on-innovation/ Mike On Ads » Blog Archive » Scalability Follow-Up — The challenge customers impose on innovation

    [...] The Challenge of Scaling an Adserver [...]

  • http://www.adzcentral.com/uncategorized/scalability-follow-up-%e2%80%94-the-challenge-customers-impose-on-innovation Adz – Scalability Follow-Up — The challenge customers impose on innovation

    [...] you haven’t done so already, read this post The Challenge of Scaling an Adserver [...]

  • http://www.viraladnetwork.net/blog/ Tim Wintle

    Hi Mike,

    Re: the (A^B)v(C^D) issue – to me it’s actually a question of on vs off-line processing.

    The final actual logical (A && B) || (C && D) only going to end up one extra instruction anyway (assuming they’re all in registers) – far less than the time required to parse a cookie or HTTP header.

    The most important thing I’ve found, and something that took me a while to convince others of, is that you can give up synchronisation. E.g. if you’ve got quotas you don’t want to be locking over something every request. It’s fine to have a delay in synchronisation, and soak up costs for the _tiny_ fraction of a percentage extra that may be delivered, at an incorrect price etc.

    All that said, I’m working at what’s your low-end , >1K/s is a fairly big spike for me.

    (I just came across your blog by the way – it’s a breath of fresh air to find so many people working on serving ads commenting on one blog)

  • http://www.tumri.com Pradeep Javangula

    Hi Mike,

    Well written post.

    I am going to go a bit math geeky on you for a bit!:-)

    The problem in dynamic ad construction, as well as in optimization with algorithmic feedback loops at core boils down to “set computations” with a notion of relevance ranking and scoring. In general, the problem in real-time bidding is isomorphic to the dynamic ad selection problem – modulo some plumbing. The problem is of matching a high dimensional incoming vector to a set of high dimensional content/document vectors. The constraints for selection are often expressed in SQL like constructs with keyword search like semantics. In the literature, these are often called Non-Ordered Discrete Data Spaces, where standard Euclidean geometry does not work. And standard SQL oriented databases don’t work at all, in terms of scale of computation and the flexibility demands. A new mechanism that does the absolute minimum in terms of fast computations, where business rules for marketing are encoded offline, we found to be the best approach.

    Having fallen down a few times implementing parametric search engines (in such discrete data environments) in past lives – our team has solved a substantial part of the retrieval scale problem in ad serving. We have had to run at volumes approaching 30K qps at latencies of less than 1 ms response times.

    That was a surprising burst of unexpected volume that came from RightMedia – within the first 2 seconds at the top of the hour reaching such peak volumes. It was scary and fun.

    It is also important to point out to the brave souls attempting this sort of build out that all of your systems – logging, data aggregation, analytics system scale are also not to be treated lightly. Everything gets stressed significantly. There are substantial challenges in horizontal scalability on those fronts as well.

    There are some folks that appear to think that throwing hardware in a cluster will solve all problems. With battle scars to prove it, not really. You still have to look at your own computational pipeline and address bottlenecks one at a time. But nothing gives like a great algorithm. But you know that already. Long live D.E. Knuth!:-)

    –pradeep

  • http://www.twitter.com/mtanski Milosz Tanski

    I spent a lot of time at Admeld discussing (erm, arguing) with the folks here about what seams at times like minute details of features and how they should be implemented (or how we need to go differently about things). Results of good designs can be scaled and grown rather then needing a rewrite in the future.

    There’s a lot to be said for doing as much work ahead of time (offline) and sharing the upfront cost of it. But there’s absolutely no reason you should not be able to ridicules amount of user targeting in real time and not even having the cpu break a sweat.

    I don’t think there’s anyone out there with well designed adserver who’s run into machine limits. Real machine limits where their bottle neck is page fault, cache misses. Including us here.

    Lastly, it is my opinion that if can’t get faster at what you do as you grow up (request / and data wise) then you’re not doing it right. In fact as we’ve grown and added more features our systems acctually got faster. We now able to serve more qps (by order of magnitude) per machine then we use to.

  • http://revenuerealized.com/2011/05/30/different-strokes-for-different-folks/ Different Strokes for Different Folks « Yieldex

    [...] and I spent a lot of years building ad servers. So I can say this from experience: ad servers are hard to build. There’s a lot of expertise that goes into delivering the right ad to the right person where [...]

  • http://www.zip-repair.org/ zip repairs

    love the post, very interesting and informative. keep up the good work.