In webapp circles, denormalizing vs normalizing is an age old debate. Normalization is the classical approach you learned in Databases 101, or from that Oracle graybeard back in the day. Denormalization is the new kid on the block, child of web scale and NoSQL and 100x read/write ratios.
There’s still plenty of debate, but denormalizing is now a standard technique for scaling a modern webapp, especially when combined with a NoSQL datastore, and for good reason: it works. A new caveat occurred to me the other day, though. Denormalizing isn’t so useful when you’re small or medium sized, of course, since scaling isn’t a problem yet and denormalizing takes work. However, denormalizing may also be a bad idea when you’re extremely large.
At a high level, denormalizing buys faster reads by doing more writes up front. This works up to a point, but beyond that it can get out of hand. This writeup of Raffi Krikorian‘s presentation on scaling Twitter illustrates it well:
These high fanout users are the biggest challenge for Twitter. Replies are being seen all the time before the original tweets for celebrities…If it takes minutes for a tweet from Lady Gaga to fan out, then people are seeing her tweets at different points in time…Causes much user confusion.
[Twitter is] not fanning out the high value users anymore. For people like Taylor Swift, don’t bother with fanout anymore, instead merge in her timeline at read time. Balances read and write paths. Saves 10s of percents of computational resources.
So is there a sweet spot for denormalizing, if not for whole applications or datasets then maybe at the level of individual queries or objects? It seems like it might follow an inverted U-shaped graph, with a sweet spot somewhere in the range of 10 to 1000 writes per operation. It’s still doable outside that range, but the cost benefit tradeoff may start to go downhill. If the old adage goes normalize til it hurts, then denormalize til it works, we may need to add and renormalize when it starts hurting again.