How Discord stores trillions of messages (2023)

435 points by jakey_bakey 9 months ago

foobazgt 9 months ago

This blog post seems to blame GC heavily, but if you look back at their earlier blog post [0], it seems to be more shortcomings in either how they're using Cassandra or how Cassandra handles heavy deletes, or some combination:

"It was at that moment that it became obvious they deleted millions of messages using our API, leaving only 1 message in the channel. If you have been paying attention you might remember how Cassandra handles deletes using tombstones (mentioned in Eventual Consistency). When a user loaded this channel, even though there was only 1 message, Cassandra had to effectively scan millions of message tombstones (generating garbage faster than the JVM could collect it)."

And although the blog post talks about GC tuning, there's mention here [1] that they didn't do much tuning and were actually running on an old version of Cassandra (and presumably JVM) - having just switched over from CMS (!).

  0) https://discord.com/blog/how-discord-stores-billions-of-messages
  1) https://news.ycombinator.com/item?id=33136453

Aeolun 9 months ago

But then it’s still nice that they’re using ScyllaDB and now it’s not a concern at all right?
Even if they were using their original solution wrong, I think the solution that cannot use wrong is superior.
- ericvolp12 9 months ago
  
  The funny part is ScyllaDB still uses tombstones for deletions, though they do have configurable compaction strategies and iirc Discord uses Scylla's Incremental Compaction Strategy that I suppose solves the specific issue they were dealing with. iirc that compaction strategy will trigger a compaction once a certain threshold of a partition is tombstones and then the table is rebuilt without the tombstoned content (which effectively pauses writes on that specific node and that specific table and partition for the duration of that process). Compacting a massive partition is really expensive. Scylla defaults to warning you that a partition is too large if it has at least 100,000 rows in it. My guess is when they moved to ScyllaDB they also adopted a new strategy for partitioning messages in a channel that keeps partition sizes reasonable so compactions don't take a super long time.
  - jhgg 9 months ago
    
    We did not change schema or partitioning strategy.
  - sroussey 9 months ago
    
    Good default configurations can mean quite a lot if people don’t tune them.
- roenxi 9 months ago
  
  I don't see anything here that looks untoward. They increased their data storage by 3 orders of magnitude and decided to use a different DB system. Fair enough, maybe they've learned more about the nature of their data.
  But that logic isn't sound. When dealing with huge amounts of data there are going to be trade-offs. Picking a system that makes different trade-offs to an existing system is not automatically helpful. Yes you don't have the old problems. However, you are about to discover new problems. There is always something of a gamble around which will be more of a problem to your business.
- frr149 9 months ago
  
  What's the problem with Scylla? Honest question, BTW
vips7L 9 months ago

> having just switched over from CMS (!)
This is really interesting. CMS was removed in Java 14 after being replaced by G1GC in Java 9. They were probably running an antiquated Java 8 or 11 runtime. So that means that in 2022 they were either running a 4 year old Java 11 runtime or an 8 year old Java 8 runtime. They were really leaving a lot of performance on the table.
- gorset 9 months ago
  
  They could also have gone the commercial route and gotten Zing with their pauseless GC. It’s been around forever and they even cover Cassandra in their marketing.
  https://www.azul.com/technologies/cassandra/
  - pebal 9 months ago
    
    This is not pauseless GC.

leetrout 9 months ago

Needs (2023)

That services layer reminds be of a big, fancy, distributed Varnish Cache... they don't mention caching and they chose the word coalesce so I assume it doesn't do much actual caching. But made me think of Varnish's "grace mode" and it's use to prevent the thundering herd problem (which is where I first heard of 'request coalescing') https://varnish-cache.org/docs/6.1/users-guide/vcl-grace.htm...

Also love to see consistent hashing come up again and again. It's a great piece of duct tape that has proven useful in many similar situations. If you know where something should be then you know where everything is gonna come look for it!

loloquwowndueo 9 months ago

Coalescing and “origin shielding” tend to be more common terms for that - I’ve never heard of “grace” until today :)
- atombender 9 months ago
  
  Varnish does call it coalescing. Grace is used for a specific situation: When a previously cached object has expired, Varnish won't evict it from the cache immediately, but will continue to serve the old content, while sending exactly 1 request to the background to refetch. How long an object can live after expiring is called the grace. The HTTP standard calls this behaviour "stale-while-revalidate".
mnutt 9 months ago

Grace mode itself doesn’t prevent thundering herd; varnish coalesces all requests automatically and grace mode is used to increase the likelihood of clients receiving cached (albeit stale) responses.
hinkley 9 months ago
Nginx always more businesslike.
```
    proxy_cache_use_stale updating;
```
dang 9 months ago

Year added above. Thanks!

dorlaor 9 months ago

Some additional nuggets by ScyllaDB co-founder: - Discord couldn't complete repair with Cassandra. Not the case with Scylla - Scylla has a lot in common with Cassandra, from a good reason, like the LSM tree, compaction etc. However, Scylla has a unique CPU&IO schedulers which allows us to prioritize the queries over compaction, and defer compaction to the half milisecond where we have enough idle bandwidth. We have plenty of articles about it - Scylla has a new (1.5 years) tombstone_gc=repair - a much safer mode - Scylla's new architecture of Raft and tablets was recently launched and is the next big thing for our users. Watch the cool youtube video of those tablet load balancing

aaptel 9 months ago

This whole problem wouldn't exist if we used distributed chat protocols which have been around for over 40 years (IRC). With the added benefit of having an open specification and multiple implementations. No walled gardens.

And if you think IRC is too old for the modern world take a look at matrix or xmpp.

How did we let discord take over is a mystery to me, or rather a tragedy.

rollcat 9 months ago

IRC does not store messages, it only relays them to clients. You need an add-on solution to store chat history, something we've been taking for granted for ~30 years.
IRC all but requires using a bouncer to follow a conversation from more than a single device.
IRC does not encrypt messages, only (optionally) the client<->server connection. Without E2EE, you have no privacy against the server/operator, which is an easily targeted SPOF.
Matrix (the protocol) is still in flux, and the implementations are lagging behind the spec. If you're not using Element, you're behind on features and security.
XMPP is (similarly to IRC) relying on optional protocol add-ons for basic things, like E2EE, which clients may or may not support fully or correctly.
I recommend reading these breakdowns by soatok: https://soatok.blog/2024/08/04/against-xmppomemo/ https://soatok.blog/2024/08/14/security-issues-in-matrixs-ol...
2013/Snowden happened 11 years ago. E2EE should by now be considered a basic feature, a commodity, something we should be calling for as relentlessly as we did for HTTPS. (Discord of course does not implement E2EE.)
- grishka 9 months ago
  
  Truth is, E2EE isn't a "basic thing". It's an add-on feature that most people don't want. It is impossible to have E2EE that doesn't leak into the UX, and most people would rather have a streamlined UX than deal with key management. It is also much more complex to have robust E2EE in a group chat.
  The thing that sets E2EE apart from HTTPS is that HTTPS requires nothing from the end user. It just works. And as a site owner, you just set it up once and forget about it.
  - rollcat 9 months ago
    
    > It is impossible to have E2EE that doesn't leak into the UX
    True, but one is also free to study the UX solutions implemented on platforms such as iMessage, WhatsApp, and Signal, which all have strong E2EE and see plenty of mainstream usage.
    > [...] HTTPS requires nothing from the end user.
    Depends on how you define "nothing". We've collectively put an insane amount of work to bring HTTPS to where it is today. Also, HTTPS continues to rely heavily on each server operator's skills and diligence.
    There's also plenty of edge cases where HTTPS clients need to go an extra mile, such as containers (many base images do not include a cacert bundle), IoT/retrocomputing/other underpowered devices, and so on. There's always a cost, but it's usually worth it.
    
    grishka 9 months ago
    
    I should've said "true E2EE".
    On iMessage, your keys are managed by Apple. You effectively fully trust them (which seems to be the assumption in most of Apple products anyway). I wouldn't call this a "real" E2EE implementation.
    In WhatsApp, you're limited to one device logged into your account, and the rest are proxied through it. And message backups, those are annoying.
    In Signal, you have all those stupid backups too, and while you're able to log into multiple devices (it seems), your past messages don't load "for your own security", and there's also this stupid time component so you get logged out on your computer if you haven't used the Signal desktop app for some weeks (which I don't).
    Whereas on Discord, Telegram, Slack and other IM services without end-to-end encryption, you log in on a new device and that's it. You instantly get access to all your messages since the beginning of time, and stay logged in forever.
    
    rollcat 9 months ago
    
    > On iMessage, your keys are managed by Apple. You effectively fully trust them (which seems to be the assumption in most of Apple products anyway).
    I'd argue there are many scenarios in which this might be preferable to a lengthier/wider supply chain. Personally I'd sooner trust Apple than Microsoft+(Lenovo/HP/Dell/...)+(Intel/AMD/Qualcomm/Broadcom/...)+(every device with DMA (PCIe/TB), unless you trust your IOMMU)+(.../...)... (you get the point). And the alternatives to Microsoft are each its own kitchen sink.
    > In Signal [...] your past messages don't load "for your own security" [...]
    I agree that this is quite annoying. HTTPS clients resolved a somewhat similar problem (usage of self-signed certificates) by trusting the user to make an informed choice. I wish Signal would trust their user base to make their own choices there as well.
    > Whereas on Discord, Telegram, Slack and other IM services without end-to-end encryption, you log in on a new device and that's it. You instantly get access to all your messages since the beginning of time, and stay logged in forever.
    Same with iMessage. Whether this is a feature or a bug, depends on your threat model.
    But we're in a situation where we don't even get to make an informed choice - every solution (as you pointed out) comes with its own bag of UX shortcomings. These trade-offs should be user choices, not something the vendor forces upon you. But these are not fundamental shortcomings of E2EE as a concept, but particular issues with its different implementations. WhatsApp shows you can restore messages from a backup; Signal shows you can have "real" multi-device presence; etc. If we could spend 1/100th of the effort we did to push HTTPS everywhere, E2EE could be just as ubiquitous today.
    
    brobdingnagians 9 months ago
    
    Just spitballing, but couldn't you have a new device login as three fields, username, password, and encryption key? Then if you don't add the encryption key you don't get the history, but still access the account. Then if password managers really saved all three, then would simplify it for more people (at least those with password managers). But there still has to be a cultural shift for a lot of people to password managers asking non-tech people
    
    iknowstuff 9 months ago
    
    I think whatsapp no longer proxies via a single device.
    On iMessage, you can verify keys now.
    
    NotPractical 9 months ago
    
    > On iMessage, your keys are managed by Apple. You effectively fully trust them
    Not really? You can choose whether to upload your recovery key to iCloud or not. The software abstracts over the details of course, but Signal does that too. Unless you're arguing that it's impossible for closed source software to have "true E2EE", which may have some merit, but Discord is proprietary, and something is better than nothing.
    
    saberience 9 months ago
    
    Yes but see the group size limits on iMessage which is 32!!
    Effectively making it useless for so many people, the reason is due to e2e encryption.
    In contrast, Telegram has groups with 1000s of participants, but only possible as they don’t use e2e encryption.
  - NotPractical 9 months ago
    
    iMessage.
- AnonCoward42 9 months ago
  
  > IRC does not encrypt messages, only (optionally) the client<->server connection. Without E2EE, you have no privacy against the server/operator, which is an easily targeted SPOF.
  Same as Discord.
  > Matrix (the protocol) is still in flux, and the implementations are lagging behind the spec. If you're not using Element, you're behind on features and security.
  Discord also only has one reference client, but for me even with that client Matrix/Element was not as reliable. I still use and like it, but it's not a like for like in that regard.
  > XMPP is (similarly to IRC) relying on optional protocol add-ons for basic things, like E2EE, which clients may or may not support fully or correctly.
  But if you use current clients like Conversations or Dino or the likes it does work. There is no point in counting the clients that don't support it if these aren't the reference or biggest ones. The problem here is more that it's not meant to be used like Discord in any way. Not for big group chats/channels nor for big voice chats (not even sure this possible).
- Zambyte 9 months ago
  
  > IRC does not encrypt messages, only (optionally) the client<->server connection. Without E2EE, you have no privacy against the server/operator, which is an easily targeted SPOF.
  FWIW this point isn't relevant to the IRC vs Discord discussion, since Discord is also very not E2EE. That said, XMPP my preferred protocol that checks all of the boxes.
  - rollcat 9 months ago
    
    > [...] since Discord is also very not E2EE.
    I have stated that at the end of my original comment. I'm not advocating for Discord (merely enumerating IRC's and XMPP's shortcomings), but I would like to point out once again, that post-2013 any solution that does not enable strong E2EE by default should not be advocated for - at all.
    > That said, XMPP my preferred protocol that checks all of the boxes.
    Read up soatok's breakdown on the design & status of OMEMO. I'm not a cryptographer, but I do trust a cryptographer when they say some protocol's design/crypto is broken.
    
    vidarh 9 months ago
    
    Maybe for your your use. For my use, not a single thing that goes over discord are things I'd object to being posted on a public website. That includes DM's. Not having E2EE means something isn't a solution for actually private conversations, but a lot of conversations happens in setting that are not actually private in any sense.
    
    collingreen 9 months ago
    
    I personally think I am unable to perfectly guess today what I will want/need to have private forever.
    This is one of the tenets underpinning my thoughts about why privacy matters.
    
    multjoy 9 months ago
    
    But Discord & IRC aren't generally private spaces. They're no different to web forums in that you would reasonably expect that something you write today would be accessible without reference to you in 10 years hence.
    That's a very different proposition to a private/group message exchange in WhatsApp/iMessage etc.
    
    vidarh 9 months ago
    
    It's really quite simple: Would I be happy to discuss it in a public space where people might record?
    I wouldn't plan a controversial political movement in public, or on Discord. I would discuss videos game programming either place.
- crtasm 9 months ago
  
  Nothing stopping a server also acting as a bouncer and storing messages: https://ergo.chat/about
- timeon 9 months ago
  
  > IRC does not encrypt messages
  Wasn't SILC later used for this instead of IRC?
- voidnap 9 months ago
  
  > IRC does not store messages, it only relays them to clients.
  Some people consider this a feature and prefer using IRC bouncers to discord.
  OMEMO solved encryption for XMPP a decade ago. I haven't seen it on IRC yet though.
  - brysonreece 9 months ago
    
    Some (most) people want to easily talk to their friends or interest groups without having to worry about it.
    
    voidnap 9 months ago
    
    I get that, I wasn't passing judgement. You guys must be super sensitive to be downvoting me for just sharing another point of view.
    Personally, I find xmpp and IRC to be easier ways to talk to friends and interest groups when they use those networks. The software is simpler, faster, and a better experience for me.
    Matrix is a bit of an exception where it's slow and buggy and barely hanging on.
    But me and my friends don't care about discord stickers or nitro or giphy links or the discord store or any of that kinda stuff that you go to discord to use. And thats fine if you do.
    People can want and enjoy different things and also "want to easily talk to their friends or interest groups without having to worry about it."
  - dakom 9 months ago
    
    I do consider it a feature, in hindsight. Learning to program by asking "dumb" questions was great, because chats were ephemeral, nobody cared if the same question was asked for the 10 millionth time or risk of embarrassment being like 12 years old and asking greybeards for help.
    Nobody also felt bad saying "RTFM" because, whatever, it blows over in a minute, there's no permanent record of having a harsh moment, more free to just move on.
    The same old questions being asked due to no search also provided more opportunities to answer those questions, so, newbies could start to learn by teaching.
    So, yeah, I think something beneficial was lost, even if I wouldn't go back to that approach- it's more of a tradeoff than a definitive improvement
    
    znpy 9 months ago
    
    > I do consider it a feature, in hindsight. Learning to program by asking "dumb" questions was great, because chats were ephemeral, nobody cared if the same question was asked for the 10 millionth time or risk of embarrassment being like 12 years old and asking greybeards for help.
    I pity the new generations for not having this kind of opportunity: the opportunity to make mistakes, say dumb stuff and goof off with all these things vanishing in a matter of minutes, hours at most.
    I miss the old internet: at any point you could pick a new nickname and get a fresh and clean new email address from many of the webmail providers and just start a new online life.
    And it was considered normal. It was actually a "best practice" to never use nicknames.
    I miss the old internet.
    
    MichaelZuo 9 months ago
    
    This approach simply doesn’t work when users are allowed to vote or have any sort of scoring mechanism. Since bad actors will also create multiple “online lives” and manipulate those systems with a few clicks
    
    sham1 9 months ago
    
    Remember when phrases like "Never use your real name online" used to be near universal? Yeah, this is something I also miss about the old Internet.
    Like, even back then you could absolutely tie your IRL identity up with your online identity, but the difference of course was that it wasn't a requirement of existing online, like it is now. Like yeah, you can stay anonymous but a) it's super difficult since the modern day assumption is that you're not doing that and b) that you're up to no good, because why would you be hiding who you are, unless you were doing something shady. And now even "normal" people lament just where we went wrong and what happened to online privacy. To the aware, privacy dying like this was clear as day, but I suppose most just didn't hear, or chose to ignore, the alarm bells.
    And now everything is logged, analysed, and associates with the people who produced the messages and other sundry content. There is no ephemera, we need laws just to be forgotten by services (as an EU citizen, I'm glad about law existing here, but it shouldn't need to be a law, it ideally should be assumed), and we're constantly getting watched by both states and surveillance capitalists alike. Not actively in most cases, mind you, but passively, with our movements, our interactions online, and just what we do, just getting aggregated into these humongous data sets of Big Data, to train statistical models on. Mostly to surveil us even harder, or to manipulate us in the form of advertisement, which can be even more insidious in some ways.
    I'm sure that stuff like the Cambridge Analytica fiasco could have occurred even without this destruction of privacy, anonymity, and ephemeral content, but I posit that it would have been way more difficult had people not been encouraged to put everything about themselves into services that would log them and build evermore complex models about them and their thoughts. And now this kind of stuff can be used to destroy democracies, and as alluded to earlier, manipulate for example our spending habits. And now we all wonder just where this all went wrong.
    I miss the old Internet.
Ecoste 9 months ago

> How did we let discord take over is a mystery to me, or rather a tragedy.
The fact that you're baffled why discord took over is exactly why it took over. You can't even acknowledge that the user experience is 10x better and it's suitable for a general non-technical audience.
- mystified5016 9 months ago
  
  New quest available! Buy nitro for stickers! Buy nitro as a gift! New quest available! New quest available! Restart to update. New quest available! Look at the new emojis you could use with nitro! New update available! New update available again! Third update today! New quest available! Look at these profile decorations you could use with nitro! Boost this server! *NEW QUEST AVAILABLE*
dewey 9 months ago

I’m a huge IRC fan and I dislike Discord, but all these other services are way too clunky and IRC is really only usable through IRCCloud that has a relatively okay mobile app these days.
Recently a very technical group I’m part of migrated from Telegram to Matrix and the user experience is just not very good. The apps are buggy, don’t look good, then in the new “Element” app SSO isn’t supported so I can’t use my account with it. There’s lots of paper cuts that are okay for someone like me who likes to figure it out but I’d never try to convince my friends to use it.
- nunobrito 9 months ago
  
  For telegram refugees then maybe SimpleX is an option, except it has no bots nor other options for clients at the moment.
  What I personally use is the nostr protocol through a client like Amethyst or OxChat. Messages and groups can be E2EE private, or you can just use the public groups.
  The biggest advantage is that you are joining a bigger community of apps and services built on top of the same protocol, rather than joining some isolated island (again).
  - dewey 9 months ago
    
    I recently listed to a nostr podcast and even people working in it said it would not be reasonable to recommend it for a secure messaging app at this point. Just because very early things like metadata leaking are not addressed yet. So not really an alternative.
    
    nunobrito 9 months ago
    
    I don't know what podcast you are mentioning or the context. Anyone can say anything on youtube.
    We are talking about a transition from telegram, when comparing to that platform then NOSTR is undoubtely more secure when noticing that telegram doesn't even encrypt conversations by default and this isn't informed to users. Whereas in NOSTR you are made aware when a conversation is private between both parties.
    Metadata is fetchable for 99% of messaging apps out there. If you'd ask me about making a more secure app then this involves continuous streaming of data, padding of messages to avoid content guessing and avoid the usage of internet as data channel.
    So it really depends on what you consider secure and what it is compared against. Compared to Telegram it is more secure. Compared to a piece of paper encrypted with a custom algorithm and delivered by a trusted human transporter? Not really.
high_na_euv 9 months ago

>How did we let discord take over is a mystery to me, or rather a tragedy.
Orders of magnitude better product than anything competition had at the time?
- doublerabbit 9 months ago
  
  > Orders of magnitude better product than anything competition had at the time?
  Nah, it just comes down to non-techy folk wanting to play/chat with their friends in a just-work configuration.
  Mumble, TeamSpeak were always janky, needed a hosted server. IRC is multiplayer notepad.
  Geeks care about E2E, and all that glory but these folks don't. And that's what Discord dishes; as did Y!M, MSN, ICQ, AIM back in the day.
  All discord has done is replaced those above as GitHub has replaced SourceForge.
  We didn't care if the message were encrypted or not back then. Why do we now?
  - StableAlkyne 9 months ago
    
    > Geeks care about E2E
    *Some* geeks. Specifically those who are into encryption.
    There is nothing wrong with wanting an application to just work, especially when it's significantly better than what came before (contemporary competitors were Skype and IRC)
  - pphysch 9 months ago
    
    You're just describing why Discord was a much better product.
  - high_na_euv 9 months ago
    
    I used all
    Ventrilo mumble ts3 skype
    Discord was way better and had more features and was more safe than self hosted alternatives
Krasnol 9 months ago

Usability did it.
You download an exe, install it, make an account and it runs. Just like that. Everybody can do it.
There are tons of useful and great software out there. Most of it is not easy for the public. Some (most?) of it doesn't even have an GUI. People rather sell their identity and even pay than suffer through too many hops.
- Intralexical 9 months ago
  
  Not even a EXE. The web version is feature-complete, so you only need to click a link.
  - Krasnol 9 months ago
    
    You're right. I forgot about that.
    I also forgot all those people who came from the TeamSpeak servers.
throw16180339 9 months ago

> How did we let discord take over is a mystery to me, or rather a tragedy.
Anyone can set up or join a Discord server. If you give users the choice between a complex open platform and an easy proprietary solution, they will pick the latter every time.
maccard 9 months ago

If you want to know why, look at the App Store reviews for discord and tea speak and compare them.
Discord just works.
tannhaeuser 9 months ago

There’s no lack of open chat protocols and federated services but those have mostly torpedoed themselves: by usability and discoverability problems, holier–than–you attitudes, and plain nerd attention wars. Such as XMPP (used a lot until around 2010 but easily dragged into the mud because XML and overengineering), Mastodon (saw a surge as twitter was faltering but then seemingly stopped to be everyone‘s darling as its limitations became obvious, among them Mastodon admins taking their audience hostage; also ActivityPub fans going around advertising it for each and everything when RSS is just fine for web sites, damaging news feeds alltogether in the process).
Where spamming, or the systematic exploitation of digital communication by the „ad industry“, was killing it in the past (Usenet, and arguably the web), today there‘s also the problem of being consumed by LLMs to push non-public messaging. Though I‘m not sure the latter is really a concern for many, as developers not only are giving away their code, but their entire activity log/issues and their solutions on github such that they can easily be digested and replaced by coding assistant LLMs, git being a distributed system in the first place.
- Terr_ 9 months ago
  
  > among them Mastodon admins taking their audience hostage
  I was excited first hearing all the "fediverse" stuff, but having to hand over control of your online identity to a particular node forever felt a little bit like "old boss, same as the new boss."
  (Yes, I know some folks are working on the identity issue.)
  - nunobrito 9 months ago
    
    Reminds when I joined the largest mastodon server for my country. Advertised by the owner as a bastion for free speech, democracy and fair treatment. Then in 2020 started mass banning everyone "that went against science" on the covid fraudemia at our country.
    Twitter on those days was bad, but that mastodon server sure became even worser. Nowadays found a fresh air of innovation with Nostr. No more servers with your data and followers locked inside.
    You can silence the people you don't want to hear, you won't hostage them into forced silence any longer.
  - paulryanrogers 9 months ago
    
    Mastodon means you can at least pick your boss, be your own boss, and take your identity and followers to a new boss. (Possibly even taking your content too, though maybe not links)
    
    MichaelZuo 9 months ago
    
    Picking a ‘boss’ in a system where the average ‘employee’ has no credible way of assessing or evaluating them, or their superiors, and zero prospects of ever getting a face to face meeting with, is effectively no different to having the boss picked by an anonymous shareholder meeting in SF.
    If all of the potential bosses have roughly the same degree of accessibility… which is the case for Mastodon for anything over a few hundred users.
    
    ThrowawayTestr 9 months ago
    
    What's stopping you from messaging server owners or stalking their profile to see they're ideologically compatible?
    
    throw16180339 9 months ago
    
    That's a lot more effort than using Discord and getting on with my life.
    
    paulryanrogers 9 months ago
    
    Compared to closed gardens like Discord and Xitter, Mastodon is a significant improvement.
    
    MichaelZuo 9 months ago
    
    But not in terms of the ‘choosing a boss’ aspect for the median user.
    
    StableAlkyne 9 months ago
    
    Did they ever address the problem of migration from a bad server?
    For example, a scenario where your server dies and does not return. Or a malicious actor takes over and bans the user base. Or a honeypot encouraging user account migration, followed by bans.
    In all 3 cases, you are effectively screwed the moment you migrate to a malicious server, or your server becomes malicious.
    I remember blue sky trying to address this by tying your identity to a DNS record or something, but it's a severe limitation in anything trying to be decentralized
elcomet 9 months ago

IRC and distributed protocols un general had a big issue : you loose history every time you disconnect
- menaerus 9 months ago
  
  In the age we are living this starts to sound more like a feature to me.
  - MatthiasPortzel 9 months ago
    
    The other reply goes to airplanes but there are much more common ways to get disconnected. Locking my phone or closing my laptop lid disconnects me from IRC. A lot of Discord users have desktops that are always on (since Discord originally advertised to gamers), but a lot of Discord users don’t.
    Discord is fundamentally a very versatile platform. If you lose one seemingly unimportant, you lose a lot of versatility. Maybe I’ll write a blog post just with examples of how I’ve used it. It replaces IRC, but it also replaces Facebook groups, Skype, a lot of group texts, and a lot of email for me.
  - agumonkey 9 months ago
    
    It does alter the meaning of chat tremendously. In discord, often things become heavy, because we're not talking, we're accumulating information, and you have to stay on purpose so data is manageable and seekable.
    The few times I join IRC I know we're only here to chat, it's semi-transient (a little bit more if logs are stored) and I feel lighter.
  - rtpg 9 months ago
    
    Is it really that much of a jump to say "I would like to see the chat that has happened between my friends between the time I got on a plane and then got back off"? Does that sound odd?
    Imagine if you couldn't receive e-mail while you were offline!
    This isn't to disparage IRC and friends too much, obviously there's huge value in it existing as a synchronous chat room. Just... async chat is a thing that totally happens for most people.
    
    serf 9 months ago
    
    a non-technical person wouldn't consider the implications of a history log with regards to security or data hoarding, they just see it work and think of it as a convenience.
    this value sell shifts in the mind of the non-technical person once they're told that the feature they want implies non-ephemeral data that will be systematically sifted through either for legal or financial benefit by a third party.
    in other words : the reason why 'async chat is a thing that totally happens for most people.' is because a vast majority of people are simply unqualified to even see the problem, much less seek alternatives or solutions to the data hoarding that they must comply with.
    this creates a social effect and pulls everyone into Discord, regardless of their beliefs on the matter, simply because it has become 'the only game in town'.
    regardless of personal preference, centralization of these kind of things is BAD for the user in nearly all circumstances aside from convenience.
    
    Shog9 9 months ago
    
    Please stop pretending that "data hording" didn't / doesn't happen on IRC. There's nothing inherently friendly to security or privacy in the protocol; if anything, it's quite the opposite.
    That you can, with augmentation and diligent op-sec, get something a bit better than Discord isn't a great selling point unless you have the time and resources and buy-in already, not just for yourself but from everyone in your group. At which point, there are still better options than IRC.
    For decades now, the main draw of IRC has remained a fetish for conspicuous configuration, as it embodies a sort of brutalist architecture of communication software. The excuses change every few years, but the love for cobbling together a barely workable system from parts remains core.
    
    menaerus 9 months ago
    
    Sure, the advantages of async communication are obvious but the crucial difference is that in that case vendor has to store your data somewhere in the data center. Reusing that data for unsolicited purposes is what many people will have a concern with.
    
    indeyets 9 months ago
    
    But logs are stored on IRC as well. It’s not a part of standard protocol, but a lot of ir c-servers can do that automatically and there are boys which do that not to mention personal archives. The difference is that end-users don’t have easy access to this logs. And on discord they do (because it is a part of protocol)
    
    cmiller1 9 months ago
    
    How about a secure async chat where the vendor simply stores a list of message IDs, and then the client requests if anyone has a copy of any message you haven't received yet from the other users in chat when you log on
    
    menaerus 9 months ago
    
    Such vendor would have a hard time finding a business model since plenty of chat-services are already existing on the market and all of them have access to the data of their users in one way or another. Thus I don't know what other type of leverage they would be able to pull off to sustain their business.
  - StableAlkyne 9 months ago
    
    You and your friends lost history, but the server owner never did :)
Intralexical 9 months ago

> How did we let discord take over is a mystery to me, or rather a tragedy.
I think I'm reasonably technically competent, and I also dislike Discord's issues with privacy, data sovereignty, siloing information away from the open web, etc.
But you know what I think whenever I click a Matrix link, or IRC? I just don't want to deal with it. You get a list of apps you've never heard of, some of which may not be feature-complete, some with more than one version, some which are advertised using words like "GNOME", "Rust", "Qt5", and "C++" that have no meaning or relation to actually using them as a chat app, and all of which I guess are different and would need to be tried and learned separately. Then picking and clicking one tries to open an outside program which probably isn't installed and I don't want to install because I don't really know/care what it is. And if at that point, out of the dozen or so app options it showed you, you happened to choose one with a web version like Element, and you figure out you can click the "Continue in your browser" button out of the four or five unexplained buttons that pop up as a result ("XDG-Open", "Cancel", "FlatHub", "Download", and "Continue in Browser")— You get a static screen that shows just enough message history to not be useful, with a confusing UI you can't seem to interact with, hidden behind a login wall that still hasn't really explained what in the Internet tubes you're actually looking at.
E.G.: https://matrix.to/#/#invidious:matrix.org
If you try to Google "What is Matrix"— You get pages about math. So then you Google "What is Matrix chat". And all the results harp on using words like "open network", "decentralised", "protocol", "real-time communication", "open standard", "federated"— Which, again, may be technically interesting if you're into that, but doesn't actually have anything to do with how it directly serves the user as a chat app and how you can use it or sign up for it.
It takes way too many clicks, and you get bombarded with way too much information… To still not end up using the app, and in fact end up more confused than before about what a "Matrix" even is. Let's say you lose 15% of incoming users at each step. That rapidly scares off most of the mainstream, before they've even tried it. Maybe Matrix and Element are great. But it just seems like such an ordeal.
Compare that with Discord. You click a link. And then either you're already in the server, or it has a single text box and a single button you click to funnel you through making an account and joining the server.
It doesn't try to convince you to install a Desktop app until you're already fully using it in the web version. You get clear answers and reasons to use it if you search "What is Discord" or go to the website. It doesn't overwhelm you with options and then hound you with technical explainers that you didn't ask for.
IRC goes the other way in usability. People want voice chat, message history, different channels in the same "server", PM channels, etc.
/rant
weaksauce 9 months ago

because the voice chat function is so leaps and bounds better than anything out there and it was primarily used for that to game in real time. the text was an afterthought for gamers.
EGreg 9 months ago

I keep writing about this tragedy, but few people care. Even on HN:
https://cointelegraph.com/news/how-a-web-that-lost-its-way-c...
and
https://community.qbix.com/t/the-debate-about-end-to-end-enc...
- philipwhiuk 9 months ago
  
  > Own this piece of crypto history
  I would argue that the web lost it's way as much with "web3" as with the platforms of web 2.
  - EGreg 9 months ago
    
    I didn’t write that.
    You must be quoting an ad, and dismissing everything else
RadiozRadioz 9 months ago

There are loads of comments exactly like OP's, and they always make the mistake of mentioning IRC alongside XMPP and Matrix. Inevitably repliers can't help themselves and spend their replies discussing IRC's unsuitability for modern IM and how it's not federated. When IRC is mentioned, commenters ignore XMPP and Matrix and attack the point in terms of IRC. (Though this thread in particular is better than average).
Matrix and XMPP are the far more appropriate competitors for Discord, we need to steer the conversation toward them. I deliberately never mention IRC when I make these types of comments so people don't latch onto it and ignore everything else I said.
lofaszvanitt 9 months ago

Discord wrapped irc in shiny paper.
urza 9 months ago

100% !! It's so sad :(

jimkoen 9 months ago

My takeaway from this is maybe somewhat different from what the authors intended:

> The last one? Our friend, cassandra-messages. [...] To start with, it’s a big cluster. With trillions of messages and nearly 200 nodes, any migration was going to be an involved effort.

To me, that's a surprisingly small amount of nodes for message storage, given the size of discord. I had honestly expected a much more intricate architecture, engineered towards quick scalability, involving a lot more moving parts. I'm sure the complexity is higher than stated in the article, but it makes me wonder, given that I've been partially responsible for more than 200 physical nodes that did less, how much of modern cloud architecture is over engineered.

romanhn 9 months ago

They are talking about 177 database nodes, which is not an indicator of architecture complexity. I assume they have dozens/hundreds of services consisting of multiple highly available nodes each across various geographies.
Having seen a much smaller set of Cassandra nodes used to store billions (rather than trillions) of records, I can say that Cassandra was definitely a total PITA for on-call, and a cause of several major outages.
nicholasjarnold 9 months ago

> ...how much of modern cloud architecture is over engineered.
I would wager a good majority of it is. The Stack Overflow architecture[0] sticks out to me in this regard as an example on the other end of the spectrum.
[0] https://news.ycombinator.com/item?id=34950843
hiyer 9 months ago

Also bear in mind that they're now doing the same with just 72 nodes.

hiyer 9 months ago

Very well-written article. I'm happy for them that part of the solution was switching from Cassandra to drop-in replacement Scylla, rather than having to deal with something entirely different.

dean2432 9 months ago

They make it literally impossible to delete your old messages. It's a privacy nightmare and I wonder why the EU hasn't stepped in.

Intralexical 9 months ago

I do think there is a balance to be struck, because directed communication means the recipients of old messages are also stakeholders, such that maintaining a consistent record by default is a fundamental part of the "service" they offer. The message contents are different from e.g. secretly hoovering up click patterns. Matrix had some thoughts when they faced the same questions:

  The key question boils down to whether Matrix should be considered more like email (where people would be horrified if senders could erase their messages from your mail spool), or should it be considered more like Facebook (where people would be horrified if their posts were visible anywhere after they avail themselves of their right to erasure).

  Solving this requires making a judgement call, which we've approached from two directions: firstly, considering what the spirit of the GDPR is actually trying to achieve…

https://matrix.org/blog/2018/05/08/gdpr-compliance-in-matrix...

Xen9 9 months ago

In Discord culture, indeed, users usually share a shit-ton of PII in "introduction" messages from images to specific hobbies to medical information (EG "support" communities).
The problem from GDPR perspective is that Discoed makes it impossible to delete those, since once thet detect your interest in trying to delete any of your accounts' data, they will try to get to "anonymisize" it. Then at least publicly your username isdisconnected from thos messages, but they can still be traced back to specific persons. Now if this also is done server side, then they would be in a situation where you'd either have to go through ton of messages or to bulk delete past messages of all to enforce the GDPR demands of an user wanting their PII deleted.
EU Parliament is not a real Parliament in the sense that ONLY the Comission can propose new laws, and the elected parliament basically just votes on those. Who controls the Comission if not the people? The US State Department. Newsguard and non-Musk US bigtechs including Discord are in the same poli-financial bed of the establishment here. And they are full of previous state department workers.*
Unless there is public outrage, the EU-level bodies at least will probably be owned. But Public opinion is controlled by the cyberpunk establishment that trains their LLMs & targets their campaign ads using that illegal Discord data to get political advantage.
You in my view ought to "worry" about the fact that it's possible there will sooner or later no longer be escape from a permanent establishment, Orwell-style. Goes along with the theme that "cybersecurity" is the United States government level has been "war against hate speech" for years, and of course "hate speech" meaning "censorship of internal and external enemy speech."
Budd Dwyers if I recall correctly shot himself in TV after writing to Biden (???) that under some conditions (that became true), the Department of Justice should have "Justice" removed from its name.
---
Most of this I hold only at 50+% confidence of being broadly correct. Take with lots of salt.
- r3d0c 9 months ago
  
  incoherant babbling
  - Xen9 9 months ago
    
    Reasons as to why I should believe that the comment or parts of it were "incoherant babbling"?
    I did express a low condifence.
    My information is limited. You ought to expect to feel my points being "incohetent babbling"
intelVISA 9 months ago

Given the sheer size and extent of the user data collected and processed one imagines the EU is working on a big case... quietly.

robmccoll 9 months ago

Cassandra is essentially an append-mostly distributed fault-tolerant hash table. If you need specifically that with high write throughput, it's a good choice. I don't understand why people use it as a database. You run into it's limitations immediately and the pain of trying to use it like a database only gets worse with scale.

LeifCarrotson 9 months ago

FTA:
> In Cassandra, reads are more expensive than writes.
This makes it insane as a message store for a chat server to me. It seems appropriate for a logging destination for a distributed system, one where you want lots of clients to dump data but most of the time you don't even need to audit the logs, so the number of reads for a given item is less than one. This is obviously not true for Discord messages.
- atombender 9 months ago
  
  The sentence makes it sounds like Cassandra and Scylla are slow for writes, which isn't the case at all. It's just that writes require a bit less I/O. Reads are still very fast. If reads were slow, nobody would use Cassandra and Scylla for the purposes that they're being used for.
  - menaerus 9 months ago
    
    Actually read performance is one of the main challenges in LSM based storages.
- Squeeeez 9 months ago
  
  Not too sure - I would have guessed that most of the messages are written once, read by the constant number of participants (say 1-100 or so) and then they disappear off the screen and are never accessed again, ever. Maybe a few people will scroll or search, or use some custom extension to load and export the history, but very rarely.
mianos 9 months ago

All the Casandra documentation and web site say it is a database. You can't blame anyone from getting confused. In my experience, I have never seen a project that started to use it, continue to use it after a year or so it may take a year to run into its limitations before having to replace it, with a database, like Postgres.

PaulHoule 9 months ago

How is they just can’t shard the thing? Isn’t each Discord ‘server’ isolated from the others (can’t send a message from one to the other?) Why can’t they address trillions of messages by having thousands of shards that each handle billions?

DylanSp 9 months ago

The partition key included the channel ID, and they were still having problems with hot partitions even with that fine-grained sharing.
hun3 9 months ago

Last time I checked the Discord bot API, it had explicit provisions for sharding.

codexon 9 months ago

> The ScyllaDB team prioritized improvements and implemented performant reverse queries, removing the last database blocker in our migration plan.

I wonder how much they paid ScyllaDB to do this before even using ScyllaDB.

jsnell 9 months ago

The article says they were using ScyllaDB for everything except the message store two years before they did the migration for messages.

molszanski 9 months ago

> In an afternoon, we extended our data service library to perform large-scale data migrations. It reads token ranges from a database, checkpoints them locally via SQLite, and then firehoses them into ScyllaDB. We hook up our new and improved migrator and get a new estimate: nine days!

How many machines this migrator was running on? One? :D Sounds absurdly amazing!

tcfhgj 9 months ago

Storing is one thing. Performing data mining on them is another

philipwhiuk 9 months ago

That's a separate problem with hugely different latency concerns, likely done on a separate copy.
CamperBob2 9 months ago

Also, people need to keep in mind that those trillions of messages are archived nowhere. Thanks to the walled gardens we're obsessed with building, far-future anthropologists will know more about Pompeii and Machu Picchu than San Francisco.
- squigz 9 months ago
  
  Firstly, no they won't. That's silly.
  Secondly, how would such an archive work? Who would pay for it? How would it be safeguarded in such a way that it can be read by 'far future anthropologists' but not the people paying for the storage?
  - geysersam 9 months ago
    
    If we're only talking about public chat rooms, it shouldn't be difficult to archive the content of those.
    There are open repositories of the entire internet text content (common crawl). These scrapes are periodically repeated. That's orders of magnitude more data than all discord messages ever.
    So technically it's not a problem making such an archive. The financing is of course always an issue, but not because the costs are large.
- xboxnolifes 9 months ago
  
  I don't think every single individual message ever needs to be archived. Every text, every email, every post-it, every poke, every emoji, every reaction GIF...
  - ktosobcy 9 months ago
    
    Well, considerting annoying push for "let's resolve the issue on discord" it's very annoying. With things like github issues you can search for a problem and find a solution. Even ancient mailing lists most of the time have archives. Not so much with all those fancy "realtime" :/
    
    klabb3 9 months ago
    
    I agree with the sentiment but GitHub issues is not a good replacement. First, it’s also owned by a corporation and is available on the open web today because they let us (is it even scrape/api available today? Can people build tooling on top?). Anyway, this “openness” can easily be changed once the “value extraction knob” is turned.
    Secondly, GitHub is a developer platform, not a user/enjoyer platform. Issue reports are high-barrier even for devs. People get upset if you’re asking a random question, don’t check for duplicates, etc. Some people even get upset about issues without a PR.
    Again, I’m all for good open alternatives but when HN is like “you just configure Gentoo and type 30 commands” we don’t stand a chance to actually win users over, gotta accept reality before we can improve it…
    
    ktosobcy 9 months ago
    
    GH was only an example of something quite common and seachable. It could be codeberg.org or similar
  - famahar 9 months ago
    
    Definitely not everything, but it's still wild to me that so many products and services have all their troubleshooting and customer support in a discord server.
    
    proteal 9 months ago
    
    It makes sense to me. The number of people who actually create useful open source software is so vanishingly small compared to the number of people who use OSS, it seems obvious that we should optimize for their time, not the other way around. I agree with you that using mailing lists or GitHub issues or whatnot would be globally more efficient, but if I’m working on a product, I’m going to work in the way that is most efficient for my time. I owe my “customers” nothing because they are not paying for my work. We keep seeing discord as a means to communicate about products because devs see it as the best use of their time. The fact that so many people use it should be an indictment on the alternatives, not the devs who choose to use discord.
    
    foobazgt 9 months ago
    
    Sadly, I can understand why Discord doesn't have a lot of incentive to do this. Maybe the community should popularize an open-source free/low-costing bot and hosting solution for exported chat? (I couldn't find one in a few minutes of searching).
    
    tbrockman 9 months ago
    
    Here ya go: https://github.com/AnswerOverflow/AnswerOverflow
    
    ekianjo 9 months ago
    
    Even FOSS communities. shame on the devs who decide to do so.
    
    Kiro 9 months ago
    
    It used to be IRC channels on Freenode and I didn't see anyone complaining back then.
    
    CamperBob2 9 months ago
    
    That's the thing. No one ever complains at the time.
    
    squigz 9 months ago
    
    Why do you and GP think so many FOSS projects choose to use Discord like this?
- daedrdev 9 months ago
  
  For many people the fact that discord is not easily discoverable is a benefit, just like in many other messaging services

dang 9 months ago

Discussed (a bit) at the time:

How Discord Stores Trillions of Messages - https://news.ycombinator.com/item?id=35048410 - March 2023 (10 comments)

bofaGuy 9 months ago

I’m lost at why a DB (Cassandra) with better write performance than read performance was ever selected for a messaging system. I feel like it’s obvious that a message will be read more than it is written (once).

remram 9 months ago

The fact that it has better write speed than read speed doesn't mean that it has bad read speed. It just happens to have even better write speed.
It's like how I connect my phone to my home's cable connection to send a big file. It is better at downloading than uploading, but that doesn't mean it's not the best solution for uploading.
SpikeMeister 9 months ago

While it’s true that messages are read more, reading can be cached so not every read necessarily results in a DB call.
- axelthegerman 9 months ago
  
  Which seems something they added recently but was not part of the original design of using Cassandra

cynicalpeace 9 months ago

Is there a fundamental reason you wouldn't use postgres for something like this? Scale certainly wouldn't be it.

ericvolp12 9 months ago

ScyllaDB scales horizontally on a shard-per-core architecture with a ballpark throughput of 12,500 Reads and 12,500 Writes per second per shard. If you're running Scylla across a total of 64 cores (maybe on 4 VMs with 16 vCPUs each), you can get up to 800k Reads 800k Writes per sec of throughput with P99 writes of <500us and p99 reads of <2ms.
You will not be able to get that performance out of Postgres and the write scaling will also be impossible on a non-sharded DB.
If you're a company like Discord and are running dozens (70-something?) of ScyllaDB nodes, likely each with 32 or 64 vCPUs, you've got capacity for 50M+ reads/writes per second across the cluster assuming your read/write workloads are evenly balanced across shards.
- jhgg 9 months ago
  
  Fwiw the benchmarked numbers are for writing very small rows. When doing the messages migration, with no read traffic, and the cluster/compaction settings tuned for writes we only managed approx 3m inserts/sec while fully saturating the Scylla cluster.
  - ericvolp12 9 months ago
    
    Interesting, we've got to 5M+ reads/sec in realistic simulated benchmarks and ~2M reads/sec of real-world-throughput on our clusters that are <10 nodes (though really high density). I don't think I've pushed writes beyond 1M QPS in real-world or simulated loads yet though. Thankfully our partitioning schemes are super well distributed though and our rows are very small (generally 1-5k) so I don't think we'd have a problem hitting some big numbers.
  - menaerus 9 months ago
    
    How about per-node memory pressure, did it change in favor of Scylla? I ask because I would legitimately expect that GC-based system would have a larger pressure on the memory subsystem.
    
    jhgg 9 months ago
    
    Scylla just eats all the ram it can with cache. So it's hard to say really. On Cassandra we allocated half the ram to the JVM which it gladly used up and left the other half to the OS for disk cache. On Scylla, since it uses direct io, there is no need for OS disk cache.
- riku_iki 9 months ago
  
  > You will not be able to get that performance out of Postgres
  if writes are batched, I get this and higher performance from postgres. If 800k on 64 cores is Scylla's best result, it is not that impressive.
  But also you probably mean writes/reads to indexed table, then it is another story.
- ryanjshaw 9 months ago
  
  Okay but this is where I get confused. Why does Discord need a single database system when discord servers are independent, right?
  And the volume of traffic per Discord server must be human-processable or what would the point be? A Discord server doing 800k writes per second makes no sense.
  So why not a RDBMS per Discord server, and if you want to ship all that out to a warehouse for analytics you do that as a separate problem?
  Or is it that spinning up a Postgres instance per Discord server ends up being significantly more expensive than these mega distributed database systems?
  - jhgg 9 months ago
    
    There are ballpark of a few hundred million discord servers... do you really want to run that many Postgres instances? And even so what do you do about DM/GDMs? Easier to just run one big mega cluster for messages.
    
    ryanjshaw 9 months ago
    
    Okay so the latter then - economies of scale. Surprised to hear that few hundred million figure - I thought it'd be 1/10th of that at most! Wow.
    Although I did expect there'd be a very long tail, and you might choose to host a bunch of servers on a single RDBMS, at that scale yeah it wouldn't solve much.
    Thanks for coming back to me, appreciate it.
  - Drew_ 9 months ago
    
    Apple kind of does something like this with iCloud however their per user "databases" are only virtual:
    https://news.ycombinator.com/item?id=39028672
justnoise 9 months ago

I'd guess that Discord's storage systems lean towards processing a lot more writes than reads. Postgres and other databases that use B-tree indexing are ideally suited for read heavy workloads. LSM based databases like Cassandra/Scylla are designed for write intensive workloads and have very good horizontal scaling properties built into the system.
- Aeolun 9 months ago
  
  Would you actually have more writes than reads? Are messages read by fewer people than post them?
  - sadeshmukh 9 months ago
    
    When you send a message, afaik it sends to all people looking at it at the time. So there is no read when in a conversation, and maybe the reads are batched when reading multiple.
- jhgg 9 months ago
  
  Read traffic is much higher than write traffic due to mobile clients needing to sync chat history more often as their sessions are much shorter lived. Also search queries execute 1 query per result. And don't forget people doing GDPR data dump requests. It adds up.
cowthulhu 9 months ago

I’m not sure if Postgres would have enough horizontal scaling to accommodate the insane volume of reads and writes. I would be super interested to be proven wrong though… anyone know of a cluster being run at that scale?
riku_iki 9 months ago

> Scale certainly wouldn't be it.
vanilla postgres can't scale to such size, you need some sharding solution on top, which likely will be much harder to maintain than ScyllaDB..

pavel_lishin 9 months ago

Anyone else reading this and being quite happy that they're not working at this scale?

wavemode 9 months ago

I don't mind scale. I mind the bureaucracy and promotion-driven-development that comes with working in a bloated engineering org.
- pm90 9 months ago
  
  +100
  Many companies have products that operate at “scale”. They manage to do so with pretty boring techniques (sharding, autoscaling) and technologies (postgres, cloud storage).
  Because of the insane blog driven tech culture, many of these teams get questioned by clueless leadership (who read these blogs) and ask why the company isn’t using cassandra / some other hot technology. And it always causes much consternation and wastage.
  - rnts08 9 months ago
    
    Anyone wanting to introduce $new/$other language, database, library, deployment system, build system into a large enough system that doesn't solve any actual problem is a nightmare for someone working at this scale.
    I don't mind the scale, I like it. I don't like having to fend off questions and complaints why we aren't deploying the latest shiny new thing in our core this week.
  - secondcoming 9 months ago
    
    Well we use Cassandra (actually ScyllaDB) because Redis no longer cut it.
Twirrim 9 months ago

But that's where the really fun and complicated problems are. The ones that really make you stop and think, and not just think, but be creative.
95% of the work is still the same "treading in well trod paths", same old same old tech work, but that 5% is really something.
- Olreich 9 months ago
  
  This was a “double-pump” migration to a faster database and building a caching service. There’s nothing particularly fancy or creative about their solutions. The migration efforts and working out issues with the reverse table scan were probably way more creative, but they didn’t get into that unfortunately.
- pavel_lishin 9 months ago
  
  I think I can understand the appeal, but it's just not there for me. I have enough complicated problems outside of work, some of which are even fun to solve.
twelve40 9 months ago

I'm happy I'm currently not working at this scale. I'm not happy when idiots (including one of our self-important ex-Google VP's) set this as a benchmark for backend interviews (for careers that 99% likely will never come close to such problems).
mystified5016 9 months ago

Any time I read anything about any web-adjacent technology I'm incredibly thankful that I don't work anywhere near that industry.
Embedded can be complex, but web stuff is just a Lovecraftian nightmare in comparison
- milesvp 9 months ago
  
  I have stared into the abyss and seen the eyes of cthulu. I am much happier writing embedded drivers than I was trying to make sense of why previous devs thought it was a good idea to move bounded tunable server side api calls to the client, allowing it to effectively write arbitrary sql calls across multiple databases.
  - bdcravens 9 months ago
    
    Fortunately the web is starting (very slowly) to return to sanity, pushing back towards the simpler server-rendered pattern with Javascript being relegated to specific use cases.
    
    Aeolun 9 months ago
    
    I really like the client rendered UI part. It’s a lot more efficient than sending the whole page again every time.
    
    bdcravens 9 months ago
    
    Which is precisely what is meant by specific use cases. We don't have to throw out the first 25 years of the web and reimplement all of our business logic in a minified JS blob. Even when client side code is necessary, the trend of pushing rendered HTML rather than JSON that must be parsed and rendered keeps us as close to browser primitives as possible.
    
    Aeolun 9 months ago
    
    Why would you implement the business logic there? You can still keep (most of) that in the backend.
    The client just does orchestration.
    
    bdcravens 9 months ago
    
    Once you move beyond basic CRUD business requirements work their way into the UI. For instance, making fields read-only based on access level. Adding additional form fields, etc. Conditionally hiding and showing entire portions of the UI. All of which requires you to either pass around UI-directives in your data or implement business logic in your client code. Better to just ship HTML, and if we're worried about full page loads, just use one of the many over-the-wire options to only change small bits of a page.
    This is before we get into having to implement application primitives like authentication on the client, and all of the state management that goes with. The absolute amount of scaffolding and plumbing we've built up just to save a few ms is always worth questioning. Doesn't mean the answer is no, just that we need to ask the question and not assume the default is carved in stone.
    
    gonzo41 9 months ago
    
    But you can cache the whole server side page and the cost is once. Whereas if you have the client side do the render then every client wears the cost.
    
    Aeolun 9 months ago
    
    That’s your generation that happens once. The browser still needs to render it. Sure, rendering it on the client may cost the client a bit more, but the client generally has the computational power to spare.
    
    bdcravens 9 months ago
    
    Which becomes a far more important issue when dealing with bandwidth or CPU constrained devices, or artificially imposed constraints due to data usage costs.
    
    asynchronous 9 months ago
    
    We can also cache some of the dynamic JavaScript, depending on the scenario but your point stands.
    
    iknowstuff 9 months ago
    
    You usually can't because of users who are signed in needing slightly different pages etc.
    
    bdcravens 9 months ago
    
    While not as fast as a purely client cached page, the server can selectively cache content, even when some bits of the page are dynamic.
  - qudat 9 months ago
    
    Iteration speed is significantly fast on the client. Perf is an afterthought — for better or worse
    
    swyx 9 months ago
    
    spoken like someone who doesnt deploy clients at discord scale?
    the 200 backend nodes surely update significantly faster than the hundreds of millions of clients.
  - artursapek 9 months ago
    
    Sounds like a fun time lol
est 9 months ago

I am happy that I dont have to deal with this.
I am sad that my business aren't as big as this scale.
Aeolun 9 months ago

Honestly, 77 nodes doesn’t sound like a terrific scale? The more I scale things up, the more I realize that the tone of the problems doesn’t really change. You just get more layers to your data structures.

m-hodges 9 months ago

Fun article. Also fun to think about how many people have decided to document their crimes in these Cassandra nodes.

7bit 9 months ago

The blog posts shows how great the technical expertise is at Discord. I work in IT and in my company devs are so incompetent, they don't even know how to create an M365/Azure dev tenant and constantly request *.Read write.All to our production tenant. I'm so envious!

On the other hand, the HOME/END keys jump to the beginning of the input field rather than the line and the frontend devs are unable to fix this non-default behaviour for years, which makes it a fucking pain in the ass to use the Posts feature within a Discord channel. I believe the budget for the backend geniuses meant that frontend had to be juniors only.

crop_rotation 9 months ago

Hiring good is probably the most important thing for a company and also one of the hardest problem. I have seen a team of competent engineers outperform their sibling teams by 5-10x as long as each member of the team is good enough. Just 2 bad hires will slow down a team drastically. One terrible hire can do -5x work of a normal engineer.
fastball 9 months ago

In their defense, Azure is terrible.
- 7bit 9 months ago
  
  I haven't found a difference to AWS, for example. They are all terrible in their own ways. But if one or the other is what you earn your money with, then at least put in the effort to be proficient with it, and not a complete dumbass. (Not you as in "you!")

andrewstuart 9 months ago

When you get to scale like this, I wonder if the access patterns of the application and its data might be best served by a custom data retrieval and storage application.

I may be wrong but I just wonder if efficiency is lost to the generalized nature of any data storage system.

The other question that comes to mind is, to what extent have the developers made a systematic effort to optimize how data is stored and retrieved? If you’re building a gigantic back end system and simply accepting that the system load is what it is then you might be missing a chance to dramatically impact the size of the task of managing that data.

lyu07282 9 months ago

They did give one example, if someone does a @everyone in a big channel, they specifically optimized their architecture to make that efficient using their custom data services.

znpy 9 months ago

Interesting read on one had, a bit disappointing on the other: when the solution is just "we moved to this other product" it smells of lack of serious and rigorous investigation.

Also, having worked with the JVM and with GC issues I don't buy the "GC problems" point: there are a number of improvements in recent JVM release, the main being ZGC (and generational ZGC in particular).

ZGC is great, I've personally witnessed sub-millisecond GC pauses (and i mean sub-millisecond stop-the-world pauses) on machines serving millions of requests per second. Garbage Collection is largely a solved problem in the industry as of today, thanks to ZGC.

Other than this, also comparing latencies for machines with 9TB disks rather than 4TB disks is a bit like comparing apples and oranges: we will never know if issues at the storage layer were affecting tail latencies. Were the node having, i don't know, filesystem fragmentation issues? Does the 9TB storage configuration deliver higher iops than the previous 4TB storage configuration? Is the same kind of hardware underneat (same disk type? same disk bus? or are we talking ssd vs nvme?).

As somebody that's been doing performance engineering for work, this piece is a bit appalling.

Glad to see they've solved their issue though!

ozgrakkurt 9 months ago

GC is a problem, and it always will be at some level. You can improve it but that doesn’t mean it is not a problem. Memory allocation and management is a problem even in c/c++ problems if you want to optimize your program, there is no universe where gc is not a problem

tonetegeatinst 9 months ago

My love of embedded stuff is growing. I'm self teaching C and assembly....to get better at low level programming and interactions with hardware but it all seems much simpler than the big data systems. Granted I'm sure it call be broken down into steps and issues to solve like any programming issue but I'm happy focusing on low level stuff for now.

airocker 9 months ago

Just wondering if anyone considered using Postgres or another relational db. I understand it won’t do multi master replication as well but it is much more stable and predictable if you give it right amount of traffic. I guess the team had to do that part anyways for ScyllaDB

crop_rotation 9 months ago

I don't think anyone runs Postgress at that scale (unless very specialized sharding setup). Given the choice between using ScyllaDB like everyone else and using Postgres in a super specialized best in the world setup, the choice becomes clear. Also keep in mind that Discord is not a huge super profitable company, so for them to develop something like vitess for Postgress would not make sense. For a small company with huge data like discord, using existing data solutions makes a lot more sense.
- airocker 9 months ago
  
  They could use vitess, citus or alloydb. They could use read replicas for read operations and single master in a shard for write. They would get many SQL features (upgrades, referential integrity etc) for free. It would allow them to extend their business logic considerably.

gigatexal 9 months ago

What a fun write up and a huge confidence building post for me in ScyllaDB.

yas_hmaheshwari 9 months ago

Does this article imply that don't use Cassandra. Use ScyllaDB when you think you want Cassandra

crakhamster01 9 months ago

Interesting technical read, but I appreciated the lighthearted jokes/comments the author threw in as well. Felt like they struck the right balance - nice work!

KaoruAoiShiho 9 months ago

Did they go with ScyllaDB just because it was compatible with Cassandra? Would it make sense to use a totally different solution altogether if they didn't start with that.

jhgg 9 months ago

Yes, we wanted to migrate all our data stores away from Cassandra due to stability and performance issues. Moving to something that didn't have those issues (or at least had a different set of less severe issues) while also not having to rewrite a bunch of code was a positive.
- ericvolp12 9 months ago
  
  Did you guys end up redesigning the partitioning scheme to fit within Scylla's recommended partition sizes? I assume the tombstone issue didn't disappear with a move to Scylla but incremental compaction and/or SCTS might have helped a bunch?
  - jhgg 9 months ago
    
    Nope. Didn't change the schema, mainly added read coalescing and used ICS. I think the big thing is when Scylla is processing a bunch of tombstones it's able to do so in a way that doesn't choke the whole server. Latest Scylla version also can send back partial/empty pages to the client to limit the amount of work per query that is run.
    
    ericvolp12 9 months ago
    
    Oh that's pretty neat. Did you just end up being okay with large partitions? I've been really afraid to let partition sizes grow beyond 100k rows even if the rows themselves are tiny but I'm not really sure how much of a real-world performance impact it has. It definitely complicates the data model to break the partitions up though.
    
    jhgg 9 months ago
    
    Yeah it just worked a lot better on scylla.

qntmfred 9 months ago

i usually start projects with postgres this days. i have reached the tens of millions of rows threshold without breaking a sweat, but is there any good reason postgres can't handle into the billions or trillions? any well known products at that scale that are known to use postgres?

mxscho 9 months ago

Just the raw amount of data is not enough metrics to judge whether postgres is "enough". They seem to value horizontal scalability e.g. in terms of write throughput, which is easier to handle with something like their solution compared to postgres.
bastawhiz 9 months ago

Postgres can pretty easily scale to billions or trillions of rows. It forces you to think carefully about how you query that data, though, and I think most beginners would find themselves in deep trouble jumping into the deep end.
- qntmfred 9 months ago
  
  > most beginners would find themselves in deep trouble jumping into the deep end
  probably true for any database platform. postgres probably easier for beginners than cassandra
postgresfan 9 months ago

The problem with Postgres is that you have to read the doc (boring), sometimes read database books beyond chapter 2 (lmao nerds).
This filters out 99% of software "engineers".
So it's better to use KookaburaDB, version 0.2 just got released. It's written in Rust, and it's modern of course (whatever that fucking means: config written in YAML I guess? Complicated build and deployment?)
- consteval 9 months ago
  
  Thank you prostgresfan, it's nice to see a completely unbiased source on database technologies.
  Jokes aside, you're largely right. Postgres really does cover 99.9% of database usecases. But, I think Discord might still fall outside of that.
  The problem is that Discord's scale pretty much requires heavy sharding. While you can make this word with Postgres, you can tell it was never designed out of the gate for this.
  IMO, Discord isn't even taking it far enough. Working under the assumption every server is its own isolated pocket, I see no reason not to have 1 database (or database-like thing) per server. Then it's truly a distributed system, which matches Discord's business use cases. I often find matching business use cases to technology like this can greatly simply architecture and reduce friction.

akimbostrawman 9 months ago

in cleartext

jaimehrubiks 9 months ago

Until they don't, or they can't, and they need to start deleting.

(Not trying to undermine the engineering efforts, or the welcoming engineering blog posts though! I really think all these is needed)

dobin 9 months ago

So the TL;DR is: Cassandra and ScyllaDB have bad performance when reading. So they put a cache in front.

jhgg 9 months ago

No cache. Just read coalescing. There is a big difference. Coalescing just ensures that while a query is executing if an identical query arrives, rather than sending the same query as an already executing query to the database it will wait for the existing query to complete and duplicate the result. If after this the same query arrives again, it will be issued against the database.
This means we don't have to deal with cache invalidation/consistency issues while also being able to handle thundering herds, for example a large server pinging @everyone and having a bunch of people click into the channel or launch their apps in response.

pawelduda 9 months ago

Pretty fun read, even tho I'll never work at such scale lol

dancemethis 9 months ago

[flagged]

SupremumLimit 9 months ago

[flagged]

GrantMoyer 9 months ago

The post appears to consistenly use past tense for things that were true in the past at time of writing, and present tense for things that are true in the present or are always true. So the use of tense appears to be valid, though not following commonly prescribed style.
phist_mcgee 9 months ago

Your question is rude and I hope you know that.
He's walking us through the process of designing the solution. Why wouldn't present tense work for this? We're discovering things with him as he takes us along for the journey.
nerdponx 9 months ago

No, what a ridiculous thing to say. Storytelling in the present tense is not new.

zombiwoof 9 months ago

I think it’s annoying they interview engineers like they are Google and reading the blog they made it up and learned some basic “pitfalls” as they went along

xyst 9 months ago

Having used discord in the past. Most of the conversations were just shit posts. Nothing serious. Why even bother storing a trillion messages of garbage in the first place?

huimang 9 months ago

Many people within niches have discord servers for researching and discussing specific things. There is a large wealth of information locked away behind them that can be lost pretty much whenever discord decides to start pursuing different monetization strategies.
adzm 9 months ago

Because that is literally what Discord is for
squigz 9 months ago

That sounds like it was a problem with the communities you engaged with.
jerryspringster 9 months ago

How do you sort the good from the bad? I'm sure most of my conversations were shit posts aswell but some weren't, especially when it figuring out how something new worked or how to fix a problem.
hypeatei 9 months ago

That's why I laugh when people say discord content needs to be indexed on the web so things are more discoverable. 99% is garbage and the useful messages are scattered across channels.
- retsibsi 9 months ago
  
  I'm not trying to be a smartarse but doesn't this describe the entire internet? The good stuff is rare and scattered, and that's why search is so important.
  - hypeatei 9 months ago
    
    At least with forums, there are dedicated pages for whatever is being discussed. Discord is just a collection of channels with topics being split up across multiple messages and shitposts in the middle.
  - jcgrillo 9 months ago
    
    Just wait until the LLM bots start arguing with each other on discord ;)
aurareturn 9 months ago

How do you differentiate shit posts vs quality ones if you’re Discord?

robertclaus 9 months ago

Very cool that even at this scale the right vanilla SQL database just works. No fancy document store, map-reduce, or GPU implementations needed.

salomonk_mur 9 months ago

How is ScyllaDB (the solution used in the article) a vanilla SQL DB? Its the complete opposite!
- melodyogonna 9 months ago
  
  The syntax is SQL
  - biorach 9 months ago
    
    That... doesn't necessarily mean that it's a "vanilla SQL server"
hinkley 9 months ago

It annoys me sometimes how effective B-trees are.
Every decade has some cool breakthrough in compression, and a handful of other disciplines. But OLTP databases are still basically better B-trees.
- menaerus 9 months ago
  
  LSM trees? ScyllaDB is LSM-based storage engine. RocksDB as well.
asjfkdlf 9 months ago

Aren’t they using a NoSQL store? They migrated from Casandra to Scylla DB