This is very much worth watching. It is a tour de force.
Laurie does an amazing job of reimagining Google's strange job optimisation technique (for jobs running on hard disk storage) that uses 2 CPUs to do the same job. The technique simply takes the result of the machine that finishes it first, discarding the slower job's results... It seems expensive in resources, but it works and allows high priority tasks to run optimally.
Laurie re-imagines this process but for RAM!! In doing this she needs to deal with Cores, RAM channels and other relatively undocumented CPU memory management features.
She was even able to work out various undocumented CPU/RAM settings by using her tool to find where timing differences exposed various CPU settings.
She's turned "Tailslayer" into a lib now, available on Github, https://github.com/LaurieWired/tailslayer
You can see her having so much fun, doing cool victory dances as she works out ways of getting around each of the issues that she finds.
The experimentation, explanation and graphing of results is fantastic. Amazing stuff. Perhaps someone will use this somewhere?
As mentioned in the YT comments, the work done here is probably a Master's degrees worth of work, experimentation and documentation.
Go Laurie!
throwaway81523
today at 7:12 AM
This is a 54 minute video. I watched about 3 minutes and it seemed like some potentially interesting info wrapped in useless visuals. I thought about downloading and reading the transcript (that's faster than watching videos), but it seems to me that it's another video that would be much better as a blog post. Could someone summarize in a sentence or two? Yes we know about the refresh interval. What is the bypass?
Update: found the bypass via the youtube blurb: https://github.com/LaurieWired/tailslayer
"Tailslayer is a C++ library that reduces tail latency in RAM reads caused by DRAM refresh stalls.
"It replicates data across multiple, independent DRAM channels with uncorrelated refresh schedules, using (undocumented!) channel scrambling offsets that works on AMD, Intel, and Graviton. Once the request comes in, Tailslayer issues hedged reads across all replicas, allowing the work to be performed on whichever result responds first."
kelsolaar
today at 9:05 AM
The video could be a shorter, some of the goofiness might not please the most pressed people but that is also what makes it fresh and stand out.
fc417fc802
today at 8:20 AM
> using (undocumented!) channel scrambling offsets that works on AMD, Intel, and Graviton
Seems odd to me that all three architectures implement this yet all three leave it undocumented. Is it intended as some sort of debug functionality or what?
it's explained in the video, and there's no way I'll be explaining it better than her
satvikpendem
today at 7:18 AM
Just use the Ask button on YouTube videos to summarize, that's what it's for.
dspillett
today at 9:49 AM
Not complaining about the particular presenter here, this is an interesting video with some decent content, I don't find the presentation style overly irritating, and it is documenting a lot of work that has obviously been done experimenting in order to get the end result (rather than just summarising someone else's work). Such a goofy elongated style, that is infuriating if you are looking for quick hard information, is practically required in order to drive wider interest in the channel.
But the “ask the LLM” thing is a sign of how off kilter information passing has become in the current world. A lot of stuff is packaged deliberately inefficiently because that is the way to monetise it, or sometimes just to game the searching & recommendation systems so it gets out to potentially interested people at all, then we are encouraged to use a computationally expensive process to summarise that to distil the information back out.
MS's documentation the large chunks of Azure is that way, but with even less excuse (they aren't a content creator needing to drive interest by being a quirky presenter as well as a potential information source). Instead of telling me to ask copilot to guess what I need to know, why not write some good documentation that you can reference directly (or that I can search through)? Heck, use copilot to draft that documentation if you want to (but please have humans review the result for hallucinations, missed parts, and other inaccuracies, before publishing).
Unnecessarily negative imo.
I like the video because I cant read a blog post in the background while doing other stuff, and I like Gadget Hackwrench narrating semi-obscure CS topics lol
fc417fc802
today at 8:32 AM
> I cant read a blog post in the background
You can consume technical content in the background?
saidnooneever
today at 9:40 AM
this is a thing people do. convince themselves they can consume technical content subconsciously. its now how the brain works though. it will just give you the idea you are following something.
I hope this approach gets some visibility in the CPU field. It could be obviously improved with a special cpu instruction which simply races two reads and returns the first one which succeeds. She’s doing an insane amount of work, making multiple threads and so on (and burning lots of performance) all to work around the lack of dedicated support for this in silicon.
>> It replicates data across multiple, independent DRAM channels with uncorrelated refresh schedules
This is the sort of thing which was done before in a world where there was NUMA, but that is easy. Just task-set and mbind your way around it to keep your copies in both places.
The crazy part of what she's done is how to determine that the two copies don't get get hit by refresh cycles at the same time.
Particularly by experimenting on something proprietary like Graviton.
She determines that by having three copies. Or four. Or eight.
Tis just probabilities and unlikelihood of hitting a refresh cycle across that many memory channels all at once.
GeneralMayhem
today at 7:16 AM
Right, but the impressive part is finding addresses that are actually on different memory channels.
Surprising to me that two memory channels are separated by as little as 256 bytes. The short distance makes it easier to find, surely?
> Google's strange job optimisation technique (for jobs running on hard disk storage)
Can you give more context on this? Opus couldn't figure out a reference for it
why_only_15
today at 7:40 AM
This is a quite old technique. The idea, as I understood it, was that lots of data at Google was stored in triplicate for reliability purposes. Instead of fetching one, you fetched all three and then took the one that arrived first. Then you sent UDP packets cancelling the other two. For something like search where you're issuing hundreds of requests that have to resolve in a few hundred milliseconds, this substantially cut down on tail latency.
Aha that makes more sense, I thought it was specifically to do with job scheduling from the description. You can do something similar at home as a poor man's CDN by racing requests to regionally replicated S3 buckets. Also magic eyeballs (ipv4/v6 race done in browsers and I think also for Quic/HTTP selection) works pretty much the same way
Tournament parallelism is the technical term IIRC.
I like the video, but this is hardly groundbreaking. You send out two or more messengers hoping at least one of them will get there on time.
Yeah. These are literally just mainframe techniques from yesteryear.
actionfromafar
today at 8:09 AM
Almost everything "new" was invented by IBM it seems like. And it goes by a completely different name there. It's still nice to rediscover what they knew.
and dropbox was just rsync
UltraSane
today at 4:32 AM
The clever part is figuring out what RAM is controlled by which controllers.
I have to say that using drawbridges and differently colored rail pieces to explain it was very clever.
saidnooneever
today at 8:23 AM
everyone says this but no one says why it was clever. i find her videos have cool results but i cant have patience for them usually because its recycled old stuff (can be cool but its not ground breaking).
there is a ton of info you can pull from: smbios, acpi, msrs, cpuid etc. etc. about cpu/ram topology and connecticity, latencies etc etc.
isnt the info on what controllers/ram relationships exists somewhere in there provided by firmware or platform?
i can hardly imagine it is not just plainly in there with the plethtora info in there...
theres srat/slit/hmat etc. in acpi, then theres MSRs with info (amd expose more than intel ofc, as always) and then there is registers on memory controller itself as well as socket to socket interconnects from upi links..
its just a lot of reading and finding bits here n there. LLms are actually really good at pulling all sorts of stuff from various 6-10k page documents if u are too lazy to dig yourself -_-