Tiled Hacker news on React Router

A 40-line fix eliminated a 400x performance gap

370 points - 01/13/2026

Source

ot
01/14/2026
You can do even faster, about 8ns (almost an additional 10x improvement) by using software perf events: PERF_COUNT_SW_TASK_CLOCK is thread CPU time, it can be read through a shared page (so no syscall, see perf_event_mmap_page), and then you add the delta since the last context switch with a single rdtsc call within a seqlock.
This is not well documented unfortunately, and I'm not aware of open-source implementations of this.
EDIT: Or maybe not, I'm not sure if PERF_COUNT_SW_TASK_CLOCK allows to select only user time. The kernel can definitely do it, but I don't know if the wiring is there. However this definitely works for overall thread CPU time.
shermantanktop
01/14/2026
Flamegraphs are wonderful.
Me: looks at my code. "sure, ok, looks alright."
Me: looks at the resulting flamegraph. "what the hell is this?!?!?"
I've found all kinds of crazy stuff in codebases this way. Static initializers that aren't static, one-line logger calls that trigger expensive serialization, heavy string-parsing calls that don't memoize patterns, etc. Unfortunately some of those are my fault.
jerrinot
01/13/2026
Author here. After my last post about kernel bugs, I spent some time looking at how the JVM reports its own thread activity. It turns out that "What is the CPU time of this thread?" is/was a much more expensive question than it should be.
jonasn
01/14/2026
Author of the OpenJDK patch here.
Thanks for the write-up Jaromir :) For those interested, I explored memory overhead when reading /proc—including eBPF profiling and the history behind the poorly documented user-space ABI.
Full details in my write-up: https://norlinder.nu/posts/User-CPU-Time-JVM/
furyofantares
01/14/2026
> Flame graph image
> Click to zoom, open in a new tab for interactivity
I admit I did not expect "Open Image in New Tab" to do what it said on the tin. I guess I was aware that it was possible with SVG but I don't think I've ever seen it done and was really not expecting it.
pjmlp
01/14/2026
Which goes to show writing C, C++ or whatever systems language isn't automatically blazing fast, depending on what is being done.
Very interesting read.
higherhalf
01/14/2026
clock_gettime() goes through vDSO, avoiding a context switch. It shows up on the flamegraph as well.
goodroot
01/14/2026
The QuestDB team are among the best doing it.
Love the people and their software.
Great blog Jaromir!
burnt-resistor
01/14/2026
I really wished™ there was an API/ABI for userland- and kernelland-defined individual virtual files at arbitrary locations, backed by processes and kernel modules respectively. I've tried pipes, overlays, and FUSE to no avail. It would greatly simply configuration management implementations while maintaining compatibility with the convention of plain text files, and there's often no need to have an actual file on any media or the expense of IOPS.
While I don't particularly like the IO overhead and churn consequences of real files for performance metrics, I get the 9p-like appeal of treating the virtual fs as a DBMS/API/ABI.
otterley
01/14/2026
It took seven years to address this concern following the initial bug report (2018). That seems like a lot, considering how instrumenting CPU time can be in the hot path for profiled code.
Ono-Sendai
01/14/2026
"look, I'm sorry, but the rule is simple: if you made something 2x faster, you might have done something smart if you made something 100x faster, you definitely just stopped doing something stupid"
https://x.com/rygorous/status/1271296834439282690
ee99ee
01/14/2026
This is such a great writeup
squirrellous
01/14/2026
Does anyone knowledgeable know whether it’s possible to drastically reduce the overhead of reading from procfs? IIUC everything in it is in-memory, so there’s no real reason reading some data should take the order of 10us.
mgaunard
01/14/2026
Obviously a vdso read is going to be significantly faster than a syscall switching to the kernel, writing serialized data to a buffer, switching back to userland, and parsing that data.
xthe
01/14/2026
This is a great example of how a small change in the right place can outweigh years of incremental tuning.
amelius
01/14/2026
It's kinda crazy the amount of plumbing required to get a few bits across the CPU.
tomiezhang
01/14/2026
cool

A 40-line fix eliminated a 400x performance gap

ot

jerrinot

ot

catlifeonmars

jerrinot

nly

ot

mgaunard

jerrinot

shermantanktop

wging

pests

wging

yxhuvud

tempaccsoz5

sllabres

arethuza

jabwd

shermantanktop

shermantanktop

MengerSponge

tyingq

sroerick

atdt

dummydummy1234

jsymolon

jerrinot

jacquesm

rcxdude

jacquesm

loeg

jacquesm

loeg

jacquesm

jerrinot

jacquesm

menaerus

Neywiny

jerrinot

6r17

jerrinot

6r17

abicklefitch

jonasn

jerrinot

jonasn

kstrauser

jonasn

kstrauser

furyofantares

jerrinot

IshKebab

pjmlp

higherhalf

jerrinot

touisteur

higherhalf

jerrinot

ot

jerrinot

a-dub

goodroot

burnt-resistor

otterley

loeg

otterley

singron

otterley

loeg

u8080

loeg

Ono-Sendai

ee99ee

squirrellous

mgaunard

xthe

nomel

amelius

tomiezhang