# a billion rows later > Date: 2026-03-30 > Language: EN > Tags: rust, performance, simd, optimization, 1brc > URL: https://guswid.com/blog/1brc how a college assignment turned into a performance rabbit hole --- --- title: a billion rows later date: 2026-03-30 excerpt: how a college assignment turned into a performance rabbit hole tags: [rust, performance, simd, optimization, 1brc] --- # one billion rows, zero platform intrinsics about a year ago, my algorithms professor handed out a challenge: parse a file of weather station measurements, compute min/max/average per station, output sorted. then he mentioned it was a billion rows. i spent a few weeks thinking about what the problem actually required. came out with something that runs in under 1.5 seconds on my M3 Max, no platform intrinsics. ## the problem the [one billion row challenge](https://github.com/gunnarmorling/1brc) started as a java challenge in early 2024. the rules: a text file, 1 billion lines, each formatted as `;`. compute min/mean/max per station, print alphabetically sorted. it's a map-reduce problem with a fixed, known dataset: ~413 unique station names, temperatures between -99.9 and 99.9 with one decimal place. ### the dataset the file is roughly 13GB uncompressed. the repo includes a generator: ```bash title="terminal" cargo run --example generate 1000000000 ``` ### what knowing the dataset unlocks there are always ~413 stations and the file divides cleanly. not assumptions you'd make in a general-purpose tool, but here they're free wins. ## getting started ```bash title="terminal" git clone https://github.com/GustavoWidman/1brc cd 1brc cargo run --release ``` `.cargo/config.toml` sets `target-cpu=native` globally. build and benchmark on the same machine. ## bypassing the kernel with memmap2 the first question was how to read the file. `BufReader` copies kernel buffers into userspace on every read. `memmap2` skips that: the file maps directly into the process address space, the OS pages it on demand. ```rust title="file.rs" pub fn open(path: &str) -> anyhow::Result { let file = std::fs::File::options().read(true).open(path)?; Ok(Self { mmap: Arc::new(unsafe { memmap2::MmapOptions::new().huge(None).map(&file)? }), }) } ``` the `huge(None)` hint requests transparent huge pages where available, which cuts TLB pressure on large sequential reads. ## never parse a float you don't need to temperatures like `-12.3` only ever appear with one decimal place. no reason to touch `f64`. store them as `i16` times 10 and divide by 10 at output only. integer arithmetic is faster and `i16` is half the size of `f32`. ```rust title="file.rs" fn parse_fake_float(mut bytes: &[u8]) -> i16 { let negative = unsafe { *bytes.get_unchecked(0) } == b'-'; if negative { bytes = &unsafe { bytes.get_unchecked(1..) }; } let mut val = 0i16; for &byte in bytes { if byte == b'.' { continue; } val = val * 10 + (byte - b'0') as i16; } if negative { -val } else { val } } ``` the measurement struct is `min: i16`, `max: i16`, `sum: i64`, `count: usize`. `i64` on the sum prevents overflow across a billion additions. the divide-by-10 happens once in `into()`. ### fitting in cache 24 bytes per measurement entry, pre-allocated to `STATIONS_IN_DATASET * 2 = 826` slots. the whole table fits in a couple of cache lines during the merge phase. ## parallel chunking with rayon the file splits into equal-sized chunks, one per worker. the catch is alignment: you can't cut at an arbitrary byte offset without splitting a line, so each chunk boundary scans forward to the next newline: ```rust title="file.rs" fn chunk_file(&self) -> Vec<&[u8]> { let chunk_size = total_size / NUM_WORKERS; let mut start = 0; while start < total_size { let mut end = (start + chunk_size).min(total_size); if end < total_size { while end < total_size && buffer[end] != b'\n' { end += 1; } end += 1; } chunks.push(&buffer[start..end]); start = end; } } ``` `NUM_WORKERS` is hardcoded to 14, the M3 Max core count. each chunk runs on its own thread via rayon and produces a local hashmap. merge at the end: ```rust title="file.rs" let measurements = chunks .into_par_iter() .map(|chunk| Self::parse_buffer(chunk)) .reduce( || HashMap::new(), |mut left, right| { left.merge(right); left }, ); ``` ### the merge `merge` iterates the smaller map and folds each entry into the accumulator. all keys are station names bounded to ~413, so the merged map never resizes. ## the hashmap `hashbrown` with `ahash`, pre-sized to `STATIONS_IN_DATASET * 2`: ```rust title="hashmap.rs" pub fn new() -> Self { Self { inner: InnerHashMap::with_capacity_and_hasher( STATIONS_IN_DATASET * 2, ahash::RandomState::new(), ), } } ``` `hashbrown` resizes when load factor hits 0.875. at 826 slots for 413 keys, the map never rehashes across a billion insertions. ### memchr for the hot loop finding the semicolon on every row is the hottest single operation. `memchr` does it in two lines. AVX2 on x86, NEON on arm, no architecture-specific code on my end: ```rust title="file.rs" let comma_separator = memchr::memchr(b';', &buffer).unwrap(); let end = memchr::memchr(b'\n', &buffer[comma_separator..]).unwrap(); ``` ## where it landed on my M3 Max (14 cores), warm cache: **~1.4 seconds**. ![m3 max warm cache run at 1.415s, 14 workers](./imgs/original-numbers.png) no platform intrinsics, no unsafe hot path. just the right abstractions doing their job. ## an autonomous agent takes a swing near the end of the project's first year, a friend was testing an early version of an autonomous coding agent he's been building, a project he hasn't shipped yet. he pointed it at this repo with a "make it better" prompt and let it run. the agent came back claiming 18.2 seconds to 1.66 seconds, a 110x speedup. the commit message was confident. the changes were 100% x86_64 platform intrinsics, completely incompatible with my M3 Max. it ran, but on scalar fallbacks. a follow-up PR added proper aarch64 support. that one actually ran natively. ![m3 max run after aarch64 pr at 1.240s, 14 workers](./imgs/post-colton.png) the result: 1.415s to 1.240s. 0.2 seconds over the generalist version, only after two PRs and needing platform-specific intrinsics to get there. the agent isn't public yet, but i'm watching it. the original "110x" claim lives on in git history. ## where it breaks `NUM_WORKERS = 14` is hardcoded. running on a different machine won't adapt. the hashmap is pre-sized for 413 known stations. a different dataset with more unique keys causes rehashing. malformed input produces wrong output, no panic, no error. this is a benchmarking tool, not a general-purpose one. it only works because i know exactly what the input looks like. code is here: [github.com/GustavoWidman/1brc](https://github.com/GustavoWidman/1brc)