go / runtime
The Go runtime is compiled into every Go binary.
It provides the memory allocator, garbage collector, goroutine scheduler,
and everything that makes go, channels, and defer work.
This article is my compressed notes on the Internals for Interns four-part series, which I verified against the Go 1.26 source. It ends with practical implications for day-to-day Go programming.
Bootstrap
Before func main() runs, the runtime:
- Creates
g0(the runtime's housekeeping goroutine) andm0(the first OS thread) - Sets up Thread-Local Storage so each thread knows which goroutine it's running
- Detects CPU features (AES for map hashing, etc.)
- Initializes the stack pool, memory allocator, and type/interface tables
- Creates
Pstructs (one perGOMAXPROCS, defaulting to CPU count) - Spawns the
sysmonbackground thread - Creates a goroutine for
runtime.main, which calls yourmain.main
Memory allocator
The allocator sits between your program and the OS.
It grabs large arenas (64MB on 64-bit systems) via mmap
and subdivides them into 8KB pages, which are grouped into spans.
Each span holds objects of a single size class. There are 68 size classes from 8 bytes to 32KB. A 50-byte allocation rounds up to 64 bytes and fills a slot in a span of 64-byte slots. Objects larger than 32KB are allocated directly from the heap.
Allocation uses a three-level cache:
mcache(per-P, no locks) — the fast pathmcentral(per-size-class, shared) — refillsmcachemheap(global page allocator) — refillsmcentral
Most allocations hit the mcache and need no locks.
Scheduler
The scheduler multiplexes goroutines onto OS threads using three structures (the GMP model):
G(goroutine): lightweight, starts with a 2KB stackM(machine): an OS thread that executes codeP(processor): a scheduling context with a local run queue, memory cache, and GC state
An M must acquire a P to run Go code.
When an M blocks on a syscall, the P detaches and moves to a free M,
keeping work moving.
Each P has a 256-slot local run queue plus a runnext fast slot.
Idle Ps steal work from other Ps' queues.
The sysmon thread runs in the background to preempt
long-running goroutines and retake Ps stuck in syscalls.
Garbage collector
Go uses a concurrent, non-moving, tri-color mark-and-sweep collector. Collection runs in four phases:
- Sweep termination: brief stop-the-world to finish prior sweeps and enable the write barrier
- Mark: concurrent; traces the object graph using ~25% CPU
- Mark termination: brief stop-the-world to disable write barrier and swap bitmaps
- Sweep: concurrent; frees unmarked objects
The write barrier intercepts pointer writes during marking so the GC doesn't miss reachable objects that your code is moving around. Go uses a hybrid Yuasa-Dijkstra barrier that shades both old and new pointer targets.
Go 1.26 introduces the Green Tea GC (GOEXPERIMENT=greenteagc),
which improves mark phase locality by batching objects within the same
span before scanning them.
So what
Knowing these internals changes how I write Go in a few concrete ways.
Reduce heap allocations
Every heap allocation is work the GC must trace. The compiler's escape analysis decides what goes on the heap. Check its decisions:
go build -gcflags='-m' ./...
Common ways to keep things off the heap:
- Return values instead of pointers when the value is small
- Pre-size slices and maps when you know the length
- Reuse buffers with
sync.Poolfor hot-path temporary objects
Prefer values over pointers in collections
The GC scans every pointer in a live object.
A []User where User has no pointer fields is cheaper
for the GC than a []*User, because the collector
doesn't need to chase each element.
When structs are small, pass and store them by value.
Tune GC for your workload
The GC triggers based on heap growth. Two knobs control it (see the GC guide):
GOGC(default 100): the heap can grow 100% before the next GC. Raise it to trade memory for fewer GC cycles. Lower it for tighter memory at the cost of more CPU.GOMEMLIMIT: a soft memory ceiling. The GC works harder to stay under it. Useful when you know your memory budget.
import "runtime/debug"
// Set at startup, or via environment variables
// GOGC=200 GOMEMLIMIT=512MiB ./myapp
debug.SetGCPercent(200)
debug.SetMemoryLimit(512 << 20)
Don't fear goroutines, but don't ignore them
Goroutines are cheap (~2KB stack, recycled when done) but each live goroutine is a GC root the collector must scan. Millions of goroutines are fine if they're short-lived. Millions of long-lived, blocked goroutines holding pointers add GC pressure.
Use the profiling tools
go test -bench=. -benchmem # allocations per op
go test -cpuprofile cpu.prof # CPU profile
go test -memprofile mem.prof # memory profile
go tool pprof cpu.prof # analyze
GODEBUG=gctrace=1 ./myapp # GC cycle log
go tool trace trace.out # scheduler/GC timeline
benchmem is especially useful: if allocs/op drops to zero,
you've moved everything to the stack.
GOMAXPROCS rarely needs changing
It defaults to the number of CPU cores,
which is the right answer for most workloads.
Raising it doesn't help because it only controls
how many Ps exist (and thus how many goroutines
can run truly in parallel).
More Ms can exist than Ps to handle blocking syscalls.
Channel patterns get scheduler help
When a goroutine sends on a channel and the receiver is ready,
the scheduler uses the runnext slot to run the receiver
immediately on the same P.
This means tight producer-consumer pairs on channels
have low scheduling latency by design.