Linux System Calls: The User-Kernel Interface

12 min

Understand Linux system calls through interactive visualizations. Learn how user programs communicate with the kernel, protection rings, and syscall performance.

Best viewed on desktop for optimal interactive experience

The Gateway to the Kernel

Every time your program needs to interact with hardware, read a file, or create a process, it must ask the kernel for help. But user programs can't directly access kernel memory or execute privileged instructions - that would be chaos! Instead, they use system calls - the carefully controlled gateway between user space and kernel space.

Think of system calls as a restaurant's service window. Customers (user programs) can't walk into the kitchen (kernel space) - that would be unsafe and chaotic. Instead, they place orders through the window (system calls), and the kitchen staff (kernel) fulfills those requests with proper safety checks and resource management.

Let's explore this fascinating boundary where user programs meet the almighty kernel.

Interactive System Call Visualization

Watch the complete system call journey - from user space preparation to kernel execution and back:

System Calls: The User-Kernel Gateway

Watch the complete journey from user space to kernel and back - every register, every mode switch, every validation.

File Read: read() System CallStep 1 of 13

User program prepares to read file

Application needs to read data from file. Calls C library wrapper function read().

Ring 3
User Mode
User Mode
Ring 0
Kernel Mode
CPU Registers:
rax
?
rdi
?
rsi
?
rdx
?
Running in: User Space (Ring 3)
Application code: read(fd, buffer, count)
File descriptor: fd = 3 (opened file)
Destination buffer: 0x7fff1234 (user memory)
Bytes to read: 1024
C library wrapper will invoke syscall
1 / 13
8% complete

Understanding System Calls

What Are System Calls?

System calls are the only way user programs can request kernel services:

  • File operations: open(), read(), write(), close()
  • Process management: fork(), exec(), exit(), wait()
  • Memory: mmap(), brk(), munmap()
  • Network: socket(), connect(), send(), recv()
  • Devices: ioctl()
  • Signals: kill(), signal(), sigaction()

Why needed? Modern CPUs enforce privilege separation. User code runs in Ring 3 (restricted), kernel runs in Ring 0 (full access). System calls are the bridge.

The System Call Table

The kernel maintains sys_call_table[] - an array of function pointers indexed by syscall number:

  • sys_call_table[0] = sys_read
  • sys_call_table[1] = sys_write
  • sys_call_table[2] = sys_open
  • sys_call_table[57] = sys_fork
  • sys_call_table[59] = sys_execve
  • ~450 total syscalls on modern Linux

Each architecture (x86_64, ARM, etc.) has its own syscall numbers!

CPU Protection Rings

Ring Architecture

x86 CPUs provide 4 privilege levels (rings), but Linux only uses 2:

Ring 0 (Kernel Mode):

  • Execute privileged instructions (hlt, lgdt, lidt, mov to CR3)
  • Access all memory (kernel + user)
  • Direct hardware I/O (in, out instructions)
  • Modify page tables and control registers

Ring 3 (User Mode):

  • Restricted instruction set only
  • Access only user memory (virtual addresses)
  • No hardware I/O (causes general protection fault)
  • Cannot modify page tables

Rings 1 & 2: Unused by Linux (originally intended for device drivers, but Linux puts drivers in Ring 0)

Mode Transitions

User → Kernel (via syscall instruction):

  1. CPU saves RIP (return address), RSP (stack pointer), RFLAGS
  2. CPU loads kernel RIP, kernel RSP from MSRs
  3. CPU switches from Ring 3 to Ring 0
  4. Kernel entry point (entry_SYSCALL_64) executes

Kernel → User (via sysret instruction):

  1. Kernel prepares return value in rax
  2. Kernel executes sysret
  3. CPU restores saved RIP, RSP, RFLAGS
  4. CPU switches from Ring 0 to Ring 3
  5. User code resumes after syscall instruction

System Call Execution Flow

Step-by-Step: read(fd, buffer, count)

  1. User space: C library wrapper function called
  2. Load rax: Syscall number (0 for read)
  3. Load args: rdi=fd, rsi=buffer, rdx=count
  4. Execute: syscall instruction
  5. Mode switch: Ring 3 → Ring 0, switch to kernel stack
  6. Save context: Push all user registers to kernel stack
  7. Validate: Check syscall number, verify parameters
  8. Dispatch: Call sys_call_table[rax] → sys_read()
  9. Execute: Kernel reads from file, copies to user buffer
  10. Return value: Place result in rax
  11. Restore context: Pop user registers from kernel stack
  12. Mode switch: Ring 0 → Ring 3, switch to user stack
  13. Resume: User code continues with return value in rax

Cost: ~100-300 CPU cycles for context switch overhead

Parameter Passing

x86-64 Calling Convention

Arguments passed in registers (up to 6):

RegisterSyscall Argument
raxSyscall number
rdiArgument 1
rsiArgument 2
rdxArgument 3
r10Argument 4
r8Argument 5
r9Argument 6

More than 6 arguments? Passed on stack (rare).

Return value: rax register (negative = error code)

Example: open(filename, flags, mode)

mov rax, 2 ; SYS_open mov rdi, filename_ptr ; Arg 1: pathname mov rsi, 0 ; Arg 2: O_RDONLY mov rdx, 0 ; Arg 3: mode (unused) syscall ; Invoke ; rax now contains file descriptor (or negative error)

Security and Validation

Why Validation is Critical

Kernel must never trust user input. Malicious programs could:

  • Pass invalid pointers to crash kernel
  • Pass kernel addresses to read kernel memory
  • Exploit buffer overflows

Validation Checks

  1. Pointer validation: Is address in user space? (< 0x00007FFFFFFFFFFF)
  2. Read/write permission: Can user read from / write to this address?
  3. Range checks: Buffer within valid range?
  4. Resource limits: Does user have permission? (file permissions, capabilities)
  5. Numeric validation: Are numeric args reasonable?

Failed validation → Return -EFAULT, -EINVAL, -EACCES, etc.

Tracing System Calls

strace - System Call Tracer

# Trace all syscalls of a program strace ls # Trace specific syscalls strace -e open,read,write cat file.txt # Show timing strace -T ls # Count syscalls strace -c ls # Trace running process strace -p 1234 # Save to file strace -o trace.log program

Example output:

openat(AT_FDCWD, "/etc/passwd", O_RDONLY) = 3 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1534 close(3) = 0

ltrace - Library Call Tracer

# Trace library calls (wrapper functions) ltrace ls # See both library and system calls ltrace -S ls

Performance Considerations

Syscall Overhead

  • Direct syscall: ~100-300 cycles
  • Library call: +20-50 cycles (wrapper overhead)
  • vDSO syscalls: ~10-20 cycles (no mode switch!)

vDSO: Virtual Dynamic Shared Object

Some syscalls are so common and so simple that Linux provides userspace versions:

  • gettimeofday(): Read kernel's time variable (mapped read-only to userspace)
  • clock_gettime(): Same
  • getcpu(): Which CPU am I running on?

Benefit: No mode switch! 10x faster!

# Check vDSO mapping cat /proc/self/maps | grep vdso # 7ffff7ffd000-7ffff7fff000 r-xp 00000000 00:00 0 [vdso] # Verify vDSO usage ldd /bin/date # linux-vdso.so.1 (0x00007ffff7ffd000)

Batching Syscalls

Problem: Making many small syscalls is slow (overhead per call).

Solution: Batch operations

# Bad: Many read() calls for i in range(1000000): read(fd, buf, 1) # 1 byte at a time # Good: One large read() read(fd, buf, 1000000) # 1 MB at once

Vectored I/O: readv(), writev() - read/write multiple buffers in one syscall

Common System Calls

File I/O

open() # Open file, returns fd read() # Read bytes from fd write() # Write bytes to fd close() # Close fd lseek() # Change file position stat() # Get file metadata

Process Management

fork() # Create child process execve() # Replace process image exit() # Terminate process wait4() # Wait for child to exit kill() # Send signal to process

Memory Management

brk() # Grow/shrink heap mmap() # Map file or anonymous memory munmap() # Unmap memory mprotect()# Change memory protection

Network

socket() # Create socket bind() # Bind socket to address listen() # Listen for connections accept() # Accept incoming connection connect() # Connect to server send() # Send data recv() # Receive data

Error Handling

Error Codes

System calls return -errno on failure:

  • -ENOENT: No such file or directory
  • -EACCES: Permission denied
  • -EINVAL: Invalid argument
  • -ENOMEM: Out of memory
  • -EFAULT: Bad address (invalid pointer)
  • -EBADF: Bad file descriptor

C library converts negative returns to errno:

// Kernel returns -ENOENT (-2) // C library: if (ret < 0) { errno = -ret; // errno = 2 return -1; }

Check errno with perror() or strerror().

Best Practices

  1. Check return values: Always! Don't ignore errors
  2. Batch operations: Reduce syscall count
  3. Use buffered I/O: stdio buffers writes (fewer syscalls)
  4. Avoid unnecessary syscalls: Cache results when possible
  5. Use async I/O: io_uring for high-performance servers
  6. Profile with strace -c: Find syscall hotspots

Modern Alternatives

io_uring (Linux 5.1+)

Asynchronous I/O interface avoiding syscall overhead:

  • Submit I/O operations to ring buffer
  • Kernel processes asynchronously
  • Retrieve completions from ring buffer
  • Zero syscalls in fast path!

Used by high-performance databases and servers.

If you found this explanation helpful, consider sharing it with others.

Mastodon