Linux System Calls: The User-Kernel Interface

The Gateway to the Kernel

Every time your program needs to interact with hardware, read a file, or create a process, it must ask the kernel for help. But user programs can't directly access kernel memory or execute privileged instructions - that would be chaos! Instead, they use system calls - the carefully controlled gateway between user space and kernel space.

Think of system calls as a restaurant's service window. Customers (user programs) can't walk into the kitchen (kernel space) - that would be unsafe and chaotic. Instead, they place orders through the window (system calls), and the kitchen staff (kernel) fulfills those requests with proper safety checks and resource management.

Let's explore this fascinating boundary where user programs meet the almighty kernel.

Interactive System Call Visualization

Watch the complete system call journey - from user space preparation to kernel execution and back:

System Calls: The User-Kernel Gateway

Watch the complete journey from user space to kernel and back - every register, every mode switch, every validation.

File Read: read() System CallStep 1 of 13

User program prepares to read file

Application needs to read data from file. Calls C library wrapper function read().

Ring 3

User Mode

Ring 0

Kernel Mode

CPU Registers:

rax

rdi

rsi

rdx

Running in: User Space (Ring 3)

Application code: read(fd, buffer, count)

File descriptor: fd = 3 (opened file)

Destination buffer: 0x7fff1234 (user memory)

Bytes to read: 1024

C library wrapper will invoke syscall

1 / 13

8% complete

Understanding System Calls

What Are System Calls?

System calls are the only way user programs can request kernel services:

File operations: open(), read(), write(), close()
Process management: fork(), exec(), exit(), wait()
Memory: mmap(), brk(), munmap()
Network: socket(), connect(), send(), recv()
Devices: ioctl()
Signals: kill(), signal(), sigaction()

Why needed? Modern CPUs enforce privilege separation. User code runs in Ring 3 (restricted), kernel runs in Ring 0 (full access). System calls are the bridge.

The System Call Table

The kernel maintains sys_call_table[] - an array of function pointers indexed by syscall number:

sys_call_table[0] = sys_read
sys_call_table[1] = sys_write
sys_call_table[2] = sys_open
sys_call_table[57] = sys_fork
sys_call_table[59] = sys_execve
~450 total syscalls on modern Linux

Each architecture (x86_64, ARM, etc.) has its own syscall numbers!

CPU Protection Rings

Ring Architecture

x86 CPUs provide 4 privilege levels (rings), but Linux only uses 2:

Ring 0 (Kernel Mode):

Execute privileged instructions (hlt, lgdt, lidt, mov to CR3)
Access all memory (kernel + user)
Direct hardware I/O (in, out instructions)
Modify page tables and control registers

Ring 3 (User Mode):

Restricted instruction set only
Access only user memory (virtual addresses)
No hardware I/O (causes general protection fault)
Cannot modify page tables

Rings 1 & 2: Unused by Linux (originally intended for device drivers, but Linux puts drivers in Ring 0)

Mode Transitions

User → Kernel (via syscall instruction):

CPU saves RIP (return address), RSP (stack pointer), RFLAGS
CPU loads kernel RIP, kernel RSP from MSRs
CPU switches from Ring 3 to Ring 0
Kernel entry point (entry_SYSCALL_64) executes

Kernel → User (via sysret instruction):

Kernel prepares return value in rax
Kernel executes sysret
CPU restores saved RIP, RSP, RFLAGS
CPU switches from Ring 0 to Ring 3
User code resumes after syscall instruction

System Call Execution Flow

Step-by-Step: read(fd, buffer, count)

User space: C library wrapper function called
Load rax: Syscall number (0 for read)
Load args: rdi=fd, rsi=buffer, rdx=count
Execute: syscall instruction
Mode switch: Ring 3 → Ring 0, switch to kernel stack
Save context: Push all user registers to kernel stack
Validate: Check syscall number, verify parameters
Dispatch: Call sys_call_table[rax] → sys_read()
Execute: Kernel reads from file, copies to user buffer
Return value: Place result in rax
Restore context: Pop user registers from kernel stack
Mode switch: Ring 0 → Ring 3, switch to user stack
Resume: User code continues with return value in rax

Cost: ~100-300 CPU cycles for context switch overhead

Parameter Passing

x86-64 Calling Convention

Arguments passed in registers (up to 6):

Register	Syscall Argument
rax	Syscall number
rdi	Argument 1
rsi	Argument 2
rdx	Argument 3
r10	Argument 4
r8	Argument 5
r9	Argument 6

More than 6 arguments? Passed on stack (rare).

Return value: rax register (negative = error code)

Example: open(filename, flags, mode)

mov rax, 2              ; SYS_open
mov rdi, filename_ptr   ; Arg 1: pathname
mov rsi, 0              ; Arg 2: O_RDONLY
mov rdx, 0              ; Arg 3: mode (unused)
syscall                 ; Invoke
; rax now contains file descriptor (or negative error)

Security and Validation

Why Validation is Critical

Kernel must never trust user input. Malicious programs could:

Pass invalid pointers to crash kernel
Pass kernel addresses to read kernel memory
Exploit buffer overflows

Validation Checks

Pointer validation: Is address in user space? (< 0x00007FFFFFFFFFFF)
Read/write permission: Can user read from / write to this address?
Range checks: Buffer within valid range?
Resource limits: Does user have permission? (file permissions, capabilities)
Numeric validation: Are numeric args reasonable?

Failed validation → Return -EFAULT, -EINVAL, -EACCES, etc.

Tracing System Calls

strace - System Call Tracer

# Trace all syscalls of a program
strace ls

# Trace specific syscalls
strace -e open,read,write cat file.txt

# Show timing
strace -T ls

# Count syscalls
strace -c ls

# Trace running process
strace -p 1234

# Save to file
strace -o trace.log program

Example output:

openat(AT_FDCWD, "/etc/passwd", O_RDONLY) = 3
read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1534
close(3) = 0

ltrace - Library Call Tracer

# Trace library calls (wrapper functions)
ltrace ls

# See both library and system calls
ltrace -S ls

Performance Considerations

Syscall Overhead

Direct syscall: ~100-300 cycles
Library call: +20-50 cycles (wrapper overhead)
vDSO syscalls: ~10-20 cycles (no mode switch!)

vDSO: Virtual Dynamic Shared Object

Some syscalls are so common and so simple that Linux provides userspace versions:

gettimeofday(): Read kernel's time variable (mapped read-only to userspace)
clock_gettime(): Same
getcpu(): Which CPU am I running on?

Benefit: No mode switch! 10x faster!

# Check vDSO mapping
cat /proc/self/maps | grep vdso
# 7ffff7ffd000-7ffff7fff000 r-xp 00000000 00:00 0 [vdso]

# Verify vDSO usage
ldd /bin/date
# linux-vdso.so.1 (0x00007ffff7ffd000)

Batching Syscalls

Problem: Making many small syscalls is slow (overhead per call).

Solution: Batch operations

# Bad: Many read() calls
for i in range(1000000):
    read(fd, buf, 1)  # 1 byte at a time

# Good: One large read()
read(fd, buf, 1000000)  # 1 MB at once

Vectored I/O: readv(), writev() - read/write multiple buffers in one syscall

Common System Calls

File I/O

open()    # Open file, returns fd
read()    # Read bytes from fd
write()   # Write bytes to fd
close()   # Close fd
lseek()   # Change file position
stat()    # Get file metadata

Process Management

fork()    # Create child process
execve()  # Replace process image
exit()    # Terminate process
wait4()   # Wait for child to exit
kill()    # Send signal to process

Memory Management

brk()     # Grow/shrink heap
mmap()    # Map file or anonymous memory
munmap()  # Unmap memory
mprotect()# Change memory protection

Network

socket()  # Create socket
bind()    # Bind socket to address
listen()  # Listen for connections
accept()  # Accept incoming connection
connect() # Connect to server
send()    # Send data
recv()    # Receive data

Error Handling

Error Codes

System calls return -errno on failure:

-ENOENT: No such file or directory
-EACCES: Permission denied
-EINVAL: Invalid argument
-ENOMEM: Out of memory
-EFAULT: Bad address (invalid pointer)
-EBADF: Bad file descriptor

C library converts negative returns to errno:

// Kernel returns -ENOENT (-2)
// C library:
if (ret < 0) {
    errno = -ret;  // errno = 2
    return -1;
}

Check errno with perror() or strerror().

Best Practices

Check return values: Always! Don't ignore errors
Batch operations: Reduce syscall count
Use buffered I/O: stdio buffers writes (fewer syscalls)
Avoid unnecessary syscalls: Cache results when possible
Use async I/O: io_uring for high-performance servers
Profile with strace -c: Find syscall hotspots

Modern Alternatives

io_uring (Linux 5.1+)

Asynchronous I/O interface avoiding syscall overhead:

Submit I/O operations to ring buffer
Kernel processes asynchronously
Retrieve completions from ring buffer
Zero syscalls in fast path!

Used by high-performance databases and servers.

Process Management: fork(), exec(), exit() syscalls
Memory Management: mmap(), brk(), mprotect() syscalls
Kernel Architecture: How the kernel processes syscalls
Boot Process: How PID 1 uses syscalls

Table of Contents

System Calls: The User-Kernel Gateway

User program prepares to read file