CPython Internals: How Python Really Works Under the Hood

Introduction

Python is one of the most popular programming languages, but what happens when you run python script.py? This article explores the internals of CPython, the reference implementation of Python, revealing how Python code is compiled to bytecode, how memory is managed, and why the Global Interpreter Lock (GIL) exists.

Python Execution Model

Python Execution Pipeline

Source Code

Lexical Analysis

Parse Tree

AST

Bytecode

PVM Execution

Source Code

.py file with Python code

def add(a, b):
    return a + b

Compilation Cache

Bytecode is cached in __pycache__/*.pyc files to speed up subsequent imports

Just-In-Time

Python 3.11+ uses adaptive bytecode that specializes based on runtime behavior

Python code goes through several stages before execution:

Parsing: Source code → Abstract Syntax Tree (AST)
Compilation: AST → Bytecode
Execution: Bytecode → Python Virtual Machine

From Source to Bytecode

The Compilation Pipeline

# Python source code
def greet(name):
    return f"Hello, {name}!"

result = greet("World")
print(result)

Python Bytecode Visualization

Python Source

def add(a, b):
    return a + b

result = add(3, 5)

Bytecode Instructions

0LOAD_CONST0(<code object add>)

2LOAD_CONST1('add')

4MAKE_FUNCTION0

6STORE_NAME0(add)

8LOAD_NAME0(add)

10LOAD_CONST2(3)

12LOAD_CONST3(5)

14CALL_FUNCTION2

16STORE_NAME1(result)

18LOAD_CONST4(None)

20RETURN_VALUE

How Python Bytecode Works

• Python compiles source code to bytecode before execution
• Bytecode is platform-independent and cached in .pyc files
• The Python Virtual Machine (PVM) executes bytecode instructions
• Each instruction manipulates the value stack and local/global namespaces

Understanding Python Bytecode

Python compiles source code to bytecode, which is executed by the Python Virtual Machine (PVM):

import dis

def add(a, b):
    return a + b

dis.dis(add)

Output:

  2           0 LOAD_FAST                0 (a)
              2 LOAD_FAST                1 (b)
              4 BINARY_ADD
              6 RETURN_VALUE

Bytecode Instructions

Key bytecode instructions:

LOAD_FAST: Load local variable
LOAD_GLOBAL: Load global variable
STORE_FAST: Store to local variable
BINARY_ADD: Add two values from stack
CALL_FUNCTION: Call a function
RETURN_VALUE: Return from function

Python Object Model

Python Code

x = 42

Memory Layout

ob_refcnt

Reference count

ob_type

&PyLong_Type

Type pointer

ob_size

Number of digits

ob_digit[0]

Actual value

Total Size:28 bytes

C Implementation

Click "Show C Structure" to view implementation

PyLongObject

• Every Python object starts with PyObject_HEAD
• ob_refcnt tracks reference count for memory management
• ob_type points to the type object (metaclass)
• Additional fields store the actual data

Common PyObject Operations

Reference Counting

Py_INCREF(obj);
Py_DECREF(obj);

Type Checking

PyLong_Check(obj);
Py_TYPE(obj);

Object Creation

PyLong_FromLong(42);
PyUnicode_FromString("hello");

Everything is a PyObject

In CPython, every Python object is represented as a PyObject structure:

typedef struct _object {
    _PyObject_HEAD_EXTRA
    Py_ssize_t ob_refcnt;    // Reference count
    PyTypeObject *ob_type;    // Type pointer
} PyObject;

Type Objects

Every Python type (int, str, list, etc.) has a corresponding type object:

typedef struct _typeobject {
    PyObject_VAR_HEAD
    const char *tp_name;           // Type name
    Py_ssize_t tp_basicsize;       // Instance size
    destructor tp_dealloc;         // Deallocator
    getattrfunc tp_getattr;        // Get attribute
    setattrfunc tp_setattr;        // Set attribute
    // ... many more fields
} PyTypeObject;

Object Creation

When you create an object in Python:

x = 42  # Creates a PyLongObject

CPython:

Allocates memory for PyLongObject
Sets reference count to 1
Sets type pointer to PyLong_Type
Stores the value 42

Memory Management

Python Memory Management

PyMalloc Memory Pools

45/64 blocks

int, bool

16B

32/64 blocks

float

24B

28/64 blocks

small str

32B

20/64 blocks

tuple

40B

15/64 blocks

small list

48B

10/64 blocks

small dict

64B

8/64 blocks

object

128B

4/32 blocks

medium str

256B

2/16 blocks

large list

512B

1/8 blocks

large dict

Simulate Allocation

Memory Optimization Tips

• Use `__slots__` to reduce memory overhead for classes
• Reuse objects when possible (especially small integers and strings)
• Use generators for large datasets to avoid loading everything into memory
• Profile memory usage with tools like memory_profiler or tracemalloc

PyMalloc: Python's Memory Allocator

CPython uses a hierarchical memory management system:

Small objects (< 512 bytes): PyMalloc
Large objects: System malloc
Memory pools: Pre-allocated blocks

Memory Pools and Arenas

Arena (256 KB)
├── Pool 1 (4 KB) - 8-byte blocks
├── Pool 2 (4 KB) - 16-byte blocks
├── Pool 3 (4 KB) - 24-byte blocks
└── ... (up to 512-byte blocks)

Object Allocation Strategy

# Small integer optimization
a = 256  # Uses cached object
b = 256  # Same object as 'a'
print(a is b)  # True

c = 257  # Creates new object
d = 257  # Different object
print(c is d)  # False

Performance Optimization

CPython caches small integers (-5 to 256) and single-character strings for performance.

Reference Counting

How Reference Counting Works

import sys

x = []                  # refcount = 1
y = x                   # refcount = 2
z = [x, x]             # refcount = 4
print(sys.getrefcount(x))  # Shows 5 (includes temporary reference)

Reference Count Operations

// Increment reference count
Py_INCREF(obj);

// Decrement reference count
Py_DECREF(obj);  // Deallocates if refcount reaches 0

Circular References Problem

# Circular reference
class Node:
    def __init__(self):
        self.ref = None

a = Node()
b = Node()
a.ref = b
b.ref = a  # Circular reference!

Garbage Collection

Generational Garbage Collector

Generation 0

New objects

700/700

Generation 1

Survived 1 collection

10/10

Generation 2

Long-lived objects

10/10

Circular Reference Detection

→B

→C

→A

→E

→D

Objects A-B-C form a reachable cycle. Objects D-E form an unreachable cycle.

GC Configuration

import gc
gc.get_threshold()
# (700, 10, 10)
gc.set_threshold(800, 20, 20)
gc.collect()  # Manual collection

Best Practices

• Break circular references explicitly
• Use weak references when appropriate
• Monitor with gc.get_stats()
• Profile with gc.set_debug()

Generational Garbage Collection

CPython uses a generational garbage collector for circular references:

Generation 0: New objects
Generation 1: Survived one collection
Generation 2: Long-lived objects

GC Algorithm

Mark Phase: Traverse reachable objects
Sweep Phase: Collect unreachable objects
Compact Phase: Optional memory compaction

Controlling the GC

import gc

# Disable automatic collection
gc.disable()

# Manual collection
collected = gc.collect()
print(f"Collected {collected} objects")

# GC statistics
print(gc.get_stats())

# Set collection thresholds
gc.set_threshold(700, 10, 10)

The Global Interpreter Lock (GIL)

Global Interpreter Lock (GIL)

GIL Status

Not acquired

Bytecode Ticks

Thread Execution

Thread 1

CPU

Thread 2

CPU

Thread 3

Thread 4

CPU

When GIL is Released

I/O Operation

file.read(), socket.recv()

time.sleep()

time.sleep(1)

C Extension

numpy operations

Pure Python

for i in range(1000000)

Threading Lock

lock.acquire()

Every 100 bytecodes

Automatic check

CPU-Bound Tasks

No true parallelism

Threads take turns

Use multiprocessing instead

I/O-Bound Tasks

GIL released during I/O

Good concurrency

Threading works well

Working Around the GIL

• Use multiprocessing for CPU-bound parallelism
• Use asyncio for I/O-bound concurrency
• Write performance-critical code in C extensions
• Consider alternative Python implementations (PyPy, Jython)
• Use concurrent.futures for high-level parallelism

Why the GIL Exists

The GIL ensures thread safety for:

Reference counting operations
Memory allocation
Python/C API calls

GIL Behavior

import threading
import time

def cpu_bound():
    total = 0
    for i in range(100_000_000):
        total += i
    return total

# Single thread
start = time.time()
cpu_bound()
print(f"Single thread: {time.time() - start:.2f}s")

# Multiple threads (doesn't help for CPU-bound)
start = time.time()
threads = []
for _ in range(4):
    t = threading.Thread(target=cpu_bound)
    t.start()
    threads.append(t)
for t in threads:
    t.join()
print(f"Multi-thread: {time.time() - start:.2f}s")

GIL Release Points

The GIL is released during:

I/O operations
time.sleep()
C extension calls (if designed to)
Every 100 bytecode instructions (check)

Optimization Techniques

CPython Optimizations

Before Optimization

if True:
    x = 1
else:
    x = 2
y = 2 * 3 + 4

After Optimization

x = 1
y = 10

Compile-time optimizations: dead code elimination, constant folding

Quick Wins

• Use list comprehensions over loops
• Cache function results with @lru_cache
• Use built-in functions (written in C)
• Avoid repeated attribute lookups

Python 3.11+ Features

• Adaptive bytecode specialization
• Zero-cost exception handling
• Faster frame creation
• Improved error messages

Peephole Optimizer

CPython performs compile-time optimizations:

# Before optimization
if True:
    x = 1
else:
    x = 2

# After optimization (dead code eliminated)
x = 1

Constant Folding

# Compile-time evaluation
result = 2 * 3 + 4  # Becomes: result = 10

String Interning

# String interning
a = "hello"
b = "hello"
print(a is b)  # True (interned)

c = "hello world"
d = "hello world"
print(c is d)  # May be False (not automatically interned)

# Force interning
import sys
e = sys.intern("hello world")
f = sys.intern("hello world")
print(e is f)  # True

Function Calls and Stack Frames

Python Stack Frame

typedef struct _frame {
    PyObject_VAR_HEAD
    struct _frame *f_back;      // Previous frame
    PyCodeObject *f_code;       // Code object
    PyObject *f_builtins;       // Builtin namespace
    PyObject *f_globals;        // Global namespace
    PyObject *f_locals;         // Local namespace
    PyObject **f_valuestack;    // Value stack
    // ... more fields
} PyFrameObject;

Function Call Overhead

def add(a, b):
    return a + b

# Function call creates:
# 1. New frame object
# 2. Argument parsing
# 3. Local namespace setup
# 4. Frame cleanup on return

C Extensions Interface

C Extension Interface

Python Code

# Python usage
import fastmath
result = fastmath.add(10, 20)
print(result)  # 30

C Extension

#include <Python.h>

static PyObject* add(PyObject* self, PyObject* args) {
    long a, b;
    if (!PyArg_ParseTuple(args, "ll", &a, &b))
        return NULL;
    return PyLong_FromLong(a + b);
}

static PyMethodDef methods[] = {
    {"add", add, METH_VARARGS, "Add two numbers"},
    {NULL, NULL, 0, NULL}
};

static struct PyModuleDef module = {
    PyModuleDef_HEAD_INIT,
    "fastmath",
    "Fast math operations",
    -1,
    methods
};

PyMODINIT_FUNC PyInit_fastmath(void) {
    return PyModule_Create(&module);
}

Build Process

.c file

Compile

.so/.pyd

python setup.py build_ext --inplace

Pure Python

Baseline

Cython

10-100x

Compiled Python

C Extension

100-1000x

Native C

Writing C Extensions

#include <Python.h>

static PyObject* fast_add(PyObject* self, PyObject* args) {
    long a, b;
    if (!PyArg_ParseTuple(args, "ll", &a, &b))
        return NULL;
    return PyLong_FromLong(a + b);
}

static PyMethodDef module_methods[] = {
    {"fast_add", fast_add, METH_VARARGS, "Add two numbers"},
    {NULL, NULL, 0, NULL}
};

static struct PyModuleDef module = {
    PyModuleDef_HEAD_INIT,
    "fastmath",
    "Fast math operations",
    -1,
    module_methods
};

PyMODINIT_FUNC PyInit_fastmath(void) {
    return PyModule_Create(&module);
}

Using Cython for Optimization

# Pure Python
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)

# Cython version (fibonacci.pyx)
def fibonacci_cy(int n):
    if n <= 1:
        return n
    return fibonacci_cy(n-1) + fibonacci_cy(n-2)

Performance Profiling

Using cProfile

import cProfile
import pstats

def profile_code():
    # Your code here
    pass

cProfile.run('profile_code()', 'profile_stats')

# Analyze results
p = pstats.Stats('profile_stats')
p.sort_stats('cumulative')
p.print_stats(10)

Memory Profiling

from memory_profiler import profile

@profile
def memory_intensive():
    a = [1] * (10 ** 6)
    b = [2] * (2 * 10 ** 7)
    del b
    return a

Python 3.11+ Optimizations

Adaptive Bytecode

Python 3.11 introduces adaptive bytecode that specializes based on runtime behavior:

def add_numbers(a, b):
    return a + b  # Specializes for int after seeing int inputs

# First calls: generic bytecode
# After ~8 calls with ints: specialized INT_ADD

Frame Objects Optimization

Lazy frame creation
Reduced memory overhead
Faster function calls

Best Practices for Performance

Use built-in functions: They're implemented in C
List comprehensions: Faster than loops
Local variables: Faster than global
__slots__: Reduce memory for classes
Profile before optimizing: Measure, don't guess

# Slower
result = []
for i in range(1000):
    if i % 2 == 0:
        result.append(i * 2)

# Faster
result = [i * 2 for i in range(1000) if i % 2 == 0]

Debugging CPython

Using gdb with Python

# Debug Python with gdb
gdb python
(gdb) run script.py
(gdb) py-bt  # Python backtrace
(gdb) py-list  # List Python source
(gdb) py-locals  # Show local variables

Inspecting Objects

import ctypes
import sys

def get_object_address(obj):
    return id(obj)

def inspect_pyobject(obj):
    address = id(obj)
    refcount = sys.getrefcount(obj)
    size = sys.getsizeof(obj)
    print(f"Address: 0x{address:x}")
    print(f"Refcount: {refcount}")
    print(f"Size: {size} bytes")

Conclusion

Understanding CPython internals helps you:

Write more efficient Python code
Debug performance issues
Understand Python's limitations
Make informed decisions about optimization
Contribute to CPython development

The journey from Python source code to execution involves complex machinery: bytecode compilation, memory management, garbage collection, and the GIL. While Python abstracts these details, knowing them makes you a better Python developer.