Introduction
Python is one of the most popular programming languages, but what happens when you run python script.py
? This article explores the internals of CPython, the reference implementation of Python, revealing how Python code is compiled to bytecode, how memory is managed, and why the Global Interpreter Lock (GIL) exists.
Python Execution Model
Python Execution Pipeline
Source Code
.py file with Python code
def add(a, b): return a + b
Compilation Cache
Bytecode is cached in __pycache__/*.pyc files to speed up subsequent imports
Just-In-Time
Python 3.11+ uses adaptive bytecode that specializes based on runtime behavior
Python code goes through several stages before execution:
- Parsing: Source code → Abstract Syntax Tree (AST)
- Compilation: AST → Bytecode
- Execution: Bytecode → Python Virtual Machine
From Source to Bytecode
The Compilation Pipeline
# Python source code def greet(name): return f"Hello, {name}!" result = greet("World") print(result)
Python Bytecode Visualization
Python Source
def add(a, b):
return a + b
result = add(3, 5)
Bytecode Instructions
How Python Bytecode Works
- • Python compiles source code to bytecode before execution
- • Bytecode is platform-independent and cached in .pyc files
- • The Python Virtual Machine (PVM) executes bytecode instructions
- • Each instruction manipulates the value stack and local/global namespaces
Understanding Python Bytecode
Python compiles source code to bytecode, which is executed by the Python Virtual Machine (PVM):
import dis def add(a, b): return a + b dis.dis(add)
Output:
2 0 LOAD_FAST 0 (a) 2 LOAD_FAST 1 (b) 4 BINARY_ADD 6 RETURN_VALUE
Bytecode Instructions
Key bytecode instructions:
- LOAD_FAST: Load local variable
- LOAD_GLOBAL: Load global variable
- STORE_FAST: Store to local variable
- BINARY_ADD: Add two values from stack
- CALL_FUNCTION: Call a function
- RETURN_VALUE: Return from function
Python Object Model
Python Object Model
Python Code
x = 42
Memory Layout
C Implementation
Click "Show C Structure" to view implementation
PyLongObject
- • Every Python object starts with PyObject_HEAD
- • ob_refcnt tracks reference count for memory management
- • ob_type points to the type object (metaclass)
- • Additional fields store the actual data
Common PyObject Operations
Reference Counting
Py_INCREF(obj);
Py_DECREF(obj);
Type Checking
PyLong_Check(obj);
Py_TYPE(obj);
Object Creation
PyLong_FromLong(42);
PyUnicode_FromString("hello");
Everything is a PyObject
In CPython, every Python object is represented as a PyObject
structure:
typedef struct _object { _PyObject_HEAD_EXTRA Py_ssize_t ob_refcnt; // Reference count PyTypeObject *ob_type; // Type pointer } PyObject;
Type Objects
Every Python type (int, str, list, etc.) has a corresponding type object:
typedef struct _typeobject { PyObject_VAR_HEAD const char *tp_name; // Type name Py_ssize_t tp_basicsize; // Instance size destructor tp_dealloc; // Deallocator getattrfunc tp_getattr; // Get attribute setattrfunc tp_setattr; // Set attribute // ... many more fields } PyTypeObject;
Object Creation
When you create an object in Python:
x = 42 # Creates a PyLongObject
CPython:
- Allocates memory for PyLongObject
- Sets reference count to 1
- Sets type pointer to PyLong_Type
- Stores the value 42
Memory Management
Python Memory Management
PyMalloc Memory Pools
Simulate Allocation
Memory Optimization Tips
- • Use `__slots__` to reduce memory overhead for classes
- • Reuse objects when possible (especially small integers and strings)
- • Use generators for large datasets to avoid loading everything into memory
- • Profile memory usage with tools like memory_profiler or tracemalloc
PyMalloc: Python's Memory Allocator
CPython uses a hierarchical memory management system:
- Small objects (< 512 bytes): PyMalloc
- Large objects: System malloc
- Memory pools: Pre-allocated blocks
Memory Pools and Arenas
Arena (256 KB) ├── Pool 1 (4 KB) - 8-byte blocks ├── Pool 2 (4 KB) - 16-byte blocks ├── Pool 3 (4 KB) - 24-byte blocks └── ... (up to 512-byte blocks)
Object Allocation Strategy
# Small integer optimization a = 256 # Uses cached object b = 256 # Same object as 'a' print(a is b) # True c = 257 # Creates new object d = 257 # Different object print(c is d) # False
Performance Optimization
CPython caches small integers (-5 to 256) and single-character strings for performance.
Reference Counting
How Reference Counting Works
import sys x = [] # refcount = 1 y = x # refcount = 2 z = [x, x] # refcount = 4 print(sys.getrefcount(x)) # Shows 5 (includes temporary reference)
Reference Count Operations
// Increment reference count Py_INCREF(obj); // Decrement reference count Py_DECREF(obj); // Deallocates if refcount reaches 0
Circular References Problem
# Circular reference class Node: def __init__(self): self.ref = None a = Node() b = Node() a.ref = b b.ref = a # Circular reference!
Garbage Collection
Garbage Collection
Generational Garbage Collector
Generation 0
New objectsGeneration 1
Survived 1 collectionGeneration 2
Long-lived objectsCircular Reference Detection
Objects A-B-C form a reachable cycle. Objects D-E form an unreachable cycle.
GC Configuration
import gc
gc.get_threshold()
# (700, 10, 10)
gc.set_threshold(800, 20, 20)
gc.collect() # Manual collection
Best Practices
- • Break circular references explicitly
- • Use weak references when appropriate
- • Monitor with gc.get_stats()
- • Profile with gc.set_debug()
Generational Garbage Collection
CPython uses a generational garbage collector for circular references:
- Generation 0: New objects
- Generation 1: Survived one collection
- Generation 2: Long-lived objects
GC Algorithm
- Mark Phase: Traverse reachable objects
- Sweep Phase: Collect unreachable objects
- Compact Phase: Optional memory compaction
Controlling the GC
import gc # Disable automatic collection gc.disable() # Manual collection collected = gc.collect() print(f"Collected {collected} objects") # GC statistics print(gc.get_stats()) # Set collection thresholds gc.set_threshold(700, 10, 10)
The Global Interpreter Lock (GIL)
Global Interpreter Lock (GIL)
GIL Status
Not acquired
Thread Execution
When GIL is Released
CPU-Bound Tasks
I/O-Bound Tasks
Working Around the GIL
- • Use
multiprocessing
for CPU-bound parallelism - • Use
asyncio
for I/O-bound concurrency - • Write performance-critical code in C extensions
- • Consider alternative Python implementations (PyPy, Jython)
- • Use
concurrent.futures
for high-level parallelism
Why the GIL Exists
The GIL ensures thread safety for:
- Reference counting operations
- Memory allocation
- Python/C API calls
GIL Behavior
import threading import time def cpu_bound(): total = 0 for i in range(100_000_000): total += i return total # Single thread start = time.time() cpu_bound() print(f"Single thread: {time.time() - start:.2f}s") # Multiple threads (doesn't help for CPU-bound) start = time.time() threads = [] for _ in range(4): t = threading.Thread(target=cpu_bound) t.start() threads.append(t) for t in threads: t.join() print(f"Multi-thread: {time.time() - start:.2f}s")
GIL Release Points
The GIL is released during:
- I/O operations
time.sleep()
- C extension calls (if designed to)
- Every 100 bytecode instructions (check)
Optimization Techniques
CPython Optimizations
Before Optimization
if True: x = 1 else: x = 2 y = 2 * 3 + 4
After Optimization
x = 1 y = 10
Compile-time optimizations: dead code elimination, constant folding
Quick Wins
- • Use list comprehensions over loops
- • Cache function results with @lru_cache
- • Use built-in functions (written in C)
- • Avoid repeated attribute lookups
Python 3.11+ Features
- • Adaptive bytecode specialization
- • Zero-cost exception handling
- • Faster frame creation
- • Improved error messages
Peephole Optimizer
CPython performs compile-time optimizations:
# Before optimization if True: x = 1 else: x = 2 # After optimization (dead code eliminated) x = 1
Constant Folding
# Compile-time evaluation result = 2 * 3 + 4 # Becomes: result = 10
String Interning
# String interning a = "hello" b = "hello" print(a is b) # True (interned) c = "hello world" d = "hello world" print(c is d) # May be False (not automatically interned) # Force interning import sys e = sys.intern("hello world") f = sys.intern("hello world") print(e is f) # True
Function Calls and Stack Frames
Python Stack Frame
typedef struct _frame { PyObject_VAR_HEAD struct _frame *f_back; // Previous frame PyCodeObject *f_code; // Code object PyObject *f_builtins; // Builtin namespace PyObject *f_globals; // Global namespace PyObject *f_locals; // Local namespace PyObject **f_valuestack; // Value stack // ... more fields } PyFrameObject;
Function Call Overhead
def add(a, b): return a + b # Function call creates: # 1. New frame object # 2. Argument parsing # 3. Local namespace setup # 4. Frame cleanup on return
C Extensions Interface
C Extension Interface
Python Code
# Python usage import fastmath result = fastmath.add(10, 20) print(result) # 30
C Extension
#include <Python.h> static PyObject* add(PyObject* self, PyObject* args) { long a, b; if (!PyArg_ParseTuple(args, "ll", &a, &b)) return NULL; return PyLong_FromLong(a + b); } static PyMethodDef methods[] = { {"add", add, METH_VARARGS, "Add two numbers"}, {NULL, NULL, 0, NULL} }; static struct PyModuleDef module = { PyModuleDef_HEAD_INIT, "fastmath", "Fast math operations", -1, methods }; PyMODINIT_FUNC PyInit_fastmath(void) { return PyModule_Create(&module); }
Build Process
.c file
Compile
.so/.pyd
python setup.py build_ext --inplace
Pure Python
Baseline
Cython
Compiled Python
C Extension
Native C
Writing C Extensions
#include <Python.h> static PyObject* fast_add(PyObject* self, PyObject* args) { long a, b; if (!PyArg_ParseTuple(args, "ll", &a, &b)) return NULL; return PyLong_FromLong(a + b); } static PyMethodDef module_methods[] = { {"fast_add", fast_add, METH_VARARGS, "Add two numbers"}, {NULL, NULL, 0, NULL} }; static struct PyModuleDef module = { PyModuleDef_HEAD_INIT, "fastmath", "Fast math operations", -1, module_methods }; PyMODINIT_FUNC PyInit_fastmath(void) { return PyModule_Create(&module); }
Using Cython for Optimization
# Pure Python def fibonacci(n): if n <= 1: return n return fibonacci(n-1) + fibonacci(n-2) # Cython version (fibonacci.pyx) def fibonacci_cy(int n): if n <= 1: return n return fibonacci_cy(n-1) + fibonacci_cy(n-2)
Performance Profiling
Using cProfile
import cProfile import pstats def profile_code(): # Your code here pass cProfile.run('profile_code()', 'profile_stats') # Analyze results p = pstats.Stats('profile_stats') p.sort_stats('cumulative') p.print_stats(10)
Memory Profiling
from memory_profiler import profile @profile def memory_intensive(): a = [1] * (10 ** 6) b = [2] * (2 * 10 ** 7) del b return a
Python 3.11+ Optimizations
Adaptive Bytecode
Python 3.11 introduces adaptive bytecode that specializes based on runtime behavior:
def add_numbers(a, b): return a + b # Specializes for int after seeing int inputs # First calls: generic bytecode # After ~8 calls with ints: specialized INT_ADD
Frame Objects Optimization
- Lazy frame creation
- Reduced memory overhead
- Faster function calls
Best Practices for Performance
- Use built-in functions: They're implemented in C
- List comprehensions: Faster than loops
- Local variables: Faster than global
__slots__
: Reduce memory for classes- Profile before optimizing: Measure, don't guess
# Slower result = [] for i in range(1000): if i % 2 == 0: result.append(i * 2) # Faster result = [i * 2 for i in range(1000) if i % 2 == 0]
Debugging CPython
Using gdb with Python
# Debug Python with gdb gdb python (gdb) run script.py (gdb) py-bt # Python backtrace (gdb) py-list # List Python source (gdb) py-locals # Show local variables
Inspecting Objects
import ctypes import sys def get_object_address(obj): return id(obj) def inspect_pyobject(obj): address = id(obj) refcount = sys.getrefcount(obj) size = sys.getsizeof(obj) print(f"Address: 0x{address:x}") print(f"Refcount: {refcount}") print(f"Size: {size} bytes")
Conclusion
Understanding CPython internals helps you:
- Write more efficient Python code
- Debug performance issues
- Understand Python's limitations
- Make informed decisions about optimization
- Contribute to CPython development
The journey from Python source code to execution involves complex machinery: bytecode compilation, memory management, garbage collection, and the GIL. While Python abstracts these details, knowing them makes you a better Python developer.