欢迎关注我的"大众年夜众号:ReadingPython
在系列第一篇和第二篇中,我们谈论了 Python 程序编译与实行的基本事理,之后我们还是会聚焦一些事理性的东西。不过,在本篇中,让我们来看看这些事理在代码中的详细实现。
CPython 代码库有大约 35 万行 C 代码(头文件除外),60 万行 Python 代码,一次性看完不太现实,本日我们紧张看那些每次运行 Python 程序都会实行的部分。从 python 可实行文件对应的 main 函数开始,一步步往下,直到求值循环(evaluation loop),也便是运行 Python 字节码的地方。

我们并不须要理解每一行代码,而是重点关注一些故意思的地方,争取对 Python 程序的启动过程有一个基本观点。
其余有两点解释。一是,我们只深入谈论部分函数而概览其它,不过,我会按实行顺序来讲解。二是,除了极少数构造体定义,我会按代码库中的原貌呈现代码,唯一的改动便是增加一些解释性的注释。在后文代码中,多行注释 // 都是原来就有的,单行注释 // 是我新增的。
现在,让我们开启 CPython 源码之旅吧。
1. 获取 CPython 源码首先,把 CPython 代码库下载下来:
$ git clone https://github.com/python/cpython/ && cd cpython
目前, master 分支上的是 CPython 3.10。我们要看的是最新稳定版本,也便是 CPython 3.9。先切换分支:
$ git checkout 3.9
根目录下,可以看到下面这些内容:
$ ls -pCODE_OF_CONDUCT.md Objects/ config.subDoc/ PC/ configureGrammar/ PCbuild/ configure.acInclude/ Parser/ install-shLICENSE Programs/ m4/Lib/ Python/ netlify.tomlMac/ README.rst pyconfig.h.inMakefile.pre.in Tools/ setup.pyMisc/ aclocal.m4Modules/ config.guess
个中一些子件夹是本系列文章会重点关注的:
Grammar/ 中是我们上篇谈论的语法文件。Include/ 中是一些头文件,供 CPython 或调用 Python/C 接口的用户利用。Lib/ 中是 Python 写的标准库,个中一些库,如 argparse 和 wave,是纯 Python 实现的,另一些则包含了 C 代码,比如 io 库就封装了 C 措辞实现的 _io 模块。Modules/ 中的则是 C 措辞写的标准库,个中一些,如 itertools 等,可以直接引入利用,另一些则要由对应的 Python 模块封装后利用。Objects/ 中是内置类型的详细实现,如果想知道 int 或 float 是如何实现的,你会来到这个文件夹。Parser/ 中是旧版解析器、旧版解析器天生器、新版解析器以及分词器。Programs/ 中是各种可实行文件的源码。Python/ 中的是阐明器的源文件,包括编译器、求值循环、内置模块等。Tools/ 中包含了一些构建和管理 CPython 的工具,新版解析器天生器也放在这里。如果你是那种没看到 tests 文件夹就要心跳加速的人,请放心,它在 Lib 文件夹中。这些测试不仅在开拓代码的时候很有用,而且能帮我们更好地理解 CPython。
比如,要理解窥孔优化器(peephole optimizer)到底优化了哪些东西,可以查看 Lib/test/test_peepholer.py 文件,要理解某部分代码的功能,可以注释掉那部分代码,重新编译 CPython,再运行测试:
$ ./python.exe -m test test_peepholer
看看哪些用例失落败了。
空想情形下,编译 CPython 只须要实行两条命令:
$ ./configure$ make -j -s
make 命令将天生 python 可实行文件。如果你在 Mac 系统下看到 python.exe,不要以为惊异,这里的 .exe 后缀只是用于在大小写不敏感的文件系统中区分可实行文件与 Python/ 文件夹而已。更多编译干系信息,可以查看开拓者手册。
现在,我们可以自满地说,我们已经构建了自有版本的 CPython 了:
$ ./python.exePython 3.9.0+ (heads/3.9-dirty:20bdeedfb4, Oct 10 2020, 16:55:24)[Clang 10.0.0 (clang-1000.10.44.4)] on darwinType "help", "copyright", "credits" or "license" for more information.>>> 2 1665536
2. 源码
正如所有 C 程序一样,CPython 的实行入口是 Python/python.c 中的一个 main() 函数:
/ Minimal main program -- everything is loaded from the library /#include "Python.h"#ifdef MS_WINDOWSintwmain(int argc, wchar_t argv){ return Py_Main(argc, argv);}#elseintmain(int argc, char argv){ return Py_BytesMain(argc, argv);}#endif
这里没什么内容。唯一值得一提的是,在 Windows 系统中,为吸收 UTF-16 编码的字符串参数,CPython 利用 wmain() 函数作为入口。而在其它平台上,CPython 须要额外实行一个步骤,将 char 字符串转为 wchar_t 字符串,char 字符串的编码办法取决于 locale 设置,而 wchar_t 的编码办法则取决于 wchar_t 的长度。例如,如果 sizeof(wchar_t) 为 4,则采取 UCS-4 编码。
Py_Main() 和 Py_BytesMain() 在 Modules/main.c 中定义,实在只因此不同参数调用 pymain_main() 函数而已:
intPy_Main(int argc, wchar_t argv){ _PyArgv args = { .argc = argc, .use_bytes_argv = 0, .bytes_argv = NULL, .wchar_argv = argv}; return pymain_main(&args);}intPy_BytesMain(int argc, char argv){ _PyArgv args = { .argc = argc, .use_bytes_argv = 1, .bytes_argv = argv, .wchar_argv = NULL}; return pymain_main(&args);}
我们可以看下 pymain_main() 函数。初看上去,它彷佛也没做什么事情:
static intpymain_main(_PyArgv args){ PyStatus status = pymain_init(args); if (_PyStatus_IS_EXIT(status)) { pymain_free(); return status.exitcode; } if (_PyStatus_EXCEPTION(status)) { pymain_exit_error(status); } return Py_RunMain();}
上一篇中,我们看到,在一个 Python 程序实行前,CPython 须要做很多编译事情。实在,在编译之前,CPython 就已经做了很多事情了,这些事情组成了 CPython 初始化过程。在第一篇中,我们曾说,CPython 的事情包括三个阶段:
初始化编译,以及阐明因此,pymain_main() 首先调用 pymain_init() 实行初始化,然后调用 Py_RunMain() 进行下一步事情。问题来了:CPython 在初始化阶段做了哪些事情呢?我们可以推测一下,至少,它要做以下几件事:
根据操作系统的不同,为参数、环境变量、标准输入输出流以及文件系统选择一种得当的编码办法解析命令行参数,读取环境变量,确定运行办法初始化运行时状态、主阐明器状态以及主线程状态初始化内置类型与内置模块初始化 sys 模块准备好模块导入系统创建 __main__ 模块在进入 pymain_init() 函数前,我们先详细谈论下初始化过程。
2.1 初始化CPython 3.8 之后,初始化被分为三个阶段:
预初始化(preinitialization)核心初始化(core initialization)主初始化(main initialization)预初始化阶段卖力初始化运行时状态,准备默认的内存分配器,完成基本配置。这里还看不到 Python 的影子。
核心初始化阶段卖力初始化阐明器状态、主线程状态、内置类型与非常、内置模块,准备 sys 模块与模块导入系统。此时,我们已经可以利用 Python 的“核心”部分了。不过,还是有些功能没准备好,例如,sys 模块只有部分功能、只支持导入内置模块与冻结模块(frozen modules,译者注:由 Python 实现,但字节码封装在可实行文件中,不须要阐明器即可实行的模块)等。
主初始化阶段后,CPython 才完成所有初始化过程,可以进行编译或阐明事情了。
把初始化过程分为三个阶段有什么好处呢?大略地说,这样可以更方便地配置 CPython。比如,用户可以在核心初始化阶段覆盖干系路径,从而利用自定义的内存分配器。
当然,CPython 自己不须要再“自定义”什么东西,但对利用 Python/C 接口的人来说,这种能力是很主要的。PEP432 和 PEP587 详细地解释了多阶段初始化的上风。
pymain_init() 函数卖力预初始化,然后调用 Py_InitializeFromConfig() 进入核心初始化和主初始化阶段。
static PyStatuspymain_init(const _PyArgv args){ PyStatus status; // 初始化运行时状态 status = _PyRuntime_Initialize(); if (_PyStatus_EXCEPTION(status)) { return status; } // 初始化默认配置 PyPreConfig preconfig; PyPreConfig_InitPythonConfig(&preconfig); // 预初始化 status = _Py_PreInitializeFromPyArgv(&preconfig, args); if (_PyStatus_EXCEPTION(status)) { return status; } // 预初始化完成,为下一个初始化阶段准备参数 // 初始化默认配置 PyConfig config; PyConfig_InitPythonConfig(&config); // 将命令行参数存储至 `config->argv` if (args->use_bytes_argv) { status = PyConfig_SetBytesArgv(&config, args->argc, args->bytes_argv); } else { status = PyConfig_SetArgv(&config, args->argc, args->wchar_argv); } if (_PyStatus_EXCEPTION(status)) { goto done; } // 实行核心初始化和主初始化 status = Py_InitializeFromConfig(&config); if (_PyStatus_EXCEPTION(status)) { goto done; } status = _PyStatus_OK();done: PyConfig_Clear(&config); return status;}
_PyRuntime_Initialize() 卖力初始化运行时状态,运行时状态存储在 _PyRuntime 全局变量中,它的构造体定义如下:
/ Full Python runtime state /typedef struct pyruntimestate { / Is running Py_PreInitialize()? / int preinitializing; / Is Python preinitialized? Set to 1 by Py_PreInitialize() / int preinitialized; / Is Python core initialized? Set to 1 by _Py_InitializeCore() / int core_initialized; / Is Python fully initialized? Set to 1 by Py_Initialize() / int initialized; / Set by Py_FinalizeEx(). Only reset to NULL if Py_Initialize() is called again. / _Py_atomic_address _finalizing; struct pyinterpreters { PyThread_type_lock mutex; PyInterpreterState head; PyInterpreterState main; int64_t next_id; } interpreters; unsigned long main_thread; struct _ceval_runtime_state ceval; struct _gilstate_runtime_state gilstate; PyPreConfig preconfig; // ... 后面是一些暂时可以忽略的东西} _PyRuntimeState;
构造体末了一个字段是 preconfig,卖力保存 CPython 预初始化干系配置,同时也用于之后两个阶段。下面是它的类型定义:
typedef struct { int _config_init; / _PyConfigInitEnum value / / Parse Py_PreInitializeFromBytesArgs() arguments? See PyConfig.parse_argv / int parse_argv; / If greater than 0, enable isolated mode: sys.path contains neither the script's directory nor the user's site-packages directory. Set to 1 by the -I command line option. If set to -1 (default), inherit Py_IsolatedFlag value. / int isolated; / If greater than 0: use environment variables. Set to 0 by -E command line option. If set to -1 (default), it is set to !Py_IgnoreEnvironmentFlag. / int use_environment; / Set the LC_CTYPE locale to the user preferred locale? If equals to 0, set coerce_c_locale and coerce_c_locale_warn to 0. / int configure_locale; / Coerce the LC_CTYPE locale if it's equal to "C"? (PEP 538) Set to 0 by PYTHONCOERCECLOCALE=0. Set to 1 by PYTHONCOERCECLOCALE=1. Set to 2 if the user preferred LC_CTYPE locale is "C". If it is equal to 1, LC_CTYPE locale is read to decide if it should be coerced or not (ex: PYTHONCOERCECLOCALE=1). Internally, it is set to 2 if the LC_CTYPE locale must be coerced. Disable by default (set to 0). Set it to -1 to let Python decide if it should be enabled or not. / int coerce_c_locale; / Emit a warning if the LC_CTYPE locale is coerced? Set to 1 by PYTHONCOERCECLOCALE=warn. Disable by default (set to 0). Set it to -1 to let Python decide if it should be enabled or not. / int coerce_c_locale_warn;#ifdef MS_WINDOWS / If greater than 1, use the "mbcs" encoding instead of the UTF-8 encoding for the filesystem encoding. Set to 1 if the PYTHONLEGACYWINDOWSFSENCODING environment variable is set to a non-empty string. If set to -1 (default), inherit Py_LegacyWindowsFSEncodingFlag value. See PEP 529 for more details. / int legacy_windows_fs_encoding;#endif / Enable UTF-8 mode? (PEP 540) Disabled by default (equals to 0). Set to 1 by "-X utf8" and "-X utf8=1" command line options. Set to 1 by PYTHONUTF8=1 environment variable. Set to 0 by "-X utf8=0" and PYTHONUTF8=0. If equals to -1, it is set to 1 if the LC_CTYPE locale is "C" or "POSIX", otherwise it is set to 0. Inherit Py_UTF8Mode value value. / int utf8_mode; / If non-zero, enable the Python Development Mode. Set to 1 by the -X dev command line option. Set by the PYTHONDEVMODE environment variable. / int dev_mode; / Memory allocator: PYTHONMALLOC env var. See PyMemAllocatorName for valid values. / int allocator;} PyPreConfig;
调用 _PyRuntime_Initialize() 后,_PyRuntime 会按默认值完成初始化,随后,PyPreConfig_InitPythonConfig() 按预定义值再次将它初始化,再由 _Py_PreInitializeFromPyArgv() 实行真正的预初始化过程。
为什么 _PyRuntime 要实行两次初始化呢?由于 CPython 调用的很多函数同时也供 Python/C 接口调用,因此,CPython 也会统一按 Python/C 接口的调用模式调用这些函数。也由于同一个缘故原由,CPython 源码中常常会看到一些不好理解的函数调用。比如,在全体初始化过程中,_PyRuntime_Initialize() 函数就被调用了很多次,实际上后面几次调用并没有什么浸染。
_Py_PreInitializeFromPyArgv() 卖力读取命令行参数、环境变量以及全局配置,并完成 _PyRuntime.preconfig、本地化以及内存分配器设置。它只读取和预初始化干系的参数,例如,命令行参数中的 -E -I -X 等。
此时,运行时已经预初始化了。接下来,pymain_init() 会准备好下一步初始化须要的配置。把稳,这个配置和前面的 preconfig 不是一个东西,这里的配置保存着绝大多数 Python 干系配置,在全体初始化、以及 Python 程序实行过程中利用广泛。
你可以看一下它的构造体的超长定义:
/ --- PyConfig ---------------------------------------------- /typedef struct { int _config_init; / _PyConfigInitEnum value / int isolated; / Isolated mode? see PyPreConfig.isolated / int use_environment; / Use environment variables? see PyPreConfig.use_environment / int dev_mode; / Python Development Mode? See PyPreConfig.dev_mode / / Install signal handlers? Yes by default. / int install_signal_handlers; int use_hash_seed; / PYTHONHASHSEED=x / unsigned long hash_seed; / Enable faulthandler? Set to 1 by -X faulthandler and PYTHONFAULTHANDLER. -1 means unset. / int faulthandler; / Enable PEG parser? 1 by default, set to 0 by -X oldparser and PYTHONOLDPARSER / int _use_peg_parser; / Enable tracemalloc? Set by -X tracemalloc=N and PYTHONTRACEMALLOC. -1 means unset / int tracemalloc; int import_time; / PYTHONPROFILEIMPORTTIME, -X importtime / int show_ref_count; / -X showrefcount / int dump_refs; / PYTHONDUMPREFS / int malloc_stats; / PYTHONMALLOCSTATS / / Python filesystem encoding and error handler: sys.getfilesystemencoding() and sys.getfilesystemencodeerrors(). Default encoding and error handler: if Py_SetStandardStreamEncoding() has been called: they have the highest priority; PYTHONIOENCODING environment variable; The UTF-8 Mode uses UTF-8/surrogateescape; If Python forces the usage of the ASCII encoding (ex: C locale or POSIX locale on FreeBSD or HP-UX), use ASCII/surrogateescape; locale encoding: ANSI code page on Windows, UTF-8 on Android and VxWorks, LC_CTYPE locale encoding on other platforms; On Windows, "surrogateescape" error handler; "surrogateescape" error handler if the LC_CTYPE locale is "C" or "POSIX"; "surrogateescape" error handler if the LC_CTYPE locale has been coerced (PEP 538); "strict" error handler. Supported error handlers: "strict", "surrogateescape" and "surrogatepass". The surrogatepass error handler is only supported if Py_DecodeLocale() and Py_EncodeLocale() use directly the UTF-8 codec; it's only used on Windows. initfsencoding() updates the encoding to the Python codec name. For example, "ANSI_X3.4-1968" is replaced with "ascii". On Windows, sys._enablelegacywindowsfsencoding() sets the encoding/errors to mbcs/replace at runtime. See Py_FileSystemDefaultEncoding and Py_FileSystemDefaultEncodeErrors. / wchar_t filesystem_encoding; wchar_t filesystem_errors; wchar_t pycache_prefix; / PYTHONPYCACHEPREFIX, -X pycache_prefix=PATH / int parse_argv; / Parse argv command line arguments? / / Command line arguments (sys.argv). Set parse_argv to 1 to parse argv as Python command line arguments and then strip Python arguments from argv. If argv is empty, an empty string is added to ensure that sys.argv always exists and is never empty. / PyWideStringList argv; / Program name: - If Py_SetProgramName() was called, use its value. - On macOS, use PYTHONEXECUTABLE environment variable if set. - If WITH_NEXT_FRAMEWORK macro is defined, use __PYVENV_LAUNCHER__ environment variable is set. - Use argv[0] if available and non-empty. - Use "python" on Windows, or "python3 on other platforms. / wchar_t program_name; PyWideStringList xoptions; / Command line -X options / / Warnings options: lowest to highest priority. warnings.filters is built in the reverse order (highest to lowest priority). / PyWideStringList warnoptions; / If equal to zero, disable the import of the module site and the site-dependent manipulations of sys.path that it entails. Also disable these manipulations if site is explicitly imported later (call site.main() if you want them to be triggered). Set to 0 by the -S command line option. If set to -1 (default), it is set to !Py_NoSiteFlag. / int site_import; / Bytes warnings: If equal to 1, issue a warning when comparing bytes or bytearray with str or bytes with int. If equal or greater to 2, issue an error. Incremented by the -b command line option. If set to -1 (default), inherit Py_BytesWarningFlag value. / int bytes_warning; / If greater than 0, enable inspect: when a script is passed as first argument or the -c option is used, enter interactive mode after executing the script or the command, even when sys.stdin does not appear to be a terminal. Incremented by the -i command line option. Set to 1 if the PYTHONINSPECT environment variable is non-empty. If set to -1 (default), inherit Py_InspectFlag value. / int inspect; / If greater than 0: enable the interactive mode (REPL). Incremented by the -i command line option. If set to -1 (default), inherit Py_InteractiveFlag value. / int interactive; / Optimization level. Incremented by the -O command line option. Set by the PYTHONOPTIMIZE environment variable. If set to -1 (default), inherit Py_OptimizeFlag value. / int optimization_level; / If greater than 0, enable the debug mode: turn on parser debugging output (for expert only, depending on compilation options). Incremented by the -d command line option. Set by the PYTHONDEBUG environment variable. If set to -1 (default), inherit Py_DebugFlag value. / int parser_debug; / If equal to 0, Python won't try to write ``.pyc`` files on the import of source modules. Set to 0 by the -B command line option and the PYTHONDONTWRITEBYTECODE environment variable. If set to -1 (default), it is set to !Py_DontWriteBytecodeFlag. / int write_bytecode; / If greater than 0, enable the verbose mode: print a message each time a module is initialized, showing the place (filename or built-in module) from which it is loaded. If greater or equal to 2, print a message for each file that is checked for when searching for a module. Also provides information on module cleanup at exit. Incremented by the -v option. Set by the PYTHONVERBOSE environment variable. If set to -1 (default), inherit Py_VerboseFlag value. / int verbose; / If greater than 0, enable the quiet mode: Don't display the copyright and version messages even in interactive mode. Incremented by the -q option. If set to -1 (default), inherit Py_QuietFlag value. / int quiet; / If greater than 0, don't add the user site-packages directory to sys.path. Set to 0 by the -s and -I command line options , and the PYTHONNOUSERSITE environment variable. If set to -1 (default), it is set to !Py_NoUserSiteDirectory. / int user_site_directory; / If non-zero, configure C standard steams (stdio, stdout, stderr): - Set O_BINARY mode on Windows. - If buffered_stdio is equal to zero, make streams unbuffered. Otherwise, enable streams buffering if interactive is non-zero. / int configure_c_stdio; / If equal to 0, enable unbuffered mode: force the stdout and stderr streams to be unbuffered. Set to 0 by the -u option. Set by the PYTHONUNBUFFERED environment variable. If set to -1 (default), it is set to !Py_UnbufferedStdioFlag. / int buffered_stdio; / Encoding of sys.stdin, sys.stdout and sys.stderr. Value set from PYTHONIOENCODING environment variable and Py_SetStandardStreamEncoding() function. See also 'stdio_errors' attribute. / wchar_t stdio_encoding; / Error handler of sys.stdin and sys.stdout. Value set from PYTHONIOENCODING environment variable and Py_SetStandardStreamEncoding() function. See also 'stdio_encoding' attribute. / wchar_t stdio_errors;#ifdef MS_WINDOWS / If greater than zero, use io.FileIO instead of WindowsConsoleIO for sys standard streams. Set to 1 if the PYTHONLEGACYWINDOWSSTDIO environment variable is set to a non-empty string. If set to -1 (default), inherit Py_LegacyWindowsStdioFlag value. See PEP 528 for more details. / int legacy_windows_stdio;#endif / Value of the --check-hash-based-pycs command line option: - "default" means the 'check_source' flag in hash-based pycs determines invalidation - "always" causes the interpreter to hash the source file for invalidation regardless of value of 'check_source' bit - "never" causes the interpreter to always assume hash-based pycs are valid The default value is "default". See PEP 552 "Deterministic pycs" for more details. / wchar_t check_hash_pycs_mode; / --- Path configuration inputs ------------ / / If greater than 0, suppress _PyPathConfig_Calculate() warnings on Unix. The parameter has no effect on Windows. If set to -1 (default), inherit !Py_FrozenFlag value. / int pathconfig_warnings; wchar_t pythonpath_env; / PYTHONPATH environment variable / wchar_t home; / PYTHONHOME environment variable, see also Py_SetPythonHome(). / / --- Path configuration outputs ----------- / int module_search_paths_set; / If non-zero, use module_search_paths / PyWideStringList module_search_paths; / sys.path paths. Computed if module_search_paths_set is equal to zero. / wchar_t executable; / sys.executable / wchar_t base_executable; / sys._base_executable / wchar_t prefix; / sys.prefix / wchar_t base_prefix; / sys.base_prefix / wchar_t exec_prefix; / sys.exec_prefix / wchar_t base_exec_prefix; / sys.base_exec_prefix / wchar_t platlibdir; / sys.platlibdir / / --- Parameter only used by Py_Main() ---------- / / Skip the first line of the source ('run_filename' parameter), allowing use of non-Unix forms of "#!cmd". This is intended for a DOS specific hack only. Set by the -x command line option. / int skip_source_first_line; wchar_t run_command; / -c command line argument / wchar_t run_module; / -m command line argument / wchar_t run_filename; / Trailing command line argument without -c or -m / / --- Private fields ---------------------------- / / Install importlib? If set to 0, importlib is not initialized at all. Needed by freeze_importlib. / int _install_importlib; / If equal to 0, stop Python initialization before the "main" phase / int _init_main; / If non-zero, disallow threads, subprocesses, and fork. Default: 0. / int _isolated_interpreter; / Original command line arguments. If _orig_argv is empty and _argv is not equal to [''], PyConfig_Read() copies the configuration 'argv' list into '_orig_argv' list before modifying 'argv' list (if parse_argv is non-zero). _PyConfig_Write() initializes Py_GetArgcArgv() to this list. / PyWideStringList _orig_argv;} PyConfig;
pymain_init() 先调用 PyConfig_InitPythonConfig() 创建默认配置,然后调用 PyConfig_SetBytesArgv() 将命令行参数存储至 config.argv 中,末了调用 Py_InitializeFromConfig() 实行核心初始化和主初始化。
下面,我们来看看 Py_InitializeFromConfig():
PyStatusPy_InitializeFromConfig(const PyConfig config){ if (config == NULL) { return _PyStatus_ERR("initialization config is NULL"); } PyStatus status; // 看到没,这里又调用了一次!
status = _PyRuntime_Initialize(); if (_PyStatus_EXCEPTION(status)) { return status; } _PyRuntimeState runtime = &_PyRuntime; PyThreadState tstate = NULL; // 核心初始化阶段 status = pyinit_core(runtime, config, &tstate); if (_PyStatus_EXCEPTION(status)) { return status; } config = _PyInterpreterState_GetConfig(tstate->interp); if (config->_init_main) { // 主初始化阶段 status = pyinit_main(tstate); if (_PyStatus_EXCEPTION(status)) { return status; } } return _PyStatus_OK();}
我们可以清楚地看到初始化的不同阶段。核心初始化由 pyinit_core() 完成,主初始化由 pyinit_main() 完成。pyinit_core() 函数初始化了 Python “核心”部分,详细可以分为两步:
准备干系配置:解析命令行参数,读取环境变量,确定文件路径,选择标准流与文件系统的编码办法,并将这些数据写入配置变量的对应位置;运用这些配置:设置标准流,天生哈希函数密钥,创建主阐明器状态与主线程状态,初始化 GIL 并占用,使能垃圾网络器,初始化内置类型与非常,初始化 sys 模块及内置模块,为内置模块与冻结模块准备好模块导入系统;在第一步中,CPython 司帐算 config.module_search_paths,之后,这个路径会被复制到 sys.path。其它内容比较无聊,我们先略过。
我们来看看 pyinit_config(),它被 pyinit_core 函数调用,卖力实行第二步:
static PyStatuspyinit_config(_PyRuntimeState runtime, PyThreadState tstate_p, const PyConfig config){ // 根据配置设置 Py_ 全局变量 // 初始化标准流(stdin, stdout, stderr) // 为哈希函数设置密钥 PyStatus status = pycore_init_runtime(runtime, config); if (_PyStatus_EXCEPTION(status)) { return status; } PyThreadState tstate; // 创建主阐明器状态和主线程状态 // 占用 GIL status = pycore_create_interpreter(runtime, config, &tstate); if (_PyStatus_EXCEPTION(status)) { return status; } tstate_p = tstate; // 初始化数据类型、非常、sys、内置函数和模块、导入系统等 status = pycore_interp_init(tstate); if (_PyStatus_EXCEPTION(status)) { return status; } / Only when we get here is the runtime core fully initialized / runtime->core_initialized = 1; return _PyStatus_OK();}
首先,pycore_init_runtime() 会把一些配置数据复制到对应的全局变量中,这些全局变量将在 PyConfig 准备好之前用于配置 CPython,同时也作为 Python/C 接口的一部分。
然后,pycore_init_runtime() 将设置标准输入输出流对应的文件句柄与缓存模式,在类 Unix 系统中,也便是调用库函数 setvbug()。
末了,pycore_init_runtime() 会为哈希函数天生密钥,存储在全局变量 _Py_HashSecret 中。这个密钥将作为 CPython 所用的哈希函数 SipHash24 的参数。每次 CPython 启动,都会随机天生一个新的密钥,以防止哈希冲突攻击。
Python 与其它很多编程措辞,如 PHP、Ruby、JavaScript 以及 C# 等,都曾存在哈希冲突攻击漏洞。攻击者可以用一组天生相同哈希值的字符串攻击干系运用,由于这些字符串得到的哈希值相同,把它们放在同一个凑集或字典(哈希表)中,会导致每次数据存取都花费大量打算,占用 CPU 性能。办理方案便是为哈希函数供应一个随机密钥。其余,Python 也支持设置 PYTHONHASHSEED 环境变量,掌握密钥的天生。
关于哈希冲突攻击,可以参考这个演讲。关于 CPython 的哈希算法,可以参考 PEP456。
在系列第一篇中,我们知道,CPython 利用线程状态保存线程干系数据,如调用栈、非常状态等,利用阐明器状态保存阐明器干系数据,如加载的模块、导入设置等。pycore_create_interpreter() 函数卖力为主线程创建阐明器状态与线程状态。下面是阐明器状态的构造体定义:
// 阐明器状态的定义在 Include/pystate.hstruct _is { // _PyRuntime.interpreters.head 保存了最近创建的阐明器 // `next` 指针让我们可以访问所有阐明器 struct _is next; // `tstate_head` 指向最近创建的线程状态 // 同一个阐明器下的线程状态在一个链表中 struct _ts tstate_head; / Reference to the _PyRuntime global variable. This field exists to not have to pass runtime in addition to tstate to a function. Get runtime from tstate: tstate->interp->runtime. / struct pyruntimestate runtime; int64_t id; // 阐明器的引用记录 int64_t id_refcount; int requires_idref; PyThread_type_lock id_mutex; int finalizing; struct _ceval_state ceval; struct _gc_runtime_state gc; PyObject modules; // sys.modules 对应的指针 PyObject modules_by_index; PyObject sysdict; // sys.__dict__ 对应的指针 PyObject builtins; // builtins.__dict__ 对应的指针 PyObject importlib; // 编解码器搜索 PyObject codec_search_path; PyObject codec_search_cache; PyObject codec_error_registry; int codecs_initialized; struct _Py_unicode_state unicode; PyConfig config; PyObject dict; / Stores per-interpreter state / PyObject builtins_copy; PyObject import_func; / Initialized to PyEval_EvalFrameDefault(). / _PyFrameEvalFunction eval_frame; // 可以看 `atexit` 模块 void (pyexitfunc)(PyObject ); PyObject pyexitmodule; uint64_t tstate_next_unique_id; // 可以看 `warnings` 模块 struct _warnings_runtime_state warnings; // 审计钩子,可以看 sys.addaudithook PyObject audit_hooks;#if _PY_NSMALLNEGINTS + _PY_NSMALLPOSINTS > 0 // 小整数保存在这里,便于复用 // 默认范围为 [-5, 256]. PyLongObject small_ints[_PY_NSMALLNEGINTS + _PY_NSMALLPOSINTS];#endif // ... 暂时不关心的内容};
值得特殊关注的是,我们之前读取的各种参数保存在新创建的阐明器状态的 config 字段中。配置归属于阐明器状态。
线程状态的构造体定义如下:
// The PyThreadState typedef is in Include/pystate.h.struct _ts { // 同阐明器下的线程状态保存在一个双链表中 struct _ts prev; struct _ts next; PyInterpreterState interp; // 当前帧的引用(可以是 NULL) // 通过 frame->f_back 可以访问调用栈 PyFrameObject frame; // ... 检讨递归层次是否太深 // ... 追踪/记录状态 / The exception currently being raised / PyObject curexc_type; PyObject curexc_value; PyObject curexc_traceback; / The exception currently being handled, if no coroutines/generators are present. Always last element on the stack referred to be exc_info. / _PyErr_StackItem exc_state; / Pointer to the top of the stack of the exceptions currently being handled / _PyErr_StackItem exc_info; PyObject dict; / Stores per-thread state / int gilstate_counter; PyObject async_exc; / Asynchronous exception to raise / unsigned long thread_id; / Thread id where this tstate was created / / Unique thread state id. / uint64_t id; // ... 其它暂时可忽略的东西};
创建主线程状态后,pycore_create_interpreter() 函数将初始化 GIL,避免多个线程同时操作 Python 工具。如果你通过 threading 模块创建新线程,它将在每次进入求值循环前等待,直到占用 GIL 锁,同时,线程状态将作为求值函数的一个参数,供线程随时访问。
如果你要通过 Python/C 接口手动占用 GIL 锁,也必须同时供应对应的线程状态。此时,须要将线程状态存储至特定的线程存储空间(在类 Unix 系统中,即调用 pthread_setspecific() 库函数)。
GIL 须要单独一篇文章来谈论,Python 工具系统和 import 机制也一样。不过,本篇还是会大略提及一些内容。
创建第一个阐明器状态和线程状态后,pyinit_config() 调用 pycore_interp_init() 函数完成核心初始化。pycore_interp_init() 函数的代码逻辑很清晰:
static PyStatuspycore_interp_init(PyThreadState tstate){ PyStatus status; PyObject sysmod = NULL; status = pycore_init_types(tstate); if (_PyStatus_EXCEPTION(status)) { goto done; } status = _PySys_Create(tstate, &sysmod); if (_PyStatus_EXCEPTION(status)) { goto done; } status = pycore_init_builtins(tstate); if (_PyStatus_EXCEPTION(status)) { goto done; } status = pycore_init_import_warnings(tstate, sysmod);done: // Py_XDECREF() 减少工具引用计数 // 如果引用计数变为 0,将销毁工具,回收内存 Py_XDECREF(sysmod); return status;}
pycore_init_types() 函数卖力初始化内置类型。详细做了哪些事情呢?内置类型又是什么?我们知道,Python 中统统皆工具。数字、字符串、列表、函数、模块、帧、自定义类乃至内置类型都是 Python 工具。
所有 Python 工具都是 PyObject 构造或以 PyObject 作为第一个字段的其它 C 构造的一个实例。PyObject 有两个字段,第一个是 Py_ssize_t 类型的引用计数,第二个是 PyTypeObject 指针,指向工具类型。下面是 PyObject 构造体的定义:
typedef struct _object { _PyObject_HEAD_EXTRA // for debugging only Py_ssize_t ob_refcnt; PyTypeObject ob_type;} PyObject;
而下面的是大家熟习的 float 类型的构造体定义:
typedef struct { PyObject_HEAD // 一个宏,扩展为 PyObject ob_base; double ob_fval;} PyFloatObject;
在 C 措辞中,指向任意构造体的指针可以转换为指向该构造体第一个成员的指针,反过来也一样。因此,由于 Python 工具的第一个成员都是 PyObject,CPython 可以将所有 Python 工具都当作 PyObject 处理。你可以把它当作一种 C 措辞中实现子类的办法。这种做法的好处是实现了多态性,比方说,通过通报 PyObject,可以将任意 Python 工具作为参数传给函数。
CPython 之以是能借由 PyObject 完成许多操作,是由于 Python 工具由其类型所决定,而 PyObject 指定了 Python 工具的类型。通过类型,CPython 可以知道工具如何创建,如何打算哈希值,如何相互加减,如何调用,如何访问其属性,以及如何销毁等等。
类型本身也是 Python 工具,由 PyTypeObject 构造体表示。所有的类型都属于 PyType_Type 类型,PyType_Type 类型的类型指向它自身。听起来比较繁芜,看一个例子就清楚了:
$ ./python.exe -q>>> type([])<class 'list'>>>> type(type([]))<class 'type'>>>> type(type(type([])))<class 'type'>
PyTypeObject 的详细解释可以参考 Python/C 接口参考手册。这里只给出干系构造体定义,读者对类型工具存储的信息有个大概观点就行。
// PyTypeObject 类型定义struct _typeobject { PyObject_VAR_HEAD // 扩展为 // PyObject ob_base; // Py_ssize_t ob_size; const char tp_name; / For printing, in format "<module>.<name>" / Py_ssize_t tp_basicsize, tp_itemsize; / For allocation / / Methods to implement standard operations / destructor tp_dealloc; Py_ssize_t tp_vectorcall_offset; getattrfunc tp_getattr; setattrfunc tp_setattr; PyAsyncMethods tp_as_async; / formerly known as tp_compare (Python 2) or tp_reserved (Python 3) / reprfunc tp_repr; / Method suites for standard classes / PyNumberMethods tp_as_number; PySequenceMethods tp_as_sequence; PyMappingMethods tp_as_mapping; / More standard operations (here for binary compatibility) / hashfunc tp_hash; ternaryfunc tp_call; reprfunc tp_str; getattrofunc tp_getattro; setattrofunc tp_setattro; / Functions to access object as input/output buffer / PyBufferProcs tp_as_buffer; / Flags to define presence of optional/expanded features / unsigned long tp_flags; const char tp_doc; / Documentation string / / Assigned meaning in release 2.0 / / call function for all accessible objects / traverseproc tp_traverse; / delete references to contained objects / inquiry tp_clear; / Assigned meaning in release 2.1 / / rich comparisons / richcmpfunc tp_richcompare; / weak reference enabler / Py_ssize_t tp_weaklistoffset; / Iterators / getiterfunc tp_iter; iternextfunc tp_iternext; / Attribute descriptor and subclassing stuff / struct PyMethodDef tp_methods; struct PyMemberDef tp_members; struct PyGetSetDef tp_getset; struct _typeobject tp_base; PyObject tp_dict; descrgetfunc tp_descr_get; descrsetfunc tp_descr_set; Py_ssize_t tp_dictoffset; initproc tp_init; allocfunc tp_alloc; newfunc tp_new; freefunc tp_free; / Low-level free-memory routine / inquiry tp_is_gc; / For PyObject_IS_GC / PyObject tp_bases; PyObject tp_mro; / method resolution order / PyObject tp_cache; PyObject tp_subclasses; PyObject tp_weaklist; destructor tp_del; / Type attribute cache version tag. Added in version 2.6 / unsigned int tp_version_tag; destructor tp_finalize; vectorcallfunc tp_vectorcall;};
内置类型,如 int、list 等,是通过静态声明 PyTypeObject 实例实现的:
PyTypeObject PyList_Type = { PyVarObject_HEAD_INIT(&PyType_Type, 0) "list", sizeof(PyListObject), 0, (destructor)list_dealloc, / tp_dealloc / 0, / tp_vectorcall_offset / 0, / tp_getattr / 0, / tp_setattr / 0, / tp_as_async / (reprfunc)list_repr, / tp_repr / 0, / tp_as_number / &list_as_sequence, / tp_as_sequence / &list_as_mapping, / tp_as_mapping / PyObject_HashNotImplemented, / tp_hash / 0, / tp_call / 0, / tp_str / PyObject_GenericGetAttr, / tp_getattro / 0, / tp_setattro / 0, / tp_as_buffer / Py_TPFLAGS_DEFAULT | Py_TPFLAGS_HAVE_GC | Py_TPFLAGS_BASETYPE | Py_TPFLAGS_LIST_SUBCLASS, / tp_flags / list___init____doc__, / tp_doc / (traverseproc)list_traverse, / tp_traverse / (inquiry)_list_clear, / tp_clear / list_richcompare, / tp_richcompare / 0, / tp_weaklistoffset / list_iter, / tp_iter / 0, / tp_iternext / list_methods, / tp_methods / 0, / tp_members / 0, / tp_getset / 0, / tp_base / 0, / tp_dict / 0, / tp_descr_get / 0, / tp_descr_set / 0, / tp_dictoffset / (initproc)list___init__, / tp_init / PyType_GenericAlloc, / tp_alloc / PyType_GenericNew, / tp_new / PyObject_GC_Del, / tp_free / .tp_vectorcall = list_vectorcall,};
类型声明之后,须要进行初始化。比如,将 __call__、__eq__ 等方法添加到该类型对应的字典中,并指向相应的 tp_ 函数。这个初始化过程是通过调用 PyType_Ready() 函数完成的:
PyStatus_PyTypes_Init(void){ // 添加 "__hash__", "__call_" 等邪术函数 PyStatus status = _PyTypes_InitSlotDefs(); if (_PyStatus_EXCEPTION(status)) { return status; }#define INIT_TYPE(TYPE, NAME) \ do { \ if (PyType_Ready(TYPE) < 0) { \ return _PyStatus_ERR("Can't initialize " NAME " type"); \ } \ } while (0) INIT_TYPE(&PyBaseObject_Type, "object"); INIT_TYPE(&PyType_Type, "type"); INIT_TYPE(&_PyWeakref_RefType, "weakref"); INIT_TYPE(&_PyWeakref_CallableProxyType, "callable weakref proxy"); INIT_TYPE(&_PyWeakref_ProxyType, "weakref proxy"); INIT_TYPE(&PyLong_Type, "int"); INIT_TYPE(&PyBool_Type, "bool"); INIT_TYPE(&PyByteArray_Type, "bytearray"); INIT_TYPE(&PyBytes_Type, "str"); INIT_TYPE(&PyList_Type, "list"); INIT_TYPE(&_PyNone_Type, "None"); INIT_TYPE(&_PyNotImplemented_Type, "NotImplemented"); INIT_TYPE(&PyTraceBack_Type, "traceback"); INIT_TYPE(&PySuper_Type, "super"); INIT_TYPE(&PyRange_Type, "range"); INIT_TYPE(&PyDict_Type, "dict"); INIT_TYPE(&PyDictKeys_Type, "dict keys"); // ... 其余 50 种类型的初始化 return _PyStatus_OK();#undef INIT_TYPE}
有些内置类型还会实行一些分外的初始化操作。例如,初始化 int 时,须要天生一些小整数,存放在 interp->small_ints 列表中,便于之后复用;初始化 float 时,须要判断浮点数在当前系统中的存储格式。
内置类型初始化完成后,pycore_interp_init() 调用 _PySys_Create() 创建 sys 模块。为什么 sys 模块须要第一个创建呢?
这个模块当然是很主要的,它包含了命令行参数(sys.argv),模块搜索路径(sys.path),各种系统、实现干系参数(sys.version/sys.implementation/sys.thread_info 等),以及可与阐明器交互的各种函数(sys.addaudithook()/sys.settrace() 等)。但之以是要最先初始化这个模块,紧张目的还是为了初始化 sys.modules。
sys.modules 指向 interp->modules 字典。这个字典也是由 _PySys_Create() 创建的。所有已导入的模块都会缓存在这里,搜索模块时,也会首先查找这个字典。模块导入系统强依赖于 sys.modules。
实际上,_PySys_Create() 函数只完成了 sys 模块的部分初始化。调用干系的数据,如 sys.argv、sys._xoptions 等,与路径干系的数据,如 sys.path 、sys.exec_prefix 等,将在主初始化流程完成设置。
接下来,pycore_interp_init() 调用 pycore_init_builtins() 实行内置模块的初始化。内置模块的内容包括内置函数,如 abs()、dir()、print() 等,内置类型,如 dict、int、str 等,内置非常,如 Exception、ValueError 等,以及内置常数,如 False、Ellipsis、None 等。
内置函数本身是内置模块定义的一部分,而内置类型、非常、常数等则必须显式移入模块字典中。运行代码时,frame->f_builtins 将指向模块字典,从而可以搜索到这些内置名称。这也是内置模块不须要手动引入的缘故原由。
核心初始化的末了一步是调用 pycore_init_import_warnings() 函数。你可能已经见识过 Python 的警告机制,比如:
$ ./python.exe -q>>> import imp<stdin>:1: DeprecationWarning: the imp module is deprecated in favour of importlib; ...
CPython 中包含一些过滤器,可以忽略 Warning,或将其升级为非常,或以各种形式将其展示给用户。 pycore_init_import_warnings() 卖力打开这些过滤器。其余,这个函数还为内置模块与冻结模块准备好导入系统。
内置模块与冻结模块比较分外。它们都直接编译进 Python 可实行文件。不过,内置模块是 C 措辞实现的,而冻结模块则是用 Python 写的。怎么把 Python 写的模块直接编译进可实行文件呢?办法是把模块的代码工具的二进制表示合并到 C 措辞源码中。而代码工具的二进制表示是通过 Freeze 工具天生的。
_frozen_importlib 便是一个冻结模块,也是全体导入系统的核心部分。Python 代码中的 import 语句终极都会走到 _frozen_importlib._find_and_load() 函数。为支持内置模块与冻结模块的导入,pycore_init_import_warnings() 将调用 init_importlib() 函数,而该函数做的第一件事便是导入 _frozen_importlib。看上去,导入这个模块的动作本身就依赖于这个模块,而 CPython 避开了这个问题。
_frozen_importlib 依赖于其余两个模块,一个是 sys,以便访问 sys.modules,另一个是 _imp,卖力底层导入函数的实现,包括用于创建内置模块与冻结模块的函数。为避开依赖自身以导入自身的问题,这里通过 init_importlib() 函数直接创建 _imp 模块,然后调用 _frozen_importlib._install(sys, _imp) 函数将它与 sys 模块注入到 _frozen_importlib 中。
完成这个自启动过程后,核心初始化阶段也就发布完成。
下一步是主初始化阶段,即 pyinit_main()。实行一些校验之后,该函数将调用 init_interp_main() 完成紧张事情,这些事情可以总结如下:
获取系统真实时间和单调韶光(译者注:系统启动后经历的 ticks),确保 time.time(),time.monotonic(),time.perf_counter() 等函数正常事情。完成 sys 模块初始化,包括设置路径,如 sys.path,sys.executable,sys.exec_prefix 等,以及调用参数干系变量,如 sys.argv,sys._xoptions 等。支持基于路径的(外部)模块导入。初始化过程会导入一个冻结模块,importlib._bootstrap_external。它支持基于 sys.path 的模块导入。同时,另一个冻结模块,zipimport,也会被导入,以支持导入 ZIP 压缩格式的模块,也便是说,sys.path 下的文件夹可以是以被压缩格式存在的。规范文件系统与标准流的编码格式,设置编解码缺点处理器。设置默认的旗子暗记处理器,以处理进程吸收到的 SIGINT 等系统旗子暗记。用户可以通过 signal 模块自定义旗子暗记处理器。导入 io 模块,初始化 sys.stdin、sys.stdout、sys.stderr,实质上便是通过 io.open() 打开标准流对应的文件描述符。将 builtins.open 设置为 io.OpenWrapper,利用户可以直策应用这个内置函数。创建 __main__ 模块,将 __main__.__builtins__ 设置为 builtins,__main__.__loader__ 设置为 _frozen_importlib.BuiltinImporter。此时,__main__ 模块中还没有内容。导入 warnings、site 模块,site 模块会在 sys.path 中添加 /usr/local/lib/python3.9/site-packages/ 干系路径。将 interp->runtime->initialized 设置为 1。至此,CPython 初始化完成。
下面,我们来看看 Py_RunMain()。
2.2 运行 Python 程序看上去,Py_RunMain() 函数本身做的事情不多:
intPy_RunMain(void){ int exitcode = 0; pymain_run_python(&exitcode); if (Py_FinalizeEx() < 0) { / Value unlikely to be confused with a non-error exit status or other special meaning / exitcode = 120; } // 开释 Py_FinalizeEx() 没有开释的内存 pymain_free(); if (_Py_UnhandledKeyboardInterrupt) { exitcode = exit_sigint(); } return exitcode;}
Py_RunMain() 首先调用 pymain_run_python() 运行 Python,然后调用 Py_FinalizeEx() 实行去初始化。这个函数开释了大多数 CPython 能开释的内存,剩余部分由 pymain_free() 开释。其余,Py_FinalizeEx() 还会调用各种退出函数,包括用户通过 atexit 模块注册的退出函数。
我们知道,运行 Python 代码有几种办法,即:
交互式:$ ./cpython/python.exe>>> import sys>>> sys.path[:1]['']
作为标准输入流:
$ echo "import sys; print(sys.path[:1])" | ./cpython/python.exe['']
命令形式:
$ ./cpython/python.exe -c "import sys; print(sys.path[:1])"['']
脚本形式:
$ ./cpython/python.exe 03/print_path0.py['/Users/Victor/Projects/tenthousandmeters/python_behind_the_scenes/03']
以模块运行:
$ ./cpython/python.exe -m 03.print_path0['/Users/Victor/Projects/tenthousandmeters/python_behind_the_scenes']
以及,可能比较少见的,把包作为脚本运行(print_path0_package 是一个文件夹,个中包含 __main__.py 文件):
$ ./cpython/python.exe 03/print_path0_package['/Users/Victor/Projects/tenthousandmeters/python_behind_the_scenes/03/print_path0_package']
把稳,我们是在 cpython/ 文件夹之外实行指令的,可以看到,不同调用模式下, sys.path[0] 有不同的值。我们下一个要看的函数,pymain_run_python(),司帐算 sys.path[0] 的值,并以不同的模式运行 Python:
static voidpymain_run_python(int exitcode){ PyInterpreterState interp = _PyInterpreterState_GET(); PyConfig config = (PyConfig)_PyInterpreterState_GetConfig(interp); // 预设 `sys.path` PyObject main_importer_path = NULL; if (config->run_filename != NULL) { // Calculate the search path for the case when the filename is a package // (ex: directory or ZIP file) which contains __main__.py, store it in `main_importer_path`. // Otherwise, left `main_importer_path` unchanged. // Handle other cases later. if (pymain_get_importer(config->run_filename, &main_importer_path, exitcode)) { return; } } if (main_importer_path != NULL) { if (pymain_sys_path_add_path0(interp, main_importer_path) < 0) { goto error; } } else if (!config->isolated) { PyObject path0 = NULL; // 打算要添加到 `sys.path` 的模块搜索路径 // 如果以脚本运行,即脚本所在文件夹 // 如果以模块运行(-m),即当前所在文件夹 // 否则为空字符串 int res = _PyPathConfig_ComputeSysPath0(&config->argv, &path0); if (res < 0) { goto error; } if (res > 0) { if (pymain_sys_path_add_path0(interp, path0) < 0) { Py_DECREF(path0); goto error; } Py_DECREF(path0); } } PyCompilerFlags cf = _PyCompilerFlags_INIT; // 在交互模式,打印版本与平台信息 pymain_header(config); // 在交互模式,导入 `readline` 模块, // 支持自动补完、行内编辑、历史命令等功能 pymain_import_readline(config); // 按调用模式运行 Python(如脚本,-m,-c 等) if (config->run_command) { exitcode = pymain_run_command(config->run_command, &cf); } else if (config->run_module) { exitcode = pymain_run_module(config->run_module, 1); } else if (main_importer_path != NULL) { exitcode = pymain_run_module(L"__main__", 0); } else if (config->run_filename != NULL) { exitcode = pymain_run_file(config, &cf); } else { exitcode = pymain_run_stdin(config, &cf); } // 程序实行后进入交互模式 // 即支持 `-i`、`PYTHONINSPECT`选项 pymain_repl(config, &cf, exitcode); goto done;error: exitcode = pymain_exit_err_print();done: Py_XDECREF(main_importer_path);}
这里,我们以脚本模式为例。下一步将实行 pymain_run_file() 函数,检讨文件是否能被打开,是不是一个文件夹等,然后调用 PyRun_AnyFileExFlags(),如果文件是一个终端(isatty(fd) 返回 1),程序进入交互模式:
$ ./python.exe /dev/ttys000>>> 1 + 12
否则,调用 PyRun_SimpleFileExFlags()。
你可能对模块中 __pycache__ 文件夹下的 .pyc 文件已经很熟习了。.pyc 文件中的是编译好的源码,即该模块所包含的代码工具。由于 .pyc 文件的存在,模块不必在每次导入时都重新编译——我想,这个你已经知道了。不过,你知道我们可以直接运行 .pyc 文件吗:
$ ./cpython/python.exe 03/__pycache__/print_path0.cpython-39.pyc['/Users/Victor/Projects/tenthousandmeters/python_behind_the_scenes/03/__pycache__']
PyRun_SimpleFileExFlags() 函数会检讨用户实行的是不是 .pyc 文件,这个 .pyc 文件是不是匹配当前 CPython 版本,如果匹配,则实行 run_pyc_file() 函数。
如果不是 .pyc 文件,它将调用 PyRun_FileExFlags() 函数。最主要的是,PyRun_SimpleFileExFlags() 还将导入 __main__ 模块,并将它的字典传入 PyRun_FileExFlags(),作为文件实行时的全局与本地命名空间:
intPyRun_SimpleFileExFlags(FILE fp, const char filename, int closeit, PyCompilerFlags flags){ PyObject m, d, v; const char ext; int set_file_name = 0, ret = -1; size_t len; m = PyImport_AddModule("__main__"); if (m == NULL) return -1; Py_INCREF(m); d = PyModule_GetDict(m); if (PyDict_GetItemString(d, "__file__") == NULL) { PyObject f; f = PyUnicode_DecodeFSDefault(filename); if (f == NULL) goto done; if (PyDict_SetItemString(d, "__file__", f) < 0) { Py_DECREF(f); goto done; } if (PyDict_SetItemString(d, "__cached__", Py_None) < 0) { Py_DECREF(f); goto done; } set_file_name = 1; Py_DECREF(f); } // 检讨是不是 .pyc 文件 len = strlen(filename); ext = filename + len - (len > 4 ? 4 : 0); if (maybe_pyc_file(fp, filename, ext, closeit)) { FILE pyc_fp; / Try to run a pyc file. First, re-open in binary / if (closeit) fclose(fp); if ((pyc_fp = _Py_fopen(filename, "rb")) == NULL) { fprintf(stderr, "python: Can't reopen .pyc file\n"); goto done; } if (set_main_loader(d, filename, "SourcelessFileLoader") < 0) { fprintf(stderr, "python: failed to set __main__.__loader__\n"); ret = -1; fclose(pyc_fp); goto done; } v = run_pyc_file(pyc_fp, filename, d, d, flags); } else { / When running from stdin, leave __main__.__loader__ alone / if (strcmp(filename, "<stdin>") != 0 && set_main_loader(d, filename, "SourceFileLoader") < 0) { fprintf(stderr, "python: failed to set __main__.__loader__\n"); ret = -1; goto done; } v = PyRun_FileExFlags(fp, filename, Py_file_input, d, d, closeit, flags); } flush_io(); if (v == NULL) { Py_CLEAR(m); PyErr_Print(); goto done; } Py_DECREF(v); ret = 0; done: if (set_file_name) { if (PyDict_DelItemString(d, "__file__")) { PyErr_Clear(); } if (PyDict_DelItemString(d, "__cached__")) { PyErr_Clear(); } } Py_XDECREF(m); return ret;}
PyRun_FileExFlags() 会实行编译过程,运行解析器,获取模块的 AST,并调用 run_mod() 运行 AST。同时,它还卖力创建 PyArena,供 CPython 保存小工具(小于即是 512 字节的工具)。
PyObject PyRun_FileExFlags(FILE fp, const char filename_str, int start, PyObject globals, PyObject locals, int closeit, PyCompilerFlags flags){ PyObject ret = NULL; mod_ty mod; PyArena arena = NULL; PyObject filename; int use_peg = _PyInterpreterState_GET()->config._use_peg_parser; filename = PyUnicode_DecodeFSDefault(filename_str); if (filename == NULL) goto exit; arena = PyArena_New(); if (arena == NULL) goto exit; // 运行解析器 // 默认利用新版 PEG 解析器 // 传入 `-X oldparser` 可以利用旧版解析器 // `mod` 表示模块,也是 AST 的根节点 if (use_peg) { mod = PyPegen_ASTFromFileObject(fp, filename, start, NULL, NULL, NULL, flags, NULL, arena); } else { mod = PyParser_ASTFromFileObject(fp, filename, NULL, start, 0, 0, flags, NULL, arena); } if (closeit) fclose(fp); if (mod == NULL) { goto exit; } // 编译 AST 并运行 ret = run_mod(mod, filename, globals, locals, flags, arena);exit: Py_XDECREF(filename); if (arena != NULL) PyArena_Free(arena); return ret;}
run_mod() 调用 PyAST_CompileObject() 以运行编译器,获取模块的代码工具,然后调用 run_eval_code_obj() 实行代码工具。期间还抛出 exec 事宜——这是 CPython 将运行时势宜关照审计工具的办法。干系机制可以参考 PEP 578。
static PyObject run_mod(mod_ty mod, PyObject filename, PyObject globals, PyObject locals, PyCompilerFlags flags, PyArena arena){ PyThreadState tstate = _PyThreadState_GET(); PyCodeObject co = PyAST_CompileObject(mod, filename, flags, -1, arena); if (co == NULL) return NULL; if (_PySys_Audit(tstate, "exec", "O", co) < 0) { Py_DECREF(co); return NULL; } PyObject v = run_eval_code_obj(tstate, co, globals, locals); Py_DECREF(co); return v;}
在第二篇文章中,我们已经看到编译器的事情办法:
构建符号表创建基本块的 CFG将 CFG 集成为代码工具这正是 PyAST_CompileObject() 所做的事情,因此,这个函数我们不再过多谈论。
通过一系列的调用,run_eval_code_obj() 终极会到达 _PyEval_EvalCode()。我把全体调用链粘贴过来,方便大家看到参数是怎么一起传过去的:
static PyObject run_eval_code_obj(PyThreadState tstate, PyCodeObject co, PyObject globals, PyObject locals){ PyObject v; // 处理 CPython 被嵌入利用的情形,我们可以忽略 / We explicitly re-initialize _Py_UnhandledKeyboardInterrupt every eval _just in case_ someone is calling into an embedded Python where they don't care about an uncaught KeyboardInterrupt exception (why didn't they leave config.install_signal_handlers set to 0?!?) but then later call Py_Main() itself (which _checks_ this flag and dies with a signal after its interpreter exits). We don't want a previous embedded interpreter's uncaught exception to trigger an unexplained signal exit from a future Py_Main() based one. / _Py_UnhandledKeyboardInterrupt = 0; / Set globals['__builtins__'] if it doesn't exist / // 在我们的场景中,已经在主初始化时设置为 `builtins` 模块 if (globals != NULL && PyDict_GetItemString(globals, "__builtins__") == NULL) { if (PyDict_SetItemString(globals, "__builtins__", tstate->interp->builtins) < 0) { return NULL; } } v = PyEval_EvalCode((PyObject)co, globals, locals); if (!v && _PyErr_Occurred(tstate) == PyExc_KeyboardInterrupt) { _Py_UnhandledKeyboardInterrupt = 1; } return v;}
PyObject PyEval_EvalCode(PyObject co, PyObject globals, PyObject locals){ return PyEval_EvalCodeEx(co, globals, locals, (PyObject )NULL, 0, (PyObject )NULL, 0, (PyObject )NULL, 0, NULL, NULL);}
PyObject PyEval_EvalCodeEx(PyObject _co, PyObject globals, PyObject locals, PyObject const args, int argcount, PyObject const kws, int kwcount, PyObject const defs, int defcount, PyObject kwdefs, PyObject closure){ return _PyEval_EvalCodeWithName(_co, globals, locals, args, argcount, kws, kws != NULL ? kws + 1 : NULL, kwcount, 2, defs, defcount, kwdefs, closure, NULL, NULL);}
PyObject _PyEval_EvalCodeWithName(PyObject _co, PyObject globals, PyObject locals, PyObject const args, Py_ssize_t argcount, PyObject const kwnames, PyObject const kwargs, Py_ssize_t kwcount, int kwstep, PyObject const defs, Py_ssize_t defcount, PyObject kwdefs, PyObject closure, PyObject name, PyObject qualname){ PyThreadState tstate = _PyThreadState_GET(); return _PyEval_EvalCode(tstate, _co, globals, locals, args, argcount, kwnames, kwargs, kwcount, kwstep, defs, defcount, kwdefs, closure, name, qualname);}
我们说过,代码工具表示代码要实行的动作,而要实行一个代码工具,还必须依赖 CPython 创建的相应状态,即帧工具。_PyEval_EvalCode() 会根据参数为指定代码工具创建帧工具。
在我们的场景中,大多参数都是 NULL,因此,创建帧工具的事情量不大。而在 CPython 根据不同传入参数实行代码工具的时候,则有大量事情要做。也由于这个缘故原由,_PyEval_EvalCode() 函数长达 300 行。我们会在之后的文章中谈论这些代码所做的事情,暂时可以跳过它们,只要把稳到它终极调用 _PyEval_EvalFrame() 对帧工具求值即可:
PyObject _PyEval_EvalCode(PyThreadState tstate, PyObject _co, PyObject globals, PyObject locals, PyObject const args, Py_ssize_t argcount, PyObject const kwnames, PyObject const kwargs, Py_ssize_t kwcount, int kwstep, PyObject const defs, Py_ssize_t defcount, PyObject kwdefs, PyObject closure, PyObject name, PyObject qualname){ assert(is_tstate_valid(tstate)); PyCodeObject co = (PyCodeObject)_co; PyFrameObject f; PyObject retval = NULL; PyObject fastlocals, freevars; PyObject x, u; const Py_ssize_t total_args = co->co_argcount + co->co_kwonlyargcount; Py_ssize_t i, j, n; PyObject kwdict; if (globals == NULL) { _PyErr_SetString(tstate, PyExc_SystemError, "PyEval_EvalCodeEx: NULL globals"); return NULL; } / Create the frame / f = _PyFrame_New_NoTrack(tstate, co, globals, locals); if (f == NULL) { return NULL; } fastlocals = f->f_localsplus; freevars = f->f_localsplus + co->co_nlocals; / Create a dictionary for keyword parameters (kwags) / if (co->co_flags & CO_VARKEYWORDS) { kwdict = PyDict_New(); if (kwdict == NULL) goto fail; i = total_args; if (co->co_flags & CO_VARARGS) { i++; } SETLOCAL(i, kwdict); } else { kwdict = NULL; } / Copy all positional arguments into local variables / if (argcount > co->co_argcount) { n = co->co_argcount; } else { n = argcount; } for (j = 0; j < n; j++) { x = args[j]; Py_INCREF(x); SETLOCAL(j, x); } / Pack other positional arguments into the args argument / if (co->co_flags & CO_VARARGS) { u = _PyTuple_FromArray(args + n, argcount - n); if (u == NULL) { goto fail; } SETLOCAL(total_args, u); } / Handle keyword arguments passed as two strided arrays / kwcount = kwstep; for (i = 0; i < kwcount; i += kwstep) { PyObject co_varnames; PyObject keyword = kwnames[i]; PyObject value = kwargs[i]; Py_ssize_t j; if (keyword == NULL || !PyUnicode_Check(keyword)) { _PyErr_Format(tstate, PyExc_TypeError, "%U() keywords must be strings", co->co_name); goto fail; } / Speed hack: do raw pointer compares. As names are normally interned this should almost always hit. / co_varnames = ((PyTupleObject )(co->co_varnames))->ob_item; for (j = co->co_posonlyargcount; j < total_args; j++) { PyObject name = co_varnames[j]; if (name == keyword) { goto kw_found; } } / Slow fallback, just in case / for (j = co->co_posonlyargcount; j < total_args; j++) { PyObject name = co_varnames[j]; int cmp = PyObject_RichCompareBool( keyword, name, Py_EQ); if (cmp > 0) { goto kw_found; } else if (cmp < 0) { goto fail; } } assert(j >= total_args); if (kwdict == NULL) { if (co->co_posonlyargcount && positional_only_passed_as_keyword(tstate, co, kwcount, kwnames)) { goto fail; } _PyErr_Format(tstate, PyExc_TypeError, "%U() got an unexpected keyword argument '%S'", co->co_name, keyword); goto fail; } if (PyDict_SetItem(kwdict, keyword, value) == -1) { goto fail; } continue; kw_found: if (GETLOCAL(j) != NULL) { _PyErr_Format(tstate, PyExc_TypeError, "%U() got multiple values for argument '%S'", co->co_name, keyword); goto fail; } Py_INCREF(value); SETLOCAL(j, value); } / Check the number of positional arguments / if ((argcount > co->co_argcount) && !(co->co_flags & CO_VARARGS)) { too_many_positional(tstate, co, argcount, defcount, fastlocals); goto fail; } / Add missing positional arguments (copy default values from defs) / if (argcount < co->co_argcount) { Py_ssize_t m = co->co_argcount - defcount; Py_ssize_t missing = 0; for (i = argcount; i < m; i++) { if (GETLOCAL(i) == NULL) { missing++; } } if (missing) { missing_arguments(tstate, co, missing, defcount, fastlocals); goto fail; } if (n > m) i = n - m; else i = 0; for (; i < defcount; i++) { if (GETLOCAL(m+i) == NULL) { PyObject def = defs[i]; Py_INCREF(def); SETLOCAL(m+i, def); } } } / Add missing keyword arguments (copy default values from kwdefs) / if (co->co_kwonlyargcount > 0) { Py_ssize_t missing = 0; for (i = co->co_argcount; i < total_args; i++) { PyObject name; if (GETLOCAL(i) != NULL) continue; name = PyTuple_GET_ITEM(co->co_varnames, i); if (kwdefs != NULL) { PyObject def = PyDict_GetItemWithError(kwdefs, name); if (def) { Py_INCREF(def); SETLOCAL(i, def); continue; } else if (_PyErr_Occurred(tstate)) { goto fail; } } missing++; } if (missing) { missing_arguments(tstate, co, missing, -1, fastlocals); goto fail; } } / Allocate and initialize storage for cell vars, and copy free vars into frame. / for (i = 0; i < PyTuple_GET_SIZE(co->co_cellvars); ++i) { PyObject c; Py_ssize_t arg; / Possibly account for the cell variable being an argument. / if (co->co_cell2arg != NULL && (arg = co->co_cell2arg[i]) != CO_CELL_NOT_AN_ARG) { c = PyCell_New(GETLOCAL(arg)); / Clear the local copy. / SETLOCAL(arg, NULL); } else { c = PyCell_New(NULL); } if (c == NULL) goto fail; SETLOCAL(co->co_nlocals + i, c); } / Copy closure variables to free variables / for (i = 0; i < PyTuple_GET_SIZE(co->co_freevars); ++i) { PyObject o = PyTuple_GET_ITEM(closure, i); Py_INCREF(o); freevars[PyTuple_GET_SIZE(co->co_cellvars) + i] = o; } / Handle generator/coroutine/asynchronous generator / if (co->co_flags & (CO_GENERATOR | CO_COROUTINE | CO_ASYNC_GENERATOR)) { PyObject gen; int is_coro = co->co_flags & CO_COROUTINE; / Don't need to keep the reference to f_back, it will be set when the generator is resumed. / Py_CLEAR(f->f_back); / Create a new generator that owns the ready to run frame and return that as the value. / if (is_coro) { gen = PyCoro_New(f, name, qualname); } else if (co->co_flags & CO_ASYNC_GENERATOR) { gen = PyAsyncGen_New(f, name, qualname); } else { gen = PyGen_NewWithQualName(f, name, qualname); } if (gen == NULL) { return NULL; } _PyObject_GC_TRACK(f); return gen; } retval = _PyEval_EvalFrame(tstate, f, 0);fail: / Jump here from prelude on failure / / decref'ing the frame can cause __del__ methods to get invoked, which can call back into Python. While we're done with the current Python frame (f), the associated C stack is still in use, so recursion_depth must be boosted for the duration. / if (Py_REFCNT(f) > 1) { Py_DECREF(f); _PyObject_GC_TRACK(f); } else { ++tstate->recursion_depth; Py_DECREF(f); --tstate->recursion_depth; } return retval;}
_PyEval_EvalFrame() 封装了帧求值函数 interp->eval_frame()。实在,用户可以自定义帧求值函数。什么情形下须要自定义呢?比如说,为了添加一个 JIT 编译器,要将编译的机器码保存在代码工具中并实行的时候。这个功能是由 PEP 523 在 CPython3.6 引入的。
interp->eval_frame() 默认为 _PyEval_EvalFrameDefault()。这个函数在 Python/ceval.c 文件中定义,代码将近 3000 行。不过,本日,我们只对个中的第 1336 行感兴趣,这行代码我们已经找了良久了:求值循环。
3. 总结本篇谈论了很多内容。首先是 CPython 项目的概览,然后编译 CPython,随着源码一步步学习各个初始化阶段。希望这篇文章让你对 CPython 开始解析字节码前的事情有了大概的理解。至于后面的流程,则是下一篇文章的内容。
同时,为巩固本日的内容,理解更多故意思的东西,强烈推举大家自己花点韶光看看 CPython 源码。我相信,读完这篇文章后,你会有许多疑问,带着这些疑问去读源码是个不错的选择。祝大家探索愉快!