« back to list

Embed a Python 3 interpreter in a C program

Summary

In this post, we will use a fake usecase to see how to embed a Python 3 interpreter in a C program, and call a Python script from a binary. As we'll see below, there may be several reasons for doing this, such as extending the modularity of a C program, or enjoy a more productive language for non-performance-critical tasks.

The code used in this post can be found on my GitHub account.



It is well-known that the most used Python interpreter is built in C. Because of this original Python implementation (called CPython, not to be confused with Cython, a C extension of Python), synergies between the two languages have been naturally created since the beginning of Python.

While C is an excellent choice to build a software where performance is critical, it can be seen to have drawbacks:

  • The compilation phase make it less flexible than an interpreted language,
  • Because of its low-level nature, the productivity is not as good a with a language higher level language like Python or Ruby

If some part of your C application is not eagerly bounded to bleeding-edge performance, it may be a good idea to build it with a more productive tool like Python. Or, as mentionned in the official doc – which is the main inspiration for this post –, you can also use this feature to allow your users to easily develop plugins for your existing C application.


There are some interesting ressources on the Web to guide you in the process of embedding a Python interpreter in your C program, but I felt like some lacked clarity. In this post, I'd like to introduce you to this opportunity, with a hello-world example.

Context

Let's say we want to build a C program that perform Natural Language Processing (NLP) on the content of an web page, for which the URL is passed as an argument. While all the NLP task will require high performance computing, we'd like the user to be able to implement himself a script that would return the text content of a webpage, given its URL.

Disclosure: we won't do any kind of NLP here :) 

Implementation

Here is the Python script that a user may come up with:

import re
import sys
from bs4 import BeautifulSoup
import requests

def extract_content(url):
    """Given a URL, extract all the text content, 
    save it to a text file and return the number of
    words that have been processed.
    """
    
    webpage = requests.get(url)
    soup = BeautifulSoup(webpage.content, "lxml")
    content = soup.find("body").findAll("p")
    count = 0
    with open("content.txt", "w") as file:
        for paragraph in content:
            text = paragraph.get_text()
            words = re.findall(r'\b\w+', text)
            file.write(text)
            count += len(words)
        return count
     if __name__ == '__main__': print("\n--- entering python script ---") url = sys.argv[1] count = extract_content(url) print("{} words have been saved to './content.txt'".format(count)) print("--- exiting python script ---\n")

No need to provide lots of details here: what this script does is to parse a webpage based on the URL given as argument, and save the text content to a file in the current directory.

Of course we could have programmed this inside our C program, but as mentionned earlier, it is a task that doesn't require specific performance, and it would probably takes a few hundreds lines of code to produce the same result. Here, Python, with appropriate modules, supplies its high productivity and ease of implementation to perform this work efficiently.


Here's our C code for this example

#include <stdio.h>
#include <stdlib.h>
#include <Python.h>

void nl_processor(FILE file); // ideally, put this in 'main.h' file

int main(int argc, char *argv[]) 
{

    wchar_t *program = Py_DecodeLocale(argv[0], NULL);
    if (program == NULL) {
        fprintf(stderr, "Fatal error: cannot decode argv[0]\n");
        exit(1);
    }

    Py_SetProgramName(program);  // optional
    Py_Initialize();

    wchar_t *args[argc];
    for (int i = 0; i < argc; i++) {
        args[i] = Py_DecodeLocale(argv[i], NULL);
    }

    PySys_SetArgv(argc, args);

    FILE *py_file = NULL;
    py_file = fopen("./main.py", "r");

    if (py_file != NULL) {
        PyImport_ImportModule("bs4");
        PyImport_ImportModule("requests");
        PyRun_SimpleFile(py_file, "main");
        Py_Finalize();
        PyMem_RawFree(program);
    }

    else {
        printf("Fail to open Python file. Please, check if it exists...");
        return 1;
    }

    FILE *source = NULL;
    source = fopen("context.txt", "r");
    
    if (source != NULL) {
        nl_processor(*source);    
    }
    else {
        printf("Fail to open text file.");
        return 1;
    }

    printf("\n*** Task completed! ***");
    return 0;
}

void nl_processor(FILE source)
{
    printf("-> Beginning NLP...\n");
    // do some complicated NLP here...
    // save results to a file
    // or whatever... :)
    sleep(2); // simulate 2 seconds of processing
    printf("...\n-> NLP completed.\n");
}

Quick explanation of the different steps (see the official Python/C API documentation doc for more details):

  • Decode program name to get a Unicode from bytes, and set the program name
  • Initialize the Python interpreter
  • Decode the arguments passed to the program (in our case, it must be a single argument: the URL of the page we intend to do NLP on)
  • Pass the arguments to the Python session
  • Get the Python file
  • Import the modules needed to make our Python script runs
  • Run the script
  • Proceed with the C program (here, the program will use a file created by our Python script to process whatever it needs to process)

Note if the Python script fails in any way, this means that we won't get any text file to map to the source variable, and our program will crash.

Compile and link the program

This is not as trivial as compiling a simple C program. During the linking, the compiler needs to add to the final binary all the Python libraries required to run a regular Python script. On top of that, since we use two packages that are not included in the Python Standard Library, make sure that bs4 and requests are installed on the machine that will compile the program.

To point the compiler and linker to the required libraries, we'll need to add the appropriate flags for each step. To make it easy, we'll simply pass the flags with which your Python interpreter has been compiled with. For instance, if your using Python 3.5 on Ubuntu/Debian, you can get the compiler flags by typing :

$ /usr/bin/python3.5-config --cflags

And the linker flags can be found with this command:

$ /usr/bin/python3.5-config --ldflags

So, to ensure that we won't forget any library to the compiled program, we will build it by using the responses from these two commands. 

First, we compile and assemble the program, without linking:

$ gcc main.c -c $(/usr/bin/python3.5-config --cflags)

Then, we proceed with the final step, the linking:

$ gcc -o fake_nlp main.o $(/usr/bin/python3.5-config --ldflags)

This yields no error, nor warning. We now have in our working directory the binary file called fake_nlp that should be able to call the Python script by itself. 

Test and final words

Let's try our little program by feeding it with a blog post URL

$ ./fake_nlp https://medium.com/@edouardtheron/m%C3%A9moire-de-guerre-89b9b7c31dd5

--- entering python script ---
1864 words have been saved to './content.txt'
--- exiting python script ---

-> Beginning NLP...

-> NLP completed.

*** Task completed! ***

$ _

The program completed with no error: this means that it managed to read from the text file created by Python after the interpreter has completed its task. In other words: we did it!



The code used in this post can be found on my GitHub account.

Any suggestion or criticism? Please, do not hesitate to contact me, feedbacks are welcome!