Using Poppler/pdftotext and other custom binaries on AWS Lambda

A recurring question with AWS Lambda functions is how to properly use third-party binary executables in the Lambda execution environment. A lot of people like to use precompiled binaries but often they’re not available and the only option is to compile the package you want to use from source.

Before I get into detail I’ll break it down into the 3 required steps:

  1. Compile the desired package from source in the Lambda Runtime kernel
  2. Package up the executables and dependent shared libraries into a zip
  3. Include the package within your Lambda Function/Layer and modify functions settings to find it

Each of these I found to be pretty poorly documented and spent a mind-numbing afternoon working it out from snippets I’d found elsewhere. In this example I’m talking about poppler-utils, a required package for many PDF and OCR conversion tools, but the concept is broadly similar when trying to bundle any third-party binary into a Lambda function. You can’t just do a yum or apt-get in Lambda, after all.

Compile the desired package from source in the Lamba Runtime Kernel

Quick lesson: Lambda natively supports a variety of different language runtimes: Python 2.7, 3.6, 3.7, Node.js 8.10, 10, Ruby 2.5, Java 8, Go 1.x, .NET Core 1.0 and 2.1.

Each one is based on a specific Amazon Linux AMI and Kernel version, and if you want to compile your own binaries from source, you have to do it in the matching AMI to ensure it works correctly. Other guides I’ve seen recommend that you spin up an EC2 instance and do it that way, but it’s current year and Docker can do the same thing but locally and for free!

So we head over to the Official Dockerhub for Amazon Linux and pull and appropriate image. At the current time Python is based on amzn-ami-hvm-2018.03.0.20181129-x86_64-gp2 which annoyingly enough did not have a precise docker image available, so I opted for the general tag of amazonlinux:2018.03

From your preferred local command line terminal running docker, do the following:

docker pull amazonlinux:2018.03
docker run --name amzn -d -t amazonlinux:2018.03 cat
docker exec -it amzn bash

A standard trick to hold a container open is to run it in daemon mode with cat at the entry command. The container will now run until we decide to kill it, and we get started by hopping inside with a docker exec command for bash.

Poppler has a dependency on OpenJPEG, which must be installed first, and all of the compilation tools we need are missing and must also be certain versions, so we’ve got to bootstrap a bunch of packages within our docker image:

yum -y install openjpeg-devel libjpeg-devel fontconfig-devel libtiff-devel libpng-devel xz gcc gcc-c++ epel-release zip cmake3

cd /root
curl -L https://github.com/uclouvain/openjpeg/archive/v2.3.1/openjpeg-2.3.1.tar.gz | tar xvz
cd openjpeg-2.3.1
mkdir -v build &&
cd       build &&
cmake3 -DCMAKE_BUILD_TYPE=Release \
      -DCMAKE_INSTALL_PREFIX=/usr \
      -DBUILD_STATIC_LIBS=OFF .. &&
make && make install && cd ../../

That’s OpenJPEG installed. Next for Poppler itself:

curl https://poppler.freedesktop.org/poppler-0.59.0.tar.xz | tar xJv
cd poppler-0.59.0/ && ./configure --enable-static --enable-build-type=release && make && make install

And voila, we’ve manually compiled Poppler in an Amazon Linux docker image. Now, what the heck do we do with it? This is the step that I’ve found so very badly explained elsewhere.

Package up the executables and dependent shared libraries into a zip

This is the trickiest part in a way because whilst you can probably follow steps to compile what you want in a local filesystem, what then? How do we extract what we need to get into Lambda? What do we even need?! Sadly this will vary from package to package so I’ve got no fixed answer (but read on for what to do with Poppler), but as a guide the make and make install will both create the binaries that you’re after (perhaps in /usr/bin or /usr/local/bin as well as a some shared libraries in/usr/lib64 /lib64 or /lib that these binaries will rely on.

In the case of Poppler the compiled utility binaries exist (in this example) in in /root/poppler-0.59.0/utils and include things like pdfimages, pdffonts, pdftohtml, pdftotext, and pdftoppm. Chances are you’re reading this because you need one of these.

The shared libraries you should probably include are noted below, because now we’re going to package everything we need up.

Still within our docker image, do the following:

cd /root
mkdir -p package/lib
mkdir -p package/bin
cp /usr/lib64/{libopenjpeg.so.2,libtiff.so.5,libjpeg.so.62,libpng12.so.0,libfreetype.so.6,libfontconfig.so.1,libjbig.so.2.0} /root/package/lib
cp /lib64/{libz.so.1,libexpat.so.1} /root/package/lib
cp poppler-0.59.0/poppler/.libs/libpoppler.so.70 /root/package/lib
cp poppler-0.59.0/utils/.libs/{pdftotext,pdfinfo,pdfseparate} /root/package/bin

Modify the last line there to include any of the poppler-utils that your Lambda function needs. Now to zip it all up:

cd package
zip -r9 ../package.zip *

Now sitting in /root/package.zip is everything we need! But wait, we’re still working within our fragile and volatile Docker container here – as we haven’t mounted any shared volumes – everything we’ve done inside the container will be lost the moment we kill it. That would be bad! We can safely run exit to exit the container and go back to our local shell, because the container is still running from our original cat command that started it.

Now we’re back into our local shell, we can copy the file from within the container to our local filesystem.

docker cp amzn:/root/package.zip package.zip

Now we’ve got our local file, package.zip, that has everything we need. Now we need to put this in Lambda!

Include the package within your Lambda Function/Layer and modify functions settings to find it.

Firstly, I’d strongly recommend using a Lambda Layer for this. Layers are what they sound like – they’re easily included packages of dependencies that you can reuse with your Lambda functions without having to include everything with each one individually.

So, all we need to do is Create a Lambda layer and upload our newly created zip as its source.

The mild headache in Lambda is knowing what the default environment variables are for paths and what those paths look like in the execution environment. If you upload a package into a Lambda layer, what paths do they convert to?

The root of your function itself runs in /var/task

Layers appear to be overlaid into /opt

So if your Lambda Layer zip contains the folders /bin and /lib they’ll be translated to /opt/bin and /opt/lib in the lambda execution environment.

If you were to include these same folders in your function itself, rather than a layer, they’d be places in /var/task/bin and /var/task/lib

This is important to know as the two important environment variables have the following defaults:

PATH:/usr/local/bin:/usr/bin/:/bin:/opt/bin
LD_LIBRARY_PATH: /lib64:/usr/lib64:$LAMBDA_RUNTIME_DIR:$LAMBDA_RUNTIME_DIR/lib:$LAMBDA_TASK_ROOT:$LAMBDA_TASK_ROOT/lib:/opt/lib

So without doing anything else, Lambda automatically picks up the paths from our Layer and will find them in the execution of your function. NOTE if you were to include a bin directory with your main function, this would not be in the PATH variable by default and wouldn’t get picked up!

Finally, it’s important to note that if you want to place libraries or binaries in other paths, you can do so as long as you overwrite the environment variables in the Lambda function itself.

I’ve seen some crazy workarounds implemented for this on the internet. People copying binaries into Lambda’s /tmp directory and then fudging their functions to manually look for the executable there – but in general this isn’t necessary if the function is properly configured and you’re not doing something strange like implementing a custom runtime!

Comments

About the Author

Avatar
Pete
Pete is the person that owns this website. This is his face. His opinions are his own except when they're not, at which point you're forced to guess and your perception of what is truly real is diminished that little bit more.