Gitlab Community Edition Instance

Skip to content
Snippets Groups Projects

Celery Docker Test

This PROTOTYPE illustrates an example setup for Celery running Talend Open Studio ETL Jobs with Docker. Needless to say, it is badly documented and does not allow for easy set up whatsoever. It was tested on a cloud server by GWDG.

Setup

The basic setup includes: Redis as message broker, PostgreSQL as a backend database and minio as a S3 capable object storage. They are connected by Celery as a task queue.

Celery

Celery runs a farmer-worker model, i.e. a few farmers instruct a lot of workers what to do. The farmers are responsible to collect the results. In our example, the farmer would be the ActiveWorkflow agent (i.e. TOS Agent) that should be enabled to work with Celery. The communication works solely by populating the message broker (Redis) and retrieving results back from the backend database (PostgreSQL). There is no direct communication between farmers and workers. This does limit the possibility to exchange data between farmers and workers; any data exchanged has to be pickleable.

Redis

Redis is used as a message broker. Apart from setting it up as specified in the example test_setup/docker-compose.yaml I did non touch Redis once. It just works. There is some documentation on how to check on queues using redis-cli.

PostgreSQL

PostgreSQL is used as a backend database to communicate results and store states of all running Celery tasks. There is a ton of alternatives to PostgreSQL as a backend, which may make more sense in our infrastructure (I am looking a CouchDB)

minio

minio is not necessary to run celery at all. I chose it as a S3 compatible store to store all TOS Jobs. minio has a very basic web-interface that allows to upload TOS Jobs from the web. Or, maybe more important, it allows to upload jobs using S3; e.g. using a CI/CD pipeline. Finally, minio can be operated as a single Docker Container, which makes it very easy to use. It uses the very minimum of configuration possibilities, I assume that more advanced configs are part of their enterprise program. minio may be able to fill the gap between using CDSTAR as a TOS Job storage and the upcoming GWDG S3 Data Lake.

Technical Stuff

Internally, most of the logic is implemented in task.py, as the start_job method to be exact. This task takes the name and version of a Job, downloads it from minio, extracts the zip-file to some cache folder, starts a openjdk:8 Docker Container with the Job bound a volume into the Container and thus runs the Job. STDOUT and STDERR are directly returned back to the backend. Moreover, a aw_message_filepath is read and returned. To get more technical: the docker container is run by mounting the Docker unix socket into the Container the worker runs in. This is a significant security risk, as it basically allows root privileges on any server this Container runs. Use with caution!!

Usage

Use test_setup/docker-compose.yaml to set up all necessary services quickly. In the test setup, I used two containers as workers. Upload you Job by accessing the server minio runs on at port 9000 and upload you TOS Job zip-file using the web-UI. You can access them by editing run_cli.sh with you server URI (or localhost to run locally) and test around with it. You will be put into an interactive Docker Container.