Architecture options to run a workflow engine

This week a customer called and asked (translated into my own words and shortened):html

「We do composite services, orchestrating two or three CRUD-Services to do something more useful. Our architects want to use your workflow engine for this because the orchestration flow might be long running. Is this a valid scenario for workflow? Currently we run one big central cluster for the workflow engine — won’t this get a mess?」

These are valid questions which recently we get asked a lot, especially in the context of microservices, modern SOA initiatives or domain-driven design.java

Modern workflow engines are incredibly flexible. In this blog post I will look at possible architectures using them. To illustrate these architecture I use the open source products my company provides(Camunda and Zeebe.io) as I know them best and saw them 「in the wild「 at hundreds if not thousands of companies. But these thoughts are applicable to comparable tools on the market as well.node

Aspects to consider

Central or de-central workflow engine

For over a decade, workflow engines (or BPM tools) were positioned as being something central. Very often this was the only way of running complex beasts that required challenging setups. Most vendors even cemented this approach into their licensing model which meant, not only was it horribly expensive but also tied to CPU cores.git

This is not true any more with modern tools which are very flexible.github

In Avoiding the 「BPM monolith」 when using bounded contexts I show that in an architecture which cleanly separated responsibilities you should run multiple engines. And all modern trends like DDD or Microservices call for de-central components to allow more agility and autonomy which is about getting changes into production as fast as possible which will be the key capabilities of successful enterprises in future.web

So in the above scenario I would run an own workflow engine for every service that needs long running behavior.spring

Advantages:sql

  • Autonomy: Every team that builds a service can decide on the best solution themselves. They can decide on the concrete tool to use (even if most customers limit the diversity to avoid complexity). They can update the engines version whenever they like. They can redeploy things or plan outages without the need of any coordination with other teams.
  • Isolation/Scalability: Every service has a dedicated engine. This can be scaled depending on the concrete requirements. Workflows from other teams cannot harm your performance or stability.
  • Isolation/Security: Team A cannot read or mess with data from Team B. If you share one engine you always meet in the data store. In Camunda we trust JDBC connections or Java clients therefore it’s hard to avoid any friendly attacks or accidental changes by operators.
  • Isolation/DevOps: The local operating tool (Cockpit when using Camunda) is 100% focused on the workflows the DevOps team is really responsible for. No other data is shown so people don’t get confused or distracted. This also makes it easier to give the team full access to Cockpit allowing for a true DevOps mindset which is essential to start a good continuous improvement cycle.

But of course there are also drawbacks:docker

  • Operating the engine: You have to install, run and maintain multiple engines. Depending on the concrete deployment scenario (see below) this can be cumbersome but also every simple. This problem can be addressed by automated deployment pipelines.
  • Operating the workflows: You will have multiple operating tools showing only the local processes. If you have end-to-end processes stretch across the boundaries of multiple services you don’t get a central view on all of them for free.

To get a proper overview our customers typically setup monitoring solutions collecting events from all engines. The central monitoring doesn’t show too many details but provides links to the right operating tool. You could e.g. use the Elastic stack for this. And simple workflow engine plugins can generically push events within a workflow to the central tool. This is comparable to what Camunda Optimize does, it collects data from many engines and makes them available for central analyzing (but with the goal of business analysis and improvement, not technical operations).apache

However, there is one interesting hybrid possible, at least in the Camunda world: Running multiple engines on top of a shared database.

This is actually what a reasonable amount of customers do and it’s not only a valid approach but often a good compromise, especially because Camunda has two features supporting it:

  • Deployment aware engines: Every engine knows the workflow definitions owned by the current service and only touches them (just make sure you configured your engine to act deployment aware).
  • Rolling updates: Camunda guarantees that two subsequent versions can run on the same database. Assume you run 7.7 and want to upgrade the engine to 7.8. You can first update the database to 7.8, services can keep running on their engine in 7.7! They can now eventually update as soon as they like. You are just blocked to migrate to 7.9 before all services have migrated to 7.8.

Orchestration in service or infrastructure

Some customers see the workflow engine as an infrastructure component, living on its own and not being part of a specific service.

This is very much the view of a BPM or ESB-like component of the first wave of SOA projects, it is a central engine as described above. We see a lot of customers in this scenario use workflows as rather small integration flows, replacing an ESB. This is technically feasible but it poses questions around the ownership and life-cycle of the workflow models. I dedicate a section especially to this very important aspect. It is a valid approach if you sort out these questions even if a lot of customers turn away from this view because of bad experiences back in the old SOA days.

Orchestrating serverless functions

A very related use case in modern architectures is the orchestration of stateless and rather small functions. Using these functions (e.g. Azure Functions, AWS Lambda or similar) you need a way to implement end-to-end business capabilities which often stretch across multiple functions. This is a perfect fit for a workflow engine. I published details of a recent customer project in the Azure Cloud in Orchestrating Azure Functions using BPMN and Camunda — a case study, this is how the high-level architecture looks like:

Ownership and life-cycle of workflow models

Every workflow model needs a clear owner which can decide not only for changes, but is also responsible that the workflow runs smoothly. In Microservices architectures (or similar) this ownership is typically given to teams building the service. In this scenario ownership for services is baked into the organization already. This makes it quite natural to make workflow models part of the service. If you have workflow models as own artifacts — as e.g. in the serverless example above — it is a bit harder but not less important to define a proper ownership.

Additionally you have to clarify for every workflow model when and how it gets deployed. Very often workflow models have a strong relationship to a certain domain meaning that you have to deploy them in-sync with the corresponding service (or function). This is easier to do if they are logically part of the service. If it’s a separated artifact with a separated owner it means you have to coordinate between teams which slows you down. And if it’s not really separated, why it’s not part of your service?

Once again there is not only a black and white approach. A hybrid solution is to run a centralized engine but to distribute ownership and responsibility for deployment of the workflow models to the services.

This is a common scenario for non-Java environments using Camunda. It mitigates the biggest problems arising from a central engine or seeing workflow engines as infrastructure. It is a valid approach. If you are developing Java services, it might be a bigger overhead than running the engine within your service itself. But to understand this let’s turn our attention on how to run a workflow engine at all.

Engine as separate program, possibly remote

Typically a workflow engine runs as a separate application allowing you to connect remotely, e.g. via REST in Camunda or Message Pack in Zeebe. This allows you to run the engine on a different host but of course it can also run just beside the service itself.

A code example can be found here: Use Camunda without touching Java and get an easy-to-use REST-based orchestration and workflow engine.

Forces:

  • Good separation of concerns as the workflow engine lives in an isolated process and can be controlled independent of your own code. This makes it easier to hunt down failures and isolate components.
  • You do not have ACID transactions in this model as the remote channel is not transactional. So you have to think about retrying, idempotency or duplicate calls. I personally do not see this as a disadvantage. We rapidly move towards distributed systems as the new normal and in this world we have to accept that consistency is an illusion (a good read is Pet Halland: Life beyond distributed transactions).
  • Compared to an embedded engine (see below) you introduce remote communication. This increases complexity especially for unit tests as you have to make sure to run a workflow engine in a defined state only for a certain test case.

By the way, remote communication doesn’t have to feel bad for the developer. For example, Zeebe provides native client libraries (currently for Java and GoLang but more support is planned) which allow to use it like an embedded engine in terms of coding but keep the separation of concerns.

In love with Java? Embeddable workflow engines

Camunda is an example of an embeddable Java engine. That means the engine can also run as part of your Java application. In this case you can simply communicate with the engine by Java API and the engine can call Java code whenever it needs to transform data or call services. The following code snippet will work in any Java program and defines a workflow model, deploys it and starts instances (it is really that simple to get started!). You just need Camunda and a H2 database on the classpath:

 

This scenario is the de-facto standard for JUnit tests. I am very proud of the Camunda test support (also providing assertions, scenario tests and visualizations of test runs).

Spring Boot Starter

But running an embedded engine is also what you do if you use the Camunda Spring Boot Starter which is a nice way of implementing the one workflow engine per service approach described above.

A Camunda Spring Boot code example can be found here: https://github.com/berndruecker/camunda-spring-boot-amqp-microservice-cloud-example/

Container-managed engine

Within Camunda we provide something unique: A container-managed engine running as part of Tomcat or application servers like WildFly or WebSphere. Having this available, you can deploy normal WAR artifacts which can contain workflow models which are automatically deployed to the container-managed engine. This allows for a clean distinction between infrastructure (container) and domain logic (WAR).

A code example can be found here: https://github.com/camunda-consulting/camunda-showcase-insurance-application.

In the past, it was often advocated to deploy multiple applications (WAR files) onto a single app server. We also support this model in Camunda. However, we do not see this being applied often any more as it breaks isolation between deployments. If one application goes nuts it can very well affect other applications on this server. And with technologies like Docker it’s super easy to run multiple app servers instead so I also favor the one app per application server approach.

Spring Boot or container-managed engine?

Deciding between Spring Boot or an application server is hard as it‘s still a very religious decision. Conceptually I like separating the infrastructure into an application server. And if you are running Docker, the WildFly approach might even allow for faster turnaround times (see below). But it’s also true that the environment around Spring Boot is quite active, pragmatic, stable, innovative and often ahead of Java EE (sorry, EE4J of course). My advice would be to base your decision on what works best in your company, either way will be fine. Keep your team and the target runtime environment in mind when doing this selection.

In general, we do not recommend running an embedded Camunda engine and configuring it on your own (e.g. via plain Java or Spring). There are some things you have to do right (e.g. thread pooling, transactions, id generation) especially to guarantee proper behavior under load. So better use the Spring Boot Starter or the container-managed engine.

Pros and cons of the embeddable engine approach

Running the engine in the previously mentioned ways is optimal in a Java world as it provides many advantages including:

  • easy to use (Java API),
  • easy to write proper unit tests,
  • clear deployment path for workflow models (as part of the normal application deployment),
  • transactional integration,

But of course it has a major drawback:

  • It is nailed to Java. If you do not develop your services in Java, this might not be for you.

Our current rule of thumb for Camunda customers is:

  • If you do Java then use either Spring Boot (embedded engine) or any container-managed engine (Tomcat, WildFly, Websphere, …).
  • If you don’t do Java, run a remote engine and talk to it via REST.

For Camunda there is a funny recursion: When running the workflow engine as a separate application, the Camunda engine itself has to be started up using one of the options already stated above (e.g. container-managed or using Spring Boot). The typical way is to run a container-managed engine on Tomcat, maybe using Docker.

Communication pattern

Assume you have a payment workflow that needs to communicate with a remote service to charge a credit card:

There are two basic patterns to implement this communication:

  • Push: The workflow actively calls that service. It could be a Java invocation, a REST or SOAP Web-Service call but also sending a message to a queue is kind of an active push from the perspective of the workflow engine.
  • Pull: A service or some intermediary connector asks the workflow engine for work. This reverses the communication direction. In Camunda this is supported by so called External Tasks.

Pulling work has a couple of advantages:

  • The workflow engine does not have to know concrete endpoints (URL, queue names, etc).
  • The called service defines the scaling itself depending on how many tasks it consumes and how fast. This might depend on a various factors like the current load, which implements so-called back-pressure.
  • The called service can be implemented in any language and ask via REST API (or Message Pack in Zeebe).
  • The core workflow engine can concentrate on the coordination of the workflow and does not have to handle integration logic and protocols.
  • Transaction boundaries are enforced in a way which is natural for distributed systems thereby making the whole system more flexible to adapt to future technologies.

Of course it has drawbacks:

We regularly have customers which want to have true temporal decoupling in a way that service unavailability doesn’t cascade. They don’t want to do synchronous calls which needs to be retried when a service is unavailable. But a lot of them shy away from introducing messaging systems as they are an additional component in the stack which are hard to operate. As an alternative they often use the pull model to implement asynchronism and temporal decoupling. This works very well.

Overall we cannot give a general recommendation for the best communication style. It depends on too many factors. If you’re in doubt, we recently tend to favor the pull approach. But in the Java space most customers use the push approach as it is quite natural and simple to setup — and works very well for most use cases.

Running workflow engines in the cloud, on docker, cloud foundry or others

Embedded engines are part of your Java application and can basically run everywhere. Spring Boot packages a so-called uber-jar containing all dependencies. Some people also call this fat-jar as it typically grows around the same size as containers because it contains all dependencies they need (including the webserver, transaction manager, etc). Either way, you can run it on Docker, in Java build packs from CloudFoundry or Heroku or any other environment you like. A code example for Spring Boot and Cloud Foundry can be found here: https://github.com/berndruecker/camunda-spring-boot-amqp-microservice-cloud-example/

With app servers like Tomcat or WildFly you will typically run Docker nowadays. There is one interesting thought on this: Docker works in layers, and new docker images just add new layers on top of existing ones.

So you could define a base image for workflow applications and then just add WAR files to this. The WAR files are typically tiny (some kilobytes) allowing for very quick turnaround. Whether this advantage affects you depends on the development workflow in your company.

There are Docker images available of Camunda and Zeebe and we provide several examples on how to build custom Docker images. These Docker images can run in basically all cloud environments providing container services.

Summary

Gee, that was a long blog post. Sorry. I couldn’t stop writing. But I know that this is relevant information for a lot of people who are going to introduce workflow engines in their architecture — or to fix bad decisions from the past.

Modern tools allow for greater flexibility. And this flexibility comes with the burden that you have to make decisions. This post lists the most important aspects to consider and should help you make a well informed decision. Feel free to approach me or the Camunda team if you need help in this endeavor. It is really important to think this through as some basic decisions have a huge impact on your architectures future.

Bonus: Greenfield stack for Java folks

For everybody who thinks: 「I have no clue what this guy is writing about, I just want to get going」 we defined a so-called greenfield stack in the Camunda Best Practices (which are reserved for our Enterprise Customers, sorry!). It is not superior to any other stack, it is just the one we recommend if you don’t care. And because you read this till the end, you get it free as a bonus ;-)

We suggest as runtime environment:

  • Latest version of Camunda, use the Enterprise Edition (with patches),
  • Deployed on WildFly (or JBoss EAP if you want to have a supported Enterprise Edition),
  • using Java 8 (latest Sun JDK) ,
  • as Container Managed Engine,
  • using PostgreSQL — or the database you already operate.
  • For High Availability you normally have at least two application servers running, not necessarily forming a cluster but point to the same database.

For development:

As runtime environment on the local developer machine we recommend:

  • Same version of Camunda, WildFly and Java (as above)
  • H2 database with file back-end as this easily allows every developer to have its own database without the necessity of installing a database. H2 is already contained and configured in our our pre-packaged WildFly distribution, which works perfectly for the use case of a 「local developer machine」.

We strongly discouragethat multiple developers share the same database during development as this can lead to a multitude of problems.

相關文章
相關標籤/搜索