Apache NiFi and Philter

Describes how to use Philter with Apache NiFi to filter sensitive information from text.

This article describes how Philter can be used with Apache NiFi to filter sensitive information such as PII and PHI within an Apache NiFi data flow.

Philter is available on the AWS, Azure, and Google Cloud marketplaces. You can also run Philter in containers. So fire up an instance of Philter and let's get started using it alongside your Apache NiFi data flow.

Integrating Philter with Apache NiFi

To integrate Philter with Apache NiFi we will make use of Philter's API to filter sensitive information from text. The Apache NiFi flow will send text to Philter and Philter will return the filtered text. We will use Apache NiFi's InvokeHTTP processor for making the API call to Philter.

We are using Apache Kafka to manage the incoming and outgoing streaming text but this is not required. You could modify the Apache NiFi data flow to interact with Philter directly and omit Apache Kafka from the flow.

Here's an illustration of our data flow:

Apache NiFi Flow

The Apache NiFi flow:

The the text to be filtered has been previously published to an Apache Kafka cluster. The ConsumeKafka Apache NiFi processor is used to consume the text from the Kafka brokers and get it into the data flow.

An InvokeHTTP processor sends the text consumed from the Kafka brokers to Philter via Philter's API. Philter responds with the filtered text which is then published onto a separate Kafka topic via the PutKafka processor.

When complete, we have two topics on Kafka - the first topic contains the text unfiltered text and the second topic contains the filtered text.

If we had already had a pipeline using Apache Kafka and Apache NiFi, this configuration allows us to insert Philter into the pipeline with minimal changes. Our downstream process would just need to update the name of the Apache Kafka topic to be the name of the topic containing the filtered text. The configuration presented here is a powerful way to inject the removal of sensitive information into an existing pipeline.

This flow does not require any Apache NiFi processors outside of the processors that are included with the standard Apache NiFi distribution, ensuring compatibility across deployments.

Processor Configurations

ConsumeKafka

InvokeHTTP

PutKafka

Considerations

We are using a single instance of Philter in this article. For a production environment, a cluster of Philter instances deployed behind a load balancer would provide improved performance. The only change to the Apache NiFi flow configuration would be to change the InvokeHTTP processor's Remote URL to point to the load balancer instead of an individual Philter hostname or IP address.