Using Custom NLP Models

Describes how to train and use a custom NLP model with Philter.

Using your own custom model allows you to have unlimited control over how Philter identifies named-entities in text while still being able to take advantage of Philter's features of filter profiles, redaction, realistic value replacement, and all other features.

Using a custom NLP model is most often not necessary when using Philter. We highly recommend that you ensure a custom NLP model will provide benefits that out weigh the effort required for model training before proceeding.

Important: Training your own NLP model for Philter may require extensive knowledge and abilities in areas such as machine learning, neural networks, and distributed computing.

Philter is able to identify named-entities in text through the use of a trained model. The model is able to identify things, like person's names, in the text that do not follow a well-defined pattern or are easily referenced in a dictionary. Philter's NLP model is interchangeable and we offer multiple models that you can choose from to better tailor Philter to your use-case and your domain.

However, there are times when using our models may not be sufficient, such as when your use-case does not exactly match our available models or you want to try to get better performance by training a model on text very similar to your input text. In those cases you can train a custom NLP model for use with Philter.

Custom NLP Models

Training a Custom NLP Model

Philter is indifferent of the technologies and methods you choose to train your custom model. You can use any framework you like, such as Apache OpenNLP, spaCy, Stanford CoreNLP, or your own custom framework. Follow the framework's documentation for training a model.

Using Your Model

Once your model has been trained and you are satisfied with its performance, to use the model with Philter you must expose the model by implementing a simple HTTP service interface around it. This service facilitates communication between Philter and your model. This interface has two methods described at a high-level below. The service interface is described in detail lower on this page.

Method

Endpoint

Description

POST

/process

Processes the text and returns the named-entities.

GET

/status

Gets the status of the model service, e.g. is the model still loading, is the model ready for inference?

Once your model is available behind the HTTP interface described above, you are ready to use the model with Philter. On the Philter virtual machine, simply export the PHILTER_NER_ENDPOINT environment variable to be the location of the running HTTP service. It is recommended you set this environment variable in /etc/environment. If your HTTP service is running on the same host as Philter on port 8888, the environment variable would be set as:

export PHILTER_NER_ENDPOINT=http://localhost:8888/

Now restart the Philter service and stop and disable the philter-ner service.

sudo systemctl restart philter.service
sudo systemctl stop philter-ner.service
sudo systemctl disable philter-ner.service

When a filter profile containing an NER filter is applied, requests will be made to your HTTP service invoking your model inference returning the identified named-entities.

Philter NER HTTP Service Interface

Your NLP model must be exposed by a service implementing the following HTTP API. The base URL https://localhost:8888 is just for example. Your service can run anywhere (on the same host as Philter, on a different host, etc.) as long as it is accessible from Philter and it can be available on any port.

An HTTP service makes the NLP model accessible to Philter.

post
Extract named-entities from input text

https://localhost:8888/process
This method receives the text to process and returns the named-entities identified by the model. The method requires no parameters and the text to process should be in the body of the request. When text is received by this request the service should invoke the model to identify named-entities. Any preprocessing of the text should also be performed here. Philter does no preprocessing to the text prior to invoking this endpoint.
Request
Response
Request
Path Parameters
optional
string
Body Parameters
required
string
The text to process.
Response
200: OK
The response is a list of identified entities. The text is the named-entity, the tag is the type of named-entity, the score is a value from 0.0 to 1.0 that represents the model's confidence, and start and end are the zero-based character positions of the named entity in the input text, respectively. The model's confidence score is an important value because it can be used by filter profiles as filter conditions on entities. All NLP libraries should expose a confidence value that can be used here. The value of the tag can be any string you choose that identifies the type of the entity. If you are training your model for persons' names you may use person. If you are training your model for street addresses you may use address. The value you choose for the tag is up to you. For example, given the input text George Washington was president. the response would be:
[
{
text: "George Washington",
tag: "person",
score: "0.97",
start: "0",
end: "17"
}
]

get
Get the status

https://localhost:8888/status
Gets the status of the model inference service. This endpoint is used by Philter to be aware of the model inference service's status to know if it is ready to process requests.
Request
Response
Request
Path Parameters
optional
string
Response
200: OK
When the service is ready to receive and process requests, this endpoint should return the text healthy. Anything other than an HTTP 200 response is considered unhealthy and could indicate the service is still loading or has encountered an error.
healthy

Recommendations and Best Practices

You have complete freedom to train your custom NLP model using whatever tools and processes you choose. However, from our experience that are a few things that can help you be successful.

The first recommendation is to contain your service in a Docker container. Doing so gives you a self-contained image that can be deployed and run virtually anywhere. It simplifies dependency management and protects you from dependency version changes.

The second recommendation is to make your HTTP service as lightweight as possible. Avoid any unnecessary code or features that could negatively impact the speed of your model inference.

Lastly, thoroughly evaluate your model prior to deploying the model to Philter to have a better expectation of performance.