Splitting Input Text

Describes how to split long input text for increased performance.

On a filter profile basis, Philter can split input text to process each split individually. This generally improves performance and allows for handling long input text. Splitting is disabled by default.

An example split configuration in a filter profile is shown below

{
"name": "default",
"crypto": null,
"identifiers": {},
"config": {
"splitting": {
"enabled": true,
"threshold": 10000,
"method": "newline"
}
}
}

In this example filter profile, splitting is enabled for inputs greater than equal to 10,000 characters in length.

The method of splitting the text will be the newline method. This method will cause Philter to split the text based on the locations of new line characters in the input text. Additional methods of text splitting may be added in future versions.

Because the newline method splits text based on the locations of new line characters in the text, the text contained in the reassembled filter responses may not be an exact match of the input text. This is due to white space and other characters that may reside near the new line characters that get omitted during processing.

Text Splitting Filter Profile Properties

Property

Description

Allowed Values

Default Value

enabled

Whether or not input texts are split.

true or false

false

threshold

When to split the input text.

Any integer value.

10000

method

How to split the text.

newline

newline

Alternative to Philter Splitting Text

In some cases it may be best to split your input text client side prior to sending the text to Philter. This gives you full control over how the text will be split and provides more predictable responses from Philter because you know how the text is split.

An example of splitting text into chunks prior to sending the text to Philter is given in the commands below:

# Given a large file called largefile.txt, split it into 10k pieces.
$ split -b 10k largefile.txt segment
# Now process the pieces.
$ curl -s -X POST -k "https://philter:8080/api/filter?d=document1" --data "@/tmp/segmentaa" -H "Content-type: text/plain" > out1
$ curl -s -X POST -k "https://philter:8080/api/filter?d=document1" --data "@/tmp/segmentab" -H "Content-type: text/plain" > out2
# Now recombine the outputs into a single file.
$ cat out1 out2 > filtered.txt