tverskyIndexOf function
Finds the Tversky similarity index between two strings.
Parameters
sourceis the variant stringtargetis the prototype stringalphais the variant coefficient. Default is 0.5betais the prototype coefficient. Default is 0.5- if
ignoreCaseis true, the character case shall be ignored. - if
ignoreWhitespaceis true, space, tab, newlines etc whitespace characters will be ignored. - if
ignoreNumbersis true, numbers will be ignored. - if
alphaNumericOnlyis true, only letters and digits will be matched. ngramis the size a single item group. If n = 1, each individual items are considered separately. If n = 2, two consecutive items are grouped together and treated as one.
TIPS: You can pass both
ignoreNumbersandalphaNumericOnlyto true to ignore everything else except letters.
Details
Tversky index is an asymmetric similarity measure between sets that compares a variant with a prototype. It is a generalization of the Sørensen–Dice coefficient and Jaccard index.
It may return NaN dependending on the values of alpha and beta.
See Also: tverskyIndex
Complexity: Time O(n log n) | Space O(n)
Implementation
double tverskyIndexOf(
String source,
String target, {
int ngram = 1,
double alpha = 0.5,
double beta = 0.5,
bool ignoreCase = false,
bool ignoreWhitespace = false,
bool ignoreNumbers = false,
bool alphaNumericOnly = false,
}) {
source = cleanupString(
source,
ignoreCase: ignoreCase,
ignoreWhitespace: ignoreWhitespace,
ignoreNumbers: ignoreNumbers,
alphaNumericOnly: alphaNumericOnly,
);
target = cleanupString(
target,
ignoreCase: ignoreCase,
ignoreWhitespace: ignoreWhitespace,
ignoreNumbers: ignoreNumbers,
alphaNumericOnly: alphaNumericOnly,
);
if (ngram < 2) {
return tverskyIndex(
source.codeUnits,
target.codeUnits,
alpha: alpha,
beta: beta,
);
} else {
return tverskyIndex(
splitStringToSet(source, ngram),
splitStringToSet(target, ngram),
alpha: alpha,
beta: beta,
);
}
}