According to Wikipedia, Locality-Sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same “buckets” with high probability. The number of buckets is much smaller than the universe of possible input items. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items. In this problem, we will use a portion of LSH to determine similarities between strings.
Consider that you have two strings and . Assume that the lengths of the two strings are and , respectively. To determine the similarity between these two strings, you need to generate -shingles. A -shingle for a string is any substring of length found within the string. Then, you can represent each string using a set of shingles that appear one or more times in that string. For example, if our string is , then the set of -shingles for the string is . Note that the substring appears twice in the string, but appears only once as a shingle.
After generating the shingles for each string, you need to measure the Jaccard Similarity of and . The Jaccard Similarity is the ratio of the number of shingles that can be found in both strings and the number of shingles found in at least one string.
where is the number of shingles that can be found in both strings, and is the number of shingles that can be found in at least one string.
In this problem, you will be given , , and . You need to determine the Jaccard Similarity of the two strings.
The first line of the input contains two space-separated strings and . The strings contain only lowercase English letters.
The next line contains a single integer .
Print a single number denoting the Jaccard Similarity of the two strings. Absolute errors less than will be ignored.
Input | Output |
---|---|
abcdabd aacdabd 2 | 0.5714 |
For , the set of -shingles is: . So the Jaccard Similarity is: . |
A substring is a contiguous sequence of characters within a string. For example, the list of all substrings of the string "apple" would be "apple", "appl", "pple", "app", "ppl", "ple", "ap", "pp", "pl", "le", "a", "p", "l", "e", "" (note that, there is an empty string at the end).