Locality-Sensitive Hashing I

ishtupeed Criterion 2021 Round 14
Limits 1s, 512 MB

According to Wikipedia, Locality-Sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same “buckets” with high probability. The number of buckets is much smaller than the universe of possible input items. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items. In this problem, we will use a portion of LSH to determine similarities between strings.

Consider that you have two strings SS and TT. Assume that the lengths of the two strings are S|S| and T|T|, respectively. To determine the similarity between these two strings, you need to generate kk-shingles. A kk-shingle for a string is any substring of length kk found within the string. Then, you can represent each string using a set of shingles that appear one or more times in that string. For example, if our string is abcdabdabcdabd, then the set of 22-shingles for the string is {ab,bc,cd,da,bd}\{ab, bc, cd, da, bd\}. Note that the substring abab appears twice in the string, but appears only once as a shingle.

After generating the shingles for each string, you need to measure the Jaccard Similarity of SS and TT. The Jaccard Similarity is the ratio of the number of shingles that can be found in both strings and the number of shingles found in at least one string.
Jaccard Similarity=CbCo\text{Jaccard Similarity} = \frac{C_b}{C_o}
where CbC_b is the number of shingles that can be found in both strings, and CoC_o is the number of shingles that can be found in at least one string.

In this problem, you will be given SS, TT, and kk. You need to determine the Jaccard Similarity of the two strings.


The first line of the input contains two space-separated strings SS and T(1S,T1000)T (1 \leq |S|, |T| \leq 1000). The strings contain only lowercase English letters.

The next line contains a single integer k(1kmin(S,T))k (1 \leq k \leq \min(|S|, |T|)).


Print a single number denoting the Jaccard Similarity of the two strings. Absolute errors less than 10410^{-4} will be ignored.


abcdabd aacdabd

For SS, the set of 22-shingles is: {ab,bc,cd,da,bd}\{ab, bc, cd, da, bd\}.
For TT, the set of 22-shingles is: {aa,ac,cd,da,ab,bd}\{aa, ac, cd, da, ab, bd\}.
The number of shingles that are common in both strings: 44.
The number of shingles that are in at least one string: 77.

So the Jaccard Similarity is: 47=0.5714\frac{4}{7} = 0.5714.


A substring is a contiguous sequence of characters within a string. For example, the list of all substrings of the string "apple" would be "apple", "appl", "pple", "app", "ppl", "ple", "ap", "pp", "pl", "le", "a", "p", "l", "e", "" (note that, there is an empty string at the end).


Login to submit.



86% Solution Ratio
fextivityEarliest, 7M ago
uchiha6125Fastest, 0.0s
TrAdBuLightest, 213 kB
white_monsterShortest, 158B
Toph uses cookies. By continuing you agree to our Cookie Policy.