Practice on Toph

Participate in exhilarating programming contests, solve unique algorithm and data structure challenges and be a part of an awesome community.

Locality-Sensitive Hashing I

By ishtupeed · Limits 1s, 512 MB

According to Wikipedia, Locality-Sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same “buckets” with high probability. The number of buckets is much smaller than the universe of possible input items. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items. In this problem, we will use a portion of LSH to determine similarities between strings.

Consider that you have two strings SS and TT. Assume that the lengths of the two strings are S|S| and T|T|, respectively. To determine the similarity between these two strings, you need to generate kk-shingles. A kk-shingle for a string is any substring of length kk found within the string. Then, you can represent each string using a set of shingles that appear one or more times in that string. For example, if our string is abcdabdabcdabd, then the set of 22-shingles for the string is {ab,bc,cd,da,bd}\{ab, bc, cd, da, bd\}. Note that the substring abab appears twice in the string, but appears only once as a shingle.

After generating the shingles for each string, you need to measure the Jaccard Similarity of SS and TT. The Jaccard Similarity is the ratio of the number of shingles that can be found in both strings and the number of shingles found in at least one string.
Jaccard Similarity=CbCo\text{Jaccard Similarity} = \frac{C_b}{C_o}
where CbC_b is the number of shingles that can be found in both strings, and CoC_o is the number of shingles that can be found in at least one string.

In this problem, you will be given SS, TT, and kk. You need to determine the Jaccard Similarity of the two strings.


The first line of the input contains two space-separated strings SS and T(1S,T1000)T (1 \leq |S|, |T| \leq 1000). The strings contain only lowercase English letters.

The next line contains a single integer k(1kmin(S,T))k (1 \leq k \leq \min(|S|, |T|)).


Print a single number denoting the Jaccard Similarity of the two strings. Absolute errors less than 10410^{-4} will be ignored.


abcdabd aacdabd

For SS, the set of 22-shingles is: {ab,bc,cd,da,bd}\{ab, bc, cd, da, bd\}.
For TT, the set of 22-shingles is: {aa,ac,cd,da,ab,bd}\{aa, ac, cd, da, ab, bd\}.
The number of shingles that are common in both strings: 44.
The number of shingles that are in at least one string: 77.

So the Jaccard Similarity is: 47=0.5714\frac{4}{7} = 0.5714.


A substring is a contiguous sequence of characters within a string. For example, the list of all substrings of the string "apple" would be "apple", "appl", "pple", "app", "ppl", "ple", "ap", "pp", "pl", "le", "a", "p", "l", "e", "" (note that, there is an empty string at the end).



86% Solution Ratio

fextivityEarliest, 3w ago

Farhan20Fastest, 0.0s

TrAdBuLightest, 213 kB

anonyo.akandShortest, 188B


Login to submit


We can create two sets SSS_SSS​ and STS_TST​ by generating kkk-shingles for the strings SSS and TTT,...

Toph uses cookies. By continuing you agree to our Cookie Policy.