Locality-Sensitive Hashing I

ishtupeed

Criterion 2021 Round 14

93/107/215

Statement Editorial Statistics Discussion

HTML

Limits 1s, 512 MB

According to Wikipedia, Locality-Sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same “buckets” with high probability. The number of buckets is much smaller than the universe of possible input items. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items. In this problem, we will use a portion of LSH to determine similarities between strings.

Consider that you have two strings $S$ and $T$ . Assume that the lengths of the two strings are $|S|$ and $|T|$ , respectively. To determine the similarity between these two strings, you need to generate $k$ -shingles. A $k$ -shingle for a string is any substring of length $k$ found within the string. Then, you can represent each string using a set of shingles that appear one or more times in that string. For example, if our string is $abcdabd$ , then the set of $2$ -shingles for the string is $\{ab, bc, cd, da, bd\}$ . Note that the substring $ab$ appears twice in the string, but appears only once as a shingle.

After generating the shingles for each string, you need to measure the Jaccard Similarity of $S$ and $T$ . The Jaccard Similarity is the ratio of the number of shingles that can be found in both strings and the number of shingles found in at least one string.
$\text{Jaccard Similarity} = \frac{C_b}{C_o}$
where $C_b$ is the number of shingles that can be found in both strings, and $C_o$ is the number of shingles that can be found in at least one string.

In this problem, you will be given $S$ , $T$ , and $k$ . You need to determine the Jaccard Similarity of the two strings.

Input

The first line of the input contains two space-separated strings $S$ and $T (1 \leq |S|, |T| \leq 1000)$ . The strings contain only lowercase English letters.

The next line contains a single integer $k (1 \leq k \leq \min(|S|, |T|))$ .

Output

Print a single number denoting the Jaccard Similarity of the two strings. Absolute errors less than $10^{-4}$ will be ignored.

Sample

Input	Output
abcdabd aacdabd 2	0.5714
For $S$ , the set of $2$ -shingles is: $\{ab, bc, cd, da, bd\}$ . For $T$ , the set of $2$ -shingles is: $\{aa, ac, cd, da, ab, bd\}$ . The number of shingles that are common in both strings: $4$ . The number of shingles that are in at least one string: $7$ . So the Jaccard Similarity is: $\frac{4}{7} = 0.5714$ .

Input

Output

abcdabd aacdabd
2

0.5714

For $S$ , the set of $2$ -shingles is: $\{ab, bc, cd, da, bd\}$ .
For $T$ , the set of $2$ -shingles is: $\{aa, ac, cd, da, ab, bd\}$ .
The number of shingles that are common in both strings: $4$ .
The number of shingles that are in at least one string: $7$ .

So the Jaccard Similarity is: $\frac{4}{7} = 0.5714$ .

Notes

A substring is a contiguous sequence of characters within a string. For example, the list of all substrings of the string "apple" would be "apple", "appl", "pple", "app", "ppl", "ple", "ap", "pp", "pl", "le", "a", "p", "l", "e", "" (note that, there is an empty string at the end).

Factors

	CPU	Memory	Source
Bash 5.2	1×	1×	1×
Brainf*ck	1×	1×	1×
C# Mono 6.0	1×	1×	1×
C++17 GCC 13.2	1×	1×	1×
C++20 Clang 16.0	1×	1×	1×
C++20 GCC 13.2	1×	1×	1×
C++23 GCC 13.2	1×	1×	1×
C11 GCC 13.2	1×	1×	1×
C17 GCC 13.2	1×	1×	1×
C23 GCC 13.2	1×	1×	1×
Common Lisp SBCL 2.0	1×	1×	1×
D8 11.8	1×	1×	1×
Erlang 22.3	1×	1×	1×
Free Pascal 3.0	1×	1×	1×
Go 1.22	1×	1×	1×
Grep 3.7	1×	1×	1×
Haskell 8.6	1×	1×	1×
Java 1.8	3×	3×	1×
Kotlin 1.1	1×	1×	1×
Lua 5.4	1×	1×	1×
Node.js 10.16	1×	1×	1×
Perl 5.30	1×	1×	1×
PHP 8.3	1×	1×	1×
PyPy 7.1 (3.6)	3×	3×	1×
Python 3.12	1×	1×	1×
Ruby 3.2	1×	1×	1×
Rust 1.57	1×	1×	1×
Swift 5.3	1×	1×	1×
Whitespace	1×	1×	1×

Adhoc, String

uDebug

Submit

Contributors

MR.Jukerburg11

Statistics

87% Solution Ratio

fextivityEarliest, Dec '21

uchiha6125Fastest, 0.0s

TrAdBuLightest, 213 kB

white_monsterShortest, 158B