Given the following three items where w2, w4, etc. all represent different words. Compare how well the shingle process works in determining which items are near duplicates by looking at the shingles composed of 3 words versus shingles composed of 6 words. Use the rolling definition of shingles where for example the first 3 words are shingle 1, then word 2-4 are shingle 2, 3-5 are shingle 3 until the last 3 words are the last shingle when creating the three word process. To determine the numeric value for each shingle just take the word number to make a number. Thus for shingle w1w1w4 the numeric value would be 114. For shingle w1w1w4w2w2w1 the number would be 114221. Use Borders formula to calculate the resemblance between each item and the other items for the 3 word shingle and the 6 word shingles. Discuss the results and the impact of going to 6 word shingles.

Item 1: w1 w1 w4 w2 w2 w1 w4 w2 w3 w1 w1 w4 w2 w3 w4 w3
Item 2: w1 w4 w2 w4 w1 w1 w4 w2 w2 w1 w2 w3 w3 w2 w2 w4
Item 3: w1 w1 w4 w2 w2 w1 w4 w2 w3 w1 w1 w4 w2 w5 w4 w3

To compare the effectiveness of the shingle process using 3-word shingles versus 6-word shingles, we need to calculate the resemblance between each item using Border's formula.

First, let's calculate the 3-word shingle values for each item:

Item 1: w1w1w4, w1w4w2, w4w2w2, w2w2w1, w2w1w4, w1w4w2, w4w2w3, w2w3w1, w3w1w1, w1w1w4, w1w4w2, w4w2w3, w2w3w4, w3w4w3.
Item 2: w1w4w2, w4w2w4, w2w4w1, w4w1w1, w1w1w4, w1w4w2, w4w2w2, w2w2w1, w2w1w2, w1w2w3, w2w3w3, w3w3w2, w3w2w2, w2w2w4.
Item 3: w1w1w4, w1w4w2, w4w2w2, w2w2w1, w2w1w4, w1w4w2, w4w2w3, w2w3w1, w3w1w1, w1w1w4, w1w4w2, w4w2w5, w2w5w4, w5w4w3.

Now let's calculate the 6-word shingle values for each item:

Item 1: w1w1w4w2w2w1, w1w4w2w2w1w4, w4w2w1w4w2w3, w2w1w4w2w3w4, w1w4w2w3w4w3.
Item 2: w1w4w2w4w1w1, w4w2w4w1w1w4, w2w4w1w1w4w2, w4w1w1w4w2w3, w1w1w4w2w3w4.
Item 3: w1w1w4w2w2w1, w1w4w2w2w1w4, w4w2w1w4w2w3, w2w1w4w2w3w4, w1w4w2w3w4w3.

Now, let's calculate the resemblance using Border's formula for the 3-word shingle process:

Resemblance between Item 1 and Item 2: 2 / 14 = 0.1429
Resemblance between Item 1 and Item 3: 5 / 14 = 0.3571
Resemblance between Item 2 and Item 3: 2 / 14 = 0.1429

Now, let's calculate the resemblance using Border's formula for the 6-word shingle process:

Resemblance between Item 1 and Item 2: 1 / 5 = 0.2
Resemblance between Item 1 and Item 3: 3 / 5 = 0.6
Resemblance between Item 2 and Item 3: 1 / 5 = 0.2

From the results, we can see that the 3-word shingle process provides lower resemblances compared to the 6-word shingle process. This is because with shorter shingle lengths, there are more opportunities for different shingles to match, resulting in lower resemblance values.

On the other hand, the 6-word shingle process provides higher resemblances as it considers a longer sequence of words, thereby capturing more comprehensive information about the items. However, it is important to note that increasing the shingle length also increases the dimensionality of the problem, which may impact the computational complexity when dealing with a large number of items.

In summary, using longer shingle lengths, such as 6-word shingles, can improve the effectiveness of the shingle process in determining near duplicates, providing higher resemblance values and capturing more detailed information about the items. However, it's vital to consider the trade-off between effectiveness and computational complexity.