Como encontrar 5 valores repetidos no tempo O (n)?

15

Suponha que você tenha uma matriz de tamanho contendo números inteiros de a , inclusive, com exatamente cinco repetidos. Preciso propor um algoritmo que possa encontrar os números repetidos em tempo. Pela minha vida, não consigo pensar em nada. Eu acho que a classificação, na melhor das hipóteses, seria ? Atravessar a matriz seria , resultando em . No entanto, não tenho muita certeza se a classificação seria necessária, pois já vi algumas coisas complicadas com lista vinculada, filas, pilhas, etc. $n \geq 6$ $1$ $n − 5$ $O(n)$ $O(n\log n)$ $O(n)$ $O(n^2\log n)$

algorithms arrays searching

— darylnak
fonte

16

O (n \log n) + O (n)

$O(n \log n) + O(n)$ não é . É . Seria se você fizesse a classificação n vezes.

O (n^{2} \log n)

$O(n^2 \log n)$

O (n \log n)

$O(n \log n)$

O (n^{2} \log n)

$O(n^2 \log n)$

— Fund Monica's Lawsuit

11

A classificação de números inteiros é

O (n)

$O(n)$ .

— usar o seguinte comando

11

@leftaroundabout Esses algoritmos são que é o tamanho da matriz é o tamanho do conjunto de entrada. desde destes algoritmos trabalhar em

O (k \cdot n)

$O(k\cdot n)$

n

$n$

k

$k$

k = n - c o n s t a n t

$k=n-constant$

O (n^{2})

$O(n^2)$

— romano Gräf

4

@ RomanGräf parece que a situação real é a seguinte: os algoritmos funcionam em , onde é o tamanho do domínio. Portanto, para um problema como o OP, é o mesmo se você usa esse algoritmo no domínio de tamanho ou um algoritmo tradicional em um domínio de tamanho ilimitado. Também faz sentido.

O (\log k \cdot n)

$O(\log k \cdot n)$

k

$k$

n

$n$

O (n \cdot \log n)

$O(n\cdot \log n)$

— usar o seguinte comando

5

Para , o único número permitido é , de acordo com sua descrição. Mas, em seguida, teria que ser repetido seis, e não cinco, vezes.

n = 6

$n=6$

1

$1$

1

$1$

— Re: Alex Reinking

22

Você pode criar uma matriz adicional de tamanho . Inicialmente, defina todos os elementos da matriz como . Em seguida, faça um loop pela matriz de entrada e aumente em 1 para cada . Depois disso, basta verificar a matriz : loop sobre e se então é repetido. Você o resolve em tempo ao custo da memória que é e porque seus números inteiros estão entre e . $B$ $n$ $0$ $A$ $B[A[i]]$ $i$ $B$ $A$ $B[A[i]] > 1$ $A[i]$ $O(n)$ $O(n)$ $1$ $n-5$

— fade2black
fonte

26

A solução na resposta do fade2black é a padrão, mas usa $O(n)$ espaço. Você pode melhorar isso para o espaço $O(1)$ seguinte maneira:

Seja a matriz $A[1],\ldots,A[n]$ . Para $d=1,\ldots,5$ , computação $\sigma_d = \sum_{i=1}^n A[i]^d$ .
Calcular $\tau_d = \sigma_d - \sum_{i=1}^{n-5} i^d$ (você pode usar as fórmulas conhecidas para calcular a última soma em $O(1)$ ). Observe que $\tau_d = m_1^d + \cdots + m_5^d$ , onde $m_1,\ldots,m_5$ são os números repetidos.
Calcule o polinômio $P(t) = (t-m_1)\cdots(t-m_5)$ . Os coeficientes desse polinômio são funções simétricas de $m_1,\ldots,m_5$ que podem ser calculadas a partir de $\tau_1,\ldots,\tau_5$ em $O(1)$ .
Encontre todas as raízes do polinômio $P(t)$ tentando todas as $n-5$ possibilidades.

Esse algoritmo assume o modelo da máquina de RAM, no qual operações aritméticas básicas em palavras de bits levam tempo . $O(\log n)$ $O(1)$

Outra maneira de formular esta solução é nas seguintes linhas:

Calcular , e deduzir utilizando a fórmula . $x_1 = \sum_{i=1}^n A[i]$ $y_1 = m_1 + \cdots + m_5$ $y_1 = x_1 - \sum_{i=1}^{n-5} i$
Calcular em usando a fórmula $x_2 = \sum_{1 \leq i < j \leq} A[i] A[j]$ $O(n)$ $x_{2} = (A [1]) A [2] + (A [1] + A [2]) A [3] + (A [1] + A [2] + A [3]) A [4] + \dots + (A [1] + \dots + A [n - 1]) A [n] .$ $x_2 = (A[1]) A[2] + (A[1] + A[2]) A[3] + (A[1] + A[2] + A[3]) A[4] + \cdots + (A[1] + \cdots + A[n-1]) A[n].$
Deduza usando a fórmula $y_2 = \sum_{1 \leq i < j \leq 5} m_i m_j$ $y_{2} = x_{2} - \sum_{1 \leq i < j \leq n - 5} i j - (\sum_{i = 1}^{n - 5} i) y_{1} .$ $y_2 = x_2 - \sum_{1 \leq i < j \leq n-5} ij - \left(\sum_{i=1}^{n-5} i\right) y_1.$
Calcule e deduza ao longo de linhas semelhantes. $x_3,x_4,x_5$ $y_3,y_4,y_5$
Os valores de são (até o sinal) os coeficientes do polinômio da solução anterior. $y_1,\ldots,y_5$ $P(t)$

Essa solução mostra que, se substituirmos 5 por , obteremos (acredito) um algoritmo usando espaço , que executa operações aritméticas em números inteiros de comprimento de bits , mantendo no máximo destes a qualquer momento. (Isso requer uma análise cuidadosa das multiplicações que realizamos, a maioria das quais envolve um operando de comprimento apenas $d$ $O(d^2n)$ $O(d^2)$ $O(dn)$ $O(d\log n)$ $O(d)$ .) É concebível que isso possa ser aprimorado para tempo e espaço usando aritmética modular. $O(\log n)$ $O(dn)$ $O(d)$

— Yuval Filmus
fonte

Alguma interpretação de

e

,

e assim por diante? Por que

?

σ_{d}

$\sigma_d$

τ_{d}

$\tau_d$

P (t)

$P(t)$

m_{i}

$m_i$

d \in {1, 2, 3, 4, 5}

$d \in \{1, 2, 3, 4, 5\}$

— styrofoam fly

3

O insight por trás da solução é o truque de soma , que aparece em muitos exercícios (por exemplo, como você encontra o elemento ausente de uma matriz de comprimento

contém todos, exceto um dos números

?). O truque de somar pode ser usado para calcular

para uma função arbitrária

, e a questão é qual

escolher para poder deduzir

n - 1

$n-1$

1, \dots, n

$1,\ldots,n$

f (m_{1}) + \dots + f (m_{5})

$f(m_1) + \cdots + f(m_5)$

f

$f$

f

$f$

. Minha resposta usa truques familiares da teoria elementar das funções simétricas.

m_{1}, \dots, m_{5}

$m_1,\ldots,m_5$

— Yuval Filmus

11

@hoffmale Na verdade,

.

O (d^{2})

$O(d^2)$

— Yuval Filmus 08/10

11

@hoffmale Cada um deles usa

palavras de máquina.

d

$d$

— Yuval Filmus

11

@BurnsBA O problema dessa abordagem é que

é muito maior que

(n - 5) #

$(n-5)\#$

\frac{(n - 4) (n - 5)}{2}

$\frac{(n-4)(n-5)}{2}$ . Operations on large numbers are slower.

— Yuval Filmus

8

There's also a linear time and constant space algorithm based on partitioning, which may be more flexible if you're trying to apply this to variants of the problem that the mathematical approach doesn't work well on. This requires mutating the underlying array and has worse constant factors than the mathematical approach. More specifically, I believe the costs in terms of the total number of values $n$ and the number of duplicates $d$ are $\mathcal{O}(n \log d)$ and $\mathcal{O}(d)$ respectively, though proving it rigorously will take more time than I have at the moment.

Algorithm

Start with a list of pairs, where the first pair is the range over the whole array, or $[(1, n)]$ if 1-indexed.

Repeat the following steps until the list is empty:

Take and remove any pair $(i, j)$ from the list.
Find the minimum and maximum, $\text{min}$ and $\text{max}$ , of the denoted subarray.
If $\text{min} = \text{max}$ , the subarray consists only of equal elements. Yield its elements except one and skip steps 4 to 6.
If $\text{max} - \text{min} = j - i$ , the subarray contains no duplicates. Skip steps 5 and 6.
Partition the subarray around $\frac{\text{min}+\text{max}}{2}$ , such that elements up to some index $k$ are smaller than the separator and elements above that index are not.
Add $(i, k)$ and $(k + 1, j)$ to the list.

Cursory analysis of time complexity.

Steps 1 to 6 take $\mathcal{O}(j - i)$ time, since finding the minimum and maximum and partitioning can be done in linear time.

Every pair $(i, j)$ in the list is either the first pair, $(1, n)$ , or a child of some pair for which the corresponding subarray contains a duplicate element. There are at most $d \lceil \log_2 n + 1\rceil$ such parents, since each traversal halves the range in which a duplicate can be, so there are at most $2d \lceil \log_2 n + 1\rceil$ total when including pairs over subarrays with no duplicates. At any one time, the size of the list is no more than $2d$ .

Consider the work to find any one duplicate. This consists of a sequence of pairs over an exponentially decreasing range, so the total work is the sum of the geometric sequence, or $\mathcal{O}(n)$ . This produces an obvious corollary that the total work for $d$ duplicates must be $\mathcal{O}(nd)$ , which is linear in $n$ .

To find a tighter bound, consider the worst-case scenario of maximally spread out duplicates. Intuitively, the search takes two phases, one where the full array is being traversed each time, in progressively smaller parts, and one where the parts are smaller than $\frac{n}{d}$ so only parts of the array are traversed. The first phase can only be $\log d$ deep, so has cost $\mathcal{O}(n \log d)$ , and the second phase has cost $\mathcal{O}(n)$ because the total area being searched is again exponentially decreasing.

— Veedrac
fonte

Thank you for the explanation. Now I understand. A very pretty algorithm!

— D.W.

5

Leaving this as an answer because it needs more space than a comment gives.

You make a mistake in the OP when you suggest a method. Sorting a list and then transversing it $O(n\log n)$ time, not $O(n^2\log n)$ time. When you do two things (that take $O(f)$ and $O(g)$ respectively) sequentially then the resulting time complexity is $O(f+g)=O(\max{f,g})$ (under most circumstances).

In order to multiply the time complexities, you need to be using a for loop. If you have a loop of length $f$ and for each value in the loop you do a function that takes $O(g)$ , then you'll get $O(fg)$ time.

So, in your case you sort in $O(n\log n)$ and then transverse in $O(n)$ resulting in $O(n\log n+n)=O(n\log n)$ . If for each comparison of the sorting algorithm you had to do a computation that takes $O(n)$ , then it would take $O(n^2\log n)$ but that's not the case here.

In case your curious about my claim that $O(f+g)=O(\max{f,g})$ , it's important to note that that's not always true. But if $f\in O(g)$ or $g\in O(f)$ (which holds for a whole host of common functions), it will hold. The most common time it doesn't hold is when additional parameters get involved and you get expressions like $O(2^cn+n\log n)$ .

— Stella Biderman
fonte

3

There's an obvious in-place variant of the boolean array technique using the order of the elements as the store (where arr[x] == x for "found" elements). Unlike the partition variant that can be justified for being more general I'm unsure when you'd actually need something like this, but it is simple.

for idx from n-4 to n
    while arr[arr[idx]] != arr[idx]
        swap(arr[arr[idx]], arr[idx])

This just repeatedly puts arr[idx] at the location arr[idx] until you find that location already taken, at which point it must be a duplicate. Note that the total number of swaps is bounded by $n$ since each swap makes its exit condition correct.

— Veedrac
fonte

You're going to have to give some sort of argument that the inner while loop runs in constant time on average. Otherwise, this isn't a linear-time algorithm.

— David Richerby

@DavidRicherby It doesn't run constant time on average, but the outer loop only runs 5 times so that's fine. Note that the total number of swaps is bounded by

n

$n$ since each swap makes its exit condition correct, so even if the number of duplicate values increases the total time is still linear (aka. it takes

n

$n$ steps rather than

n d

$nd$ ) .

— Veedrac

Oops, I somehow didn't notice that the outer loop runs a constant number of times! (Edited to include your note about the number of swaps and also so I could reverse my downvote.)

— David Richerby

1

Subtract the values you have from the sum $\sum_{i=1}^{n} i = \frac{(n-1) \cdot n}{2}$ .

So, after $\Theta(n)$ time (assuming arithmetic is O(1), which it isn't really, but let's pretend) you have a sum $\sigma_1$ of 5 integers between 1 and n:

$x_1 + x_2 + x_3 + x_4 + x_5 = \sigma_1$

Supposedly, this is no good, right? You can't possibly figure out how to break this up into 5 distinct numbers.

Ah, but this is where it gets to be fun! Now do the same thing as before, but subtract the squares of the values from $\sum_{i=1}^{n} i^2$ . Now you have:

${x_1}^2 + {x_2}^2 + {x_3}^2 + {x_4}^2 + {x_5}^2 = \sigma_2$

See where I'm going with this? Do the same for powers 3, 4 and 5 and you have yourself 5 independent equations in 5 variables. I'm pretty sure you can solve for $\vec{x}$ .

Caveats: Arithmetic is not really O(1). Also, you need a bit of space to represent your sums; but not as much as you would imagine - you can do most everything modularly, as long as you have, oh, $\lceil\log(5n^6)\rceil$ bits; that should do it.

— einpoklum
fonte

Doesn't @YuvalFilmus propose the same solution?

— fade2black

@fade2black: Oh, yes, it does, sorry, I just saw the first line of his solution.

— einpoklum

0

Easiest way to solve the problem is to create array in which we will count the apperances for each number in the original array, and then traverse all number from $1$ to $n-5$ and check if the number appears more than once, the complexity for this solution in both memory and time is linear, or $O(N)$

— someone12321
fonte

1

This is the same @fade2black's answer (although a bit easier on the eyes)

— LangeHaare

0

Map an array to 1 << A[i] and then XOR everything together. Your duplicates will be the numbers where corresponding bit is off.

— Hauleth
fonte

There are five duplicates, so the xor trick will not break in some cases.

— Evil

1

The running time of this is

O (n^{2})

$O(n^2)$ . Each bitvector is

n

$n$ bits long, so you each bitvector operation takes

O (n)

$O(n)$ time, and you do one bit vector operation per element of the original array, for a total of

O (n^{2})

$O(n^2)$ time.

— D.W.

@D.W. But given that the machines we normally use are fixed at either 32 or 64-bits, and these don't change at run-time (i.e. they're constant), why shouldn't they be treated as such and assume that the bit operations are in

O (1)

$O(1)$ instead of

O (n)

$O(n)$ ?

— code_dredd

1

@ray, I think you answered your own question. Given that the machines we normally use are fixed at 64-bits, the running time to do an operation on a

n

$n$ -bit vector is

O (n)

$O(n)$ , not

O (1)

$O(1)$ . It takes something like

n / 64

$n/64$ instructions to do some operation on all

n

$n$ bits of a

n

$n$ -bit vector, and

n / 64

$n/64$ is

O (n)

$O(n)$ , not

O (1)

$O(1)$ .

— D.W.

@D.W. What I got out of prev. comments was that a bit vector referred to a single element in an

n

$n$ -sized array, with the bit vector being 64-bits, which would be the constant I'm referring to. Obviously, processing an an array of size

n

$n$ will take

O (k n)

$O(kn)$ time, if we assume there're

k

$k$ -bits per element and

n

$n$ the number of elements in the array. But

k = 64

$k=64$ , so an operation for an array element w/ a constant bit count should be

O (1)

$O(1)$ instead of

O (k)

$O(k)$ and the array

O (n)

$O(n)$ instead of

O (k n)

$O(kn)$ . Are you keeping the

k

$k$ for the sake of completeness/correctness or am I missing something else?

— code_dredd

-2

DATA=[1,2,2,2,2,2]

from collections import defaultdict

collated=defaultdict(list):
for item in DATA:
    collated[item].append(item)
    if len(collated) == 5:
        return item.

# n time

— user78484
fonte

4

Welcome to the site. We're a computer science site, so we're looking for algorithms and explanations, not code dumps that require understanding of a particular language and its libraries. In particular, your claim that this code runs in linear time assumes that collated[item].append(item) runs in constant time. Is that really true?

— David Richerby

3

Also, you are looking for a value which is repeated five times. In contrast, the OP is looking for five values, which are each repeated twice.

— Yuval Filmus