First problem: Minimizing ∥w∥ or ∥w∥2:
It is correct that one wants to maximize the margin. This is actually done by maximizing 2∥w∥. This would be the "correct" way of doing it, but it is rather inconvenient. Let's first drop the 2, as it is just a constant. Now if 1∥w∥ is maximal, ∥w∥ will have to be as small as possible. We can thus find the identical solution by minimizing ∥w∥.
∥w∥ can be calculated by wTw−−−−√. As the square root is a monotonic function, any point x which maximizes f(x)−−−−√ will also maximize f(x). To find this point x we thus don't have to calculate the square root and can minimize wTw=∥w∥2.
Finally, as we often have to calculate derivatives, we multiply the whole expression by a factor 12. This is done very often, because if we derive ddxx2=2x and thus ddx12x2=x.
This is how we end up with the problem: minimize 12∥w∥2.
tl;dr: yes, minimizing ∥w∥ instead of 12∥w∥2 would work.
Second problem: ≥0 or ≥1:
As already stated in the question, yi(⟨w,xi⟩+b)≥0 means that the point has to be on the correct side of the hyperplane. However this isn't enough: we want the point to be at least as far away as the margin (then the point is a support vector), or even further away.
Remember the definition of the hyperplane,
H={x∣⟨w,x⟩+b=0}.
This description however is not unique: if we scale w and b by a constant c, then we get an equivalent description of this hyperplane. To make sure our optimization algorithm doesn't just scale w and b by constant factors to get a higher margin, we define that the distance of a support vector from the hyperplane is always 1, i.e. the margin is 1∥w∥. A support vector is thus characterized by yi(⟨w,xi⟩+b)=1.
As already mentioned earlier, we want all points to be either a support vector, or even further away from the hyperplane. In training, we thus add the constraint yi(⟨w,xi⟩+b)≥1, which ensures exactly that.
tl;dr: Training points don't only need to be correct, they have to be on the margin or further away.