A vector is an ordered, finite list of numbers.
In coding, it is usually represented by arrays or lists (and not tuples).
Vector notations can wildly differ among authors, so caution is advised.
Some notations don’t distinguish between a 1-vector and scalars (numbers). In coding, they’re in different data structures.
There’s vectors like zero, unit, one, sparse vectors.
Adding/subtracting vectors is different from stacking their contents.
The inner product (or dot product) of n-vectors and is written as
Sparse vectors reduce computation time; both Python and Julia have them as dedicated data structures.
Functions that take in vectors are simply written as with as an n-vector.
Example: inner product function
Can be seen as a weighted sum of elements of
A linear function is when:
for scalar and n-vector
for n-vectors and
Example: mean() is a linear function, max() is not, median() is not.
A differentiable function (doesn't need to be linear) can be approached by a linear function with Taylor approximation.
Regression can be written with vectors as (an affine function)
The (Euclidean) norm of an n-vector is
Also written as
Norm is more general than “length”.
The triangle inequality for norms:
The RMS of a vector can be calculated by its norm:
Chebyshev’s inequality states that the number of entries of an n-vector with absolute value is no more than
Chebyshev’s inequality (stats version): the percentage of entries of an n-vector with absolute value is no more than
The (Euclidean) distance of two vectors and is
Average, RMS, and standard deviation:
We can standardize a vector by subtracting its and dividing by , these are called z-scores.
The angle between two vectors and is
The correlation coefficient between two vectors and is
Clustering outside of simple cases:
In almost all applications is larger than 2 (we can’t just scatterplot the values).
There will be some or even many points between clusters.
In real examples the best number of clusters will be less clear.
Formalizing a clustering objective:
All vectors get clustered into clusters , where is the group where is assigned to.
With each cluster there is a group representative (doesn’t need to be one of ) which is close to each vectors in that group (small distance ).
= mean squared distance of each vector to their group representative
Optimal clustering is minimizing objective , but this is computationally impractical outside of small problems.
-means is suboptimal but finds very good clustering (close to smallest )
Algorithm: given vectors and initial list of group representatives , repeat steps 1 & 2 until convergence
Given , find best cluster assignments for each
For each cluster, set each group representative into mean of vectors in group
The algorithm is a heuristic. Different initial representatives can result in different clustering and final value of
The resulting representatives are quite interpretable. Example: if 4th part of vector is age, then = 37.8 tells us average age of group 3 is 37.8
A collection of n-vectors is linearly independent if only holds up for .
Like linear dependence, linear independence is an attribute of a collection of vectors, not individual vectors.
Independence-dimension inequality: A linearly independent collection of n-vectors can have at most elements.
A collection of linearly independent n-vectors is called a basis.
A collection of n-vectors is orthonormal if each vector pair is orthogonal ( for ) and
Gram-Schmidt algorithm can be used to determine if a collection of vectors is linearly independent.
It finds the first vector that is a linear combination of previous vectors
Step 1: Orthogonalization:
Step 2: Test for linear dependence: if , quit.
Step 3: Normalization:
coming soon...