Skip to content

Glossary

Last updated: 12 / May / 2020

Table of Contents


A

Accuracy

[back to top]

Accuracy is defined as the fraction of correct classifications out of the total number of samples; it resembles one way to assess the performance of a predictor and is often used synonymous to specificity/precision although it is calculated differently. Accuracy is calculated as:

\frac{True Positives + True Negatives}{Positives+Negatives}

Source: wikipedia


Active Learning

[back to top]

Active learning is a variant of the on-line learning machine learning architecture where feedback about the ground truth class labels of unseen data can be requested if the classification is uncertain. New training data that was labeled can then be used to update the model as in on-line learning.


Anomaly detection

[back to top]

Anomaly detection describes the task of identifying points that deviate from specific patterns in a dataset -- the so-called outliers. Different types of anomaly detection methods include graph-based, statistical-based and distance-based techniques and can be used in both unsupervised and supervised learning tasks.


Artificial Neural Networks (ANN)

[back to top]

Artificial Neural Networks (ANN) are a class of machine learning algorithms that are inspired by the neuron architecture of the human brain. Typically, a (multi-layer) ANN consists of a layer of input nodes, a layer of output nodes, and hidden layers in-between. The nodes are connected by weighted links that can be interpreted as the neuron-connections by axons of different strengths. The simplest version of an ANN is a single-layer perceptron.


B

Backtesting

[back to top]

Backtesting is a specific case of cross-validation in the context of finance and trading models where empirical data from previous time periods (data from the past) is used to evaluate a trading strategy.


Bagging

[back to top]

Bagging is an ensemble method for classification (or regression analysis) in which individual models are trained by random sampling of data, and the final decision is made by voting among individual models with equal weights (or averaging for regression analysis).


Bag of words

[back to top]

Bag of words is a model that is used to construct sparse feature vectors for text classification tasks. The bag of words is an unordered set of all words that occur in all documents that are part of the training set. Every word is then associated with a count of how often it occurs whereas the positional information is ignored. Sometimes, the bag of words is also called "dictionary" or "vocabulary" based on the training data.


Batch Gradient Descent

[back to top]

Batch Gradient descent is a variant of a Gradient Descent algorithm to optimize a function by finding its local minimum. In contrast to Stochastic Gradient Descent the gradient is computed from the whole dataset.


Bayes Theorem

[back to top]

Naive Bayes' classifier:

  • posterior probability:
P(\omega_j|x) = \frac{p(x|\omega_j) \cdot P(\omega_j)}{p(x)}
P(\omega_j|x) = \frac{p(x|\omega_j) \cdot P(\omega_j)}{p(x)}
\Rightarrow \text{posterior probability} = \frac{ \text{likelihood} \cdot \text{prior probability}}{\text{evidence}}
\Rightarrow \text{posterior probability} = \frac{ \text{likelihood}  \cdot \text{prior probability}}{\text{evidence}}
  • decision rule:
\text{Decide } \omega_1 \text{ if } P(\omega_1|x) > P(\omega_2|x) \text{ else decide } \omega_2 .
\text{Decide } \omega_1  \text{ if }  P(\omega_1|x) > P(\omega_2|x)  \text{ else decide } \omega_2 .
\frac{p(x|\omega_1) \cdot P(\omega_1)}{p(x)} > \frac{p(x|\omega_2) \cdot P(\omega_2)}{p(x)}
\frac{p(x|\omega_1) \cdot P(\omega_1)}{p(x)} > \frac{p(x|\omega_2) \cdot P(\omega_2)}{p(x)}
  • objective functions:
g_1(\pmb x) = P(\omega_1 | \; \pmb{x}), \quad g_2(\pmb{x}) = P(\omega_2 | \; \pmb{x}), \quad g_3(\pmb{x}) = P(\omega_2 | \; \pmb{x})
g_1(\pmb x) = P(\omega_1 | \; \pmb{x}), \quad  g_2(\pmb{x}) = P(\omega_2 | \; \pmb{x}), \quad  g_3(\pmb{x}) = P(\omega_2 | \; \pmb{x})
\quad g_i(\pmb{x}) = \pmb{x}^{\,t} \bigg( - \frac{1}{2} \Sigma_i^{-1} \bigg) \pmb{x} + \bigg( \Sigma_i^{-1} \pmb{\mu}_{\,i}\bigg)^t \pmb x + \bigg( -\frac{1}{2} \pmb{\mu}_{\,i}^{\,t} \Sigma_{i}^{-1} \pmb{\mu}_{\,i} -\frac{1}{2} ln(|\Sigma_i|)\bigg)
\quad g_i(\pmb{x}) = \pmb{x}^{\,t} \bigg( - \frac{1}{2} \Sigma_i^{-1} \bigg) \pmb{x} + \bigg( \Sigma_i^{-1} \pmb{\mu}_{\,i}\bigg)^t \pmb x + \bigg( -\frac{1}{2} \pmb{\mu}_{\,i}^{\,t}  \Sigma_{i}^{-1} \pmb{\mu}_{\,i} -\frac{1}{2} ln(|\Sigma_i|)\bigg)

Batch Learning

[back to top]

Batch learning is an architecture used in machine learning tasks where the entire training dataset is available upfront to build the model. In contrast to on-line learning, the model is not updated once it was build on a training dataset.


Big Data

[back to top]

There are many different, controversial interpretations and definitions for the term "Big Data". Typically, one refers to data as "Big Data" if its volume and complexity are of a magnitude that the data cannot be processed by "conventional" computing platforms anymore; storage space, processing power, and database structures are typically among the limiting factors.


Binomial distribution

[back to top]

  • Probability density function:
p_k = {n \choose x} \cdot p^k \cdot (1-p)^{n-k}
p_k = {n \choose x} \cdot p^k \cdot (1-p)^{n-k}

Bootstrapping

[back to top]

A resampling technique to that is closely related to cross-validation where a training dataset is divided into random subsets. Bootstrapping -- in contrast to cross-validation -- is a random sampling with replacement. Bootstrapping is typically used for statistical estimation of bias and standard error, and a common application in machine learning is to estimate the generalization error of a predictor.


Bregman divergence

[back to top]

Bregman divergence describes are family of proximity functions (or distance measures) that share common properties and are often used in clustering algorithms. A popular example is the squared Euclidean distance.


C

Central Limit Theorem

[back to top]

The Central Limit Theorem is a theorem in the field of probability theory that expresses the idea that the distribution of sample means (from independent random variables) converges to a normal distribution when the sample size approaches infinity.


Co-Variance

[back to top]

S_{xy} = \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})
S_{xy} = \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})

example covariance matrix:

\pmb{\Sigma_1} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0\\ 0 & 0 & 1 \end{bmatrix}
\pmb{\Sigma_1} = \begin{bmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1
\end{bmatrix}

Confusion Matrix

[back to top]

The confusion matrix is used as a way to represent the performance of a classifier and is sometimes also called "error matrix". This square matrix consists of columns and rows that list the number of instances as absolute or relative "actual class" vs. "predicted class" ratios.


Contingency Table

[back to top]

A contingency table is used in clustering analysis to compare the overlap between two different clustering (grouping) results. The partitions from the two clustering results are represented as rows and columns in the table, and the individual elements of the table represent the number of elements that are shared between two partitions from each clustering result.


Correlation analysis

[back to top]

Correlation analysis describes and quantifies the relationship between two independent variables. Typically, in case of a positive correlation both variables have a tendency to increase, and in the case of negative correlation, one variable increases while the other variable increases. It is important to mention the famous quotation "correlation does not imply causation".


Correlation analysis, Canonical

Let x and y be two vectors, the goal of canonical correlation analysis is to maximize the correlation between linear transformations of those original vectors. With applications in dimensionality reduction and feature selection, CCA tries to find common dimensions between two vectors.


Correlation analysis - Matthews Correlation Coefficient (MCC)

[back to top]

MCC is an assessment metric for clustering or binary classification analyses that represents the correlation between the observed (ground truth) and predicted labels. MCC can be directly computed from the confusion matrix and returns a value between -1 and 1.


Correlation analysis - Kendall

[back to top]

Similar to the Pearson correlation coefficient, Kendall's tau measures the degree of a monotone relationship between variables, and like Spearman's rho, it calculates the dependence between ranked variables, which makes it feasible for non-normal distributed data. Kendall tau can be calculated for continuous as well as ordinal data. Roughly speaking, Kendall's tau distinguishes itself from Spearman's rho by stronger penalization of non-sequential (in context of the ranked variables) dislocations.

\tau = \frac{c-d}{c+d} = \frac{S}{ \left( \begin{matrix} n \\ 2 \end{matrix} \right)} = \frac{2S}{n(n-1)}
\tau = \frac{c-d}{c+d} = \frac{S}{
\left(
\begin{matrix}
n \\
2
\end{matrix}
\right)}
= \frac{2S}{n(n-1)}

where:

c = the number of concordant pairs d = the number of discordant pairs

[Source]

If ties are present among the 2 ranked variables, the following equation shall be used instead:

\tau = \frac{S}{\sqrt{n(n-1)/2-T}\sqrt{n(n-1)/2-U}}
\tau = \frac{S}{\sqrt{n(n-1)/2-T}\sqrt{n(n-1)/2-U}}
T = \sum_t t(t-1)/2 \\
T = \sum_t t(t-1)/2 \\
U = \sum_u u(u-1)/2 \\
U = \sum_u u(u-1)/2 \\

where:

t = number of observations of variable x that are tied
u = number of observations of variable y that are tied


Correlation analysis - Pearson

[back to top]

The Pearson correlation coefficient is probably the most widely used measure for linear relationships between two normal distributed variables and thus often just called "correlation coefficient". Usually, the Pearson coefficient is obtained via a Least-Squares fit and a value of 1 represents a perfect positive relation-ship, -1 a perfect negative relationship, and 0 indicates the absence of a relationship between variables.

\rho = \frac{\text{cov}(X,Y)}{\sigma_x \sigma_y}
\rho = \frac{\text{cov}(X,Y)}{\sigma_x \sigma_y}

And the estimate

r = \frac{{}\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})} {\sqrt{\sum_{i=1}^{n} (x_i - \overline{x})^2(y_i - \overline{y})^2}}
r = \frac{{}\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}
{\sqrt{\sum_{i=1}^{n} (x_i - \overline{x})^2(y_i - \overline{y})^2}}

Correlation analysis - Spearman

[back to top]

Related to the Pearson correlation coefficient, the Spearman correlation coefficient (rho) measures the relationship between two variables. Spearman's rho can be understood as a rank-based version of Pearson's correlation coefficient, which can be used for variables that are not normal-distributed and have a non-linear relationship. Also, its use is not only restricted to continuous data, but can also be used in analyses of ordinal attributes.

\rho = 1- {\frac {6 \sum d_i^2}{n(n^2 - 1)}}
\rho = 1- {\frac {6 \sum d_i^2}{n(n^2 - 1)}}

where: d = the pairwise distances of the ranks of the variables xi and yi .
n = the number of samples.


Cosine Similarity

[back to top]

Cosine similarity measures the orientation of two n-dimensional sample vectors irrespective to their magnitude. It is calculated by the dot product of two numeric vectors, and it is normalized by the product of the vector lengths, so that output values close to 1 indicate high similarity.

cos(\pmb x, \pmb y) = \frac {\pmb x \cdot \pmb y}{||\pmb x|| \cdot ||\pmb y||}
cos(\pmb x, \pmb y) = \frac {\pmb x \cdot \pmb y}{||\pmb x|| \cdot ||\pmb y||}

Cost function

[back to top]

A cost function (synonymous to loss function) is a special case of an objective function, i.e., a function that is used for solving optimization problems. A cost function can take one or more input variables and the output variable is to be minimized. A typical use case for cost functions is parameter optimization.


Cross-validation

[back to top]

Cross-validation is a statistical technique to estimate the prediction error rate by splitting the data into training, cross-validation, and test datasets. A prediction model is obtained using the training set, and model parameters are optimized by the cross-validation set, while the test set is held primarily for empirical error estimation.


Cross-validation, K-fold

[back to top]

K-fold cross-validation is a variant of cross validation where contiguous segments of samples are selected from the training dataset to build two new subsets for every iteration (without replacement): a new training and test dataset (while the original test dataset is retained for the final evaluation of the predictor).


Cross-validation, Leave-One-Out

[back to top]

Leave-One-Out cross-validation a variant of cross validation one sample is removed for every iteration (without replacement). The model is trained on the remaining N-1 samples and evaluated via the removed sample (while the original test dataset is retained for the final evaluation of the predictor).


Cross-validation, Random Sampling

[back to top]

Cross-validation via random sampling is a variant of cross validation where random chunks of samples are extracted from the training dataset to build two new subsets for every iteration (with or without replacement): a new training and test dataset for every iteration (while the original test dataset is retained for the final evaluation of the predictor).


Curse of dimensionality

[back to top]

For a fixed number of training samples, the curse of dimensionality describes the increased error rate for a large number of dimensions (or features) due to imprecise parameter estimations.


D

Data mining

[back to top]

A field that is closely related to machine learning and pattern classification. The focus of data mining does not lie in merely the collection of data, but the extraction of useful information: Discovery of patterns, and making inferences and predictions. Common techniques in data mining include predictive modeling, clustering, association rules, and anomaly detection.


DBSCAN

[back to top]

DBSCAN is a variant of a density-based clustering algorithm that identifies core points as regions of high-densities based on their number of neighbors (> MinPts) in a specified radius (ε). Points that are below MinPts but within ε are specified as border points; the remaining points are classified as noise points.


Decision rule

[back to top]

A function in pattern classification tasks of making an "action", e.g., assigning a certain class label to an observation or pattern.


Decision tree classifier

[back to top]

Decision tree classifiers are tree like graphs, where nodes in the graph test certain conditions on a particular set of features, and branches split the decision towards the leaf nodes. Leaves represent lowest level in the graph and determine the class labels. Optimal tree are trained by minimizing Gini impurity, or maximizing information gain.


Density-based clustering

[back to top]

In density-based clustering, regions of high density in n-dimensional space are identified as clusters. The best advantage of this class of clustering algorithms is that they do not require apriori knowledge of number of clusters (as opposed to k-means algorithm).


Descriptive modeling

[back to top]

Descriptive modeling is a common task in the field of data mining where a model is build in order to distinguish between objects and categorize them into classes - a form of data summary. In contrast to predictive modeling, the scope of descriptive modeling does not extend to making prediction for unseen objects.


Dimensionality reduction

[back to top]

Dimensionality reduction is a data pre-processing step in machine learning applications that aims to avoid the curse of dimensionality and reduce the effect of overfitting. Dimensionality reduction is related to feature selection, but instead of selecting a feature subset, dimensionality reduction takes as projection-based approach (e.g, linear transformation) in order to create a new feature subspace.


Distance Metric Learning

[back to top]

Distance metrics are fundamental for many machine learning algorithms. Distance metric learning - instead of learning a model - incorporates estimated relevances of features to obtain a distance metric for potentially optimal separation of classes and clusters: Large distances for objects from different classes, and small distances for objects of the same class, respectively.


Distance, Euclidean

[back to top]

The Euclidean distance is a distance measure between two points or or vectors in a two- or multidimensional (Euclidean) space based on Pythagoras' theorem. The distance is calculated by taking the square root of the sum of the squared pair-wise distances of every dimension.

\sqrt{\sum_{i=1}^n (x_i-y_i)^2}
\sqrt{\sum_{i=1}^n (x_i-y_i)^2}

Distance, Manhattan

[back to top]

The Manhattan distance (sometimes also called Taxicab distance) metric is related to the Euclidean distance, but instead of calculating the shortest diagonal path ("beeline") between two points, it calculates the distance based on gridlines. The Manhattan distance was named after the block-like layout of the streets in Manhattan.

\left(\sum_{i=1}^n |x_i-y_i|^p\right)^{1/p}
\left(\sum_{i=1}^n |x_i-y_i|^p\right)^{1/p}

Distance, Minkowski

[back to top]

The Minkowski distance is a generalized form of the Euclidean distance (if p=2) and the Manhattan distance (if p=1).

\left(\sum_{i=1}^n |x_i-y_i|^p\right)^{1/p}
\left(\sum_{i=1}^n |x_i-y_i|^p\right)^{1/p}

E

Eager learners

[back to top]

Eager learners (in contrast to lazy learners) describe machine learning algorithms that learn a model for mapping attributes to class labels as soon as the data becomes available (e.g., Decision tree classifiers or naive Bayes classifiers) and do not require the training data for making predictions on unseen samples once the model was built. The most computationally expensive step is the creation of a prediction model from the training data, and the actual prediction is considered as relatively inexpensive.


Eigenvectors and Eigenvalues

[back to top]

Both eigenvectors and eigenvalues fundamental in many applications involve linear systems and are related via A·v = λ·v (where A is a square matrix, v the eigenvector, and λ the eigenvalue). Eigenvectors are describing the direction of the axes of a linear transformation, whereas eigenvalues are describing the scale or magnitude.

\pmb A\pmb{v} = \lambda\pmb{v}\\
\pmb A\pmb{v} =  \lambda\pmb{v}\\

where:

\pmb A = S_{W}^{-1}S_B\\ \pmb{v} = \text{Eigenvector}\\ \lambda = \text{Eigenvalue}
\pmb A = S_{W}^{-1}S_B\\
\pmb{v} = \text{Eigenvector}\\
\lambda = \text{Eigenvalue}

Ensemble methods

[back to top]

Ensemble methods combine multiple classifiers which may differ in algorithms, input features, or input samples. Statistical analyses showed that ensemble methods yield better classification performances and are also less prone to overfitting. Different methods, e.g., bagging or boosting, are used to construct the final classification decision based on weighted votes.


Evolutionary algorithms

[back to top]

Evolutionary algorithms are a class of algorithms that are based on heuristic search methods inspired by biological evolution in order to solve optimization problems.


[back to top]

Exhaustive search (synonymous to brute-force search) is a problem-solving approach where all possible combinations are sequentially evaluated to find the optimal solution. Exhaustive search guarantees to find the optimal solution whereas other approaches (e.g., heuristic searches) are regarded as sub-optimal. A downside of exhaustive searches is that computational costs increase proportional to the number of combinations to be evaluated.


Expectation Maximization algorithm - EM

[back to top]

The Expectation Maximization algorithm (EM) is a technique to estimate parameters of a distribution based on the Maximum Likelihood Estimate (MLE) that is often used for the imputation of missing values in incomplete datasets. After the EM algorithm is initialized with a starting value, alternating iterations between expectation and maximization steps are repeated until convergence. In the expectation step, parameters are estimated based on the current model to impute missing values. In the maximization step, the log-likelihood function of the statistical model is to be maximized by re-estimating the parameters based on the imputed values from the expectation step.


F

Feature Selection

[back to top]

Feature selection is an important pre-processing step in many machine learning applications in order to avoid the curse of dimensionality and overfitting. A subset of features is typically selected by evaluating different combinations of features and eventually retain the subset that minimizes a specified cost function. Commonly used algorithms for feature selection as alternative to exhaustive search algorithms include sequential selection algorithms and genetic algorithms.


Feature Space

[back to top]

A feature space describes the descriptive variables that are available for samples in a dataset as a d-dimensional Euclidean space. E.g., sepal length and width, and petal length and width for each flower sample in the popular Iris dataset.


Fuzzy C-Means Clustering

[back to top]

Fuzzy C-Means is a soft clustering algorithm in which each sample point has a membership degree to each cluster; in hard (crisp) clustering, membership of each point to each cluster is either 0 or 1. Fuzzy C-Means considers a weight matrix for cluster memberships, and minimizes sum squared error (SSE) of weighted distances of sample points to the cluster centroids.


G

Generalization error

[back to top]

The generalization error describes how well new data can be classified and is a useful metric to assess the performance of a classifier. Typically, the generalization error is computed via cross-validation or simply the absolute difference between the error rate on the training and test dataset.


Genetic algorithm

[back to top]

The Genetic algorithm is a subclass of evolutionary algorithms that takes a heuristic approach inspired by Charles Darwin's theory of "natural selection" in order to solve optimization problems.


Gradient Descent

[back to top]

Gradient descent is an algorithm that optimizes a function by finding its local minimum. After the algorithm was initialized with an initial guess, it takes the derivative of the function to make a step towards the direction of deepest descent. This step-wise process is repeated until convergence.


Greedy Algorithm

[back to top]

Greedy Algorithms are a family of algorithms that are used in optimization problems. A greedy algorithm makes locally optimal choices in order to find a local optimum (suboptimal solution, also see (heuristic problem solving).


H

[back to top]

Heuristic search is a problem-solving approach that is focussed on efficiency rather than completeness in order to find a suboptimal solution to a problem. Heuristic search is often used as alternative approach where exhaustive search is too computationally intensive and where solutions need to be approximated.


Hyperparameters

[back to top]

Hyperparameters are the parameters of a classifier or estimator that are not directly learned in the machine learning step from the training data but are optimized separately (e.g., via Grid Search). The goal of hyperparameter optimization is to achieve good generalization of a learning algorithm and to avoid overfitting to the training data.


I

iid

[back to top]

The abbreviation "iid" stands for "independent and identically distributed" and describes random variables that are independent from one another and are drawn from a similar probability distribution. Independence means that the probability of one observation does not affect the probability of another variable (e.g., time series and network graphs are not independent). One popular example of iid would be the tossing of a coin: One coin toss does not affect the outcome of another coin toss, and the probability of the coin landing on either "heads" or "tails" is the same for every coin toss.


Imputation

[back to top]

Imputations algorithms are designed to replace the missing data (NAs) with certain statistics rather than discarding them for downstream analysis. Commonly used imputation methods include mean imputation (replacement by the sample mean of an attribute), kNN imputation, and regression imputation.


Independent Component Analysis

[back to top]

Independent Component Analysis (ICA) is a statistical signal-processing technique that decomposes a multivariate dataset of mixed, non-gaussian distributed source signals into independent components. A popular example is the separation of overlapping voice samples -- the so-called "cocktail party problem".


J

Jaccard coefficient

[back to top]

The Jaccard coefficient (bounded at [0, 1)) is used as similarity measure for asymmetric binary data and calculated by taking the number of matching attributes and divide it by the number of all attributes except those where both variables have a value 0 in contrast to a simple matching coefficient. A popular application is the identification of near-duplicate documents for which the Jaccard coefficient can be calculated by the dividing the intersection of the set of words by the union of the set words in both documents.


Jackknifing

[back to top]

Jackknifing is a resampling technique that predates the related cross-validation and bootstrapping techniques and is mostly used for bias and variance estimations. In jackknifing, a dataset is split into N subsets where exactly one sample is removed from every subset so that every subset is of size N-1.


Jittering

[back to top]

Jittering is a sampling technique that can be used to measure the stability of a given statistical model (classifiction/regression/clustering). In jittering, some noise is added to sample data points, and then a new model is drawn and compared to the original model.


K

k-D Trees

[back to top]

k-D trees are a data structures (recursive space partitioning trees) that result from the binary partitioning of multi-dimensional feature spaces. A typical application of k-D trees is to increase the search efficiency for nearest-neighbor searches. A k-D tree construction can be described as a iterating process with the following steps: Select the dimension of largest variance, draw a cutting plane based at the median along the dimension to split the data into 2 halves, choose the next dimension.


Kernel Density Estimation

[back to top]

Non-parametric techniques to estimate probability densities from the available data without requiring prior knowledge of the underlying model of the probability distribution.


Kernel (in statistics)

[back to top]

In the context of [kernel methods](#kernel-methods the term “kernel” describes a function that calculates the dot product of the images of the samples x under the kernel function φ (see kernel methods). Roughly speaking, a kernel can be understood as a similarity measure in higher-dimensional space.


Kernel Methods

[back to top]

Kernel methods are algorithms that map the sample vectors of a dataset onto a higher-dimensional feature space via a so-called kernel function (φ(x)). The goal is to identify and simplify general relationships between data, which is especially useful for linearly non-separable datasets.


Kernel Trick

[back to top]

Since the explicit computation of the kernel is increasingly computationally expensive for large sample sizes and high numbers of dimensions, the kernel trick uses approximations to calculate the kernel implicitly. The most popular kernels used for the kernel trick are Gaussian Radius Basis Function (RBF) kernels, sigmoidal kernels, and polynomial kernels.


k-fold Cross-validation

[back to top]

In k-fold cross-validation the data is split into k subsets, then a prediction/classification model is trained k times, each time holding one subset as test set, training the model parameters using the remaining k-1. Finally, cross-validation error is evaluated as the average error out of all k training models.


K-Means Clustering

[back to top]

A method of partitioning a dataset into k clusters by picking k random initial points (where k < n, the number or total points - modified by S.R.), assigning clusters, averaging, reassigning, and repeating until stability is achieved. The number k must be chosen beforehand.


K-Means++ Clustering

[back to top]

A variant of k-means where instead of choosing all initial centers randomly, the first is chosen randomly, the second chosen with probability proportional to the squared distance from the first, the third chosen with probability proportional to the square distance from the first two, etc. See this paper.


K-Medoids Clustering

[back to top]

K-Medoids clustering is a variant of k-means algorithm in which cluster centroids are picked among the sample points rather than the mean point of each cluster. K-Medoids can overcome some of the limitations of k-means algorithm by avoiding empty clusters, being more robust to outliers, and being more easily applicable to non-numeric data types.


K-nearest neighbors algorithms

[back to top]

K-nearest neighbors algorithms find the k-points that are closest to a point of interest based on their attributes using a certain distance measure (e.g., Euclidean distance). K-nearest neighbors algorithms are being used in many different contexts: Non-parametric density estimation, missing value imputation, dimensionality reduction, and classifiers in supervised and unsupervised pattern classification and regression problems.


Knowledge Discovery in Databases (KDD)

[back to top]

Knowledge Discovery in Databases (KDD) describes a popular workflow including data mining for extracting useful and meaningful information out of data. Typically, the individual steps are feature selection, pre-processing, transformation, data mining, and post-processing (evaluation and interpretation).


L

LASSO Regression

[back to top]

LASSO (Least Absolute Shrinkage and Selection Operator) is a regression model that uses the L1-norm (sum of absolute values) of model coefficients to penalize the model complexity. LASSO has the advantage that some coefficients can become zero, as opposed to ridge regression that uses the squared sum of model coefficients.


Latent Semantic Indexing

[back to top]

Latent Semantic Indexing (LSI) is a data mining technique to characterize documents by topics, word usage, or other contexts. The structures of the documents are compared by applying singular value decomposition to an input term-document matrix (e.g., a data table of word counts with terms as row labels and document numbers as column labels) in order to obtain the singular values and vectors.


Law of Large Numbers

[back to top]

The Law of Large Numbers is a theorem in the field of probability theory that expresses the idea that the actual value of a random sampling process approaches the expected value for growing sample sizes. A common example is that the observed ratio of "heads" in an unbiased coin-flip experiment will approach 0.5 for large sample sizes.


Lazy learners

[back to top]

Lazy learners (in contrast to eager learners) are memorizing training data in order to make predictions for unseen samples. While there is no expensive learning step involved, the prediction step is generally considered to be more expensive compared to eager learners since it involves the evaluation of training data. One example of lazy learners are k-nearest neighbor algorithms where the class label of a unseen sample is estimated by e.g., the majority of class labels of its neighbors in the training data.


Least Squares fit

[back to top]

A linear regression technique that fits a straight line to a data set (or overdetermined system) by minimizing the sum of the squared residuals, which can be the minimized vertical or perpendicular offsets from the fitted line.

  • Linear equation
f(x) = a\cdot x + b
f(x) = a\cdot x + b
  • Slope:
a = \frac{S_{x,y}}{\sigma_{x}^{2}}\quad
a = \frac{S_{x,y}}{\sigma_{x}^{2}}\quad
  • Y-axis intercept:
b = \bar{y} - a\bar{x}\quad
b = \bar{y} - a\bar{x}\quad

where:

S_{xy} = \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})\quad \text{(covariance)} \\ \sigma{_x}^{2} = \sum_{i=1}^{n} (x_i - \bar{x})^2\quad \text{(variance)}
S_{xy} = \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})\quad \text{(covariance)} \\
\sigma{_x}^{2} = \sum_{i=1}^{n} (x_i - \bar{x})^2\quad \text{(variance)}
  • Matrix equation
\pmb X \ . \pmb a = \pmb y
\pmb X \ . \pmb a = \pmb y
\Bigg[ \begin{array}{cc} x_1 & 1 \\ ... & 1 \\ x_n & 1 \end{array} \Bigg] \bigg[ \begin{array}{c} a \\ b \end{array} \bigg] =\Bigg[ \begin{array}{c} y_1 \\ ... \\ y_n \end{array} \Bigg]
\Bigg[ \begin{array}{cc}
x_1 & 1  \\
... & 1 \\
x_n & 1  \end{array} \Bigg]
\bigg[ \begin{array}{c}
a  \\
b \end{array} \bigg]
=\Bigg[ \begin{array}{c}
y_1   \\
...  \\
y_n  \end{array} \Bigg]
\pmb a = (\pmb X^T \; \pmb X)^{-1} \pmb X^T \; \pmb y
\pmb a = (\pmb X^T \; \pmb X)^{-1} \pmb X^T \; \pmb y

Lennard-Jones Potential

[back to top]

The Lennard-Jones potential describes the energy potential between two non-bonded atoms based on their distance to each other.

V = 4 \epsilon \bigg[ \bigg(\frac{\sigma}{r}\bigg)^{12} - \bigg(\frac{\sigma}{r}\bigg)^{6} \bigg]
V = 4 \epsilon \bigg[ \bigg(\frac{\sigma}{r}\bigg)^{12}  - \bigg(\frac{\sigma}{r}\bigg)^{6} \bigg]

V = intermolecular potential
σ = distance where V is 0
r = distance between atoms, measured from one center to the other
ε = interaction strength


Linear Discriminant Analysis

[back to top]

  • In-between class scatter matrix
S_W = \sum\limits_{i=1}^{c} S_i
S_W = \sum\limits_{i=1}^{c} S_i

Where:

S_i = \sum\limits_{\pmb x \in D_i}^n (\pmb x - \pmb m_i)\;(\pmb x - \pmb m_i)^T \text{ (scatter matrix for every class)} \\\\
S_i = \sum\limits_{\pmb x \in D_i}^n (\pmb x - \pmb m_i)\;(\pmb x - \pmb m_i)^T
\text{  (scatter matrix for every class)} \\\\

and

\pmb m_i = \frac{1}{n_i} \sum\limits_{\pmb x \in D_i}^n \; \pmb x_k \text{ (mean vector)}
\pmb m_i = \frac{1}{n_i} \sum\limits_{\pmb x \in D_i}^n \; \pmb x_k   \text{ (mean vector)}
  • Between class scatter matrix
S_B = \sum\limits_{i=1}^{c} (\pmb m_i - \pmb m) (\pmb m_i - \pmb m)^T
S_B = \sum\limits_{i=1}^{c} (\pmb m_i - \pmb m) (\pmb m_i - \pmb m)^T

Linear Discriminant Analysis (LDA)

[back to top]

A linear transformation technique (related to Principal Component Analysis) that is commonly used to project a dataset onto a new feature space or feature subspace, where the new component axes maximize the spread between multiple classes, or for classification of data.


Local Outlier Factor (LOF)

[back to top]

LOF is a density-based anomaly detection technique for outlier identification. The LOF for a point p refers to the average "reachability distance" towards its nearest neighbors. Eventually, the points with the largest LOF values (given a particular threshold) are identified as outliers.


Locality-sensitive hashing (LSH)

[back to top]

Locality-sensitive hashing (LSH) is a dimensionality reduction technique that groups objects that are likely similar (based on a similarity signature such as MinHash) into the same buckets in order to reduce the search space for pair-wise similarity comparisons. One application of LSH could be a combination with other dimensionality reduction techniques, e.g., MinHash, in order to reduce the computational costs of finding near-duplicate document pairs.


Logistic Regression

[back to top]

Logistic regression is a statistical model used for binary classification (binomial logistic regression) where class labels are mapped to "0" or "1" outputs. Logistic regression uses the logistic function (a general form of sigmoid function), where its output ranges from (0-1).


M

Machine learning

[back to top]

A set of algorithmic instructions for discovering and learning patterns from data e.g., to train a classifier for a pattern classification task.


Mahalanobis distance

[back to top]

The Mahalanobis distance measure accounts for the covariance among variables by calculating the distance between a sample x and the sample mean μ in units of the standard deviation. The Mahalanobis distance becomes equal to the Euclidean distance for uncorrelated with same variances.


MapReduce

[back to top]

MapRedcue is a programming model for analyzing large datasets on distributed computer clusters, in which the task is divided into two steps, a map step and a reducer step. In the map step, the data are filtered by some factors on each compute node, then filtered data are shuffled and passed to the reducer function which performs further analysis on each portion of filtered data separately.


Markov chains

[back to top]

Markov chains (names after Andrey Markov) are mathematical systems that describe the transitioning between different states in a model. The transitioning from one state to the other (or back to itself) is a stochastic process.


Monte Carlo simulation

[back to top]

A Monte Carlo simulation is an iterative sampling method for solving deterministic models. Random numbers or variables from a particular probability distribution are used as input variables for uncertain parameters to compute the response variables.


Maximum Likelihood Estimates (MLE)

[back to top]

A technique to estimate the parameters that have been fit to a model by maximizing a known likelihood function. One common application is the estimation of "mean" and "variance" for a Gaussian distribution.

D = \left\{ \pmb x_1, \pmb x_2,..., \pmb x_n \right\}
D = \left\{ \pmb x_1, \pmb x_2,..., \pmb x_n \right\}

can be pictured as probability to observe a particular sequence of patterns,
where the probability of observing a particular patterns depends on θ, the parameters the underlying (class-conditional) distribution. In order to apply MLE, we have to make the assumption that the samples are i.i.d. (independent and identically distributed).

p(D\; | \; \pmb \theta\;) = p(\pmb x_1 \; | \; \pmb \theta\;)\; \cdot \; p(\pmb x_2 \; | \;\pmb \theta\;) \; \cdot \;... \; p(\pmb x_n \; | \; \pmb \theta\;) = \prod_{k=1}^{n} \; p(\pmb x_k \pmb \; | \; \pmb \theta \;)
p(D\; | \;  \pmb \theta\;)
= p(\pmb x_1 \; | \; \pmb \theta\;)\; \cdot \; p(\pmb x_2 \; | \;\pmb \theta\;) \; \cdot \;...  \; p(\pmb x_n \; | \; \pmb \theta\;)
= \prod_{k=1}^{n} \; p(\pmb x_k \pmb \; | \; \pmb \theta \;)

Where θ is the parameter vector, that contains the parameters for a particular distribution that we want to estimate and p(D | θ) is also called the likelihood of θ.

  • log-likelihood
p(D|\theta) = \prod_{k=1}^{n} p(x_k|\theta) \\ \Rightarrow l(\theta) = \sum_{k=1}^{n} ln \; p(x_k|\theta)
p(D|\theta) = \prod_{k=1}^{n} p(x_k|\theta) \\
\Rightarrow l(\theta) = \sum_{k=1}^{n} ln \; p(x_k|\theta)
  • Differentiation
\nabla_{\pmb \theta} \equiv \begin{bmatrix} \frac{\partial \; }{\partial \; \theta_1} \\ \frac{\partial \; }{\partial \; \theta_2} \\ ...\\ \frac{\partial \; }{\partial \; \theta_p}\end{bmatrix}
\nabla_{\pmb \theta} \equiv \begin{bmatrix}  
\frac{\partial \; }{\partial \; \theta_1} \\
\frac{\partial \; }{\partial \; \theta_2} \\
...\\
\frac{\partial \; }{\partial \; \theta_p}\end{bmatrix}
\nabla_{\pmb \theta} l(\pmb\theta) \equiv \begin{bmatrix} \frac{\partial \; L(\pmb\theta)}{\partial \; \theta_1} \\ \frac{\partial \; L(\pmb\theta)}{\partial \; \theta_2} \\ ...\\ \frac{\partial \; L(\pmb\theta)}{\partial \; \theta_p}\end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ ...\\ 0\end{bmatrix}
\nabla_{\pmb \theta} l(\pmb\theta) \equiv \begin{bmatrix}  
\frac{\partial \; L(\pmb\theta)}{\partial \; \theta_1} \\
\frac{\partial \; L(\pmb\theta)}{\partial \; \theta_2} \\
...\\
\frac{\partial \; L(\pmb\theta)}{\partial \; \theta_p}\end{bmatrix}
= \begin{bmatrix}  
0 \\
0 \\
...\\
0\end{bmatrix}
  • parameter vector
\pmb \theta_i = \bigg[ \begin{array}{c} \ \theta_{i1} \\ \ \theta_{i2} \\ \end{array} \bigg]= \bigg[ \begin{array}{c} \pmb \mu_i \\ \pmb \Sigma_i \\ \end{array} \bigg]
\pmb \theta_i = \bigg[ \begin{array}{c}
\ \theta_{i1} \\
\ \theta_{i2} \\
\end{array} \bigg]=
\bigg[ \begin{array}{c}
\pmb \mu_i \\
\pmb \Sigma_i \\
\end{array} \bigg]

Min-Max scaling

[back to top]

X_{norm} = \frac{X - X_{min}}{X_{max}-X_{min}}
X_{norm} = \frac{X - X_{min}}{X_{max}-X_{min}}

MinHash

[back to top]

MinHash is a commonly used technique for dimensionality reduction in document similarity comparisons. The idea behind MinHash is to create a signature of reduced dimensionality while preserving the Jaccard similarity coefficient. A common implementation of MinHash is to generate k random permutations of the columns in a m*x*n-document matrix (rows represent the sparse vectors of words for each document as binary data) and generate a new matrix of size m*x*k. The cells of the new matrix now contain the position labels of the first non-zero value for every document (1 column for each round of random permutation). Based on similarities of the position labels, the Jaccard coefficient for the pairs of documents can be calculated.


N

Naive Bayes Classifier

[back to top]

A classifier based on a statistical model (i.e., Bayes theorem: calculating posterior probabilities based on the prior probability and the so-called likelihood) in the field of pattern classification. Naive Bayes assumes that all attributes are conditionally independent, thereby, computing the likelihood is simplified to the product of the conditional probabilities of observing individual attributes given a particular class label.


N-grams

[back to top]

In context of natural language processing (NLP), a text is typically broken down into individual elements (see tokenization). N-grams describe the length of the individual elements where n refers to the number of words or symbols in every token. E.g., a unigram (or 1-gram) can represent a single word, and a bigram (or 2-gram) describes a token that consists of 2 words etc.


Non-parametric statistics

[back to top]

In contrast to parametric approaches, non-parametric statistics or approaches do not make prior assumptions about the underlying probability distribution of a particular variable or attribute.


Normal distribution (multivariate)

[back to top]

  • Probability density function
p(\pmb x) \sim N(\pmb \mu|\Sigma)
p(\pmb x) \sim N(\pmb \mu|\Sigma)
p(\pmb x) \sim \frac{1}{(2\pi)^{d/2} \; |\Sigma|^{1/2}} \exp \bigg[ -\frac{1}{2}(\pmb x - \pmb \mu)^t \Sigma^{-1}(\pmb x - \pmb \mu) \bigg]
p(\pmb x) \sim \frac{1}{(2\pi)^{d/2} \; |\Sigma|^{1/2}} \exp \bigg[ -\frac{1}{2}(\pmb x - \pmb \mu)^t \Sigma^{-1}(\pmb x - \pmb \mu) \bigg]

Normal distribution (univariate)

[back to top]

  • Probability density function
p(x) \sim N(\mu|\sigma^2)
p(x) \sim N(\mu|\sigma^2)
p(x) \sim \frac{1}{\sqrt{2\pi\sigma^2}} \exp{ \bigg[-\frac{1}{2}\bigg( \frac{x-\mu}{\sigma}\bigg)^2 \bigg] }
p(x) \sim \frac{1}{\sqrt{2\pi\sigma^2}} \exp{ \bigg[-\frac{1}{2}\bigg( \frac{x-\mu}{\sigma}\bigg)^2 \bigg] }

Normal Modes

[back to top]

Normal modes are the harmonic oscillations of a system of masses connected by springs, or roughly speaking "concerted motions," and all normal modes of a system are independent from each other. A classic example describes two masses connected by a middle spring, and each mass is connected to a fixed outer edge ( |m1~~m2|). The oscillation of this system where the middle spring does not move is defined as its normal mode.


Normalization - Min-Max scaling

[back to top]

A data pre-processing step (also often referred to as "Feature Scaling") for fitting features from different measurements within a certain range, typically the unit range from 0 to 1.


Normalization - Standard Scores

[back to top]

A data pre-processing step (also often just called "Standardization") for re-scaling features from different measurements to match proportions of a standard normal distribution (unit variance centered at mean=0).


O

Objective function

[back to top]

Objective functions are mathematical function that are used for problem-solving and optimization tasks. Depending on the task, the objective function can be omtpimized through minimization (cost or loss functions) or maximization (reward function). A typical application of an objective function in pattern classification tasks is to minimize the error rate of a classifier.


On-Line Learning

[back to top]

On-line learning is a machine learning architecture where the model is being updated consecutively as new training data arrives in contrast to batch-learning, which requires the entire training dataset to be available upfront. On-line has the advantage that a model can be updated and refined over time to account for changes in the population of training samples. A popular example where on-line learning is beneficial is the task of spam detection.


On-Line Analytical Processing (OLAP)

[back to top]

On-Line Analytical Processing (OLAP) describes the general process of working with multidimensional arrays for exploratory analysis and information retrieval; often, OLAP is used to create summary data, e.g., via data aggregation across multiple dimensions or columns.


P

Parzen-Rosenblatt Window technique

[back to top]

A non-parametric kernel density estimation technique for probability densities of random variables if the underlying distribution/model is unknown. A so-called window function is used to count samples within hypercubes or Gaussian kernels of a specified volume to estimate the probability density.

\phi(\pmb u) = \Bigg[ \begin{array}{ll} 1 & \quad |u_j| \leq 1/2 \; ;\quad \quad j = 1, ..., d \\ 0 & \quad \text{otherwise} \end{array}
\phi(\pmb u) = \Bigg[ \begin{array}{ll} 1 & \quad |u_j| \leq 1/2 \; ;\quad \quad j = 1, ..., d \\
0 & \quad \text{otherwise} \end{array}

for a hypercube of unit length 1 centered at the coordinate system's origin. What this function basically does is assigning a value 1 to a sample point if it lies within ½ of the edges of the hypercube, and 0 if lies outside (note that the evaluation is done for all dimensions of the sample point).

If we extend on this concept, we can define a more general equation that applies to hypercubes of any length h n that are centered at x:

k_n = \sum\limits_{i=1}^{n} \phi \bigg( \frac{\pmb x - \pmb x_i}{h_n} \bigg)\\\\
k_n = \sum\limits_{i=1}^{n} \phi \bigg( \frac{\pmb x - \pmb x_i}{h_n} \bigg)\\\\

where:

\pmb u = \bigg( \frac{\pmb x - \pmb x_i}{h_n} \bigg)
\pmb u = \bigg( \frac{\pmb x - \pmb x_i}{h_n} \bigg)
  • probability density estimation with hypercube kernel
p_n(\pmb x) = \frac{1}{n} \sum\limits_{i=1}^{n} \frac{1}{h^d} \phi \bigg[ \frac{\pmb x - \pmb x_i}{h_n} \bigg]
p_n(\pmb x) = \frac{1}{n} \sum\limits_{i=1}^{n} \frac{1}{h^d} \phi \bigg[ \frac{\pmb x - \pmb x_i}{h_n} \bigg]

where:

h^d = V_n\quad \text{and} \quad\phi \bigg[ \frac{\pmb x - \pmb x_i}{h_n} \bigg] = k
h^d = V_n\quad   \text{and}    \quad\phi \bigg[ \frac{\pmb x - \pmb x_i}{h_n} \bigg] = k
  • probability density estimation with Gaussian kernel
p_n(\pmb x) = \frac{1}{n} \sum\limits_{i=1}^{n} \frac{1}{h^d} \phi \Bigg[ \frac{1}{(\sqrt {2 \pi})^d h_{n}^{d}} \exp \; \bigg[ -\frac{1}{2} \bigg(\frac{\pmb x - \pmb x_i}{h_n} \bigg)^2 \bigg] \Bigg]
p_n(\pmb x) = \frac{1}{n} \sum\limits_{i=1}^{n} \frac{1}{h^d} \phi \Bigg[ \frac{1}{(\sqrt {2 \pi})^d h_{n}^{d}} \exp \; \bigg[ -\frac{1}{2} \bigg(\frac{\pmb x - \pmb x_i}{h_n} \bigg)^2 \bigg] \Bigg]

Pattern classification

[back to top]

The usage of patterns in datasets to discriminate between classes, i.e., to assign a class label to a new observation based on inference.


Perceptron

[back to top]

A (single-layer) perceptron is a simple Artificial Neural Network algorithm that consists of only two types of nodes: Input nodes and output nodes connected by weighted links. Perceptrons are being used as linear classifiers in supervised machine learning tasks.


Permissive transformations

[back to top]

Permissive transformations are transformations of data that that do not change the "meaning" of the attributes, such as scaling or mapping. For example, the transformation of temperature measurements from a Celsius to a Kelvin scale would be a permissive transformation of a numerical attribute.


Poisson distribution (univariate)

[back to top]

  • Probability density function
p(x|\theta) = \frac{e^{-\theta}\theta^{xk}}{x_k!}
p(x|\theta) = \frac{e^{-\theta}\theta^{xk}}{x_k!}

Population mean

[back to top]

\mu = \frac{1}{N} \sum_{i=1}^N x_i
\mu = \frac{1}{N} \sum_{i=1}^N x_i

example mean vector:

\pmb{\mu_1} = \begin{bmatrix}0\\0\\0\end{bmatrix}
\pmb{\mu_1} = \begin{bmatrix}0\\0\\0\end{bmatrix}

Power transform

[back to top]

Power transforms form a category of statistical transformation techniques that are used to transform non-normal distributed data to normality.


Principal Component Analysis (PCA)

[back to top]

A linear transformation technique that is commonly used to project a dataset (without utilizing class labels) onto a new feature space or feature subspace (for dimensionality reduction) where the new component axes are the directions that maximize the variance/spread of the data.

  • Scatter matrix
S = \sum\limits_{k=1}^n (\pmb x_k - \pmb m)\;(\pmb x_k - \pmb m)^T
S = \sum\limits_{k=1}^n (\pmb x_k - \pmb m)\;(\pmb x_k - \pmb m)^T

where:

\pmb m = \frac{1}{n} \sum\limits_{k=1}^n \; \pmb x_k \text{ (mean vector)}
\pmb m = \frac{1}{n} \sum\limits_{k=1}^n \; \pmb x_k \text{   (mean vector)}

Precision and Recall

[back to top]

Precision (synonymous to specificity) and recall (synonymous to sensitivity) are two measures to assess performance of a classifier if class label distributions are skewed. Precision is defined as the ratio of number of relevant items out of total retrieved items, whereas recall is the fraction of relevant items which are retrieved.


Predictive Modeling

[back to top]

Predictive modeling a data mining task for predicting outcomes based on a statistical model that was build on previous observations (in contrast to descriptive modeling). Predictive modeling can be further divided into the three sub-tasks: Regression, classification, and ranking.


Proportion of Variance Explained

[back to top]

In the context of dimensionality reduction, the proportion of variance explained (PVE) describes how much of the total variance is captured by the new selected axes, for example, principal components or discriminant axes. It is computed by the sum of variance of new component axes divided by the total variance.


Purity Measure

[back to top]

In a cluster analysis with given truth cluster memberships (or classes), "purity" is used to assess the effectiveness of clustering. Purity is measured by assigning each cluster to the class that is maximally represented and computed via the weighted average of maximum number of samples from the same class in each cluster.


Q

Quantitative and qualitative attributes

[back to top]

Quantitative attributes are also often called "numeric"; those are attributes for which calculations and comparisons like ratios and intervals make sense (e.g., temperature in Celsius). Qualitative, or "categorical", attributes can be grouped into to subclasses: nominal and ordinal. Where ordinal attributes (e.g., street numbers) can be ordered, nominal attributes can only distinguished by their category names (e.g., colors).


R

R-factor

[back to top]

The R-factor is one of several measures to assess the quality of a protein crystal structure. After building and refining an atomistic model of the crystal structure, the R-factor measures how well this model can describe the experimental diffraction patterns via the equation:

R = \frac{\sum ||F_{obs}| - |F_{calc}||}{\sum|F_{obs}|}
R = \frac{\sum ||F_{obs}| - |F_{calc}||}{\sum|F_{obs}|}

Random forest

[back to top]

Random forest is an ensemble classifier where multiple decision tree classifiers are learned and combined via the bagging technique. Unseen/test objects are then classified by taking the majority of votes from individual decision trees.


Rayleigh distribution (univariate)

[back to top]

  • Probability density function
p(x|\theta) = \Bigg\{ \begin{array}{c} 2\theta xe^{- \theta x^2},\quad \quad x \geq0, \\ 0,\quad \text{otherwise.} \\ \end{array}
p(x|\theta) =  \Bigg\{ \begin{array}{c}
2\theta xe^{- \theta x^2},\quad \quad x \geq0, \\
0,\quad \text{otherwise.} \\
\end{array}

Receiver Operating Characteristic (ROC)

[back to top]

The Receiver Operating Characteristic (ROC, or ROC curve) is a quality measure for binary prediction algorithms by plotting the "False positive rate" vs. the "True positive rate" (sensitivity).


Regularization

[back to top]

Regularization is a technique to overcome overfitting by introducing a penalty term for model complexity. Usually, the penalty term is the squared sum of the model parameters, thereby promoting less complex models during training. Regularization may increase the training error but can potentially reduce the classification error on the test dataset.


Reinforcement learning

[back to top]

Reinforcement learning is a machine learning algorithm that learns from a series of actions by maximizing a "reward function". The reward function can either be maximized by penalizing "bad actions" and/or rewarding "good actions".


Rejection sampling

[back to top]

Rejection sampling is similar to the popular Monte Carlo sampling with the difference of an additional bound. The goal of rejection sampling is to simplify the task of drawing random samples from a complex probability distribution by using a uniform distribution instead; random samples drawn from the uniform distribution that lie outside certain boundary criteria are rejected, and all samples within the boundary are accepted, respectively.


Resubstitution error

[back to top]

The resubstitution error represents the classification error rate on the training dataset (the dataset that was used to train the classifier). The performance of a classifier cannot be directly deduced from resubstitution error alone, but it becomes a useful measure for calculating the generalization error.


Ridge Regression

[back to top]

Ridge regression is a regularized regression technique in which the squared sum of the model coefficients is used to penalize model complexity.


Rule-based classifier

[back to top]

Rule-based classifiers are classifiers that are based on one or more "IF ... THEN ..." rules. Rule-based classifiers are related to decision trees and can be extracted from the latter. If the requirements for a rule-based classifier (mutually exclusive: at most one rule per sample; mutuallyy exhaustive: at least one rule per sample) are violated, possible remedies include the addition of rules or the ordering of rules.


S

Sampling

[back to top]

Sampling is data pre-processing procedure that is used to reduce the overall size of a dataset and to reduce computational costs by selecting a representative subset from the whole input dataset.


Sensitivity

[back to top]

Sensitivity (synonymous to precision), which is related to specificity -- in the context of error rate evaluation -- describes the "True Positive Rate" for a binary classification problem: The probability to make a correct prediction for a "positive/true" case (e.g., in an attempt to predict a disease, the disease is correctly predicted for a patient who truly has this disease). Sensitivity is calculated as (TP)/(TP+FN), where TP=True Positives, FN=False Negatives.


Sharding

[back to top]

Sharding is the non-redundant partitioning of a database into smaller databases; this process can also be understood as horizontal splitting. The rationale behind sharing is to divide a database among separate machines to avoid storage or performance issues that are related to growing database sizes.


Silhouette Measure (clustering)

[back to top]

Silhouette measure provides a metric to evaluate the performance of a clustering analysis. For each data point i, it measures the average distance of point i to all other points in the same cluster (a(i)), and the minimum distance to points from other clusters (b(i)). The average silhouette measures for each cluster can provide a visual way to pick the proper number of clusters.


Simple Matching Coefficient

[back to top]

The simple matching coefficient is a similarity measure for binary data and calculated by dividing the total number of matches by the total number of attributes. For asymmetric binary data, the related Jaccard coefficient is to be preferred in order to avoid highly similar scores.


Singular Value Decomposition (SVD)

[back to top]

Singular value decomposition (SVD) is linear algebra technique that decomposes matrix X into
U D V T where U (left-singular vectors) and V (right-singular vector) are both column-orthogonal, and D is a diagonal matrix that contains singular values. PCA is closely related to the right0singular vectors of SVD.


Soft classification

[back to top]

The general goal of a pattern classification is to assign a pre-defined class labels to particular observations. Typically, in (hard) classification, only one class label is assigned to every instance whereas in soft classification, an instance can have multiple class labels. The degree to which an instance belongs to different classes is then defined by a so-called membership function.


Specificity

[back to top]

Specificity (synonymous to recall), which is related to sensitivity -- in the context of error rate evaluation -- describes the "True Negative Rate" for a binary classification problem: The probability to make a correct prediction for a "false/negative" case (e.g., in an attempt to predict a disease, no disease is predicted for a healthy patient). Specificity is calculated as (TN)/(FP+TN), where TN=True Negatives, FP=False Positives.


Standard deviation

[back to top]

\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2}
\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2}

Stochastic Gradient Descent (SGD)

[back to top]

Stochastic Gradient Descent (SGD) (also see Gradient Descent) is a machine learning algorithm that seeks to minimize an objective (or cost) function and can be grouped into the category of linear classifiers for supervised learning tasks. In contrast to Batch Gradient Descent, the gradient is computed from a single sample.


Supervised Learning

[back to top]

The problem of inferring a mapping between the input space X and a target variable y when given labelled training data (i.e. (X,y) pairs). Encompasses the problems of classification (categorical y) and regression (continuous y).


Support Vector Machine

[back to top]

SMV is a classification method that tries to find the hyperplane which separates classes with highest margin. The margin is defined as the minimum distance from sample points to the hyperplane. The sample point(s) that form margin are called support vectors and eventually establish the SVM model.


T

Term frequency and document frequency

[back to top]

Term frequency and document frequency are commonly used measures in context of text classification tasks. Term frequency is the count of how often a particular word occurs in a particular document. In contrast, document frequency measures the presence or absence of a particular word in a document as a binary value. Thus, for a single document, the document frequency is either 1 or 0.


Term frequency - inverse document frequency, Tf-idf

[back to top]

Term frequency - inverse document frequency (Tf-idf) is a weighting scheme for term frequencies and document frequencies in text classification tasks that favors terms that occur in relatively few documents. The Tf-idf is calculated as a simple product of term frequency and the inverse document frequency, and the latter is calculated is calculated by log("number of documents in total" / "number of documents that contain a particular term").


Tokenization

[back to top]

Tokenization, in the context of natural language processing (NLP) is the process of breaking down a text into individual elements which can consist of words or symbols. Tokenization is usually accompanied by other processing procedures such as stemming, the removal of stop words, or the creation of n-grams.


U

Unsupervised Learning

[back to top]

The problem of inferring latent structure in data when not given any training cases. Encompasses the problems of clustering, dimensionality reduction and density estimation.


V

Variance

[back to top]

\sigma{_x}^{2} = \sum_{i=1}^{n} (x_i - \bar{x})^2\quad
\sigma{_x}^{2} = \sum_{i=1}^{n} (x_i - \bar{x})^2\quad

W

White noise

[back to top]

White noise is a source that produces random, statistically independent variables following a particular distribution. In the field of sound processing, white noise is also often referred to as a mixture of tones or sounds of different frequencies.


Whitening transformation

[back to top]

Whitening transformation is a normalization procedure to de-correlate samples in a dataset if the covariance matrix is not a diagonal matrix. Features are uncorrelated after "whitening" and their variances are equal unity, thus the covariance matrix becomes an identity matrix.


Z

Z-score

[back to top]

z = \frac{x - \mu}{\sigma}
z = \frac{x - \mu}{\sigma}