What is the classification of label spread

Traditionally, machine learning technology falls into two categories, one is unsupervised learning and the other is supervised learning.

Unsupervised learning only uses unlabeled example sentences, while supervised learning only uses labeled example sentences for learning.

However, in many practical problems, there is only a small amount of tagged data because the cost of tagging data is sometimes high. For example, in biology, structural analysis of a particular protein can be performed. Functional identification can take many years for biologists, but a large amount of unlabeled data is readily available.

This has led to the rapid development of semi-supervised learning techniques in which labeled and unlabelled samples can be used simultaneously.

 

A brief description of the semi-supervised learning theory:

There are two example sentences for semi-supervised learning, one is labeled and the other is labeled.

Lable = {(xi, yi)}, Unlabled = {(xi)}. And in the set L << U.

1. Use labeled samples only, we can generate a supervised classification algorithm

2. Use only unlabelled samples, we can generate an unattended clustering algorithm

3. Use both We hope to be able to add unlabeled samples to 1 to enhance the effect of supervised classification, and we hope to add 2 labeled samples to enhance the effect of unsupervised clustering.

 

In general, semi-supervised learning focuses on adding unlabeled samples to the supervised classification algorithm to achieve semi-supervised classification. That is, unmarked samples are added to 1 to improve the classification effect.

 

The motivation of semi-supervised learning, motivation

When someone is discussing, they always teach us the word motivation. Four or five times in the afternoon it is emphasized that there has to be motivation to write a paper. Let's talk about the motivation of semi-supervised learning.

1. Marked samples are difficult to obtain.

requires specialized personnel, special equipment, additional costs, etc.

2. Unlabelled samples are relatively cheap.

3. Another point is the bright future of machine learning.

 

The difference between semi-supervised learning and direct learning:

This is also discussed on the internet. The main thing is that semi-supervised learning is inductive and the generated model can be used as a broader sample, while direct learning is only used for classifying the current unlabelled samples.

Simply put, the former uses unlabeled samples to help classify other samples in the future.

The latter is only used to classify these limited unlabelled samples.

 

The following images clearly illustrate the advantages of half monitoring:

 

In the above figure there are only two marked samples, X, O, and the remaining green dots are unmarked samples. With the addition of unlabeled samples, the original classification limit has been shifted 0 to 0.5 It fits better with the actual distribution of the sample.

 

Semi-supervised classification of learning algorithms:

1. self-training (self-training algorithm)

2. Generative modelsGenerative model

3. SVMs Semi-monitored support vector machine

4. graph-basedmethodsGraph-based method

5. multiview learingMulti-view algorithm

6. Other methods

 

Then briefly introduce the algorithms above

 

Self-training algorithm:

still consists of two example sentences: Labeled = ((xi, yi)); Unlabled = (xj).

executes the following algorithm

Repeat:

1. useL creates a classification strategy F;

2. Use F classification U, calculation error

3. Choose the subset u of U, ie the error is small, add the marking. L = L + u;

Repeat the above steps until U is an empty set.

 

In the algorithm above, L continuously selects samples with good performance to join U, constantly updates algorithm F of the subset, and finally gets the best F.

 

A concrete example of self-training: the nearest neighbor algorithm

Let d (x1, x2) be the Euclidean distance between two samples and execute the following algorithm:

Repeat:

1. useL creates a classification strategy F;

2. Choose x = argmin d (x, x0). Select from these x∈U, min x0∈L. So select the unlabelled sample that is closest to the labeled sample.

2. Using F gives x a category F (x).

3. Digits (x, F (x)) are added to L.

Repeat the above steps until U is an empty set.

 

In the algorithm above, the "minimum error" of self-training is defined by using the Euclidean distance to define the "best performing unlabeled sample" and then using F to give a mark, added to L. and also dynamically updated F.

 

This algorithm is rendered below:

 

The above figure starts at two points by continuously adding the nearest neighbors and continuously updating F.

The above algorithm works fine, of course it's rather a poor performance. As follows:

 

Generative model

The generation algorithm is based on assumptions. For example, let's assume that the original sample conforms to the Gaussian distribution and then use the probability of maximum relief to fit it to a commonly used Gaussian distribution, Gaussian mixture model (GMM).

A brief introduction to the Gaussian mixture model:

Assume the following example distribution:

We assume that they correspond to the Gaussian distribution.

The parameters of the Gaussian distribution are: θ = {w1, w2, µ1, µ2, Σ1, Σ2}

Use the idea of ​​maximum relief to maximize the likelihood of:

p (x, y | θ) = p (y | θ) p (x | y, θ).

Obtain the following hypothetical distribution:

 

Incidentally, publish an introduction to the Gaussian mixture model protocol:

http://blog.csdn.net/zouxy09/article/details/8537620

Next comes our semi-supervised generation algorithm:

The example distribution is as follows:

According to the algorithm, the following distribution is obtained:


Compare these two figures to illustrate the difference between the Gaussian mixture model and the semi-supervised generation model:

 

The release functions of these two methods are different. The former maximizes the probability of the occurrence of labeled samples and the latter adds the probability of occurrence of unlabelled samples.

The specific implementation of the algorithm can be found in the E-M algorithm.

The algorithm generated this way also has many poor performances:

For example, suppose the original is distributed as follows:


With GMM it will be like this:


Some things to consider:

1. Local convergence of the Gaussian mixture

2. Reduce the weight of unlabelled samples

 

Semi-Supervised SVM, Graph Algorithm Model, Popular Model, etc.

The theory of SVM is not repeated, it is an optimal hyperplane:

Steal a very impressive SVM picture:


This content covers a wide area and a protocol does not fit. If you are interested, you can find out more.

 

Finally, the summary is

The last two pictures:

Semi-supervised learning method based on a generative model

This type of method usually considers the probability that an unlabeled sample belongs to each category as a set of missing parameters and then takes it EM (expectation-maximization) algorithm estimates the maximum probability of the parameters of the generative model. The difference between different methods is that different generative models are selected as base classifiers, for example the Gaussian mixture [3] mixture of experts [1] Naive Bayes (nave Bayes) [4]. Although the semi-supervised learning method is based on a generative model, it is simple, intuitive and can perform better than the discriminatory model when training samples, especially when there are very few labeled samples. However, if the model assumptions do not match the data distribution, a large amount of unlabeled data is used to estimate the model parameters instead of degradation generalization ability [5]. Since finding a suitable generative model for modeling data requires a lot of domain knowledge, the application of semi-supervised learning based on a generative model is limited to practical problems.

 

Semi-supervised learning method based on a sparse partition

In this type of procedure, the decision boundary must pass through the sparse data area as far as possible in order to avoid the dense data points in the cluster from being split on both sides of the decision boundary. Based on this idea, Joachim's [6] came up with the TSVM algorithm (As shown2, As shown where the solid line is TSVM, the classification limit and the dashed line do not consider unlabelled data, SVM classification limit). During training, TSVM's algorithm first uses labeled data to train aSVMSestimate the labels of the unlabeled data and then, based on the maximization interval criterion, iteratively swap the labels of the samples on both sides of the classification boundary to maximize the interval, and update the current predictive model accordingly to achieve the correct classification of the labeled data as much as possible. At the same time, “press” the decision limit. Distribute relatively sparse areas over the data. however, TSVM the loss function is not convex, the learning process falls to a local minimum, which affects the generalization ability. To this end, various TSVM, a variant method has been proposed to non-convexThe influence of the loss function on the optimization process, typical methods include deterministic annealing [7] 、 CCCPDirect optimization [8] wait. In addition, the idea of ​​sparse partitioning is also used TSVM design of semi-supervised learning methods other than entropy Regularize semi-supervised learning to enforce the learned classification boundary and avoid data-rich areas [9]。

Number 2 TSVM algorithm diagram [6]

Graph-based semi-supervised learning method

This type of method uses marked and unmarked data to create a data graph. Based on the adjacency relationship in the diagram, the marking is passed from marked data points to unmarked data points (as shown in the figure). 3 As shown, the light gray and black nodes are marked samples of different categories, and the hollow nodes are unmarked samples. According to the label spreading method, graph-based semi-supervised learning methods can be divided into two categories. One type of method realizes explicit label propagation by defining a label propagation method that meets certain properties, such as: B. the Gaussian random field and harmony transmission of function marks [10], mark propagation based on global and local consistency [11] Another type of method is the regularity defined in the graphic To give neighbors similar outputs in the diagram, which implicitly passes labels from marked samples to unmarked samples [12]. In fact, the impact of label dissemination methods on learning performance is far less than the impact of data graphing methods on learning performance. When the nature of the data graph deviates from the inherent law of the data, it is difficult to achieve satisfactory learning outcomes regardless of the method used to characterize the spread. However, creating a data graph that reflects the internal relationship of the data often requires a lot of domain knowledge. Fortunately, in some cases, processing can still be done according to the nature of the data to get a more robust data graph. For example, if the data graph does not meet the metric, the non-metric graph can be broken down into multiple metrics charts based on the chart. Mark the spread separately, eliminating non-The negative influence of the metric diagram on the spread of the mark [13]. The graph-based semi-supervised learning method has a good mathematical basis, but because the time complexity of the learning algorithm is mostly O (n3) it is difficult to meet the application requirements of semi-supervised learning for large unlabelled data.

Figure 3 Schematic representation of brand expansion

Semi-supervised co-training algorithm

They assume that the data set has two sufficient and redundant views, i.e. two sets of attributes that meet the following conditions: First, each set of attributes is sufficient to describe the problem, i.e. if the training examples are sufficient, and each set of attributes is sufficient to a strong learner to learn, second, when a label is given, each attribute set is conditionally independent of any other attribute set. During collaborative training, each classifier selects a number of examples with higher labeling certainty (i.e. assigning the correct labeling for example) from the unlabeled examples for identification and adds the flagged examples other A flagged training set of a classifier so that the other participant can update with these newly flagged examples. The collaborative training process is continued iteratively until a certain stop condition is reached.

Such algorithms implicitly use a clustering hypothesis or a manifold hypothesis. You use two or more learners. During the learning process, these learners select several highly trustworthy ones. The unmarked examples of are marked with each other so that the model can be updated.