EducationSecondary education and schools

Nearest neighbor method: example of work

The method of the nearest neighbor is the simplest metric classifier, which is based on evaluating the similarity of various objects.

The analyzed object is referred to the class to which the subjects of the training sample belong. Let's find out what is the method of the nearest neighbor. Let's try to understand this complex issue, give examples of different techniques.

Hypothesis of the method

The nearest-neighbor method can be considered the most common algorithm used for classification. An object that undergoes classification belongs to that class y_i, to which the closest object of the training sample x_i belongs.

Specificity of the methodology of the nearest neighbors

The method k of the nearest neighbors allows to increase reliability of classification. The analyzed object belongs to the same class as the main mass of its neighbors, that is, k objects close to it of the analyzed sample x_i. When solving problems with two classes, the number of neighbors will be odd to exclude the ambiguity situation, if the same number of neighbors will belong to different classes.

The technique of weighted neighbors

The postgresql-method of nearest neighbors tsvector is used when the number of classes is not less than three, and oddness can not be used. But ambiguity arises even in these cases. Then the i-th neighbor receives the weight w_i, which decreases with increasing rank of neighbor i. The object refers to a class that will have a maximum total weight among close neighbors.

Hypothesis of compactness

At the heart of all the above methods is the hypothesis of compactness. It involves a link between the measure of similarity of objects and their belonging to one class. In this situation, the boundary between different views has a simple form, and classes create compact mobile areas in the space of objects. Under such domains in mathematical analysis it is customary to mean closed bounded sets. This hypothesis is not related to the everyday perception of this word.

The basic formula

Let us analyze in more detail the method of the nearest neighbor. If a training sample of the form "object-response" is offered X ^ m = \ {(x_1, y_1), \ dots, (x_m, y_m) \}; If for a set of objects the distance function \ rho (x, x ') is given, which is represented as an adequate model of similarity of objects, as the value of this function increases, the similarity between the objects x, x' decreases.

For any object u, we construct the training sample objects x_i as the distances to u increase:

\ Rho (u, x_ {1; u}) \ leq \ rho (u, x_ {2; u}) \ leq \ cdots \ leq \ rho (u, x_ {m; u}

Where x_ {i; U} characterizes the training sample object that is the i-th neighbor of the original object u. We use this notation for the answer to the i-th neighbor: y_ {i; U}. As a result, we get that an arbitrary object u provokes a change in the numbering of its own sample.

Determination of the number of neighbors k

The method of the nearest neighbor at k = 1 is capable of giving an erroneous classification, not only on emission objects, but also for other classes that are located near.

If we take k = m, the algorithm will be maximally stable and degenerate into a constant value. That is why for reliability it is important not to allow extreme indicators k.

In practice, the criterion for sliding control is used as the optimal indicator k.

Abolition of emissions

Objects of training are mostly unequal, but among them there are those that have characteristic features of the class and are called standards. With the closeness of the subject to the ideal sample, the probability of its belonging to a given class is high.

How effective is the method of the nearest neighbors? An example can be looked at based on peripheral and noninformative categories of objects. The dense environment of the object under consideration is assumed to be other representatives of this class. If you remove them from the sample, the quality of the classification will not be affected.

To get into such a sample can be a certain number of noise emissions that are "in the thick of" another class. The removal basically has a positive effect on the quality of the classification carried out.

If non-informative and noise objects are eliminated from the sample, several positive results can be expected at the same time.

First of all interpolation by the method of the nearest neighbor allows to improve the quality of classification, to reduce the amount of data stored, to reduce the time of classification, which is spent on choosing the nearest standards.

The application of extra-large samples

The nearest-neighbor method is based on the actual storage of training objects. To create super-large samples use technical problems. The task is not only to preserve a significant amount of information, but also in a minimal time frame to manage to find an arbitrary object u among the nearest k neighbors.

In order to cope with the task, two methods are used:

  • Thin out the sample by throwing out non-information objects;
  • Apply special effective structures and data indexes for instant search of the nearest neighbors.

Rules for the selection of methods

The classification was considered above. The method of the nearest neighbor is used to solve practical problems in which the distance function \ rho (x, x ') is known beforehand. When describing objects, numerical vectors use the Euclidean metric. Such a choice has no special justification, but it implies the measurement of all the signs "on a single scale". If this factor is not taken into account, then the metric will be dominated by the sign having the largest numerical values.

In the presence of a significant number of features, calculating the distance as a sum of deviations for specific characteristics, a serious dimensionality problem appears.

In a space of high dimension, all objects will be far from each other. In the final analysis, an arbitrary sample of the neighbors closest to the object under study k will be arbitrary. To eliminate this problem, a small number of informative signs are selected. Algorithms for calculating estimates are built on the basis of different sets of characteristics, and for each individual they build their proximity function.

Conclusion

Mathematical calculations often involve the use of a variety of techniques that have their own distinctive characteristics, advantages and disadvantages. The considered method of the nearest neighbors allows to solve rather serious problems connected with the characterization of mathematical objects. Experimental concepts, based on the analyzed technique, are now actively used in artificial intelligence tools.

In expert systems, it is necessary not only to classify objects, but also to show the user an explanation of the classification in question. In this method, explanations for such a phenomenon are expressed by the relation of the object to a particular class, and by its location relative to the sample used. Specialists of the legal industry, geologists, physicians, accept this "precedent" logic, actively use it in their studies.

In order for the analyzed method to be as reliable as possible, effective, give the desired result, it is necessary to take the minimum indicator k, and also not to allow emissions from the analyzed objects. That is why the methodology of selecting standards is applied, and metric optimization is also carried out.

Similar articles

 

 

 

 

Trending Now

 

 

 

 

Newest

Copyright © 2018 en.atomiyme.com. Theme powered by WordPress.