Definition 4.12 DF concept error boundary: The number of error categorizations of the training model to the training sample before successful convergence to a hypothesis.

4.3.2.1Dynamic fuzzy probably approximately correct (DFPAC) Learning Framework Theory

Prof. Valiant’s probably approximately correct (PAC) learning theory is very important in the development of machine learning. The PAC learning framework, first proposed in 1984, states that the evaluation of machine learning should be based on “probability approximation correctness” [3], unlike traditional pattern recognition, which is based on probabilities of 1 [4].

The basic idea of this theory is that we do not need the absolutely correct learning algorithm. We use the probability language statement. The correctness of a learning algorithm can be established by the probability of the representation. This algorithm is required to satisfy polynomial complexity.

Let (X,X)n be an instance set of learning problems of size n, let any (x,x)(X,X) be a training example, and let (X,X)=(U,U)n1 be an instance space. The task of learner L is to generate a hypothesis of the target concept so that it can correctly label each instance’s category. For n1,Cn2|(x,x)n| is a series of concepts on (X,X)n,c=(U,U)N1, and Cn is the concept space on (X,X), i.e. the concept class. Suppose that the space is the set of hypotheses h that the algorithm can output.

Assume that the instances are randomly generated from X in a probability distribution. The general ℘ may be any distribution, unknown to learner L, which requires stability and does not change over time. The generation of the training example (x,x) is randomly selected according to the distribution, and then (x,x) and its target value c((x,x)) are supplied to L. Learner L considers the hypothesis space H in the learning target concept, and after observing a series of training examples of target concept c, L must output a hypothesis h from H, which is the probability estimate of c. We can evaluate the ability of learner L by evaluating the performance of the new instance extracted from (X,X).

To depict the approximation degree of learner L’s output hypothesis h and target concept c, the true error rate is defined as follows.

Definition 4.13 The true error rate of hypothesis h and target concept c obeys the distribution:

errorp(h)Pr(x,x)p[c((x,x)=h(x,x))].(4.2)

This states that h randomly distributes the probability that the instance of is mistakenly classified, where Pr (x,x)Pr is the probability distribution in the example calculated on ℘.

Although we would like to know the true error rate error(h), which represents the error that learner L encounters when classifying unknown samples, it is not measurable for the size of the prediction sample space. What we can measure is the size of the sample. Therefore, we define the sample error rate.

Definition 4.14 The sample error rate of h for target concept c and sample space S is

errorS(h)1n(x,x)Sσ(((x,x)),h((x,x))),(4.3)

where n is the length of the sample space S, σ(c((x,x)),h((x,x))) is 1 if c((x,x))h((x,x)) and 0 otherwise.

Definition 4.15 Consider the concept class C defined on the instance set (X,X) with length n. Learner L uses the hypothesis space H. For all cC,(X,X) distributions ,0<ε<1/2and 0<δ<1/2 learner L will output the hypothesis hH with a probability of at least 1 − δ such that error(h) ≤ ⁵. Concept C is then a DFPAC for L using H. The polynomial functions used are 1/⁵, 1/δ, n, and the size (c).

4.3.2.2Sample complexity of DF assumed space

According to the definition, the learning ability of DFPAC is largely determined by the number of training samples. Therefore, the problem of the sample complexity of DF learning arises.

(1) For a limited hypothesis space situation

First, we consider the sample complexity of a consistent learner. If and only if it is possible to output the hypothesis that best fits the training examples, this class of learners are said to be consistent.

Definition 4.16 Consider a set of training examples D of hypothesis space H, target concept c, and instance distributions ℘. When each hypothesis h in VSH,, D is less thanfor c and errors, the variant space is called c and the – space of is exhaustive.

The detailed deformation space of ℘ – indicates that the true error rate of all hypotheses [i.e. errorSh = 0] consistent with the training example is just less than ⁵. From the perspective of trainer L, these assumptions fit the training data equally well, and they all have zero training error rates. Only by knowing the exact concept of the observation can we determine whether the deformation of space ⁵ – is detailed.

Theorem 4.2 ⁵ – Extension of the Variant Space: If the hypothesis space H is finite, and D consists of m ≥ 1 independent, randomly selected training examples of concept c, then for any 0 ≤ ⁵ ≤ 1, the variant space VSH, D is not ⁵ – exhaustive (with respect to c). The probability is less than or equal to |H|e−⁵m.

We obtain the upper bound of the probability that the variant space is not ⁵ –. On the other hand, we define the probability that m training samples fail to reject all the “bad” assumptions (i.e. the assumption that the error rate is greater than ⁵ for any learner L using hypothesis space H. This theorem can be used to determine the number of training examples required to reduce the probability of this “undeleted” to a certain limit δ. This satisfies |H|e−⁵m, from which we can conclude

m1ε(ln|H|+ln(1/δ)).(4.4)

It can be seen that the number of training examples increases linearly with 1/⁵ and increases logarithmically with 1/δ, but also increases logarithmically with the size of the hypothetical space H.

When the learner is not consistent, for example, for a given sample D, errorD(h) ≠ 0, only the hypothetical hbest with the smallest training error rate can be considered at this time, thus ensuring that error(hbest) will not exceed the number of samples of error(hbest) + ⁵ with a higher probability. When errorD random samples on the measurement of D, this probability can be given by the Hoeffding bound:

P(hH,s,t.errorp(hbest)>errorD(hbest)+ε)|H|e2ε2m.

The boundary of the training samples is

m12ε2(ln|H|+ln(1/δ)).(4.5)

(2) For infinite hypothesis space

For the infinite hypothesis space (concept class), we need to use the concept of VC-dimensionality to describe the complexity of the hypothesis space, which reflects the ability to scatter the (X,X) instance set size.

Definition 4.17 The set of instances S is scrambled by the hypothesis space H if and only if, for each partition of S, there exists an assumption in H that is consistent with this partition.

Assume that the ability of the spatial H to break up the set of instances is a measure of its ability to define the target concept on these instance sets. The larger the subset in the scattered instance space, the stronger the representation ability of H.

Definition 4.18 The VC dimension of a hypothetical space H, or VC(H), defined on the instance space (X,X) is the size of the largest finite subset of (X,X) that can be broken up by H. If any finite subset of (X,X) can be broken up by H, then VC(H) ≡ ∞

Using VC(H) as a measure of H complexity, we can derive a new boundary, that is,

m1ε(4log2(2/δ)+8VC(H)log2(13/ε)).

The number of training examples m grows with the logarithm of 1/δ, and also linearly in the logarithm of 1/⁵ and in 1/⁵.

Theorem 4.3 Lower Bound of Sample Complexity: Consider any concept class C with VC(C) > 2, any learner L, and any 0 < ⁵ < 1/8, 0 < 1/100. There is a target concept for the distribution and C, and when L observes that the number of samples is less than:

max[1εlog(1/δ),VC(C)132ε].(4.6)

L will output hypothesis h with a probability of at least δ and error(h) > ⁵.

This theorem states that if there are few training examples, then no learner L can learn every target concept in C from any nontrivial PAC model.

4.3.2.3DF error bound model

1) Learning algorithm error rate model

As there may be noise in the real data, it is inevitable that there may be errors in the learning process. Assume that the training sample S is randomly selected in the set of instances ℘, and the true error rate error(h) is estimated from the statistical sample error rate errorS(h). Because the test h on the sample set S obtained by random sampling to estimate error(h) is the same as the probability of estimating the occurrence of the positive or negative sides from n random samples of a tossed coin, it can be described by the binomial distribution B(n, p).

Assuming that the number of instances of a set S is n, let r be the number of instances in S that are misclassified by h, and let p be the probability that one instance is randomly divided into other concepts from ℘. Then, the moment estimator of p is errorS(h) =r/n, unbiased estimator. The binomial distribution property gives errorS(h) = r/n where error a standard deviation of σerrors(h)=p(1p)/n and if r/ n is used to approximate p, we get

σerrrorS(h)errorS(H)(1errorS(H))/n.(4.7)

2) Optimal error bound model

To define the error rate for any learning algorithm and any target concept learning process, we introduce the concept of optimal error bounds. Let MaginAl(C) ≡ cCmaxMaginAl(c) denote the maximum value of errors in all possible training examples to ensure that concept c is learned.

Definition 4.19 Let C be any nonempty concept class. (C) is the minimum value of MaginAl(C) for all possible learning algorithms Al.

Opt(C)minAlLeamingAlgorithmMaginAl(C)(4.8)

Littlestone [5] proved that, for any concept C, there exists the following relation between the VC error of C and the optimal error bound of C:

VC(C)Opt(C)log2(|C|).(4.9)

Therefore, it is very interesting to study the error boundary of DF concept learning. This is a hot topic and an integral part of DF concept learning.

4.3.3Dimensionality reduction model of DF instances

For common actual dynamic fuzzy datasets (DFD), human factors such as data recording errors, missing records, too specialized data, and so on, mean it is necessary to pre-process the corresponding data set to improve the quality of data before applying a DF concept learning algorithm.

Because DF concept learning is a supervised learning process, the concept category of each instance in the sample set has been marked, which can reduce the number of cluster centres. Therefore, a clustering method is more feasible. Clustering methods candetectdatathatdonotbelongtothesameconceptoftheclassinstance,whilehighly similar data can be attributed to the same concept class. Common clustering methods include density-based clustering, segmentation clustering, and hierarchical clustering. However, the Fuzzy C-means (FCM) algorithm theory in nonclassical applications is far away perfect. The following describes the DF data using an FCM algorithm.

(1)According to the labelled DF concept class, specify the number of clusters c(2cn), where n is the number of DF data, set the iteration termination threshold (ξ,ξ), initialize the cluster center point V0, and let the iteration counter b = 0.

(2)The partition matrix Ub=[(uij,uij)] is calculated as

(uij,uij)=[k=1c((xj,xj)(vi,vi)2(xj,xj)(vk,vk)2)1/(m1)]1,1ic,1jn.(4.10)

(3)Update the cluster centre according to

(vi,vi)=j=1n(uij,uij)m(xj,xj)j=1n(uij,uij)m,1ic.(4.11)

(4)If ||VbVb+1||(ξ,ξ), the algorithm terminates and outputs the partition matrix U and the cluster centre V; otherwise, b = b + 1 and the algorithm returns to step (3).

Through the optimization of the initial data set selection, the quality of the data is improved, which enhances the accuracy of data analysis.

4.3.4Dimensionality reduction model of DF attribute space

DF data attribute space reduction is the main method for preprocessing massive datasets. Common methods are principal component analysis (PCA), linear discriminant analysis (LDA), maximum margin criterion, and so on. PCA is one of the most widely used methods for feature selection in large-scale data sets. The main idea is to extract the eigenvectors with expressive force from the high-dimensional data and remove the irrelevant and redundant features so as to improve the accuracy of the learning problem.

4.3.4.1Dynamic fuzzy principal component analysis (DFPCA) dimensionality reduction model

DFPCA uses the PCA principle to reduce the dimension of the data. The basic idea is to analyse, process, and extract the covariance matrix of dynamic fuzzy variables according to the principle of variance maximization. The original DFD matrix is represented by a new set of linearly independent and orthogonal vectors. The variance is sorted by size, and the largest variance and eigenvalue are taken as the first principal component, the second-largest form the second principal component, and so on. The principal components are orthogonal to each other. The specific algorithm is described as follows:

Consider a known function {(X,X)} whose probability distribution function is given by n-dimensional random dynamic fuzzy variables, namely, {(X,X)}=[(x1,x1),(x2,x2),...,(xn,xn)],withmeanE((x,x))=(0,0).

The DF covariance matrix (c,c)=E[((X,X)E[(X,X)])((X,X)E[(X,X)])T]. From E[(X,X)]=(0,0), this can be simplified as

(C,C)=E[(X,X)(X,X)T].(4.12)

The characteristic value of (c,c)is(λ1,λ1),(λ2,λ2),...,(λR,λR), and the characteristic value is normalized to obtain (y1,y1),(y2,y2),...,(yn,yn) as the input characteristic satisfying the condition. Thus, the projection of the eigenvectors is

(M,M)=(Yi,Yi)T(X,X)(i=1,2,,n)(4.13)

This can be expressed in matrix form as the first principal component of the eigenvector:

(M,M)=(Y,Y)T(X,X),(4.14)

where (M,M) can be decomposed into (M,M)={(M1M1),(M2M2),...,(MnMn)} satisfying (M,M)(M,M)T=(I,I)[(I,I)] is the dynamic fuzzy unit matrix]. The arbitrary vector (Mi,Mi) in the feature space is the i th principal component of the input n-dimensional vector (X,X)

The autocovariance matrix of the feature vector (X,X). is

(C,C)(M,M)=E((M,M)(M,M)T)=E((Y,Y)T(X,X)(X,X)T(Y,Y))(4.15)=(Y,Y)TE((X,X)(X,X)T)(Y,Y).

At this time, (Y,Y) becomes the (X,X) eigenvector matrix, and then according to the eigenvalue of the principal component eigenvalues corresponding to (λ1,λ1)(λ2,λ2)...(λn,λn),the(Y,Y) cutoff dimension is reduced. If the EE value corresponding to the eigenvector is larger, bigger values of (λ,λ) correspond to smaller values in the original DFD set.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.20.231