| Preface to the Second Edition |
|
vii | |
| Preface to the First Edition |
|
ix | |
| Introduction: Four Periods in the Research of the Learning Problem |
|
1 | (300) |
|
Rosenblatt's Perceptron (The 1960s) |
|
|
1 | (6) |
|
Construction of the Fundamentals of Learning Theory (The 1960s--1970s) |
|
|
7 | (4) |
|
Neural Networks (The 1980s) |
|
|
11 | (3) |
|
Returning to the Origin (The 1990s) |
|
|
14 | (3) |
|
Setting of the Learning Problem |
|
|
17 | (18) |
|
Function Estimation Model |
|
|
17 | (1) |
|
The Problem of Risk Minimization |
|
|
18 | (1) |
|
Three Main Learning Problems |
|
|
18 | (2) |
|
|
|
19 | (1) |
|
|
|
19 | (1) |
|
Density Estimation (Fisher-Wald Setting) |
|
|
19 | (1) |
|
The General Setting of the Learning Problem |
|
|
20 | (1) |
|
The Empirical Risk Minimization (ERM) Inductive Principle |
|
|
20 | (1) |
|
The Four Parts of Learning Theory |
|
|
21 | (2) |
|
Informal Reasoning and Comments --- 1 |
|
|
23 | (1) |
|
The Classical Paradigm of Solving Learning Problems |
|
|
23 | (4) |
|
Density Estimation Problem (Maximum Likelihood Method) |
|
|
24 | (1) |
|
Pattern Recognition (Discriminant Analysis) Problem |
|
|
24 | (1) |
|
Regression Estimation Model |
|
|
25 | (1) |
|
Narrowness of the ML Method |
|
|
26 | (1) |
|
Nonparametric Methods of Density Estimation |
|
|
27 | (3) |
|
|
|
27 | (1) |
|
The Problem of Density Estimation Is Ill-Posed |
|
|
28 | (2) |
|
Main Principle for Solving Problems Using a Restricted Amount of Information |
|
|
30 | (1) |
|
Model Minimization of the Risk Based on Empirical Data |
|
|
31 | (2) |
|
|
|
31 | (1) |
|
|
|
31 | (1) |
|
|
|
32 | (1) |
|
Stochastic Approximation Inference |
|
|
33 | (2) |
|
Consistency of Learning Processes |
|
|
35 | (34) |
|
The Classical Definition of Consistency and the Concept of Nontrivial Consistency |
|
|
36 | (2) |
|
The Key Theorem of Learning Theory |
|
|
38 | (2) |
|
|
|
39 | (1) |
|
Necessary and Sufficient Conditions for Uniform Two-Sided Convergence |
|
|
40 | (5) |
|
Remark on Law of Large Numbers and Its Generalization |
|
|
41 | (1) |
|
Entropy of the Set of Indicator Functions |
|
|
42 | (1) |
|
Entropy of the Set of Real Functions |
|
|
43 | (2) |
|
Conditions for Uniform Two-Sided Convergence |
|
|
45 | (1) |
|
Necessary and Sufficient Conditions for Uniform One-Sided Convergence |
|
|
45 | (2) |
|
Theory of Nonfalsifiability |
|
|
47 | (2) |
|
Kant's Problem of Demarcation and Popper's Theory of Nonfalsifiability |
|
|
47 | (2) |
|
Theorems on Nonfalsifiability |
|
|
49 | (6) |
|
Case of Complete (Popper's) Nonfalsifiability |
|
|
50 | (1) |
|
Theorem on Partial Nonfalsifiability |
|
|
50 | (2) |
|
Theorem on Potential Nonfalsifiability |
|
|
52 | (3) |
|
Three Milestones in Learning Theory |
|
|
55 | (5) |
|
Informal Reasoning and Comments --- 2 |
|
|
59 | (1) |
|
The Basic Problems of Probability Theory and Statistics |
|
|
60 | (3) |
|
Axioms of Probability Theory |
|
|
60 | (3) |
|
Two Modes of Estimating a Probability Measure |
|
|
63 | (2) |
|
Strong Mode Estimation of Probability Measures and the Density Estimation Problem |
|
|
65 | (1) |
|
The Glivenko-Cantelli Theorem and its Generalization |
|
|
66 | (1) |
|
Mathematical Theory of Induction |
|
|
67 | (2) |
|
Bounds on the Rate of Convergence of Learning Processes |
|
|
69 | (24) |
|
|
|
70 | (2) |
|
Generalization for the Set of Real Functions |
|
|
72 | (3) |
|
The Main Distribution-Independent Bounds |
|
|
75 | (1) |
|
Bounds on the Generalization Ability of Learning Machines |
|
|
76 | (2) |
|
The Structure of the Growth Function |
|
|
78 | (2) |
|
The VC Dimension of a Set of Functions |
|
|
80 | (3) |
|
Constructive Distribution-Independent Bounds |
|
|
83 | (2) |
|
The Problem of Constructing Rigorous (Distribution-Dependent) Bounds |
|
|
85 | (2) |
|
Informal Reasoning and Comments --- 3 |
|
|
87 | (1) |
|
Kolmogorov-Smirnov Distributions |
|
|
87 | (2) |
|
|
|
89 | (1) |
|
Bounds on Empirical Processes |
|
|
90 | (3) |
|
Controlling the Generalization Ability of Learning Processes |
|
|
93 | (30) |
|
Structural Risk Minimization (SRM) Inductive Principle |
|
|
94 | (3) |
|
Asymptotic Analysis of the Rate of Convergence |
|
|
97 | (2) |
|
The Problem of Function Approximation in Learning Theory |
|
|
99 | (2) |
|
Examples of Structures for Neural Nets |
|
|
101 | (2) |
|
The Problem of Local Function Estimation |
|
|
103 | (1) |
|
The Minimum Description Length (MDL) and SRM Principles |
|
|
104 | (8) |
|
|
|
106 | (1) |
|
Bounds for the MDL Principle |
|
|
107 | (1) |
|
The SRM and MDL Principles |
|
|
108 | (2) |
|
A Weak Point of the MDL Principle |
|
|
110 | (1) |
|
Informal Reasoning and Comments --- 4 |
|
|
111 | (1) |
|
Methods for Solving Ill-Posed Problems |
|
|
112 | (1) |
|
Stochastic Ill-Posed Problems and the Problem of Density Estimation |
|
|
113 | (2) |
|
The Problem of Polynomial Approximation of the Regression |
|
|
115 | (1) |
|
The Problem of Capacity Control |
|
|
116 | (3) |
|
Choosing the Degree of the Polynomial |
|
|
116 | (1) |
|
Choosing the Best Sparse Algebraic Polynomial |
|
|
117 | (1) |
|
Structures on the Set of Trigonometric Polynomials |
|
|
118 | (1) |
|
The Problem of Features Selection |
|
|
119 | (1) |
|
The Problem of Capacity Control and Bayesian Inference |
|
|
119 | (4) |
|
The Bayesian Approach in Learning Theory |
|
|
119 | (2) |
|
Discussion of the Bayesian Approach and Capacity Control Methods |
|
|
121 | (2) |
|
Methods of Pattern Recognition |
|
|
123 | (58) |
|
Why Can Learning Machines Generalize? |
|
|
123 | (2) |
|
Sigmoid Approximation of Indicator Functions |
|
|
125 | (1) |
|
|
|
126 | (5) |
|
The Back-Propagation Method |
|
|
126 | (4) |
|
The Back-Propagation Algorithm |
|
|
130 | (1) |
|
Neural Networks for the Regression Estimation Problem |
|
|
130 | (1) |
|
Remarks on the Back-Propagation Method |
|
|
130 | (1) |
|
The Optimal Separating Hyperplane |
|
|
131 | (2) |
|
|
|
131 | (1) |
|
|
|
132 | (1) |
|
Constructing the Optimal Hyperplane |
|
|
133 | (5) |
|
Generalization for the Nonseparable Case |
|
|
136 | (2) |
|
Support Vector (SV) Machines |
|
|
138 | (8) |
|
Generalization in High-Dimensional Space |
|
|
139 | (1) |
|
Convolution of the Inner Product |
|
|
140 | (1) |
|
|
|
141 | (1) |
|
|
|
141 | (5) |
|
Experiments with SV Machines |
|
|
146 | (8) |
|
|
|
146 | (1) |
|
Handwritten Digit Recognition |
|
|
147 | (4) |
|
|
|
151 | (3) |
|
|
|
154 | (2) |
|
SVM and Logistic Regression |
|
|
156 | (7) |
|
|
|
156 | (3) |
|
The Risk Function for SVM |
|
|
159 | (1) |
|
The SVMn Approximation of the Logistic Regression |
|
|
160 | (3) |
|
|
|
163 | (8) |
|
|
|
164 | (3) |
|
|
|
167 | (4) |
|
Informal Reasoning and Comments --- 5 |
|
|
171 | (1) |
|
The Art of Engineering Versus Formal Inference |
|
|
171 | (3) |
|
Wisdom of Statistical Models |
|
|
174 | (2) |
|
What Can One Learn from Digit Recognition Experiments? |
|
|
176 | (5) |
|
Influence of the Type of Structures and Accuracy of Capacity Control |
|
|
177 | (1) |
|
SRM Principle and the Problem of Feature Construction |
|
|
178 | (1) |
|
Is the Set of Support Vectors a Robust Characteristic of the Data? |
|
|
179 | (2) |
|
Methods of Function Estimation |
|
|
181 | (44) |
|
∈-Insensitive Loss-Function |
|
|
181 | (2) |
|
SVM for Estimating Regression Function |
|
|
183 | (7) |
|
SV Machine with Convolved Inner Product |
|
|
186 | (2) |
|
Solution for Nonlinear Loss Functions |
|
|
188 | (2) |
|
Linear Optimization Method |
|
|
190 | (1) |
|
Constructing Kernels for Estimating Real-Valued Functions |
|
|
190 | (4) |
|
Kernels Generating Expansion on Orthogonal Polynomials |
|
|
191 | (2) |
|
Constructing Multidimensional Kernels |
|
|
193 | (1) |
|
Kernels Generating Splines |
|
|
194 | (2) |
|
Spline of Order d With a Finite Number of Nodes |
|
|
194 | (1) |
|
Kernels Generating Splines With an Infinite Number of Nodes |
|
|
195 | (1) |
|
Kernels Generating Fourier Expansions |
|
|
196 | (2) |
|
Kernels for Regularized Fourier Expansions |
|
|
197 | (1) |
|
The Support Vector ANOVA Decomposition for Function Approximation and Regression Estimation |
|
|
198 | (2) |
|
SVM for Solving Linear Operator Equations |
|
|
200 | (4) |
|
The Support Vector Method |
|
|
201 | (3) |
|
Function Approximation Using the SVM |
|
|
204 | (4) |
|
Why Does the Value of ∈ Control the Number of Support Vectors? |
|
|
205 | (3) |
|
SVM for Regression Estimation |
|
|
208 | (11) |
|
Problem of Data Smoothing |
|
|
209 | (1) |
|
Estimation of Linear Regression Functions |
|
|
209 | (7) |
|
Estimation Nonlinear Regression Functions |
|
|
216 | (3) |
|
Informal Reasoning and Comments --- 6 |
|
|
219 | (1) |
|
Loss Functions for the Regression Estimation Problem |
|
|
219 | (2) |
|
Loss Functions for Robust Estimators |
|
|
221 | (2) |
|
Support Vector Regression Machine |
|
|
223 | (2) |
|
Direct Methods in Statistical Learning Theory |
|
|
225 | (42) |
|
Problem of Estimating Densities, Conditional Probabilities, and Conditional Densities |
|
|
226 | (3) |
|
Problem of Density Estimation: Direct Setting |
|
|
226 | (1) |
|
Problem of Conditional Probability Estimation |
|
|
227 | (1) |
|
Problem of Conditional Density Estimation |
|
|
228 | (1) |
|
Solving an Approximately Determined Integral Equation |
|
|
229 | (1) |
|
Glivenko-Cantelli Theorem |
|
|
230 | (3) |
|
Kolmogorov-Smirnov Distribution |
|
|
232 | (1) |
|
|
|
233 | (2) |
|
Three Methods of Solving Ill-Posed Problems |
|
|
235 | (2) |
|
|
|
236 | (1) |
|
Main Assertions of the Theory of Ill-Posed Problems |
|
|
237 | (3) |
|
Deterministic Ill-Posed Problems |
|
|
237 | (1) |
|
Stochastic Ill-Posed Problem |
|
|
238 | (2) |
|
Nonparametric Methods of Density Estimation |
|
|
240 | (4) |
|
Consistency of the Solution of the Density Estimation Problem |
|
|
240 | (1) |
|
|
|
241 | (3) |
|
SVM Solution of the Density Estimation Problem |
|
|
244 | (5) |
|
The SVM Density Estimate: Summary |
|
|
247 | (1) |
|
Comparison of the Parzen's and the SVM methods |
|
|
248 | (1) |
|
Conditional Probability Estimation |
|
|
249 | (7) |
|
Approximately Defined Operator |
|
|
251 | (2) |
|
SVM Method for Conditional Probability Estimation |
|
|
253 | (2) |
|
The SVM Conditional Probability Estimate: Summary |
|
|
255 | (1) |
|
Estimation of Conditional Density and Regression |
|
|
256 | (2) |
|
|
|
258 | (3) |
|
One Can Use a Good Estimate of the Unknown Density |
|
|
258 | (1) |
|
One Can Use Both Labeled (Training) and Unlabeled (Test) Data |
|
|
259 | (1) |
|
Method for Obtaining Sparse Solutions of the Ill-Posed Problems |
|
|
259 | (2) |
|
Informal Reasoning and Comments --- 7 |
|
|
261 | (1) |
|
Three Elements of a Scientific Theory |
|
|
261 | (2) |
|
Problem of Density Estimation |
|
|
262 | (1) |
|
Theory of Ill-Posed Problems |
|
|
262 | (1) |
|
Stochastic Ill-Posed Problems |
|
|
263 | (4) |
|
The Vicinal Risk Minimization Principle and the SVMs |
|
|
267 | (24) |
|
The Vicinal Risk Minimization Principle |
|
|
267 | (4) |
|
|
|
269 | (1) |
|
|
|
270 | (1) |
|
VRM Method for the Pattern Recognition Problem |
|
|
271 | (4) |
|
Examples of Vicinal Kernels |
|
|
275 | (4) |
|
|
|
276 | (3) |
|
|
|
279 | (1) |
|
|
|
279 | (2) |
|
Generalization for Estimation Real-Valued Functions |
|
|
281 | (3) |
|
Estimating Density and Conditional Density |
|
|
284 | (7) |
|
Estimating a Density Function |
|
|
284 | (1) |
|
Estimating a Conditional Probability Function |
|
|
285 | (1) |
|
Estimating a Conditional Density Function |
|
|
286 | (1) |
|
Estimating a Regression Function |
|
|
287 | (2) |
|
Informal Reasoning and Comments --- 8 |
|
|
289 | (2) |
|
Conclusion: What Is Important in Learning Theory? |
|
|
291 | (10) |
|
What Is Important in the Setting of the Problem? |
|
|
291 | (3) |
|
What Is Important in the Theory of Consistency of Learning Processes? |
|
|
294 | (1) |
|
What Is Important in the Theory of Bounds? |
|
|
295 | (1) |
|
What Is Important in the Theory of Controlling the Generalization Ability of Learning Machines? |
|
|
296 | (1) |
|
What Is Important in the Theory for Constructing Learning Algorithms? |
|
|
297 | (1) |
|
What Is the Most Important? |
|
|
298 | (3) |
| References |
|
301 | (10) |
|
|
|
301 | (1) |
|
|
|
302 | (9) |
| Index |
|
311 | |