Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
References 443

Nesterov, Y. & Nesterov, I. (2004),Introductory lectures on convex optimization: A
basic course, Vol. 87, Springer Netherlands.
Novikoff, A. B. J. (1962), On convergence proofs on perceptrons,in‘Proceedings of the
Symposium on the Mathematical Theory of Automata’, Vol. XII, pp. 615–622.
Parberry, I. (1994),Circuit complexity and neural networks, The MIT press.
Pearson, K. (1901), ‘On lines and planes of closest fit to systems of points in space’,
The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science
2 (11), 559–572.
Phillips, D. L. (1962), ‘A technique for the numerical solution of certain integral equa-
tions of the first kind’,Journal of the ACM 9 (1), 84–97.
Pisier, G. (1980-1981), ‘Remarques sur un r ́esultat non publi ́e de B. maurey’.
Pitt, L. & Valiant, L. (1988), ‘Computational limitations on learning from examples’,
Journal of the Association for Computing Machinery 35 (4), 965–984.
Poon, H. & Domingos, P. (2011), Sum-product networks: A new deep architecture,in
‘Conference on Uncertainty in Artificial Intelligence (UAI)’.
Quinlan, J. R. (1986), ‘Induction of decision trees’,Machine Learning 1 , 81–106.
Quinlan, J. R. (1993),C4.5: Programs for Machine Learning, Morgan Kaufmann.
Rabiner, L. & Juang, B. (1986), ‘An introduction to hidden markov models’,IEEE
ASSP Magazine 3 (1), 4–16.
Rakhlin, A., Shamir, O. & Sridharan, K. (2012), Making gradient descent optimal for
strongly convex stochastic optimization,in‘International Conference on Machine
Learning (ICML)’.
Rakhlin, A., Sridharan, K. & Tewari, A. (2010), Online learning: Random averages,
combinatorial parameters, and learnability,in‘NIPS’.
Rakhlin, S., Mukherjee, S. & Poggio, T. (2005), ‘Stability results in learning theory’,
Analysis and Applications 3 (4), 397–419.
Ranzato, M., Huang, F., Boureau, Y. & Lecun, Y. (2007), Unsupervised learning of
invariant feature hierarchies with applications to object recognition,in‘Computer
Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on’, IEEE, pp. 1–
8.
Rissanen, J. (1978), ‘Modeling by shortest data description’,Automatica 14 , 465–471.
Rissanen, J. (1983), ‘A universal prior for integers and estimation by minimum descrip-
tion length’,The Annals of Statistics 11 (2), 416–431.
Robbins, H. & Monro, S. (1951), ‘A stochastic approximation method’,The Annals of
Mathematical Statisticspp. 400–407.
Rogers, W. & Wagner, T. (1978), ‘A finite sample distribution-free performance bound
for local discrimination rules’,The Annals of Statistics 6 (3), 506–514.
Rokach, L. (2007),Data mining with decision trees: theory and applications, Vol. 69,
World scientific.
Rosenblatt, F. (1958), ‘The perceptron: A probabilistic model for information storage
and organization in the brain’,Psychological Review 65 , 386–407. (Reprinted in
Neurocomputing(MIT Press, 1988).).
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. (1986), Learning internal represen-
tations by error propagation,inD. E. Rumelhart & J. L. McClelland, eds, ‘Paral-
lel Distributed Processing – Explorations in the Microstructure of Cognition’, MIT
Press, chapter 8, pp. 318–362.

Free download pdf