References¶ Open the notebook in SageMaker Studio Lab
- Abadi et al., 2016
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., … others. (2016). Tensorflow: a system for large-scale machine learning. 12th $\$USENIX$\$ symposium on operating systems design and implementation ($\$OSDI$\$ 16) (pp. 265–283).
- Abdel-Hamid et al., 2014
Abdel-Hamid, O., Mohamed, A.-r., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language processing, 22(10), 1533–1545.
- Ahmed et al., 2012
Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., & Smola, A. J. (2012). Scalable inference in latent variable models. Proceedings of the fifth ACM international conference on Web search and data mining (pp. 123–132).
- Akiba et al., 2019
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: a next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining.
- Alayrac et al., 2022
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., … others. (2022). Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198.
- Alsallakh et al., 2020
Alsallakh, B., Kokhlikyan, N., Miglani, V., Yuan, J., & Reblitz-Richardson, O. (2020). Mind the pad–cnns can develop blind spots. arXiv preprint arXiv:2010.02178.
- Anil et al., 2020
Anil, R., Gupta, V., Koren, T., Regan, K., & Singer, Y. (2020). Scalable second order optimization for deep learning. arXiv preprint arXiv:2002.09018.
- Aronszajn, 1950
Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American mathematical society, 68(3), 337–404.
- Ba et al., 2016
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
- Baevski & Auli, 2018
Baevski, A., & Auli, M. (2018). Adaptive input representations for neural language modeling. International Conference on Learning Representations.
- Bahdanau et al., 2014
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Baptista & Poloczek, 2018
Baptista, R., & Poloczek, M. (2018). Bayesian optimization of combinatorial structures. Proceedings of the 35th International Conference on Machine Learning.
- Bardenet et al., 2013
Bardenet, R., Brendel, M., Kégl, B., & Sebag, M. (2013). Collaborative hyperparameter tuning. Proceedings of the 30th International Conference on Machine Learning (ICML'13).
- Bay et al., 2006
Bay, H., Tuytelaars, T., & Van Gool, L. (2006). Surf: speeded up robust features. European conference on computer vision (pp. 404–417).
- Bellman, 1966
Bellman, R. (1966). Dynamic programming. Science.
- Bellman, 1952
Bellman, R. (1952). On the theory of dynamic programming. Proceedings of the National Academy of Sciences, 38(8), 716-719. doi:10.1073/pnas.38.8.716
- Bellman, 1957a
Bellman, R. (1957). A markovian decision process. Journal of Mathematics and Mechanics, 6(5), 679–684. URL: http://www.jstor.org/stable/24900506
- Bellman, 1957b
Bellman, R. (1957). Dynamic Programming. Dover Publications.
- Beltagy et al., 2020
Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150.
- Bengio et al., 2003
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of machine learning research, 3(Feb), 1137–1155.
- Bengio et al., 1994
Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2), 157–166.
- Bergstra et al., 2011
Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter optimization. Advances in neural information processing systems, 24.
- Bergstra et al., 2010
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., … Bengio, Y. (2010). Theano: a cpu and gpu math compiler in python. Proc. 9th python in science conf (pp. 3–10).
- Beutel et al., 2014
Beutel, A., Murray, K., Faloutsos, C., & Smola, A. J. (2014). Cobafi: collaborative bayesian filtering. Proceedings of the 23rd international conference on World wide web (pp. 97–108).
- Bishop, 1995
Bishop, C. M. (1995). Training with noise is equivalent to tikhonov regularization. Neural computation, 7(1), 108–116.
- Bishop, 2006
Bishop, C. M. (2006). Pattern recognition and machine learning. springer.
- Black & Scholes, 1973
Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. The Journal of Political Economy, pp. 637–654.
- Bodla et al., 2017
Bodla, N., Singh, B., Chellappa, R., & Davis, L. S. (2017). Soft-nms–improving object detection with one line of code. Proceedings of the IEEE international conference on computer vision (pp. 5561–5569).
- Bojanowski et al., 2017
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.
- Bollobas, 1999
Bollobás, B. (1999). Linear analysis. Cambridge University Press, Cambridge.
- Bommasani et al., 2021
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., … others. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
- Bottou, 2010
Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. Proceedings of COMPSTAT'2010 (pp. 177–186). Springer.
- Bottou & Le Cun, 1988
Bottou, L., & Le Cun, Y. (1988). Sn: a simulator for connectionist models. Proceedings of NeuroNimes 88 (pp. 371–382). Nimes, France. URL: http://leon.bottou.org/papers/bottou-lecun-88
- Boucheron et al., 2005a
Boucheron, S., Bousquet, O., & Lugosi, G. (2005). Theory of classification: a survey of some recent advances. ESAIM: probability and statistics, 9, 323–375.
- Boucheron et al., 2005b
Boucheron, S., Bousquet, O., & Lugosi, G. (2005). Theory of classification: a survey of some recent advances. ESAIM: probability and statistics, 9, 323–375.
- Bowman et al., 2015
Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
- Boyd & Vandenberghe, 2004
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge, England: Cambridge University Press.
- Bradley & Terry, 1952
Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: i. the method of paired comparisons. Biometrika, 39(3/4), 324–345.
- Brown & Sandholm, 2017
Brown, N., & Sandholm, T. (2017). Libratus: the superhuman ai for no-limit poker. IJCAI (pp. 5226–5228).
- Brown et al., 1990
Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Lafferty, J., … Roossin, P. S. (1990). A statistical approach to machine translation. Computational linguistics, 16(2), 79–85.
- Brown et al., 1988
Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Mercer, R. L., & Roossin, P. (1988). A statistical approach to language translation. Coling Budapest 1988 Volume 1: International Conference on Computational Linguistics.
- Brown et al., 2020
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … others. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.
- Buslaev et al., 2020
Buslaev, A., Iglovikov, V. I., Khvedchenya, E., Parinov, A., Druzhinin, M., & Kalinin, A. A. (2020). Albumentations: fast and flexible image augmentations. Information, 11(2), 125.
- Campbell et al., 2002
Campbell, M., Hoane Jr, A. J., & Hsu, F.-h. (2002). Deep blue. Artificial intelligence, 134(1-2), 57–83.
- Canny, 1987
Canny, J. (1987). A computational approach to edge detection. Readings in computer vision (pp. 184–203). Elsevier.
- Cer et al., 2017
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., & Specia, L. (2017). Semeval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) (pp. 1–14).
- Chan et al., 2015
Chan, W., Jaitly, N., Le, Q. V., & Vinyals, O. (2015). Listen, attend and spell. arXiv preprint arXiv:1508.01211.
- Chen et al., 2021
Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., … Mordatch, I. (2021). Decision transformer: reinforcement learning via sequence modeling. Advances in neural information processing systems, 34, 15084–15097.
- Chen et al., 2015
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., … Zhang, Z. (2015). Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274.
- Cheng et al., 2016
Cheng, J., Dong, L., & Lapata, M. (2016). Long short-term memory-networks for machine reading. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 551–561).
- Chetlur et al., 2014
Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., & Shelhamer, E. (2014). Cudnn: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759.
- Cho et al., 2014a
Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
- Cho et al., 2014b
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
- Chowdhery et al., 2022
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., … others. (2022). Palm: scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Chung et al., 2014
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
- Clark et al., 2020
Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). Electra: pre-training text encoders as discriminators rather than generators. International Conference on Learning Representations.
- Collobert et al., 2011
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of machine learning research, 12(ARTICLE), 2493–2537.
- Cordonnier et al., 2020
Cordonnier, J.-B., Loukas, A., & Jaggi, M. (2020). On the relationship between self-attention and convolutional layers. International Conference on Learning Representations.
- Cover & Thomas, 1999
Cover, T., & Thomas, J.M. (1999). Elements of information theory. John Wiley & Sons.
- Csiszar, 2008
Csiszár, I. (2008). Axiomatic characterizations of information measures. Entropy, 10(3), 261–273.
- Cybenko, 1989
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303–314.
- Dalal & Triggs, 2005
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05) (pp. 886–893).
- DeCock, 2011
De Cock, D. (2011). Ames, iowa: alternative to the boston housing data as an end of semester regression project. Journal of Statistics Education, 19(3).
- Dean et al., 2012a
Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., … others. (2012). Large scale distributed deep networks. Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume 1 (pp. 1223–1231).
- Dean et al., 2012b
Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., … others. (2012). Large scale distributed deep networks. Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume 1 (pp. 1223–1231).
- DeCandia et al., 2007
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., … Vogels, W. (2007). Dynamo: amazon's highly available key-value store. ACM SIGOPS operating systems review (pp. 205–220).
- Deng et al., 2009
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: a large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255).
- DerKiureghian & Ditlevsen, 2009
Der Kiureghian, A., & Ditlevsen, O. (2009). Aleatory or epistemic? does it matter? Structural safety, 31(2), 105–112.
- Devlin et al., 2018
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Dinh et al., 2014
Dinh, L., Krueger, D., & Bengio, Y. (2014). Nice: non-linear independent components estimation. arXiv preprint arXiv:1410.8516.
- Dinh et al., 2017
Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2017). Density estimation using real nvp. International Conference on Learning Representations.
- Doersch et al., 2015
Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. Proceedings of the IEEE international conference on computer vision (pp. 1422–1430).
- Dosovitskiy et al., 2021
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … others. (2021). An image is worth 16x16 words: transformers for image recognition at scale. International Conference on Learning Representations.
- Duchi et al., 2011
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121–2159.
- Dumoulin & Visin, 2016
Dumoulin, V., & Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285.
- Dwivedi & Bresson, 2020
Dwivedi, V. P., & Bresson, X. (2020). A generalization of transformer networks to graphs. arXiv preprint arXiv:2012.09699.
- Dwork et al., 2015
Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth, A. L. (2015). Preserving statistical validity in adaptive data analysis. Proceedings of the forty-seventh annual ACM symposium on Theory of computing (pp. 117–126).
- Elman, 1990
Elman, J. L. (1990). Finding structure in time. Cognitive science, 14(2), 179–211.
- Elsken et al., 2018
Elsken, T., Metzen, J. H., & Hutter, F. (2018). Neural architecture search: a survey. arXiv:1808.05377 [stat.ML].
- Fechner, 1860
Fechner, G. T. (1860). Elemente der Ppsychophysik. Vol. 2. Breitkopf u. Härtel.
- Fedus et al., 2022
Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1–39.
- Fernando, 2004
Fernando, R. (2004). GPU gems: programming techniques, tips, and tricks for real-time graphics. Vol. 590. Addison-Wesley Reading.
- Feurer & Hutter, 2018
Feurer, M., & Hutter, F. (2018). Hyperparameter optimization. Automatic Machine Learning: Methods, Systems, Challenges. Springer.
- Feurer et al., 2022
Feurer, M., Letham, B., Hutter, F., & Bakshy, E. (2022). Practical transfer learning for bayesian optimization. arXiv:1802.02219 [stat.ML].
- Field, 1987
Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. Josa a, 4(12), 2379–2394.
- Fisher, 1928
Fisher, R. (1928). Statistical methods for research workers. Stechert.
- Flammarion & Bach, 2015
Flammarion, N., & Bach, F. (2015). From averaging to acceleration, there is only a step-size. Conference on Learning Theory (pp. 658–695).
- Forrester et al., 2007
Forrester, A. I., Sóbester, A., & Keane, A. J. (2007). Multi-fidelity optimization via surrogate modelling. Proceedings of the royal society a: mathematical, physical and engineering sciences, 463(2088), 3251–3269.
- Franceschi et al., 2017
Franceschi, L., Donini, M., Frasconi, P., & Pontil, M. (2017). Forward and reverse gradient-based hyperparameter optimization. Proceedings of the 34th International Conference on Machine Learning (ICML'17).
- Frankle & Carbin, 2018
Frankle, J., & Carbin, M. (2018). The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635.
- Frazier, 2018
Frazier, P. I. (2018). A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811.
- Freund et al., 1996
Freund, Y., Schapire, R. E., & others. (1996). Experiments with a new boosting algorithm. icml (pp. 148–156).
- Friedman, 1987
Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American statistical association, 82(397), 249–266.
- Frostig et al., 2018
Frostig, R., Johnson, M. J., & Leary, C. (2018). Compiling machine learning programs via high-level tracing. Systems for Machine Learning.
- Fukushima, 1982
Fukushima, K. (1982). Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition. Competition and cooperation in neural nets (pp. 267–285). Springer.
- Gardner et al., 2018
Gardner, J., Pleiss, G., Weinberger, K. Q., Bindel, D., & Wilson, A. G. (2018). GPyTorch: blackbox matrix-matrix Gaussian process inference with GPU acceleration. Advances in Neural Information Processing Systems.
- Garg et al., 2021
Garg, S., Balakrishnan, S., Kolter, Z., & Lipton, Z. (2021). Ratt: leveraging unlabeled data to guarantee generalization. International Conference on Machine Learning (pp. 3598–3609).
- Gatys et al., 2016
Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2414–2423).
- Gauss, 1809
Gauss, C. F. (1809). Theoria motus corporum coelestum. Werke.
- Gibbs, 1902
Gibbs, J. W. (1902). Elementary principles of statistical mechanics. Compare, 289, 314.
- Ginibre, 1965
Ginibre, J. (1965). Statistical ensembles of complex, quaternion, and real matrices. Journal of Mathematical Physics, 6(3), 440–449.
- Girshick, 2015
Girshick, R. (2015). Fast r-cnn. Proceedings of the IEEE international conference on computer vision (pp. 1440–1448).
- Girshick et al., 2014
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580–587).
- Glorot & Bengio, 2010
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249–256).
- Goh, 2017
Goh, G. (2017). Why momentum really works. Distill. URL: http://distill.pub/2017/momentum, doi:10.23915/distill.00006
- Goldberg et al., 1992
Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992). Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35(12), 61–71.
- Golub & VanLoan, 1996
Golub, G. H., & Van Loan, C. F. (1996). Matrix computations. Johns Hopkins studies in the mathematical sciences.
- Goodfellow et al., 2016
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org.
- Goodfellow et al., 2014
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems (pp. 2672–2680).
- Gotmare et al., 2018
Gotmare, A., Keskar, N. S., Xiong, C., & Socher, R. (2018). A closer look at deep learning heuristics: learning rate restarts, warmup and distillation. arXiv preprint arXiv:1810.13243.
- Goyal et al., 2021
Goyal, A., Bochkovskiy, A., Deng, J., & Koltun, V. (2021). Non-deep networks. arXiv preprint arXiv:2110.07641.
- Graham, 2014
Graham, B. (2014). Fractional max-pooling. arXiv preprint arXiv:1412.6071.
- Graves, 2013
Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.
- Graves et al., 2008
Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., & Schmidhuber, J. (2008). A novel connectionist system for unconstrained handwriting recognition. IEEE transactions on pattern analysis and machine intelligence, 31(5), 855–868.
- Graves & Schmidhuber, 2005
Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks, 18(5-6), 602–610.
- Griewank, 1989
Griewank, A. (1989). On automatic differentiation. Mathematical Programming: recent developments and applications, 6(6), 83–107.
- Gulati et al., 2020
Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., … others. (2020). Conformer: convolution-augmented transformer for speech recognition. Proc. Interspeech 2020, pp. 5036–5040.
- Gunawardana & Shani, 2015
Gunawardana, A., & Shani, G. (2015). Evaluating recommender systems. Recommender systems handbook (pp. 265–308). Springer.
- Guo et al., 2017
Guo, H., Tang, R., Ye, Y., Li, Z., & He, X. (2017). Deepfm: a factorization-machine based neural network for ctr prediction. Proceedings of the 26th International Joint Conference on Artificial Intelligence (pp. 1725–1731).
- Guyon et al., 2008
Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. A. (2008). Feature extraction: foundations and applications. Vol. 207. Springer.
- Hadjis et al., 2016
Hadjis, S., Zhang, C., Mitliagkas, I., Iter, D., & Ré, C. (2016). Omnivore: an optimizer for multi-device deep learning on cpus and gpus. arXiv preprint arXiv:1606.04487.
- Hartley & Zisserman, 2000
Hartley, R., & Zisserman, A. (2000). Multiple View Geometry in Computer Vision. Cambridge University Press.
- Hartley & Kahl, 2009
Hartley, R. I., & Kahl, F. (2009). Global optimization through rotation space search. International Journal of Computer Vision, 82(1), 64–79.
- He et al., 2022
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000–16009).
- He et al., 2017a
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).
- He et al., 2015
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: surpassing human-level performance on imagenet classification. Proceedings of the IEEE international conference on computer vision (pp. 1026–1034).
- He et al., 2016a
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
- He et al., 2016b
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity mappings in deep residual networks. European conference on computer vision (pp. 630–645).
- He & Chua, 2017
He, X., & Chua, T.-S. (2017). Neural factorization machines for sparse predictive analytics. Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval (pp. 355–364).
- He et al., 2017b
He, X., Liao, L., Zhang, H., Nie, L., Hu, X., & Chua, T.-S. (2017). Neural collaborative filtering. Proceedings of the 26th international conference on world wide web (pp. 173–182).
- Hebb & Hebb, 1949
Hebb, D. O., & Hebb, D. (1949). The organization of behavior. Vol. 65. Wiley New York.
- Hendrycks & Gimpel, 2016a
Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
- Hendrycks & Gimpel, 2016b
Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
- Hennessy & Patterson, 2011
Hennessy, J. L., & Patterson, D. A. (2011). Computer architecture: a quantitative approach. Elsevier.
- Herlocker et al., 1999
Herlocker, J. L., Konstan, J. A., Borchers, A., & Riedl, J. (1999). An algorithmic framework for performing collaborative filtering. 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999 (pp. 230–237).
- Hidasi et al., 2015
Hidasi, B., Karatzoglou, A., Baltrunas, L., & Tikk, D. (2015). Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939.
- Ho et al., 2020
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
- Hochreiter et al., 2001
Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J., & others (2001). Gradient flow in recurrent nets: the difficulty of learning long-term dependencies.
- Hochreiter & Schmidhuber, 1997
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780.
- Hoffmann et al., 2022
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., … others. (2022). Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- Howard et al., 2019
Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., … Adam, H. (2019). Searching for mobilenetv3. Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1314–1324).
- Hoyer et al., 2009
Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., & Schölkopf, B. (2009). Nonlinear causal discovery with additive noise models. Advances in neural information processing systems (pp. 689–696).
- Hu et al., 2018
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132–7141).
- Hu et al., 2008
Hu, Y., Koren, Y., & Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. 2008 Eighth IEEE International Conference on Data Mining (pp. 263–272).
- Hu et al., 2022
Hu, Z., Lee, R. K.-W., Aggarwal, C. C., & Zhang, A. (2022 , jun). Text style transfer: a review and experimental evaluation. SIGKDD Explor. Newsl., 24(1), 14–45. URL: https://doi.org/10.1145/3544903.3544906, doi:10.1145/3544903.3544906
- Huang et al., 2018
Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Simon, I., Hawthorne, C., Shazeer, N., … Eck, D. (2018). Music transformer: generating music with long-term structure. International Conference on Learning Representations.
- Huang et al., 2017
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700–4708).
- Huang et al., 2015
Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
- Hubel & Wiesel, 1959
Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of single neurones in the cat's striate cortex. The Journal of physiology, 148(3), 574–591.
- Hubel & Wiesel, 1962
Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. The Journal of physiology, 160(1), 106–154.
- Hubel & Wiesel, 1968
Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology, 195(1), 215–243.
- Hutter et al., 2011
Hutter, F., Hoos, H., & Leyton-Brown, K. (2011). Sequential model-based optimization for general algorithm configuration. Proceedings of the Fifth International Conference on Learning and Intelligent Optimization (LION'11).
- Hutter et al., 2019
Hutter, F., Kotthoff, L., & Vanschoren, J. (Eds.) (2019). Automated Machine Learning: Methods, Systems, Challenges. Springer.
- Ioffe, 2017
Ioffe, S. (2017). Batch renormalization: towards reducing minibatch dependence in batch-normalized models. Advances in neural information processing systems (pp. 1945–1953).
- Ioffe & Szegedy, 2015
Ioffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
- Izmailov et al., 2018
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., & Wilson, A. G. (2018). Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407.
- Jacot et al., 2018
Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural tangent kernel: convergence and generalization in neural networks. Advances in neural information processing systems.
- Jaeger, 2002
Jaeger, H. (2002). Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the" echo state network" approach. Vol. 5. GMD-Forschungszentrum Informationstechnik Bonn.
- Jamieson & Talwalkar, 2016
Jamieson, K., & Talwalkar, A. (2016). Non-stochastic best arm identification and hyperparameter optimization. Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS'16).
- Jenatton et al., 2017
Jenatton, R., Archambeau, C., González, J., & Seeger, M. (2017). Bayesian Optimization with Tree-structured Dependencies. Proceedings of the 34th International Conference on Machine Learning (ICML'17).
- Jia et al., 2018
Jia, X., Song, S., He, W., Wang, Y., Rong, H., Zhou, F., … others. (2018). Highly scalable deep learning training system with mixed-precision: training imagenet in four minutes. arXiv preprint arXiv:1807.11205.
- Jia et al., 2014
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., … Darrell, T. (2014). Caffe: convolutional architecture for fast feature embedding. Proceedings of the 22nd ACM international conference on Multimedia (pp. 675–678).
- Joshi et al., 2020
Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer, L., & Levy, O. (2020). Spanbert: improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8, 64–77.
- Jouppi et al., 2017
Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., … others. (2017). In-datacenter performance analysis of a tensor processing unit. 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) (pp. 1–12).
- Kalchbrenner et al., 2014
Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188.
- Kalman & Kwasny, 1992
Kalman, B. L., & Kwasny, S. C. (1992). Why tanh: choosing a sigmoidal function. [Proceedings 1992] IJCNN International Joint Conference on Neural Networks (pp. 578–581).
- Kaplan et al., 2020
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., … Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Karnin et al., 2013
Karnin, Z., Koren, T., & Somekh, O. (2013). Almost optimal exploration in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning (ICML'13).
- Karras et al., 2017
Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196.
- Kim et al., 2017
Kim, J., El-Khamy, M., & Lee, J. (2017). Residual lstm: design of a deep recurrent architecture for distant speech recognition. arXiv preprint arXiv:1701.03360.
- Kim, 2014
Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
- Kimeldorf & Wahba, 1971
Kimeldorf, G. S., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. J. Math. Anal. Appl., 33, 82-95.
- Kingma & Ba, 2014
Kingma, D. P., & Ba, J. (2014). Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Kingma & Welling, 2014
Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. International Conference on Learning Representations (ICLR).
- Kipf & Welling, 2016
Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
- Kojima et al., 2022
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. ArXiv preprint, abs/2205.11916.
- Koller & Friedman, 2009
Koller, D., & Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT press.
- Kolmogorov, 1933
Kolmogorov, A. (1933). Sulla determinazione empirica di una lgge di distribuzione. Inst. Ital. Attuari, Giorn., 4, 83–91.
- Kolter, 2008
Kolter, Z. (2008). Linear algebra review and reference. Available online: http://cs229.stanford.edu/section/cs229-linalg.pdf.
- Koren et al., 2009
Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, pp. 30–37.
- Krizhevsky et al., 2012
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems (pp. 1097–1105).
- Kung, 1988
Kung, S. Y. (1988). Vlsi array processors. Englewood Cliffs, NJ, Prentice Hall, 1988, 685 p. Research supported by the Semiconductor Research Corp., SDIO, NSF, and US Navy.
- Kuzovkin et al., 2018
Kuzovkin, I., Vicente, R., Petton, M., Lachaux, J.-P., Baciu, M., Kahane, P., … Aru, J. (2018). Activations of deep convolutional neural networks are aligned with gamma band activity of human visual cortex. Communications biology, 1(1), 1–12.
- Lan et al., 2019
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
- Lavin & Gray, 2016
Lavin, A., & Gray, S. (2016). Fast algorithms for convolutional neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4013–4021).
- Le, 2013
Le, Q. V. (2013). Building high-level features using large scale unsupervised learning. 2013 IEEE international conference on acoustics, speech and signal processing (pp. 8595–8598).
- LeCun et al., 1995a
LeCun, Y., Bengio, Y., & others. (1995). Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10), 1995.
- LeCun et al., 1989
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4), 541–551.
- LeCun et al., 1998a
LeCun, Y., Bottou, L., Orr, G., & Muller, K.-R. (1998). Efficient backprop. Neural Networks: Tricks of the Trade. New York: Springer.
- LeCun et al., 1998b
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., & others. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
- LeCun et al., 1995b
LeCun, Y., Jackel, L., Bottou, L., Brunot, A., Cortes, C., Denker, J., … others. (1995). Comparison of learning algorithms for handwritten digit recognition. International conference on artificial neural networks (pp. 53–60).
- Legendre, 1805
Legendre, A. M. (1805). Mémoire sur les opérations trigonométriques: dont les résultats dépendent de la figure de la terre. F. Didot.
- Lewis et al., 2019
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., … Zettlemoyer, L. (2019). Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
- Lewkowycz et al., 2022
Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., … others. (2022). Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858.
- Li et al., 2018
Li, L., Jamieson, K., Rostamizadeh, A., Gonina, K., Hardt, M., Recht, B., & Talwalkar, A. (2018). Massively parallel hyperparameter tuning. arXiv:1810.05934 [cs.LG].
- Li, 2017
Li, M. (2017). Scaling Distributed Machine Learning with System and Algorithm Co-design (Doctoral dissertation). PhD Thesis, CMU.
- Li et al., 2014a
Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., … Su, B.-Y. (2014). Scaling distributed machine learning with the parameter server. 11th $\$USENIX$\$ Symposium on Operating Systems Design and Implementation ($\$OSDI$\$ 14) (pp. 583–598).
- Li et al., 2014b
Li, M., Zhang, T., Chen, Y., & Smola, A. J. (2014). Efficient mini-batch training for stochastic optimization. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 661–670).
- Liaw et al., 2018
Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J., & Stoica, I. (2018). Tune: a research platform for distributed model selection and training. arXiv:1807.05118 [cs.LG].
- Lin et al., 2013
Lin, M., Chen, Q., & Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400.
- Lin et al., 2017a
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).
- Lin et al., 2010
Lin, Y., Lv, F., Zhu, S., Yang, M., Cour, T., Yu, K., … others. (2010). Imagenet classification: fast descriptor coding and large-scale svm training. Large scale visual recognition challenge.
- Lin et al., 2017b
Lin, Z., Feng, M., Santos, C. N. d., Yu, M., Xiang, B., Zhou, B., & Bengio, Y. (2017). A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.
- Lipton et al., 2015
Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015). A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019.
- Lipton et al., 2016
Lipton, Z. C., Kale, D. C., Elkan, C., & Wetzel, R. (2016). Learning to diagnose with lstm recurrent neural networks. International Conference on Learning Representations (ICLR).
- Lipton & Steinhardt, 2018
Lipton, Z. C., & Steinhardt, J. (2018). Troubling trends in machine learning scholarship. Communications of the ACM (CACM).
- Liu & Nocedal, 1989
Liu, D. C., & Nocedal, J. (1989). On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1), 503–528.
- Liu et al., 2018
Liu, H., Simonyan, K., & Yang, Y. (2018). Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055.
- Liu et al., 2016
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). Ssd: single shot multibox detector. European conference on computer vision (pp. 21–37).
- Liu et al., 2019a
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … Stoyanov, V. (2019). Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Liu et al., 2019b
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … Stoyanov, V. (2019). Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Liu et al., 2021
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., … Guo, B. (2021). Swin transformer: hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012–10022).
- Liu et al., 2022
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. arXiv preprint arXiv:2201.03545.
- Long et al., 2015
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440).
- Loshchilov & Hutter, 2016
Loshchilov, I., & Hutter, F. (2016). Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.
- Lowe, 2004
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2), 91–110.
- Luo et al., 2018
Luo, P., Wang, X., Shao, W., & Peng, Z. (2018). Towards understanding regularization in batch normalization. arXiv preprint.
- Maas et al., 2011
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 142–150).
- Mack & Silverman, 1982
Mack, Y.-p., & Silverman, B. W. (1982). Weak and strong uniform consistency of kernel regression estimates. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 61(3), 405–415.
- MacKay & MacKay, 2003
MacKay, D. J., & Mac Kay, D. J. (2003). Information theory, inference and learning algorithms. Cambridge university press.
- Maclaurin et al., 2015
Maclaurin, D., Duvenaud, D., & Adams, R. (2015). Gradient-based hyperparameter optimization through reversible learning. Proceedings of the 32nd International Conference on Machine Learning (ICML'15).
- Mangasarian, 1965
Mangasarian, O. L. (1965). Linear and nonlinear separation of patterns by linear programming. Oper. Res., 13, 444-452.
- Mangram, 2013
Mangram, M. E. (2013). A simplified perspective of the markowitz portfolio theory. Global journal of business research, 7(1), 59–70.
- Matthews et al., 2018
Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E., & Ghahramani, Z. (2018). Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271.
- McCann et al., 2017
McCann, B., Bradbury, J., Xiong, C., & Socher, R. (2017). Learned in translation: contextualized word vectors. Advances in Neural Information Processing Systems (pp. 6294–6305).
- McCulloch & Pitts, 1943
McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4), 115–133.
- McMahan et al., 2013
McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., … others. (2013). Ad click prediction: a view from the trenches. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1222–1230).
- Mead, 1980
Mead, C. (1980). Introduction to vlsi systems. IEE Proceedings I-Solid-State and Electron Devices, 128(1), 18.
- Merity et al., 2016
Merity, S., Xiong, C., Bradbury, J., & Socher, R. (2016). Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
- Micchelli, 1984
Micchelli, C. A. (1984). Interpolation of scattered data: distance matrices and conditionally positive definite functions. Approximation theory and spline functions (pp. 143–145). Springer.
- Mikolov et al., 2013a
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- Mikolov et al., 2013b
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems (pp. 3111–3119).
- Miller, 1995
Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38(11), 39–41.
- Mirhoseini et al., 2017
Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen, R., Zhou, Y., … Dean, J. (2017). Device placement optimization with reinforcement learning. Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 2430–2439).
- Mnih et al., 2014
Mnih, V., Heess, N., Graves, A., & others. (2014). Recurrent models of visual attention. Advances in neural information processing systems (pp. 2204–2212).
- Mnih et al., 2013
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv:1312.5602.
- Mnih et al., 2015
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., … others. (2015). Human-level control through deep reinforcement learning. nature, 518(7540), 529–533.
- Moon et al., 2010
Moon, T., Smola, A., Chang, Y., & Zheng, Z. (2010). Intervalrank: isotonic regression with listwise and pairwise constraints. Proceedings of the third ACM international conference on Web search and data mining (pp. 151–160).
- Morey et al., 2016
Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E.-J. (2016). The fallacy of placing confidence in confidence intervals. Psychonomic bulletin & review, 23(1), 103–123.
- Morozov, 1984
Morozov, V. A. (1984). Methods for solving incorrectly posed problems. Springer Science & Business Media.
- Nadaraya, 1964
Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & Its Applications, 9(1), 141–142.
- Nair & Hinton, 2010
Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. Icml.
- Nakkiran et al., 2021
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2021). Deep double descent: where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12), 124003.
- Naor & Reingold, 1999
Naor, M., & Reingold, O. (1999). On the construction of pseudorandom permutations: luby—rackoff revisited. Journal of Cryptology, 12(1), 29–66.
- Neal, 1996
Neal, R. M. (1996). Bayesian learning for neural networks. Springer Science & Business Media.
- Nesterov & Vial, 2000
Nesterov, Y., & Vial, J.-P. (2000). Confidence level solutions for stochastic programming, Stochastic Programming E-Print Series.
- Nesterov, 2018
Nesterov, Y. (2018). Lectures on convex optimization. Vol. 137. Springer.
- Neyman, 1937
Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 236(767), 333–380.
- Norelli et al., 2022
Norelli, A., Fumero, M., Maiorca, V., Moschella, L., Rodolà, E., & Locatello, F. (2022). Asif: coupled data turns unimodal models to multimodal without training. arXiv preprint arXiv:2210.01738.
- Novak et al., 2018
Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., Hron, J., … Sohl-Dickstein, J. (2018). Bayesian deep convolutional networks with many channels are Gaussian processes. arXiv preprint arXiv:1810.05148.
- Novikoff, 1962
Novikoff, A. B. J. (1962). On convergence proofs on perceptrons. Proceedings of the Symposium on the Mathematical Theory of Automata (pp. 615–622).
- Olshausen & Field, 1996
Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607–609.
- Ong et al., 2005
Ong, C. S., Smola, A., Williamson, R., & others. (2005). Learning the kernel with hyperkernels. Journal of Machine Learning Research.
- Ouyang et al., 2022
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., … others. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
- Papineni et al., 2002
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311–318).
- Parikh et al., 2016
Parikh, A. P., Täckström, O., Das, D., & Uszkoreit, J. (2016). A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933.
- Park et al., 2019
Park, T., Liu, M.-Y., Wang, T.-C., & Zhu, J.-Y. (2019). Semantic image synthesis with spatially-adaptive normalization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2337–2346).
- Parzen, 1957
Parzen, E. (1957). On consistent estimates of the spectrum of a stationary time series. The Annals of Mathematical Statistics, pp. 329–348.
- Paszke et al., 2019
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., … others. (2019). Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 8026–8037.
- Paulus et al., 2017
Paulus, R., Xiong, C., & Socher, R. (2017). A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304.
- Pennington et al., 2017
Pennington, J., Schoenholz, S., & Ganguli, S. (2017). Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. Advances in neural information processing systems (pp. 4785–4795).
- Pennington et al., 2014
Pennington, J., Socher, R., & Manning, C. (2014). Glove: global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).
- Peters et al., 2017a
Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of causal inference: foundations and learning algorithms. MIT press.
- Peters et al., 2017b
Peters, M., Ammar, W., Bhagavatula, C., & Power, R. (2017). Semi-supervised sequence tagging with bidirectional language models. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1756–1765).
- Peters et al., 2018
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (pp. 2227–2237).
- Petersen et al., 2008
Petersen, K. B., Pedersen, M. S., & others. (2008). The matrix cookbook. Technical University of Denmark, 7(15), 510.
- Pleiss et al., 2017
Pleiss, G., Chen, D., Huang, G., Li, T., Van Der Maaten, L., & Weinberger, K. Q. (2017). Memory-efficient implementation of densenets. arXiv preprint arXiv:1707.06990.
- Polyak, 1964
Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17.
- Popper, 2005
Popper, K. (2005). The logic of scientific discovery. Routledge.
- Prakash et al., 2016
Prakash, A., Hasan, S. A., Lee, K., Datla, V., Qadir, A., Liu, J., & Farri, O. (2016). Neural paraphrase generation with stacked residual lstm networks. arXiv preprint arXiv:1610.03098.
- Quadrana et al., 2018
Quadrana, M., Cremonesi, P., & Jannach, D. (2018). Sequence-aware recommender systems. ACM Computing Surveys (CSUR), 51(4), 66.
- Quinlan, 2014
Quinlan, J. R. (2014). C4. 5: programs for machine learning. Elsevier.
- Rabiner & Juang, 1993
Rabiner, L., & Juang, B.-H. (1993). Fundamentals of speech recognition. Prentice-Hall, Inc.
- Radford et al., 2021
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … others. (2021). Learning transferable visual models from natural language supervision. International Conference on Machine Learning (pp. 8748–8763).
- Radford et al., 2015
Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
- Radford et al., 2018
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI.
- Radford et al., 2019
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.
- Radosavovic et al., 2019
Radosavovic, I., Johnson, J., Xie, S., Lo, W.-Y., & Dollár, P. (2019). On network design spaces for visual recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1882–1890).
- Radosavovic et al., 2020
Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., & Dollár, P. (2020). Designing network design spaces. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10428–10436).
- Rae et al., 2021
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., … others. (2021). Scaling language models: methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
- Raffel et al., 2020
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., … Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21, 1–67.
- Rajpurkar et al., 2016
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
- Ramachandran et al., 2019
Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., & Shlens, J. (2019). Stand-alone self-attention in vision models. Advances in Neural Information Processing Systems, 32.
- Ramachandran et al., 2017
Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for activation functions. arXiv preprint arXiv:1710.05941.
- Ramesh et al., 2022
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
- Ranzato et al., 2007
Ranzato, Marc’Aurelio, Boureau, Y.-L., Chopra, S., & LeCun, Y. (2007). A unified energy-based framework for unsupervised learning. Artificial Intelligence and Statistics (pp. 371–379).
- Rasmussen & Williams, 2006
Rasmussen, C. E., & Williams, C. K. (2006). Gaussian processes for machine learning. Vol. 2. MIT press.
- Reddi et al., 2019
Reddi, S. J., Kale, S., & Kumar, S. (2019). On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237.
- Redmon et al., 2016
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: unified, real-time object detection. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779–788).
- Redmon & Farhadi, 2018
Redmon, J., & Farhadi, A. (2018). Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767.
- Reed & DeFreitas, 2015
Reed, S., & De Freitas, N. (2015). Neural programmer-interpreters. arXiv preprint arXiv:1511.06279.
- Reed et al., 2022
Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., … others. (2022). A generalist agent. arXiv preprint arXiv:2205.06175.
- Ren et al., 2015
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems (pp. 91–99).
- Rendle, 2010
Rendle, S. (2010). Factorization machines. 2010 IEEE International Conference on Data Mining (pp. 995–1000).
- Rendle et al., 2009
Rendle, S., Freudenthaler, C., Gantner, Z., & Schmidt-Thieme, L. (2009). Bpr: bayesian personalized ranking from implicit feedback. Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence (pp. 452–461).
- Revels et al., 2016
Revels, J., Lubin, M., & Papamarkou, T. (2016). Forward-mode automatic differentiation in julia. arXiv preprint arXiv:1607.07892.
- Rezende et al., 2014
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. International conference on machine learning (pp. 1278–1286).
- Riesenhuber & Poggio, 1999
Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature neuroscience, 2(11), 1019–1025.
- Rockafellar, 1970
Rockafellar, R. T. (1970). Convex Analysis. Vol. 28. Princeton, NJ: Princeton University Press.
- Rolnick et al., 2017
Rolnick, D., Veit, A., Belongie, S., & Shavit, N. (2017). Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694.
- Rudin, 1973
Rudin, W. (1973). Functional Analysis. New York: McGraw-Hill.
- Rumelhart et al., 1988
Rumelhart, D. E., Hinton, G. E., Williams, R. J., & others. (1988). Learning representations by back-propagating errors. Cognitive modeling, 5(3), 1.
- Russakovsky et al., 2013
Russakovsky, O., Deng, J., Huang, Z., Berg, A. C., & Fei-Fei, L. (2013). Detecting avocados to zucchinis: what have we done, and where are we going? International Conference on Computer Vision (ICCV).
- Russakovsky et al., 2015
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., … others. (2015). Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3), 211–252.
- Russell & Norvig, 2016
Russell, S. J., & Norvig, P. (2016). Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited,.
- Saharia et al., 2022
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., … others. (2022). Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487.
- Salinas et al., 2022
Salinas, D., Seeger, M., Klein, A., Perrone, V., Wistuba, M., & Archambeau, C. (2022). Syne tune: a library for large scale hyperparameter tuning and reproducible research. First Conference on Automated Machine Learning (Main Track).
- Sanh et al., 2019
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
- Santurkar et al., 2018
Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). How does batch normalization help optimization? Advances in Neural Information Processing Systems (pp. 2483–2493).
- Sarwar et al., 2001
Sarwar, B. M., Karypis, G., Konstan, J. A., Riedl, J., & others. (2001). Item-based collaborative filtering recommendation algorithms. Www, 1, 285–295.
- Schein et al., 2002
Schein, A. I., Popescul, A., Ungar, L. H., & Pennock, D. M. (2002). Methods and metrics for cold-start recommendations. Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 253–260).
- Scholkopf & Smola, 2002a
Scholkopf, B., & Smola, A. J. (2002). Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive Computation and Machine Learning Series.
- Schuhmann et al., 2022
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., … others. (2022). Laion-5b: an open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402.
- Schuster & Paliwal, 1997
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.
- Scholkopf et al., 2001
Schölkopf, B., Herbrich, R., & Smola, A. J. (2001). Helmbold, D. P., & Williamson, B. (Eds.). A generalized representer theorem. Proc. Annual Conf. Computational Learning Theory (pp. 416–426). London, UK: Springer-Verlag.
- Scholkopf et al., 1996
Schölkopf, B., Burges, C., & Vapnik, V. (1996). Incorporating invariances in support vector learning machines. International Conference on Artificial Neural Networks (pp. 47–52).
- Scholkopf & Smola, 2002b
Schölkopf, B., & Smola, A. J. (2002). Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive Computation and Machine Learning Series.
- Sedhain et al., 2015
Sedhain, S., Menon, A. K., Sanner, S., & Xie, L. (2015). Autorec: autoencoders meet collaborative filtering. Proceedings of the 24th International Conference on World Wide Web (pp. 111–112).
- Sennrich et al., 2015
Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
- Sergeev & DelBalso, 2018
Sergeev, A., & Del Balso, M. (2018). Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799.
- Shannon, 1948
Shannon, C. E. (1948 , 7). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423.
- Shao et al., 2020
Shao, H., Yao, S., Sun, D., Zhang, A., Liu, S., Liu, D., … Abdelzaher, T. (2020). Controlvae: controllable variational autoencoder. Proceedings of the 37th International Conference on Machine Learning.
- Shaw et al., 2018
Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-attention with relative position representations. arXiv preprint arXiv:1803.02155.
- Shoeybi et al., 2019
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). Megatron-lm: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
- Silver et al., 2016a
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., … others. (2016). Mastering the game of go with deep neural networks and tree search. nature, 529(7587), 484.
- Silver et al., 2016b
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., … others. (2016). Mastering the game of go with deep neural networks and tree search. nature, 529(7587), 484–489.
- Silverman, 1986
Silverman, B. W. (1986). Density Estimation for Statistical and Data Analysis. London: Chapman and Hall.
- Simard et al., 1998
Simard, P. Y., LeCun, Y. A., Denker, J. S., & Victorri, B. (1998). Transformation invariance in pattern recognition—tangent distance and tangent propagation. Neural networks: tricks of the trade (pp. 239–274). Springer.
- Simonyan & Zisserman, 2014
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
- Sindhwani et al., 2015
Sindhwani, V., Sainath, T. N., & Kumar, S. (2015). Structured transforms for small-footprint deep learning. arXiv preprint arXiv:1510.01722.
- Sivic & Zisserman, 2003
Sivic, J., & Zisserman, A. (2003). Video google: a text retrieval approach to object matching in videos. Computer Vision, IEEE International Conference on (pp. 1470–1470).
- Smith et al., 2022
Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., … others. (2022). Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.
- Smola & Narayanamurthy, 2010
Smola, A., & Narayanamurthy, S. (2010). An architecture for parallel topic models. Proceedings of the VLDB Endowment, 3(1-2), 703–710.
- Snoek et al., 2012
Snoek, J., Larochelle, H., & Adams, R. (2012). Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems 25 (pp. 2951–2959).
- Sohl-Dickstein et al., 2015
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. International Conference on Machine Learning (pp. 2256–2265).
- Song & Ermon, 2019
Song, Y., & Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32.
- Song et al., 2021
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations.
- Speelpenning, 1980
Speelpenning, B. (1980). Compiling fast partial derivatives of functions given by algorithms (Doctoral dissertation). University of Illinois at Urbana-Champaign.
- Srivastava et al., 2022
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., … others. (2022). Beyond the imitation game: quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- Srivastava et al., 2014
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.
- Srivastava et al., 2015
Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway networks. arXiv preprint arXiv:1505.00387.
- Strang, 1993
Strang, G. (1993). Introduction to linear algebra. Vol. 3. Wellesley-Cambridge Press Wellesley, MA.
- Su & Khoshgoftaar, 2009
Su, X., & Khoshgoftaar, T. M. (2009). A survey of collaborative filtering techniques. Advances in artificial intelligence, 2009.
- Sukhbaatar et al., 2015
Sukhbaatar, S., Weston, J., Fergus, R., & others. (2015). End-to-end memory networks. Advances in neural information processing systems (pp. 2440–2448).
- Sutskever et al., 2013
Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. International conference on machine learning (pp. 1139–1147).
- Sutskever et al., 2014
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in neural information processing systems (pp. 3104–3112).
- Szegedy et al., 2017
Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. Thirty-First AAAI Conference on Artificial Intelligence.
- Szegedy et al., 2015
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … Rabinovich, A. (2015). Going deeper with convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–9).
- Szegedy et al., 2016
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818–2826).
- Tallec & Ollivier, 2017
Tallec, C., & Ollivier, Y. (2017). Unbiasing truncated backpropagation through time. arXiv preprint arXiv:1705.08209.
- Tan & Le, 2019
Tan, M., & Le, Q. (2019). Efficientnet: rethinking model scaling for convolutional neural networks. International conference on machine learning (pp. 6105–6114).
- Tang & Wang, 2018
Tang, J., & Wang, K. (2018). Personalized top-n sequential recommendation via convolutional sequence embedding. Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (pp. 565–573).
- Taskar et al., 2004
Taskar, B., Guestrin, C., & Koller, D. (2004). Max-margin markov networks. Advances in neural information processing systems, 16, 25.
- Tay et al., 2020
Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2020). Efficient transformers: a survey. arXiv preprint arXiv:2009.06732.
- Teye et al., 2018
Teye, M., Azizpour, H., & Smith, K. (2018). Bayesian uncertainty estimation for batch normalized deep networks. arXiv preprint arXiv:1802.06455.
- Thomee et al., 2016
Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., … Li, L.-J. (2016). Yfcc100m: the new data in multimedia research. Communications of the ACM, 59(2), 64–73.
- Tieleman & Hinton, 2012
Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 26–31.
- Tikhonov & Arsenin, 1977
Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of ill-posed problems. W.H. Winston.
- Tolstikhin et al., 2021
Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., … others. (2021). Mlp-mixer: an all-mlp architecture for vision. Advances in Neural Information Processing Systems, 34.
- Torralba et al., 2008
Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE transactions on pattern analysis and machine intelligence, 30(11), 1958–1970.
- Touvron et al., 2021
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning (pp. 10347–10357).
- Tsoumakas & Katakis, 2007
Tsoumakas, G., & Katakis, I. (2007). Multi-label classification: an overview. International Journal of Data Warehousing and Mining (IJDWM), 3(3), 1–13.
- Turing, 1950
Turing, A. (1950). Computing machinery and intelligence. Mind, 59(236), 433.
- Toscher et al., 2009
Töscher, A., Jahrer, M., & Bell, R. M. (2009). The bigchaos solution to the netflix grand prize. Netflix prize documentation, pp. 1–52.
- Uijlings et al., 2013
Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International journal of computer vision, 104(2), 154–171.
- VanLoan & Golub, 1983
Van Loan, C. F., & Golub, G. H. (1983). Matrix computations. Johns Hopkins University Press.
- Vapnik, 1995
Vapnik, V. (1995). The Nature of Statistical Learning Theory. New York: Springer.
- Vapnik, 1998
Vapnik, V. (1998). Statistical Learning Theory. New York: John Wiley and Sons.
- Vapnik & Chervonenkis, 1964
Vapnik, V., & Chervonenkis, A. (1964). A note on one class of perceptrons. Automation and Remote Control, 25.
- Vapnik & Chervonenkis, 1968
Vapnik, V., & Chervonenkis, A. (1968). Uniform convergence of frequencies of occurence of events to their probabilities. Dokl. Akad. Nauk SSSR, 181, 915-918.
- Vapnik & Chervonenkis, 1971
Vapnik, V., & Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl., 16(2), 264-281.
- Vapnik & Chervonenkis, 1981
Vapnik, V., & Chervonenkis, A. (1981). The necessary and sufficient conditions for the uniform convergence of averages to their expected values. Teoriya Veroyatnostei i Ee Primeneniya, 26(3), 543-564.
- Vapnik & Chervonenkis, 1991
Vapnik, V., & Chervonenkis, A. (1991). The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognition and Image Analysis, 1(3), 283-305.
- Vapnik & Chervonenkis, 1974
Vapnik, V. N., & Chervonenkis, A. Y. (1974). Ordered risk minimization. Automation and Remote Control, 35, 1226–1235, 1403–1412.
- Vapnik, 1992
Vapnik, V. (1992). Principles of risk minimization for learning theory. Advances in neural information processing systems (pp. 831–838).
- Vapnik et al., 1994
Vapnik, V., Levin, E., & Le Cun, Y. (1994). Measuring the vc-dimension of a learning machine. Neural computation, 6(5), 851–876.
- Vaswani et al., 2017
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems (pp. 5998–6008).
- Wahba, 1990
Wahba, G. (1990). Spline models for observational data. SIAM.
- Waibel et al., 1989
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE transactions on acoustics, speech, and signal processing, 37(3), 328–339.
- Wang et al., 2022
Wang, H., Zhang, A., Zheng, S., Shi, X., Li, M., & Wang, Z. (2022). Removing batch normalization boosts adversarial training. International Conference on Machine Learning (pp. 23433–23445).
- Wang et al., 2018
Wang, L., Li, M., Liberty, E., & Smola, A. J. (2018). Optimal message scheduling for aggregation. NETWORKS, 2(3), 2–3.
- Wang et al., 2019
Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D. F., & Chao, L. S. (2019). Learning deep transformer models for machine translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 1810–1822).
- Wang et al., 2023
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., & Zhou, D. (2023). Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations.
- Wang et al., 2016
Wang, Y., Davidson, A., Pan, Y., Wu, Y., Riffel, A., & Owens, J. D. (2016). Gunrock: a high-performance graph processing library on the gpu. ACM SIGPLAN Notices (p. 11).
- Warstadt et al., 2019
Warstadt, A., Singh, A., & Bowman, S. R. (2019). Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7, 625–641.
- Wasserman, 2013
Wasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer Science & Business Media.
- Watkins & Dayan, 1992a
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279-292.
- Watkins & Dayan, 1992b
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8(3-4), 279–292.
- Watson, 1964
Watson, G. S. (1964). Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A, pp. 359–372.
- Wei et al., 2022a
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., … others. (2022). Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
- Wei et al., 2022b
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., & Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Welling & Teh, 2011
Welling, M., & Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics. Proceedings of the 28th international conference on machine learning (ICML-11) (pp. 681–688).
- Wengert, 1964
Wengert, R. E. (1964). A simple automatic derivative evaluation program. Communications of the ACM, 7(8), 463–464.
- Werbos, 1990
Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10), 1550–1560.
- Wigner, 1958
Wigner, E. P. (1958). On the distribution of the roots of certain symmetric matrices. Ann. Math (pp. 325–327).
- Wilson & Izmailov, 2020
Wilson, A. G., & Izmailov, P. (2020). Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems, 33, 4697–4708.
- Wistuba et al., 2019
Wistuba, M., Rawat, A., & Pedapati, T. (2019). A survey on neural architecture search. arXiv:1905.01392 [cs.LG].
- Wistuba et al., 2018
Wistuba, M., Schilling, N., & Schmidt-Thieme, L. (2018). Scalable gaussian process-based transfer surrogates for hyperparameter optimization. Machine Learning.
- Wolpert et al., 1995
Wolpert, D. H., Macready, W. G., & others (1995). No free lunch theorems for search. Technical Report SFI-TR-95-02-010, Santa Fe Institute.
- Wood et al., 2011
Wood, F., Gasthaus, J., Archambeau, C., James, L., & Teh, Y. W. (2011). The sequence memoizer. Communications of the ACM, 54(2), 91–98.
- Wu et al., 2018
Wu, B., Wan, A., Yue, X., Jin, P., Zhao, S., Golmant, N., … Keutzer, K. (2018). Shift: a zero flop, zero parameter alternative to spatial convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9127–9135).
- Wu et al., 2016
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., … others. (2016). Google's neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
- Xiao et al., 2017
Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747.
- Xiao et al., 2018
Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., & Pennington, J. (2018). Dynamical isometry and a mean field theory of cnns: how to train 10,000-layer vanilla convolutional neural networks. International Conference on Machine Learning (pp. 5393–5402).
- Xie et al., 2017
Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1492–1500).
- Xiong et al., 2020
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., … Liu, T. (2020). On layer normalization in the transformer architecture. International Conference on Machine Learning (pp. 10524–10533).
- Xiong et al., 2018
Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., & Stolcke, A. (2018). The microsoft 2017 conversational speech recognition system. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5934–5938).
- Cajal & Azoulay, 1894
y Cajal, S. R., & Azoulay, L. (1894). Les nouvelles idées sur la structure du système nerveux chez l'homme et chez les vertébrés. C. Reinwald.
- Yamaguchi et al., 1990
Yamaguchi, K., Sakamoto, K., Akabane, T., & Fujimoto, Y. (1990). A neural network for speaker-independent isolated word recognition. First International Conference on Spoken Language Processing.
- Yang et al., 2016
Yang, Z., Hu, Z., Deng, Y., Dyer, C., & Smola, A. (2016). Neural machine translation with recurrent attention modeling. arXiv preprint arXiv:1607.05108.
- Yang et al., 2015
Yang, Z., Moczulski, M., Denil, M., De Freitas, N., Smola, A., Song, L., & Wang, Z. (2015). Deep fried convnets. Proceedings of the IEEE International Conference on Computer Vision (pp. 1476–1483).
- Ye et al., 2011
Ye, M., Yin, P., Lee, W.-C., & Lee, D.-L. (2011). Exploiting geographical influence for collaborative point-of-interest recommendation. Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (pp. 325–334).
- You et al., 2017
You, Y., Gitman, I., & Ginsburg, B. (2017). Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888.
- Yu et al., 2022
Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z., … Wu, Y. (2022). Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789.
- Zaheer et al., 2018
Zaheer, M., Reddi, S., Sachan, D., Kale, S., & Kumar, S. (2018). Adaptive methods for nonconvex optimization. Advances in Neural Information Processing Systems (pp. 9793–9803).
- Zeiler, 2012
Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
- Zeiler & Fergus, 2013
Zeiler, M. D., & Fergus, R. (2013). Stochastic pooling for regularization of deep convolutional neural networks. arXiv preprint arXiv:1301.3557.
- Zhang et al., 2021a
Zhang, A., Tay, Y., Zhang, S., Chan, A., Luu, A. T., Hui, S. C., & Fu, J. (2021). Beyond fully-connected layers with quaternions: parameterization of hypercomplex multiplications with 1/n parameters. International Conference on Learning Representations.
- Zhang et al., 2021b
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3), 107–115.
- Zhang et al., 2019
Zhang, S., Yao, L., Sun, A., & Tay, Y. (2019). Deep learning based recommender system: a survey and new perspectives. ACM Computing Surveys (CSUR), 52(1), 5.
- Zhang et al., 2022
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., … others. (2022). Opt: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Zhang & others, 1988
Zhang, W., & others. (1988). Shift-invariant pattern recognition neural network and its optical architecture. Proceedings of annual conference of the Japan Society of Applied Physics.
- Zhang et al., 2021c
Zhang, Y., Sun, P., Jiang, Y., Yu, D., Yuan, Z., Luo, P., … Wang, X. (2021). Bytetrack: multi-object tracking by associating every detection box. arXiv preprint arXiv:2110.06864.
- Zhang et al., 2023a
Zhang, Z., Zhang, A., Li, M., & Smola, A. (2023). Automatic chain of thought prompting in large language models. International Conference on Learning Representations.
- Zhang et al., 2023b
Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., & Smola, A. (2023). Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923.
- Zhao et al., 2019
Zhao, Z.-Q., Zheng, P., Xu, S.-t., & Wu, X. (2019). Object detection with deep learning: a review. IEEE transactions on neural networks and learning systems, 30(11), 3212–3232.
- Zhou et al., 2023
Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., … Chi, E. (2023). Least-to-most prompting enables complex reasoning in large language models. International Conference on Learning Representations.
- Zhu et al., 2017
Zhu, J.-Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. Proceedings of the IEEE international conference on computer vision (pp. 2223–2232).
- Zhu et al., 2015
Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: towards story-like visual explanations by watching movies and reading books. Proceedings of the IEEE international conference on computer vision (pp. 19–27).
- Zoph & Le, 2016
Zoph, B., & Le, Q. V. (2016). Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.