Essential AI Research Papers for 2020: A Comprehensive Guide
Written on
Artificial Intelligence is rapidly evolving and has become a highly sought-after skill, closely associated with Data Science. This field encompasses a wide range of applications, categorized by the types of input (text, audio, images, video, graphs) and problem-solving approaches (supervised, unsupervised, reinforcement learning). Keeping pace with advancements can be overwhelming. To aid in this, I have compiled a list of essential readings that highlight both contemporary and classic breakthroughs in AI and Data Science.
While many papers focus on image and text, the principles discussed often apply broadly across various inputs, offering insights beyond specific domains. For each recommendation, I've outlined reasons for its significance and suggested additional readings for those interested in deeper exploration.
I apologize to those specializing in Audio and Reinforcement Learning, as my experience in these areas is limited, and thus, they are not included in this selection.
Let's dive in.
#1 AlexNet (2012)
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems, 2012.
In 2012, the authors introduced the innovative use of GPUs for training a large Convolutional Neural Network (CNN) aimed at the ImageNet challenge. This was a groundbreaking decision, as CNNs were typically deemed too cumbersome for large-scale training. Remarkably, they achieved first place with a ~15% Top-5 error rate, outperforming the second-place team, which had a ~26% error rate and employed traditional image processing techniques.
Reason #1: While AlexNet's historical significance is widely recognized, many are unaware of the foundational techniques we still use today that were first introduced in this paper, such as dropout and ReLU.
Reason #2: The architecture featured 60 million parameters, a staggering figure by 2012 standards. Today, we encounter models with over a billion parameters. Reading the AlexNet paper provides valuable insights into the evolution of model complexity.
Further Reading: To trace the history of ImageNet champions, consider exploring the ZF Net, VGG, Inception-v1, and ResNet papers. ResNet, in particular, achieved superhuman performance, transforming the landscape of deep learning. ImageNet is now primarily utilized for Transfer Learning and assessing low-parameter models, such as:
#2 MobileNet (2017)
Howard, Andrew G., et al. “Mobilenets: Efficient convolutional neural networks for mobile vision applications.” arXiv preprint arXiv:1704.04861, 2017.
MobileNet is renowned as one of the leading "low-parameter" networks, ideal for low-resource devices and accelerating real-time applications like object recognition on smartphones. The model's key principle is breaking down complex operations into smaller, faster ones, leading to significant efficiency gains.
Reason #1: Many of us lack the resources of major tech companies. Understanding low-parameter networks is essential for developing cost-effective models. In my experience, utilizing depth-wise convolutions can significantly reduce cloud inference costs without sacrificing accuracy.
Reason #2: There's a common misconception that larger models are inherently superior. Papers like MobileNet demonstrate that model elegance and efficiency are equally crucial.
Further Reading: MobileNet v2 and v3 have since been developed, enhancing both accuracy and model size. Additionally, various techniques, such as SqueezeNet, have emerged to minimize model sizes while maintaining performance. A comprehensive overview of model size versus accuracy is available in recent literature.
#3 Attention is All You Need (2017)
Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems, 2017.
This paper introduced the Transformer Model, revolutionizing language models that previously relied heavily on Recurrent Neural Networks (RNNs) for sequence-to-sequence tasks. RNNs are notoriously slow and difficult to parallelize across multiple GPUs. In contrast, the Transformer model relies exclusively on Attention layers, capturing the interdependencies of sequence elements more efficiently. This approach not only achieved superior results but also trains significantly faster than prior RNN models.
Reason #1: Most contemporary architectures in Natural Language Processing (NLP) build upon the Transformer model. Innovations like GPT-2 and BERT are at the forefront of the field. Understanding the Transformer is essential for grasping later developments in NLP.
Reason #2: The majority of transformer models contain billions of parameters. While MobileNets focus on efficient architectures, NLP research emphasizes efficient training methods. Together, these perspectives provide a comprehensive toolkit for enhancing training and inference efficiency.
Reason #3: Although the transformer model has primarily been applied to NLP, the proposed Attention mechanism holds potential across various domains. Models like Self-Attention GAN illustrate the versatility of global reasoning in diverse tasks, with new papers on Attention applications emerging monthly.
Further Reading: I highly recommend studying the BERT and SAGAN papers. The former extends the Transformer model, while the latter showcases the Attention mechanism's application to images within a GAN framework.
#4 Stop Thinking with Your Head / Reformer (~2020)
Merity, Stephen. “Single Headed Attention RNN: Stop Thinking With Your Head.” arXiv preprint arXiv:1911.11423, 2019.
Kitaev, Nikita, ?ukasz Kaiser, and Anselm Levskaya. “Reformer: The Efficient Transformer.” arXiv preprint arXiv:2001.04451, 2020.
While Transformer and Attention models have garnered much attention, they often require substantial resources, making them unsuitable for typical consumer hardware. Both papers critique the architecture and suggest computationally efficient alternatives to the Attention module, echoing the sentiment that elegance is essential.
Reason #1: “Stop Thinking With Your Head” is a particularly entertaining read, which is reason enough to explore it.
Reason #2: Major corporations can rapidly scale their research across numerous GPUs, but most individuals cannot. Enhancing model efficiency rather than merely scaling them is paramount. Understanding efficiency is crucial for optimizing available resources.
Further Reading: Given the publication dates of these papers, there isn’t much additional material to reference. However, revisiting the MobileNet paper may yield further insights into efficiency strategies.
#5 Human Baselines for Pose Estimation (2017)
Xiao, Bin, Haiping Wu, and Yichen Wei. “Simple baselines for human pose estimation and tracking.” Proceedings of the European conference on computer vision (ECCV), 2018.
While many papers propose novel techniques to elevate the state-of-the-art, this paper posits that a straightforward model, leveraging established best practices, can be surprisingly effective. They introduced a human pose estimation network that relies on a backbone network followed by three deconvolution operations. At the time, this approach performed exceptionally well on the COCO benchmark despite its simplicity.
Reason #1: Simplicity can sometimes yield the most effective results. While the allure of intricate architectures is strong, a baseline model may be quicker to implement and yield comparable outcomes. This paper serves as a reminder that not all effective models need complexity.
Reason #2: Scientific progress often occurs incrementally. Each new paper advances the field slightly, but it can be beneficial to revisit earlier approaches. The aforementioned “Stop Thinking with Your Head” and “Reformer” papers exemplify this idea.
Reason #3: Proper data augmentation, training schedules, and effective problem formulation are often undervalued yet crucial components of successful models.
Further Reading: For those interested in Pose Estimation, a comprehensive state-of-the-art review would be beneficial.
#6 Bag of Tricks for Image Classification (2019)
He, Tong, et al. “Bag of tricks for image classification with convolutional neural networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
Often, the solution lies not in developing a novel model but in employing a few new techniques. Many papers introduce one or two tricks that provide modest improvements. However, these enhancements can easily be overlooked amid more significant contributions. This paper compiles a collection of practical tips from the literature, presenting them for our benefit.
Reason #1: Most tips are straightforward to implement.
Reason #2: It's likely that you may not be familiar with several of these approaches, which go beyond the usual “use ELU” suggestions.
Further Readings: Numerous other techniques exist, some tailored to specific problems. A topic deserving more focus is class and sample weights. Consider reading about class weights for unbalanced datasets.
#7 The SELU Activation (2017)
Klambauer, Günter, et al. “Self-normalizing neural networks.” Advances in neural information processing systems, 2017.
Most practitioners utilize Batch Normalization layers and either ReLU or ELU activation functions. The SELU paper presents a unifying solution: an activation function that self-normalizes its outputs, effectively eliminating the need for batch normalization layers. Consequently, models using SELU activations are simpler and require fewer operations.
Reason #1: The authors predominantly focus on standard machine learning problems (e.g., tabular data), providing a refreshing perspective for data scientists primarily engaged with images.
Reason #2: For those dealing with tabular data, this represents one of the most current approaches within the Neural Networks literature.
Reason #3: The paper is mathematically rigorous and presents a computationally derived proof, a rare and admirable feature.
Further Reading: If you're interested in the history and application of popular activation functions, I have written a guide on the topic.
#8 Bag-of-local-Features (2019)
Brendel, Wieland, and Matthias Bethge. “Approximating cnns with bag-of-local-features models works surprisingly well on imagenet.” arXiv preprint arXiv:1904.00760, 2019.
The premise is that if you divide an image into jigsaw-like segments, scramble them, and present them to a child, they won’t recognize the original object; however, a CNN might still succeed. This paper reveals that classifying all 33x33 patches of an image and averaging their predictions can yield near state-of-the-art results on ImageNet. Furthermore, they explore this concept with VGG and ResNet-50 models, demonstrating that CNNs heavily depend on local information with minimal global context.
Reason #1: Contrary to the belief that CNNs possess advanced visual recognition capabilities, this paper suggests they may rely on simpler mechanisms than we anticipate.
Reason #2: It’s rare to encounter research that offers a fresh viewpoint on the limitations of CNNs and their interpretability.
Further Reading: Related findings in adversarial attacks literature also unveil CNNs' limitations. Refer to articles that explore these vulnerabilities.
#9 The Lottery Ticket Hypothesis (2019)
Frankle, Jonathan, and Michael Carbin. “The lottery ticket hypothesis: Finding sparse, trainable neural networks.” arXiv preprint arXiv:1803.03635, 2018.
This theoretical paper proposes that if you train a large network, prune the low-value weights, revert to the original network, and retrain, you’ll achieve a better-performing model. The lottery analogy views each weight as a “ticket.” With a large number of tickets, you're bound to find a winner, but only a few will succeed. If you could go back and select only the winning tickets, you would maximize your gains. This framework suggests that the initial large network can be pruned effectively to enhance performance.
Reason #1: The concept is incredibly intriguing.
Reason #2: Similar to the Bag-of-Features paper, this research highlights the constraints of our current understanding of CNNs, revealing the potential for underutilizing millions of parameters. The authors successfully reduced networks to a tenth of their original size, raising questions about the future possibilities.
Reason #3: These ideas provide insight into the inefficiencies of large networks. The Reformer paper, mentioned earlier, significantly reduced Transformer sizes through algorithmic improvements. How much further could the lottery technique push this reduction?
Further Reading: Weight initialization is often overlooked. Many stick to default settings that may not be optimal. “All You Need is a Good Init” is a crucial paper on this topic. For insights into the lottery hypothesis, consider reading a comprehensive review.
#10 Pix2Pix and CycleGAN (2017)
Isola, Phillip, et al. “Image-to-image translation with conditional adversarial networks.” Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.
Zhu, Jun-Yan, et al. “Unpaired image-to-image translation using cycle-consistent adversarial networks.” Proceedings of the IEEE international conference on computer vision, 2017.
No list would be complete without discussing GANs. Pix2Pix and CycleGAN are foundational works in conditional generative models, which transform images from one domain to another using paired and unpaired datasets, respectively. The former can convert line drawings to fully rendered images, while the latter excels in tasks like transforming horses into zebras. These conditional models enable users to exert some control over the generated outputs through input adjustments.
Reason #1: GAN literature often emphasizes the quality of generated outputs without considering artistic control. Conditional models like these pave the way for practical applications of GANs, such as assisting artists.
Reason #2: Adversarial methods exemplify multi-network architectures. Even if generation isn't your focus, understanding multi-network setups could inspire solutions to various challenges.
Reason #3: The CycleGAN paper notably illustrates how an effective loss function can significantly improve the resolution of complex issues. The Focal loss paper demonstrates a similar principle by enhancing object detectors through better loss functions.
Further Reading: As AI continues to advance rapidly, so do GANs. I encourage you to experiment with coding a GAN if you haven't yet. The official TensorFlow 2 documentation provides valuable resources. Additionally, exploring semi-supervised learning applications of GANs is worthwhile.
With these ten pivotal papers and their additional resources, you now have ample reading material to enhance your understanding of AI. This list is by no means exhaustive, but I've aimed to present some of the most insightful and significant works I’ve encountered. Please share any additional papers you believe should be included.
Happy reading! :)
Edit: After compiling this list, I created a follow-up featuring ten more AI papers from 2020 and another focused on GANs. If you appreciated this list, you may find those continuations intriguing as well:
- Ten More AI Papers to Read in 2020
- GAN Papers to Read in 2020
Feel free to reach out or connect with me. If you're new to Medium, I highly recommend subscribing. Medium articles serve as an excellent complement to StackOverflow for Data and IT professionals, especially newcomers. Consider using my affiliate link when signing up.
Thank you for your attention! :)