Incremental Learning Papers
- Survey
- Classic Papers
  - Learning without Forgetting
  - iCaRL: Incremental Classifier and Representation Learning

Incremental Learning Papers

Refer to: Awesome Incremental Learning Papers

Survey

All survey literatures are sorted by first submission date.

Online Continual Learning in Image Classification: An Empirical Survey

Paper: paper on arxiv

Code: code on GitHub

First Submission: 2021-01-25

Latest Submission: 2021-06-01

Dataset: CIFAR-100, MiniImageNet, CORe50-NC, NonStationary-MiniImageNet, CORe50-NI

Methods:

Regularization-based methods: EWC (Elastic Weight Consolidation)++, LwF (Learning without Forgetting)
Memory-based methods: A-GEM (Averaged GEM), iCaRL (Incremental Classifier and Representation Learning), ER (Experience Replay), MIR (Maximally Interfered Retrieval), GSS (Gradient based Sample Selection), GDumb (Greedy Sampler and Dumb Learner)
Parameter-isolation-based methods: CN-DPM (Continual Neural Dirichlet Process Mixture)

Related Survey:

Not empirical
- Continual Lifelong Learning with Neural Networks: A Review
- Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges
- Online continual learning on sequences, in: Recent Trends in Learning From Data
Empirical
- Re-evaluating continual learning scenarios: A categorization and case for strong baselines
- Three scenarios for continual learning
- Towards robust evaluations of continual learning
- walk for incremental learning: Understanding forgetting and intransigence
- A comprehensive, application-oriented study of catastrophic forgetting in DNNs
- Measuring catastrophic forgetting in neural networks
- A continual learning survey: Defying forgetting in classification tasks

Focus

Online Class Incremental: The model may experience new classes abruptly
Online Domain Incremental: data nonstationarity including new background, blur, noise, illumination, etc.

Trends

Raw-Data-Free Methods. Regularization(has theoretical limitations in the class incremental setting and cannot be used alone to reach decent performance), Generative replay(not viable for more complex datasets), Feature Replay(latent features of the old samples at a given layer are relayed instead of raw data, one way is to freeze—the learning of all the layers before the feature extraction layer, another way is to generate latent features with a generative model), approaches not require storing the raw data(such as SDC, DMC, DSLDA)
Meta Learning. Such as MER, OML, iTAML, La-MAML, MERLIN, Continual-MAML.
CL in Other Area: Object detection, RNN, language learning, dialogue systems, image captioning, sentiment classification, sentence representation learning, recommender system, on-the-job learning

A Comprehensive Study of Class Incremental Learning Algorithms for Visual Tasks

Paper: paper on arxiv

Code: code on GitHub

First Submission: 2020-11-03

Latest Submission: 2020-12-15

Dataset: ILSVRC, VGGFACE2, Google Landmarks, CIFAR-100

Methods:

Model-Growth based methods: PNN(Progressive Neural Networks), DAN(Deep Adaptation Networks), PackNet, MAS(Memory Aware Synapses), SOMs(Self-Organizing Maps), NG(Neural Gas), PROPRE(PROjection-PREdiction), NGPCA(Neural Gas with local Principal Component Analysis), DYNG(Dynamic Online Growing Neural Gas), TOPIC(TOpology-Preserving knowledge InCrementer), ILVQ(Incremental Learning Vector Quantization)
Fixed Presentation based methods: DeeSIL (Deep Shallow Incremental Learning), SVMs(Support Vector Machines), FearNet, Deep-SLDA (Deep Streaming Linear Discriminant Analysis), REMIND (REplay using Memory INDexing), ART(Adaptive Resonance Theroy), FR (Fixed Representation)
Fine-Tuning based methods: iCaRL, LwF, LwM(Learning without Memorizing), M2KD(Multi-model and Multi-level Knowledge Distillation), LUCIR (Learning a Unified Classifier Incrementally via Rebalance), PODNet, BiC (Bias Correction), IL2M (Incremental Learning with Dual Memory), MDF(Maintaining Discrimination and Fairness), ScaIL (Classifier Weights Scailing for Class Inrcremental Learning), SIW(Standardization of Initial Weights), SDC(Semantic Drift Compensation), GAN Memory with No Forgetting, FT.

Related Survey

None

Focus

Fine-tuning methods
Fixed representation methods

Trends

handling class IL as an imbalanced learning problem provides very interesting results with or without the use of a distillation component. Here, we introduced a competitive method where classification bias in favor of new classes is reduced by using prior class probabilities. It would be interesting to investigate more sophisticated bias reduction schemes to improve performance further.
a more in-depth investigation of why distillation fails to work for large scale datasets is needed. The empirical findings reported here should be complemented with a more theoretical analysis to improve its usefulness. Already, the addition of inter-class separation is promising. More powerful distillation formulations, such as the relational knowledge distillation, also hold promise.
the results obtained with herding based selection of exemplars are better compared to a random selection for all methods tested. Further work in this direction could follow up on Mnemonics training and investigate in more depth which exemplar distribution is optimal for replay.
the evaluation scenario should be made more realistic by: (1) dropping the strong hypothesis that new data are readily annotated when they are streamed; (2) using a variable number of classes for the incremental states and (3) working with imbalanced datasets, which are more likely to occur in real-life applications than the controlled datasets tested until now.

Class-incremental learning: survey and performance evaluation on image classification

Paper: paper on arxiv

Code: code on GitHub

First Submission: 2020-10-28

Latest Submission: 2021-05-06

Dataset: CIFAR-100, Oxford Flowers, MIT Indoor Scenes, CUB-200-2011 Birds, Stanford Cars, FGVC Aircraft, Stanford Actions, VGGFace2, ImageNet

Methods:

Regularization methods
- Weight regularization: EWC (Elastic Weight Consolidation), PathInt (Path Integral), MAS (Memory Aware Synapses), RWalk (Riemanian Walk)
- Data regularization: LwF (Learn without Forgetting), LFL(less-forgetting learning), Encoder-based lifelong learning
- Recent development: LwM (Learning without memorizing), DMC (Deep Model Consolidation), GD (Global Distillation), Less-Forget constraint
Rehearsal methods: iCaRL
Bias-correction methods
- EEIL (End-to-End Incremental Learning)
- BiC (Bias Correction)
- LUCIR (Learning a Unified Classifier Incremental via Rebalancing)
- IL2M (Class-IL with Dual Memory)

Related Survey

Continual Lifelong Learning with Neural Networks: A Review
Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges
A comprehensive, application-oriented study of catastrophic forgetting in dnns
A continual learning survey: Defying forgetting in classification tasks

Focus

Class Incremental Learning

Trends

Exemplar learning: an exciting new direction has emerged that parametrizes exemplars and optimizes them to prevent forgetting. Optimizing the available storage by computing more effificient exemplars is expected to attract more research in the coming years
Feature rehearsal: Moving away from image replay towards different variants of feature replay is expected to gain traction.
Self- and unsupervised incremental learning: Still in its infancy, more research on unsupervised incremental learning is expected in coming years. In addition, leveraging the power of self-supervised representation learning is only little explored within the context of IL, and is expected to gain interest.
Beyond cross-entropy loss: Less forgetting has been reported by replacing the cross-entropy loss with a metric learning loss or by using a energy-based method.
Meta-learning: these techniques to be further developed in the coming years, and will start to obtain results on more complex datasets like the ones considered in our evaluation.
Task-free settings: The transition to the task-free setting is not straight-forward, since many methods have inherent operations that are performed on the task boundaries: replacing the old model, updating of importance weights, etc.

A continual learning survey: Defying forgetting in classification tasks

Paper: paper on arxiv, paper on IEEE

Code: code on GitHub

First Submission: 2019-09-18

Latest Submission: 2021-04-16

Dataset: Tiny ImageNet, iNaturalist, RecogSeq(Oxford Flowers, MIT Scenes, Caltech-UCSD Birds, Stanford Cars, FGVC-Aircraft, VOC Actions, Letters, SVHN)

Methods:

Replay methods
- Rehearsal methods: iCaRL (implement in multi-head fashion since iCaRL is class incremental method), CoPE(Continual Prototype Evolution), ER, SER, TEM
- Constrained methods: GEM, A-GEM, GSS.
- Pseudo Rehearsal methods: DGR, PR, CCLUGM, LGM
Regularization-based methods
- Data-focused methods: LwF, LFL, EBLL, DMC
- Prior-focused methods: EWC, IMM, SI, R-EWC, MAS, Riemannian Walk
Parameter isolation methods
- Fixed Network: PackNet, PathNet, PiggyBack, HAT
- Dynamic Architecture: PNN, Expert Gate, RCL, DAN

Related Survey

None

Focus

task incremental learning
image classification

Trends

Constant memory
No task boundaries
Online learning without demanding offline training of large batches or separate tasks introduces fast acquisition of new information.
Forward transfer or zero-shot learning indicates the importance of previously acquired knowledge to aid the learning of new tasks by increased data efficiency.
Backward transfer aims at retaining previous knowledge and preferably improving it when learning future related tasks.
Problem agnostic. Continual learning is not limited to a specific setting (e.g. only classification).
Adaptive systems learn from available unlabeled data as well, opening doors for adaptation to specific user data.
No test time oracle providing the task label should be required for prediction.
Task revisiting of previously seen tasks should enable enhancement of the corresponding task knowledge.
Graceful forgetting. Given an unbounded system and infinite stream of data, selective forgetting of trivial information is an important mechanism to achieve balance between stability and plasticity.

Three scenarios for continual learning

Paper: paper on arxiv

Code: code on GitHub

First Submission: 2019-04-15

Latest Submission: 2019-04-15

Dataset: MNIST(split MNIST & premuted MNIST)

Methods:

Task specific components(sub-network per task): XDG (Context-dependent Gating)
regularized optimization(differently regularizing parameters): EWC (Elastic Weight Consolidation), SI (Synaptic Intelligence)
Modifying Training Data(pseudo-data, generate samples): LwF (Learning without Forgetting), DGR (Deep Generative Replay)
Using Exemplars(store data from previous tasks): iCaRL

Related Survey

None

Focus

Task-IL: task-id provided, predict class in task
Domain-IL: task-id is not provided, predict class in all tasks
Class-IL: task-id is not provided, predict class in which task

Trends

None

Continual Lifelong Learning with Neural Networks: A Review

Paper: paper on arxiv

Code: None

First Submission: 2018-02-21

Latest Submission: 2019-02-11

Dataset: None

Methods:

Regularization methods: EWC (Elastic Weight Consolidation)++, LwF (Learning without Forgetting)

Related Survey

None

Focus

Trends

Classic Papers

Learning without Forgetting

Paper: paper on arxiv

Code: code on github

First Submission: 2016-06-29

Latest Submission: 2017-02-14

Focus: image classification problems with Convolutional Neural Network classifiers.

Parameters

A CNN has a set of shared parameters $\theta_s$(e.g. 5 convolutional layers and 2 fully connected layers for AlexNet architecture).

Task specific parameters for previously learning tasks $\theta_o$(e.g. the output layer for ImageNet classification and corresponding weights).

Randomly initialized task specific parameters for new tasks $\theta_n$.

Related work

Feature extraction: $\theta_s$ and $\theta_o$ are unchanged, and the outputs of one or more layers are used as features for the new tasks in training $\theta_n$. Drawback: Feature extraction typically underperforms on the new task because the shared parameters fail to represent some information that is discriminative for the new task.

Fine-tuning: $\theta_s$ and $\theta_n$ are both optimized for the new tasks, while $\theta_o$ is fixed. Drawback: Fine-tuning degrades performance on previously learned tasks because the shared parameters change without new guidance for the original task-specifific prediction parameters

Joint Training: All parameters $\theta_s$, $\theta_o$, $\theta_n$ are jointly optimized. Drawback: Joint training becomes increasingly cumbersome in training as more tasks are learned and is not possible if the training data for previously learned tasks is unavailable.

Algorithm of LwF

Backbone

AlexNet, VGG

Dataset

ImageNet, Places2, VOC

iCaRL: Incremental Classifier and Representation Learning

Paper: paper on arxiv, paper on CVF

Code: code on GitHub

First Submission: 2016-11-23

Latest Submission: 2017-04-14

First definition of class-incremental learning:

The following three properties of an algorithm to qualify as class-incremental:

It should be trainable from a stream of data in which examples of different classes occur at different times.
It should at any time provide a competitive multi-class classifier for the classes observed so far.
Its computational requirements and memory footprint should remain bounded, or at least grow very slowly, with respect to the number of classes so far.

Components of iCaRL

classification by nearest-mean-of-exemplars rule instead of cnn
prioritized exemplar selection based on herding
representation learning using knowledge distillation and prototype rehearsal

Introduction

Classification

Training

Why nearest-mean-of-exemplars classification

NME overcomes 2 major problems of the IL settings:

ccn outputs will change uncontrollably, which is observable as catastrophic forgetting, in contrast, NME rule does not have decoupled weight vectors. The class-prototypes automatically change whenever the feature representation changes, making the classifier robust against changes of the feature respresentation.
We cannot make use of the true class mean, since all training data would have to be stored in order to recompute this quantity after a representation change. Instead, we use the average over a flexible number of exemplars that are chosen in a way to provide a good approximation to the class mean.

Why representation learning

The representation learning step resembles ordinary network finetuning: starting from previously learned network weights it minimizes a loss function over a training set.
Modification to fine-tuning: the training set consists of new training exemplars and the stored exemplars; The loss function is augmented, including standard classification loss and distillation loss

Exemplar management

Overall, iCaRL’s steps for exemplar selection and reduction fit exactly to the incremental learning setting: the selection step is required for each class only once, when it is first observed and its training data is available. At later times, only the reduction step is called, which does not need access to any earlier training data.

Related work

fixed data representation
representation learning
LwF

Dataset

CIFAR-100, ImageNet ILSVRC 2012

Future work

analyze the reasons for low performance in more detail with the goal of closing the remaining performance gap.
study related scenarios in which the classifier cannot store any of the training data in raw form for privacy reasons.

Table of Contents

Incremental Learning Papers

Survey

Online Continual Learning in Image Classification: An Empirical Survey

A Comprehensive Study of Class Incremental Learning Algorithms for Visual Tasks

Class-incremental learning: survey and performance evaluation on image classification

A continual learning survey: Defying forgetting in classification tasks

Three scenarios for continual learning

Continual Lifelong Learning with Neural Networks: A Review

Classic Papers

Learning without Forgetting

iCaRL: Incremental Classifier and Representation Learning