1 Dynamic Ensemble Size Adjustment for Memory Constrained Mondrian Forest

2025-04-24 1 0 272.59KB 6 页 10玖币
侵权投诉
1
Dynamic Ensemble Size Adjustment for Memory
Constrained Mondrian Forest
Martin Khannouz and Tristan Glatard
Department of Computer Science and Software Engineering
Concordia University, Montreal, Quebec, Canada
Abstract—Supervised learning algorithms generally assume the
availability of enough memory to store data models during the
training and test phases. However, this assumption is unrealistic
when data comes in the form of infinite data streams, or
when learning algorithms are deployed on devices with reduced
amounts of memory. Such memory constraints impact the model
behavior and assumptions. In this paper, we show that under
memory constraints, increasing the size of a tree-based ensemble
classifier can worsen its performance. In particular, we experi-
mentally show the existence of an optimal ensemble size for a
memory-bounded Mondrian forest on data streams and we design
an algorithm to guide the forest toward that optimal number by
using an estimation of overfitting. We tested different variations
for this algorithm on a variety of real and simulated datasets,
and we conclude that our method can achieve up to 95% of the
performance of an optimally-sized Mondrian forest for stable
datasets, and can even outperform it for datasets with concept
drifts. All our methods are implemented in the OrpailleCC open-
source library and are ready to be used on embedded systems
and connected objects.
Index Terms—Mondrian Forest, Trimming, Concept Drift,
Data Stream, Memory Constraints.
I. INTRODUCTION
Supervised classification algorithms mostly assume the
availability of abundant memory to store data and models.
This is an issue when processing data streams — which are
infinite sequences by definition — or when using memory-
limited devices as is commonly the case in the Internet of
Things. We focus on the Mondrian forest, a popular online
classification method. Our ultimate goal is to optimize it for
data streams under memory constraints to make it compatible
with connected objects.
The Mondrian forest is a tree-based, ensemble, online learn-
ing method with comparable performance to offline Random
Forest [1]. Previous experiments highlighted the Mondrian
forest sensitivity to the ensemble size in a memory-constrained
environment [2]. Indeed, introducing a memory limit to the
ensemble also introduces a trade-off between underfitting and
overfitting. On the one hand, low tree numbers make room
for deeper trees and increase the risk of overfitting. On the
other hand, large tree numbers constrain tree depth due to
the memory limitation and increase the risk of underfitting.
Depending on the memory limit and the data distribution, a
given ensemble size may either overfit or underfit the dataset.
The goal of this paper is to address this trade-off by adapting
the ensemble size dynamically.
In summary, this paper makes the following contributions:
1) Highlight the existence of an optimal tree count in
memory-constrained Mondrian forest;
2) Propose a dynamic method to optimize the tree count;
3) Compare this method to the Mondrian forest with a
optimally-sized tree count.
II. MATERIALS AND METHODS
All the methods presented in this section are implemented
in the OrpailleCC framework [3]. The scripts to reproduce our
experiments are available on GitHub at https://github.com/big
-data-lab-team/benchmark-har-data-stream.
In this section, we start by presenting background on
Mondrian Forests (II-A and II-B), then presents the main
contribution of the paper, namely dynamically adjusting the
ensemble size of a memory-constrained Mondrian forest (II-C,
II-D,II-E,II-F), then describes the experimental evaluation
framework (II-G,II-H).
A. Mondrian Forest
The Mondrian forest [1] is an ensemble method that aggre-
gates Mondrian trees. Each tree recursively splits the feature
space, similar to a regular decision tree. However, the feature
used in the split and the value of the split are picked randomly.
The probability to select a feature is proportional to its range,
and the value for the split is uniformly selected in the range of
the feature. In contrast with other decision trees, the Mondrian
tree does not split leaves to introduce new nodes. Instead,
it introduces a new parent and a sibling to the node where
the split occurs. The original node and its descendant are not
modified and no data point is moved to that new sibling besides
the data points that initialized the split. This approach allows
the Mondrian tree to introduce new branches to internal nodes.
This training algorithm does not rely on labels to build the tree,
however, each node maintains counters for each label seen.
Therefore, labels can be delayed, but are needed before the
prediction. In addition to the counters, each node keeps track
of the range of its feature which represents a box containing
all data points. A data point can create a new branch only if
it is sorted to a node and falls outside of the node’s box.
B. Mondrian Forest for Data Stream Classification
The implementation of Mondrian forest presented in [1], [4]
is online because trees rely on potentially all the previously
seen data points to grow new branches. To support data
arXiv:2210.05704v1 [cs.LG] 11 Oct 2022
2
streams, the Mondrian forest has to access data points only
once as the dataset is assumed to be infinite in size.
The work in [2] describes a Data Stream Mondrian forest
with a memory bound. The ensemble grows trees from a
shared pool of nodes and the trees are paused when there is
no node left. This work also proposed out-memory strategies
to keep updating the statistics for the trees without creating
new branches. In particular, this work recommend using the
Extend Node strategy when the memory is full, a strategy
where statistics of the node boxes are automatically extended
to fit all data points, and the counters automatically increased.
The attributes of a node are an array for counting labels
that fall inside a leaf, and two arrays lower bound and
upper bound that define a box of the node. The Extend
Node strategy automatically increases the counter of the la-
bel in the leaf and automatically adjusts lower bound and
upper bound so the new data point fits inside the box.
Having a shared pool of nodes for the ensemble has a direct
impact on the number of trees. As mentioned before, having
more trees limits the tree depth and may lead to underfitting,
whereas having less trees increases the risk of overfitting.
C. Dynamic Tree Count Optimization
Algorithm 1describes the function that trains the forest
with a new data point and dynamically adjusts the ensemble
size.The main idea is to compare pre- and post-quential errors
to decide whether or not to adjust the forest size.
Data: f = a Mondrian forest
Data: x = a data point
Data: l = the label of data point x
1Function train forest(f, x, l) is
2predicted label = f.test(x);
3prequential.update(predicted label, l);
4for ido
5train tree(i, x, l);
6end
7predicted label = f.test(x);
8postquential.update(predicted label, l);
9post metric = postquential.metric();
10 pre metric = prequential.metric();
11 if post metric ˜
>pre metric then
12 trim trees(f);
13 add tree(f);
14 end
15 end
Algorithm 1: Training function for a data stream Mon-
drian forest with a dynamic ensemble size.
The forest evaluates prequential statistics, meaning that it
predicts the new data point label before using it for training
(line 2-3). After training, the forest evaluates postquential
statistics (lines 7-8). The prequential accuracy is widely used
as accuracy approximation on a test set for data streams [5].
Here we introduce the postquential accuracy, where we test
after training, to simulate the accuracy on a training set.
To find the number of trees that maximize prediction
performance, we initialize the Mondrian forest with a single
tree, therefore with a high risk of overfitting. The idea is to
test for overfitting by comparing prequential and postquential
accuracies and add a tree in case the forest is deemed to overfit.
A model that overfits is defined as a model that performs
significantly better on the training set than on the test set.
Therefore the problem of detecting overfitting becomes a
statistical testing problem between the training and test ac-
curacies.
While overfitting is commonly detected by comparing the
performance of the classifier on the training and test set,
detecting underfitting for memory-constrained data stream
models remains very challenging. Consistently, Algorithm 1
is written to only support tree addition. Should an underfitting
criterion be available in the future, Algorithm 1could easily
be adapted to support tree removal.
Overall, Algorithm 1can be adjusted with three com-
ponents: the comparison test (line 11), the type of
pre/postquential statistics used (line 3 and 8), and the method
to add trees (line 12-13).
D. Comparison Test
In Algorithm 1, the update process determines if it needs to
add a tree based on a comparison between the prequential and
the postquential accuracies. If both accuracies are significantly
different, the forest is deemed to overfit and thus, the algorithm
adds a tree to compensate.
There are different methods to test the statistical difference
between two accuracies and we experiment with four in this
paper: the sum of variances, the t-test, the z-test, and the sum
of standard deviations.
Notations for the following equations include: µpre and
µpost the mean of respectively the prequential and postquential
accuracies, σ2
pre and σ2
post the variance of respectively the
prequential and postquential accuracies, nthe size of the
sample, µand σ2respectively the mean and variance of
µpost µpre.
1) Sum of Variances: Equation 1shows the sum of vari-
ances as a comparison test. The two accuracies are different
when the distance it is higher than the square root of the
prequential and postquential variances.
µpost µpre >qσ2
post +σ2
pre (1)
2) T-test: Equation 2describes the t-test used to compare
the prequential and postquential accuracies. We apply a one-
sample t-test where we check if µis different from 0 with a
99% confidence [6].
nµ
σ>2.326 (2)
3) Z-test: Equation 3shows how we computed the two
proportion z-test pooled for µpre equal µpost. We first compute
the zscore, then we compare it with the Z-value that ensures
confidence of 99% [7], [8]. The zscore is the observed dif-
摘要:

1DynamicEnsembleSizeAdjustmentforMemoryConstrainedMondrianForestMartinKhannouzandTristanGlatardDepartmentofComputerScienceandSoftwareEngineeringConcordiaUniversity,Montreal,Quebec,CanadaAbstract—Supervisedlearningalgorithmsgenerallyassumetheavailabilityofenoughmemorytostoredatamodelsduringthetrainin...

展开>> 收起<<
1 Dynamic Ensemble Size Adjustment for Memory Constrained Mondrian Forest.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:272.59KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注