1 Dynamic Ensemble Size Adjustment for Memory Constrained Mondrian Forest

2025-04-24 1 0 272.59KB 6 页 10玖币

侵权投诉

Dynamic Ensemble Size Adjustment for Memory

Constrained Mondrian Forest

Martin Khannouz and Tristan Glatard

Department of Computer Science and Software Engineering

Concordia University, Montreal, Quebec, Canada

Abstract—Supervised learning algorithms generally assume the

availability of enough memory to store data models during the

training and test phases. However, this assumption is unrealistic

when data comes in the form of inﬁnite data streams, or

when learning algorithms are deployed on devices with reduced

amounts of memory. Such memory constraints impact the model

behavior and assumptions. In this paper, we show that under

memory constraints, increasing the size of a tree-based ensemble

classiﬁer can worsen its performance. In particular, we experi-

mentally show the existence of an optimal ensemble size for a

memory-bounded Mondrian forest on data streams and we design

an algorithm to guide the forest toward that optimal number by

using an estimation of overﬁtting. We tested different variations

for this algorithm on a variety of real and simulated datasets,

and we conclude that our method can achieve up to 95% of the

performance of an optimally-sized Mondrian forest for stable

datasets, and can even outperform it for datasets with concept

drifts. All our methods are implemented in the OrpailleCC open-

source library and are ready to be used on embedded systems

and connected objects.

Index Terms—Mondrian Forest, Trimming, Concept Drift,

Data Stream, Memory Constraints.

I. INTRODUCTION

Supervised classiﬁcation algorithms mostly assume the

availability of abundant memory to store data and models.

This is an issue when processing data streams — which are

inﬁnite sequences by deﬁnition — or when using memory-

limited devices as is commonly the case in the Internet of

Things. We focus on the Mondrian forest, a popular online

classiﬁcation method. Our ultimate goal is to optimize it for

data streams under memory constraints to make it compatible

with connected objects.

The Mondrian forest is a tree-based, ensemble, online learn-

ing method with comparable performance to ofﬂine Random

Forest [1]. Previous experiments highlighted the Mondrian

forest sensitivity to the ensemble size in a memory-constrained

environment [2]. Indeed, introducing a memory limit to the

ensemble also introduces a trade-off between underﬁtting and

overﬁtting. On the one hand, low tree numbers make room

for deeper trees and increase the risk of overﬁtting. On the

other hand, large tree numbers constrain tree depth due to

the memory limitation and increase the risk of underﬁtting.

Depending on the memory limit and the data distribution, a

given ensemble size may either overﬁt or underﬁt the dataset.

The goal of this paper is to address this trade-off by adapting

the ensemble size dynamically.

In summary, this paper makes the following contributions:

1) Highlight the existence of an optimal tree count in

memory-constrained Mondrian forest;

2) Propose a dynamic method to optimize the tree count;

3) Compare this method to the Mondrian forest with a

optimally-sized tree count.

II. MATERIALS AND METHODS

All the methods presented in this section are implemented

in the OrpailleCC framework [3]. The scripts to reproduce our

experiments are available on GitHub at https://github.com/big

-data-lab-team/benchmark-har-data-stream.

In this section, we start by presenting background on

Mondrian Forests (II-A and II-B), then presents the main

contribution of the paper, namely dynamically adjusting the

ensemble size of a memory-constrained Mondrian forest (II-C,

II-D,II-E,II-F), then describes the experimental evaluation

framework (II-G,II-H).

A. Mondrian Forest

The Mondrian forest [1] is an ensemble method that aggre-

gates Mondrian trees. Each tree recursively splits the feature

space, similar to a regular decision tree. However, the feature

used in the split and the value of the split are picked randomly.

The probability to select a feature is proportional to its range,

and the value for the split is uniformly selected in the range of

the feature. In contrast with other decision trees, the Mondrian

tree does not split leaves to introduce new nodes. Instead,

it introduces a new parent and a sibling to the node where

the split occurs. The original node and its descendant are not

modiﬁed and no data point is moved to that new sibling besides

the data points that initialized the split. This approach allows

the Mondrian tree to introduce new branches to internal nodes.

This training algorithm does not rely on labels to build the tree,

however, each node maintains counters for each label seen.

Therefore, labels can be delayed, but are needed before the

prediction. In addition to the counters, each node keeps track

of the range of its feature which represents a box containing

all data points. A data point can create a new branch only if

it is sorted to a node and falls outside of the node’s box.

B. Mondrian Forest for Data Stream Classiﬁcation

The implementation of Mondrian forest presented in [1], [4]

is online because trees rely on potentially all the previously

seen data points to grow new branches. To support data

arXiv:2210.05704v1 [cs.LG] 11 Oct 2022

streams, the Mondrian forest has to access data points only

once as the dataset is assumed to be inﬁnite in size.

The work in [2] describes a Data Stream Mondrian forest

with a memory bound. The ensemble grows trees from a

shared pool of nodes and the trees are paused when there is

no node left. This work also proposed out-memory strategies

to keep updating the statistics for the trees without creating

new branches. In particular, this work recommend using the

Extend Node strategy when the memory is full, a strategy

where statistics of the node boxes are automatically extended

to ﬁt all data points, and the counters automatically increased.

The attributes of a node are an array for counting labels

that fall inside a leaf, and two arrays lower bound and

upper bound that deﬁne a box of the node. The Extend

Node strategy automatically increases the counter of the la-

bel in the leaf and automatically adjusts lower bound and

upper bound so the new data point ﬁts inside the box.

Having a shared pool of nodes for the ensemble has a direct

impact on the number of trees. As mentioned before, having

more trees limits the tree depth and may lead to underﬁtting,

whereas having less trees increases the risk of overﬁtting.

C. Dynamic Tree Count Optimization

Algorithm 1describes the function that trains the forest

with a new data point and dynamically adjusts the ensemble

size.The main idea is to compare pre- and post-quential errors

to decide whether or not to adjust the forest size.

Data: f = a Mondrian forest

Data: x = a data point

Data: l = the label of data point x

1Function train forest(f, x, l) is

2predicted label = f.test(x);

3prequential.update(predicted label, l);

4for ido

5train tree(i, x, l);

6end

7predicted label = f.test(x);

8postquential.update(predicted label, l);

9post metric = postquential.metric();

10 pre metric = prequential.metric();

11 if post metric ˜

>pre metric then

12 trim trees(f);

13 add tree(f);

14 end

15 end

Algorithm 1: Training function for a data stream Mon-

drian forest with a dynamic ensemble size.

The forest evaluates prequential statistics, meaning that it

predicts the new data point label before using it for training

(line 2-3). After training, the forest evaluates postquential

statistics (lines 7-8). The prequential accuracy is widely used

as accuracy approximation on a test set for data streams [5].

Here we introduce the postquential accuracy, where we test

after training, to simulate the accuracy on a training set.

To ﬁnd the number of trees that maximize prediction

performance, we initialize the Mondrian forest with a single

tree, therefore with a high risk of overﬁtting. The idea is to

test for overﬁtting by comparing prequential and postquential

accuracies and add a tree in case the forest is deemed to overﬁt.

A model that overﬁts is deﬁned as a model that performs

signiﬁcantly better on the training set than on the test set.

Therefore the problem of detecting overﬁtting becomes a

statistical testing problem between the training and test ac-

curacies.

While overﬁtting is commonly detected by comparing the

performance of the classiﬁer on the training and test set,

detecting underﬁtting for memory-constrained data stream

models remains very challenging. Consistently, Algorithm 1

is written to only support tree addition. Should an underﬁtting

criterion be available in the future, Algorithm 1could easily

be adapted to support tree removal.

Overall, Algorithm 1can be adjusted with three com-

ponents: the comparison test (line 11), the type of

pre/postquential statistics used (line 3 and 8), and the method

to add trees (line 12-13).

D. Comparison Test

In Algorithm 1, the update process determines if it needs to

add a tree based on a comparison between the prequential and

the postquential accuracies. If both accuracies are signiﬁcantly

different, the forest is deemed to overﬁt and thus, the algorithm

adds a tree to compensate.

There are different methods to test the statistical difference

between two accuracies and we experiment with four in this

paper: the sum of variances, the t-test, the z-test, and the sum

of standard deviations.

Notations for the following equations include: µpre and

µpost the mean of respectively the prequential and postquential

accuracies, σ2

pre and σ2

post the variance of respectively the

prequential and postquential accuracies, nthe size of the

sample, µand σ2respectively the mean and variance of

µpost −µpre.

1) Sum of Variances: Equation 1shows the sum of vari-

ances as a comparison test. The two accuracies are different

when the distance it is higher than the square root of the

prequential and postquential variances.

µpost −µpre >qσ2

post +σ2

pre (1)

2) T-test: Equation 2describes the t-test used to compare

the prequential and postquential accuracies. We apply a one-

sample t-test where we check if µis different from 0 with a

99% conﬁdence [6].

√nµ

σ>2.326 (2)

3) Z-test: Equation 3shows how we computed the two

proportion z-test pooled for µpre equal µpost. We ﬁrst compute

the zscore, then we compare it with the Z-value that ensures

conﬁdence of 99% [7], [8]. The zscore is the observed dif-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1DynamicEnsembleSizeAdjustmentforMemoryConstrainedMondrianForestMartinKhannouzandTristanGlatardDepartmentofComputerScienceandSoftwareEngineeringConcordiaUniversity,Montreal,Quebec,CanadaAbstractSupervisedlearningalgorithmsgenerallyassumetheavailabilityofenoughmemorytostoredatamodelsduringthetrainin...

展开>> 收起<<

1 Dynamic Ensemble Size Adjustment for Memory Constrained Mondrian Forest.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Dynamic Ensemble Size Adjustment for Memory Constrained Mondrian Forest

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: