Handling categorical variables on new data

Hey,
I have 2 questions regarding how FLAML handles categorical variables on new data, different from the initial training dataset (for example, during inference after model deployment).

1. Does it handle new categories in categorical features (unseen during training)?
2. SKLearn and XGBoost estimators use ordinal encodings of categorical features. But it seems the categorical codes are extracted during inference ([code](https://github.com/microsoft/FLAML/blob/e3ca95bf8a5cc9c86638dab8ef0eb44bc1f3ee6f/flaml/automl/model.py#L1221)). Doesn't it mean that the encodings will be different when running on a different dataset, thus mixing the categories passed to the model? If so, then sklearn's [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) would be a better choice here (persisting correct category codes).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling categorical variables on new data #1101

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Handling categorical variables on new data #1101

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions