eogrow.pipelines.training

Implements a base training pipeline and a LGBM specialized classification and regression model training pipelines.

pydantic model eogrow.pipelines.training.RandomTrainTestSplitSchema[source]

Bases: Schema

Create a new model by parsing and validating input data from keyword arguments.

Raises ValidationError if the input data cannot be parsed to form a valid model.

Fields:

random_state (int)
train_size (float)

field random_state: int = 42: Seed used in data splitter (either for scikit.learn.train_test_split or for scikit.utils.shuffle.

field train_size: float [Required]

Training size value (0.8 = 80/20 split for training/testing).

Constraints:

minimum = 0
maximum = 1

class eogrow.pipelines.training.BaseTrainingPipeline(config, raw_config=None)[source]

Bases: Pipeline

A base pipeline for training an ML model

This class has a few abstract methods which have to be implemented. But in general all public methods are designed in a way that you can override them in a child class

Parameters:

config (Schema) – A dictionary with configuration parameters
raw_config (RawConfig | None) – The configuration parameters pre-validation, for logging purposes only

pydantic model Schema[source]

Bases: Schema

Create a new model by parsing and validating input data from keyword arguments.

Raises ValidationError if the input data cannot be parsed to form a valid model.

Fields:

input_folder_key (str)
input_patch_file (None)
model_filename (str)
model_folder_key (str)
model_parameters (Dict[str, Any])
patch_list (None)
skip_existing (Literal[False])
train_features (List[str])
train_reference (str)
train_test_split (eogrow.pipelines.training.RandomTrainTestSplitSchema)

field input_folder_key: str [Required]

The storage manager key pointing to the model training data.

Validated by:

validate_storage_key

field input_patch_file: None = None

field model_filename: str [Required]

field model_folder_key: str [Required]

The storage manager key pointing to the folder where the model will be saved.

Validated by:

validate_storage_key

field model_parameters: Dict[str, Any] [Optional]: Parameters to be provided to the model

field patch_list: None = None

field skip_existing: Literal[False] = False

field train_features: List[str] [Required]: A list of feature filenames to join into training features in the given order.

field train_reference: str [Required]: Name of file where the reference data is stored.

field train_test_split: RandomTrainTestSplitSchema [Required]

config: Schema

run_procedure()[source]

The main pipeline procedure

Prepares data. Output serves as input to both the training method and scoring method, so separation of training and testing data should be done within the object.
Train model
Save model
Evaluate model

Return type:: tuple[list[str], list[str]]

prepare_data()[source]

Loads and preprocesses data.

Return type:: dict

preprocess_data(features, reference)[source]

Preforms filtering and other preprocessing before splitting data.

Parameters:

features (ndarray) –
reference (ndarray) –

Return type:

tuple[numpy.ndarray, numpy.ndarray]

train_test_split(features, reference)[source]

Computes a random train-test split

Order is train-features test-features train-reference test-reference.

Parameters:

features (ndarray) –
reference (ndarray) –

Return type:

tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]

abstract train_model(prepared_data)[source]

Trains the model on the data.

Parameters:: prepared_data (dict) –
Return type:: object

save_model(model)[source]

Saves the resulting model.

Parameters:: model (object) –
Return type:: None

abstract score_results(prepared_data, model)[source]

Scores the resulting model and reports the metrics into the log files.

Parameters:

prepared_data (dict) –
model (Any) –

Return type:

None

predict(model, features)[source]

Evaluates model on features. Should be overridden for models with a different interface.

Parameters:

model (Any) –
features (ndarray) –

Return type:

ndarray

pydantic model eogrow.pipelines.training.ClassificationPreprocessSchema[source]

Bases: Schema

Create a new model by parsing and validating input data from keyword arguments.

Raises ValidationError if the input data cannot be parsed to form a valid model.

Fields:

filter_classes (List[int])
label_encoder_filename (str | None)

field filter_classes: List[int] [Optional]: Specify IDs of classes that are going to be used for training. If empty, all the classes will be used.

field label_encoder_filename: str | None = None: If specified uses a label encoder and saves it under specified name.

class eogrow.pipelines.training.ClassificationTrainingPipeline(config, raw_config=None)[source]

Bases: BaseTrainingPipeline

A base pipeline for training an ML classifier. Uses LGBMClassifier by default.

Parameters:

config (Schema) – A dictionary with configuration parameters
raw_config (RawConfig | None) – The configuration parameters pre-validation, for logging purposes only

pydantic model Schema[source]

Bases: Schema

Create a new model by parsing and validating input data from keyword arguments.

Raises ValidationError if the input data cannot be parsed to form a valid model.

Fields:

preprocessing (Optional[ClassificationPreprocessSchema])

field preprocessing: ClassificationPreprocessSchema | None = None

config: Schema

preprocess_data(features, reference)[source]

Preforms filtering and other preprocessing before splitting data.

Parameters:

features (ndarray) –
reference (ndarray) –

Return type:

tuple[numpy.ndarray, numpy.ndarray]

train_model(prepared_data)[source]

Trains the model on the data.

Parameters:: prepared_data (dict) –
Return type:: object

score_results(prepared_data, model)[source]

Scores the resulting model and reports the metrics into the log files.

Parameters:

prepared_data (dict) –
model (Any) –

Return type:

None

class eogrow.pipelines.training.RegressionTrainingPipeline(config, raw_config=None)[source]

Bases: BaseTrainingPipeline

A base pipeline for training an ML regressor. Uses LGBMRegressor by default.

Parameters:

config (Schema) – A dictionary with configuration parameters
raw_config (RawConfig | None) – The configuration parameters pre-validation, for logging purposes only

train_model(prepared_data)[source]

Trains the model on the data.

Parameters:: prepared_data (dict) –
Return type:: object

score_results(prepared_data, model)[source]

Scores the resulting model and reports the metrics into the log files.

Parameters:

prepared_data (dict) –
model (Any) –

Return type:

None