eogrow.pipelines.training
Implements a base training pipeline and a LGBM specialized classification and regression model training pipelines.
- pydantic model eogrow.pipelines.training.RandomTrainTestSplitSchema[source]
Bases:
Schema
Create a new model by parsing and validating input data from keyword arguments.
Raises ValidationError if the input data cannot be parsed to form a valid model.
- Fields:
- field random_state: int = 42
Seed used in data splitter (either for scikit.learn.train_test_split or for scikit.utils.shuffle.
- field train_size: float [Required]
Training size value (0.8 = 80/20 split for training/testing).
- Constraints:
minimum = 0
maximum = 1
- class eogrow.pipelines.training.BaseTrainingPipeline(config, raw_config=None)[source]
Bases:
Pipeline
A base pipeline for training an ML model
This class has a few abstract methods which have to be implemented. But in general all public methods are designed in a way that you can override them in a child class
- Parameters:
config (Schema) – A dictionary with configuration parameters
raw_config (RawConfig | None) – The configuration parameters pre-validation, for logging purposes only
- pydantic model Schema[source]
Bases:
Schema
Create a new model by parsing and validating input data from keyword arguments.
Raises ValidationError if the input data cannot be parsed to form a valid model.
- Fields:
input_folder_key (str)
input_patch_file (None)
model_filename (str)
model_folder_key (str)
model_parameters (Dict[str, Any])
patch_list (None)
skip_existing (Literal[False])
train_features (List[str])
train_reference (str)
train_test_split (eogrow.pipelines.training.RandomTrainTestSplitSchema)
- field input_folder_key: str [Required]
The storage manager key pointing to the model training data.
- Validated by:
validate_storage_key
- field input_patch_file: None = None
- field model_filename: str [Required]
- field model_folder_key: str [Required]
The storage manager key pointing to the folder where the model will be saved.
- Validated by:
validate_storage_key
- field model_parameters: Dict[str, Any] [Optional]
Parameters to be provided to the model
- field patch_list: None = None
- field skip_existing: Literal[False] = False
- field train_features: List[str] [Required]
A list of feature filenames to join into training features in the given order.
- field train_reference: str [Required]
Name of file where the reference data is stored.
- field train_test_split: RandomTrainTestSplitSchema [Required]
- run_procedure()[source]
The main pipeline procedure
Prepares data. Output serves as input to both the training method and scoring method, so separation of training and testing data should be done within the object.
Train model
Save model
Evaluate model
- Return type:
tuple[list[str], list[str]]
- preprocess_data(features, reference)[source]
Preforms filtering and other preprocessing before splitting data.
- Parameters:
features (ndarray) –
reference (ndarray) –
- Return type:
tuple[numpy.ndarray, numpy.ndarray]
- train_test_split(features, reference)[source]
Computes a random train-test split
Order is train-features test-features train-reference test-reference.
- Parameters:
features (ndarray) –
reference (ndarray) –
- Return type:
tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]
- abstract train_model(prepared_data)[source]
Trains the model on the data.
- Parameters:
prepared_data (dict) –
- Return type:
object
- save_model(model)[source]
Saves the resulting model.
- Parameters:
model (object) –
- Return type:
None
- pydantic model eogrow.pipelines.training.ClassificationPreprocessSchema[source]
Bases:
Schema
Create a new model by parsing and validating input data from keyword arguments.
Raises ValidationError if the input data cannot be parsed to form a valid model.
- field filter_classes: List[int] [Optional]
Specify IDs of classes that are going to be used for training. If empty, all the classes will be used.
- field label_encoder_filename: str | None = None
If specified uses a label encoder and saves it under specified name.
- class eogrow.pipelines.training.ClassificationTrainingPipeline(config, raw_config=None)[source]
Bases:
BaseTrainingPipeline
A base pipeline for training an ML classifier. Uses LGBMClassifier by default.
- Parameters:
config (Schema) – A dictionary with configuration parameters
raw_config (RawConfig | None) – The configuration parameters pre-validation, for logging purposes only
- pydantic model Schema[source]
Bases:
Schema
Create a new model by parsing and validating input data from keyword arguments.
Raises ValidationError if the input data cannot be parsed to form a valid model.
- Fields:
preprocessing (Optional[ClassificationPreprocessSchema])
- field preprocessing: ClassificationPreprocessSchema | None = None
- preprocess_data(features, reference)[source]
Preforms filtering and other preprocessing before splitting data.
- Parameters:
features (ndarray) –
reference (ndarray) –
- Return type:
tuple[numpy.ndarray, numpy.ndarray]
- class eogrow.pipelines.training.RegressionTrainingPipeline(config, raw_config=None)[source]
Bases:
BaseTrainingPipeline
A base pipeline for training an ML regressor. Uses LGBMRegressor by default.
- Parameters:
config (Schema) – A dictionary with configuration parameters
raw_config (RawConfig | None) – The configuration parameters pre-validation, for logging purposes only