High Level Overview
The two main categories of eo-grow
building blocks are:
configurable objects (subclasses of
EOGrowObject
)configuration schemas (subclasses of
EOGrowObject.Schema
)
Each EOGrowObject
is initialized with a Schema
object. The Schema
is saved to the object as an attribute config: Schema
which stores the configuration information.
The configurable objects can be further separated into instances of:
Manager
, a helper class with a limited scope,Pipeline
, a class for execution.
Manager
classes are used to build configurations for specific aspects of the pipeline, such as area, storage, or logging, while the Pipeline
class accepts the full configuration (pipeline specific + all managers) and contains methods of execution.
Schemas
The Schema
is in general a pydantic model, but with some project specific constrains and additions. It is best to always inherit from EOGrowObject.Schema
to ensure a suitable pydantic configuration of the models.
The EOGrow.Schema
model:
rejects any additional parameters that are not listed,
does not allow mutation,
validates default values.
In case you are inheriting from a Manger
or a Pipeline
class, it is heavily advised to let Schema
be a subclass of the superclass schema (type-checkers should warn you about it).
Validators
You can use any kind of pydantic validators to verify the data in your schema. You can find some existing utility functions in eogrow.utils.validators
:
field_validator
/optional_field_validator
for wrapping callables defined elsewhere,ensure_exactly_one_defined
andensure_defined_together
for linking together parameters that can beNone
,ensure_storage_key_presence
for checking that storage keys are defined in the storage manager (see section on managers),restrict_types
to restrict which feature types are allowed on a field that defines a feature.
Root validators can also be used, but are discouraged in the main eo-grow
repository as they clutter the documentation pages.
For example, a storage key presence could be validated in the following way:
class Schema(Pipeline.Schema):
folder_key: str = "check_if_i_exist"
_check_folder_key_presence = ensure_storage_key_presence("folder_key")
Parsers
Certain types do not provide direct parsing capabilities (for instance numpy.dtype
or datetime
). In such cases you can use pre-validators, which means that the validator will be applied before pydantic
checks that the type is right (check here for more info). This is done by setting the pre
flag of validators to True
. The field_validator
and optional_field_validator
utilities also allow this setting, so you can do:
from eogrow.core.schemas import BaseSchema
from eogrow.utils.validators import optional_field_validator, parse_dtype
class MyModel(BaseSchema):
maybe_dtype: Optional[np.dtype]
_parse_maybe_dtype = optional_field_validator("maybe_dtype", parse_dtype, pre=True)
Other predefined parsers are and parse_time_period
and parse_data_collection
.
Managers
Managers are helper-classes of pipelines that focus on a single role. Each manager is defined through a Schema
configuration, where the configuration fields are specific to the manager at hand. This section focuses on different managers used by the Pipeline
class and how to work with them. In the pipelines section we will then touch on how to connect all these managers to create and run a custom pipeline.
Storage Manager
The storage manager takes care of data storage and works both with local storage and Amazon S3. It’s primary purpose is to provide correct filesystem objects and filepaths in said filesystem. A basic overview of the StorageManager
schema can be found below, for more information visit the API docs
{
"manager": "eogrow.core.storage.StorageManager",
"project_folder": "some_path/project_root_folder",
"structure": {
"data": "s2_downloaded_data",
"reference": "reference",
"models": "lgbm_models/built-up-detectors/models",
"results": "built-up-predictions"
}
}
To avoid keeping track of absolute paths, the storage manager utilizes a key: path
mapping, which is specified as the structure
parameter. Pipelines then operate with input_folder_key="data"
instead of input_path="some_path/project_root_folder/s2_downloaded_data"
. The approach is also much more resilient to typos.
Notable attributes/methods are:
filesystem
attribute, which can be used inside pipelines for IO.get_folder
which, given a folder-key, provides the path in thefilesystem
to the desired folder.
While the folder-key approach appears limiting at first, it turns out to be flexible enough for the majority of cases. For more advanced use see common configuration patterns.
Area Managers
The Area Manager is a manager that takes care of how the area-of-interest (AOI) is loaded and how it is split into chunks to be processed by eo-grow
. There are several pre-defined area managers available in the project, focusing on the few most common use cases for providing AOI specifications.
All area managers provide the following functionalities for development:
get_patch_list()
, for obtaining the list of patch names and corresponding bboxesget_area_geometry()
, for obtaining the dissolved geometry of the AOIget_grid(filtered = True|False)
, for obtaining the split AOI in the form of a grid
UTM Zone Area Manager
The UtmZoneAreaManager
is probably the most commonly used area manager and most intuitive to work with. The user-provided geometry is split into patches of the user-provided size. If the AOI spans multiple UTM zones, the patches are grouped per zone. Read the API docs on the UTMZoneAreaManager
for more info.
Custom Grid Area Manager
For users which have a very specific way of splitting the AOI in mind, we provide the CustomGridAreaManager
, which accepts a grid file of an already split AOI. The user only needs to provide the grid file folder key and name, along with the name_column
parameter, which points to the column containing the patch names to be used. The folder key by default points to the input_data
location, but could be any other location defined by the storage structure. Read the API docs on the CustomGridAreaManager
for more info.
Batch Area Manager
For users working with Sentinel Hub Batch API, we have prepared the BatchAreaManager
, which splits the area according to Sentinel Hub tiling grids. This area manager is meant for larger projects focusing on larger areas.
The interface of the BatchAreaManager
relies heavily on the predefined configuration options defined for the Batch API, so be sure to provide sensible values for the parameters. For example, the tiling_grid_id
and resolution
parameters should correspond to values stated in the docs.
For existing projects involving Batch API, it is possible to provide the batch_id
parameter, which will search for existing grids corresponding to the batch request. If the batch_id
is not provided (this is by default), the BatchAreaManager
will generate a new batch job with the given parameters. Read the API docs on the BatchAreaManager
for more info.
Logging Manager
The logging manager ensures that logging handlers are set up and configured correctly. It allows adjusting which packages to log to files, which to stdout, and which to ignore. It is unlikely you’ll ever need to access any of it’s methods directly. Use the standard LOGGER = logging.getLogger(__name__)
for logging.
Settings that reference packages to ignore/show have a collection of default packages. One can reference them in the configuration with "..."
.
{
"manager": "eogrow.core.logging.LoggingManager",
"show_logs": true,
"stdout_log_packages": ["...", "cool_package", "cooler_package"]
}
With the above settings the stdout logs will include cool_package
, cooler_package
, and also all of the default packages eogrow
,
__main__
, root
, sentinelhub.api.batch
.
Pipelines
A Pipeline
is an object focused towards executing a specific EOWorkflow
over a collection of patches. It represents the interface for managing the data and logging with the use of managers, as well as contains instructions for execution in the form of pipeline-specific tasks.
The Pipeline
class has multiple run methods that appear to have a similar functionality:
run
is the main execution method. It sets up logging and error handlers aroundrun_procedure
. It is not meant to be changed.run_procedure
contains instructions on what the pipeline does. By default, it creates a workflow withbuild_workflow
and runsrun_execution
. Override if you need the pipeline to also process things outside of an EOWorkflow (e.g. combine results).run_execution
takes care of logging and execution of the workflow. It is not meant to be changed.build_workflow
is a method that builds anEOWorkflow
that the pipeline executes. This is the method you usually want to implement.
In fact, when writing a custom pipeline, the majority of cases only need the following:
defining the pipeline schema
defining a custom
build_workflow
methodconstructing execution arguments (optional)
providing filtering logic (optional)
The following sections expand on each item in the list above.
Defining the Pipeline Schema
The configuration schema of the Pipeline
class already has some pre-defined parameters which need to be provided for execution in addition to the managers. A full list of the parameters and their detailed descriptions can be found in the pipeline schema API docs.
Building a custom pipeline is straighforward, you only need to provide a suitable nested subclass of Pipeline.Schema
, which must always be named Schema
, as shown below:
# example of how to write a custom pipeline
class MyPipeline(Pipeline):
class Schema(Pipeline.Schema):
extra_field: str = "beep"
...
# this line informs type-checkers that the type of `config` is no longer `Pipeline.Schema`
# but it is now `MyPipeline.Schema`
config: Schema
def custom_method():
...
...
Building the Workflow
All pipelines expect an implementation of the build_workflow
method, where the tasks for running specific work are defined and grouped into a workflow. Many workflows tend to be of the form:
Load patch
Perform specific tasks
Save patch and/or results
You can however load from multiple locations, merge patches, process and filter data, save some features and output some others. Anything you can do in eo-learn
you can do here (but on a larger scale).
Constructing Execution Arguments
In some cases, a task requires additional information at runtime, which can be unique per patch, such as the load/save location of a patch, or a specific bbox used to create a patch at the beginning of a pipeline.
By default, the method get_execution_arguments
already configures execution arguments for SaveTask
, LoadTask
, and CreateEOPatchTask
with the area manager data.
However, in cases where a custom task requires an extra parameter at runtime, it can be provided by updating the get_execution_arguments
method of the Pipeline
class. The method must set the arguments for each task which expects them, for all patches.
def get_execution_arguments(self, workflow, patch_list):
exec_kwargs = super().get_execution_arguments(workflow, patch_list)
nodes = workflow.get_nodes()
for patch_name, patch_args in exec_args.items():
for node in nodes:
if isinstance(node.task, MyCustomTask):
patch_args[node] = my_custom_kwargs
return exec_kwargs
Providing Filtering Logic
Filtering logic is an optional part of the pipeline class and provides information on which patches to skip, in case they have already been processed. This is controlled via the skip_existing
parameter in the pipeline schema.
The filtering logic can be provided with the filter_patch_list
method and depends very much on the what the user’s definiton of “already processed” is. It could simply mean an existing patch directory in the storage, but it could depend on wheter some expected output is present or not.
Most commonly this boils down to checking for feature presence for all eopatches, and returning a list of patches where some/all features are missing. For this specific case we provide a utility method get_patches_with_missing_features
under eogrow.utils.filter
. The utility usage could look along the lines of:
def filter_patch_list(self, patch_list):
return get_patches_with_missing_features(filesystem, patches_folder, patch_list, features)
where the features
parameter defines the list of features which must be present if the patch is to be skipped.