Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
jackseeburger
Advisor
Advisor
2,218

Background


Large-scale distributed data has become the foundation for analytics and informed decision-making processes in most businesses. A large amount of this data is also utilized for predictive modeling and building machine learning models.

There has been a rise in the number and variety of hyperscaler platforms providing machine learning and modeling capabilities, along with data storage and processing. Businesses that use these platforms for data storage can now seamlessly utilize them for efficient training and deployment of machine learning models.

Training machine learning models on most of these platforms is relatively smoother if the training data resides in their respective platform-native data stores. This brings up a new challenge because of the tight coupling of these features with the native data storage. Extraction and migration of data from one data source to another is both expensive and time-consuming.

Proposed Solution


SAP Federated-ML or FedML is a library built to address this issue. The library applies the data federation architecture of SAP Data Warehouse Cloud and provides functions that enable businesses and data scientists to build, train and deploy machine learning models on hyperscalers, thereby eliminating the need for replicating or migrating data out from its original source.

By abstracting the data connection, data load and model training on these hyperscalers, the FedML library provides end to end integration with just a few lines of code.


FedML Architecture



Training a Model on Vertex AI with FedMLGCP


In this article, we focus on building a machine learning model on Google Cloud Platform VertexAI by federating the training data from Amazon Athena via SAP Data Warehouse Cloud without the need for replicating or moving the data from the original data storages.

  1. Set up your environment

    1. Follow this guide to create a new Vertex AI notebook instance

    2. Create a Cloud Storage bucket to store your training artifacts

    3. Make your AWS data accessible through SAP Data Warehouse Cloud. Here is an example of how to connect Athena to SAP Data Warehouse Cloud.



  2. Download FedML GCP Library

    1. FedML GCP Library

    2. Upload the package to your Cloud Storage Bucket to be used later for training



  3. Install the library in your notebook instance with the following command


pip install fedml_gcp-1.0.0-py3-none-any.whl --force-reinstall​


  1. Load the libraries with the following imports


from fedml_gcp import DwcGCP
import numpy as np
import pandas as pd


  1. Create a new DwcGCP class instance with the following (replace the project name and bucket name)


dwc = DwcGCP(project_name='your-project-name', bucket_name='your-bucket-name')


  1. Make a tar bundle of your training script files

    1. You can find example training script files here

      1. Open a folder and drill down to the trainer folder (which contains the scripts)



    2. And more information about GCP training application structure here




dwc.make_tar_bundle('your_training_app_name.tar.gz', 'Path_of_Traning_Folder', 'gcp_bucket_path/training/)


  1. Create your training inputs

    1. More info about training inputs can be found here

    2. In package URI make sure to include the path to the FedML GCP whl file you uploaded in step 2

    3. Replace ‘DATA_VIEW’ with the name of the Athena table you exposed in DWC




training_inputs = {
'scaleTier': 'BASIC',
'packageUris': ['gs://gcp_bucket_path/training/ your_training_app_name.tar.gz', "gs://gcp_bucket_path/fedml_gcp-1.0.0-py3-none-any.whl"],
'pythonModule': 'trainer.task',
'args': ['--table_name', 'DATA_VIEW', '--table_size', '1', '--bucket_name', 'fedml-bucket'],
'region': 'us-east1',
'jobDir': 'gs://gcp_bucket_path’,
'runtimeVersion': '2.5',
'pythonVersion': '3.7',
'scheduling': {'maxWaitTime': '3600s', 'maxRunningTime': '7200s'}


  1. Submit your training job (note that each job must have a unique_id)


dwc.train_model('your_training_job_id’, training_inputs)

In summary,  FedML makes it extremely convenient for data scientists and developers to perform cross-platform ETL and train machine learning models on hyperscalers without focusing on the hassle of data replication and migration.

 

For more information about this topic or to ask a question, please leave a comment below or contact us at ci_sce@sap.com.