Large-scale distributed data has become the foundation for analytics and informed decision-making processes in most businesses. A large amount of this data is also utilized for predictive modeling and building machine learning models.
There has been a rise in number and variety of hyperscaler platforms providing machine learning and modeling capabilities, along with data storage and processing. Businesses that use these platforms for data storage can now seamlessly utilize them for efficient training and deployment of machine learning models.
Training machine learning models on most of these platforms is relatively smoother if the training data resides in their respective platform-native data stores. This brings up a new challenge because of the tight coupling of these features with the native data storage. Extraction and migration of data from one data source to another is both expensive and time-consuming.
SAP Federated-ML or FedML is a library built to address this issue. The library applies the Data Federation architecture of SAP Data Warehouse Cloud and provides functions that enable businesses and data scientists to build, train and deploy machine learning models on hyperscalers, thereby eliminating the need for replicating or migrating data out from its original source.
By abstracting the data connection, data load and model training on these hyperscalers, the FedML library provides end to end integration with just a few lines of code.
This blog post will focus on training a machine learning model on Amazon SageMaker with data from Google BigQuery:
Note: This post assumes that training data is already present in BigQuery and accessible through SAP Data Warehouse Cloud. Refer this blog post for steps on how to integrate BigQuery with SAP DWC.
1. Create an Amazon SageMaker notebook instance
Follow Step 1 of this guide for creating a notebook instance on SageMaker, creating an IAM role and adding required permissions.
2. Download Federated-ML for AWS
Download the library using the link below. It will be downloaded in a .whl file format on your local system.
4. Use the following imports to utilize library functionalities
from fedml_aws import DbConnection
from fedml_aws import DwcSagemaker
5. Read BigQuery data from SAP DWC and load it into SageMaker notebook
db = DbConnection()
train_data = db.execute_query('<your_query_to_fetch_train_data>')
#The query should ideally fetch only the data that would be needed to train the model
#Extracting and loading entire view is not required
train_data = pd.DataFrame(train_data, columns=train_data)
Only the rows and features needed to train the model need to be fetched and loaded into SageMaker notebook.
6. Train a Scikit-learn model on Sagemaker using the extracted data
Details about train_script and some example notebooks with their corresponding training scripts can be found here.
FedML makes it extremely convenient for data scientists and developers to perform cross-platform ETL and train machine learning models on hyperscalers without focussing on the hassle of data replication and migration.
If you're interested in learning about how FedML works on other hyperscalers, refer these blog posts: