Cloud Training Quickstart
Welcome to the Cloud Quickstart guide for Simplifine’s Train Engine. This guide will help you get up and running quickly with training models in the cloud using Distributed Data Parallel (DDP).
Prerequisites
Before you begin, ensure you have the following:
- A Simplifine API key.
- Access to a cloud GPU (L4 or A100) supported by Simplifine.
To obtain a Simplifine API key with free credit, express interest here.
Setting Up the Client
The first step is to initialize the Client class with your API key and GPU type. This client will handle communication with the Simplifine servers.
from simplifine_alpha.train_utils import Client
# Initialize the client with your API key and GPU type
api_key = '' # Enter your Simplifine API key here
gpu_type = 'a100' # 'l4' or 'a100'
client = Client(api_key=api_key, gpu_type=gpu_type)
''
with your actual Simplifine API key. 'l4'
or 'a100'
.Training a Model with Data Driven Parallelism (DDP)
The example below shows how to use DDP to distribute the training process across multiple GPUs.
Step 1: Define Your Training Job with DDP
You can train a model using the sft_train_cloud
method and enable DDP by setting the use_ddp
parameter to True
.
client.sft_train_cloud(
job_name='ddp_job',
model_name='EleutherAI/gpt-neo-125M',
dataset_name='my_dataset',
data_from_hf=True,
keys=['title', 'abstract', 'explanation'],
data={'title': ['title 1', 'title 2'], 'abstract': ['abstract 1', 'abstract 2'], 'explanation': ['explanation 1', 'explanation 2']},
template='### TITLE: {title}\n ### ABSTRACT: {abstract}\n ###EXPLANATION: {explanation}',
response_template='###EXPLANATION:',
use_zero=False,
use_ddp=True
)
Step 2: Monitor Your Jobs
After sending the query, you can check the status of your jobs. The status will be one of the following: completed
, in progress
, or pending
.
status = client.get_all_jobs()
for num, i in enumerate(status[-5:]):
print(f'Job {num}: {i}')
Step 3: Retrieve Training Logs
You can retrieve the logs for any job to check detailed information about the training process.
job_id = status[-1]['job_id']
logs = client.get_train_logs(job_id)
print(logs['response'])
Step 4: Downloading the Trained Model
Once your model has finished training, you can download it using the download_model
function.
import os
# Create a folder to store the model
os.mkdir('sf_trained_model')
# Download and save the model
client.download_model(job_id=job_id, extract_to='/content/sf_trained_model')
Step 5: Loading and Using the Trained Model
Finally, you can load the trained model and tokenizer to generate text.
from transformers import AutoModelForCausalLM, AutoTokenizer
path = '/content/sf_trained_model'
sf_model = AutoModelForCausalLM.from_pretrained(path)
sf_tokenizer = AutoTokenizer.from_pretrained(path)
input_example = '''### TITLE: title 1\n ### ABSTRACT: abstract 1\n ###EXPLANATION: '''
input_example = sf_tokenizer(input_example, return_tensors='pt')
output = sf_model.generate(input_example['input_ids'],
attention_mask=input_example['attention_mask'],
max_length=30,
eos_token_id=sf_tokenizer.eos_token_id,
early_stopping=True,
pad_token_id=sf_tokenizer.eos_token_id
)
print(sf_tokenizer.decode(output[0]))
This quickstart guide covers setting up the client, training a model using DDP, monitoring jobs, downloading the trained model, and using it for inference. Let me know if you need any further adjustments!