Cloud Training Quickstart

Prerequisites

Before you begin, ensure you have the following:

A Simplifine API key.
Access to a cloud GPU (L4 or A100) supported by Simplifine.

To obtain a Simplifine API key with free credit, express interest here.

Setting Up the Client

The first step is to initialize the Client class with your API key and GPU type. This client will handle communication with the Simplifine servers.

from simplifine_alpha.train_utils import Client

# Initialize the client with your API key and GPU type
api_key = ''  # Enter your Simplifine API key here
gpu_type = 'a100'  # 'l4' or 'a100'
client = Client(api_key=api_key, gpu_type=gpu_type)

Replace '' with your actual Simplifine API key.

The gpu_type can be 'l4' or 'a100'.

Training a Model with Data Driven Parallelism (DDP)

The example below shows how to use DDP to distribute the training process across multiple GPUs.

Step 1: Define Your Training Job with DDP

You can train a model using the sft_train_cloud method and enable DDP by setting the use_ddp parameter to True.

client.sft_train_cloud(
    job_name='ddp_job',
    model_name='EleutherAI/gpt-neo-125M',
    dataset_name='my_dataset',
    data_from_hf=True,
    keys=['title', 'abstract', 'explanation'],
    data={'title': ['title 1', 'title 2'], 'abstract': ['abstract 1', 'abstract 2'], 'explanation': ['explanation 1', 'explanation 2']},
    template='### TITLE: {title}\n ### ABSTRACT: {abstract}\n ###EXPLANATION: {explanation}',
    response_template='###EXPLANATION:',
    use_zero=False,
    use_ddp=True
)

Step 2: Monitor Your Jobs

After sending the query, you can check the status of your jobs. The status will be one of the following: completed, in progress, or pending.

status = client.get_all_jobs()
for num, i in enumerate(status[-5:]):
    print(f'Job {num}: {i}')

Output Example

Job 0: {'job_id': '544bb4f0-206f-43b7-850e-5e1e9f7b4d23', 'job_name': 'job-4', 'status': 'completed'}
Job 1: {'job_id': 'bde91132-9776-41ae-89f9-855dfb116a91', 'job_name': 'ddp_job', 'status': 'completed'}
Job 2: {'job_id': 'a1ff54dd-5ee2-4e35-9e78-6868f63dad37', 'job_name': 'zero_example_cloud', 'status': 'completed'}
Job 3: {'job_id': '543d3bc3-3ce4-4af6-9f9a-6c0823dcc9b0', 'job_name': 'ddp_job', 'status': 'in progress'}
Job 4: {'job_id': '5d55d46a-7793-4c06-9cef-279f03a0f953', 'job_name': 'job_1', 'status': 'pending'}

Step 3: Retrieve Training Logs

You can retrieve the logs for any job to check detailed information about the training process.

job_id = status[-1]['job_id']
logs = client.get_train_logs(job_id)
print(logs['response'])

Output Example

[2024-07-28 18:13:08,712] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
...
{'train_runtime': 6.5488, 'train_samples_per_second': 73.295, 'train_steps_per_second': 9.162, 'train_loss': 0.1135852018992106, 'epoch': 1.0}

Step 4: Downloading the Trained Model

Once your model has finished training, you can download it using the download_model function.

import os

# Create a folder to store the model
os.mkdir('sf_trained_model')

# Download and save the model
client.download_model(job_id=job_id, extract_to='/content/sf_trained_model')

Output Example

Directory downloaded successfully and saved to /content/sf_trained_model/5d55d46a-7793-4c06-9cef-279f03a0f953.zip
Model unzipped successfully to /content/sf_trained_model
Model downloaded, unzipped, and zip file deleted successfully!

Stop an Ongoing Job (Only if Necessary)

Step 5: Loading and Using the Trained Model

Finally, you can load the trained model and tokenizer to generate text.

from transformers import AutoModelForCausalLM, AutoTokenizer

path = '/content/sf_trained_model'
sf_model = AutoModelForCausalLM.from_pretrained(path)
sf_tokenizer = AutoTokenizer.from_pretrained(path)

input_example = '''### TITLE: title 1\n ### ABSTRACT: abstract 1\n ###EXPLANATION: '''

input_example = sf_tokenizer(input_example, return_tensors='pt')

output = sf_model.generate(input_example['input_ids'],
                           attention_mask=input_example['attention_mask'],
                           max_length=30,
                           eos_token_id=sf_tokenizer.eos_token_id,
                           early_stopping=True,
                           pad_token_id=sf_tokenizer.eos_token_id
)

print(sf_tokenizer.decode(output[0]))

This quickstart guide covers setting up the client, training a model using DDP, monitoring jobs, downloading the trained model, and using it for inference. Let me know if you need any further adjustments!

Overview

Project Space

Tables & Sheets

Search

Read

Write

Annotate

Save

Chat

Cloud Training Quickstart

Prerequisites

Setting Up the Client

Training a Model with Data Driven Parallelism (DDP)

Step 1: Define Your Training Job with DDP

Step 2: Monitor Your Jobs

Step 3: Retrieve Training Logs

Step 4: Downloading the Trained Model

Step 5: Loading and Using the Trained Model

Overview

Project Space

Tables & Sheets

Search

Read

Write

Annotate

Save

Chat

​Prerequisites

​Setting Up the Client

​Training a Model with Data Driven Parallelism (DDP)

​Step 1: Define Your Training Job with DDP

​Step 2: Monitor Your Jobs

​Step 3: Retrieve Training Logs

​Step 4: Downloading the Trained Model

​Step 5: Loading and Using the Trained Model

Prerequisites

Setting Up the Client

Training a Model with Data Driven Parallelism (DDP)

Step 1: Define Your Training Job with DDP

Step 2: Monitor Your Jobs

Step 3: Retrieve Training Logs

Step 4: Downloading the Trained Model

Step 5: Loading and Using the Trained Model