SynthLite 🌞

🚨 Disclaimer: SynthLite is a work in progress. Expect bugs, fun, and room for improvement.

SynthLite is more than just a tool—it’s a spark in our larger SynthArt vision: to democratize synthetic data generation for everyone. By combining cutting-edge AI with deep research, SynthLite empowers you to create reliable, high-quality synthetic datasets in a minutes. We believe the future of data solutions should be private, open source, scalable, and accessible and safe — and SynthLite is here to make that future a reality. 🔮

Introduction 💡

SynthLite ⚡️ is a synthetic data generation CLI tool and library written in TypeScript. It’s designed to help you quickly produce high-quality synthetic datasets—perfect for development, testing, or even for product features and experiments. 🥢

💬 Why? Because synthetic data opens new frontiers for experimentation, privacy-friendly testing, and robust model training—helping developers and researchers alike! 😎

Under the hood, synthlite demonstrates the speed and power of various large language models (LLMs), including those from OpenAI, Anthropic, Meta, and Groq, showcasing how seamlessly they can integrate for data generation. ⚙

Partnerships & Future Collaboration 🤝

🚦 synthlite is not affiliated with any of the mentioned organizations and is an independent "hacker" project. However, in this note, I wish to propose future partnerships or collaborations with any or all of OpenAI, Anthropic, Meta, and Groq.

Here at synthlite, we're always on the lookout for meaningful collaborations to take synthetic data generation to the next level. While our current setup already demonstrates the capabilities of various LLMs, we envision broader use cases and accelerated growth through strategic partnerships with:

OpenAI: Explore how advanced AI models can be utilized effectively across diverse tasks, moving closer to Artificial General Intelligence (AGI) by leveraging existing technologies.
Anthropic: Investigate the potential of AI models in creating nuanced synthetic data, contributing to the development of safe and reliable AI systems.
Meta: Examine how Llama 3.x and future Llama variants can seamlessly integrate with synthlite for more sophisticated data generation scenarios.
Groq: Further explore advanced hardware acceleration and develop cutting-edge benchmarks that highlight how synthlite combined with Groq can enhance synthetic data pipelines.

If you have any leads or are directly affiliated with these organizations (or similar), feel free to reach out! We believe that combining our open-source vision with innovative partners can push synthetic data tools even further—providing the community with faster, safer, and more adaptable ways to generate synthetic datasets. 🚀

💡 It would be amazing to see where our vision and hacker-like execution (which I picked up by studying Meta's culture) could take us. 🐺

Innovative Approaches 💡

In the development of SynthLite, we have employed several innovative approaches and techniques to enhance the synthetic data generation process. These methods draw from various computer science and engineering concepts, ensuring the generation of high-quality, diverse, and realistic synthetic data. Below are some of the key approaches:

Injecting Randomness via Mutation of Duplicates 🌱

To ensure the uniqueness and diversity of the generated data, we inject randomness by mutating duplicates. This involves making minor adjustments to existing data points to create new, unique entries. This technique helps in avoiding repetitive patterns and ensures that the synthetic data remains varied and realistic.

Mutation Concept and Genetic Programming 🐞

The concept of mutation used in SynthLite is inspired by genetic programming. In genetic programming, mutation is a genetic operator used to maintain genetic diversity within a population of solutions. Similarly, in SynthLite, mutation is applied to duplicate data points to introduce variations and prevent redundancy. This approach ensures that the generated data evolves and adapts, much like in genetic algorithms.

Schema-Driven Data Generation 🥤

SynthLite leverages JSON schemas to define the structure and constraints of the synthetic data. By converting JSON schemas to Zod schemas, we ensure that the generated data adheres to the specified format and validation rules. This schema-driven approach provides flexibility and precision in data generation.

AI-Powered Data Generation 🪄

SynthLite integrates AI models to generate synthetic data. By providing prompts and schemas to the AI, we harness the power of language models to create realistic and contextually appropriate data points. This AI-driven approach enhances the quality and coherence of the generated data.

Event-Driven Architecture 🎪

The use of an event-driven architecture with the SynthliteEmitter allows for efficient handling of data generation events. This architecture enables real-time processing and writing of generated data, ensuring a smooth and responsive data generation workflow.

Performance Optimization ⚡️

To optimize performance, SynthLite processes data in batches and measures the time taken for each batch. This approach helps in identifying bottlenecks and ensures efficient utilization of resources during data generation.

By combining these innovative approaches, SynthLite provides a robust and flexible solution for synthetic data generation, catering to a wide range of use cases and ensuring high-quality outputs.

Table of Contents 📚

Core Features 🔧

Optimized Generation: Harness the power of various LLMs for efficient synthetic data generation.
TypeScript Library & CLI: Use synthlite as a standalone CLI or integrate directly into your projects.
Schema-Based Datasets: Initialize a dataset with your jsonSchema for structured, valid data every time.
Flexible Output Formats: Save generated data in JSON or CSV—or just work with it in-memory as a JavaScript object.
LLM Integration: (Optional) Use the power of models like Llama 3.x to enhance realism and variety in your synthetic data.

Potential Problem Statements & Research Areas 🔎

Privacy & Compliance: Generate synthetic datasets that mimic real-world data distributions without exposing sensitive information.
High-Volume Testing: Rapidly create large datasets for load testing or performance benchmarking.
AI Model Training: Explore how synthetic data can be used to train or fine-tune AI models while preserving privacy.
Performance Research: Investigate how hardware acceleration can supercharge the synthetic data generation process.
Multi-Modal Future: Potential exploration of text, image, or even audio synthetic data using advanced AI models.

Relevance: As the need for large, diverse, and privacy-friendly datasets grows, synthlite aims to deliver a swift, flexible solution that caters to the modern data-driven ecosystem.

How It Works ⚙️

Create a Dataset
```
const dataset = new SynthliteDataset(jsonSchema);
```
This sets up your data structure based on the JSON schema you provide.
Generate Data
```
const generatedDataset = dataset.generate({ count: 1000 });
```
Produces a GeneratedDataset object containing your synthetic samples.
Save the Output
```
generatedDataset.save("output.json", "json");
```
Exports the generated data in JSON or CSV formats—whichever you prefer.

All these steps leverage the efficiency of various LLMs and can optionally tap into models like Llama 3.x for enhanced generative capabilities.

Setup Instructions 🔧

You can install SynthLite using your favorite package manager—npm, yarn, or pnpm. Just pick one of the commands below:

npm

npm install synthlite

yarn

yarn add synthlite

Usage

npx synthlite <options>
Example: npx synthlite -sc schema.json -o output.json -env .env -r 20

Prerequisites

Node.js v16+
TypeScript 4.x
Access to relevant AI models.

Installation

Clone the repository:
```
git clone <repository-url>
cd synthlite
```
Install dependencies:
```
npm install
```

Build and Run

Build the project:
```
npm run build
```
Use the CLI (example):
```
npm start -- --schema ./mySchema.json --count 1000 --output data.json
```
This will generate 1,000 samples using mySchema.json and save them to data.json.

Usage Instructions 🕵️‍♂️

Library Usage (TypeScript)

import { SynthliteDataset, SynthliteGeneratedDataset } from "synthlite";

const jsonSchemaPath = "./schema.json";
const dataset = await SynthliteDataset.fromSchemaFile(jsonSchemaPath);

const generatedDataset: SynthliteGeneratedDataset = dataset.generate({
  count: 500,
});
await generatedDataset.save("output.csv", "csv");

CLI Usage
- Basic Command
```
npm start -- --schema ./mySchema.json --count 500
```
  This generates 500 samples and prints them to stdout.
- Save to File
```
npm start -- --schema ./mySchema.json --count 1000 --output data.csv
```
  Exports 1,000 samples to a data.csv file.
Optional Llama 3.1 Hook If you have Llama 3.3 integrated, you can configure your dataset to add advanced generative power to your fields. See our docs for usage examples (if available).

Examples 📊

Generating a Simple JSON File

> npm start -- --schema ./mySchema.json --count 50 --output myData.json

Contributors 💖

⚡️ synthlite is a product of AdiPat Labs.

Aditya Patange (Founder, AdiPat Labs)

We welcome contributions! Feel free to open issues, fork the repo, and submit pull requests.

License 📜

This project is licensed under the AGPL v3. See the LICENSE file for details.

✨ "It's not fake data dude, it's 'synthetic' data. 🥼" — Oen

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
samples		samples
src		src
types		types
.DS_Store		.DS_Store
.gitignore		.gitignore
.npmignore		.npmignore
LICENSE		LICENSE
README.md		README.md
bump_version.sh		bump_version.sh
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.js		vitest.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SynthLite 🌞

Introduction 💡

Partnerships & Future Collaboration 🤝

Innovative Approaches 💡

Injecting Randomness via Mutation of Duplicates 🌱

Mutation Concept and Genetic Programming 🐞

Schema-Driven Data Generation 🥤

AI-Powered Data Generation 🪄

Event-Driven Architecture 🎪

Performance Optimization ⚡️

Table of Contents 📚

Core Features 🔧

Potential Problem Statements & Research Areas 🔎

How It Works ⚙️

Setup Instructions 🔧

npm

yarn

Usage

Prerequisites

Installation

Build and Run

Usage Instructions 🕵️‍♂️

Examples 📊

Generating a Simple JSON File

Contributors 💖

License 📜

About

Releases 1

Packages

Languages

License

AdiPat/synthlite

Folders and files

Latest commit

History

Repository files navigation

SynthLite 🌞

Introduction 💡

Partnerships & Future Collaboration 🤝

Innovative Approaches 💡

Injecting Randomness via Mutation of Duplicates 🌱

Mutation Concept and Genetic Programming 🐞

Schema-Driven Data Generation 🥤

AI-Powered Data Generation 🪄

Event-Driven Architecture 🎪

Performance Optimization ⚡️

Table of Contents 📚

Core Features 🔧

Potential Problem Statements & Research Areas 🔎

How It Works ⚙️

Setup Instructions 🔧

npm

yarn

Usage

Prerequisites

Installation

Build and Run

Usage Instructions 🕵️‍♂️

Examples 📊

Generating a Simple JSON File

Contributors 💖

License 📜

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages