🚨 Disclaimer: SynthLite is a work in progress. Expect bugs, fun, and room for improvement.
SynthLite is more than just a tool—it’s a spark in our larger SynthArt vision: to democratize synthetic data generation for everyone. By combining cutting-edge AI with deep research, SynthLite empowers you to create reliable, high-quality synthetic datasets in a minutes. We believe the future of data solutions should be private, open source, scalable, and accessible and safe — and SynthLite is here to make that future a reality. 🔮
SynthLite ⚡️ is a synthetic data generation CLI tool and library written in TypeScript. It’s designed to help you quickly produce high-quality synthetic datasets—perfect for development, testing, or even for product features and experiments. 🥢
💬 Why? Because synthetic data opens new frontiers for experimentation, privacy-friendly testing, and robust model training—helping developers and researchers alike! 😎
Under the hood, synthlite demonstrates the speed and power of various large language models (LLMs), including those from OpenAI, Anthropic, Meta, and Groq, showcasing how seamlessly they can integrate for data generation. ⚙
🚦 synthlite is not affiliated with any of the mentioned organizations and is an independent "hacker" project. However, in this note, I wish to propose future partnerships or collaborations with any or all of OpenAI, Anthropic, Meta, and Groq.
Here at synthlite, we're always on the lookout for meaningful collaborations to take synthetic data generation to the next level. While our current setup already demonstrates the capabilities of various LLMs, we envision broader use cases and accelerated growth through strategic partnerships with:
-
OpenAI: Explore how advanced AI models can be utilized effectively across diverse tasks, moving closer to Artificial General Intelligence (AGI) by leveraging existing technologies.
-
Anthropic: Investigate the potential of AI models in creating nuanced synthetic data, contributing to the development of safe and reliable AI systems.
-
Meta: Examine how Llama 3.x and future Llama variants can seamlessly integrate with synthlite for more sophisticated data generation scenarios.
-
Groq: Further explore advanced hardware acceleration and develop cutting-edge benchmarks that highlight how synthlite combined with Groq can enhance synthetic data pipelines.
If you have any leads or are directly affiliated with these organizations (or similar), feel free to reach out! We believe that combining our open-source vision with innovative partners can push synthetic data tools even further—providing the community with faster, safer, and more adaptable ways to generate synthetic datasets. 🚀
💡 It would be amazing to see where our vision and hacker-like execution (which I picked up by studying Meta's culture) could take us. 🐺
In the development of SynthLite, we have employed several innovative approaches and techniques to enhance the synthetic data generation process. These methods draw from various computer science and engineering concepts, ensuring the generation of high-quality, diverse, and realistic synthetic data. Below are some of the key approaches:
To ensure the uniqueness and diversity of the generated data, we inject randomness by mutating duplicates. This involves making minor adjustments to existing data points to create new, unique entries. This technique helps in avoiding repetitive patterns and ensures that the synthetic data remains varied and realistic.
The concept of mutation used in SynthLite is inspired by genetic programming. In genetic programming, mutation is a genetic operator used to maintain genetic diversity within a population of solutions. Similarly, in SynthLite, mutation is applied to duplicate data points to introduce variations and prevent redundancy. This approach ensures that the generated data evolves and adapts, much like in genetic algorithms.
SynthLite leverages JSON schemas to define the structure and constraints of the synthetic data. By converting JSON schemas to Zod schemas, we ensure that the generated data adheres to the specified format and validation rules. This schema-driven approach provides flexibility and precision in data generation.
SynthLite integrates AI models to generate synthetic data. By providing prompts and schemas to the AI, we harness the power of language models to create realistic and contextually appropriate data points. This AI-driven approach enhances the quality and coherence of the generated data.
The use of an event-driven architecture with the SynthliteEmitter allows for efficient handling of data generation events. This architecture enables real-time processing and writing of generated data, ensuring a smooth and responsive data generation workflow.
To optimize performance, SynthLite processes data in batches and measures the time taken for each batch. This approach helps in identifying bottlenecks and ensures efficient utilization of resources during data generation.
By combining these innovative approaches, SynthLite provides a robust and flexible solution for synthetic data generation, catering to a wide range of use cases and ensuring high-quality outputs.
- Introduction
- Core Features
- Potential Problem Statements & Research Areas
- How It Works
- Setup Instructions
- Usage Instructions
- Examples
- Contributors
- License
-
Optimized Generation: Harness the power of various LLMs for efficient synthetic data generation.
-
TypeScript Library & CLI: Use synthlite as a standalone CLI or integrate directly into your projects.
-
Schema-Based Datasets: Initialize a dataset with your
jsonSchema
for structured, valid data every time. -
Flexible Output Formats: Save generated data in JSON or CSV—or just work with it in-memory as a JavaScript object.
-
LLM Integration: (Optional) Use the power of models like Llama 3.x to enhance realism and variety in your synthetic data.
-
Privacy & Compliance: Generate synthetic datasets that mimic real-world data distributions without exposing sensitive information.
-
High-Volume Testing: Rapidly create large datasets for load testing or performance benchmarking.
-
AI Model Training: Explore how synthetic data can be used to train or fine-tune AI models while preserving privacy.
-
Performance Research: Investigate how hardware acceleration can supercharge the synthetic data generation process.
-
Multi-Modal Future: Potential exploration of text, image, or even audio synthetic data using advanced AI models.
Relevance: As the need for large, diverse, and privacy-friendly datasets grows, synthlite aims to deliver a swift, flexible solution that caters to the modern data-driven ecosystem.
-
Create a Dataset
const dataset = new SynthliteDataset(jsonSchema);
This sets up your data structure based on the JSON schema you provide.
-
Generate Data
const generatedDataset = dataset.generate({ count: 1000 });
Produces a
GeneratedDataset
object containing your synthetic samples. -
Save the Output
generatedDataset.save("output.json", "json");
Exports the generated data in JSON or CSV formats—whichever you prefer.
All these steps leverage the efficiency of various LLMs and can optionally tap into models like Llama 3.x for enhanced generative capabilities.
You can install SynthLite using your favorite package manager—npm, yarn, or pnpm. Just pick one of the commands below:
npm install synthlite
yarn add synthlite
npx synthlite <options>
Example: npx synthlite -sc schema.json -o output.json -env .env -r 20
-
Node.js v16+
-
TypeScript 4.x
-
Access to relevant AI models.
-
Clone the repository:
git clone <repository-url> cd synthlite
-
Install dependencies:
npm install
-
Build the project:
npm run build
-
Use the CLI (example):
npm start -- --schema ./mySchema.json --count 1000 --output data.json
This will generate 1,000 samples using
mySchema.json
and save them todata.json
.
-
Library Usage (TypeScript)
import { SynthliteDataset, SynthliteGeneratedDataset } from "synthlite"; const jsonSchemaPath = "./schema.json"; const dataset = await SynthliteDataset.fromSchemaFile(jsonSchemaPath); const generatedDataset: SynthliteGeneratedDataset = dataset.generate({ count: 500, }); await generatedDataset.save("output.csv", "csv");
-
CLI Usage
-
Basic Command
npm start -- --schema ./mySchema.json --count 500
This generates 500 samples and prints them to stdout.
-
Save to File
npm start -- --schema ./mySchema.json --count 1000 --output data.csv
Exports 1,000 samples to a
data.csv
file.
-
-
Optional Llama 3.1 Hook If you have Llama 3.3 integrated, you can configure your dataset to add advanced generative power to your fields. See our docs for usage examples (if available).
> npm start -- --schema ./mySchema.json --count 50 --output myData.json
⚡️ synthlite is a product of AdiPat Labs.
- Aditya Patange (Founder, AdiPat Labs)
We welcome contributions! Feel free to open issues, fork the repo, and submit pull requests.
This project is licensed under the AGPL v3. See the LICENSE file for details.
✨ "It's not fake data dude, it's 'synthetic' data. 🥼" — Oen