How to generate code-trajectory data with GPT4? #1

SeungyounShin · 2023-07-23T16:36:41Z

Creation of SFT data for

User :
Assistant :

<Thinking, GPT4>
<Debug...>

...

How to automate this process? with GPT4 and collect data efficiently?

theblackcat102 · 2023-07-24T09:39:00Z

@SeungyounShin Hi, I am currently working on a very similar project, mainly generating a dataset to use tools. One of the dataset I am working include using code interpreter tool. My method was basically starting with a few dozen of instruction and asked GPT-4 to generate more similar instructions. Using this slightly larger instructions set, I use the evol instruct [1] method to generate more instructions. So far I had only 4,628 instructions set about using code interpreter.

[1] WizardLM: Empowering Large Language Models to Follow Complex Instructions

SeungyounShin · 2023-07-24T13:23:29Z

Here's an output of the code generated by GPT-4 from my repository. The task was: "Can you plot the Tesla's 90-day volume with the mean of the closing price and a marker at 't' where the mean until 't-1' plus the standard deviation until 't-1' is less than the price at 't'?" The performance of GPT-4 is impressive but the data collection process tends to be slow. This is primarily because it operates in an iterative manner: generating code, executing it, then debugging and modifying the code, and repeating the process. This can lead to considerable latency.

Your method is a valuable alternative, but I believe the real-time execution of code between GPT-4 calls is critical for this task. I've encountered a second challenge: GPT-4 is effective at debugging but often struggles with generating the final answer. #2 I'm not entirely sure why this happens. I would appreciate any thoughts or suggestions on how to improve this process. Thank you so much! @theblackcat102

I would greatly appreciate any further discussion on this topic. Please feel free to share your insights or suggestions.

theblackcat102 · 2023-07-24T21:47:32Z

@SeungyounShin oh I had a code execution module as well, just the initial questions are generated via augmentation. Each round typically took me 20-120 seconds depends on complexity. My progress usually slows down due to bad for loop or training a 500M huggingface model on my mac.

What's the exact issue with #2 ? Could you provide more insights to the weird answer problem? An example would be nice 😊

SeungyounShin · 2023-08-01T17:05:30Z

@theblackcat102

I recently explored the concept of Evo-Instruct and found it quite fascinating. Inspired, I crafted my own version of Evo-Instruct. In the process, I observed that a significant number of human-engineered prompts are required. In addition, I noticed that GPT often tends to prompt with instructions like "Write ~" to create a Python function but does not actively check the result or implement it itself. It then appears to congratulate itself on completing the task.

One thing that stood out to me was that Evo-Instruct seems to perform better than Self-Instruct. It not only produces higher quality prompts but also a diverse range of them. While generating high-quality prompts is comparatively simpler (for instance, we could just request "more difficult one"), generating diverse prompts is quite challenging. Transitioning from one topic to another can potentially lead to significant deviations, such as moving from a simple '1+1=?' to a complex 'Use CAD to...'.

Considering these observations, it seems that maintaining a balance between diversity and quality could be an interesting research topic.

SeungyounShin · 2023-08-30T12:25:04Z

[Still in progress]

How we can enhance the generation of trajectories (code gen, exec, debug from it)

SeungyounShin added documentation Improvements or additions to documentation enhancement New feature or request labels Jul 23, 2023

SeungyounShin self-assigned this Jul 23, 2023

SeungyounShin added this to LLama2 Code Interpreter Jul 23, 2023

SeungyounShin moved this to Todo in LLama2 Code Interpreter Jul 23, 2023

SeungyounShin mentioned this issue Jul 24, 2023

working principle #5

Closed

SeungyounShin mentioned this issue Aug 4, 2023

Does it able to read csv dataset and perform data analysis? #6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to generate code-trajectory data with GPT4? #1

How to generate code-trajectory data with GPT4? #1

SeungyounShin commented Jul 23, 2023

theblackcat102 commented Jul 24, 2023

SeungyounShin commented Jul 24, 2023 •

edited

Loading

theblackcat102 commented Jul 24, 2023

SeungyounShin commented Aug 1, 2023

SeungyounShin commented Aug 30, 2023

How to generate code-trajectory data with GPT4? #1

How to generate code-trajectory data with GPT4? #1

Comments

SeungyounShin commented Jul 23, 2023

theblackcat102 commented Jul 24, 2023

SeungyounShin commented Jul 24, 2023 • edited Loading

theblackcat102 commented Jul 24, 2023

SeungyounShin commented Aug 1, 2023

SeungyounShin commented Aug 30, 2023

SeungyounShin commented Jul 24, 2023 •

edited

Loading