Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请求分享处理数据代码 #7

Open
theuserroot opened this issue Dec 22, 2024 · 10 comments
Open

请求分享处理数据代码 #7

theuserroot opened this issue Dec 22, 2024 · 10 comments

Comments

@theuserroot
Copy link

我在assist12上删除skill为NaN的数据,以100为间隔分割学生数据并编码,丢弃小于等于5条的数据,最后用DyGFormer的方法处理数据,但训练结果AUC一直在0.767左右,与论文差0.015。能否分享下数据处理的代码?

@PengLinzhi
Copy link
Owner

PengLinzhi commented Dec 22, 2024 via email

@theuserroot
Copy link
Author

在论文里你们提到了使用了Ma [16]的数据处理方法,而Ma [16]中提到For calculate effciency, we set the max sequence length to 100 and truncate student learning sequences longer than 100 to several sub-sequences following to [Shen et al., 2021].在assist17上我使用了这个方法,生成的数据只比你们提供的数据集多2个,生成的节点数一致,训练效果一样,

@PengLinzhi
Copy link
Owner

PengLinzhi commented Dec 23, 2024 via email

@theuserroot
Copy link
Author

谢谢你的回答,我去掉了删除,在assist12上删除skill为NaN的数据,以100为间隔分割学生数据并编码的操作,AUC和AP都达到了论文结果

@theuserroot
Copy link
Author

容我再确认一遍,数据处理步骤是:删除小于5次的学生和问题;对学生和问题使用LabelEncoder编码;传到DyGFormer的数据处理模块,对吗?

@PengLinzhi
Copy link
Owner

PengLinzhi commented Dec 23, 2024 via email

@theuserroot
Copy link
Author

好的,我明白了,模型是在训练过程中生成50个学生历史交互并在此基础上学习,不是在处理输入数据时截取数据

@theuserroot
Copy link
Author

不过处理数据时是使用0填充NAN吗,论文里面的assist12数据集描述的数据对应的是丢弃skill中NAN的数据,抛弃skill的交互数量是2, 621.3k,原数据集交互数量是6, 123.2k

@PengLinzhi
Copy link
Owner

PengLinzhi commented Dec 23, 2024 via email

@theuserroot
Copy link
Author

好的,谢谢你的回答

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants