Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data #18

Open
rhnvrm opened this issue Jan 12, 2021 · 3 comments
Open

Add data #18

rhnvrm opened this issue Jan 12, 2021 · 3 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@rhnvrm
Copy link
Owner

rhnvrm commented Jan 12, 2021

The bot has run out of data to tweet. If any one wants to volunteer and contribute to adding data feel free to ping either me or @kumbhakaran on this thread and we can help you get started and things to take care of while adding data.

@rhnvrm rhnvrm added help wanted Extra attention is needed good first issue Good for newcomers labels Jan 12, 2021
@rhnvrm
Copy link
Owner Author

rhnvrm commented Feb 9, 2021

Steps to add a file:

  • Copy data from CAD archives into a new file
  • Run the following regex replace for fixing the spacing for line headers: %s/^3\./\n3./g (3 will need to be changed to appropriate headline)
  • Run the following regex to trim spaces from lines at the beginning: %s/^ *//g
  • Run the following regex to remove double spaces: %s/\s\s\+/ /g (8e339a3)
  • Manually review any issues that are visible. (document them here)
  • Append the data in this file to data.txt (in case the chapter is ending, truncate the file and move it to data-2.txt and so on and add the work done above to data.txt)
  • In case there are duplicate rows while copying, we can run awk -i inplace '!seen[$0]++' data.txt

I used the above steps to add the following commits:

57924f4, cd3a347, e2aeba5, c4145b5, 5df311f, b268bed, 693705c

@rhnvrm
Copy link
Owner Author

rhnvrm commented Jun 28, 2022

curl https://www.constitutionofindia.net/constitution_assembly_debates/volume/7/1948-11-26 | htmlq '.ckeditor-content .row' -p  | htmlq -r .tooltiptext -r .summary-block -r .support-text -r .social-block -r .abt-events -t | sed -e 's/^ *//g' |sed -e 's/\s\s\+/ /g' | sed '/^$/d' | awk -i inplace '!seen[$0]++' | sed 's/^7\./\n7./g' >> data.txt

@rhnvrm
Copy link
Owner Author

rhnvrm commented Sep 18, 2022

Added Volume 8 - 3944ede

This commit also automates generating the file with a bash script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant