Question

bash find consecutive acc. no. & insert text

0

Entering edit mode

3.8 years ago

6schulte ▴ 30

I have a table that I fill with taxonomy information. This information I am requesting from NCBI using efetch. How I do it is described here: How to get summary for acc.no. not starting with 'WP_' ?.

Now I want to use a bash-line command to find two consecutive accession numbers. They should not appear but if NCBI doesn't recognize the acc. no. I am providing or the connection to the server is lost, it will appear in my file as epost will jump to the next acc. no. to work with instead of finishing the line properly.

What I am currently trying to do is find a pattern (composed of two consecutive accession numbers) and insert a new line in the middle of it.

sed -e 's/^[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t/[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t\n[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t/g' $2 > text_test #the replacement does not work properly

[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*is the pattern describing the structure of the acc. no. as they are 3 characters + 5 numbers or 3 characters + 7 numbers.

..................................................................................................................................................................

I know this is not very reader friendly so I will split the code into parts and explain my thoughts:

sed -e 's/^[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t   (...)

This first part tells the computer what I want to find.

(...)   /[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t\n[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t/g'

This second part tells the computer what I want to put there.

(...)   $2 > text_test #the replacement does not work properly

This last part only states where to look and where to write the result to.

The second part is causing the problems. It does not write the acc. no. as they are found in the file but instead it writes the regex [a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]* as a string in the file.

..................................................................................................................................................................

Example for file content:

WP_112675856    Micromonospora saelicesensis    Bacteria; Actinobacteria; Micromonosporales; Micromonosporaceae; Micromonospora
CCH19814    RSN10899    Streptomyces sp. WAC 05977  Bacteria; Actinobacteria; Streptomycetales; Streptomycetaceae; Streptomyces

With CCH19814 RSN10899 being the pattern to split.

Desired result:

WP_112675856    Micromonospora saelicesensis    Bacteria; Actinobacteria; Micromonosporales; Micromonosporaceae; Micromonospora
CCH19814    
RSN10899    Streptomyces sp. WAC 05977  Bacteria; Actinobacteria; Streptomycetales; Streptomycetaceae; Streptomyces

Or even:

WP_112675856    Micromonospora saelicesensis    Bacteria; Actinobacteria; Micromonosporales; Micromonosporaceae; Micromonospora
CCH19814    -    -
RSN10899    Streptomyces sp. WAC 05977  Bacteria; Actinobacteria; Streptomycetales; Streptomycetaceae; Streptomyces

bash NCBI efetch • 1.1k views

ADD COMMENT • link updated 19 months ago by Ram 44k • written 3.8 years ago by 6schulte ▴ 30

1

Entering edit mode

Not related to the problem, but you can tidy your regexes up considerably by using counts after your character classes, e.g. [A-Z][A-Z][A-Z] can simply be [A-Z]{3}.

ADD REPLY • link 3.8 years ago by Joe 21k

0

Entering edit mode

I also thought about approaching this by using awk and checking if there is content in a fourth column.

-> If everything goes right the file should have three entries for each line, separated by a tab.

In the case of two consecutive acc. no. we have four entries for that line, therefore four tabs in that line.

ADD REPLY • link 3.8 years ago by 6schulte ▴ 30

score 1 · Accepted Answer · 2021-01-14

1

Entering edit mode

3.8 years ago

6schulte ▴ 30

I have a solution:

number=$(awk '/[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*/{ print NR; exit }' $2)

sed -e "$number s/\t/\n/" $2 > text_test

This can run in a loop.

ADD COMMENT • link 3.8 years ago by 6schulte ▴ 30

0

Entering edit mode

My previous version was only half way functioning:

This

awk '/[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*/{ print NR; exit }' $2 > number

tells me in which line two acc. no. occur after one another.

In my case it is in the second line. This will replace the first occurrence of a tab in the second line with a new line:

sed -e "2s/\t/\n/" $2 > text_test #(1)

I'd like to dynamically input the line number though, like:

sed -e "$number s/\t/\n/" $2 > text_test #(1)

or

sed -e "$($number)s/\t/\n/" $2 > text_test

But that does not work, so I modified it (see solution above)

...

=================================

(1) The same could be achieved using an awk-version.

(2)Will unfortunatly insert a new line in both the first and the second line.

ADD REPLY • link 3.8 years ago by 6schulte ▴ 30