I have a table that I fill with taxonomy information. This information I am requesting from NCBI
using efetch
. How I do it is described here: How to get summary for acc.no. not starting with 'WP_' ?.
Now I want to use a bash-line command to find two consecutive accession numbers
. They should not appear but if NCBI doesn't recognize the acc. no. I am providing or the connection to the server is lost, it will appear in my file as epost will jump to the next acc. no. to work with instead of finishing the line properly.
What I am currently trying to do is find a pattern (composed of two consecutive accession numbers
) and insert a new line in the middle
of it.
sed -e 's/^[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t/[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t\n[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t/g' $2 > text_test #the replacement does not work properly
[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*
is the pattern describing the structure of the acc. no.
as they are 3 characters + 5 numbers
or 3 characters + 7 numbers.
..................................................................................................................................................................
..................................................................................................................................................................
I know this is not very reader friendly so I will split the code into parts and explain my thoughts:
sed -e 's/^[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t (...)
This first part tells the computer what I want to find
.
(...) /[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t\n[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t/g'
This second part tells the computer what I want to put there
.
(...) $2 > text_test #the replacement does not work properly
This last part only states where to look and where to write the result to.
The second part is causing the problems.
It does not write the acc. no. as they are found in the file but instead it writes the regex [a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*
as a string in the file.
..................................................................................................................................................................
..................................................................................................................................................................
Example for file content:
WP_112675856 Micromonospora saelicesensis Bacteria; Actinobacteria; Micromonosporales; Micromonosporaceae; Micromonospora
CCH19814 RSN10899 Streptomyces sp. WAC 05977 Bacteria; Actinobacteria; Streptomycetales; Streptomycetaceae; Streptomyces
With CCH19814 RSN10899
being the pattern to split.
Desired result:
WP_112675856 Micromonospora saelicesensis Bacteria; Actinobacteria; Micromonosporales; Micromonosporaceae; Micromonospora
CCH19814
RSN10899 Streptomyces sp. WAC 05977 Bacteria; Actinobacteria; Streptomycetales; Streptomycetaceae; Streptomyces
Or even:
WP_112675856 Micromonospora saelicesensis Bacteria; Actinobacteria; Micromonosporales; Micromonosporaceae; Micromonospora
CCH19814 - -
RSN10899 Streptomyces sp. WAC 05977 Bacteria; Actinobacteria; Streptomycetales; Streptomycetaceae; Streptomyces
Not related to the problem, but you can tidy your regexes up considerably by using counts after your character classes, e.g.
[A-Z][A-Z][A-Z]
can simply be[A-Z]{3}
.I also thought about approaching this by using
awk
and checking if there is content in a fourth column.-> If everything goes right the file should have three entries for each line,
separated by a tab
.In the case of
two consecutive acc. no.
we have four entries for that line, thereforefour tabs in that line
.