Grep palavra em um arquivo, e usar essa palavra para combinar em outro arquivo, adicionando a coisa que segue a partida

-2

Eu quero grep várias palavras no arquivo1 e usar cada palavra para grep o que se segue após sua correspondência no arquivo2. E então eu quero adicionar a string que seguiu a correspondência para a palavra que eu usei no arquivo03, para que o arquivo03 contenha

word1 [the thing that was found using word1 in a grep in file2]
word2 [the thing that was found using word1 in a grep in file2]

Parte dos arquivos que tenho são: arquivo1:

JAN1319964: PGSC|PGSC0003DMP400068385_PGSC0003DMT400096710  PGSC|PGSC0003DMP400062633_PGSC0003DMT400090958 PGSC|PGSC0003DMP400066271_PGSC0003DMT400094596 PGSC|PGSC0003DMP400064671_PGSC0003DMT400092996 PGSC|PGSC0003DMP400068967_PGSC0003DMT400097292
JAN1327159: PGSC|PGSC0003DMP400016823_PGSC0003DMT400024599 PGSC|PGSC0003DMP400017933_PGSC0003DMT400026257 Dul|Dul_comp58749_c0_seq2-1
JAN1330513: Des|Des_g36886.t1 PGSC|PGSC0003DMP400049952_PGSC0003DMT400073802

Arquivo 2:

>Dul|Dul_g997.t1
ESECRVQYFSDDEVSPVTEVTGRRGSICVVCRLVPKASVSESSFLK
>Dul|Dul_g998.t1
MDDKRLWEEEERRRIAVRQREERGKIYERQKALEEQEKLAAIESYQDAIRREREEEERLKEKKKKKKKTEIRDDYLDDFLPRRNDRRIPDRDRSVKRRQTFESGRHAKEHAPPTKRRRGGEVGLSNILEEIVDTLKNNVNVSYLFLKPVTRKEAPDYHKYVKRPMDLSTIKERARKLEYKNRGQFRHDVAQITINAHLYNDGRNPGIPPLADQLLEICDYLLEENESILAEAESAI
>Dul|Dul_g999.t1
MDDKRLWEEEERRRIAVRQREERGKIYERQKALEEQEKLAAIESYQDAIRREREEEERLKEKKKKKKKTEIRDDYLDDFLPRRNDRRIPDRDRSVKRRQTFESGRHAKEHAPPTKRRRGGEVGLSNILEEIVDTLKNNVNVSYLFLKPVTRKEAPDYHKYVKRPMDLSTIKERARKLEYKNRGQFRHDVAQITINAHLYNDGRNPGIPPLADQLLEICDYLLEENESILAEAESGIEQ
>Des|Des_g1.t1
FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK

A saída que quero é para este exemplo:

JAN1319964: PGSC|PGSC0003DMP400068385_PGSC0003DMT400096710 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
PGSC|PGSC0003DMP400062633_PGSC0003DMT400090958 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
PGSC|PGSC0003DMP400066271_PGSC0003DMT400094596 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
PGSC|PGSC0003DMP400064671_PGSC0003DMT400092996 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
PGSC|PGSC0003DMP400068967_PGSC0003DMT400097292  [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
JAN1327159: PGSC|PGSC0003DMP400016823_PGSC0003DMT400024599 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
PGSC|PGSC0003DMP400017933_PGSC0003DMT400026257 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
Dul|Dul_comp58749_c0_seq2-1
JAN1330513: Des|Des_g36886.t1  [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK
PGSC|PGSC0003DMP400049952_PGSC0003DMT400073802 [the line after the match to this name]: FRKQTVELSESDDTSISVETEDAEIENGNSPPAGLSNTTKVQLKPLYRSTIQLTPHPDGLSNTNEIK

Como você pode ver, eu simplesmente tenho algumas informações faltando no arquivo1, que está contido no arquivo2 e precisa ser adicionado ao arquivo1. Se alguém souber como fazer isso eu agradeço muito!

regex grep awk

— Stenemo
fonte

Eu não entendo, a primeira linha do seu file1 não possui identificadores que estão presentes no arquivo FASTA2, para que você deseja fazer o grep? Você está tentando converter FASTA para tbl ?

— terdon

Eu realmente não entendi sua pergunta, então vou responder o que eu pensar você está perguntando. Se você tiver um arquivo de identificadores de interesse como este (suponho que o primeiro campo nunca é um identificador, também presumo que pelo menos alguns dos IDs estão presentes no arquivo de seqüência, nenhum dos do seu exemplo):

Jan12345: ID1 ID2 ... IDN1
Jan67899: ID11 ID12 ... IDN2

E um arquivo Fasta como este:

>ID1
ABCDEFG
>ID2
HIJKLMN
>IDN1
OPQRSTU
>ID11
WXYZABC
>ID12
DEFGHIJ
>IDN2
KLMNOPQ

E você quer um arquivo de saída como este:

Jan12345 ID1 ABCDEFG ID2 HIJKLMN ... IDN OPQRSTU

Você poderia fazer algo assim:

Salve este script como FastaToTbl e torná-lo executável ( chmod 744 FastaToTbl ):

#! /bin/sh
gawk '{
        if (substr($1,1,1)==">")
       if (NR>1)
             printf "\n%s\t", substr($0,2,length($0)-1)
      else 
         printf "%s\t", substr($0,2,length($0)-1)
       else 
          printf "%s", $0
}END{printf "\n"}'  "$@"

Isso irá converter FASTA para tbl , ( ID<TAB>SEQUENCE ).

Usar FastaToTbl combinado com este script para extrair os IDs de file1 e as sequências de file2:

FastaToTbl file2 | 
  perl -ne 'chomp;@a=split(/\t/); $k{$a[0]}=$a[1]; ## Collect the sequences
                                                   ## $k{ID}=SEQUENCE
      END{open(A,"file1");   ## Open ID file
         while(<A>){         ## and process it line by line
           @a=split(/\s+/);  ## Gather the IDs in array @a
           print shift(@a);  ## Print the first element (Jan123:)
           print " $_ $k{$_}" for @a; ## Print each ID and its seq
           print "\n";
 }}' 
Jan12345:ID1 ABCDEFG ID2 HIJKLMN IDN1 OPQRSTU
Jan67899:ID11 WXYZABC ID12 DEFGHIJ IDN2 KLMNOPQ

— terdon
fonte

Eu tentei isso (por algum tempo), embora por algum motivo o primeiro script (FastaToTbl) não separou o "Jan ..." do resto com um \ t (tab), mas com um espaço normal.

— Stenemo

Além disso, não estou completamente certo, mas depois de adicionar uma guia, ainda não estou obtendo as sequências na minha saída (as últimas 5 linhas são assim: PGSC | PGSC0003DMP400017933_PGSC0003DMT400026257 Dul | Dul_comp58749_c0_seq2-1 JAN1330513: Des | Des_g36886.t1 PGSC | PGSC0003DMP400049952_PGSC0003DMT400073802 SWtf | SW_g16502.t1

— Stenemo

@Stenemo the FastaToTbl deve ser executado no arquivo FASTA. No meu exemplo, file2 é o arquivo fasta e file1 é aquele com os ids.

— terdon

Ok, funciona agora, muito obrigado pela sua ajuda!

— Stenemo