Boyce Thompson institute 2015 Noe Fernandez introduction to unix command-line episode ii
Sol Genomics Network • Terminal file system navigation • Wildcards, shortcuts and special characters • File permissions • Compression UNIX commands • Networking UNIX commands • Basic NGS file formats • Text files manipulation commands • Command-line pipelines • Introduction to bash scripts Class Content
Sol Genomics Network Text handling commandsText handling commands command > file saves STDOUT in a file command >> file appends STDOUT in a file cat file concatenate and print files cat file1 file2 > file3 merges files 1 and 2 into file3 cat *fasta > all.fasta concatenates all fasta files in the current directory head file prints first lines from a file head -n 5 file prints first five lines from a file tail file prints last lines from a file tail -n 5 file prints last five lines from a file less file view a file less -N file includes line numbers less -S file wraps long lines grep ‘pattern’ file Prints lines matching a pattern grep -c ‘pattern’ file counts lines matching a pattern cut -f 1,3 file retrieves data from selected columns in a tab-delimited file sort file sorts lines from a file sort -u file sorts and return unique lines uniq -c file filters adjacent repeated lines wc file counts lines, words and bytes paste file1 file2 concatenates the lines of input files paste -d “,” concatenates the lines of input files by commas sed transforms text File system CommandsFile system Commands ls lists directories and files ls -a lists all files including hidden files ls -lh formatted list including more data ls -t lists sorted by date pwd returns path to working directory cd dir changes directory cd .. goes to parent directory cd / goes to root directory cd goes to home directory touch file_name creates en empty file cp file file_copy copy a file cp -r copy files contained in directories rm file deletes a file rm -r dir deletes a directory and its files mv file1 file2 moves or renames a file mkdir dir_name creates a directory rmdir dir_name deletes a directory locate file_name searches a file man command shows commands manual top shows process activity df -h shows disk space info Networking CommandsNetworking Commands wget URL download a file from an URL ssh user@server connects to a server scp copy files between computers apt-get install installs applications in linux Compression commandsCompression commands gzip/zip compress a file gunzip/unzip decompress a file tar -cvf groups files tar -xvf ungroups files tar -zcvf groups and gzip files tar -zxvf gunzip and ungroups files UNIX Command-Line Cheat Sheet BTI-SGN Bioinformatics Course 2014 Text Handling Commands •Text Handling Commands
Sol Genomics Network FASTA format A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol at the beginning. http://www.ncbi.nlm.nih.gov/ >sequence_ID1 description ATGCGCGCGCGCGCGCGCGGGTAGCAGATGACGACACAGAGCGAGGATGCGCTGAGAGTA GTGTGACGACGATGACGGAAAATCAGATGGACCCGATGACAGCATGACGATGGGACGGGA AAGATTGGACCAGGACAGGACCAGGACCAGGACCAGGGATTAGA >sequence_ID2 description ATGGGGGGGACGACGATGGACACAGAGACAGAGACGACGACAGCAGACAGATTTACCTTA GACGAGATAGGAGAGACGACAGATATATATATATAGCAGACAGACAGACATTTAGACGAG ACGACGATAGACGATaaaaataa sequence datadescription line
Sol Genomics Network @D3B4KKQ1:291:D17NUACXX:8:1101:3630:2109 1:N:0: GACTTGCAGGCATGCAAGCTTGGCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAACACTGGCGT + ?@<+ADDDDFDFFI<FGE=EHGIGFFGEFIIFFBGFIDEI>D?FFFFA4;C;DC=;=ABDD; @D3B4KKQ1:291:D17NUACXX:8:1101:3971:2092 1:N:0: ATTGCAGAAGCGGCCCCGCATCTGCGAAGGGTTAACCGCAGGTGCAGAAGCTGGCTTTAAGTGAGAAGT + =BAADBA?D?FGI<@FHDB6?ADFEGGIE8@FGGII3ABBBB(;;6@CC?C3;C<99?CCCCC;:::? FASTQ format A FASTQ file normally uses four lines per sequence. Line 1: begins with a '@' character, followed by a sequence identifier and an optional description. Line 2: is the raw sequence letters. Line 3: begins with a '+' character, is optionally followed by the same sequence identifier. Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. wikipedia sequence datadescription line sequence quality
Sol Genomics Network Tab-delimited text files ATCG00890.1 PACid:16418828 90.60 117 11 0 18 134 1 117 1e-71 220 ATCG00890.1 PACid:16412855 90.48 147 14 2 41 387 27 173 1e-68 214 ATCG00500.1 PACid:23047568 64.88 299 64 2 220 477 112 410 5e-131 388 ATCG00500.1 PACid:23052247 58.88 321 69 3 220 477 381 701 3e-117 361 ATCG00280.1 PACid:24129717 95.99 474 19 0 1 474 1 474 0.0 847 ATCG00280.1 PACid:24095593 95.36 474 22 0 1 474 1 474 0.0 840 ATCG00280.1 PACid:20871697 94.94 474 24 0 1 474 1 474 0.0 837 scoreQuery Subject id % length mismatch gaps qstart qend sstart send evalue Tabular blast output example Tab-delimited files are a very common format in scientific data.They consist in columns of text separated by tabs. Other file formats could have different delimiters. Blast, SAM (mapping), BED, VCF (SNPs), GTF, GFF ...
Sol Genomics Network BIOINFORMATICIAN A: MS Word B: Less C: D: Cat What is the best option to explore the content of a file of 2Gb? Internet Explorer
Sol Genomics Network BIOINFORMATICIAN A: MS Word B: Less C: D: Cat What is the best option to explore the content of a file of 2Gb? Internet Explorer
Sol Genomics Network less blast_sample.txt view file blast_sample.txt less to view large files /pattern search pattern n find next N find previous q quit less scroll through the file < or g go to file beginning > or G go to file end space bar page down b page up less -S blast_sample.txt view file blast_sample.txt without wrapping long lines less -N blast_sample.txt view file blast_sample.txt showing line numbers
Sol Genomics Network cat sample1.fasta prints file sample1.fasta on the screen cat concatenates and prints files cat /home/bioinfo/Desktop/unix_data/sample1.fasta prints file sample1.fasta on the screen concatenates files sample1.fasta and sample2.fasta and saves them in the file new_file.fasta cat sample1.fasta sample2.fasta > new_file.fasta redirects output to a file
Sol Genomics Network cat *fasta > all_samples.fasta appends sample3.fasta file to new_file.fasta cat sample3.fasta >> new_file.fasta concatenates all FASTA files in the current directory and saves them in the file all_samples.fasta cat concatenates and prints files redirect output to a file
Sol Genomics Network head blast_sample.txt > blast10.txt print first lines from blast_sample.txt file (10 by default) and save them in blast10.txt head displays first lines of a file head -n 5 blast_sample.txt print first five lines from blast_sample.txt file
Sol Genomics Network tail blast_sample.txt print last 10 lines from blast_sample.txt file tail displays the last part of a file print last five lines from blast_sample.txt file tail -n 5 blast_sample.txt
Sol Genomics Network grep ‘^>’ sample1.fasta prints lines starting with a “>”, i.e., prints description lines from FASTA files grep searches patterns in files grep -c ‘^>’ sample1.fasta counts lines starting with a “>”, i.e., it counts the number of sequences from a FASTA file grep -c ‘^+$’ *fastq counts lines formed only by “+”, i.e., it counts the number of sequences from all FASTQ files in the current directory search pattern at line start search pattern at line end
Sol Genomics Network grep searches patterns in files grep -v ‘Vvin’ blast10.txt prints all lines but the ones containing ‘Vvin’ prints lines containing ‘Vvin’ and all their case combinations grep -i ‘Vvin’ blast10.txt
Sol Genomics Network cut -f 1,2 blast10.txt prints columns 1 and 2 from blast10.txt cut gets columns from a tab-delimited file cut -c 1-4,17-21 blast_sample.txt > tmp.txt prints characters from 1 to 4 and from 17 to 21 for each line in blast_sample.txt and save them in tmp.txt
Sol Genomics Network sort tmp.txt > tmp2.txt sort lines from file tmp.txt and save them in tmp2.txt sort sorts lines from a file sort -u tmp.txt sort lines from file tmp.txt and remove the repeated ones uniq -c tmp2.txt removes repeated lines from tmp.txt and counts how many times they were repeated. Lines have to be sorted since only adjacent lines are compared
Sol Genomics Network wc blast10.txt counts lines, words and characters in blast10.txt wc counts lines, words and characters wc -l blast10.txt counts lines in blast10.txt wc -c blast10.txt counts bytes in blast_sample.txt (including the line return) wc -w blast10.txt counts words in blast10.txt
Sol Genomics Network paste concatenates files as columns paste col2.txt col3.txt col1.txt concatenates files by their right end cut -f 1 blast10.txt > col1.txt creates a file for the columns 1, 2 and 3 respectively from blast10.txt cut -f 2 blast10.txt > col2.txt cut -f 3 blast10.txt > col3.txt paste -d ‘,’ col2.txt col3.txt col1.txt pastes columns with commas as delimiters
Sol Genomics Network sed replaces a pattern sed ‘s/A/a/g’ col1.txt replaces all “A” characters by “a” in col1.txt file sed ‘s/Atha/SGN/’ col1.txt replaces Atha by SGN in col1.txt file sed -r ‘s/^([A-Za-z]+)|(.+)/gene 2 from 1/’ col2.txt get species and gene name from col2.txt and print each line in a different format Saves species name in 1 Saves gene name in 2
Sol Genomics Network Pipelines consists in concatenate several commands by using the output of the first command as the input of the next one. Two commands are connected placing the sign “|” between them. ls | wc -l counts files in current directory Pipelines
Sol Genomics Network Pipelines cat *fasta | grep “^>” | sed ‘s/>//’ prints sequence description line for all fasta files from current directory cut -f 1 blast_sample.txt | sort -u | wc -l counts different query ids in a blast tabular file cat *fasta | grep -c “^>” counts sequences in all fasta files from current directory cut -f 1 blast_sample.txt | sort | uniq -c counts the appearance of each query id in a blast tabular file
Sol Genomics Network shell script (bash) example • All commands and programs we run in the terminal could be included in a text file with extension .sh • This file will execute the commands in the order they were written, from top to bottom. head of bash scripts comment line command or program line execution
Sol Genomics Network Run a bash script on a server emacs: text editor save = ctrl-x ctrl-s exit = ctrl-x ctrl-c touch file.sh creates an empty file emacs file.sh open file.sh in emacs
Sol Genomics Network reviewing the permissions r readable w writable x executable or searchable - not rwx d Directory - Regular file d rwx r-x r-x user group other owner user permissions owner group date File namesizelinks #
Sol Genomics Network Run a bash script on a server chmod 755 ./file.sh screen -L ./file.sh run file.sh script in screen mode Chmod manual ctrl+a+d detach screen makes file.sh executable screen -r process_id return to process screen less screenlog.0 watch log from screen execution
Sol Genomics Network 1. Merge all fasta files, in the order sample3.fasta, sample1.fasta and sample2.fasta, and save them in a new file called all_samples.fasta 2. Merge all fastq files (sample1.fastq, sample2.fastq and sample3.fastq) using wildcards, and save them in a new file called all_samples.fastq 3. Save in a file called blast100.txt the first 100 lines from blast_sample.txt 4. Save in a file called blast200.txt the last 200 lines from blast_sample.txt 5. How many sequences are in all_samples.fasta? 6. How many sequences are in all_sample.fastq? 7. Create a file with the subject ids and their scores for the 15 first lines from blast_sample.txt 8. How many different queries ids are in blast_sample.txt? 9. How many different subjects ids are in blast_sample.txt? 10. Change all ‘|’ in blast_sample.txt by ‘_’ and save the new file in Desktop as tmp.txt. 11. Count how many genes are in each Arabidopsis thaliana chromosome, chloroplast and mitochondria based on the next file: ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR10_blastsets/ TAIR10_pep_20110103_representative_gene_model_updated Exercises

SGN Introduction to UNIX Command-line 2015 part 2

  • 1.
    Boyce Thompson institute 2015 NoeFernandez introduction to unix command-line episode ii
  • 2.
    Sol Genomics Network •Terminal file system navigation • Wildcards, shortcuts and special characters • File permissions • Compression UNIX commands • Networking UNIX commands • Basic NGS file formats • Text files manipulation commands • Command-line pipelines • Introduction to bash scripts Class Content
  • 3.
    Sol Genomics Network Texthandling commandsText handling commands command > file saves STDOUT in a file command >> file appends STDOUT in a file cat file concatenate and print files cat file1 file2 > file3 merges files 1 and 2 into file3 cat *fasta > all.fasta concatenates all fasta files in the current directory head file prints first lines from a file head -n 5 file prints first five lines from a file tail file prints last lines from a file tail -n 5 file prints last five lines from a file less file view a file less -N file includes line numbers less -S file wraps long lines grep ‘pattern’ file Prints lines matching a pattern grep -c ‘pattern’ file counts lines matching a pattern cut -f 1,3 file retrieves data from selected columns in a tab-delimited file sort file sorts lines from a file sort -u file sorts and return unique lines uniq -c file filters adjacent repeated lines wc file counts lines, words and bytes paste file1 file2 concatenates the lines of input files paste -d “,” concatenates the lines of input files by commas sed transforms text File system CommandsFile system Commands ls lists directories and files ls -a lists all files including hidden files ls -lh formatted list including more data ls -t lists sorted by date pwd returns path to working directory cd dir changes directory cd .. goes to parent directory cd / goes to root directory cd goes to home directory touch file_name creates en empty file cp file file_copy copy a file cp -r copy files contained in directories rm file deletes a file rm -r dir deletes a directory and its files mv file1 file2 moves or renames a file mkdir dir_name creates a directory rmdir dir_name deletes a directory locate file_name searches a file man command shows commands manual top shows process activity df -h shows disk space info Networking CommandsNetworking Commands wget URL download a file from an URL ssh user@server connects to a server scp copy files between computers apt-get install installs applications in linux Compression commandsCompression commands gzip/zip compress a file gunzip/unzip decompress a file tar -cvf groups files tar -xvf ungroups files tar -zcvf groups and gzip files tar -zxvf gunzip and ungroups files UNIX Command-Line Cheat Sheet BTI-SGN Bioinformatics Course 2014 Text Handling Commands •Text Handling Commands
  • 4.
    Sol Genomics Network FASTAformat A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol at the beginning. http://www.ncbi.nlm.nih.gov/ >sequence_ID1 description ATGCGCGCGCGCGCGCGCGGGTAGCAGATGACGACACAGAGCGAGGATGCGCTGAGAGTA GTGTGACGACGATGACGGAAAATCAGATGGACCCGATGACAGCATGACGATGGGACGGGA AAGATTGGACCAGGACAGGACCAGGACCAGGACCAGGGATTAGA >sequence_ID2 description ATGGGGGGGACGACGATGGACACAGAGACAGAGACGACGACAGCAGACAGATTTACCTTA GACGAGATAGGAGAGACGACAGATATATATATATAGCAGACAGACAGACATTTAGACGAG ACGACGATAGACGATaaaaataa sequence datadescription line
  • 5.
    Sol Genomics Network @D3B4KKQ1:291:D17NUACXX:8:1101:3630:21091:N:0: GACTTGCAGGCATGCAAGCTTGGCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAACACTGGCGT + ?@<+ADDDDFDFFI<FGE=EHGIGFFGEFIIFFBGFIDEI>D?FFFFA4;C;DC=;=ABDD; @D3B4KKQ1:291:D17NUACXX:8:1101:3971:2092 1:N:0: ATTGCAGAAGCGGCCCCGCATCTGCGAAGGGTTAACCGCAGGTGCAGAAGCTGGCTTTAAGTGAGAAGT + =BAADBA?D?FGI<@FHDB6?ADFEGGIE8@FGGII3ABBBB(;;6@CC?C3;C<99?CCCCC;:::? FASTQ format A FASTQ file normally uses four lines per sequence. Line 1: begins with a '@' character, followed by a sequence identifier and an optional description. Line 2: is the raw sequence letters. Line 3: begins with a '+' character, is optionally followed by the same sequence identifier. Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. wikipedia sequence datadescription line sequence quality
  • 6.
    Sol Genomics Network Tab-delimitedtext files ATCG00890.1 PACid:16418828 90.60 117 11 0 18 134 1 117 1e-71 220 ATCG00890.1 PACid:16412855 90.48 147 14 2 41 387 27 173 1e-68 214 ATCG00500.1 PACid:23047568 64.88 299 64 2 220 477 112 410 5e-131 388 ATCG00500.1 PACid:23052247 58.88 321 69 3 220 477 381 701 3e-117 361 ATCG00280.1 PACid:24129717 95.99 474 19 0 1 474 1 474 0.0 847 ATCG00280.1 PACid:24095593 95.36 474 22 0 1 474 1 474 0.0 840 ATCG00280.1 PACid:20871697 94.94 474 24 0 1 474 1 474 0.0 837 scoreQuery Subject id % length mismatch gaps qstart qend sstart send evalue Tabular blast output example Tab-delimited files are a very common format in scientific data.They consist in columns of text separated by tabs. Other file formats could have different delimiters. Blast, SAM (mapping), BED, VCF (SNPs), GTF, GFF ...
  • 7.
    Sol Genomics Network BIOINFORMATICIAN A:MS Word B: Less C: D: Cat What is the best option to explore the content of a file of 2Gb? Internet Explorer
  • 8.
    Sol Genomics Network BIOINFORMATICIAN A:MS Word B: Less C: D: Cat What is the best option to explore the content of a file of 2Gb? Internet Explorer
  • 9.
    Sol Genomics Network lessblast_sample.txt view file blast_sample.txt less to view large files /pattern search pattern n find next N find previous q quit less scroll through the file < or g go to file beginning > or G go to file end space bar page down b page up less -S blast_sample.txt view file blast_sample.txt without wrapping long lines less -N blast_sample.txt view file blast_sample.txt showing line numbers
  • 10.
    Sol Genomics Network catsample1.fasta prints file sample1.fasta on the screen cat concatenates and prints files cat /home/bioinfo/Desktop/unix_data/sample1.fasta prints file sample1.fasta on the screen concatenates files sample1.fasta and sample2.fasta and saves them in the file new_file.fasta cat sample1.fasta sample2.fasta > new_file.fasta redirects output to a file
  • 11.
    Sol Genomics Network cat*fasta > all_samples.fasta appends sample3.fasta file to new_file.fasta cat sample3.fasta >> new_file.fasta concatenates all FASTA files in the current directory and saves them in the file all_samples.fasta cat concatenates and prints files redirect output to a file
  • 12.
    Sol Genomics Network headblast_sample.txt > blast10.txt print first lines from blast_sample.txt file (10 by default) and save them in blast10.txt head displays first lines of a file head -n 5 blast_sample.txt print first five lines from blast_sample.txt file
  • 13.
    Sol Genomics Network tailblast_sample.txt print last 10 lines from blast_sample.txt file tail displays the last part of a file print last five lines from blast_sample.txt file tail -n 5 blast_sample.txt
  • 14.
    Sol Genomics Network grep‘^>’ sample1.fasta prints lines starting with a “>”, i.e., prints description lines from FASTA files grep searches patterns in files grep -c ‘^>’ sample1.fasta counts lines starting with a “>”, i.e., it counts the number of sequences from a FASTA file grep -c ‘^+$’ *fastq counts lines formed only by “+”, i.e., it counts the number of sequences from all FASTQ files in the current directory search pattern at line start search pattern at line end
  • 15.
    Sol Genomics Network grepsearches patterns in files grep -v ‘Vvin’ blast10.txt prints all lines but the ones containing ‘Vvin’ prints lines containing ‘Vvin’ and all their case combinations grep -i ‘Vvin’ blast10.txt
  • 16.
    Sol Genomics Network cut-f 1,2 blast10.txt prints columns 1 and 2 from blast10.txt cut gets columns from a tab-delimited file cut -c 1-4,17-21 blast_sample.txt > tmp.txt prints characters from 1 to 4 and from 17 to 21 for each line in blast_sample.txt and save them in tmp.txt
  • 17.
    Sol Genomics Network sorttmp.txt > tmp2.txt sort lines from file tmp.txt and save them in tmp2.txt sort sorts lines from a file sort -u tmp.txt sort lines from file tmp.txt and remove the repeated ones uniq -c tmp2.txt removes repeated lines from tmp.txt and counts how many times they were repeated. Lines have to be sorted since only adjacent lines are compared
  • 18.
    Sol Genomics Network wcblast10.txt counts lines, words and characters in blast10.txt wc counts lines, words and characters wc -l blast10.txt counts lines in blast10.txt wc -c blast10.txt counts bytes in blast_sample.txt (including the line return) wc -w blast10.txt counts words in blast10.txt
  • 19.
    Sol Genomics Network pasteconcatenates files as columns paste col2.txt col3.txt col1.txt concatenates files by their right end cut -f 1 blast10.txt > col1.txt creates a file for the columns 1, 2 and 3 respectively from blast10.txt cut -f 2 blast10.txt > col2.txt cut -f 3 blast10.txt > col3.txt paste -d ‘,’ col2.txt col3.txt col1.txt pastes columns with commas as delimiters
  • 20.
    Sol Genomics Network sedreplaces a pattern sed ‘s/A/a/g’ col1.txt replaces all “A” characters by “a” in col1.txt file sed ‘s/Atha/SGN/’ col1.txt replaces Atha by SGN in col1.txt file sed -r ‘s/^([A-Za-z]+)|(.+)/gene 2 from 1/’ col2.txt get species and gene name from col2.txt and print each line in a different format Saves species name in 1 Saves gene name in 2
  • 21.
    Sol Genomics Network Pipelinesconsists in concatenate several commands by using the output of the first command as the input of the next one. Two commands are connected placing the sign “|” between them. ls | wc -l counts files in current directory Pipelines
  • 22.
    Sol Genomics Network Pipelines cat*fasta | grep “^>” | sed ‘s/>//’ prints sequence description line for all fasta files from current directory cut -f 1 blast_sample.txt | sort -u | wc -l counts different query ids in a blast tabular file cat *fasta | grep -c “^>” counts sequences in all fasta files from current directory cut -f 1 blast_sample.txt | sort | uniq -c counts the appearance of each query id in a blast tabular file
  • 23.
    Sol Genomics Network shellscript (bash) example • All commands and programs we run in the terminal could be included in a text file with extension .sh • This file will execute the commands in the order they were written, from top to bottom. head of bash scripts comment line command or program line execution
  • 24.
    Sol Genomics Network Runa bash script on a server emacs: text editor save = ctrl-x ctrl-s exit = ctrl-x ctrl-c touch file.sh creates an empty file emacs file.sh open file.sh in emacs
  • 25.
    Sol Genomics Network reviewingthe permissions r readable w writable x executable or searchable - not rwx d Directory - Regular file d rwx r-x r-x user group other owner user permissions owner group date File namesizelinks #
  • 26.
    Sol Genomics Network Runa bash script on a server chmod 755 ./file.sh screen -L ./file.sh run file.sh script in screen mode Chmod manual ctrl+a+d detach screen makes file.sh executable screen -r process_id return to process screen less screenlog.0 watch log from screen execution
  • 27.
    Sol Genomics Network 1.Merge all fasta files, in the order sample3.fasta, sample1.fasta and sample2.fasta, and save them in a new file called all_samples.fasta 2. Merge all fastq files (sample1.fastq, sample2.fastq and sample3.fastq) using wildcards, and save them in a new file called all_samples.fastq 3. Save in a file called blast100.txt the first 100 lines from blast_sample.txt 4. Save in a file called blast200.txt the last 200 lines from blast_sample.txt 5. How many sequences are in all_samples.fasta? 6. How many sequences are in all_sample.fastq? 7. Create a file with the subject ids and their scores for the 15 first lines from blast_sample.txt 8. How many different queries ids are in blast_sample.txt? 9. How many different subjects ids are in blast_sample.txt? 10. Change all ‘|’ in blast_sample.txt by ‘_’ and save the new file in Desktop as tmp.txt. 11. Count how many genes are in each Arabidopsis thaliana chromosome, chloroplast and mitochondria based on the next file: ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR10_blastsets/ TAIR10_pep_20110103_representative_gene_model_updated Exercises