
## This pipeline collapses ORFs that are more similar than a specific threshold


_Dependencies:_ 
* _perl with "Graph" module installed_
* _bioperl_
* _fasta36_
* _emboss (infoseq)_



### Step 1 : 
Aligns the ORFs all against all, using a fasta command to obtain a tabulated output:
```
$ fasta36  -m8  orfs.fasta orfs.fasta > ORF_vs_ORF.m8
```


### Step 2 : 
Then the sequence length can be added to the output file using infoseq and the script "length_in_infoseq.pl:
```
$ infoseq orfs.fasta > orfs.fasta.infoseq
$ perl lenght_in_infoseq.pl orfs.fasta.infoseq ORF_vs_ORF.m8 > ORF_vs_ORF.m8.len.cg.dat
```


### Step 3 : 
Add three columns to the ORF_vs_ORF.m8.len.cg.dat file using Excel:
#16 ratio aln len/query len
#17 ratio aln len/hit len
#18 max (16, 17)



###  Step 4 : 
Generates the networks of ORFs which will group the similar sequences. The threshold for this step can be adjusted inside the script (lines 33 to 35):
```
$ perl graph_S288C.genes.pl ORF_vs_ORF.m8.len.cg.dat.modified
```

Two files will be generated by this step:
1. "graph.*.m8.id.95.overl.0.overlper.0.75.YRQ; listing all the ORFs in each Cluster of ORFs.
2. "graph_summary.*.graph.id.90.overl.200.overlper.0.75.YRQ.dat; containing three columns:
	1. The cluster number
	2. The name of representative ORF for that cluster,
	3. The number of ORFs in the specific cluster.



### Step 5 : 
File 2 will be used by another script to write the final fasta file containing only the representative ORFS:
```
$ perl collapsing_from_graph.pl graph_summary.*.graph.id.90.overl.200.overlper.0.75.YRQ.dat orfs.fasta
```

The final file will have ".collapsed.fromgraph.fasta" attached to the end of the name.
