Skip to content

Sequence Manipulation

Robert J. Gifford edited this page Sep 30, 2024 · 7 revisions

1. Importing Sequence Data using the import source Command

The default way to import sequence data into a GLUE project is by using the import source command. This command expects a folder containing individual sequence files, either in FASTA or GenBank XML format. When importing FASTA files, the sequence header will be used as the sequence ID, and it is recommended to name the files by their sequence IDs to ensure clarity and consistency.

Here is an example of the import source command:

   Mode path: /
   GLUE> run file buildCoreProject.glue

In this example, the system imports sequences from the specified folder, recognizing each file format (in this case, GenBank XML) and assigning the appropriate sequenceID based on the file contents. If the files were in FASTA format, the sequence header would be treated as the sequenceID.

This approach allows for streamlined sequence data management, especially when working with large datasets organized by sequence identifiers.

2. Other Ways to Import Sequence Data

Module Type: fastaImporter

The fastaImporter module allows you to import nucleotide data from a FASTA file, creating a set of Sequence objects.

  • Type-Specific Commands:

    • import: Imports sequences from a FASTA file.
  • Usage Example:

GLUE> import path/to/sequences.fasta

General Module Mode Commands: In addition to the above command, all general module mode commands are available for use after importing.

3. Exporting Sequence Data

Exporting sequence data from GLUE allows researchers to save their results in various formats for further analysis or sharing. The primary method for exporting sequences is through the fastaExporter module.

Using the fastaExporter Module

The fastaExporter module provides a command for exporting sequences to a FASTA file. Below are the command options and an example usage:

  • Command Syntax:
export (-w <whereClause> | -a) [-o <offset> -b <batchSize>] [-y <lineFeedStyle>] [-r] [-t] (-p | -f <fileName>)
  • Options:

    • -y <lineFeedStyle> or --lineFeedStyle <lineFeedStyle>: Specifies the line feed style (LF or CRLF).
    • -f <fileName> or --fileName <fileName>: Name of the output FASTA file.
    • -w <whereClause> or --whereClause <whereClause>: Qualifies the sequences to be exported based on specified criteria.
    • -o <offset> or --offset <offset>: Paged query offset for batch processing.
    • -b <batchSize> or --batchSize <batchSize>: Number of sequences to export in each batch.
    • -a or --allSequences: Exports all sequences in the project.
    • -r or --suppressReverseComplement: Suppresses the reverse complement of sequences in the output.
    • -t or --suppressRotation: Suppresses the rotation of sequences in the output.
    • -p or --preview: Displays a preview of the output without saving the file.

Example Command:

To export DENV sequences belonging to a specific major lineage and preview the output, you can use the following command:

GLUE> module fastaExporter export -w "major_lineage = '1V_E'" -p

In this example, the -w option filters the sequences to include only those with major_lineage equal to '1V_E', and the -p option previews the output instead of saving it.

Export Location:

The exported file will be saved in a location relative to the current load/save directory in GLUE.

Additional Notes:

  • The whereClause can be tailored to suit specific research needs, enabling selective exporting based on various sequence attributes.
  • Users can leverage the batch size and offset options for large datasets to manage memory usage and improve performance during the export process.

By utilizing the fastaExporter module, researchers can efficiently export sequence data from their GLUE projects for further analysis, sharing, or integration with other bioinformatics tools.

Clone this wiki locally