.. _steps-ext-project-setup: Extended project setup ^^^^^^^^^^^^^^^^^^^^^^ Before the first analysis sample table and config file need to be set up. Unless otherwise specified, StemCNV-check defaults to look for a ``sample_table.tsv`` (or ``sample_table.xlsx``) and `config.yaml` file. To use sample table and config files with different names different (i.e. to compare the results of different settings), the command line options ``--sample-table`` or ``--config`` can be used. It is recommended to start by **creating a separate folder** for your project. This folder should include raw data folder, config.yaml and sample table files. Empty example files for the sample table and config can be created with this command: .. code:: bash stemcnv-check setup-files .. _steps-ext-config: Setting up the config file ========================== The command ``stemcnv-check setup-files`` generated a basic config file template containing only the required options. The default config file (config.yaml) created by this commands defines all settings required to start the analysis and inherits other settings from the inbuilt default. By using the command line option ``--config-details`` an extended config template containing more options can be generated. Possible parameters are ``medium``, ``advanced`` or ``complete``. **Altering these advanced parameters is not recommend for standard use as they can change analysis outcome!** | The complete config is described in the :ref:`technical detail section ` **Edit the config file** so that all entries marked as ``“#REQUIRED”`` are filled in. The config file (default: config.yaml) defines all settings for the analysis and inherits from the inbuilt default, as well as system-wide array definitions if those exist. While most of the settings can be left on default, the input files need to be defined. The file paths for these files need to be entered in the config under the 'array_definition' section. **Array definition** In this section you also need to give your array a name (that needs to match the 'Array_Name' column in the sample table) and define a genome version (hg19 or hg38). .. caution:: Please note that the Illumina bpm and csv manifest files are also specific to a certain genome version, usually files for hg19 end in 'A1' and those for hg38 end in 'A2' (the egt cluster file is not specific and can be used for any genome version). Other array specific files mentioned in the config can be auto-generated (see next step below). While most of the settings can be left on default, the input files need to be defined. Among those are also the files for the definition of the array platform, which are the primary required settings apart from raw data locations, that can not have defined defaults in the config file created by the setup-files command. - **'ExampleArray'** should to be renamed to the actual array name - **genome_version** options are: hg38/GRCh38 or hg19/GRCh37 **Define files specific to the used array platform and genome build:** - **egt_cluster_file**: the illumina cluster file (.egt) for the array platform, available from Illumina or the provider running the array - **bpm_manifest_file**: the beadpool manifest file (.bpm) for the array platform, available from Illumina or the provider running the array - **csv_manifest_file** (optional): the manifest file in csv format, available from Illumina or the provider running the array - **raw_data_folder**: input folder, path to the input directory under which the raw data (.idat) can be found. This folder should contain subfolders that match the Chip_Name column in the sample table (containing the array chip IDs). **idat files should be grouped in a subfolder per array-chip (sentrix_name).** - **data_path**: the output of StemCNV-check will be written to this path - **log_path**: output folder, stemcnv-check will write log filesthe log files of StemCNV-check to this path .. code:: yaml array_definition: GSAMD-24v3-0: genome_version: 'hg19' bpm_manifest_file: '../cluster-manifest-data/GSAMD-24v3/gh19/GSAMD-24v3-0-EA_20034606_A1.bpm' egt_cluster_file: '../cluster-manifest-data/GSAMD-24v3/gh19/GSAMD_24v3-0_A1-LAB-2235HiQ-Samples.egt' csv_manifest_file: '../cluster-manifest-data/GSAMD-24v3/gh19/GSAMD-24v3-0-EA_20034606_A1.csv' penncnv_pfb_file: '__cache-default__' penncnv_GCmodel_file: '__cache-default__' array_density_file: '__cache-default__' array_gaps_file: '__cache-default__' raw_data_folder: '../RAW_DATA' data_path: data log_path: logs reports: StemCNV-check-report: file_type: 'html' .. _steps-ext-sampletable: Setting up the sample table =========================== The sample table (default: sample_table.tsv) is a tab-separated file describing all samples to be analyzed. **Excel or tsv** formats are supported. The default format of the sample table is tsv. If you prefer to use an xlsx file here you can create an example by using: ``stemcnv-check setup-files --sampletable-format xlsx`` You can also use your own Excel file, if the following criteria are met: - The actual sample table is in the first sheet of the file and this sheet *only* contains columns for the sample table (optionally with commented lines starting with a '#') - All required columns are present and correctly named (the order of columns is not important) - It is possible to deviate from the standard column names, but the expected column names need be contained in the actual column names and there needs to a singular way to extract them (via regex). - In this case you need to use the ``--column-remove-regex`` option to tell the pipeline how to modify your column names to derive the expected names. If used without an explicit regex (for expert users) spaces and anything following them will be removed from your column names. - A simple example with ``--column-remove-regex`` (default) option would be to use i.e: 'Sample_ID for pipeline', 'Chip_Name (Sentrix Barcode)', 'Chip_Pos (Sentrix Position)' Filling in the sample table with your data ------------------------------------------ | **Required Columns**: | *Sample_ID, Chip_Name, Chip_Pos, Array_Name, Sex, Reference_Sample, Regions_of_Interest, Sample_Group* | | Specific explanations for columns: - Sample_ID The folder and file names for samples are derived from this entry. All entries *must* be unique. To prevent issues with filenames only alphanumeric characters (all letters and number) and the characters ``-`` and ``_`` (dash and underscore) are allowed by default. - Chip_Name and Chip_Pos These entries must match the Sentrix name (usually a 12 digit number) and position (usually ``R..C..``) on the Illumina array - Array_Name The name of the array used for the sample. This needs to match one of the arrays defined in the config under ``array_definition`` - Sex The sex of the sample is needed for analysis and mandatory. Allowed values are: ``f``, ``female``, ``m`` and ``male`` (not case sensitive) - Reference_Sample This column should refer to the (exact) Sample_ID of reference sample (i.e. a parental fibroblast line or master bank), if there is no usable or applicable reference sample the entry should be empty (i.e. for fibroblast samples). Reference samples are assumed to be the clonal "parents" of a sample. - Regions_of_Interest This column can define sample specific regions of interest (i.e. gene edited sites), if none exist it can be left empty but still must be included in the sample sheet. The syntax for regions of interest is ``NAME|region``, the ``NAME|`` part is optional and mainly useful for labeling or describing the region. The ``region`` part is mandatory and can be one of the following: 1) **Position:** "chrN:start-end": ``chrN`` can be i.e. ``chr3`` or just ``3``, start and end are coordinates (which are genome build specific!) 2) **Genomic band:** i.e. "4q21.3": a cytogenetic band, both full bands (q21) and subbands (q21.3) are allowed 3) **Gene symbol:** i.e. "TP53": The gene name (or symbol) needs to exactly match the reference annotation (UCSC gtf). Validity of gene symbols can be used using the `HGNC Multi-symbol checker `_ Multiple regions for a single sample should all be in one column entry and be separated by a ``;`` - Sample_Group This column can be used for grouping of related samples. By default all samples within the same group will be included in sample comparison based on SNP clustering. .. list-table:: Example Sample table :widths: 15 15 10 10 10 10 10 10 :header-rows: 1 * - Sample_ID - Chip_Name - Chip_Pos - Array_Name - Sex - Reference_Sample - Regions_of_Interest - Sample_Group * - HG001 - 207521920117 - R09C02 - ExampleArray - female - donor_fibroblasts_HG001 - - Group1 * - donor_fibroblasts_HG001 - 207521920117 - R05C02 - ExampleArray - male - - - Group1 * - hESC_1 - 207521920117 - R07C02 - ExampleArray - female - - 4q21.3 - * - HG005 - 207521920117 - R01C02 - ExampleArray - male - - - HG006 * - HG006 - 207521920117 - R03C02 - ExampleArray - male - - - * - HG007 - 207521920117 - R11C02 - ExampleArray - female - - - .. _steps-ext-staticdata: Static files generation ======================= This step takes place after the sample data for that array is available, sample table and the config file have been set up. **Array & genome-build specific static files** are automatic generated. .. code:: bash stemcnv-check make-staticdata .. note:: This step will also include **download of fasta and gtf** file for the reference genome build.** Array specific files and an updated array_definition block for the config will be written into the cache directory (default: `~/.cache/stemcnv-check`). StemCNV-check generally requires two types of static data files: those that are specific to the genome version (incl. the genome reference sequence) and those that are specific to the array platform. All of these files can be downloaded or generated by StemCNV-check using the ``stemcnv-check make-staticdata`` command, however array specific files can only be created if raw data for at least one sample is available. Usually genome version specific files are only downloaded once and saved in a central cache location, so they should already be available after running the example data. The files specific to an array platform are also saved to this central cache, so that they can be shared between different projects. Additionally, an updated array definition block for the config is written to the cache, so that the array definition is also saved. However, array definitions from a project specific config file will still take precedence over the central definitions. To create the array specific files, follow these steps: - make sure that the sample table and config file, with all required entries, are correctly set up - Run the ``stemcnv-check make-staticdata`` - This command will download missing genome specific files from the internet - Then it will generate the array specific files, which also requires processing the raw data from at least one sample. .. tip:: If you already have a genome reference fasta on your system you can also use that, instead of downloading a second one. To do so you need to provide the path to the fasta file for the corresponding genome version in the 'global_settings' block of the config file. This section will only be included in the config if you use at least the ``--config-details medium`` flag for the setup-files command. This command will also print out the paths to the generated array specific files. You can either copy these paths your project specific config file to use a complete array definition, or you can simply remove the array definition block and rely on the automatically saved central definitions.