Extended project setup

Before the first analysis sample table and config file need to be set up. Unless otherwise specified, StemCNV-check defaults to look for a sample_table.tsv (or sample_table.xlsx) and config.yaml file. To use sample table and config files with different names different (i.e. to compare the results of different settings), the command line options --sample-table or --config can be used.

It is recommended to start by creating a separate folder for your project. This folder should include raw data folder, config.yaml and sample table files.

Empty example files for the sample table and config can be created with this command:

stemcnv-check setup-files

Setting up the config file

The command stemcnv-check setup-files generated a basic config file template containing only the required options.

The default config file (config.yaml) created by this commands defines all settings required to start the analysis and inherits other settings from the inbuilt default. By using the command line option --config-details an extended config template containing more options can be generated. Possible parameters are medium, advanced or complete. Altering these advanced parameters is not recommend for standard use as they can change analysis outcome! | The complete config is described in the technical detail section

Edit the config file so that all entries marked as “#REQUIRED” are filled in.

The config file (default: config.yaml) defines all settings for the analysis and inherits from the inbuilt default, as well as system-wide array definitions if those exist. While most of the settings can be left on default, the input files need to be defined. The file paths for these files need to be entered in the config under the ‘array_definition’ section.

Array definition

In this section you also need to give your array a name (that needs to match the ‘Array_Name’ column in the sample table) and define a genome version (hg19 or hg38).

Caution

Please note that the Illumina bpm and csv manifest files are also specific to a certain genome version, usually files for hg19 end in ‘A1’ and those for hg38 end in ‘A2’ (the egt cluster file is not specific and can be used for any genome version).

Other array specific files mentioned in the config can be auto-generated (see next step below). While most of the settings can be left on default, the input files need to be defined. Among those are also the files for the definition of the array platform, which are the primary required settings apart from raw data locations, that can not have defined defaults in the config file created by the setup-files command.

  • ‘ExampleArray’ should to be renamed to the actual array name

  • genome_version options are: hg38/GRCh38 or hg19/GRCh37

Define files specific to the used array platform and genome build:

  • egt_cluster_file: the illumina cluster file (.egt) for the array platform, available from Illumina or the provider running the array

  • bpm_manifest_file: the beadpool manifest file (.bpm) for the array platform, available from Illumina or the provider running the array

  • csv_manifest_file (optional): the manifest file in csv format, available from Illumina or the provider running the array

  • raw_data_folder: input folder, path to the input directory under which the raw data (.idat) can be found. This folder should contain subfolders that match the Chip_Name column in the sample table (containing the array chip IDs). idat files should be grouped in a subfolder per array-chip (sentrix_name).

  • data_path: the output of StemCNV-check will be written to this path

  • log_path: output folder, stemcnv-check will write log filesthe log files of StemCNV-check to this path

array_definition:
  GSAMD-24v3-0:
    genome_version: 'hg19'
    bpm_manifest_file: '../cluster-manifest-data/GSAMD-24v3/gh19/GSAMD-24v3-0-EA_20034606_A1.bpm'
    egt_cluster_file: '../cluster-manifest-data/GSAMD-24v3/gh19/GSAMD_24v3-0_A1-LAB-2235HiQ-Samples.egt'
    csv_manifest_file: '../cluster-manifest-data/GSAMD-24v3/gh19/GSAMD-24v3-0-EA_20034606_A1.csv'
    penncnv_pfb_file: '__cache-default__'
    penncnv_GCmodel_file: '__cache-default__'
    array_density_file: '__cache-default__'
    array_gaps_file: '__cache-default__'

raw_data_folder: '../RAW_DATA'
data_path: data
log_path: logs

reports:
  StemCNV-check-report:
    file_type: 'html'

Setting up the sample table

The sample table (default: sample_table.tsv) is a tab-separated file describing all samples to be analyzed. Excel or tsv formats are supported.

The default format of the sample table is tsv. If you prefer to use an xlsx file here you can create an example by using:

stemcnv-check setup-files --sampletable-format xlsx

You can also use your own Excel file, if the following criteria are met:

  • The actual sample table is in the first sheet of the file and this sheet only contains columns for the sample table (optionally with commented lines starting with a ‘#’)

  • All required columns are present and correctly named (the order of columns is not important)

  • It is possible to deviate from the standard column names, but the expected column names need be contained in the actual column names and there needs to a singular way to extract them (via regex).

    • In this case you need to use the --column-remove-regex option to tell the pipeline how to modify your column names to derive the expected names. If used without an explicit regex (for expert users) spaces and anything following them will be removed from your column names.

    • A simple example with --column-remove-regex (default) option would be to use i.e: ‘Sample_ID for pipeline’, ‘Chip_Name (Sentrix Barcode)’, ‘Chip_Pos (Sentrix Position)’

Filling in the sample table with your data

Required Columns:
Sample_ID, Chip_Name, Chip_Pos, Array_Name, Sex, Reference_Sample, Regions_of_Interest, Sample_Group

Specific explanations for columns:
  • Sample_ID

    The folder and file names for samples are derived from this entry. All entries must be unique. To prevent issues with filenames only alphanumeric characters (all letters and number) and the characters - and _ (dash and underscore) are allowed by default.

  • Chip_Name and Chip_Pos

    These entries must match the Sentrix name (usually a 12 digit number) and position (usually R..C..) on the Illumina array

  • Array_Name

    The name of the array used for the sample. This needs to match one of the arrays defined in the config under array_definition

  • Sex

    The sex of the sample is needed for analysis and mandatory. Allowed values are: f, female, m and male (not case sensitive)

  • Reference_Sample

    This column should refer to the (exact) Sample_ID of reference sample (i.e. a parental fibroblast line or master bank), if there is no usable or applicable reference sample the entry should be empty (i.e. for fibroblast samples). Reference samples are assumed to be the clonal “parents” of a sample.

  • Regions_of_Interest

    This column can define sample specific regions of interest (i.e. gene edited sites), if none exist it can be left empty but still must be included in the sample sheet. The syntax for regions of interest is NAME|region, the NAME| part is optional and mainly useful for labeling or describing the region. The region part is mandatory and can be one of the following:

    1. Position: “chrN:start-end”: chrN can be i.e. chr3 or just 3, start and end are coordinates (which are genome build specific!)

    2. Genomic band: i.e. “4q21.3”: a cytogenetic band, both full bands (q21) and subbands (q21.3) are allowed

    3. Gene symbol: i.e. “TP53”: The gene name (or symbol) needs to exactly match the reference annotation (UCSC gtf). Validity of gene symbols can be used using the HGNC Multi-symbol checker

    Multiple regions for a single sample should all be in one column entry and be separated by a ;

  • Sample_Group

    This column can be used for grouping of related samples. By default all samples within the same group will be included in sample comparison based on SNP clustering.

Example Sample table

Sample_ID

Chip_Name

Chip_Pos

Array_Name

Sex

Reference_Sample

Regions_of_Interest

Sample_Group

HG001

207521920117

R09C02

ExampleArray

female

donor_fibroblasts_HG001

Group1

donor_fibroblasts_HG001

207521920117

R05C02

ExampleArray

male

Group1

hESC_1

207521920117

R07C02

ExampleArray

female

4q21.3

HG005

207521920117

R01C02

ExampleArray

male

HG006

HG006

207521920117

R03C02

ExampleArray

male

HG007

207521920117

R11C02

ExampleArray

female

Static files generation

This step takes place after the sample data for that array is available, sample table and the config file have been set up.

Array & genome-build specific static files are automatic generated.

stemcnv-check make-staticdata

Note

This step will also include download of fasta and gtf file for the reference genome build.** Array specific files and an updated array_definition block for the config will be written into the cache directory (default: ~/.cache/stemcnv-check).

StemCNV-check generally requires two types of static data files: those that are specific to the genome version (incl. the genome reference sequence) and those that are specific to the array platform. All of these files can be downloaded or generated by StemCNV-check using the stemcnv-check make-staticdata command, however array specific files can only be created if raw data for at least one sample is available. Usually genome version specific files are only downloaded once and saved in a central cache location, so they should already be available after running the example data. The files specific to an array platform are also saved to this central cache, so that they can be shared between different projects. Additionally, an updated array definition block for the config is written to the cache, so that the array definition is also saved. However, array definitions from a project specific config file will still take precedence over the central definitions.

To create the array specific files, follow these steps:

  • make sure that the sample table and config file, with all required entries, are correctly set up

  • Run the stemcnv-check make-staticdata

    • This command will download missing genome specific files from the internet

    • Then it will generate the array specific files, which also requires processing the raw data from at least one sample.

    Tip

    If you already have a genome reference fasta on your system you can also use that, instead of downloading a second one. To do so you need to provide the path to the fasta file for the corresponding genome version in the ‘global_settings’ block of the config file. This section will only be included in the config if you use at least the --config-details medium flag for the setup-files command.

This command will also print out the paths to the generated array specific files. You can either copy these paths your project specific config file to use a complete array definition, or you can simply remove the array definition block and rely on the automatically saved central definitions.