Data workflow

Code availability: github.com/versarchey/AraENCODE-pipeline

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Filtering low-quality datasets

The trimming of low-quality reads and artificial sequences was performed with Fastp(Chen et al., 2018). The parameters of Fastp are as follows: the window size option shared by sliding (-W) is set to 4, the mean quality requirement option shared by sliding (-M) is set to 20, the quality threshold for a qualified base (-q) is set to 15, the percentage of bases allowed to be unqualified (-u) is set to 40%, one read's N base number (-n) is set to 5, and the threshold for the low complexity filter (-Y) is set to 0.

HiChIP pipeline

   HiChIP data were processed with updated ChIA-PET Tool(V3)(Li et al., 2019) software, a JAVA-based package for automatic processing of ChIA-PET and HiChIP sequence data, including linker filtering, read mapping, redundancy removal, protein-binding sites, and chromatin interaction identification.

Hi-C, in-situ Hi-C pipeline

   Hi-C data were processed with the 4DN Hi-C data processing pipeline, which includes alignment(Li and Durbin, 2009), filtering(Song et al., 2022), and matrix aggregation and normalization (Abdennur et al., 2019) steps. Pairs file (.bedpe) and Hi-C matrix in two different formats (.mcool and .hic) are provided in "Download" module. Loops detection was proccessed by HiCCUPS. Compartment analysis was made by using "Juicertools Eigenvector" (Durand et al., 2016). 3D structure constructing was processed by 3DMax (Oluwadare et al., 2018).

ChIP-Seq pipeline

   Reads from ChIP-Seq libraries were aligned to the assembly using bwa(v0.7.17)(Li and Durbin, 2009) mem algorithm. Mapped reads with a MAPQ quality score below 30 and PCR duplicates were filtered using SAMTools(v1.9)(Li et al., 2009) to ensure high-quality aligned data. For analysis of H3K4me3, H3K27ac, and RNAPII libraries, narrow-peak calling settings were used in MACS2(v2.1.1)(Zhang et al., 2008) : macs2 callpeak -t input file -c control file -f BAM -n output peak file -B -g genomesize. For analysis of H3K9me2, H3K4me1, and H3K27me3 libraries, broad-peak mode was used in MACS2 with FDR < 0.1. RSC was set to> 0.8, and NSC was set to > 1.05 to assess the quality of each ChIP-Seq dataset; FRiP was used to assess the peak quality in the dataset

ATAC-Seq, MNase-Seq, DNase-Seq, FAIRE-Seq pipeline

  For ATAC-Seq MNase-Seq DNase-Seq FAIRE-Seq data, two independent libraries were created for each tissue or species. The alignment process was similar to that for ChIP-Seq. Peaks were identified using MACS2(v2.1.1) with the following settings: macs2 callpeak -t input file --nomodel --shift -100 --extsize 200 -f BAM -n output peak file -B -q 0.05 -g genomesize.

RNA-Seq pipeline

   Sequence quality of RNA-Seq libraries was evaluated and the adapter sequences and low-quality reads were filtered using Fastp (v0.20.0)(Chen et al., 2018). The cleaned reads were mapped to the reference genome using TopHat2(v2.1.1)(Kim et al., 2013), and gene expression was quantified using Cufflinks (v2.2.1)(Trapnell et al., 2012).

BS-Seq pipeline

   The sequence quality of the whole-genome bisulfite sequencing (WGBS) libraries was evaluated and the adapter sequences and low-quality reads were filtered using Fastp (v0.20.0)(Chen et al., 2018). The cleaned reads were then mapped to the reference genome using BatMeth2(v2.01)(Zhou et al., 2019). The uniquely mapped reads were used for further analysis. Individual cytosines with coverage of at least 3 were considered for methylation-level calling.

References

Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. bioinformatics 25:1754-1760.

Zhang, Y., Liu, T., Meyer, C.A., Eeckhoute, J., Johnson, D.S., Bernstein, B.E., Nusbaum, C., Myers, R.M., Brown, M., Li, W., et al. (2008). Model-based Analysis of ChIP-Seq (MACS). Genome Biology 9:R137.

Chen S., Zhou Y., Chen Y., Gu J. (2018) fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34: i884-i890.

Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., and Salzberg, S.L. (2013). TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology 14:R36.

Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D.R., Pimentel, H., Salzberg, S.L., Rinn, J.L., and Pachter, L. (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature protocols 7:562-578.

Zhou, Q., Lim, J.-Q., Sung, W.-K., and Li, G. (2019). An integrated package for bisulfite DNA methylation data analysis with Indel-sensitive mapping. BMC bioinformatics 20:1-11.

Servant, N., Varoquaux, N., Lajoie, B.R., Viara, E., Chen, C.-J., Vert, J.-P., Heard, E., Dekker, J., and Barillot, E. (2015). HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome biology 16:259.

Song, F., Xu, J., Dixon, J., and Yue, F. (2022). Analysis of Hi-C Data for Discovery of Structural Variations in Cancer. In Hi-C Data Analysis: Methods and Protocols, S. Bicciato and F. Ferrari, eds. (Springer US: New York, NY), pp. 143-161. 10.1007/978-1-0716-1390-0_7.

Li, G., Sun, T., Chang, H., Cai, L., Hong, P., and Zhou, Q. (2019). Chromatin interaction analysis with updated ChIA-PET Tool (V3). Genes 10:554.

Abdennur, N., and Mirny, L.A. (2020). Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics. doi: 10.1093/bioinformatics/btz540.

Durand, N.C., Shamim, M.S., Machol, I., Rao, S.S., Huntley, M.H., Lander, E.S., and Aiden, E.L. (2016). Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell systems 3:95-98.

Oluwadare, O., Zhang, Y., and Cheng, J. (2018). A maximum likelihood algorithm for reconstructing 3D structures of human chromosomes from chromosomal contact data. BMC Genomics 19:161. 10.1186/s12864-018-4546-8.