Skip to content

single cell STARR-seq analysis suite#905

Draft
sweavs111 wants to merge 111 commits intomainfrom
cellrangerBam
Draft

single cell STARR-seq analysis suite#905
sweavs111 wants to merge 111 commits intomainfrom
cellrangerBam

Conversation

@sweavs111
Copy link
Copy Markdown
Contributor

@sweavs111 sweavs111 commented Jul 31, 2023

This is a draft of a program that does scSTARR-seq data analysis. There is nothing conceptually crazy here, but it automates the process nicely in my opinion. The input it needs is a bam file from the cellranger count pipeline. Note that you will need to run something like samtools view -hb {whole genome cellranger bam} chrSS > starrSeqReadsOnly.bam to pull out only STARR-seq reads. It parses this file and pulls out the read that cellranger has selected as "best" for that UMI. By default it will sum raw read counts per construct to create a psuedobulk count matrix. There is also an option where the program will do cell-type analysis as well if provided a table with cell barcode and cell type for that barcode. The output for this option is a table that has raw read counts for every construct in every cell type. There is an additional normalization option that works for both pseudobulk and cell-type specific mode that will do input normalization for you if you provide it with a table that has construct \t normalization factor.

There are two more niche options if you want to grab some intermediate data in the pipeline for whatever reason. You can either have the output be the sam file that is only the "best" UMI read or you can get out a list of the construct the "valid" UMI read belongs to and what cell (the cell barcode) it is found in.

I am planning on changing the name of the function to something like analyseScStarrSeq or something similar.

Known issues currently. The testing input bam stores the extra flag values as uint8 (because I created the source sam file in vim) but actual bam files store the extra flag values as int32. Tests will fail unless you manually edit the bit variable declaration to uint8, however it works with real data as is. The other test that is currently failing is the sam out. The correct reads are being kept but for some reason one of the extra tags is missing. Not sure whats going on but I am looking into it. EDIT I looked into the missing tag issue. I think the particular tag that is missing has an incompatibility with the tag.go code. Hard to explain to here but I'm happy to chat more about it.

Last thing: like I said there is nothing crazy conceptual but I was worried that the program looks confusing and convoluted especially the single-cell function. I tried to extensively use comments to let y'all know what is going on, but let me know if anything is unclear or if there is anything I can do to make my code more clear.

The other important thing to note, which is maybe relevant context for this PR, is that this analysis method I believe work well for STARR-seq libraries where all constructs are dissimilar in sequence (ie: NOT different GWAS alleles or different species versions of the same sequence). Basically, only use this for constructs were you don't have to use barcodes. Cellranger count won't incorrectly map reads between similar constructs, but if there are ambiguously mapping reads they won't be counted. Therefore, constructs with more segregating SNPs, it will be easier to map reads to that construct, inflating their values. For these type libraries, it is probably best to use samFilter.go to filter the bam and collapse UMI, then use scCount.go with a custom gtf targeting barcodes.

A useful resource is the cellranger barcoded BAM tags manual page which explains some of the tags that I am using in the program: [https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/bam]

@codecov
Copy link
Copy Markdown

codecov Bot commented Jul 31, 2023

Codecov Report

Attention: Patch coverage is 39.86486% with 267 lines in your changes missing coverage. Please review.

Project coverage is 56.43%. Comparing base (958914d) to head (d0d5d2a).
Report is 170 commits behind head on main.

Current head d0d5d2a differs from pull request most recent head b23516d

Please upload reports for the commit b23516d to get more accurate results.

Files Patch % Lines
cmd/cellrangerBam/cellrangerBam.go 39.86% 258 Missing and 9 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #905       +/-   ##
===========================================
- Coverage   65.70%   56.43%    -9.28%     
===========================================
  Files         411      341       -70     
  Lines      176407    36081   -140326     
===========================================
- Hits       115916    20363    -95553     
+ Misses      58965    14302    -44663     
+ Partials     1526     1416      -110     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sweavs111
Copy link
Copy Markdown
Contributor Author

sweavs111 commented Aug 21, 2023

image

here is the output of the -singleCellAnalysis option with the first four clusters from our 2dayBrainIUE data (only the +/- controls are shown). But things appear to be working. This is with the -normalize option which will do input normalization. I think i will now add an option for GFP normalization

@sweavs111
Copy link
Copy Markdown
Contributor Author

open question for those reviewing. There are a lot of print statements while the program is running. Are those helpful? or should I take those out. A lot of them started as sanity checks/debugging but ive left some of them in because I thought they may be useful

@sweavs111 sweavs111 changed the title draft of Cellranger bam single cell STARR-seq analysis suite Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants