Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #905 +/- ##
===========================================
- Coverage 65.70% 56.43% -9.28%
===========================================
Files 411 341 -70
Lines 176407 36081 -140326
===========================================
- Hits 115916 20363 -95553
+ Misses 58965 14302 -44663
+ Partials 1526 1416 -110 ☔ View full report in Codecov by Sentry. |
…provided. Changed CB naming conventions for mulitiple GEM wells
|
here is the output of the -singleCellAnalysis option with the first four clusters from our 2dayBrainIUE data (only the +/- controls are shown). But things appear to be working. This is with the -normalize option which will do input normalization. I think i will now add an option for GFP normalization |
|
open question for those reviewing. There are a lot of print statements while the program is running. Are those helpful? or should I take those out. A lot of them started as sanity checks/debugging but ive left some of them in because I thought they may be useful |

This is a draft of a program that does scSTARR-seq data analysis. There is nothing conceptually crazy here, but it automates the process nicely in my opinion. The input it needs is a bam file from the cellranger count pipeline. Note that you will need to run something like
samtools view -hb {whole genome cellranger bam} chrSS > starrSeqReadsOnly.bamto pull out only STARR-seq reads. It parses this file and pulls out the read that cellranger has selected as "best" for that UMI. By default it will sum raw read counts per construct to create a psuedobulk count matrix. There is also an option where the program will do cell-type analysis as well if provided a table with cell barcode and cell type for that barcode. The output for this option is a table that has raw read counts for every construct in every cell type. There is an additional normalization option that works for both pseudobulk and cell-type specific mode that will do input normalization for you if you provide it with a table that has construct \t normalization factor.There are two more niche options if you want to grab some intermediate data in the pipeline for whatever reason. You can either have the output be the sam file that is only the "best" UMI read or you can get out a list of the construct the "valid" UMI read belongs to and what cell (the cell barcode) it is found in.
I am planning on changing the name of the function to something like analyseScStarrSeq or something similar.
Known issues currently. The testing input bam stores the extra flag values as uint8 (because I created the source sam file in vim) but actual bam files store the extra flag values as int32. Tests will fail unless you manually edit the bit variable declaration to uint8, however it works with real data as is. The other test that is currently failing is the sam out. The correct reads are being kept but for some reason one of the extra tags is missing. Not sure whats going on but I am looking into it. EDIT I looked into the missing tag issue. I think the particular tag that is missing has an incompatibility with the tag.go code. Hard to explain to here but I'm happy to chat more about it.
Last thing: like I said there is nothing crazy conceptual but I was worried that the program looks confusing and convoluted especially the single-cell function. I tried to extensively use comments to let y'all know what is going on, but let me know if anything is unclear or if there is anything I can do to make my code more clear.
The other important thing to note, which is maybe relevant context for this PR, is that this analysis method I believe work well for STARR-seq libraries where all constructs are dissimilar in sequence (ie: NOT different GWAS alleles or different species versions of the same sequence). Basically, only use this for constructs were you don't have to use barcodes. Cellranger count won't incorrectly map reads between similar constructs, but if there are ambiguously mapping reads they won't be counted. Therefore, constructs with more segregating SNPs, it will be easier to map reads to that construct, inflating their values. For these type libraries, it is probably best to use samFilter.go to filter the bam and collapse UMI, then use scCount.go with a custom gtf targeting barcodes.
A useful resource is the cellranger barcoded BAM tags manual page which explains some of the tags that I am using in the program: [https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/bam]