single cell STARR-seq analysis suite by sweavs111 · Pull Request #905 · vertgenlab/gonomics

sweavs111 · 2023-07-31T17:54:48Z

This is a draft of a program that does scSTARR-seq data analysis. There is nothing conceptually crazy here, but it automates the process nicely in my opinion. The input it needs is a bam file from the cellranger count pipeline. Note that you will need to run something like samtools view -hb {whole genome cellranger bam} chrSS > starrSeqReadsOnly.bam to pull out only STARR-seq reads. It parses this file and pulls out the read that cellranger has selected as "best" for that UMI. By default it will sum raw read counts per construct to create a psuedobulk count matrix. There is also an option where the program will do cell-type analysis as well if provided a table with cell barcode and cell type for that barcode. The output for this option is a table that has raw read counts for every construct in every cell type. There is an additional normalization option that works for both pseudobulk and cell-type specific mode that will do input normalization for you if you provide it with a table that has construct \t normalization factor.

There are two more niche options if you want to grab some intermediate data in the pipeline for whatever reason. You can either have the output be the sam file that is only the "best" UMI read or you can get out a list of the construct the "valid" UMI read belongs to and what cell (the cell barcode) it is found in.

I am planning on changing the name of the function to something like analyseScStarrSeq or something similar.

Known issues currently. The testing input bam stores the extra flag values as uint8 (because I created the source sam file in vim) but actual bam files store the extra flag values as int32. Tests will fail unless you manually edit the bit variable declaration to uint8, however it works with real data as is. The other test that is currently failing is the sam out. The correct reads are being kept but for some reason one of the extra tags is missing. Not sure whats going on but I am looking into it. EDIT I looked into the missing tag issue. I think the particular tag that is missing has an incompatibility with the tag.go code. Hard to explain to here but I'm happy to chat more about it.

Last thing: like I said there is nothing crazy conceptual but I was worried that the program looks confusing and convoluted especially the single-cell function. I tried to extensively use comments to let y'all know what is going on, but let me know if anything is unclear or if there is anything I can do to make my code more clear.

The other important thing to note, which is maybe relevant context for this PR, is that this analysis method I believe work well for STARR-seq libraries where all constructs are dissimilar in sequence (ie: NOT different GWAS alleles or different species versions of the same sequence). Basically, only use this for constructs were you don't have to use barcodes. Cellranger count won't incorrectly map reads between similar constructs, but if there are ambiguously mapping reads they won't be counted. Therefore, constructs with more segregating SNPs, it will be easier to map reads to that construct, inflating their values. For these type libraries, it is probably best to use samFilter.go to filter the bam and collapse UMI, then use scCount.go with a custom gtf targeting barcodes.

A useful resource is the cellranger barcoded BAM tags manual page which explains some of the tags that I am using in the program: [https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/bam]

codecov · 2023-07-31T18:06:27Z

Codecov Report

Attention: Patch coverage is 39.86486% with 267 lines in your changes missing coverage. Please review.

Project coverage is 56.43%. Comparing base (958914d) to head (d0d5d2a).
Report is 170 commits behind head on main.

❗ Current head d0d5d2a differs from pull request most recent head b23516d

Please upload reports for the commit b23516d to get more accurate results.

Files	Patch %	Lines
cmd/cellrangerBam/cellrangerBam.go	39.86%	258 Missing and 9 partials ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main     #905       +/-   ##
===========================================
- Coverage   65.70%   56.43%    -9.28%     
===========================================
  Files         411      341       -70     
  Lines      176407    36081   -140326     
===========================================
- Hits       115916    20363    -95553     
+ Misses      58965    14302    -44663     
+ Partials     1526     1416      -110

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

… for a construct

…provided. Changed CB naming conventions for mulitiple GEM wells

sweavs111 · 2023-08-21T18:37:39Z

here is the output of the -singleCellAnalysis option with the first four clusters from our 2dayBrainIUE data (only the +/- controls are shown). But things appear to be working. This is with the -normalize option which will do input normalization. I think i will now add an option for GFP normalization

sweavs111 · 2023-08-22T18:35:12Z

open question for those reviewing. There are a lot of print statements while the program is running. Are those helpful? or should I take those out. A lot of them started as sanity checks/debugging but ive left some of them in because I thought they may be useful

sweavs111 added 6 commits July 25, 2023 10:07

cellranger bam umis

4b9f650

push

97a91a4

cellrangerBam command

599c69b

add fileio close

fd2cc78

typo

34892a3

1st draft of cellrangerBam.go

647f170

sweavs111 added 17 commits July 31, 2023 16:09

testing and sorting map output

f185864

added samOut option

97610cd

Update usage

e0f026c

changed default behavior to pseudobulk

9876b4c

Added single-cell analysis mode

37810e6

adding binning of cells for psuedoreplicates

ae8b0a4

fixed binning bug

6db5925

Add option to combine multiple GEM wells in a single analysis

b07b16b

Added header lines to writing functions

b66b002

changed inputNorm function to give zero values if no read was present…

de4c5cc

… for a construct

remove combineGEMs opiton, switch to autodetect if multiple GEMs are …

c855964

…provided. Changed CB naming conventions for mulitiple GEM wells

update usage

53e92cd

changed to settings struct

ddbc7e1

better warning messages in the normalization function

a61f1eb

added umiSat option

3ed2706

added godocs to some functions

48778e2

typo

fe93280

sweavs111 added 3 commits August 21, 2023 16:39

add GFP normalization for scAnalysis

dec982d

add testing for gfp normalization

d0d5d2a

bug fixes for gfp normalization

22c14e2

sweavs111 changed the title ~~draft of Cellranger bam~~ single cell STARR-seq analysis suite Aug 22, 2023

sweavs111 and others added 30 commits September 26, 2024 05:59

plasmidUMI

c061539

re-org

751e03b

improved help message

ea57687

re-add UMI to mkref

4a32daf

fix UMI index

b35e48c

Merge branch 'main' of https://github.com/vertgenlab/gonomics

461451f

Merge branch 'main' of https://github.com/vertgenlab/gonomics

68e7b41

Merge branch 'main' of https://github.com/vertgenlab/gonomics

e24d977

Merge branch 'main' of https://github.com/vertgenlab/gonomics

1394704

fixes

c66f8fd

4.10.25

fe635b4

samBedToBases

068d8c9

dualBX impliment

16fc9bd

begin bulkOut

7dcbda8

bulk output work

1d0e337

more bulk analysis

168e13d

progress

1e603c3

saving progress

a5e4840

add error handling

4fe911f

saving progress

84d966c

tmp

5a71c11

ready for review + testing

9ee717b

remove useless error handling

9055d19

add annotation string

26e23a5

merge seq.go

efea15a

save

303fcde

better output, more efficient

13929d1

Merge branch 'samBedToBases' into cellrangerBam

34872f4

save

cdfc893

save

890ac7f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

single cell STARR-seq analysis suite#905

single cell STARR-seq analysis suite#905
sweavs111 wants to merge 111 commits intomainfrom
cellrangerBam

sweavs111 commented Jul 31, 2023 •

edited

Loading

Uh oh!

codecov Bot commented Jul 31, 2023 •

edited

Loading

Uh oh!

sweavs111 commented Aug 21, 2023 •

edited

Loading

Uh oh!

sweavs111 commented Aug 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sweavs111 commented Jul 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jul 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sweavs111 commented Aug 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sweavs111 commented Aug 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sweavs111 commented Jul 31, 2023 •

edited

Loading

codecov Bot commented Jul 31, 2023 •

edited

Loading

sweavs111 commented Aug 21, 2023 •

edited

Loading