var_reads = pd.merge(df_read_counts, df_AF,
left_on=['position', 'ref', 'base', 'chrom'],
right_on=['POS', 'REF', 'ALT', 'CHROM'],
how='inner')
ref_reads = pd.merge(df_read_counts, df_AF,
left_on=['position', 'ref', 'base', 'chrom'],
right_on=['POS', 'REF', 'REF', 'CHROM'],
how='inner')
merged_ref_var = pd.merge(ref_reads.iloc[:, :5], var_reads.iloc[:, :5], on=['chrom','position'], how='inner')
However, all SNP sites observed in the sample—whether showing only REF reads or including ALT reads—can provide information. In particular, sites with only REF reads in the sample may still carry information about other strains that have ALT alleles at that position.
Is this filtering intentional, or could it be a potential bug?
Hi:
While examining the abundance estimation step in the
compute_abundances_all.py, I noticed that SNP sites are currently filtered such that only positions with observed ALT reads in the sample are retained:However, all SNP sites observed in the sample—whether showing only REF reads or including ALT reads—can provide information. In particular, sites with only REF reads in the sample may still carry information about other strains that have ALT alleles at that position.
Is this filtering intentional, or could it be a potential bug?