Question

Ancestry estimation using 1000 genomes as reference

0

Entering edit mode

8 weeks ago

kasgel • 0

Hi,

I'm trying to perform ancestry estimation for a study sample of size one using 1000 genomes (g1k) as reference. That is, ancestry estimation for one individual. I have performed variant calling against GRCh38.

However, all guides I can find for doing this using PLINK & g1k assume that the study data contains more than one individual. Example of such guide is here. For example,it's not possible to prune for variants in high LD if the data contains only one individual (too few founders).

I'm not sure how to approach this, should I be merging the g1k data with the study sample?

Any guidance would be appreciated! thanks :)

ancestry plink 1000genomes • 433 views

ADD COMMENT • link updated 8 weeks ago by Jeremy Leipzig 22k • written 8 weeks ago by kasgel • 0

1

Entering edit mode

i can't imagine the LD calculations derived from the query data would be helpful until you had several hundred individuals. most people use LD measurements from reference data, which is what --indep-pairwise will do anyway.

ADD REPLY • link 8 weeks ago by Jeremy Leipzig 22k

score 2 · Answer 1 · 2024-09-02

2

Entering edit mode

8 weeks ago

DBScan ▴ 450

If you only have a single sample, I would suggest using somalier.

ADD COMMENT • link 8 weeks ago by DBScan ▴ 450

score 0 · Answer 2 · 2024-09-02

0

Entering edit mode

8 weeks ago

chrchang523 11k

The guide explicitly tells you to merge your study sample with the g1k data in the middle.
You correctly suspect that the guide's LD pruning recommendation is not applicable to your use case. It is reasonable to LD-prune the merged dataset (or just the g1k dataset) instead.

ADD COMMENT • link 8 weeks ago by chrchang523 11k