Supposedly anonymous people taking part in genomic studies can be identified using publicly accessible online resources, a team of Whitehead Institute researchers has shown.
They were able to identify the full names and identities of genomic research participants, even when their genetic information was held in databases in de-identified form.
“This is an important result that points out the potential for breaches of privacy in genomics studies,” says Whitehead Fellow Yaniv Erlich.
He and his team began by analyzing unique genetic markers known as short tandem repeats on the Y chromosomes (Y-STRs) of men whose genetic material was collected by the Center for the Study of Human Polymorphisms (CEPH) and whose genomes were sequenced and made publicly available as part of the 1000 Genomes Project.
Because the Y chromosome is transmitted from father to son, as family surnames usually are, there’s a strong correlation between surnames and the DNA on the Y chromosome.
Genealogists and genetic genealogy companies have established publicly accessible databases that house Y-STR data by surname. And, in a process known as ‘surname inference’, the team was able to discover the family names of the men by submitting their Y-STRs to these databases.
Once they had this information, they were able to use other sources, including internet record search engines, obituaries, genealogical websites and public demographic data from the National Institute of General Medical Sciences (NIGMS) Human Genetic Cell Repositorye, to identify nearly 50 men and women in the US who were CEPH participants.
“We show that if, for example, your Uncle Dave submitted his DNA to a enetic genealogy database, you could be identified,” says team member Melissa Gymrek. “In fact, even your fourth cousin Patrick, whom you’ve never met, could identify you if his DNA is in the database, as long as he is paternally related to you.”
Erlich has shared his findings with officials at the National Human Genome Research Institute (NHGRI) and NIGMS , which have responded by removing certain demographic information from the publicly-accessible part of the NIGMS cell repository.
“Our aim is to better illuminate the current status of identifiability of genetic data. More knowledge empowers participants to weigh the risks and benefits and make more informed decisions when considering whether to share their own data,” he says.
“We also hope that this study will eventually result in better security algorithms, better policy guidelines, and better legislation to help mitigate some of the risks described.”