We are in the middle of a revolution that we don’t understand. Even experts in the field admit that developments are taking place at such tornado speed, it’s hard to predict what might be possible one year, let alone ten years down the track. Any human – simply by spitting into a test-tube – can produce a ream of data that could alternately help identify a murderer, contribute towards finding a cure for Parkinson’s, or predict the disease that will kill their grandchildren.
The ever-growing bank of information that we have from genomic data – combined with the ability of artificial intelligence to interpret it – is radically transforming our world, and as with all revolutions it will spark both positive and negative consequences.
George Church is the leading American geneticist and Harvard professor who in 2005 pioneered the Personal Genome Project (PGP). This entices people to contribute genetic information on the basis that the more data scientists have to analyse, the greater the payoff will be to science and society. He tells me that, ‘Genetics today seems analogous to seat belts in 1964, which were free but unused until additional motivation was provided.’ In other words it could save countless lives – ushering in a new era of medicine that will enable speedier diagnosis, tailor-made treatments, and at best full-on disease prevention. But the risks involved in putting such private data into the public domain mean that people are inevitably wary. Not least because – since the genomic data market is worth billions of pounds – plenty are poised to exploit it.
‘Scientists have predicted that by 2025 the details of sequenced genomes will generate more data on the internet than either Twitter or YouTube.’ Illustration by Mike McQuade.
At the start of this summer I talk to Caroline Rivett, Cyber Security Director for Life Sciences at KPMG. We meet in the sleek minimalist surroundings of Vinoteca at King’s Cross, just a microchip’s throw from Google’s London headquarters.
The location feels appropriate for this latest chapter in the Big Data story. Google’s strong interest in the genetic information revolution – which has already raised some privacy concerns – was made most manifest four years ago when it set up Google Genomics, a cloud computing service that lets scientists move DNA data onto its servers. Beyond this Google’s parent company Alphabet was a key investor in 23andMe when it started up, a company that collects genomic data to tell people about their ancestry. (This was in part connected to the fact that Sergey Brin, Google’s president of technology and company co-founder was then married to 23andMe’s co-founder Anne Wojcicki – they divorced in 2015).
Rivett, whose softly spoken demeanour fronts a laser-sharp intelligence, is particularly concerned about the increasing popularity of ancestry tests. Last year more people than ever before bought kits to work out their origins, and overall over 12 million people have submitted samples. A kit provided by 23andMe costs £79, as does one from its main competitor AncestryDNA. Yet at the point that people are excitedly spitting into test tubes to work out whether they are from the Near East or Iceland, they are not aware enough of the full implications for their privacy. While participants sign agreements about how their data might be used, as experts like Rivett point out, these are currently not adequate.
‘The whole question of the consumerisation of DNA is interesting,’ she declares. ‘I don’t know how scientifically valid a lot of it actually is.’ Recent accounts support her doubts – a review, for instance, in the American magazine Science News featured a reporter who had tested her DNA with National Geographic Geno 2.0, Living DNA, Family Tree DNA, 23andMe, and AncestryDNA. She found her user experience, and more significantly, her results were quite different from company to company.
‘Genetics today seems analogous to seat belts in 1964, which were free but unused until additional motivation was provided.’
Yet if the upside of such tests is uncertain, the potential downsides could resonate for decades. ‘Obviously people have different DNA,’ Rivett says, ‘but DNA is shared with family members. Statistically I share 50% of my DNA with my children – there might be slight variations – and my husband the other 50%. At the moment we only know a small proportion of what that DNA means. However the rate at which that understanding of what DNA actually means is increasing rapidly. If the DNA that has been supplied contains information relating to, say, a mental illness, or an acute illness such as schizophrenia, then it also relates to the child of the person who has supplied it. This information could then become available to corporate entities.’
Through running these tests, the leading ancestry-testing companies 23andMe and Ancestry.com have become the world’s two leading personal genomics companies. Yet as Rivett implies, the service they sell to their clients is not the real story. Just a couple of weeks after I met her, 23andMe announced it had signed a £233 million deal with pharmaceuticals giant GlaxoSmithKline, allowing it full access to its genomic database in the biggest commercial venture of its kind to date. It was a public acknowledgement that anybody collecting genomic data commercially was sitting on a goldmine. Regardless of whether they could identify a customer as a Visigoth or a Hun, it was the information accrued beyond this that was of genuine value.
The rampant speed at which science and technology are advancing in this area inevitably means that people are struggling to keep up with the ethical implications. The first decoding of the human genome was done by an international team of more than a thousand scientists and took 13 years – from 1990 to 2003 – costing $2.7 billion. A decade later the cost was down to between a thousand and a few thousand dollars – depending on the technique used – and the process could take a matter of days. This February, the Rady Children’s Institute for Genomic Medicine broke the world record by sequencing a genome in just 19 ½ hours. Estimates of the cost of the process now range from a thousand to a few hundred dollars.
‘San Diego company Human Longevity – which has a business-model for building the world’s most comprehensive database of genotypes and phenotypes – claim they could use genomes to predict faces.’ Illustration by Mike McQuade.
As the process becomes cheaper and quicker, and the information that could be released ever more unpredictable, the question of what kind of third parties could gain access to genomic data becomes ever more relevant. I talk to William Knox Carey, a California-based genomic privacy expert who oversaw the building of Genecloud, a system that enables genomic data to be analysed both remotely and securely. Knox Carey, a man who seems to exude facts as naturally as he breathes – says of 23andMe, ‘The deal with GlaxoSmithKline is based on the fact that important medical discoveries are lurking in the genomes of all these people – and I’m very empathetic to that. There are extenuating circumstances, in which I believe that, yes, this data can actually be used for helping people. But there are duelling imperatives here. One of them is the privacy of those who have supplied the data, the other is how does 23andMe make money?’
Does he think that companies like 23andMe – which beyond ancestry tests also offers basic diagnostics for people wanting to assess, for instance, their fitness, or their predisposition to certain diseases – have always wanted to use the information they collect for other purposes? ‘I think that from the start they were following the data mining model for building their business,’ he says. ‘I would say as much as I’m sympathetic to the argument that data can be useful, I don’t think the companies have done a particularly wonderful job in explaining how they could use that data. Nor have they given consumers fine grain control over how the data is used.’
Some argue that if it is primarily entitities like GlaxoSmithKline receiving the information, consumers have little to fear. Stewart Room, Global Head of Cyber Security and Data Protection at PricewaterhouseCoopers says, ‘Genomic evidence is valuable in so many ways. Obviously not least to the pharmaceutical industry, the life sciences industry. It’s immensely valuable because it helps us develop drugs, or important insights. If it’s helping us in preventative medicine rather than curative medicine, what we’ve got there is a question of good. I’m not saying that every individual understands the full impact of these tests, but they do understand the fundamental purpose of what’s going on, which is “look at my DNA in a laboratory – tell me something about me.”’
Yet the storing of genomic information online means that like all other aspects of our virtual lives it will be vulnerable to hacking. American scientists have predicted that by 2025 the details of sequenced genomes will generate more data on the internet than either Twitter or YouTube. It’s an astounding fact that inevitably comes with security implications.
This summer, for instance, the Israeli genealogy website MyHeritage was hacked, with email addresses and encrypted passwords of more than 92 million user accounts compromised. Though no DNA data was accessed, it was a warning shot. Last year the NHS – another major collector of genomic data – also endured a major cyber attack after its systems became infected by the WannaCry virus.
Those wanting a comprehensive account of the dangers of what might happen to their genomic data should it reach an unscrupulous third party can find it, ironically enough, on a website which wants to encourage them to go ahead and donate it anyway. It has been part of George Church’s campaign for radical transparency in the genetics revolution that when individuals sign up for the Personal Genome Project they are left in no doubt about the worst that might happen. Carey declares, ‘It’s a really interesting list. Some of the examples seem crazy – like something from the plot of an airport novel – one says that if people have access to your DNA information they could synthesise strands of DNA and plant them at a crime scene. It’s very unlikely, but the ability to synthesise DNA is very advanced these days, so technically it is possible.’
When I read through the PDF available on the PGP website, as Carey says, some of it reads like the stuff of bad fiction. One example goes as far as to warn that ‘someone could use your DNA or cells for in vitro fertilisation to create children without your knowledge or permission.’ But there’s plenty that is easily imaginable within a real-life context. Not least among these is being discriminated against for insurance purposes.
Although the risk of all these eventualities is small – hence the PGP’s willingness to raise it in the interests of transparency – what worries privacy experts is that if and when it does happen, protection is currently inadequate. For instance in America the Genetic Information Nondiscrimination Act (GINA) – passed in 2008 – forbids health insurance companies from charging higher premiums because of genetic information. Yet this doesn’t extend to life insurance or longterm care insurance. The PGP website declares that, ‘as with other types of discrimination, it could be extremely difficult to prove discrimination has occurred. You might never know whether your employer found your PGP data and read about your genetic findings. GINA and similar laws are new, incomplete, and relatively untested – you should not assume that your genetic information, if it should become associated with you, would never be used against you in a way you found objectionable, whether or not prohibited by law.’
Obviously all companies that invite individuals to contribute genomic data offer reassurance about how that data will be protected through privacy policies. But as Rivett points out, ‘companies change their privacy policies all the time. People don’t read them, and don’t worry, so this is actually a very poor form of control.’ On this side of the Atlantic, she has concerns about what is going to happen once Brexit stops British citizens from being covered by the EU’s General Data Protection Regulation (GDPR). ‘Under the GDPR you can go back to organisations and raise the subject of access to the information you’ve supplied. That company’s then legally obliged to come back within a certain period of time and give you information about how your personal data has been used. GDPR also gives you the right to be forgotten. You can write to the company and say please delete all of my information, and they are legally obliged to do so.’
A key promise that 23andMe and other ancestry companies make in their privacy agreements is that if you consent for your data to be used in scientific research, then it will be ‘de-identified’, or anonymised. This may sound like a satisfactory way forward, until you ask precisely how effective the anonymisation process is. In a notorious 2013 experiment, the former white-hat hacker and geneticist, Yaniv Erlich, was challenged to identify individuals whose data had been anonymised for the international 1,000 Genomes Project. Erlich, in collaboration with a student, developed an algorithm that extracted genetic markers from DNA sequences. They then did some basic detective work with a genealogy website, using Google to check what they thought they had found out about the donor and his family. To their astonishment, they realised there was a 5% chance they could identify a person based on anonymised DNA. Advances in artificial intelligence and genetic science will only see that percentage rise.
More dramatically, just over a year ago the San Diego company Human Longevity – which has a business-model for building the world’s most comprehensive database of genotypes and phenotypes – published a paper in which they claimed they could use genomes to predict faces. The experiment was headed up by the company’s founder, Craig Venter. A team sequenced the genomes of 1,061 people of different ages and ethnicity, and also took their photos in high-definition 3D. All this information was given to an algorithm, which calculated the relationship between faces and the genomes, and then – taking a new batch of genomes – attempted to generate faces independently.
‘Obviously it’s a significant plus if sadistic killers are being brought to book through their DNA. But it’s another example that shows that no genomic data will ever be truly anonymous.’
Venter is one of the genetic world’s more colourful characters, who famously first stirred controversy when he founded Celera Genomics to challenge the internationally-funded Human Genome Project to be the first to sequence the human genome. Though smaller scale, this face prediction project was also controversial, with many saying that it was in fact predicting average faces based on race and sex. Erlich himself, who is currently chief scientific officer of MyHeritage was one of those who poured scorn on it. But as Knox Carey says,’ [face generation] isn’t particularly accurate yet, but that will change as machine-learning and artificial intelligence improve. There is a lot of information about your phenotype [observable characteristics like eye colour, the sound of your voice, your height] in the DNA. It’s only going to get better.’
A news story that broke this April has highlighted a very different angle to how identifiable anyone is by their genetic data. After 44 years of eluding the police, the notorious Golden State Killer – the American murderer and rapist who killed 13 men and woman across California in the 1970s and 80s – was tracked down when a DNA sample of his was uploaded to the website GEDmatch. GEDmatch specialises in analysing genetic data to pinpoint distant relatives – and in this case found 20 or 30 people who were third cousins. A family tree was then constructed, looking for a common ancestor shared by these people and the killer. It went back to the early 1800s, and to the people who would prove to be the great-great-great grandparents of the man the police wanted.
Paul Holes, an investigator and DNA expert, who had been tracking the Golden State Killer for 24 years, put together a team who painstakingly proceeded to identify all of the great-great-great grandparents’ descendants. This involved tracking down thousands of names and creating 25 family trees. Once this gargantuan task had been completed, the team then started to look at men who were the estimated age of the killer, who had a connection to Sacramento. Two men were identified, one the former cop Joseph James D’Angelo. In April he was arrested, and this August he was charged on 13 counts of murder. He is currently slated to stand trial this December.
In a twist that shows how tightly knit the genetic world can be, this August Holes also announced that he could finally reveal the individual who had provided crucial expertise and advice to his team on how to structure the research. Her name was Barbara Rae-Venter, the 70-year-old former attorney who was married to Craig Venter. Initially she was hesitant about disclosing her involvement, worried for her personal safety. But an ecstatic public response has led to her talking to law enforcement agencies about dozens of murder cases that may be solved in a similar way.
Obviously it’s a significant plus if sadistic killers are being brought to book through their genomic data. But it’s another example that shows that no genomic data will ever be truly anonymous, especially if cross-referenced with genealogy websites. As George Church constantly emphasises, therefore, what every person who reveals their genetic information has to ask is how the level of risk to the individual and their family balances out against the overall good contributed to society. Church’s mission is to reinforce the mechanisms in place to guarantee the confidentiality of the data supplied, and – in an interesting step beyond this – to ensure that the individual who supplies the information has their own financial incentive.
This February he launched Nebula Genomics, a start-up that uses blockchain to allow people not just to share their genomic data for research purposes, but also to keep ownership of that data which means that they themselves will profit financially. As one science journal memorably declared, this combination of blockchain and whole genome sequencing is the business equivalent of the Hollywood supercouple. In Church’s accompanying white paper he hit out at 23andMe and Ancestry.com for their use of DNA microarray-based genotyping saying, ‘It is an outdated and significantly less powerful alternative to DNA sequencing’. Dennis Grishin, a Nebula co-founder who is one of Church’s graduate students, also took aim, telling one newspaper that, ‘Under the current system, personal genomics companies effectively own your personal genomics data, and you don’t see any benefit at all. We want to eliminate the middleman.’
In an email exchange, Church tells me that he thinks the three key priorities for genomic data companies going forward are, ‘a) Powerful pooling and encryption of queries plus blockchain to protect via decentralisation, consensus and public time-stamping… b) Clearer statement of the value of genomes. A key, under-utilised but very high value application is carrier screening for pre-conception matchmaking. This has nearly eliminated nine genetic diseases in the populations that use it…but it needs to be adapted to cultures needing higher privacy and additional motivation. c) Transparency and education about the risks and benefits given the above considerations…Since so much financial upside is possible for carrier testing and research, Nebula is aiming to pass on enough of this cash to provide additional motivation and word-of-mouth education.’
In the UK, just a few weeks ago the health minister Lord O’Shaugnessy demanded a ‘profound rethink’ of the ethics of using any kind of health data in combination with artificial intelligence. In an observation that could have come straight from the Church school-of-thought, he declared, ‘People increasingly view the data that is held about them as…a reflection of some part of their human capital and, therefore, if it is being invested to create something with economic value, there ought to be some return.’ He continued that the NHS (which runs the 100,000 Genomes Project) itself represents ‘a comprehensive, universal data set, potentially, on 60m people, which means that you cannot just test new things, but you can actually look for patterns in historical data’. It is, in other words, ‘an extraordinary opportunity’.
Interestingly he cited a new business model that he thinks should inspire trust – a partnership between the healthcare technology company Sensyne Health, which floated on the stock exchange this August, and the NHS. Founded by the Labour peer Paul Drayson, it deploys AI to interpret health data for the pharmaceutical industry. Notably individuals who have supplied data don’t benefit directly – but defenders would argue that the huge plus is that there are three health trusts signed up to the arrangement that will receive £5m worth of shares in Sensyne, and a royalty on the products that are developed. This means that participating hospitals will own 10 per cent of the equity.
It’s just one of the latest developments in a scientific and technological revolution that – rather like a rapacious virus – seems to mutate every time you look at it. There are astonishing and positive implications for our future, yet maintaining public trust is absolutely essential. As Church says, ‘The price of a human genome has dropped to the price of a nice meal for a few friends, but adoption is still very limited.’ The stakes are huge, and it’s a matter of urgency to work out the implications so that it is the philanthropists rather than the predators who win out in this giant step for humankind.