The Sequencing and Tracking Of Phylogeny in COVID-19

As the world deals with the current SARS CoV-2 pandemic, genomic epidemiology has become a proven tool in the fight against COVID-19. As the virus passes from person to person, mutations can occur that can be monitored and traced to identify chains of infection. Working as part of the COVID-19 Genomics UK Consortium (COG-UK), Dr Robson and his team work with NHS sites across the South and Lighthouse Labs throughout the UK to help with infection control processes and to provide a UK-wide database of viral genomes. The UK leads this field by a wide margin, and these data are used to track and trace novel variants of concern.

Working closely with PHE, and with reporting to government organisations such as the Scientific Advisory Group for Emergencies (SAGE) and the New and Emerging Respiratory Virus Threats Advisory Group (NERVTAG), this project has developed rapidly to assist in our understanding of the virus as we attempt to bring it under control and plot a course to a normal way of life.

Bios: Dr Samuel Robson is a Senior Research Fellow at the University of Portsmouth, where he is the Faculty Bioinformatics Lead, and Bioinformatics Lead at the Centre for Enzyme Innovation (CEI). He has developed a Bioinformatics-specific compute cluster here at the University where he has developed analysis pipelines for whole genome sequencing, genome/transcriptome assembly, RNA-seq, ChIP-seq, CLIP-seq, BS-seq, amplicon sequencing, and other typical sequencing data types used by researchers throughout the University.

What the Hell is Bionformatics

so let's go a very good afternoon everyone a very warm welcome to yet another edition of

our research future interdisciplinary webinars i am leila shukron i'm professor of

international law and director of the university of portsmous theme in democratic citizenship

today we are absolutely delighted to welcome our colleague dr sam robson sam is going to address a

very timely topic the sequencing and tracking of phylogeny in proving 19.

let me introduce some some has already a quite impressive academic carrier

he is at the university of postmost he's the bio formatix leads at the center for enzyme

innovation but also he is the same lead in the faculty of um

biology if i'm not mistaken he's written a lot already on a variety of topic as

you can imagine sequencing but not only really he's collaborating on very large number of

projects throughout the university and and we met with some in a completely different environment we

met working on heritage related issues so just to give you a few examples of very

diverse research areas some is working on and that's most very impressive he works on the analysis of microbio

biofilm diversity and the effects of anti-falling technology he works on

understanding the enzymatic activity of wood eating dribbles for biofuel development

well i'm pretending i'm understanding some i don't but i mean anyway well

i will soon i know that's on the analysis of diverse

genre expression pathway in bacterial communities he works as well on identifying novel

biomarkers for prostatic joint infection i'm sure it's together with our colleague gordon

bloon who's professor and director of the health and wellbeing theme he also works on understanding the

pathogenesis and treatment of duchenne muscular dystrophy he works as well on transcriptional

profiling of novel marine organism and as you understood on viruses

including the sars virus and the coronary virus as well apologies for all the words i've

probably mispronounced and all the concepts i didn't know about but some were going to well enlighten us now the

floor is used thank you layla thank you for that introduction and thanks for inviting me along and you've just led very nicely

into my first slide there what the hell is bioinformatics um i get answers quite a lot i've taken

What the Hell is Bioinformatics?

to just telling people i'm a baker now because it's a lot easier than trying to explain my job to people

uh but essentially i've been working in the faculty of science and health uh for the past four years um and my

main role is to work alongside people doing research mainly in biology um but these days the

technology available to biologists is uh generating such huge amounts of

data uh that we need people like myself biometicians who kind of skirt

the uh areas of biology uh computer science and statistics

to help make sense of all the data that's generated um so within the venn diagram i sit

somewhere between somebody who works with computers uh a statistician i'm a chartered

statistician with the royal society of statistics uh and biology

but to be honest if you ask my wife what i do for a living she'll say that i do this stuff here

where i generate pretty pictures for people to put into their publications uh but really it's using computers uh in

order to understand biological processes uh this is a more accurate representation of what my life looks

More accurately...

like uh generating huge amounts of data and trying my best to manage them

i'm going to take a bit of a step back though before i start delving into uh the project in full just because i'm

aware that not everybody i'm speaking to today will have a deep scientific background

um i'm sure everybody's aware of what dna is uh but just to give you a sort of brief

introduction to the systems ongoing inside your cells that allow dna to

actually have a function and generate proteins so ultimately dna is the blueprint

to life it sits with inside all of your cells and it tells us how to make you

What is DNA?

um i recently gave a presentation very much along these lines to the children of my daughter's school

so you'll forgive that if some of the uh some of the graphics about to come up are a little bit cartoony but

i thought it worked quite nicely to explain the concept so uh what's special about dna and one

of the most important things about it is that it's a very simple molecule really it's only made up of four

building blocks uh which called nucleotides and those building blocks

adenine guanine thymine and cytosine which we just call a t c

and g so dna can be thought of as a really really long word but it's only really

made up of four letters so uh those letters are decoded

within ourselves in a process called translation where each set of three what we call

bases um encodes base pairs particular amino acid and amino acids joined together in

chains to make proteins and it's the proteins that actually have an effect within your body so

they're molecules that fold together to do a certain job now that job might be

structural uh they might build cell cell wall cell membranes uh it might be functional uh there's a

protein complex here called the ribosome which is kind of the uh the production machine of your cells

which puts all these things together um but essentially

the way that dna works is it sits in what we see here as a double helix and what's special about this is that

these bases pair together in a specific way so c always goes with g

and a always goes with t and because of this if we know one strand of dna we

automatically know what the other strand is as well and this is what's used to copy dna so when dna copies

uh for instance when um during uh cell meiosis and mitosis

[Music] the dna splits into two and then second copies of the distinct strands are made to make

two lots of double-stranded dna but this process uh is used to create what's called rna which is

a an intermediary molecule which can be taken off to the ribosome to

generate these proteins we use technology such as what i'm going

Next Generation Sequencing - Nanopore

to talk about today called next generation sequencing technologies which allow us to essentially read the sequence of nucleotides that

exist within a strand of dna or rna and the system that we use here is called

nanopore sequencing because it uses these tiny little pores which are just like the pores that i showed there that allow um dna

and rna to sorry rna to be passed outside of the nucleus these sit across

a small membrane that's got a current passing across it and the pores are just big enough to

allow a single strand of dna to pass through um and those bases each have a distinct

charge on them so as they pass through that membrane they cause a distinct change in the

uh the potential that runs across the the membrane and that change in potential can be

converted into a sequence so by doing this literally the dna will

be passed through this and as it goes through we'll simply read off what that sequence is

um and this simple technique has a huge amount of uh things that we can use it for so as

later said there my the research projects i'm involved in are very distinct from looking at rna viruses such as sars cop2

to looking at 500 year old sailors on the mary rose

to looking at biofilm formation in marine environments you name it but

they all use a similar approach where we digitize the sequence into a format that

we can read and that we can process and that we can analyze so the system that i'm talking about today is

nanopore sequencing i've actually got one of these min ions here and it really is very very small it's about the size of a

stapler um and and this machine is able to do all of the things that i'm

going to talk about today we run about 24 samples on one of these at a time um but we also use this bigger system

called the gridiron and essentially this is five of these mini and kind of sellotapes to a big

powerful computer so it just gives us the capacity to run a lot more samples at any one time

Benefits of Long Reads

so one of the big benefits of using nanopore sequencing other sequencing techniques involve

first of all chopping the dna or the rna up into small segments and then doing what's called a short

read sequencing which means that we can only read say 100 to 200 base pairs at a time

and what this means is that if you wanted to do something like here like you wanted to look at e coli and sequence the entire genome of e coli

it's about 4.6 million base pairs so if you were using a 50 base pair read

you'd need 92 000 separate pieces of dna which you'd then have to stitch together

or kind of like doing a jigsaw puzzle by using nanopore sequencing you can

theoretically pass the entire genome through in a single read

realistically it tends to not work quite that way dna is notoriously easy to degrade but

either way you're going to end up with much much longer sequences instead of 50 base pairs you're talking about 500

000 base pairs which means that rather than doing a 92 000 piece puzzle you're just doing a nine piece puzzle

so this is one of the big benefits of using long reads and often this technique is used to kind of

fill in the blanks for very complicated genomes so genomes have a lot of regions within

them that are incredibly complicated to look at because there's lots of deletions and duplications and

repetitive regions and long reads sequencing allows you to stretch across that entire region

so another big benefit of nanopore sequencing is its portability as i've shown the min ion sequencer is very very small

Portability

and can be taken with you to lots of different locations to do field sampling so people have taken it

to locations in antarctica uh taking it to remote locations such as snowdonia national park

and the ecuadorian rainforest and even up in space on the iss so this technology can be used

essentially anywhere provided that you have the equipment that you need along with you which is quite minimal in fairness

Severe Acute Respiratory Syndrome coronavirus 2 (SARS-CoV-2)

okay so you've probably all heard of uh covert 19 and the virus that causes

it sars cov2 i think we're probably all reaching a point where we're sick of hearing about it in all honesty um but i am going to talk

about it a lot today i'm afraid so sarskoff ii is a coronavirus uh and

essentially what that means is is it's a it's a hard shell uh with spiky bits

sticking out of it that look a bit like a crown and that's where the name corona virus comes from

essentially it's a solid shell uh with an rna genome inside it it's a very very

simple uh organism all told and the way that it works is that those spikes on the outside that we've that

we've all seen in these spiky ball pictures they're designed specifically they're

proteins that match exactly unfortunately to what are called ace2 receptors

within human cells so like a locking key they can fit within the receptors many

of which are located within our lungs and they can bind onto those receptors and stick to the cells

when this happens there's what's called a cleavage site within the spike protein which splits the pipe spike in two and

allows it to pass its rna genome into our cells at that point our cells essentially will

do exactly the same thing that they're used to doing when they see rna they'll take that rna and they'll convert it into a protein so

essentially this is similar to the virus is sneaking into our cells as i say this this is one of the cartoons i stole i i

used for my uh my daughter's um school so they don't really sneak into our cells but they do

pass in it's almost like within a factory they're sneaking in plans for making other things onto the

foreman's desk so your cells will then start to make lots of versions of this virus so you'll

get lots of copies of the virus being formed and put together by your own cells and in doing this it can make you feel

very poorly and it can make your cells not work properly and this kicks off

your immune response and our immune response obviously very uh um

much more complicated than shown here uh but essentially you've got two main types of uh immune cells

involved in the process uh b cells produce what are called antibodies and antibodies are kind of

y-shaped proteins which are designed specifically to stick onto things that you want to

get rid of and the creation of these antibodies so

that they'll stick on to a specific virus can take a little bit of time

which is why we use vaccines so vaccines can kind of prepare the body

to know what it might be looking out for so because the vaccine contains either

a version of the virus that's no longer infectious or such as the new mrna viruses they

create a small portion of the virus the small essentially the spike protein is made within your cells it's enough to

pre-warn the cells in your body what to expect to see

so when sarskoff ii when covert 19 first kicked off and

STOP COVID-19

the lockdown hit in march last year the first lockdown i got an email to say

that the labs at the university were about to be closed that afternoon i gave a i gave a phone

call to my colleague sharon glacier over at portsmouth hospital trust uh and

asked if she'd mind if i brought some equipment over to her lab so i ran around with gary scarlett in

the biology department trying desperately to grab everything that i could out of there before all the

doors were locked and we were no longer allowed inside and drove it all down to sharon's lab at

the hospital the translational research lab and that is kind of that's what kicked

off this entire project the stock covered 19 project we worked together with the hospital to

uh to prepare ethics statements to i to get funding which is kindly provided by

the university and from the cei to get us started essentially to be able to use our nanopore sequencing

technology in order to be able to to sequence the virus

from a number of different patients being seen at the hospital now the reason for this the rna virus is

a genome of about 30 30 000 bases in length so it's about a hundred thousand times smaller than the

human genome to give you some context on that it's not very big and we're able to

generate a lot of data from this sequencing to be able to tell us exactly what

the complete sequence of that virus looks like now over the time as the virus passes from person to person

and as it it sort of sits in these transmission chains it it slowly mutates so you get very

small changes occur every time uh it copies from one person to another now the mutation rate's very small uh

it's about two mutations a month currently uh although i don't know if there's a more up-to-date estimate for that this is based on

uh the first wave um but essentially it's quite slow as far as viruses go and

by understanding these changes we can identify if a particular version of the iris

came from a direct transmission from somebody else or if it came as a completely novel

introduction to the area so the idea was to work in the hospital to look at cases

of the virus and help to understand where transmission chains were occurring so that we could help to break those

chains and help improve infection control within the hospital so we initially set out to sequence i think

uh 400 samples is what we aim to do over the course of a year um

and it's this can give us a lot of information about how the virus is being passed from person to person

we can link it together with patient information to do epidemiological analysis and try and understand

community spread as well as spread within the hospital um and it can help us to identify new mutations uh

which may be linked to things like increased virulence they might be linked to poorer outcomes

for the patients or they might be linked to different symptoms that we might see over time

whilst all this was going on we ended up joining up with a large national scale effort called the

cog uk consortium the covert 19 genomics uk consortium and that kind of changed the project

significantly in its scope so this is just some pictures of

everybody working in the lab so sharon uh nicely modeling our gridiron there

along with other members of the team and me working very hard there in the bottom right that is how when i'm not on

a meeting with you guys that's how i tend to uh tend to dress so the way that it works

Sample Flow-Through

is people who uh have symptoms of covert 19 um will go in and they'll get tested so

this could be either through what's called pillar one which is uh within healthcare settings so eeg on

wards or in hospitals or it can be community testing so if you're feeling poorly and you get

onto the government website you contact them for a test and this is so called pillar two testing

either way generally speaking there'll be some kind of swab used although there's potential of moving to things

like saliva testing moving forwards from now and these go off to microbiology lab now

this could be the microbiology lab at the hospital uh where where we're currently working

or it could be one of what are called lighthouse labs which are about five different labs centralized and spread

across the whole uk which take large volumes of samples from community testing

and generate these tests some kind of test is used now that might be a pcr

based test it might be a different kind of test such as the point of care testing currently used

in the uh the student asymptomatic screening program and then this information passes on to

our clinical to the clinical care um and helps inform how that patient is is looked after

so we basically take as an offshoot of that the rna that generated and we use that for doing our sequencing

so essentially what we're using here is is an offshoot of this testing process we run

our sequencing analysis we do a wide array of analyses

and then these data are made publicly available as well so almost on a daily basis these data are

being uploaded to a centralized database and being made publicly available so that they can feed into

the uk picture of what the virus looks like and also we feedback directly to

clinicians so i work very very closely with the nhs labs who submit samples with us

to help them to understand what's currently going on within the hospital

so the cog uk consortium the covid19 genomic uk consortium was set up by professor sharon peacock

COVID-19 Genomics UK Consortium

and a number of other pis who joined together at the very start in a very similar uh approach to how we got started to

this in that it was a simple phone call they decided to do it and before you knew it they'd obtained 20 million pounds worth

of funding for setting this up as a sentinel surveillance program for sales cov2 throughout the uk

and essentially the way that it works is it's a it's a distributed approach utilizing academic partners so there's

about 16 different academic institutes including ozdan in the south

university of oxford university of cambridge the welcome trust sanger institute and then other universities throughout

the uk all working very closely with public health agencies and nhs organizations to

essentially try and create a almost real-time map of how the virus

is passing from person to person so we're doing it based on nanopore sequencing using what's known as the

arctic pipeline which i'll discuss a little bit in a moment uh other sites are using different sequencing technologies but essentially

the outcome is the same what we're able to do is generate these uh

sequences uh whole genome sequences from the virus from patients from across the

entire uk and feed it into a um a a map with significant coverage across

the uk to be able to answer specific public health questions but on a national scale so currently the

cog uk i've said here generated over 250 000 high quality genome sequences i think actually today it's probably going to be

more like 300 000 it's both an impressive number but also a very sad number that

there are so many cases that are able to be sequenced um

but the system that's been generated by the cog uk is world leading at the moment and

and really what's being done by the cog uk is probably making up

the vast majority of possibly even around half of all global sequences

generated generated and made publicly available and it's this publicly available

information that's allowing us to fully understand how the virus is spreading how it's changing over time and importantly how

it's developing and the impacts this might have on things like vaccines

so just to give you a a bit of context obviously out of the 300 000 samples we initially set out to do

Throughput

400 samples um we've actually done about eight and a half thousand so far and we're at a

stage where our throughput has increased so significantly uh since the new year that we're probably doing about 700 plus samples a

week at the moment um we have about 10 different nhs trusts submitting samples to us regularly

and then we also get a lot of samples from the lighthouse labs which see tens of thousands of samples every day

and then as i say all of this data is public it is uploaded and then made publicly available through

mainly through the giseid website which is the global initiative on sharing all influenza data which as

it sounds was set up to to share data on influenza but has now been

repurposed to make it the uh largest repository of saskov2 genome data in the world

Geographic Coverage

and then this is just give you an idea of the coverage that we have in our site we receive samples

across the south coast and the south east with new sites coming on board all the

time and and really at the moment the coverage of the cog uk consortium is quite significant

so this is maps of the uk showing um the proportion of positive cases that

have been sequenced throughout the uk and on the right we can see our weekly results and

and more and more areas are becoming darker which is what we want to see we want to try and

uh get more complete coverage over this regions and you can see whales in particular have an incredibly good system in place

for uh for managing and sequencing all cases that come through there

Naming conventions

so i just want to take a brief i'm going to be i'm going to be naming a lot of things and there's a lot of confusion around

the naming of different uh variants or lineages or whatever you want to call them now

the difficulty comes that there's no hard and fast place for when you start calling it a

new lineage or a new variant one thing is that a lot of people use the word strain there

there is definitely only one strain of sars cov2 at the moment uh there are just different versions of

it so lineage and variant tend to be the phrases that are used most often and they're often used interchangeably

i'm going to say lineage from now on and this just gives you an idea of how the naming is used so i use the naming

convention suggested by andrew ramboat and colleagues and andrew is one of the

lead pis on cog uk and uh the way that it works is it's

kind of like a family tree so what we can see here is what's called a phylogenetic map um and here what you're seeing is over

on the left is when things were identical to one another so this is your ancestors so if you were

to trace back your family tree adam and eve would be over here on the left and then

as the family tree diverges over time we can see that it branches off into new

subsets so here we can see that the a subset is quite distinct from the b subset

because they differ from each other by if we come down here about two nucleotides so there's two

bases different uh in common between all of the ones that are in the a lineage uh sorry in

the a lineage and all of the ones that are in the b lineage and then that continues down and down and down and the way that

it works is that we name things so that everything that starts with a b is within this large purple group

but then this b is split up into b two and b one and then b three four five six

and seven and eight nine and then that split down even further into b 1.5 here

and b 1.1 here which is much larger and what's confusing is that as we get

more and more cases this phylogeny develops over time so the naming conventions are trying to

be kept as uh consistent as possible but often you'll find that this b point one b one point one lineage

actually consists of two distinct lineages so down the line this might split off

into two distinct lineages that are named differently it's all very confusing and nothing i

can say will make it any less confusing i'm afraid but that is the convention and that's what what i'll be talking about so when

i say b point 1.1 1.1.7 and b point 1.177

they're both similar to each other down to the b point one location but then they have diverged off so

you may have heard in the news about certain variants of concern that have been highlighted by phe uh and

these are the three main ones at the moment that are uh the one that pa the ones that phe are largely focused on

trying to identify over time more will occur over time more will be identified uh and

over time there'll be more things that we are keeping an eye on to make sure that they're not

uh causing significant problems but the main ones are the b 1.1.7 which is the

so-called kent lineage uh that came about just before christmas there's b point 1.351 which is the

so-called south african lineage and then there's also p 0.1 and you'll notice here that this doesn't start with

a b and this is because once you reach a point beyond having three points within uh the naming

it then resets back to a different letter so this is actually b point one point one point one point

eight i think is is how uh the naming has occurred but then it's being renamed p point one

to avoid the numbers becoming infinitesimal so there's obviously a

little bit of uh issue with naming them around geographical locations

which is why i'm going to try and stick to these ones where i can but just so that you have some context over which ones i'm talking about based

on what's been in the news recently no hang on there we go

so really of more interest at the moment isn't necessarily the variants and the lineages themselves

but specific mutations that might have uh some interesting properties or some

concerning properties so there are certain mutations and the way that they're named

uh such as here d614g these are on the spike protein and where

i said earlier on about proteins being made from uh sequences of amino acids being stitched together

what this is telling you is that the 614th amino acid that makes up the spike

protein is normally an aspartic acid which is called b but in this particular mutation

uh it's actually become iguana dng so uh in there was a an article in the

guardian recently uh about some of the names that people have been giving these and and you can see some of the uh highly

amusing i'm sure names that that scientists have been giving some of these mutations just to add a bit of levity to the process of

analyzing these data but i'll be talking a little bit more detail about the d614g mutation in a

moment because that's quite an interesting one the n501y mutation is one of the

defining characteristics of the b 1.17 lineage and then at the moment e484k

is actually one of the uh most concerning variants because it's been linked with uh antibody escape and there's some

concern over what impact that will have uh on vaccine efficacy

Sequence Read Data

okay so going into the reading i don't want to dwell too much on this but this is just how we go from generating data to

having some idea of what the genome looks like so at the top here this is from left to right this is the entire

genome of sarskoff ii and then each of these blocks is a set of reads

that have been generated and mapped back to the genome what that means is that we found we've got a sequence of letters and we

found the best place along this entire genome where that sequence of letters sits

in that exact order and if you notice they're split up into

two groups so we have an odd group and we have an even group and we use a process of amplification

using what are called primers which are short sections of dna that can kick off that amplification

process from a specific point so we use primers that will specifically

amplify just this short section of the dna sorry of the rna

and we split this up into two different pools we add all the odd numbered ones together into one pool

and all the even numbered ones into another pool because there is a slight overlap between the two and just to make sure that there's no

overamplification at those overlap stages but essentially what we do is we've tiled the entire sars kof2 genome

by about 100 different amplicons short amplicons of about 500 bases

and what we're interested in if we zoom in a little bit closer we can see anywhere that it's gray means

that it matches exactly to what we call the reference sequence so uh we use the very first sarskov2 genome

that was ever generated from a patient in wuhan china we use that as our reference which is kind of

the oldest ancestor that we have access to and we map against that and we find for

the most part they all match every now and then you'll see some colored sections and the colored dots

here and there mean that there's a nucleotide that's different in the read than it is in the reference sequence

but what we care about is not these ones that are kind of randomly distributed but we do care about things like this

where we see a consistent location where there's a difference in our reads

compared to that reference and if we zoom in even closer we can see that what we have

here is we have what should be an a is actually coming out as a g for the

vast majority of our reads so this is a mutation that is seen consistently across our reeds and that

means that it's real and not down to some error in the sequencing and in this particular case this is

affecting uh this aspartic acid d so this is that d614g mutant that i mentioned

D6146 Mutation • Mutation in the spike protein

earlier so uh as i mentioned it's mutating an aspartic acid to a glycine

at position 614 in the protein and then here you can just make out this is the spike protein

as it sits and then this is where that mutation lies and what it does is it slightly opens up the structure of

that spike so it's still able to bind to waste to inhibitors but because of the structural change

it's actually able to bind to s2 inhibitors slightly better and this was one of the very one of the

first mutations that really came to prominence in the analyses that we were doing uh through cog uk

because back in january and february there were zero cases where uh this g existed however from

march onwards it very very quickly took over until actually by june almost a hundred percent of

every single virus we sequenced had the g form of the virus rather than the d

form um so douglas if you if you will um and some work that was

done by phe's showed that uh this g-form

mutant was actually able to uh bind better to ace-2 inhibitors than the

d-form of the mutant which is why this one became a very prominent mutation to focus on to try and

understand whether or not that was having a significant effect on transmissibility of the virus

Variant B.1.1.7

but then come christmas that that was very quickly lost by the appearance of the new

variant b 1.1.7 so this variant arose in the southeast of the uk

in i think end of november was the first case that was seen uh it it actually now accounts for

basically 100 of everything that we see so almost everything that we see at the minute is the new variant

it has very very quickly taken prominence over all other um lineages of the virus and this one's

very interesting so i just realized the video hasn't started playing

one second this was a video generated by professor john mcgeehan

who was able to model where in the spike protein one of the particular

mutations associated n501y is located to show the effect that this

mutation can have on the spike protein but what's interesting is that this one has a large number of mutations uh

compared to its closest its closest ancestor so there's 17 mutations eight of which

are on the spike protein and when you look at this sample i'll show you in a moment of where this sits on the family

tree of the viruses it really does stick out like a sore thumb so whereas most versions of the virus

are uh sequentially uh generated from previous versions this one seems to have generated a huge

number of mutations uh completely independently of anything else so the current working hypothesis is

that this version of the virus is probably uh was probably contracted and then

mutated within somebody who had a very long case of covid uh and possibly somebody who was

undergoing convalescent plasma treatment so that it was able to not only mutate but mutate

specifically to account for um the changes that were being implemented

on it through the treatment so most of the analyses we've recently been involved with

an analysis that's currently recently been submitted to nerve tag

and as a paper looking at whether or not this version of the virus

is associated with increased severity of disease and that that's part of also large-scale

surveillance from phe along with some of those other versions as well

and then we can track it across the across the world as well obviously it's largely known as the uk variant we do see it in the uk but it

has spread very significantly across the world and likely will continue to do so

over time and a lot of work is currently being done on linking in flight travel plans

and how international travel links in with transmissibility of the disease

i realize i skipped through one explanatory slide as well so this one is

From Mutations to Lineages

just to quickly show how we use these mutations so this is just an example what i've done here

all of these at the bottom are possible mutations that might exist in these samples

and then each row is a different sample if it's in red it means that it has that mutation if

it's in gray it means that it has the reference version of that particular snip

and what you can see is you can see groupings of samples together so you can see a group of samples here that share the

this group of snips but actually three of them have an extra snip here that the other

ones don't so it's this type of information that we can piece together to try and understand

well okay these three individuals may be part of a shared transmission chain but it's unlikely that these samples and

this sample came from the same transmission chain because it would have had to have lost

one mutation and gained another

and then so this is analysis now just looking at numbers of cases so this

The National Picture

is just a quick analysis that i ran last night just to update the case numbers that have been

seen and what i've done is within each region of the uh of the country i've

uh over the entire course of the pandemic so far i've taken the number of cases and

normalized to the number of residents within that area and then i've lined everything up so

that it's based on the earliest occurrence of the biggest peak and what you can see here is that for

the most part peak one has paled into insignificance compared to peak two and peak three and peak two was

much higher in cases in places in the north of england compared to the south of england

but in those areas in the north the third peak was lower than what was seen in the second peak so

these areas of the country were places that saw local lockdowns before the complete lockdown in the country was

seen and we can see that following lockdown uh in in december

we can see case numbers have dropped significantly and that's continuing to go in in the right direction as we move

forwards as well okay this is the sort of crux

slide of what i wanted to show today it's quite busy so i'm going to go through it step by step

this is using a program called microreact which is a place that you can go to yourself and explore all the data that

we've generated through the cog uk database through the core uk program and all the data that we generate is

made publicly available and then is incorporated into a variety of tools including microreact

and you can explore those data so the first plot on the left is going to identify different lineages

throughout the country so this is looking across the entire pandemic and it's

got a pie chart within each county uh showing the distribution of lineages within that region

and interestingly when you look at this plot from the first first wave

there was much higher differences between different areas of the country so the north had its own

distinct set of lineages compared to what was going on in the south for instance now it's a lot more standardized across

the entire uk and in particular the b117 mutation has now accounts for so many cases uh

which is what we can see here so in this case the green plot this is essentially a

a pie chart for each week stretched out into a bar chart

and then pieced next to each other so what we can see is all of these colors represent a distinct lineage

and we can see that in the first wave there were certain lineages that were more enriched than others but this didn't

really change too much over that period of time but since october we've seen an increase

in one lineage this increased since summer started going up this was the d614g

mutant uh which was b1.177 was the particular lineage being shown

here and that one was taking over as the most dominant lineage but then come december the b117 lineage

came in and very very quickly started to become the most dominant scene across

the country and you can see now that in the most recent data over 95 percent of every single case

within this read within this period of time was a b117 lineage and you can see that broken down

a little bit more here so you can see the number of genomes and the the time distribution of those

genomes so again we can see that b point one was quite significant in the uh the first wave but actually

seems to be quite low uh these days there are other lineages like the b lineage which is the ancestral lineage which were

uh high in the early days but now is almost never seen whereas we can see these cases that are

new versions of the virus and in particular b 1.177 came about

just after summer and then there are other offshoots of this like 177 177.4

where we seem to have a large number of samples and then the plot on the right is

showing you the family tree that phylogeny but with the specific lineages uh identified in

colors and this green set here is the b117 lineage

and you can see whereas everything else they all kind of sit in amongst themselves there aren't really any what

we call long branches involved here the branch from its closest neighbor for b117

is significant is very very long so the longer this is the more different it is from its

closest neighbor so you can see that the b117 lineage jumps out and is is very very different

from everything else that we see and on the right here i've colored it uh based on whether or not

it has certain uh mutations so you can see the specific mutations

that are uh the d614g mutant is largely present in everything that we see here

uh but then there's the n501y uh and also a deletion of two amino acids

that's uh seen on the b117 lineage and all these data are available for you

to go in and play around with you can uh even start to look at specific

B.1.351 Prevalence

uh variants of interest so if you wanted to see what the prevalence of the new 117 variant looks like you

can limit it just on that and you could even look at a timeline so you can play a little video that will

show you where the first cases were seen and you can see it starting down in the southeast and gradually spreading around

the country and similarly you can do the same with the 351

variant the south african variant you can see here that there's been 218 cases identified all of which attract very

closely by phe um and many of which have been seen in london and you can track over time where

those have been located and you can see here that unlike 117 it doesn't jump out like a sore thumb it

just happens to have one particular mutation of particular concern which is the e484k mutation

The Local Picture

and we can zoom in as well so we can take a look at our local area uh and and try and get an understanding

of what the virus pandemic has looked like within our region so this is uh very roughly looking at uh

some of the areas that have been sequenced by our own lab um and and you can see that it very much

mirrors what's seen across the country within particular the 117 mutant being the uh the most significant case but what's

quite interesting is you can see that actually there's a lot of them which we didn't see in the area until recently

and these all sort of cropped up at around the same time um whereas other cases uh in the first

wave we probably started with uh only a handful of different

local variants which accounted for most cases that we saw

and then this was a a small graphic that's been put together by uh simone gunto who's interested in

approaching the work that we're doing the data-driven work that we're doing approaching it from a a different

perspective of uh creative and artistic vision so what she's done here is she's

taken a very rough geographic location of the cases that we've seen

and used colors and shapes to indicate the different lineages involved

and what she's trying to do with this this is just a very uh early early stage rendition of this what she's

interested in doing is seeing how this looks from a creative perspective

and using this information you can actually get a lot of information so we can kind of see how these variants have changed over

time how they've spread around the region and the more that we do this and the more sequencing we're able to do from

local cases in the community in particular uh the better picture that we'll have of how

that spread has occurred over time uh i just wanted to highlight a few

COG UK Mutation Explorer sars2.cvr.gla.ac.uk/cog-uk

tools as well that you can use to go and investigate these data yourselves so um

there's the cog uk mutation explorer which has recently been generated to allow you to go in

and really start to explore these data and understand which are the most important mutations that you should be

aware of so if you are if you are at all interested in exploring these data i recommend going and having a look at

this because there's a lot of different things that you can learn so here you can see the uh the mutations of most importance

currently which is that 6970 and the 501y mutation that i mentioned earlier

and both of them together are defining of the 117 lineage and you can see obviously a large number

of uk sequences have this particular set but there's also a subset now of b117 which also has

that e484k mutation so this is another variant of concern

that's been identified there haven't been many cases so far but this is the kind of thing that we

are keeping a very close eye out for to make sure that as we sequence these samples

every time a new sample is generated we check to see whether any of them have the characteristics

of a lineage that should be flagged up to phe or chased up by the hospital or some

kind of track and trace should be put into place to try and understand uh who may have been in contact with that individual and that's where those

close relationships with the nhs sites really come into play

um and it lets you explore in a lot more detail so you can look over time at how the samples that have been

generated uh which particular mutations they have so if we look here at that 501

variant we can see that across the time course almost all of them in fact all of them

did have the wild type version right up until just before christmas at which point

we started to see an increase in those with the uh the y form of n501y and until now where

almost 100 of cases now have that particular version so you can go in and explore these in

the visualizer and a lot more information there's also links to antigenic information so it links to antigenic

databases which have looked at which antibodies might be escaped by

certain variants to try and understand which of the variants that we should be focusing on try and identify ones that

we should focus on in case they might have some link to uh potential vaccine dropout

Cluster Identification

and then in terms of feeding back with nhs sites and making a direct impact on

their infection control procedures what we try and do within local cases is we try and identify clusters of cases

so it's a bit of a busy picture but just to give you a brief idea of what you're seeing here these are

several hundred cases and it's the same along the rows as along the columns but

what we do is we work out how similar they are to one another and to do that we look at the mutations

associated with both sample and if they have exactly the same mutations as one another

it gets coloured blue if they have completely different mutations from one another it gets coloured red and anything else

gets the colour in between so what we're interested in are these clusters these groups of samples that

are almost identical to one another and it's those clusters which are most likely to be part of

a a single transmission chain and in these cases we can work with the hospital

to incorporate epidemiological studies to try and understand are they all seen on the same ward as

one another uh were they all part of the same cohort as one another

were they all seen by similar healthcare workers trying to understand why that particular

group of individuals has a single transmission chain associated with them

um and then just finally just to mention about the work that's being done by the university on student testing

COVID-19 Student Testing Program

so uh back last year things started with uh the setting up of the pillar two

testing site within the elden building car park um where if you needed to get a a

community test you could go in there and get tested through the lighthouse lab process um but then last year working in

collaboration with ntl biologica the university set up a asymptomatic screening process so

while the elden building is for people who are feeling poorly and want to understand if they've got covered or not

the uh the asymptomatic screening is a more preventative measure

method to try and identify cases of covert 19 from people who don't realize

that they've got it so this started off with uh doing both pcr and

along with portsmouth hospital university trust and lfd testing sorry which is a lateral flow device

which is a system that picks up whether or not you've got active versions of the protein

the spike protein currently present within you at that time um but that was developed with

the rollout from the government of increased testing for students like at

the end of last year just before christmas and now the spinnaker sports hall has been set up

for an asymptomatic screening program using just lfd testing

and we were originally sending off positives for confirmatory testing from pcr

at the hospital but actually now that's no longer necessary but for those cases where samples were

sent for pcr testing at the hospital we were able to incorporate those into our pipeline for sequencing

and using that information we've been able to do various analyses which we're currently in the process of finalizing

these have been fed back to sage in a recent report to sage last week um along with a

uh uh a report that's been put together through the uk on understanding uh the role of students

in transmission of the virus and kind of one of the main take-homes from this is that

all the positive cases that we've seen are actually part of only a small number uh of specific uh clusters of cases

and most of those were seen in the early days of uh um early days of

university students returning to the area but very very quickly the infection control procedures put

into place by the university reduced cases significantly among

students until students were very much lower transmission risk than those than others and others of the same age group

within the community and then oh just finally as well this is just another uh project that i've been working with

on um with southampton hospital uh southampton general hospital

which is a more direct involvement of the sequencing in infection control so it's based on looking for

cases that are of hospital onset covert 19 infection or so-called hokies what we do is when a hokey is identified

by the hospital it gets sent to us and we generate a report very very quickly within 48 hours of the

sample making its way to us and then feeding that information back to the infection control team at the

hospital so that they can use that information to understand if they're part of larger outbreak clusters or if there's

epidemiological evidence for how they've caught that infection within the hospital and

ultimately what we'll end up doing is we'll be comparing uh this intervention of providing this report back within two days

uh reporting back but doing it much slower and then not reporting back

at all to try and understand whether or not this uh this direct feeding back for infection control can have a large impact

on um cases at the hospital so just in conclusion that that's a very

Conclusions

broad analysis of what we've been doing over the past uh year with the covert 19 pandemic but really

the cog uk consortium it really is a vital part of the uk's response to the pandemic

and we're just a small cog in that cog and it's really important for this

process to continue so that we can continue to trace the virus as it spreads across the uk and

in particular to be able to to react quickly to new potential mutations of concern

and variants that might crop up in the future so this sentinel surveillance uh is being used

as well when looking for tracking of importation so as we come out to lock down trying to understand if increased

transportation of people both due to international travel but also just to local travel

can have an effect on the sorts of lineages that are brought into the area

and then we're continuing to work closely with our nhs partners with public health agencies

to ensure that this ultimately goes into to improve patient care and and help

limit infections where possible and that sort of genomic epidemiology

is really key and it's certainly one of the uh one of the shining lights of the uk

scientific community of what's been able to be achieved using this process and our ability to do this rapidly like

we've done with hokie has been used throughout the consortium to help identify and control outbreaks

in a lot of different settings be their hospitals or workplaces or care homes and we also work very closely with other

studies uh such as siren genomic reacts and hokey all of which are are working very very closely with pha

to to try and uh really feed this information back and incorporate it into patient care moving

forward and then this was just a a graphic generated by alex kagan to

highlight the work done by the cog uk it's a little bit dramatic but

i thought it was quite a nice case from from that initial meeting in the bottom left uh right to uh saving the world from

covert i guess is what's going on in the top there but there's still a way to go yet but i think there's there's a lot of very very

positive uh uh a lot of positive coming out of the vaccine results coming out at the moment

uh numbers are going down in the right direction and hopefully they'll continue to go down

and really the work that we're doing along with other colleagues from across the uk will help ensure that we'll soon get to

the end of this pandemic and just to thank everybody that's involved um there's far too many people

Acknowledgements

uh to go through individually but particularly angie and sharon who i set this up with in the first at

the start and it really was just the three of us kind of against the world to start with and before you knew it

that that number of people increased dramatically and we're at the stage that we're at now

and just thank everybody from all the different nhs sites and from cog uk and uh thank you all for listening thank

you and thanks very much to you sam for this quite amazing work very impressive in

terms of network as well you see because all the themes that you've listed at the end of your presentation

are extremely uh telling off the very very impressive network one more time

that you've developed across the country and certainly internationally i did not interrupt you because i saw

that everybody was following everybody was interested we received a lot of questions so i kept

you know let you talk as much as you wanted and i think as well you've done a great job in making

something extremely complicated i'm sure it is relatively simple so that we all understand so you have questions

um a lot of questions actually and so i'd like us to take at least 10 minutes to go all through to

go through all these questions so the first question is from mona

just wondering how do you get around processing potential positive covered samples

in the open air because she thought that sorry the question just moved how do you

do that in the open air and correct me if i'm wrong i thought it

should be a class 3 organism that will have to be processed in class 1 biosafety cabinet

so uh yeah so first first of all because we're working so closely with the hospital uh we don't ourselves process covid's

um positive samples so everything that we receive from our

submitting sites is rna only so we only receive it after the rna has been extracted

um so by that point it's entirely non-infectious that there's no risk of contamination

uh from from those samples um also they have um class three

facilities at the hospital for dealing with such cases and then similarly it was actually

dropped down to class two working so you can deal with covered positive samples

uh even primary samples primary swap samples at class two with certain

restrictions put in place so um that was largely done i think just to

because so many places needed to be able to process these samples it needed to be a a safe system needed

to be put in place that could be done at class 2 by many many more places that don't have class 3

facilities but with no risk of infection so actually most places have been working under class 2

for the majority of the pandemic thank you very much for that some a

question from gary and i think it's a question we all have in mind there is some discussion about the general evolutionary trend of pathogen

easy to become more contagious but less lethal less symptoms as well

clearly you show mutation causing increased and again the questions moved causing

increased issue uh contagion is there any evidence of

reduced lethality or is it um in fact the reverse happening

so this sort of you know nexus between contagion and and lethality yeah so i mean i i'm

gonna pre-face this with saying i'm not a virologist i have learned a lot about virology in

the past year um but certainly my my feeling would be that the the as far as evolution goes

the optimum uh evolutionary state of a of a virus would be so that it didn't

cause any problems at all to the host so coughing is very useful because it helps spread the virus to other people

um but killing the host is not ideal um in a lot of ways sorry that came off

sounding far more callous than i meant it to but um it's as far as the virus is concerned uh

it doesn't help it to spread which is ultimately all a virus is it it just wants to pass its genetic material from

one person to another um so my my feeling would be with yours

gary that actually if anything we should see that mutations in the virus should result in it becoming less lethal

not more lethal um there was some work released

recently uh that showed for the new 117 mutation uh that there was some potential

increase in um uh lethality and uh but this could be

largely a result of its increased transmissibility uh and also to do with kind of the sorts

of people that it's likely to affect so most of this was done in the community rather than in hospital something

so the work that we've been doing with the hokey trial on uh looking specifically at those in hospital

and in particular those with you know that are suffering the worst from uh covered 19 um should add to that

body of evidence and that paper should be uh released in the not too distant future i think

um but i think it sort of depends as well i mean you know evolution isn't uh

it's not it doesn't know what it's doing these things are just happening by chance

but in response to some stress that's being put on him we think in this case that stress has been uh treatment with convalescent plasma or

something similar um so really its main

change for the 117 mutation is to just become more transmissible to to sort of help evade those uh

uh the results of using the convalescent plasma um but yeah i think time will tell with

it i mean it's only been a year i think we're going to understand a lot more and hopefully over time

it will if anything gets less severe rather than more severe but you only

have to look at flu to realize that it's an incredibly deadly disease year on year um so it hasn't disappeared

over time and and i don't think this one will either unfortunately thank you so much again

for that so a question from karen who stresses a sort of paradoxical situation i think we all

understood that we are in this country in a paradoxical situation aren't we we are world leading for the sequencing

and yet at the same time it seems that we have many cases and the system is not really facing and

right so question from karen we call this mutation of the virus the british variant

is that because it was the it was first discovered here is that because of good research or is

it because it was the first first generated here so this sort of paradox again

yeah that is a very good question i mean it's entirely possible that actually it first came about in another country

and we just happened to detect it here first in kent because it's spread to kent and our approach is

uh quite rapid at identifying these things that's entirely possible and it's

difficult to say one way or another until um you know i i guess

as more countries increase the rate of genomic surveillance that they're doing which many of them are

um i think that we'll start to see to be able to fill in those blanks from outside of our

country of our nation thank you so much i can't say for sure

i'm afraid well yeah as you said suppose we need more time to to understand better all these things another question we all

have in mind i suppose from cressida will vaccination programs drive the evolution of escape

does the timing of delivery play a role um good questions uh i mean

again as i say i'm i'm it's a little bit outside of my my knowledge space but i think my answer

would probably be yes if anything's going to going to lead to adaptation

mutations it will be a systematic vaccine approach but similarly with

influenza we have the same situation with influenza and it's relatively straightforward

to keep on top of in terms of vaccination with a seasonal vaccine rollout so i mean again time is going to

tell as we go forwards but that's very much kind of what we're what we're modeling and what we're

keeping track of with these data is to try and catch these things as early as possible

um so by understanding the sorts of you know it should be said as well a lot of these mutations occur completely

independently of one another so e484k for instance has cropped up multiple times independently of one

another um and it's those types of mutations that are cropping up

in response to something rather than just due to random randomness that are the ones of most

interest and i i think we've got a very good system in place and it's developing

rapidly over time as well so the coverage that we get is increasing all the time uh the tools available for linking these

data together and and understanding what they mean and how they impact uh things like the vaccine rollout uh i

think all of this is yeah it's a time will tell situation i remain very positive about

the future moving forwards uh but i like to consider myself an optimist so i don't know if that's a good thing or not

well that's that's a very good thing surely sam is going to be the last question if you don't mind from me

if you project yourself in the future in your superman spiderman costume what do you see next how are you going

to use this sequencing the work that you've done maybe to apply it to

other viruses or for cues how do you see that i i think

that the work that's been done by cog uk uh is is has been very much

seen by the uk government as having massive impacts and massive uh positive

impact on the covert 19 pandemic so my feeling is that we will see something

similar become a more general uh pathogen surveillance operation um

and i think that there's a lot of you know despite everything despite how bad everything's been over the last year i

think there have been the occasional positive thing to come out of it and one of those positive

things has been the introduction and uh making so ubiquitous of things like

sequencing and other high throughput approaches uh within clinical science so i think that moving

forward now so many places have been set up to allow for

these kinds of technologies to be used more and more people are coming online for doing this sequencing as we go and it the system that we have in place

for sask 2 will be equally applicable to any other pathogen you might name so

um yeah i i think i think there will be a lasting uh you know even if even if by the end

of this year covert is gone we're entirely out of it it's only a matter of time until the next virus

until the next pathogen hits so i think it would be um jumping on the bandwagon now and you

know making taking advantage of everything now of what's been set up i think is going to be really important to create a

lasting uh legacy of the work that's been done absolutely fantastic and this is going to be the

concluding node the lasting legacy thank you so much sam extraordinary

discussion extraordinary work as well as you know everybody you can see this webinar watch it again on the

research features website i'm sure you'll have an even larger audience some thank you very much everyone for being

such a great uh attentive and interested audience i'll see you next week for yet another seminar on

something completely different economic crime and i'd like to thank the team as usual gloria

her claudia and olga in particular thanks very much again sam and see you soon thank you bye

The Sequencing and Tracking Of Phylogeny in COVID-19

Using genomic epidemiology in the fight against COVID-19

What the Hell is Bionformatics

Read more on the topic

Democratic Citizenship

Sequencing and Tracking Of Phylogeny in COVID-19