Fighting COVID-19 with Data and AI: A Review of Active Research Groups and Datasets

In this article we'll cover 8 different active groups doing game-changing research on COVID-19 treatment, diagnosis, and prognosis, plus open datasets.

4 years ago   •   14 min read

By Vihar Kurama

The economy has come to a halt; people are quarantined; work is stagnating; and governments fear the public health crisis this could turn into. The deadly coronavirus (SARS-CoV-2, or COVID-19 for "coronavirus disease 2019") is spreading fast. It's 2020 but even with the most advanced technology available, we haven’t been able to stop it. So far, here are well over a million cases and 60,000 registered deaths. As people say, prevention is better than cure: it's time to stay home and take care. In this piece, we'll see how artificial intelligence is helping people track, control, and find a cure for COVID-19.

Over the past ten years we've seen a tremendous amount of research and positive results in the fields of computer science and AI. The math and algorithms have been around for a long time; the real reason for this explosion is the availability of data, higher computational power, open-source tools and frameworks. From industries like manufacturing and energy to healthcare and education, artificial intelligence has revolutionized them all.

Let’s get started by understanding exactly what the problem is, from a data science perspective.

Bring this project to life

The Problem – COVID-19!

At the end of December 2019, China's health authorities reported a vast number of cases of acute respiratory syndrome in the city of Wuhan. On February 11 2020, the World Health Organisation (WHO) named this virus COVID-19. This virus is transmitted between people who are in close contact with one another by respiratory droplets produced when an infected person coughs, sneezes, or speaks. These droplets can be inhaled, land in the mouth, eyes, or nose, or even be brought to one of these areas by touching your own face. Commonly reported symptoms include fever, dry cough, and tiredness. In mild cases, people may get just a runny nose or a sore throat [2]. Meanwhile, computer scientists and machine learning researchers all over the world have been passionately collaborating and working extensively to find ways to solve issues related to the coronavirus, either by compiling datasets or building algorithms to learn from them.

There's not yet a cure or vaccine for this virus, but we can stop its spread by staying at home, frequently washing/sanitizing our hands, and avoiding public places. If you feel sick or ill, make sure you self-quarantine.

In the next section we’ll see how different research groups are using AI to tackle this problem.

AI-Based Systems Detecting COVID-19

1. DAMO Academy (Alibaba Group) Detects Coronavirus Cases in CT Scans

In early February, Alibaba Research Academy (DAMO Academy) came up with an AI-based solution that can detect COVID-19 in under 20 seconds with 96% accuracy. The network is a deep computer vision model which takes the CT scan of a patient as input and outputs whether or not they show signs of coronavirus. The model was fine-tuned with more than 5,000 training samples and deployed in more than 26 hospitals across China. So far, it's helped to diagnose over 30,000 cases [3].

AI-Based Software developed by DAMO Academy [3]

The same group  has also developed an NLP solution based on pre-trained models for dissecting medical reports of COVID-19 in search of a cure. Their model topped the GLUE benchmark rankings, an industry table perceived as the most important baseline test for NLP-related tasks, on March 3rd, 2020. It's currently being used and tested for text analysis of medical records and epidemiological investigation by CDCs in different cities in China [4].

2. Lung Infection Quantification

To reduce the analysis time of CT scans, researchers built a system using deep learning to quantify the lung infection caused by COVID-19 [5].  The core idea is to develop a deep learning-based model for automatic segmentation and quantification of the affected regions, as well as the entire lungs from chest CT scans.

The authors developed a network named VB-NET, which is a modification of V-Net [6]. They claim that their proposed model is faster than the original V-NET due to its special bottleneck structure. Below is an image of the architecture.

VB-Net Architecture [5]

As seen in the image, the bottleneck architecture is a stacked three-layer structure. It uses (1 x 1 x 1), (3 x 3 x 3), and (1 x 1 x 1) convolution kernels to extract the features from the images. To train this network the authors used a special strategy called the Human-In-The-Loop Strategy. In this strategy, to save time for radiologists during the outbreak, the training data is divided into several batches:

  1. The first batch contains CT data that is manually contoured by radiologists.
  2. Next, the segmentation network trains on this batch and an initial model is created.
  3. This initial model is then applied to segment infection regions in the next batch, and radiologists manually correct the segmentation results provided by the segmentation network.
  4. These results are fed as new training data, which increases the training dataset as well as the accuracy of the network.

The network was trained on 249 COVID-19 patients and validated using 300 new COVID-19 patients. To evaluate the results, the VB-Net uses the Dice Similarity Coefficient (DSC) and the Pearson correlation Coefficient. These metrics yielded 91.6% ± 10.0% between automatic and manual segmentations.

Results from VB-NET. In the last column the CT Images have the overlaid Segmentation, and a 3D surface rendering of segmented infections
Results from VB-NET. The whisker plots of infected areas in bronchopulmonary segments

3. Abnormal Respiratory Pattern Classification for Large Scale Screening

According to clinical research, it has been observed that people suffering from COVID-19 have a different pattern of respiration. Noticing this, researchers from East China Normal University have collaborated with other research organizations to develop a deep learning-based algorithm that can help in diagnosis, prognosis, and screening for infected patients based on breathing characteristics.

A GRU neural network with bi-directional and attentional mechanisms was used to classify six clinically significant respiratory patterns using a depth camera. The six respiratory patterns are namely Eupnea, Tachypnea, Bradypnea, Biots, Cheyne-Stokes, and Central-Apnea.

Real-Time Respiratory Patterns Classification for COVID-19 [7]

The research consists of four core steps:

  1. Develop a respiratory simulation model for generating simulated data.
  2. Acquire real-world data using a Depth Camera.
  3. Establish and validate the BI-AT-GRU model.
  4. Conduct comparative experiments.

In the first step, the authors approximated the subject’s respiration cycle using a mathematical sine wave. Since respiration is a cyclic process of inhalation and exhalation, its graphical form is reflected by the rise and fall of a wave.

In the second step, to record depth images of subjects while they were breathing, a Kinect v2 depth camera was used to capture real data. Three regions of interest–namely the chest, abdomen and shoulder–were selected to capture a specific respiratory pattern for one minute at a time. Below is an image of how the data is captured through the camera.

Actual Measured Central-Apnea Waveforms [7]

For the respiratory pattern classification task in Step 3, the authors used a BI-AT-GRU. The BI-AT-GRU network is basically an improvement on GRU with the addition of bidirectional and attentional mechanisms. A GRU network is a simplified variant of LSTM. To give you an idea, below is an image of the network setup.

Lastly, the results show that the proposed model can classify the six mentioned respiratory patterns with an accuracy, precision, recall, and F1 score of 94.5%, 94.4%, 95.1%, and 94.8%, respectively.

4. Convolutional Neural Networks for COVID-19 and Pneumonia Screening

Convolutional Neural Networks have been a simple trick for identifying patterns in different images. To make the screening process faster in China, several research groups collaborated to develop a CNN-based Deep Learning model to identify COVID-19 in its early stages from CT scans. To pursue this research, a total of 618 lung CT scans were collected.

Below is an image explaining the pipeline from input to output.

To identify the location of infections in the CT scans the authors studied several features of COVID-19, including:

  1. Ground-glass appearance
  2. Striking peripheral distribution
  3. More than one independent focus of infections

Based on the appearance and resulting structures from different infections, the image classification model should be able to differentiate between illnesses.

The network architecture consists of two 3-D CNN classification models. The first is a classic ResNet-18 network that is used for feature extraction. Next, a few pooling operations were used to reduce the dimensionality of the data to prevent overfitting and improve generalization. Lastly, three fully-connected layers output the final classification result, together with the confidence score. The training was carried out on an Intel i7-8700k CPU connected with an NVIDIA GeForce GTX 1080 Ti. The network was trained for 1000 iterations with a common cross-entropy loss function. Below is an image of the CNN architecture.

Network structure of the location-attention oriented model

This model was able to identify COVID-19 in its early stages with an accuracy of 86.7 % [8].

5. COVID-19 Identification and Patient Monitoring Using Deep Learning for CT Image Analysis

This research was mainly carried out by RADLogics, a company based out of Boston, in collaboration with many other research groups across the world. The main intention was to build an AI-based automated CT image analysis tool that can achieve high accuracy in the detection of coronavirus-positive patients, and monitor them throughout treatment. This research provides a timely analysis of patients by delivering the quantitative measurements for smaller opacities resulting from the infection with respect to the volume and diameter (from CT scans). They also proposed a “corona score” that continuously measures the progression of the disease over time.

Below are the steps they take to detect the virus:

  • First, the lung regions of the CT scans are segmented using the U-Net architecture that was trained on 6,150 images.
  • Next, the COVID-19-related abnormalities are detected from the segmented lung images using a Resnet-50-2D convolutional network that was pre-trained on the ImageNet dataset.
  • To mark a case as positive, first, the ratio of positively detected slices out of the total slices of the lung (the "positive ratio") is calculated. Next, a positive case-decision is made if the positive ratio exceeds a predefined threshold.
  • After the case is detected as positive, a 3D analysis is proposed by the authors for nodular and focal diffuse opacity (visualized below in green and red).
  • Lastly, the authors used an abnormality localization step where network activation maps are produced. The Grad-CAM technique is implemented for creating more appropriate visualizations.
Monitoring COVID-19 in a patient over time

The results of the trained network were 0.996 AUC (95%CI: 0.989-1.00) on datasets of Chinese control and infected patients. They reported two possible working points: 98.2% sensitivity and 92.2% specificity (high sensitivity point), or 96.4% sensitivity and 98% specificity (high specificity point). You can find the original paper here.

6. Drug Screening for COVID-19

AI is not only useful for diagnosing the coronavirus, but also for screening drugs. Several research groups in China collaborated on research of how deep learning can help find medication faster than traditional methods.

After studying RNA sequences that were available on the GISAID database, the authors concluded that COVID-19 is highly homologous to SARS-CoV-1. They found this homology by translating the RNA sequences into protein sequences and then building a 3D protein model using homology modeling (comparing the proteins and constructing an atomic-resolution of them). A DFCNN (Deep Fully Convolutional Neural Network) was used to identify and rank the protein-ligand interactions for performing virtual screening quickly, since no docking or molecular dynamic simulation is needed. For clarification, a molecular dynamic simulation refers to a computer simulation method used for analyzing the physical movements of atoms and molecules; docking refers to the intentional removal of parts for analysis. With these techniques, the authors were able to identify potential drugs for COVID-19 by performing drug screening against four chemical compound databases.

The proposed DFCNN architecture was trained on the protein-ligand dataset from the PDBbind database. The model was able to predict three potential drugs with a score of higher than 0.997 and three more with a score greater than 0.99.

Below is a screenshot of the overall workflow.

The workflow of virtual screening of chemical compounds for COVID-19 proposed by Haiping Zhang and team.

7. Computational Predictions of Protein Structures with AlphaFold

DeepMind is one of the top research companies in the world. On April 2, 2019, they submitted their research to the journal Nature on "Improved protein structure prediction using potentials from deep learning." It was accepted and published in December 2019. On January 15, 2020, they've reviewed this research in a blog post named AlphaFold: Using AI for scientific discovery, which is now helping identify the protein structures that are associated with COVID-19.

This research mainly addresses the problem of protein folding. You can think of proteins like large, complex molecules. Their three-dimensional structure changes as they perform different operations. The authors gave us an example of why identifying the structure of proteins is essential. Below are the lines from the article:

Antibody proteins utilized by our immune systems are 'Y-shaped' and form unique hooks. By latching on to viruses and bacteria, these antibody proteins can detect and tag disease-causing microorganisms for elimination. [11]

This explains why protein folding is crucial for us. But how do we do it? Where does Deep Learning come into the picture? Let's break this down. The proteins are combinations of amino acids that are typically encoded in DNA. But the DNA does not contain information regarding the protein structure. Using traditional techniques, it would take ages to count all possible configurations of a typical protein before reaching the true 3D structure.

So how do we interpret the 3D structure from a complex protein sequence? Five decades ago, these were manually determined by using microscopes and X-Rays, which involved a lot of trial and error. But with deep learning and research like AlphaFold, we can now much more easily identify the 3D structures of proteins.

The working of Alphafold is divided into two phases. In the first phase standard biology techniques are used, and new protein fragments are created by repeatedly replacing the pieces of existing proteins. These structures are continuously improved with the help of a generative adversarial neural network (GAN). The protein structure from the output of the GAN has two properties:

  1. Distances between pairs of amino acids
  2. Angles between chemical bonds that connect those amino acids

In the second phase, the distance and the angles are improved by the Gradient Descent algorithm until they achieve the best scores. Below is an image of the entire workflow of AlphaFold.

Working of AlphaFold [12]

The same team came up with a prediction of the 3D COVID-19 protein structure using this AlphaFold System. The prerequisite data that was used to understand COVID-19 was collected from an open-access database. However, the authors quoted that it’s not experimentally verified. But it can be helpful for the investigation of how the virus functions. Below is an image of the predicted 3D-rendered SARS-COV-2 membrane protein:

Rendering of COVID-19 Protein Membrane [12]

8. Prediction of Criticality of Patients with Severe COVID-19

In this research, the authors propose prognostic prediction models based on three indices which will predict the mortality risk and clinical route for recognizing critical cases from severe cases. The authors used 2,779 electronic patient records consisting of their medical conditions from January 10th to February 18th, 2020, at Tongji Hospital in Wuhan, China.

Image explaining how the patients are enrolled and classified [10]

Building this machine learning model included the following three steps:

  1. Data Preprocessing: The authors imported all the clinical measurements from their last available date as features, and added two new labels: survival and death. These clinical measurements of the patients include more than 35 features in the input data, including gender, Wuhan residency, familial cluster, fever, cough, fatigue, chest distress, lymphocytes, etc. If there were any incomplete clinical measures then the values were set to -1.
  2. Data Splitting: The data was divided into 70% for training and 30% for test, according to traditional ML guidelines.
  3. Training: In this step, the authors choose the Multi-tree XGBoost algorithm to predict the severity of the patient using the input data.  The depth of the tree is set to 4, with a learning rate of 0.2. The value of the regularization parameter, the subsample and colsample_bytree were set to 1, 0.9 and 0.9 respectively. This is to reduce overfitting.
XGBoost Machine Learning Algorithm for Severity Detection of COVID-19

Using this pipeline, the model was able to achieve more than 90% accuracy enabling early detection, early intervention, and the reduction of mortality in high-risk patients affected with COVID-19.

Datasets for COVID-19

In the previous sections, we've seen different approaches and techniques for fighting COVID-19 using machine learning and AI. These all have one thing in common: data! With more data, the algorithms get better and better. To help understand COVID-19, several companies and open-source organizations have developed different datasets. Below are a few use cases of how individuals and teams can use available data to tackle the spread and consequences of COVID-19.

  1. Monitoring distribution of protective equipment for medical staff
  2. Finding average incubation and recovery time of COVID-19
  3. Real-time visualizations and monitoring across continents
  4. Faster diagnosing of COVID-19 symptoms
  5. AI assistance for frontline medical staff
  6. Hospital bed management
  7. Diagnosis from X-Rays or CAT scans
  8. Patient tracking using cluster networks
  9. Vaccine or antidote protein prediction
  10. Continuous patient monitoring

Here are links to a few datasets that are being extensively put to use:

  1. COVID-19 Open Research Dataset Challenge (CORD-19): CORD-19 is a dataset by the Allen Institute for AI in collaboration with several companies and organizations. It consists of over 45,000 scholarly articles, 33,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.
  2. COVID-19 Korea Dataset: This is an open-sourced dataset by the Republic of Korea for tracking a COVID-positive patient’s travel history. Using this dataset, an ML and web-based platform is developed for visualizing patient routes.
  3. Novel Coronavirus 2019 Dataset: This dataset has daily information on the number of cases, deaths, and recoveries from across different regions, including time-stamps.
  4. COVID19 ChextXRay Dataset: This data contains Chest-X-Rays of COVID-19 cases. Credits to Joseph Paul Cohen for making this dataset open on Github. Using this, one could try building a neural network classifier for detecting COVID-19 using X-Rays (note, however, that the data is quite limited for creating an effective model). People should not, however, claim the diagnostic performance of a model without a clinical study.


AI has revolutionized many industries, and is now being used to revolutionize the battle against the novel coronavirus (COVID-19). While the AI community is working intensively on delivering applications that can help to contain the consequences of the virus, AI systems are still at a preliminary stage and it will take time before the results of such measures are visible. We are still far from the end of this tragic story. However, there is a significant amount of progress being made by the community and the next big breakthrough is not too far away.


  1. World Health Organization
  2. Covid-19 -
  3. How DAMO Academy's AI System Detects Coronavirus Cases
  4. Fighting Coronavirus with Technology: Another Breakthrough for Alibaba in NLP Research
  5. Lung Infection Quantification of COVID-19 in CT Images with Deep Learning
  6. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation
  7. Abnormal Respiratory Patterns Classifier May Contribute to Large-scale Screening of People Infected With Covid-19 in an Accurate and Unobtrusive Manner
  8. Deep Learning System to Screen Coronavirus Disease 2019 Pneumonia
  9. Deep Learning-Based Drug Screening for Novel Coronavirus 2019-NovCov
  10. Prediction of criticality in patients with severe Covid-19 infection using three clinical features: a machine learning-based prognostic model with clinical data in Wuhan
  11. AlphaFold: Using AI for scientific discovery
  12. Computational predictions of protein structures associated with COVID-19
  13. Computer Scientists Are Building Algorithms to Tackle COVID-19

Add speed and simplicity to your Machine Learning workflow today

Get startedContact Sales

Spread the word

Keep reading