Public health · AI · 2024

Omdena · International collaboration

Anti-malaria AI: predictive app for prevention in Liberia

Collaboration on an AI-powered app to predict transmission risk and anticipate malaria outbreaks in Liberia, working with data scarcity and regression models.

Anti-malaria AI app in Liberia

On Wednesday, May 15, 2024, the project Develop an AI-powered App for Predictive Modeling and Forecasting of Malaria Prevention in Liberia was presented by collaborators of the Omdena community.

It brought together people connected to the world of data from places like Sri Lanka, India, the United States, Bolivia, Portugal, Brazil and several West African countries, including Liberia itself. For most, it was the moment to test everything they had learned about Data Analytics, Data Science and Machine Learning Engineering.

The project was managed by Liberian Data Scientist Daikukai Bindah.

In this article I focus on two challenges of the work: the relevance of finding solutions to this disease, and how — facing a lack of sufficient continuous data — the project was modeled.

I) Challenges to eliminate malaria worldwide

One species of mosquito is responsible for 67% of malaria deaths in children under 5.

In 2020, Africa led the regions, facing 95% of malaria cases and 96% of deaths. Of all victims, 80% were children under 5 in that area, according to the report Mathematical Modelling and Optimal Control of Malaria Using Awareness-Based Interventions by Fahad Al Basir and Teklebirhan Abraha.

Pregnant women follow: per the WHO, malaria during pregnancy can cause severe complications for both mother and fetus, including severe anemia, preterm birth and low birth weight.

For decades this scourge has been fought and controlled, even eliminated in regions of Europe and Asia. But Anopheles mosquitoes have developed resistance to medications and insecticides. Health infrastructure in endemic regions is insufficient. Financial resources are limited and depend on often-unstable international donations. Climate factors expand incidence, and even movement of people within the same country contributes to its proliferation.

The Oxford University report A global map of dominant malaria vectors, by the Spatial Ecology and Epidemiology Group led by Marianne Sinka with fifteen scientists, notes that the protozoa of human malaria are transmitted by mosquitoes of the Anopheles genus, which includes 465 formally recognized species. About 70 of these species are capable of transmitting human malaria parasites and 41 are considered dominant vector species (DVS).

Among the malaria-causing species, Plasmodium falciparum transmits the most dangerous form, producing 80% of infections and 90% of deaths.

The parasite always has two hosts in its life cycle: a mosquito that acts as vector and a vertebrate host.

Hundreds of papers worldwide work on improving diagnosis, producing medications to mitigate symptoms and achieving a definitive cure. Each contributes to the goal of eliminating malaria.

Per the article Leveraging innovation technologies to respond to malaria published in the Malaria Journal, of the 650 key tech innovations against malaria at the start of 2023, 34% are web-based, 28% mobile-app, 25% diagnostic tools and 13% drone-based. Among them stands out the Malaria Atlas Project.

The project itself is challenging.

II) Data scarcity

Each country and region has its own database. Collection requires identifying samples and being executed by specialized health institutions, university research, NGOs and some industrial companies. Budget is essential. And as far as possible, this collection should be continuous, every one or two years. Here a gap opens between countries: not all have managed continuous collection over long ranges, like from 2000 to today.

The goal of the Omdena project was to develop an innovative AI-powered app using predictive modeling and forecasting techniques to significantly improve malaria prevention efforts in Liberia, integrating three key features:

1. Predicting malaria transmission risk: identify areas and populations at high outbreak risk, using historical data, climate patterns and human behavior.

2. Forecasting malaria outbreaks: anticipate the timing and severity of future outbreaks, analyzing real-time data on climate, mosquito populations and human mobility.

3. Identifying environmental and social determinants: examine large datasets to identify the factors that contribute to transmission and vulnerability.

How to take on the challenge?

It's important to remember that more data doesn't always mean better models. Sometimes more data leads to overfitting: we don't want to overfit the training set and then perform poorly on the test set.

Data augmentation techniques and synthetic data generators were ruled out. The first would transform the sample; the second isn't recommended because it doesn't identify seasonal patterns.

The TU Delft Research Portal publishes Is your dataset big enough?, recommending a minimum sample size of fifty times the number of weights in the neural network.

However, if a year's data is large enough, high-quality and representative of the target population, it can suffice to train an accurate model — although in many cases, especially time series, data has seasonal patterns a single year can't capture.

Through research, the team obtained meaningful data from international sources with frequent collection: The DHS Program, World Bank Data Catalog, The Humanitarian Data Exchange and Malaria Atlas Project.

From there it was discovered that county-level data existed, and that was very important for deciding how to leverage the information. As lead Data Scientist Thomas James commented during the presentation: given limited data, we had to choose between county-level or national data, and we focused more on county level; counties turned out to be key determinants of malaria prevalence in Liberia.

It was a great joint decision and a big step forward.

EDA was applied examining intercorrelations of various rainfall measurements, given that these mosquitoes concentrate in clean-water pools during rainy seasons. Distribution patterns were identified with positive correlation to malaria deaths.

The country is organized into 15 counties, and each was found to have different experiences with malaria, as shown by cases and deaths in Greater Kru and River Gee counties.

Using the data

Choosing the model took several days of team effort, until settling on Random Forest Regressor, after applying the PyCaret tool.

Key points — best model: Random Forest Regression (RFR). Best scores: MAE 5.31e-13, MSE 6.93e-25, RMSE 8.33e-13.

The application of the model

The model gives users greater understanding of factors like rainfall and interventions by health organizations — covering IRS (Indoor Residual Spraying), ITN (Insecticide Treated Net) and medical treatments — to show how they influence malaria cases and deaths in Liberia.

Malaria prevalence and cost over years and across counties, helping inform prevention strategies and resource allocation.

Details can be seen in the open-source Streamlit app.

After seeing how to navigate data scarcity, there's now material to develop further topics: incorporate other features like human mobility, temperature, humidity, vegetation cover and socioeconomic factors; integrate real-time data feeds; and explore CNN or LSTM which could capture more complex relationships than Random Forest.

Final reflections

I got to be part of this project that let me learn more about the day-to-day of a Data Scientist. It was my first internship, so I can say I made it through my career transition.

I discovered the functionality and usefulness of ML models for developing the project; they're so varied and robust they expand expectations of what's coming with these tools. Vivien Siew introduced me to PyCaret and led the approach to models very well. I loved how Dorothea Paulssen organized. I'm grateful for Priyanka N's effort. The Streamlit work managed by Maria Loureiro was very new to me and I'll consider it in my next projects.

In this decade we get to be the architects facing diverse diseases with the resources available, and involving more people. It's very important to persuade decision-makers in each country to increase the budget of statistical institutions.

Other data sources can be exploited, and we must learn to recognize which relationships are significant, without being misled by R² values near 1.

Machine LearningRandom ForestPyCaretStreamlitPublic health