ACM Digital Library home

  • Advanced Search

The Journal of Machine Learning Research

Volume 24, Issue 1

January 2023

  • new algorithms with empirical, theoretical, psychological, or biological justification;
  • experimental and/or theoretical studies yielding new insight into the design and behavior of learning in intelligent systems;
  • accounts of applications of existing techniques that shed light on the strengths and weaknesses of the methods;
  • formalization of new learning tasks (e.g., in the context of new applications) and of methods for assessing performance on those tasks;
  • development of new analytical frameworks that advance theoretical studies of practical learning methods;
  • computational models of data from natural learning systems at the behavioral or neural level; or
  • extremely well-written surveys of existing work.

JMLR has a commitment to rigorous yet rapid reviewing. Final versions are published electronically (ISSN 1533-7928) immediately upon receipt. Printed volumes (ISSN: 1532-4435) are now published by Microtome Publishing and available for sale .

Subject Areas

Announcements.

ACM Updates Its Peer Review Policy

ACM is pleased to announce that its Publications Board has approved an updated Peer Review Policy . If you have any questions regarding the update, the associated FAQ addresses topics such as confidentiality, the use of large language models in the peer review process, conflicts of interest, and several other relevant concerns. If there are any issues that are not addressed in the FAQ, please contact ACM’s Director of Publications, Scott Delman .

New ACM Policy on Authorship ACM has a new Policy on Authorship , covering a range of key topics, including the use of generative AI tools.  Please familiarize yourself with the new policy and the associated list of Frequently Asked Questions .

Most Frequent Affiliations

Most cited authors, latest issue.

  • Volume 24, Issue 1 January 2023 ISSN: 1532-4435 EISSN: 1533-7928 View Table of Contents

The measure and mismeasure of fairness

Department of Statistics, Harvard University, Cambridge, MA

Department of Computer Science, Stanford University, Stanford, CA

Department of Applied Statistics, Social Science, and Humanities, New York University, New York, NY

Harvard Kennedy School, Harvard University, Cambridge, MA

Weisfeiler and Leman go machine learning: the story so far

Department of Computer Science, RWTH Aachen University, Aachen, Germany

Meta AI Research, Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel

NVIDIA Research, Tel Aviv, Israel

Author Picture

AIDOS Lab, Institute of AI for Health Helmholtz Zentrum München and Technical University of Munich Munich, Germany

Faculty of Computer Science and Research Network Data Science, University of Vienna, Vienna, Austria

Kumo.AI, Mountain View, CA

Machine Learning & Computational Biology Lab, Department of Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland and Swiss Institute of Bioinformatics, Lausanne, Switzerland

Recent Award Winners

Most popular, other acm journals.

ACM Journal on Computing and Sustainable Societies cover image

Volume 2, Issue 2

Collective Intelligence cover image

Volume 3, Issue 2

April-June 2024

ACM Computing Surveys cover image

Volume 56, Issue 10

October 2024

Digital Government: Research and Practice cover image

Volume 5, Issue 1

Distributed Ledger Technologies: Research and Practice cover image

Volume 3, Issue 1

Digital Threats: Research and Practice cover image

Volume 36, Issue 1

Export Citations

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

machine learning research papers

Frequently Asked Questions

JMLR Papers

Select a volume number to see its table of contents with links to the papers.

Volume 25 (January 2024 - Present)

Volume 24 (January 2023 - December 2023)

Volume 23 (January 2022 - December 2022)

Volume 22 (January 2021 - December 2021)

Volume 21 (January 2020 - December 2020)

Volume 20 (January 2019 - December 2019)

Volume 19 (August 2018 - December 2018)

Volume 18 (February 2017 - August 2018)

Volume 17 (January 2016 - January 2017)

Volume 16 (January 2015 - December 2015)

Volume 15 (January 2014 - December 2014)

Volume 14 (January 2013 - December 2013)

Volume 13 (January 2012 - December 2012)

Volume 12 (January 2011 - December 2011)

Volume 11 (January 2010 - December 2010)

Volume 10 (January 2009 - December 2009)

Volume 9 (January 2008 - December 2008)

Volume 8 (January 2007 - December 2007)

Volume 7 (January 2006 - December 2006)

Volume 6 (January 2005 - December 2005)

Volume 5 (December 2003 - December 2004)

Volume 4 (Apr 2003 - December 2003)

Volume 3 (Jul 2002 - Mar 2003)

Volume 2 (Oct 2001 - Mar 2002)

Volume 1 (Oct 2000 - Sep 2001)

Special Topics

Bayesian Optimization

Learning from Electronic Health Data (December 2016)

Gesture Recognition (May 2012 - present)

Large Scale Learning (Jul 2009 - present)

Mining and Learning with Graphs and Relations (February 2009 - present)

Grammar Induction, Representation of Language and Language Learning (Nov 2010 - Apr 2011)

Causality (Sep 2007 - May 2010)

Model Selection (Apr 2007 - Jul 2010)

Conference on Learning Theory 2005 (February 2007 - Jul 2007)

Machine Learning for Computer Security (December 2006)

Machine Learning and Large Scale Optimization (Jul 2006 - Oct 2006)

Approaches and Applications of Inductive Programming (February 2006 - Mar 2006)

Learning Theory (Jun 2004 - Aug 2004)

Special Issues

In Memory of Alexey Chervonenkis (Sep 2015)

Independent Components Analysis (December 2003)

Learning Theory (Oct 2003)

Inductive Logic Programming (Aug 2003)

Fusion of Domain Knowledge with Data for Decision Support (Jul 2003)

Variable and Feature Selection (Mar 2003)

Machine Learning Methods for Text and Images (February 2003)

Eighteenth International Conference on Machine Learning (ICML2001) (December 2002)

Computational Learning Theory (Nov 2002)

Shallow Parsing (Mar 2002)

Kernel Methods (December 2001)

.

machine learning Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

An explainable machine learning model for identifying geographical origins of sea cucumber Apostichopus japonicus based on multi-element profile

A comparison of machine learning- and regression-based models for predicting ductility ratio of rc beam-column joints, alexa, is this a historical record.

Digital transformation in government has brought an increase in the scale, variety, and complexity of records and greater levels of disorganised data. Current practices for selecting records for transfer to The National Archives (TNA) were developed to deal with paper records and are struggling to deal with this shift. This article examines the background to the problem and outlines a project that TNA undertook to research the feasibility of using commercially available artificial intelligence tools to aid selection. The project AI for Selection evaluated a range of commercial solutions varying from off-the-shelf products to cloud-hosted machine learning platforms, as well as a benchmarking tool developed in-house. Suitability of tools depended on several factors, including requirements and skills of transferring bodies as well as the tools’ usability and configurability. This article also explores questions around trust and explainability of decisions made when using AI for sensitive tasks such as selection.

Automated Text Classification of Maintenance Data of Higher Education Buildings Using Text Mining and Machine Learning Techniques

Data-driven analysis and machine learning for energy prediction in distributed photovoltaic generation plants: a case study in queensland, australia, modeling nutrient removal by membrane bioreactor at a sewage treatment plant using machine learning models, big five personality prediction based in indonesian tweets using machine learning methods.

<span lang="EN-US">The popularity of social media has drawn the attention of researchers who have conducted cross-disciplinary studies examining the relationship between personality traits and behavior on social media. Most current work focuses on personality prediction analysis of English texts, but Indonesian has received scant attention. Therefore, this research aims to predict user’s personalities based on Indonesian text from social media using machine learning techniques. This paper evaluates several machine learning techniques, including <a name="_Hlk87278444"></a>naive Bayes (NB), K-nearest neighbors (KNN), and support vector machine (SVM), based on semantic features including emotion, sentiment, and publicly available Twitter profile. We predict the personality based on the big five personality model, the most appropriate model for predicting user personality in social media. We examine the relationships between the semantic features and the Big Five personality dimensions. The experimental results indicate that the Big Five personality exhibit distinct emotional, sentimental, and social characteristics and that SVM outperformed NB and KNN for Indonesian. In addition, we observe several terms in Indonesian that specifically refer to each personality type, each of which has distinct emotional, sentimental, and social features.</span>

Compressive strength of concrete with recycled aggregate; a machine learning-based evaluation

Temperature prediction of flat steel box girders of long-span bridges utilizing in situ environmental parameters and machine learning, computer-assisted cohort identification in practice.

The standard approach to expert-in-the-loop machine learning is active learning, where, repeatedly, an expert is asked to annotate one or more records and the machine finds a classifier that respects all annotations made until that point. We propose an alternative approach, IQRef , in which the expert iteratively designs a classifier and the machine helps him or her to determine how well it is performing and, importantly, when to stop, by reporting statistics on a fixed, hold-out sample of annotated records. We justify our approach based on prior work giving a theoretical model of how to re-use hold-out data. We compare the two approaches in the context of identifying a cohort of EHRs and examine their strengths and weaknesses through a case study arising from an optometric research problem. We conclude that both approaches are complementary, and we recommend that they both be employed in conjunction to address the problem of cohort identification in health research.

Export Citation Format

Share document.

machine learning research papers

Advertisement

Advertisement

Artificial intelligence and machine learning research: towards digital transformation at a global scale

  • Published: 17 April 2021
  • Volume 13 , pages 3319–3321, ( 2022 )

Cite this article

machine learning research papers

  • Akila Sarirete 1 ,
  • Zain Balfagih 1 ,
  • Tayeb Brahimi 1 ,
  • Miltiadis D. Lytras 1 , 2 &
  • Anna Visvizi 3 , 4  

14k Accesses

13 Citations

Explore all metrics

Avoid common mistakes on your manuscript.

Artificial intelligence (AI) is reshaping how we live, learn, and work. Until recently, AI used to be a fanciful concept, more closely associated with science fiction rather than with anything else. However, driven by unprecedented advances in sophisticated information and communication technology (ICT), AI today is synonymous technological progress already attained and the one yet to come in all spheres of our lives (Chui et al. 2018 ; Lytras et al. 2018 , 2019 ).

Considering that Machine Learning (ML) and AI are apt to reach unforeseen levels of accuracy and efficiency, this special issue sought to promote research on AI and ML seen as functions of data-driven innovation and digital transformation. The combination of expanding ICT-driven capabilities and capacities identifiable across our socio-economic systems along with growing consumer expectations vis-a-vis technology and its value-added for our societies, requires multidisciplinary research and research agenda on AI and ML (Lytras et al. 2021 ; Visvizi et al. 2020 ; Chui et al. 2020 ). Such a research agenda should oscilate around the following five defining issues (Fig. 1 ):

figure 1

Source: The Authors

An AI-Driven Digital Transformation in all aspects of human activity/

Integration of diverse data-warehouses to unified ecosystems of AI and ML value-based services

Deployment of robust AI and ML processing capabilities for enhanced decision making and generation of value our of data.

Design of innovative novel AI and ML applications for predictive and analytical capabilities

Design of sophisticated AI and ML-enabled intelligence components with critical social impact

Promotion of the Digital Transformation in all the aspects of human activity including business, healthcare, government, commerce, social intelligence etc.

Such development will also have a critical impact on government, policies, regulations and initiatives aiming to interpret the value of the AI-driven digital transformation to the sustainable economic development of our planet. Additionally the disruptive character of AI and ML technology and research will required further research on business models and management of innovation capabilities.

This special issue is based on submissions invited from the 17th Annual Learning and Technology Conference 2019 that was held at Effat University and open call jointly. Several very good submissions were received. All of them were subjected a rigorous peer review process specific to the Ambient Intelligence and Humanized Computing Journal.

A variety of innovative topics are included in the agenda of the published papers in this special issue including topics such as:

Stock market Prediction using Machine learning

Detection of Apple Diseases and Pests based on Multi-Model LSTM-based Convolutional Neural Networks

ML for Searching

Machine Learning for Learning Automata

Entity recognition & Relation Extraction

Intelligent Surveillance Systems

Activity Recognition and K-Means Clustering

Distributed Mobility Management

Review Rating Prediction with Deep Learning

Cybersecurity: Botnet detection with Deep learning

Self-Training methods

Neuro-Fuzzy Inference systems

Fuzzy Controllers

Monarch Butterfly Optimized Control with Robustness Analysis

GMM methods for speaker age and gender classification

Regression methods for Permeability Prediction of Petroleum Reservoirs

Surface EMG Signal Classification

Pattern Mining

Human Activity Recognition in Smart Environments

Teaching–Learning based Optimization Algorithm

Big Data Analytics

Diagnosis based on Event-Driven Processing and Machine Learning for Mobile Healthcare

Over a decade ago, Effat University envisioned a timely platform that brings together educators, researchers and tech enthusiasts under one roof and functions as a fount for creativity and innovation. It was a dream that such platform bridges the existing gap and becomes a leading hub for innovators across disciplines to share their knowledge and exchange novel ideas. It was in 2003 that this dream was realized and the first Learning & Technology Conference was held. Up until today, the conference has covered a variety of cutting-edge themes such as Digital Literacy, Cyber Citizenship, Edutainment, Massive Open Online Courses, and many, many others. The conference has also attracted key, prominent figures in the fields of sciences and technology such as Farouq El Baz from NASA, Queen Rania Al-Abdullah of Jordan, and many others who addressed large, eager-to-learn audiences and inspired many with unique stories.

While emerging innovations, such as Artificial Intelligence technologies, are seen today as promising instruments that could pave our way to the future, these were also the focal points around which fruitful discussions have always taken place here at the L&T. The (AI) was selected for this conference due to its great impact. The Saudi government realized this impact of AI and already started actual steps to invest in AI. It is stated in the Kingdome Vision 2030: "In technology, we will increase our investments in, and lead, the digital economy." Dr. Ahmed Al Theneyan, Deputy Minister of Technology, Industry and Digital Capabilities, stated that: "The Government has invested around USD 3 billion in building the infrastructure so that the country is AI-ready and can become a leader in AI use." Vision 2030 programs also promote innovation in technologies. Another great step that our country made is establishing NEOM city (the model smart city).

Effat University realized this ambition and started working to make it a reality by offering academic programs that support the different sectors needed in such projects. For example, the master program in Energy Engineering was launched four years ago to support the energy sector. Also, the bachelor program of Computer Science has tracks in Artificial Intelligence and Cyber Security which was launched in Fall 2020 semester. Additionally, Energy & Technology and Smart Building Research Centers were established to support innovation in the technology and energy sectors. In general, Effat University works effectively in supporting the KSA to achieve its vision in this time of national transformation by graduating skilled citizen in different fields of technology.

The guest editors would like to take this opportunity to thank all the authors for the efforts they put in the preparation of their manuscripts and for their valuable contributions. We wish to express our deepest gratitude to the referees, who provided instrumental and constructive feedback to the authors. We also extend our sincere thanks and appreciation for the organizing team under the leadership of the Chair of L&T 2019 Conference Steering Committee, Dr. Haifa Jamal Al-Lail, University President, for her support and dedication.

Our sincere thanks go to the Editor-in-Chief for his kind help and support.

Chui KT, Lytras MD, Visvizi A (2018) Energy sustainability in smart cities: artificial intelligence, smart monitoring, and optimization of energy consumption. Energies 11(11):2869

Article   Google Scholar  

Chui KT, Fung DCL, Lytras MD, Lam TM (2020) Predicting at-risk university students in a virtual learning environment via a machine learning algorithm. Comput Human Behav 107:105584

Lytras MD, Visvizi A, Daniela L, Sarirete A, De Pablos PO (2018) Social networks research for sustainable smart education. Sustainability 10(9):2974

Lytras MD, Visvizi A, Sarirete A (2019) Clustering smart city services: perceptions, expectations, responses. Sustainability 11(6):1669

Lytras MD, Visvizi A, Chopdar PK, Sarirete A, Alhalabi W (2021) Information management in smart cities: turning end users’ views into multi-item scale development, validation, and policy-making recommendations. Int J Inf Manag 56:102146

Visvizi A, Jussila J, Lytras MD, Ijäs M (2020) Tweeting and mining OECD-related microcontent in the post-truth era: A cloud-based app. Comput Human Behav 107:105958

Download references

Author information

Authors and affiliations.

Effat College of Engineering, Effat Energy and Technology Research Center, Effat University, P.O. Box 34689, Jeddah, Saudi Arabia

Akila Sarirete, Zain Balfagih, Tayeb Brahimi & Miltiadis D. Lytras

King Abdulaziz University, Jeddah, 21589, Saudi Arabia

Miltiadis D. Lytras

Effat College of Business, Effat University, P.O. Box 34689, Jeddah, Saudi Arabia

Anna Visvizi

Institute of International Studies (ISM), SGH Warsaw School of Economics, Aleja Niepodległości 162, 02-554, Warsaw, Poland

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Akila Sarirete .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Sarirete, A., Balfagih, Z., Brahimi, T. et al. Artificial intelligence and machine learning research: towards digital transformation at a global scale. J Ambient Intell Human Comput 13 , 3319–3321 (2022). https://doi.org/10.1007/s12652-021-03168-y

Download citation

Published : 17 April 2021

Issue Date : July 2022

DOI : https://doi.org/10.1007/s12652-021-03168-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Find a journal
  • Publish with us
  • Track your research

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

futureinternet-logo

Article Menu

machine learning research papers

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Machine learning: models, challenges, and research directions.

machine learning research papers

1. Introduction

  • Brief discussion of data pre-processing;
  • Detailed classification of supervised, semi-supervised, unsupervised, and reinforcement learning models;
  • Study of known optimization techniques;
  • Challenges of machine learning in the field of cybersecurity.

2. Related Work and Research Methodology

ReferenceYearStudy HighlightsCoverage of Data Pre-Processing and Hyperparameter TuningCoverage of Machine Learning
Data Pre-ProcessingHyperparameter Tuning ApproachSupervised LearningUnsupervised LearningSemi-Supervised LearningReinforcement Learning
[ ]2021Describes the known deep learning models, their principles, and characteristics.
[ ]2019Focuses on limited machine learning techniques on only software-defined networking.
[ ]2022Investigates the known issues in the field of system designs that can be solved using machine learning techniques.
[ ]2021Presents a detailed description of a few supervised models and their optimization techniques.
[ ]2021Provides an overview of semi-supervised machine learning techniques with their existing algorithms.
[ ]2022Provides the state of the art, challenges, and limitations of supervised models in the field of maritime risk analysis.
[ ] 2022Reviews hardware architecture of reinforcement learning algorithms.
[ ]2022Presents the existing algorithm for wireless sensor networks and describes the existing challenges of using such techniques.
[ ] 2016Describes most of the known supervised algorithms for classification problems.
[ ]2019Provides a description of known supervised and unsupervised models.
[ ] 2021Discusses supervised and unsupervised deep learning models for intrusion detection systems.
[ ] 2021Surveys existing supervised and unsupervised techniques in smart grid.
[ ]2021Explains known algorithms for image classifications.
[ ]2022Illustrates the unsupervised deep learning models and summarizes their challenges.
[ ] 2023Discusses techniques for energy usage in future
[ ] 2020Reviews various ML techniques in the security of the Internet of Things.
[ ]2020Proposes a taxonomy of machine learning techniques in the security of Internet of Things.
[ ]2019Surveys the taxonomy of machine learning models in intrusion detection systems.
[ ]2022Gives ML techniques in industrial control systems.
[ ]2022Proposes the taxonomy of intrusion detection systems for supervised models.

3. Machine Learning Models

3.1. supervised learning, 3.2. semi-supervised learning, 3.3. unsupervised learning, 3.4. reinforcement learning, 4. machine learning processes, 4.1. data pre-processing, 4.2. tuning approaches, 4.3. evaluation metrics, 4.3.1. evaluation metrics for supervised learning, 4.3.2. evaluation metrics for unsupervised learning models, 4.3.3. evaluation metrics for semi-supervised learning models, 4.3.4. evaluation metrics for reinforcement learning models, 5. challenges and future directions, 6. conclusions, author contributions, data availability statement, conflicts of interest.

  • Sarker, I.H. Machine Learning: Algorithms, real-world applications and research directions. SN Comput. Sci. 2021 , 2 , 160. [ Google Scholar ] [ CrossRef ]
  • Vinuesa, R.; Azizpour, H.; Leite, I.; Balaam, M.; Dignum, V.; Domisch, S.; Felländer, A.; Langhans, S.D.; Tegmark, M.; Nerini, F.F. The role of artificial intelligence in achieving the sustainable development goals. Nat. Commun. 2020 , 11 , 233. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Ullah, Z.; Al-Turjman, F.; Mostarda, L.; Gagliardi, R. Applications of artificial intelligence and machine learning in smart cities. Comput. Commun. 2020 , 154 , 313–323. [ Google Scholar ] [ CrossRef ]
  • Ozcanli, A.K.; Yaprakdal, F.; Baysal, M. Deep learning methods and applications for electrical power systems: A comprehensive review. Int. J. Energy Res. 2020 , 44 , 7136–7157. [ Google Scholar ] [ CrossRef ]
  • Zhao, S.; Blaabjerg, F.; Wang, H. An Overview of Artificial Intelligence Applications for Power Electronics. IEEE Trans. Power Electron. 2021 , 36 , 4633–4658. [ Google Scholar ] [ CrossRef ]
  • Mamun, A.A.; Sohel, M.; Mohammad, N.; Sunny, M.S.H.; Dipta, D.R.; Hossain, E. A Comprehensive Review of the Load Fore-casting Techniques Using Single and Hybrid Predictive Models. IEEE Access 2020 , 8 , 134911–134939. [ Google Scholar ] [ CrossRef ]
  • Massaoudi, M.; Darwish, A.; Refaat, S.S.; Abu-Rub, H.; Toliyat, H.A. UHF Partial Discharge Localization in Gas-Insulated Switch-gears: Gradient Boosting Based Approach. In Proceedings of the 2020 IEEE Kansas Power and Energy Conference (KPEC), Manhattan, KS, USA, 13–14 July 2020; pp. 1–5. [ Google Scholar ]
  • Ali, S.S.; Choi, B.J. State-of-the-Art Artificial Intelligence Techniques for Distributed Smart Grids: A Review. Electronics 2020 , 9 , 1030. [ Google Scholar ] [ CrossRef ]
  • Yin, L.; Gao, Q.; Zhao, L.; Zhang, B.; Wang, T.; Li, S.; Liu, H. A review of machine learning for new generation smart dispatch in power systems. Eng. Appl. Artif. Intell. 2020 , 88 , 103372. [ Google Scholar ] [ CrossRef ]
  • Peng, S.; Sun, S.; Yao, Y.-D. A Survey of Modulation Classification Using Deep Learning: Signal Representation and Data Prepro-cessing. In IEEE Transactions on Neural Networks and Learning Systems ; IEEE: New York, NY, USA, 2021. [ Google Scholar ]
  • Arjoune, Y.; Kaabouch, N. A Comprehensive Survey on Spectrum Sensing in Cognitive Radio Networks: Recent Advances, New Challenges, and Future Research Directions. Sensors 2019 , 19 , 126. [ Google Scholar ] [ CrossRef ]
  • Meng, T.; Jing, X.; Yan, Z.; Pedrycz, W. A survey on machine learning for data fusion. Inf. Fusion 2020 , 57 , 115–129. [ Google Scholar ] [ CrossRef ]
  • Carvalho, D.V.; Pereira, E.M.; Cardoso, J.S. Machine Learning Interpretability: A Survey on Methods and Metrics. Electronics 2019 , 8 , 832. [ Google Scholar ] [ CrossRef ]
  • Khoei, T.T.; Ismail, S.; Kaabouch, N. Boosting-based Models with Tree-structured Parzen Estimator Optimization to Detect Intrusion Attacks on Smart Grid. In Proceedings of the 2021 IEEE 12th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 1–4 December 2021; pp. 165–170. [ Google Scholar ] [ CrossRef ]
  • Hutter, F.; Lücke, J.; Schmidt-Thieme, L. Beyond manual tuning of hyperparameters. KI-Künstliche Intell. 2015 , 29 , 329–337. [ Google Scholar ] [ CrossRef ]
  • Khoei, T.T.; Aissou, G.; Hu, W.C.; Kaabouch, N. Ensemble Learning Methods for Anomaly Intrusion Detection System in Smart Grid. In Proceedings of the IEEE International Conference on Electro Information Technology (EIT), Mt. Pleasant, MI, USA, 14–15 May 2021; pp. 129–135. [ Google Scholar ] [ CrossRef ]
  • Waubert de Puiseau, C.; Meyes, R.; Meisen, T. On reliability of reinforcement learning based production scheduling systems: A comparative survey. J. Intell. Manuf. 2022 , 33 , 911–927. [ Google Scholar ] [ CrossRef ]
  • Moos, J.; Hansel, K.; Abdulsamad, H.; Stark, S.; Clever, D.; Peters, J. Robust Reinforcement Learning: A Review of Foundations and Recent Advances. Mach. Learn. Knowl. Extr. 2022 , 4 , 276–315. [ Google Scholar ] [ CrossRef ]
  • Latif, S.; Cuayáhuitl, H.; Pervez, F.; Shamshad, F.; Ali, H.S.; Cambria, E. A survey on deep reinforcement learning for audio-based applications. Artif. Intell. Rev. 2022 , 56 , 2193–2240. [ Google Scholar ] [ CrossRef ]
  • Passah, A.; Kandar, D. A lightweight deep learning model for classification of synthetic aperture radar images. Ecol. Inform. 2023 , 77 , 102228. [ Google Scholar ] [ CrossRef ]
  • Verbraeken, J.; Wolting, M.; Katzy, J.; Kloppenburg, J.; Verbelen, T.; Rellermeyer, J.S. A survey on distributed machine learning. ACM Comput. Surv. 2020 , 53 , 1–33. [ Google Scholar ] [ CrossRef ]
  • Dargan, S.; Kumar, M.; Ayyagari, M.R.; Kumar, G. A survey of deep learning and its applications: A new paradigm to machine learning. Arch. Comput. Methods Eng. 2020 , 27 , 1071–1092. [ Google Scholar ] [ CrossRef ]
  • Pitropakis, N.; Panaousis, E.; Giannetsos, T.; Anastasiadis, E.; Loukas, G. A taxonomy and survey of attacks against machine learning. Comput. Sci. Rev. 2019 , 34 , 100199. [ Google Scholar ] [ CrossRef ]
  • Wu, X.; Xiao, L.; Sun, Y.; Zhang, J.; Ma, T.; He, L. A survey of human-in-the-loop for machine learning. Futur. Gener. Comput. Syst. 2022 , 135 , 364–381. [ Google Scholar ] [ CrossRef ]
  • Wang, Q.; Ma, Y.; Zhao, K.; Tian, Y. A comprehensive survey of loss functions in machine learning. Ann. Data Sci. 2022 , 9 , 187–212. [ Google Scholar ] [ CrossRef ]
  • Choi, H.; Park, S. A Survey of Machine Learning-Based System Performance Optimization Techniques. Appl. Sci. 2021 , 11 , 3235. [ Google Scholar ] [ CrossRef ]
  • Rawson, A.; Brito, M. A survey of the opportunities and challenges of supervised machine learning in maritime risk analysis. Transp. Rev. 2022 , 43 , 108–130. [ Google Scholar ] [ CrossRef ]
  • Ahmad, R.; Wazirali, R.; Abu-Ain, T. Machine Learning for Wireless Sensor Networks Security: An Overview of Challenges and Issues. Sensors 2022 , 22 , 4730. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Singh, A.; Thakur, N.; Sharma, A. A review of supervised machine learning algorithms. In Proceedings of the 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 16–18 March 2016; pp. 1310–1315. [ Google Scholar ]
  • Abdallah, E.E.; Eleisah, W.; Otoom, A.F. Intrusion Detection Systems using Supervised Machine Learning Techniques: A survey. Procedia Comput. Sci. 2022 , 201 , 205–212. [ Google Scholar ] [ CrossRef ]
  • Dike, H.U.; Zhou, Y.; Deveerasetty, K.K.; Wu, Q. Unsupervised Learning Based On Artificial Neural Network: A Review. In Proceedings of the 2018 IEEE International Conference on Cyborg and Bionic Systems (CBS), 25–27 October 2018; pp. 322–327. [ Google Scholar ]
  • van Engelen, J.E.; Hoos, H.H. A survey on semi-supervised learning. Mach. Learn. 2020 , 109 , 373–440. [ Google Scholar ] [ CrossRef ]
  • Rothmann, M.; Porrmann, M. A Survey of Domain-Specific Architectures for Reinforcement Learning. IEEE Access 2022 , 10 , 13753–13767. [ Google Scholar ] [ CrossRef ]
  • Dong, S.; Wang, P.; Abbas, K. A survey on deep learning and its applications. Comput. Sci. Rev. 2020 , 40 , 100379. [ Google Scholar ] [ CrossRef ]
  • Ray, S. A Quick Review of Machine Learning Algorithms. In Proceedings of the 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), Faridabad, India, 14–16 February 2019; pp. 35–39. [ Google Scholar ]
  • Lansky, J.; Ali, S.; Mohammadi, M.; Majeed, M.K.; Karim, S.H.T.; Rashidi, S.; Hosseinzadeh, M.; Rahmani, A.M. Deep Learning-Based Intrusion Detection Systems: A Systematic Review. IEEE Access 2021 , 9 , 101574–101599. [ Google Scholar ] [ CrossRef ]
  • Massaoudi, M.; Abu-Rub, H.; Refaat, S.S.; Chihi, I.; Oueslati, F.S. Deep Learning in Smart Grid Technology: A Review of Recent Advancements and Future Prospects. IEEE Access 2021 , 9 , 54558–54578. [ Google Scholar ] [ CrossRef ]
  • Liu, H.; Lang, B. Machine Learning and Deep Learning Methods for Intrusion Detection Systems: A Survey. Appl. Sci. 2019 , 9 , 4396. [ Google Scholar ] [ CrossRef ]
  • Wu, N.; Xie, Y. A survey of machine learning for computer architecture and systems. ACM Comput. Surv. 2022 , 55 , 1–39. [ Google Scholar ] [ CrossRef ]
  • Schmarje, L.; Santarossa, M.; Schröder, S.-M.; Koch, R. A Survey on Semi-, Self- and Unsupervised Learning for Image Classification. IEEE Access 2021 , 9 , 82146–82168. [ Google Scholar ] [ CrossRef ]
  • Xie, J.; Yu, F.R.; Huang, T.; Xie, R.; Liu, J.; Wang, C.; Liu, Y. A Survey of Machine Learning Techniques Applied to Software Defined Networking (SDN): Research Issues and Challenges. In IEEE Communications Surveys & Tutorials ; IEEE: New York, NY, USA, 2019; Volume 21, pp. 393–430. [ Google Scholar ]
  • Yao, Z.; Lum, Y.; Johnston, A.; Mejia-Mendoza, L.M.; Zhou, X.; Wen, Y.; Aspuru-Guzik, A.; Sargent, E.H.; Seh, Z.W. Machine learning for a sustainable energy future. Nat. Rev. Mater. 2023 , 8 , 202–215. [ Google Scholar ] [ CrossRef ]
  • Al-Garadi, M.A.; Mohamed, A.; Al-Ali, A.K.; Du, X.; Ali, I.; Guizani, M. A Survey of Machine and Deep Learning Methods for Internet of Things (IoT) Security. In IEEE Communications Surveys & Tutorials ; IEEE: New York, NY, USA, 2020; Volume 22, pp. 1646–1685. [ Google Scholar ]
  • Messaoud, S.; Bradai, A.; Bukhari, S.H.R.; Quang, P.T.A.; Ahmed, O.B.; Atri, M. A survey on machine learning in internet of things: Algorithms, strategies, and applications. Internet Things 2020 , 12 , 100314. [ Google Scholar ] [ CrossRef ]
  • Umer, M.A.; Junejo, K.N.; Jilani, M.T.; Mathur, A.P. Machine learning for intrusion detection in industrial control systems: Ap-plications, challenges, and recommendations. Int. J. Crit. Infrastruct. Prot. 2022 , 38 , 100516. [ Google Scholar ] [ CrossRef ]
  • Von Rueden, L.; Mayer, S.; Garcke, J.; Bauckhage, C.; Schuecker, J. Informed machine learning–towards a taxonomy of explicit integration of knowledge into machine learning. Learning 2019 , 18 , 19–20. [ Google Scholar ]
  • Waring, J.; Lindvall, C.; Umeton, R. Automated machine learning: Review of the state-of-the-art and opportunities for healthcare. Artif. Intell. Med. 2020 , 104 , 101822. [ Google Scholar ] [ CrossRef ]
  • Wang, H.; Lv, L.; Li, X.; Li, H.; Leng, J.; Zhang, Y.; Thomson, V.; Liu, G.; Wen, X.; Luo, G. A safety management approach for Industry 5.0′ s human-centered manufacturing based on digital twin. J. Manuf. Syst. 2023 , 66 , 1–12. [ Google Scholar ] [ CrossRef ]
  • Reuther, A.; Michaleas, P.; Jones, M.; Gadepally, V.; Samsi, S.; Kepner, J. Survey and Benchmarking of Machine Learning Accelerators. In Proceedings of the 2019 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA USA, 24–26 September 2019; pp. 1–9. [ Google Scholar ]
  • Kaur, B.; Dadkhah, S.; Shoeleh, F.; Neto, E.C.P.; Xiong, P.; Iqbal, S.; Lamontagne, P.; Ray, S.; Ghorbani, A.A. Internet of Things (IoT) security dataset evolution: Challenges and future directions. Internet Things 2023 , 22 , 100780. [ Google Scholar ] [ CrossRef ]
  • Paullada, A.; Raji, I.D.; Bender, E.M.; Denton, E.; Hanna, A. Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns 2021 , 2 , 100336. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Slimane, H.O.; Benouadah, S.; Khoei, T.T.; Kaabouch, N. A Light Boosting-based ML Model for Detecting Deceptive Jamming Attacks on UAVs. In Proceedings of the 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 26–29 January 2022; pp. 328–333. [ Google Scholar ]
  • Manesh, M.R.; Kenney, J.; Hu, W.C.; Devabhaktuni, V.K.; Kaabouch, N. Detection of GPS spoofing attacks on unmanned aerial systems. In Proceedings of the 16th IEEE Annual Consumer Communications & Networking Conference (CCNC), Las Vegas, NV, USA, 11–14 January 2019; pp. 1–6. [ Google Scholar ]
  • Sharifani, K.; Amini, M. Machine Learning and Deep Learning: A Review of Methods and Applications. World Inf. Technol. Eng. J. 2023 , 10 , 3897–3904. [ Google Scholar ]
  • Obaid, H.S.; Dheyab, S.A.; Sabry, S.S. The Impact of Data Pre-Processing Techniques and Dimensionality Reduction on the Ac-curacy of Machine Learning. In Proceedings of the 2019 9th Annual Information Technology, Electromechanical Engineering and Microelectronics Conference (IEMECON), Jaipur, India, 13–15 March 2019; pp. 279–283. [ Google Scholar ]
  • Liu, B.; Ding, M.; Shaham, S.; Rahayu, W.; Lin, Z. When machine learning meets privacy: A survey and outlook. ACM Comput. Surv. (CSUR) 2021 , 54 , 1–36. [ Google Scholar ] [ CrossRef ]
  • Singh, S.; Gupta, P. Comparative study ID3, cart and C4. 5 decision tree algorithm: A survey. Int. J. Adv. Inf. Sci. Technol. (IJAIST) 2014 , 27 , 97–103. [ Google Scholar ]
  • Zhang, M.-L.; Zhou, Z.-H. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognit. 2007 , 40 , 2038–2048. [ Google Scholar ] [ CrossRef ]
  • Musavi, M.T.; Ahmed, W.; Chan, K.H.; Faris, K.B.; Hummels, D.M. On the training of radial basis function classifiers. Neural Netw. 1992 , 5 , 595–603. [ Google Scholar ] [ CrossRef ]
  • Zhou, J.; Gandomi, A.H.; Chen, F.; Holzinger, A. Evaluating the Quality of Machine Learning Explanations: A Survey on Methods and Metrics. Electronics 2021 , 10 , 593. [ Google Scholar ] [ CrossRef ]
  • Jiang, T.; Fang, H.; Wang, H. Blockchain-Based Internet of Vehicles: Distributed Network Architecture and Performance Analy-sis. IEEE Internet Things J. 2019 , 6 , 4640–4649. [ Google Scholar ] [ CrossRef ]
  • Jia, W.; Dai, D.; Xiao, X.; Wu, H. ARNOR: Attention regularization based noise reduction for distant supervision relation classifi-cation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 1399–1408. [ Google Scholar ]
  • Abiodun, O.I.; Omolara, A.E.; Dada, K.V.; Mohamed, N.A.; Arshad, H. State-of-the-art in artificial neural network applications: A survey. Heliyon 2018 , 4 , e00938. [ Google Scholar ] [ CrossRef ]
  • Izeboudjen, N.; Larbes, C.; Farah, A. A new classification approach for neural networks hardware: From standards chips to embedded systems on chip. Artif. Intell. Rev. 2014 , 41 , 491–534. [ Google Scholar ] [ CrossRef ]
  • Wang, D.; He, H.; Liu, D. Intelligent Optimal Control With Critic Learning for a Nonlinear Overhead Crane System. IEEE Trans. Ind. Informatics 2018 , 14 , 2932–2940. [ Google Scholar ] [ CrossRef ]
  • Wang, S.-C. Artificial Neural Network. In Interdisciplinary Computing in Java Programming ; Springer: Berlin/Heidelberg, Germany, 2003; pp. 81–100. [ Google Scholar ]
  • Albawi, S.; Mohammed, T.A.; Al-Zawi, S. Understanding of a convolutional neural network. In Proceedings of the International Conference on Engineering and Technology (ICET), Antalya, Turkey, 21–23 August 2017. [ Google Scholar ]
  • Khoei, T.T.; Slimane, H.O.; Kaabouch, N. Cyber-Security of Smart Grids: Attacks, Detection, Countermeasure Techniques, and Future Directions. Commun. Netw. 2022 , 14 , 119–170. [ Google Scholar ] [ CrossRef ]
  • Gunturi, S.K.; Sarkar, D. Ensemble machine learning models for the detection of energy theft. Electr. Power Syst. Res. 2021 , 192 , 106904. [ Google Scholar ] [ CrossRef ]
  • Chafii, M.; Bader, F.; Palicot, J. Enhancing coverage in narrow band-IoT using machine learning. In Proceedings of the 2018 IEEE Wireless Communications and Networking Conference (WCNC), Barcelona, Spain, 15–18 April 2018; pp. 1–6. [ Google Scholar ]
  • Bithas, P.S.; Michailidis, E.T.; Nomikos, N.; Vouyioukas, D.; Kanatas, A.G. A Survey on Machine-Learning Techniques for UAV-Based Communications. Sensors 2019 , 19 , 5170. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Benos, L.; Tagarakis, A.C.; Dolias, G.; Berruto, R.; Kateris, D.; Bochtis, D. Machine Learning in Agriculture: A Comprehensive Updated Review. Sensors 2021 , 21 , 3758. [ Google Scholar ] [ CrossRef ]
  • Wagle, P.P.; Rani, S.; Kowligi, S.B.; Suman, B.H.; Pramodh, B.; Kumar, P.; Raghavan, S.; Shastry, K.A.; Sanjay, H.A.; Kumar, M.; et al. Machine Learning-Based Ensemble Network Security System. In Recent Advances in Artificial Intelligence and Data Engineering ; Springer: Berlin/Heidelberg, Germany, 2022; pp. 3–15. [ Google Scholar ]
  • Sutton, C.D. Classification and regression trees, bagging, and boosting. Handb. Stat. 2005 , 24 , 303–329. [ Google Scholar ]
  • Zaadnoordijk, L.; Besold, T.R.T.; Cusack, R. Lessons from infant learning for unsupervised machine learning. Nat. Mach. Intell. 2022 , 4 , 510–520. [ Google Scholar ] [ CrossRef ]
  • Khoei, T.T.; Kaabouch, N. A Comparative Analysis of Supervised and Unsupervised Models for Detecting Attacks on the Intrusion Detection Systems. Information 2023 , 14 , 103. [ Google Scholar ] [ CrossRef ]
  • Kumar, P.; Gupta, G.P.; Tripathi, R. An ensemble learning and fog-cloud architecture-driven cyber-attack detection framework for IoMT networks. Comput. Commun. 2021 , 166 , 110–124. [ Google Scholar ] [ CrossRef ]
  • Hady, M.; Abdel, A.M.F.; Schwenker, F. Semi-supervised learning. In Handbook on Neural Information Processing ; Springer: Berlin/Heidelberg, Germany, 2013. [ Google Scholar ]
  • Elsken, T.; Metzen, J.H.; Hutter, F. Neural architecture search: A survey. J. Mach. Learn. Res. 2019 , 20 , 1–21. [ Google Scholar ]
  • Luo, Y.; Zhu, J.; Li, M.; Ren, Y.; Zhang, B. Smooth neighbors on teacher graphs for semi-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Lake City, UT, USA, 18–22 June 2018; pp. 8896–8905. [ Google Scholar ]
  • Park, S.; Park, J.; Shin, S.; Moon, I. Adversarial dropout for supervised and semi-supervised learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 3917–3924. [ Google Scholar ]
  • Khoei, T.T.; Kaabouch, N. ACapsule Q-learning based reinforcement model for intrusion detection system on smart grid. In Proceedings of the IEEE International Conference on Electro Information Technology (eIT), Romeoville, IL, USA, 18–20 May 2023; pp. 333–339. [ Google Scholar ]
  • Polydoros, A.S.; Nalpantidis, L. Survey of model-based reinforcement learning: Applications on robotics. J. Intell. Robot. Syst. 2017 , 86 , 153–173. [ Google Scholar ] [ CrossRef ]
  • Degris, T.; Pilarski, P.M.; Sutton, R.S. Model-Free reinforcement learning with continuous action in practice. In Proceedings of the 2012 American Control Conference (ACC), Montreal, QC, Canada, 27–29 June 2012; pp. 2177–2182. [ Google Scholar ] [ CrossRef ]
  • Cao, D.; Hu, W.; Zhao, J.; Zhang, G.; Zhang, B.; Liu, Z.; Chen, Z.; Blaabjerg, F. Reinforcement learning and its applications in modern power and energy systems: A review. J. Mod. Power Syst. Clean Energy 2020 , 8 , 1029–1042. [ Google Scholar ] [ CrossRef ]
  • Zhang, J.M.; Harman, M.; Ma, L.; Liu, Y. Machine Learning Testing: Survey, Landscapes and Horizons. In IEEE Transactions on Software Engineering ; IEEE: New York, NY, USA, 2022; Volume 48, pp. 1–36. [ Google Scholar ]
  • Salahdine, F.; Kaabouch, N. Security threats, detection, and countermeasures for physical layer in cognitive radio networks: A survey. Phys. Commun. 2020 , 39 , 101001. [ Google Scholar ] [ CrossRef ]
  • Ramírez, J.; Yu, W.; Perrusquía, A. Model-free reinforcement learning from expert demonstrations: A survey. Artif. Intell. Rev. 2022 , 55 , 3213–3241. [ Google Scholar ] [ CrossRef ]
  • Yang, L.; Shami, A. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing 2020 , 415 , 295–316. [ Google Scholar ] [ CrossRef ]
  • Dev, K.; Maddikunta, P.K.R.; Gadekallu, T.R.; Bhattacharya, S.; Hegde, P.; Singh, S. Energy Optimization for Green Communication in IoT Using Harris Hawks Optimization. In IEEE Transactions on Green Communications and Networking ; IEEE: New York, NY, USA, 2022; Volume 6, pp. 685–694. [ Google Scholar ]
  • Khodadadi, N.; Snasel, V.; Mirjalili, S. Dynamic Arithmetic Optimization Algorithm for Truss Optimization Under Natural Fre-quency Constraints. IEEE Access 2022 , 10 , 16188–16208. [ Google Scholar ] [ CrossRef ]
  • Cummins, C.; Wasti, B.; Guo, J.; Cui, B.; Ansel, J.; Gomez, S.; Jain, S.; Liu, J.; Teytaud, O.; Steinerm, B.; et al. CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research. In Proceedings of the 2022 IEEE/ACM In-ternational Symposium on Code Generation and Optimization (CGO), Seoul, Republic of Korea, 2–6 April 2022; pp. 92–105. [ Google Scholar ]
  • Zhang, W.; Gu, X.; Tang, L.; Yin, Y.; Liu, D.; Zhang, Y. Application of machine learning, deep learning and optimization algo-rithms in geoengineering and geoscience: Comprehensive review and future challenge. Gondwana Res. 2022 , 109 , 1–17. [ Google Scholar ] [ CrossRef ]
  • Mittal, S.; Vaishay, S. A survey of techniques for optimizing deep learning on GPUs. J. Syst. Arch. 2019 , 99 , 101635. [ Google Scholar ] [ CrossRef ]
  • Zhang, Q.; Yang, L.T.; Chen, Z.; Li, P. A survey on deep learning for big data. Inf. Fusion 2018 , 42 , 146–157. [ Google Scholar ] [ CrossRef ]
  • Oyelade, O.N.; Ezugwu, A.E.-S.; Mohamed, T.I.A.; Abualigah, L. Ebola Optimization Search Algorithm: A New Nature-Inspired Metaheuristic Optimization Algorithm. IEEE Access 2022 , 10 , 16150–16177. [ Google Scholar ] [ CrossRef ]
  • Blank, J.; Deb, K. Pymoo: Multi-Objective Optimization in Python. IEEE Access 2020 , 8 , 89497–89509. [ Google Scholar ] [ CrossRef ]
  • Qiao, K.; Yu, K.; Qu, B.; Liang, J.; Song, H.; Yue, C. An Evolutionary Multitasking Optimization Framework for Constrained Multi-objective Optimization Problems. IEEE Trans. Evol. Comput. 2022 , 26 , 263–277. [ Google Scholar ] [ CrossRef ]
  • Riaz, M.; Ahmad, S.; Hussain, I.; Naeem, M.; Mihet-Popa, L. Probabilistic Optimization Techniques in Smart Power System. Energies 2022 , 15 , 825. [ Google Scholar ] [ CrossRef ]
  • Yu, T.; Zhu, H. Hyper-parameter optimization: A review of algorithms and applications. arXiv 2020 , arXiv:2003.05689. [ Google Scholar ]
  • Yang, X.; Song, Z.; King, I.; Xu, Z. A Survey on deep semi-supervised learning. arXiv 2021 , arXiv:2103.00550. [ Google Scholar ] [ CrossRef ]
  • Gibson, B.R.; Rogers, T.T.; Zhu, X. Human semi-supervised learning. Top. Cogn. Sci. 2013 , 5 , 132–172. [ Google Scholar ] [ CrossRef ]
  • Nguyen, T.T.; Nguyen, N.D.; Nahavandi, S. Deep Reinforcement Learning for Multiagent Systems: A Review of Challenges, Solutions, and Applications. IEEE Trans. Cybern. 2020 , 50 , 3826–3839. [ Google Scholar ] [ CrossRef ]
  • Canese, L.; Cardarilli, G.C.; Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Re, M.; Spanò, S. Multi-Agent Reinforcement Learning: A Review of Challenges and Applications. Appl. Sci. 2021 , 11 , 4948. [ Google Scholar ] [ CrossRef ]
  • Du, W.; Ding, S. A survey on multi-agent deep reinforcement learning: From the perspective of challenges and applications. Artif. Intell. Rev. 2020 , 54 , 3215–3238. [ Google Scholar ] [ CrossRef ]
  • Salwan, D.; Kant, S.; Pareek, H.; Sharma, R. Challenges with reinforcement learning in prosthesis. Mater. Today Proc. 2022 , 49 , 3133–3136. [ Google Scholar ] [ CrossRef ]
  • Narkhede, M.S.; Chatterji, S.; Ghosh, S. Trends and challenges in optimization techniques for operation and control of Mi-crogrid—A review. In Proceedings of the 2012 1st International Conference on Power and Energy in NERIST (ICPEN), Nirjuli, India, 28–29 December 2012; pp. 1–7. [ Google Scholar ]
  • Khoei, T.T.; Ismail, S.; Kaabouch, N. Dynamic Selection Techniques for Detecting GPS Spoofing Attacks on UAVs. Sensors 2022 , 22 , 662. [ Google Scholar ] [ CrossRef ]
  • Khoei, T.T.; Ismail, S.; Al Shamaileh, K.; Devabhaktuni, V.K.; Kaabouch, N. Impact of Dataset and Model Parameters on Machine Learning Performance for the Detection of GPS Spoofing Attacks on Unmanned Aerial Vehicles. Appl. Sci. 2022 , 13 , 383. [ Google Scholar ] [ CrossRef ]
  • Khoei, T.T.; Kaabouch, N. Densely Connected Neural Networks for Detecting Denial of Service Attacks on Smart Grid Network. In Proceedings of the IEEE 13th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 26–29 October 2022; pp. 0207–0211. [ Google Scholar ]
  • Khan, A.; Khan, S.H.; Saif, M.; Batool, A.; Sohail, A.; Khan, M.W. A Survey of Deep Learning Techniques for the Analysis of COVID-19 and their usability for Detecting Omicron. J. Exp. Theor. Artif. Intell. 2023 , 1–43. [ Google Scholar ] [ CrossRef ]
  • Gopinath, M.; Sethuraman, S.C. A comprehensive survey on deep learning based malware detection techniques. Comput. Sci. Rev. 2023 , 47 , 100529. [ Google Scholar ]
  • Gheisari, M.; Ebrahimzadeh, F.; Rahimi, M.; Moazzamigodarzi, M.; Liu, Y.; Pramanik, P.K.D.; Heravi, M.A.; Mehbodniya, A.; Ghaderzadeh, M.; Feylizadeh, M.R.; et al. Deep learning: Applications, architectures, models, tools, and frameworks: A com-prehensive survey. In CAAI Transactions on Intelligence Technology ; IET: Stevenage, UK, 2023. [ Google Scholar ]
  • Morgan, D.; Jacobs, R. Opportunities and challenges for machine learning in materials science. Annu. Rev. Mater. Res. 2020 , 50 , 71–103. [ Google Scholar ] [ CrossRef ]
  • Phoon, K.K.; Zhang, W. Future of machine learning in geotechnics. Georisk Assess. Manag. Risk Eng. Syst. Geohazards 2023 , 17 , 7–22. [ Google Scholar ] [ CrossRef ]
  • Krishnam, N.P.; Ashraf, M.S.; Rajagopal, B.R.; Vats, P.; Chakravarthy, D.S.K.; Rafi, S.M. Analysis of Current Trends, Advances and Challenges of Machine Learning (Ml) and Knowledge Extraction: From Ml to Explainable AI. Ind. Qualif.-Stitute Adm. Manag. UK 2022 , 58 , 54–62. [ Google Scholar ]
  • Li, Z.; Yoon, J.; Zhang, R.; Rajabipour, F.; Srubar, W.V., III; Dabo, I.; Radlińska, A. Machine learning in concrete science: Applications, challenges, and best practices. NPJ Comput. Mater. 2022 , 8 , 127. [ Google Scholar ] [ CrossRef ]
  • Houssein, E.H.; Abohashima, Z.; Elhoseny, M.; Mohamed, W.M. Machine learning in the quantum realm: The state-of-the-art, challenges, and future vision. Expert Syst. Appl. 2022 , 194 , 116512. [ Google Scholar ] [ CrossRef ]
  • Khan, T.; Tian, W.; Zhou, G.; Ilager, S.; Gong, M.; Buyya, R. Machine learning (ML)-centric resource management in cloud computing: A review and future directions. J. Netw. Comput. Appl. 2022 , 204 , 103405. [ Google Scholar ] [ CrossRef ]
  • Esterhuizen, J.A.; Goldsmith, B.R.; Linic, S. Interpretable machine learning for knowledge generation in heterogeneous catalysis. Nat. Catal. 2022 , 5 , 175–184. [ Google Scholar ] [ CrossRef ]
  • Bharadiya, J.P. Leveraging Machine Learning for Enhanced Business Intelligence. Int. J. Comput. Sci. Technol. 2023 , 7 , 1–19. [ Google Scholar ]
  • Talaei Khoei, T.; Ould Slimane, H.; Kaabouch, N. Deep learning: Systematic review, models, challenges, and research directions. In Neural Computing and Applications ; Springer: Berlin/Heidelberg, Germany, 2023; pp. 1–22. [ Google Scholar ]
  • Ben Amor, S.; Belaid, F.; Benkraiem, R.; Ramdani, B.; Guesmi, K. Multi-criteria classification, sorting, and clustering: A bibliometric review and research agenda. Ann. Oper. Res. 2023 , 325 , 771–793. [ Google Scholar ] [ CrossRef ]
  • Valdez, F.; Melin, P. A review on quantum computing and deep learning algorithms and their applications. Soft Comput. 2023 , 27 , 13217–13236. [ Google Scholar ] [ CrossRef ]
  • Fihri, W.F.; Arjoune, Y.; Hassan El Ghazi, H.; Kaabouch, N.; Abou El Majd, A.B. A particle swarm optimization based algorithm for primary user emulation attack detection. In Proceedings of the 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 8–10 January 2018; pp. 823–827. [ Google Scholar ]
Classification CategoryCharacteristicsAdvantagesDisadvantages
Bayesian-
Based
Tree-
based
Instance-
based
Regularization-based
Neural network-based
Ensemble-based
Classification CategoryCharacteristicsAdvantageDisadvantage
Inductive-
Based
Generates a model that can create predictions for any sample in the input spaceThe predictions of new samples are independent of old samplesThe same model can be used in training and predicting new data samples
Transductive-
based
Predictive strengths are limited to objects that are processed during the training stepsNo difference between the training and testing stepsNo distinction between the transductive algorithms in a supervised manner
Classification CategoryCharacteristicsAdvantagesDisadvantages
Cluster-basedDivides uncategorized data into similar groups;
Dimensionality reduction-basedDecreases the number of features in the given dataset;
Neural network-basedInspiration of human brains.
Classification CategoryCharacteristicsAdvantageDisadvantage
Model-basedOptimal actions are learned via a model
Model free-basedNo transition of a probability distribution or reward associated with the Markov decision process
Data Preprocessing StepsMethodologyTechniqueHighlights
Data transformationStandardization
and
normalization
Unit vector normalizationExtract the given data, and convert them to a usable format
Max abs scalar
Quantile transformer scalar
Robust scalar Min-max scaling
Power transformer scalar
Unit vector normalization
Standard scalar
Data cleaningMissing value imputationComplete case analysisLoss of efficiency, strong bias, and complications in handling data.
Frequent category imputation
Mean/median imputation
Mode imputation
End of tail imputation
Nearest neighbor imputation
Iterative imputation
Hot and cold deck imputation
Exploration imputation
Interpolation imputation
Regression-based imputation
Noise treatmentData polishing
Noise filters
Data reduction/
increasing
Feature selectionWrapperDecrease or increase the number of samples or features that are not important in the process of training
Filter
Embedded
Feature extractionPrinciple component analysis
Linear discriminative analysis
Independent component analysis
Partial least square
Multifactor dimensionality reduction
Nonlinear dimensionality reduction
Autoencoder
Tensor decomposition
Instance generationCondensation algorithms
Edition algorithms
Hybrid algorithms
DiscretizationDiscretization-basedChi-squared discretizationLoss of information, simplicity, readability, and faster learning process
Efficient discretization
Imbalanced learningUnder-samplingRandom under-samplingPresents true evaluation results
Tomek links
Condensed nearest neighbor
Edited nearest neighbor
Near-miss under-sampling
OversamplingRandom oversampling
Synthetic minority oversampling technique
Adaptive synthetic
Borderline-synthetic minority oversampling technique
Hyperparameter MethodsStrengthsLimitations
Grid search
Random search
Genetic algorithm
Gradient-based techniques
Bayesian optimization-Gaussian process
Particle swarm optimization
Bayesian optimization-tree structure parzen estimator
Hyperband
Bayesian optimization-SMAC
Population-based
CategoryMetric Name
Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Reinforcement Learning
ChallengesDescriptions
Interpretability and Explain-ability
Bias and Fairness
Adversarial Robustness
Privacy and Security
Reinforcement Learning
Quantum Computing
Multi-Criteria Models
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Talaei Khoei, T.; Kaabouch, N. Machine Learning: Models, Challenges, and Research Directions. Future Internet 2023 , 15 , 332. https://doi.org/10.3390/fi15100332

Talaei Khoei T, Kaabouch N. Machine Learning: Models, Challenges, and Research Directions. Future Internet . 2023; 15(10):332. https://doi.org/10.3390/fi15100332

Talaei Khoei, Tala, and Naima Kaabouch. 2023. "Machine Learning: Models, Challenges, and Research Directions" Future Internet 15, no. 10: 332. https://doi.org/10.3390/fi15100332

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Machine Learning: Algorithms, Real-World Applications and Research Directions

Iqbal h. sarker.

1 Swinburne University of Technology, Melbourne, VIC 3122 Australia

2 Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, 4349 Chattogram, Bangladesh

In the current age of the Fourth Industrial Revolution (4 IR or Industry 4.0), the digital world has a wealth of data, such as Internet of Things (IoT) data, cybersecurity data, mobile data, business data, social media data, health data, etc. To intelligently analyze these data and develop the corresponding smart and automated  applications, the knowledge of artificial intelligence (AI), particularly, machine learning (ML) is the key. Various types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning exist in the area. Besides, the deep learning , which is part of a broader family of machine learning methods, can intelligently analyze the data on a large scale. In this paper, we present a comprehensive view on these machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, this study’s key contribution is explaining the principles of different machine learning techniques and their applicability in various real-world application domains, such as cybersecurity systems, smart cities, healthcare, e-commerce, agriculture, and many more. We also highlight the challenges and potential research directions based on our study. Overall, this paper aims to serve as a reference point for both academia and industry professionals as well as for decision-makers in various real-world situations and application areas, particularly from the technical point of view.

Introduction

We live in the age of data, where everything around us is connected to a data source, and everything in our lives is digitally recorded [ 21 , 103 ]. For instance, the current electronic world has a wealth of various kinds of data, such as the Internet of Things (IoT) data, cybersecurity data, smart city data, business data, smartphone data, social media data, health data, COVID-19 data, and many more. The data can be structured, semi-structured, or unstructured, discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”, which is increasing day-by-day. Extracting insights from these data can be used to build various intelligent applications in the relevant domains. For instance, to build a data-driven automated and intelligent cybersecurity system, the relevant cybersecurity data can be used [ 105 ]; to build personalized context-aware smart mobile applications, the relevant mobile data can be used [ 103 ], and so on. Thus, the data management tools and techniques having the capability of extracting insights or useful knowledge from the data in a timely and intelligent way is urgently needed, on which the real-world applications are based.

Artificial intelligence (AI), particularly, machine learning (ML) have grown rapidly in recent years in the context of data analysis and computing that typically allows the applications to function in an intelligent manner [ 95 ]. ML usually provides systems with the ability to learn and enhance from experience automatically without being specifically programmed and is generally referred to as the most popular latest technologies in the fourth industrial revolution (4 IR or Industry 4.0) [ 103 , 105 ]. “Industry 4.0” [ 114 ] is typically the ongoing automation of conventional manufacturing and industrial practices, including exploratory data processing, using new smart technologies such as machine learning automation. Thus, to intelligently analyze these data and to develop the corresponding real-world applications, machine learning algorithms is the key. The learning algorithms can be categorized into four major types, such as supervised, unsupervised, semi-supervised, and reinforcement learning in the area [ 75 ], discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”. The popularity of these approaches to learning is increasing day-by-day, which is shown in Fig. ​ Fig.1, 1 , based on data collected from Google Trends [ 4 ] over the last five years. The x - axis of the figure indicates the specific dates and the corresponding popularity score within the range of 0 ( m i n i m u m ) to 100 ( m a x i m u m ) has been shown in y - axis . According to Fig. ​ Fig.1, 1 , the popularity indication values for these learning types are low in 2015 and are increasing day by day. These statistics motivate us to study on machine learning in this paper, which can play an important role in the real-world through Industry 4.0 automation.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig1_HTML.jpg

The worldwide popularity score of various types of ML algorithms (supervised, unsupervised, semi-supervised, and reinforcement) in a range of 0 (min) to 100 (max) over time where x-axis represents the timestamp information and y-axis represents the corresponding score

In general, the effectiveness and the efficiency of a machine learning solution depend on the nature and characteristics of data and the performance of the learning algorithms . In the area of machine learning algorithms, classification analysis, regression, data clustering, feature engineering and dimensionality reduction, association rule learning, or reinforcement learning techniques exist to effectively build data-driven systems [ 41 , 125 ]. Besides, deep learning originated from the artificial neural network that can be used to intelligently analyze data, which is known as part of a wider family of machine learning approaches [ 96 ]. Thus, selecting a proper learning algorithm that is suitable for the target application in a particular domain is challenging. The reason is that the purpose of different learning algorithms is different, even the outcome of different learning algorithms in a similar category may vary depending on the data characteristics [ 106 ]. Thus, it is important to understand the principles of various machine learning algorithms and their applicability to apply in various real-world application areas, such as IoT systems, cybersecurity services, business and recommendation systems, smart cities, healthcare and COVID-19, context-aware systems, sustainable agriculture, and many more that are explained briefly in Sect. “ Applications of Machine Learning ”.

Based on the importance and potentiality of “Machine Learning” to analyze the data mentioned above, in this paper, we provide a comprehensive view on various types of machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, the key contribution of this study is explaining the principles and potentiality of different machine learning techniques, and their applicability in various real-world application areas mentioned earlier. The purpose of this paper is, therefore, to provide a basic guide for those academia and industry people who want to study, research, and develop data-driven automated and intelligent systems in the relevant areas based on machine learning techniques.

The key contributions of this paper are listed as follows:

  • To define the scope of our study by taking into account the nature and characteristics of various types of real-world data and the capabilities of various learning techniques.
  • To provide a comprehensive view on machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.
  • To discuss the applicability of machine learning-based solutions in various real-world application domains.
  • To highlight and summarize the potential research directions within the scope of our study for intelligent data analysis and services.

The rest of the paper is organized as follows. The next section presents the types of data and machine learning algorithms in a broader sense and defines the scope of our study. We briefly discuss and explain different machine learning algorithms in the subsequent section followed by which various real-world application areas based on machine learning algorithms are discussed and summarized. In the penultimate section, we highlight several research issues and potential future directions, and the final section concludes this paper.

Types of Real-World Data and Machine Learning Techniques

Machine learning algorithms typically consume and process data to learn the related patterns about individuals, business processes, transactions, events, and so on. In the following, we discuss various types of real-world data as well as categories of machine learning algorithms.

Types of Real-World Data

Usually, the availability of data is considered as the key to construct a machine learning model or data-driven real-world systems [ 103 , 105 ]. Data can be of various forms, such as structured, semi-structured, or unstructured [ 41 , 72 ]. Besides, the “metadata” is another type that typically represents data about the data. In the following, we briefly discuss these types of data.

  • Structured: It has a well-defined structure, conforms to a data model following a standard order, which is highly organized and easily accessed, and used by an entity or a computer program. In well-defined schemes, such as relational databases, structured data are typically stored, i.e., in a tabular format. For instance, names, dates, addresses, credit card numbers, stock information, geolocation, etc. are examples of structured data.
  • Unstructured: On the other hand, there is no pre-defined format or organization for unstructured data, making it much more difficult to capture, process, and analyze, mostly containing text and multimedia material. For example, sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, and many other types of business documents can be considered as unstructured data.
  • Semi-structured: Semi-structured data are not stored in a relational database like the structured data mentioned above, but it does have certain organizational properties that make it easier to analyze. HTML, XML, JSON documents, NoSQL databases, etc., are some examples of semi-structured data.
  • Metadata: It is not the normal form of data, but “data about data”. The primary difference between “data” and “metadata” is that data are simply the material that can classify, measure, or even document something relative to an organization’s data properties. On the other hand, metadata describes the relevant data information, giving it more significance for data users. A basic example of a document’s metadata might be the author, file size, date generated by the document, keywords to define the document, etc.

In the area of machine learning and data science, researchers use various widely used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 119 ], UNSW-NB15 [ 76 ], ISCX’12 [ 1 ], CIC-DDoS2019 [ 2 ], Bot-IoT [ 59 ], etc., smartphone datasets such as phone call logs [ 84 , 101 ], SMS Log [ 29 ], mobile application usages logs [ 137 ] [ 117 ], mobile phone notification logs [ 73 ] etc., IoT data [ 16 , 57 , 62 ], agriculture and e-commerce data [ 120 , 138 ], health data such as heart disease [ 92 ], diabetes mellitus [ 83 , 134 ], COVID-19 [ 43 , 74 ], etc., and many more in various application domains. The data can be in different types discussed above, which may vary from application to application in the real world. To analyze such data in a particular problem domain, and to extract the insights or useful knowledge from the data for building the real-world intelligent applications, different types of machine learning techniques can be used according to their learning capabilities, which is discussed in the following.

Types of Machine Learning Techniques

Machine Learning algorithms are mainly divided into four categories: Supervised learning, Unsupervised learning, Semi-supervised learning, and Reinforcement learning [ 75 ], as shown in Fig. ​ Fig.2. 2 . In the following, we briefly discuss each type of learning technique with the scope of their applicability to solve real-world problems.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig2_HTML.jpg

Various types of machine learning techniques

  • Supervised: Supervised learning is typically the task of machine learning to learn a function that maps an input to an output based on sample input-output pairs [ 41 ]. It uses labeled training data and a collection of training examples to infer a function. Supervised learning is carried out when certain goals are identified to be accomplished from a certain set of inputs [ 105 ], i.e., a task-driven approach . The most common supervised tasks are “classification” that separates the data, and “regression” that fits the data. For instance, predicting the class label or sentiment of a piece of text, like a tweet or a product review, i.e., text classification, is an example of supervised learning.
  • Unsupervised: Unsupervised learning analyzes unlabeled datasets without the need for human interference, i.e., a data-driven process [ 41 ]. This is widely used for extracting generative features, identifying meaningful trends and structures, groupings in results, and exploratory purposes. The most common unsupervised learning tasks are clustering, density estimation, feature learning, dimensionality reduction, finding association rules, anomaly detection, etc.
  • Semi-supervised: Semi-supervised learning can be defined as a hybridization of the above-mentioned supervised and unsupervised methods, as it operates on both labeled and unlabeled data [ 41 , 105 ]. Thus, it falls between learning “without supervision” and learning “with supervision”. In the real world, labeled data could be rare in several contexts, and unlabeled data are numerous, where semi-supervised learning is useful [ 75 ]. The ultimate goal of a semi-supervised learning model is to provide a better outcome for prediction than that produced using the labeled data alone from the model. Some application areas where semi-supervised learning is used include machine translation, fraud detection, labeling data and text classification.
  • Reinforcement: Reinforcement learning is a type of machine learning algorithm that enables software agents and machines to automatically evaluate the optimal behavior in a particular context or environment to improve its efficiency [ 52 ], i.e., an environment-driven approach . This type of learning is based on reward or penalty, and its ultimate goal is to use insights obtained from environmental activists to take action to increase the reward or minimize the risk [ 75 ]. It is a powerful tool for training AI models that can help increase automation or optimize the operational efficiency of sophisticated systems such as robotics, autonomous driving tasks, manufacturing and supply chain logistics, however, not preferable to use it for solving the basic or straightforward problems.

Thus, to build effective models in various application areas different types of machine learning techniques can play a significant role according to their learning capabilities, depending on the nature of the data discussed earlier, and the target outcome. In Table ​ Table1, 1 , we summarize various types of machine learning techniques with examples. In the following, we provide a comprehensive view of machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.

Various types of machine learning techniques with examples

Learning typeModel buildingExamples
SupervisedAlgorithms or models learn from labeled data (task-driven approach)Classification, regression
UnsupervisedAlgorithms or models learn from unlabeled data (Data-Driven Approach)Clustering, associations, dimensionality reduction
Semi-supervisedModels are built using combined data (labeled + unlabeled)Classification, clustering
ReinforcementModels are based on reward or penalty (environment-driven approach)Classification, control

Machine Learning Tasks and Algorithms

In this section, we discuss various machine learning algorithms that include classification analysis, regression analysis, data clustering, association rule learning, feature engineering for dimensionality reduction, as well as deep learning methods. A general structure of a machine learning-based predictive model has been shown in Fig. ​ Fig.3, 3 , where the model is trained from historical data in phase 1 and the outcome is generated in phase 2 for the new test data.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig3_HTML.jpg

A general structure of a machine learning based predictive model considering both the training and testing phase

Classification Analysis

Classification is regarded as a supervised learning method in machine learning, referring to a problem of predictive modeling as well, where a class label is predicted for a given example [ 41 ]. Mathematically, it maps a function ( f ) from input variables ( X ) to output variables ( Y ) as target, label or categories. To predict the class of given data points, it can be carried out on structured or unstructured data. For example, spam detection such as “spam” and “not spam” in email service providers can be a classification problem. In the following, we summarize the common classification problems.

  • Binary classification: It refers to the classification tasks having two class labels such as “true and false” or “yes and no” [ 41 ]. In such binary classification tasks, one class could be the normal state, while the abnormal state could be another class. For instance, “cancer not detected” is the normal state of a task that involves a medical test, and “cancer detected” could be considered as the abnormal state. Similarly, “spam” and “not spam” in the above example of email service providers are considered as binary classification.
  • Multiclass classification: Traditionally, this refers to those classification tasks having more than two class labels [ 41 ]. The multiclass classification does not have the principle of normal and abnormal outcomes, unlike binary classification tasks. Instead, within a range of specified classes, examples are classified as belonging to one. For example, it can be a multiclass classification task to classify various types of network attacks in the NSL-KDD [ 119 ] dataset, where the attack categories are classified into four class labels, such as DoS (Denial of Service Attack), U2R (User to Root Attack), R2L (Root to Local Attack), and Probing Attack.
  • Multi-label classification: In machine learning, multi-label classification is an important consideration where an example is associated with several classes or labels. Thus, it is a generalization of multiclass classification, where the classes involved in the problem are hierarchically structured, and each example may simultaneously belong to more than one class in each hierarchical level, e.g., multi-level text classification. For instance, Google news can be presented under the categories of a “city name”, “technology”, or “latest news”, etc. Multi-label classification includes advanced machine learning algorithms that support predicting various mutually non-exclusive classes or labels, unlike traditional classification tasks where class labels are mutually exclusive [ 82 ].

Many classification algorithms have been proposed in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the most common and popular methods that are used widely in various application areas.

  • Naive Bayes (NB): The naive Bayes algorithm is based on the Bayes’ theorem with the assumption of independence between each pair of features [ 51 ]. It works well and can be used for both binary and multi-class categories in many real-world situations, such as document or text classification, spam filtering, etc. To effectively classify the noisy instances in the data and to construct a robust prediction model, the NB classifier can be used [ 94 ]. The key benefit is that, compared to more sophisticated approaches, it needs a small amount of training data to estimate the necessary parameters and quickly [ 82 ]. However, its performance may affect due to its strong assumptions on features independence. Gaussian, Multinomial, Complement, Bernoulli, and Categorical are the common variants of NB classifier [ 82 ].
  • Linear Discriminant Analysis (LDA): Linear Discriminant Analysis (LDA) is a linear decision boundary classifier created by fitting class conditional densities to data and applying Bayes’ rule [ 51 , 82 ]. This method is also known as a generalization of Fisher’s linear discriminant, which projects a given dataset into a lower-dimensional space, i.e., a reduction of dimensionality that minimizes the complexity of the model or reduces the resulting model’s computational costs. The standard LDA model usually suits each class with a Gaussian density, assuming that all classes share the same covariance matrix [ 82 ]. LDA is closely related to ANOVA (analysis of variance) and regression analysis, which seek to express one dependent variable as a linear combination of other features or measurements.
  • Logistic regression (LR): Another common probabilistic based statistical model used to solve classification issues in machine learning is Logistic Regression (LR) [ 64 ]. Logistic regression typically uses a logistic function to estimate the probabilities, which is also referred to as the mathematically defined sigmoid function in Eq. 1 . It can overfit high-dimensional datasets and works well when the dataset can be separated linearly. The regularization (L1 and L2) techniques [ 82 ] can be used to avoid over-fitting in such scenarios. The assumption of linearity between the dependent and independent variables is considered as a major drawback of Logistic Regression. It can be used for both classification and regression problems, but it is more commonly used for classification. g ( z ) = 1 1 + exp ( - z ) . 1
  • K-nearest neighbors (KNN): K-Nearest Neighbors (KNN) [ 9 ] is an “instance-based learning” or non-generalizing learning, also known as a “lazy learning” algorithm. It does not focus on constructing a general internal model; instead, it stores all instances corresponding to training data in n -dimensional space. KNN uses data and classifies new data points based on similarity measures (e.g., Euclidean distance function) [ 82 ]. Classification is computed from a simple majority vote of the k nearest neighbors of each point. It is quite robust to noisy training data, and accuracy depends on the data quality. The biggest issue with KNN is to choose the optimal number of neighbors to be considered. KNN can be used both for classification as well as regression.
  • Support vector machine (SVM): In machine learning, another common technique that can be used for classification, regression, or other tasks is a support vector machine (SVM) [ 56 ]. In high- or infinite-dimensional space, a support vector machine constructs a hyper-plane or set of hyper-planes. Intuitively, the hyper-plane, which has the greatest distance from the nearest training data points in any class, achieves a strong separation since, in general, the greater the margin, the lower the classifier’s generalization error. It is effective in high-dimensional spaces and can behave differently based on different mathematical functions known as the kernel. Linear, polynomial, radial basis function (RBF), sigmoid, etc., are the popular kernel functions used in SVM classifier [ 82 ]. However, when the data set contains more noise, such as overlapping target classes, SVM does not perform well.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig4_HTML.jpg

An example of a decision tree structure

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig5_HTML.jpg

An example of a random forest structure considering multiple decision trees

  • Adaptive Boosting (AdaBoost): Adaptive Boosting (AdaBoost) is an ensemble learning process that employs an iterative approach to improve poor classifiers by learning from their errors. This is developed by Yoav Freund et al. [ 35 ] and also known as “meta-learning”. Unlike the random forest that uses parallel ensembling, Adaboost uses “sequential ensembling”. It creates a powerful classifier by combining many poorly performing classifiers to obtain a good classifier of high accuracy. In that sense, AdaBoost is called an adaptive classifier by significantly improving the efficiency of the classifier, but in some instances, it can trigger overfits. AdaBoost is best used to boost the performance of decision trees, base estimator [ 82 ], on binary classification problems, however, is sensitive to noisy data and outliers.
  • Extreme gradient boosting (XGBoost): Gradient Boosting, like Random Forests [ 19 ] above, is an ensemble learning algorithm that generates a final model based on a series of individual models, typically decision trees. The gradient is used to minimize the loss function, similar to how neural networks [ 41 ] use gradient descent to optimize weights. Extreme Gradient Boosting (XGBoost) is a form of gradient boosting that takes more detailed approximations into account when determining the best model [ 82 ]. It computes second-order gradients of the loss function to minimize loss and advanced regularization (L1 and L2) [ 82 ], which reduces over-fitting, and improves model generalization and performance. XGBoost is fast to interpret and can handle large-sized datasets well.
  • Stochastic gradient descent (SGD): Stochastic gradient descent (SGD) [ 41 ] is an iterative method for optimizing an objective function with appropriate smoothness properties, where the word ‘stochastic’ refers to random probability. This reduces the computational burden, particularly in high-dimensional optimization problems, allowing for faster iterations in exchange for a lower convergence rate. A gradient is the slope of a function that calculates a variable’s degree of change in response to another variable’s changes. Mathematically, the Gradient Descent is a convex function whose output is a partial derivative of a set of its input parameters. Let, α is the learning rate, and J i is the training example cost of i th , then Eq. ( 4 ) represents the stochastic gradient descent weight update method at the j th iteration. In large-scale and sparse machine learning, SGD has been successfully applied to problems often encountered in text classification and natural language processing [ 82 ]. However, SGD is sensitive to feature scaling and needs a range of hyperparameters, such as the regularization parameter and the number of iterations. w j : = w j - α ∂ J i ∂ w j . 4
  • Rule-based classification : The term rule-based classification can be used to refer to any classification scheme that makes use of IF-THEN rules for class prediction. Several classification algorithms such as Zero-R [ 125 ], One-R [ 47 ], decision trees [ 87 , 88 ], DTNB [ 110 ], Ripple Down Rule learner (RIDOR) [ 125 ], Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [ 126 ] exist with the ability of rule generation. The decision tree is one of the most common rule-based classification algorithms among these techniques because it has several advantages, such as being easier to interpret; the ability to handle high-dimensional data; simplicity and speed; good accuracy; and the capability to produce rules for human clear and understandable classification [ 127 ] [ 128 ]. The decision tree-based rules also provide significant accuracy in a prediction model for unseen test cases [ 106 ]. Since the rules are easily interpretable, these rule-based classifiers are often used to produce descriptive models that can describe a system including the entities and their relationships.

Regression Analysis

Regression analysis includes several methods of machine learning that allow to predict a continuous ( y ) result variable based on the value of one or more ( x ) predictor variables [ 41 ]. The most significant distinction between classification and regression is that classification predicts distinct class labels, while regression facilitates the prediction of a continuous quantity. Figure ​ Figure6 6 shows an example of how classification is different with regression models. Some overlaps are often found between the two types of machine learning algorithms. Regression models are now widely used in a variety of fields, including financial forecasting or prediction, cost estimation, trend analysis, marketing, time series estimation, drug response modeling, and many more. Some of the familiar types of regression algorithms are linear, polynomial, lasso and ridge regression, etc., which are explained briefly in the following.

  • Simple and multiple linear regression: This is one of the most popular ML modeling techniques as well as a well-known regression technique. In this technique, the dependent variable is continuous, the independent variable(s) can be continuous or discrete, and the form of the regression line is linear. Linear regression creates a relationship between the dependent variable ( Y ) and one or more independent variables ( X ) (also known as regression line) using the best fit straight line [ 41 ]. It is defined by the following equations: y = a + b x + e 5 y = a + b 1 x 1 + b 2 x 2 + ⋯ + b n x n + e , 6 where a is the intercept, b is the slope of the line, and e is the error term. This equation can be used to predict the value of the target variable based on the given predictor variable(s). Multiple linear regression is an extension of simple linear regression that allows two or more predictor variables to model a response variable, y, as a linear function [ 41 ] defined in Eq. 6 , whereas simple linear regression has only 1 independent variable, defined in Eq. 5 .
  • Polynomial regression: Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is not linear, but is the polynomial degree of n th in x [ 82 ]. The equation for polynomial regression is also derived from linear regression (polynomial regression of degree 1) equation, which is defined as below: y = b 0 + b 1 x + b 2 x 2 + b 3 x 3 + ⋯ + b n x n + e . 7 Here, y is the predicted/target output, b 0 , b 1 , . . . b n are the regression coefficients, x is an independent/ input variable. In simple words, we can say that if data are not distributed linearly, instead it is n th degree of polynomial then we use polynomial regression to get desired output.
  • LASSO and ridge regression: LASSO and Ridge regression are well known as powerful techniques which are typically used for building learning models in presence of a large number of features, due to their capability to preventing over-fitting and reducing the complexity of the model. The LASSO (least absolute shrinkage and selection operator) regression model uses L 1 regularization technique [ 82 ] that uses shrinkage, which penalizes “absolute value of magnitude of coefficients” ( L 1 penalty). As a result, LASSO appears to render coefficients to absolute zero. Thus, LASSO regression aims to find the subset of predictors that minimizes the prediction error for a quantitative response variable. On the other hand, ridge regression uses L 2 regularization [ 82 ], which is the “squared magnitude of coefficients” ( L 2 penalty). Thus, ridge regression forces the weights to be small but never sets the coefficient value to zero, and does a non-sparse solution. Overall, LASSO regression is useful to obtain a subset of predictors by eliminating less important features, and ridge regression is useful when a data set has “multicollinearity” which refers to the predictors that are correlated with other predictors.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig6_HTML.jpg

Classification vs. regression. In classification the dotted line represents a linear boundary that separates the two classes; in regression, the dotted line models the linear relationship between the two variables

Cluster Analysis

Cluster analysis, also known as clustering, is an unsupervised machine learning technique for identifying and grouping related data points in large datasets without concern for the specific outcome. It does grouping a collection of objects in such a way that objects in the same category, called a cluster, are in some sense more similar to each other than objects in other groups [ 41 ]. It is often used as a data analysis technique to discover interesting trends or patterns in data, e.g., groups of consumers based on their behavior. In a broad range of application areas, such as cybersecurity, e-commerce, mobile data processing, health analytics, user modeling and behavioral analytics, clustering can be used. In the following, we briefly discuss and summarize various types of clustering methods.

  • Partitioning methods: Based on the features and similarities in the data, this clustering approach categorizes the data into multiple groups or clusters. The data scientists or analysts typically determine the number of clusters either dynamically or statically depending on the nature of the target applications, to produce for the methods of clustering. The most common clustering algorithms based on partitioning methods are K-means [ 69 ], K-Mediods [ 80 ], CLARA [ 55 ] etc.
  • Density-based methods: To identify distinct groups or clusters, it uses the concept that a cluster in the data space is a contiguous region of high point density isolated from other such clusters by contiguous regions of low point density. Points that are not part of a cluster are considered as noise. The typical clustering algorithms based on density are DBSCAN [ 32 ], OPTICS [ 12 ] etc. The density-based methods typically struggle with clusters of similar density and high dimensionality data.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig7_HTML.jpg

A graphical interpretation of the widely-used hierarchical clustering (Bottom-up and top-down) technique

  • Grid-based methods: To deal with massive datasets, grid-based clustering is especially suitable. To obtain clusters, the principle is first to summarize the dataset with a grid representation and then to combine grid cells. STING [ 122 ], CLIQUE [ 6 ], etc. are the standard algorithms of grid-based clustering.
  • Model-based methods: There are mainly two types of model-based clustering algorithms: one that uses statistical learning, and the other based on a method of neural network learning [ 130 ]. For instance, GMM [ 89 ] is an example of a statistical learning method, and SOM [ 22 ] [ 96 ] is an example of a neural network learning method.
  • Constraint-based methods: Constrained-based clustering is a semi-supervised approach to data clustering that uses constraints to incorporate domain knowledge. Application or user-oriented constraints are incorporated to perform the clustering. The typical algorithms of this kind of clustering are COP K-means [ 121 ], CMWK-Means [ 27 ], etc.

Many clustering algorithms have been proposed with the ability to grouping data in machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

  • K-means clustering: K-means clustering [ 69 ] is a fast, robust, and simple algorithm that provides reliable results when data sets are well-separated from each other. The data points are allocated to a cluster in this algorithm in such a way that the amount of the squared distance between the data points and the centroid is as small as possible. In other words, the K-means algorithm identifies the k number of centroids and then assigns each data point to the nearest cluster while keeping the centroids as small as possible. Since it begins with a random selection of cluster centers, the results can be inconsistent. Since extreme values can easily affect a mean, the K-means clustering algorithm is sensitive to outliers. K-medoids clustering [ 91 ] is a variant of K-means that is more robust to noises and outliers.
  • Mean-shift clustering: Mean-shift clustering [ 37 ] is a nonparametric clustering technique that does not require prior knowledge of the number of clusters or constraints on cluster shape. Mean-shift clustering aims to discover “blobs” in a smooth distribution or density of samples [ 82 ]. It is a centroid-based algorithm that works by updating centroid candidates to be the mean of the points in a given region. To form the final set of centroids, these candidates are filtered in a post-processing stage to remove near-duplicates. Cluster analysis in computer vision and image processing are examples of application domains. Mean Shift has the disadvantage of being computationally expensive. Moreover, in cases of high dimension, where the number of clusters shifts abruptly, the mean-shift algorithm does not work well.
  • DBSCAN: Density-based spatial clustering of applications with noise (DBSCAN) [ 32 ] is a base algorithm for density-based clustering which is widely used in data mining and machine learning. This is known as a non-parametric density-based clustering technique for separating high-density clusters from low-density clusters that are used in model building. DBSCAN’s main idea is that a point belongs to a cluster if it is close to many points from that cluster. It can find clusters of various shapes and sizes in a vast volume of data that is noisy and contains outliers. DBSCAN, unlike k-means, does not require a priori specification of the number of clusters in the data and can find arbitrarily shaped clusters. Although k-means is much faster than DBSCAN, it is efficient at finding high-density regions and outliers, i.e., is robust to outliers.
  • GMM clustering: Gaussian mixture models (GMMs) are often used for data clustering, which is a distribution-based clustering algorithm. A Gaussian mixture model is a probabilistic model in which all the data points are produced by a mixture of a finite number of Gaussian distributions with unknown parameters [ 82 ]. To find the Gaussian parameters for each cluster, an optimization algorithm called expectation-maximization (EM) [ 82 ] can be used. EM is an iterative method that uses a statistical model to estimate the parameters. In contrast to k-means, Gaussian mixture models account for uncertainty and return the likelihood that a data point belongs to one of the k clusters. GMM clustering is more robust than k-means and works well even with non-linear data distributions.
  • Agglomerative hierarchical clustering: The most common method of hierarchical clustering used to group objects in clusters based on their similarity is agglomerative clustering. This technique uses a bottom-up approach, where each object is first treated as a singleton cluster by the algorithm. Following that, pairs of clusters are merged one by one until all clusters have been merged into a single large cluster containing all objects. The result is a dendrogram, which is a tree-based representation of the elements. Single linkage [ 115 ], Complete linkage [ 116 ], BOTS [ 102 ] etc. are some examples of such techniques. The main advantage of agglomerative hierarchical clustering over k-means is that the tree-structure hierarchy generated by agglomerative clustering is more informative than the unstructured collection of flat clusters returned by k-means, which can help to make better decisions in the relevant application areas.

Dimensionality Reduction and Feature Learning

In machine learning and data science, high-dimensional data processing is a challenging task for both researchers and application developers. Thus, dimensionality reduction which is an unsupervised learning technique, is important because it leads to better human interpretations, lower computational costs, and avoids overfitting and redundancy by simplifying models. Both the process of feature selection and feature extraction can be used for dimensionality reduction. The primary distinction between the selection and extraction of features is that the “feature selection” keeps a subset of the original features [ 97 ], while “feature extraction” creates brand new ones [ 98 ]. In the following, we briefly discuss these techniques.

  • Feature selection: The selection of features, also known as the selection of variables or attributes in the data, is the process of choosing a subset of unique features (variables, predictors) to use in building machine learning and data science model. It decreases a model’s complexity by eliminating the irrelevant or less important features and allows for faster training of machine learning algorithms. A right and optimal subset of the selected features in a problem domain is capable to minimize the overfitting problem through simplifying and generalizing the model as well as increases the model’s accuracy [ 97 ]. Thus, “feature selection” [ 66 , 99 ] is considered as one of the primary concepts in machine learning that greatly affects the effectiveness and efficiency of the target machine learning model. Chi-squared test, Analysis of variance (ANOVA) test, Pearson’s correlation coefficient, recursive feature elimination, are some popular techniques that can be used for feature selection.
  • Feature extraction: In a machine learning-based model or system, feature extraction techniques usually provide a better understanding of the data, a way to improve prediction accuracy, and to reduce computational cost or training time. The aim of “feature extraction” [ 66 , 99 ] is to reduce the number of features in a dataset by generating new ones from the existing ones and then discarding the original features. The majority of the information found in the original set of features can then be summarized using this new reduced set of features. For instance, principal components analysis (PCA) is often used as a dimensionality-reduction technique to extract a lower-dimensional space creating new brand components from the existing features in a dataset [ 98 ].

Many algorithms have been proposed to reduce data dimensions in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

  • Variance threshold: A simple basic approach to feature selection is the variance threshold [ 82 ]. This excludes all features of low variance, i.e., all features whose variance does not exceed the threshold. It eliminates all zero-variance characteristics by default, i.e., characteristics that have the same value in all samples. This feature selection algorithm looks only at the ( X ) features, not the ( y ) outputs needed, and can, therefore, be used for unsupervised learning.
  • Pearson correlation: Pearson’s correlation is another method to understand a feature’s relation to the response variable and can be used for feature selection [ 99 ]. This method is also used for finding the association between the features in a dataset. The resulting value is [ - 1 , 1 ] , where - 1 means perfect negative correlation, + 1 means perfect positive correlation, and 0 means that the two variables do not have a linear correlation. If two random variables represent X and Y , then the correlation coefficient between X and Y is defined as [ 41 ] r ( X , Y ) = ∑ i = 1 n ( X i - X ¯ ) ( Y i - Y ¯ ) ∑ i = 1 n ( X i - X ¯ ) 2 ∑ i = 1 n ( Y i - Y ¯ ) 2 . 8
  • ANOVA: Analysis of variance (ANOVA) is a statistical tool used to verify the mean values of two or more groups that differ significantly from each other. ANOVA assumes a linear relationship between the variables and the target and the variables’ normal distribution. To statistically test the equality of means, the ANOVA method utilizes F tests. For feature selection, the results ‘ANOVA F value’ [ 82 ] of this test can be used where certain features independent of the goal variable can be omitted.
  • Chi square: The chi-square χ 2 [ 82 ] statistic is an estimate of the difference between the effects of a series of events or variables observed and expected frequencies. The magnitude of the difference between the real and observed values, the degrees of freedom, and the sample size depends on χ 2 . The chi-square χ 2 is commonly used for testing relationships between categorical variables. If O i represents observed value and E i represents expected value, then χ 2 = ∑ i = 1 n ( O i - E i ) 2 E i . 9
  • Recursive feature elimination (RFE): Recursive Feature Elimination (RFE) is a brute force approach to feature selection. RFE [ 82 ] fits the model and removes the weakest feature before it meets the specified number of features. Features are ranked by the coefficients or feature significance of the model. RFE aims to remove dependencies and collinearity in the model by recursively removing a small number of features per iteration.
  • Model-based selection: To reduce the dimensionality of the data, linear models penalized with the L 1 regularization can be used. Least absolute shrinkage and selection operator (Lasso) regression is a type of linear regression that has the property of shrinking some of the coefficients to zero [ 82 ]. Therefore, that feature can be removed from the model. Thus, the penalized lasso regression method, often used in machine learning to select the subset of variables. Extra Trees Classifier [ 82 ] is an example of a tree-based estimator that can be used to compute impurity-based function importance, which can then be used to discard irrelevant features.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig8_HTML.jpg

An example of a principal component analysis (PCA) and created principal components PC1 and PC2 in different dimension space

Association Rule Learning

Association rule learning is a rule-based machine learning approach to discover interesting relationships, “IF-THEN” statements, in large datasets between variables [ 7 ]. One example is that “if a customer buys a computer or laptop (an item), s/he is likely to also buy anti-virus software (another item) at the same time”. Association rules are employed today in many application areas, including IoT services, medical diagnosis, usage behavior analytics, web usage mining, smartphone applications, cybersecurity applications, and bioinformatics. In comparison to sequence mining, association rule learning does not usually take into account the order of things within or across transactions. A common way of measuring the usefulness of association rules is to use its parameter, the ‘support’ and ‘confidence’, which is introduced in [ 7 ].

In the data mining literature, many association rule learning methods have been proposed, such as logic dependent [ 34 ], frequent pattern based [ 8 , 49 , 68 ], and tree-based [ 42 ]. The most popular association rule learning algorithms are summarized below.

  • AIS and SETM: AIS is the first algorithm proposed by Agrawal et al. [ 7 ] for association rule mining. The AIS algorithm’s main downside is that too many candidate itemsets are generated, requiring more space and wasting a lot of effort. This algorithm calls for too many passes over the entire dataset to produce the rules. Another approach SETM [ 49 ] exhibits good performance and stable behavior with execution time; however, it suffers from the same flaw as the AIS algorithm.
  • Apriori: For generating association rules for a given dataset, Agrawal et al. [ 8 ] proposed the Apriori, Apriori-TID, and Apriori-Hybrid algorithms. These later algorithms outperform the AIS and SETM mentioned above due to the Apriori property of frequent itemset [ 8 ]. The term ‘Apriori’ usually refers to having prior knowledge of frequent itemset properties. Apriori uses a “bottom-up” approach, where it generates the candidate itemsets. To reduce the search space, Apriori uses the property “all subsets of a frequent itemset must be frequent; and if an itemset is infrequent, then all its supersets must also be infrequent”. Another approach predictive Apriori [ 108 ] can also generate rules; however, it receives unexpected results as it combines both the support and confidence. The Apriori [ 8 ] is the widely applicable techniques in mining association rules.
  • ECLAT: This technique was proposed by Zaki et al. [ 131 ] and stands for Equivalence Class Clustering and bottom-up Lattice Traversal. ECLAT uses a depth-first search to find frequent itemsets. In contrast to the Apriori [ 8 ] algorithm, which represents data in a horizontal pattern, it represents data vertically. Hence, the ECLAT algorithm is more efficient and scalable in the area of association rule learning. This algorithm is better suited for small and medium datasets whereas the Apriori algorithm is used for large datasets.
  • FP-Growth: Another common association rule learning technique based on the frequent-pattern tree (FP-tree) proposed by Han et al. [ 42 ] is Frequent Pattern Growth, known as FP-Growth. The key difference with Apriori is that while generating rules, the Apriori algorithm [ 8 ] generates frequent candidate itemsets; on the other hand, the FP-growth algorithm [ 42 ] prevents candidate generation and thus produces a tree by the successful strategy of ‘divide and conquer’ approach. Due to its sophistication, however, FP-Tree is challenging to use in an interactive mining environment [ 133 ]. Thus, the FP-Tree would not fit into memory for massive data sets, making it challenging to process big data as well. Another solution is RARM (Rapid Association Rule Mining) proposed by Das et al. [ 26 ] but faces a related FP-tree issue [ 133 ].
  • ABC-RuleMiner: A rule-based machine learning method, recently proposed in our earlier paper, by Sarker et al. [ 104 ], to discover the interesting non-redundant rules to provide real-world intelligent services. This algorithm effectively identifies the redundancy in associations by taking into account the impact or precedence of the related contextual features and discovers a set of non-redundant association rules. This algorithm first constructs an association generation tree (AGT), a top-down approach, and then extracts the association rules through traversing the tree. Thus, ABC-RuleMiner is more potent than traditional rule-based methods in terms of both non-redundant rule generation and intelligent decision-making, particularly in a context-aware smart computing environment, where human or user preferences are involved.

Among the association rule learning techniques discussed above, Apriori [ 8 ] is the most widely used algorithm for discovering association rules from a given dataset [ 133 ]. The main strength of the association learning technique is its comprehensiveness, as it generates all associations that satisfy the user-specified constraints, such as minimum support and confidence value. The ABC-RuleMiner approach [ 104 ] discussed earlier could give significant results in terms of non-redundant rule generation and intelligent decision-making for the relevant application areas in the real world.

Reinforcement Learning

Reinforcement learning (RL) is a machine learning technique that allows an agent to learn by trial and error in an interactive environment using input from its actions and experiences. Unlike supervised learning, which is based on given sample data or examples, the RL method is based on interacting with the environment. The problem to be solved in reinforcement learning (RL) is defined as a Markov Decision Process (MDP) [ 86 ], i.e., all about sequentially making decisions. An RL problem typically includes four elements such as Agent, Environment, Rewards, and Policy.

RL can be split roughly into Model-based and Model-free techniques. Model-based RL is the process of inferring optimal behavior from a model of the environment by performing actions and observing the results, which include the next state and the immediate reward [ 85 ]. AlphaZero, AlphaGo [ 113 ] are examples of the model-based approaches. On the other hand, a model-free approach does not use the distribution of the transition probability and the reward function associated with MDP. Q-learning, Deep Q Network, Monte Carlo Control, SARSA (State–Action–Reward–State–Action), etc. are some examples of model-free algorithms [ 52 ]. The policy network, which is required for model-based RL but not for model-free, is the key difference between model-free and model-based learning. In the following, we discuss the popular RL algorithms.

  • Monte Carlo methods: Monte Carlo techniques, or Monte Carlo experiments, are a wide category of computational algorithms that rely on repeated random sampling to obtain numerical results [ 52 ]. The underlying concept is to use randomness to solve problems that are deterministic in principle. Optimization, numerical integration, and making drawings from the probability distribution are the three problem classes where Monte Carlo techniques are most commonly used.
  • Q-learning: Q-learning is a model-free reinforcement learning algorithm for learning the quality of behaviors that tell an agent what action to take under what conditions [ 52 ]. It does not need a model of the environment (hence the term “model-free”), and it can deal with stochastic transitions and rewards without the need for adaptations. The ‘Q’ in Q-learning usually stands for quality, as the algorithm calculates the maximum expected rewards for a given behavior in a given state.
  • Deep Q-learning: The basic working step in Deep Q-Learning [ 52 ] is that the initial state is fed into the neural network, which returns the Q-value of all possible actions as an output. Still, when we have a reasonably simple setting to overcome, Q-learning works well. However, when the number of states and actions becomes more complicated, deep learning can be used as a function approximator.

Reinforcement learning, along with supervised and unsupervised learning, is one of the basic machine learning paradigms. RL can be used to solve numerous real-world problems in various fields, such as game theory, control theory, operations analysis, information theory, simulation-based optimization, manufacturing, supply chain logistics, multi-agent systems, swarm intelligence, aircraft control, robot motion control, and many more.

Artificial Neural Network and Deep Learning

Deep learning is part of a wider family of artificial neural networks (ANN)-based machine learning approaches with representation learning. Deep learning provides a computational architecture by combining several processing layers, such as input, hidden, and output layers, to learn from data [ 41 ]. The main advantage of deep learning over traditional machine learning methods is its better performance in several cases, particularly learning from large datasets [ 105 , 129 ]. Figure ​ Figure9 9 shows a general performance of deep learning over machine learning considering the increasing amount of data. However, it may vary depending on the data characteristics and experimental set up.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig9_HTML.jpg

Machine learning and deep learning performance in general with the amount of data

The most common deep learning algorithms are: Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN, or ConvNet), Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) [ 96 ]. In the following, we discuss various types of deep learning methods that can be used to build effective data-driven models for various purposes.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig10_HTML.jpg

A structure of an artificial neural network modeling with multiple processing layers

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig11_HTML.jpg

An example of a convolutional neural network (CNN or ConvNet) including multiple convolution and pooling layers

  • LSTM-RNN: Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the area of deep learning [ 38 ]. LSTM has feedback links, unlike normal feed-forward neural networks. LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, processing, and predicting data based on time series data, which differentiates it from other conventional networks. Thus, LSTM can be used when the data are in a sequential format, such as time, sentence, etc., and commonly applied in the area of time-series analysis, natural language processing, speech recognition, etc.

In addition to these most common deep learning methods discussed above, several other deep learning approaches [ 96 ] exist in the area for various purposes. For instance, the self-organizing map (SOM) [ 58 ] uses unsupervised learning to represent the high-dimensional data by a 2D grid map, thus achieving dimensionality reduction. The autoencoder (AE) [ 15 ] is another learning technique that is widely used for dimensionality reduction as well and feature extraction in unsupervised learning tasks. Restricted Boltzmann machines (RBM) [ 46 ] can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling. A deep belief network (DBN) is typically composed of simple, unsupervised networks such as restricted Boltzmann machines (RBMs) or autoencoders, and a backpropagation neural network (BPNN) [ 123 ]. A generative adversarial network (GAN) [ 39 ] is a form of the network for deep learning that can generate data with characteristics close to the actual data input. Transfer learning is currently very common because it can train deep neural networks with comparatively low data, which is typically the re-use of a new problem with a pre-trained model [ 124 ]. A brief discussion of these artificial neural networks (ANN) and deep learning (DL) models are summarized in our earlier paper Sarker et al. [ 96 ].

Overall, based on the learning techniques discussed above, we can conclude that various types of machine learning techniques, such as classification analysis, regression, data clustering, feature selection and extraction, and dimensionality reduction, association rule learning, reinforcement learning, or deep learning techniques, can play a significant role for various purposes according to their capabilities. In the following section, we discuss several application areas based on machine learning algorithms.

Applications of Machine Learning

In the current age of the Fourth Industrial Revolution (4IR), machine learning becomes popular in various application areas, because of its learning capabilities from the past and making intelligent decisions. In the following, we summarize and discuss ten popular application areas of machine learning technology.

  • Predictive analytics and intelligent decision-making: A major application field of machine learning is intelligent decision-making by data-driven predictive analytics [ 21 , 70 ]. The basis of predictive analytics is capturing and exploiting relationships between explanatory variables and predicted variables from previous events to predict the unknown outcome [ 41 ]. For instance, identifying suspects or criminals after a crime has been committed, or detecting credit card fraud as it happens. Another application, where machine learning algorithms can assist retailers in better understanding consumer preferences and behavior, better manage inventory, avoiding out-of-stock situations, and optimizing logistics and warehousing in e-commerce. Various machine learning algorithms such as decision trees, support vector machines, artificial neural networks, etc. [ 106 , 125 ] are commonly used in the area. Since accurate predictions provide insight into the unknown, they can improve the decisions of industries, businesses, and almost any organization, including government agencies, e-commerce, telecommunications, banking and financial services, healthcare, sales and marketing, transportation, social networking, and many others.
  • Cybersecurity and threat intelligence: Cybersecurity is one of the most essential areas of Industry 4.0. [ 114 ], which is typically the practice of protecting networks, systems, hardware, and data from digital attacks [ 114 ]. Machine learning has become a crucial cybersecurity technology that constantly learns by analyzing data to identify patterns, better detect malware in encrypted traffic, find insider threats, predict where bad neighborhoods are online, keep people safe while browsing, or secure data in the cloud by uncovering suspicious activity. For instance, clustering techniques can be used to identify cyber-anomalies, policy violations, etc. To detect various types of cyber-attacks or intrusions machine learning classification models by taking into account the impact of security features are useful [ 97 ]. Various deep learning-based security models can also be used on the large scale of security datasets [ 96 , 129 ]. Moreover, security policy rules generated by association rule learning techniques can play a significant role to build a rule-based security system [ 105 ]. Thus, we can say that various learning techniques discussed in Sect. Machine Learning Tasks and Algorithms , can enable cybersecurity professionals to be more proactive inefficiently preventing threats and cyber-attacks.
  • Internet of things (IoT) and smart cities: Internet of Things (IoT) is another essential area of Industry 4.0. [ 114 ], which turns everyday objects into smart objects by allowing them to transmit data and automate tasks without the need for human interaction. IoT is, therefore, considered to be the big frontier that can enhance almost all activities in our lives, such as smart governance, smart home, education, communication, transportation, retail, agriculture, health care, business, and many more [ 70 ]. Smart city is one of IoT’s core fields of application, using technologies to enhance city services and residents’ living experiences [ 132 , 135 ]. As machine learning utilizes experience to recognize trends and create models that help predict future behavior and events, it has become a crucial technology for IoT applications [ 103 ]. For example, to predict traffic in smart cities, parking availability prediction, estimate the total usage of energy of the citizens for a particular period, make context-aware and timely decisions for the people, etc. are some tasks that can be solved using machine learning techniques according to the current needs of the people.
  • Traffic prediction and transportation: Transportation systems have become a crucial component of every country’s economic development. Nonetheless, several cities around the world are experiencing an excessive rise in traffic volume, resulting in serious issues such as delays, traffic congestion, higher fuel prices, increased CO 2 pollution, accidents, emergencies, and a decline in modern society’s quality of life [ 40 ]. Thus, an intelligent transportation system through predicting future traffic is important, which is an indispensable part of a smart city. Accurate traffic prediction based on machine and deep learning modeling can help to minimize the issues [ 17 , 30 , 31 ]. For example, based on the travel history and trend of traveling through various routes, machine learning can assist transportation companies in predicting possible issues that may occur on specific routes and recommending their customers to take a different path. Ultimately, these learning-based data-driven models help improve traffic flow, increase the usage and efficiency of sustainable modes of transportation, and limit real-world disruption by modeling and visualizing future changes.
  • Healthcare and COVID-19 pandemic: Machine learning can help to solve diagnostic and prognostic problems in a variety of medical domains, such as disease prediction, medical knowledge extraction, detecting regularities in data, patient management, etc. [ 33 , 77 , 112 ]. Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus, according to the World Health Organization (WHO) [ 3 ]. Recently, the learning techniques have become popular in the battle against COVID-19 [ 61 , 63 ]. For the COVID-19 pandemic, the learning techniques are used to classify patients at high risk, their mortality rate, and other anomalies [ 61 ]. It can also be used to better understand the virus’s origin, COVID-19 outbreak prediction, as well as for disease diagnosis and treatment [ 14 , 50 ]. With the help of machine learning, researchers can forecast where and when, the COVID-19 is likely to spread, and notify those regions to match the required arrangements. Deep learning also provides exciting solutions to the problems of medical image processing and is seen as a crucial technique for potential applications, particularly for COVID-19 pandemic [ 10 , 78 , 111 ]. Overall, machine and deep learning techniques can help to fight the COVID-19 virus and the pandemic as well as intelligent clinical decisions making in the domain of healthcare.
  • E-commerce and product recommendations: Product recommendation is one of the most well known and widely used applications of machine learning, and it is one of the most prominent features of almost any e-commerce website today. Machine learning technology can assist businesses in analyzing their consumers’ purchasing histories and making customized product suggestions for their next purchase based on their behavior and preferences. E-commerce companies, for example, can easily position product suggestions and offers by analyzing browsing trends and click-through rates of specific items. Using predictive modeling based on machine learning techniques, many online retailers, such as Amazon [ 71 ], can better manage inventory, prevent out-of-stock situations, and optimize logistics and warehousing. The future of sales and marketing is the ability to capture, evaluate, and use consumer data to provide a customized shopping experience. Furthermore, machine learning techniques enable companies to create packages and content that are tailored to the needs of their customers, allowing them to maintain existing customers while attracting new ones.
  • NLP and sentiment analysis: Natural language processing (NLP) involves the reading and understanding of spoken or written language through the medium of a computer [ 79 , 103 ]. Thus, NLP helps computers, for instance, to read a text, hear speech, interpret it, analyze sentiment, and decide which aspects are significant, where machine learning techniques can be used. Virtual personal assistant, chatbot, speech recognition, document description, language or machine translation, etc. are some examples of NLP-related tasks. Sentiment Analysis [ 90 ] (also referred to as opinion mining or emotion AI) is an NLP sub-field that seeks to identify and extract public mood and views within a given text through blogs, reviews, social media, forums, news, etc. For instance, businesses and brands use sentiment analysis to understand the social sentiment of their brand, product, or service through social media platforms or the web as a whole. Overall, sentiment analysis is considered as a machine learning task that analyzes texts for polarity, such as “positive”, “negative”, or “neutral” along with more intense emotions like very happy, happy, sad, very sad, angry, have interest, or not interested etc.
  • Image, speech and pattern recognition: Image recognition [ 36 ] is a well-known and widespread example of machine learning in the real world, which can identify an object as a digital image. For instance, to label an x-ray as cancerous or not, character recognition, or face detection in an image, tagging suggestions on social media, e.g., Facebook, are common examples of image recognition. Speech recognition [ 23 ] is also very popular that typically uses sound and linguistic models, e.g., Google Assistant, Cortana, Siri, Alexa, etc. [ 67 ], where machine learning methods are used. Pattern recognition [ 13 ] is defined as the automated recognition of patterns and regularities in data, e.g., image analysis. Several machine learning techniques such as classification, feature selection, clustering, or sequence labeling methods are used in the area.
  • Sustainable agriculture: Agriculture is essential to the survival of all human activities [ 109 ]. Sustainable agriculture practices help to improve agricultural productivity while also reducing negative impacts on the environment [ 5 , 25 , 109 ]. The sustainable agriculture supply chains are knowledge-intensive and based on information, skills, technologies, etc., where knowledge transfer encourages farmers to enhance their decisions to adopt sustainable agriculture practices utilizing the increasing amount of data captured by emerging technologies, e.g., the Internet of Things (IoT), mobile technologies and devices, etc. [ 5 , 53 , 54 ]. Machine learning can be applied in various phases of sustainable agriculture, such as in the pre-production phase - for the prediction of crop yield, soil properties, irrigation requirements, etc.; in the production phase—for weather prediction, disease detection, weed detection, soil nutrient management, livestock management, etc.; in processing phase—for demand estimation, production planning, etc. and in the distribution phase - the inventory management, consumer analysis, etc.
  • User behavior analytics and context-aware smartphone applications: Context-awareness is a system’s ability to capture knowledge about its surroundings at any moment and modify behaviors accordingly [ 28 , 93 ]. Context-aware computing uses software and hardware to automatically collect and interpret data for direct responses. The mobile app development environment has been changed greatly with the power of AI, particularly, machine learning techniques through their learning capabilities from contextual data [ 103 , 136 ]. Thus, the developers of mobile apps can rely on machine learning to create smart apps that can understand human behavior, support, and entertain users [ 107 , 137 , 140 ]. To build various personalized data-driven context-aware systems, such as smart interruption management, smart mobile recommendation, context-aware smart searching, decision-making that intelligently assist end mobile phone users in a pervasive computing environment, machine learning techniques are applicable. For example, context-aware association rules can be used to build an intelligent phone call application [ 104 ]. Clustering approaches are useful in capturing users’ diverse behavioral activities by taking into account data in time series [ 102 ]. To predict the future events in various contexts, the classification methods can be used [ 106 , 139 ]. Thus, various learning techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can help to build context-aware adaptive and smart applications according to the preferences of the mobile phone users.

In addition to these application areas, machine learning-based models can also apply to several other domains such as bioinformatics, cheminformatics, computer networks, DNA sequence classification, economics and banking, robotics, advanced engineering, and many more.

Challenges and Research Directions

Our study on machine learning algorithms for intelligent data analysis and applications opens several research issues in the area. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions.

In general, the effectiveness and the efficiency of a machine learning-based solution depend on the nature and characteristics of the data, and the performance of the learning algorithms. To collect the data in the relevant domain, such as cybersecurity, IoT, healthcare and agriculture discussed in Sect. “ Applications of Machine Learning ” is not straightforward, although the current cyberspace enables the production of a huge amount of data with very high frequency. Thus, collecting useful data for the target machine learning-based applications, e.g., smart city applications, and their management is important to further analysis. Therefore, a more in-depth investigation of data collection methods is needed while working on the real-world data. Moreover, the historical data may contain many ambiguous values, missing values, outliers, and meaningless data. The machine learning algorithms, discussed in Sect “ Machine Learning Tasks and Algorithms ” highly impact on data quality, and availability for training, and consequently on the resultant model. Thus, to accurately clean and pre-process the diverse data collected from diverse sources is a challenging task. Therefore, effectively modifying or enhance existing pre-processing methods, or proposing new data preparation techniques are required to effectively use the learning algorithms in the associated application domain.

To analyze the data and extract insights, there exist many machine learning algorithms, summarized in Sect. “ Machine Learning Tasks and Algorithms ”. Thus, selecting a proper learning algorithm that is suitable for the target application is challenging. The reason is that the outcome of different learning algorithms may vary depending on the data characteristics [ 106 ]. Selecting a wrong learning algorithm would result in producing unexpected outcomes that may lead to loss of effort, as well as the model’s effectiveness and accuracy. In terms of model building, the techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can directly be used to solve many real-world issues in diverse domains, such as cybersecurity, smart cities and healthcare summarized in Sect. “ Applications of Machine Learning ”. However, the hybrid learning model, e.g., the ensemble of methods, modifying or enhancement of the existing learning techniques, or designing new learning methods, could be a potential future work in the area.

Thus, the ultimate success of a machine learning-based solution and corresponding applications mainly depends on both the data and the learning algorithms. If the data are bad to learn, such as non-representative, poor-quality, irrelevant features, or insufficient quantity for training, then the machine learning models may become useless or will produce lower accuracy. Therefore, effectively processing the data and handling the diverse learning algorithms are important, for a machine learning-based solution and eventually building intelligent applications.

In this paper, we have conducted a comprehensive overview of machine learning algorithms for intelligent data analysis and applications. According to our goal, we have briefly discussed how various types of machine learning methods can be used for making solutions to various real-world issues. A successful machine learning model depends on both the data and the performance of the learning algorithms. The sophisticated learning algorithms then need to be trained through the collected real-world data and knowledge related to the target application before the system can assist with intelligent decision-making. We also discussed several popular application areas based on machine learning techniques to highlight their applicability in various real-world issues. Finally, we have summarized and discussed the challenges faced and the potential research opportunities and future directions in the area. Therefore, the challenges that are identified create promising research opportunities in the field which must be addressed with effective solutions in various application areas. Overall, we believe that our study on machine learning-based solutions opens up a promising direction and can be used as a reference guide for potential research and applications for both academia and industry professionals as well as for decision-makers, from a technical point of view.

Declaration

The author declares no conflict of interest.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Machine Learning Research at Apple

Introducing apple’s on-device and server foundation models.

At the 2024 Worldwide Developers Conference , we introduced Apple Intelligence, a personal intelligence system integrated deeply into iOS 18, iPadOS 18, and macOS Sequoia.

Apple Intelligence is comprised of multiple highly-capable generative models that are specialized for our users’ everyday tasks, and can adapt on the fly for their current activity. The foundation models built into Apple Intelligence have been fine-tuned for user experiences such as writing and refining text, prioritizing and summarizing notifications, creating playful images for conversations with family and friends, and taking in-app actions to simplify interactions across apps.

Recent research

Evaluating the iwslt2023 speech translation tasks: human annotations, automatic metrics, and segmentation, improved modelling of federated datasets using mixtures-of-dirichlet-multinomials.

Explore all research

Research highlights

Personalizing health and fitness with hybrid modeling.

Recent research has explored clinical monitoring, cardiovascular events, and even clinical lab values from wearables data. As adoption increases, wearables data may become crucial in public health applications like disease monitoring and the design of epidemiological studies.

Enhancing Paragraph Generation with a Latent Language Diffusion Model

In the fast-evolving world of natural language processing (NLP), there is a strong demand for generating coherent and controlled text, as referenced in the work Toward Controlled Generation of Text. Traditional autoregressive models such as GPT, which have long been the industry standard, possess inherent limitations that sometimes manifest as repetitive and low-quality outputs, as seen in the work The Curious Case of Neural Text Degeneration. This is primarily due to a phenomenon known as "exposure bias," as seen in the work Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. This imperfection arises due to a mismatch between how these models are trained and their actual use during inference, often leading to error accumulation during text generation.

View all highlights

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024

Apple is sponsoring the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), which is taking place in person from June 17 to 21 in Seattle, Washington. CVPR is the annual computer vision event comprising the main conference and several co-located workshops and short courses. Below is the schedule of our sponsored workshops and events at CVPR 2024.

ACM Human-Computer Interaction conference (CHI) 2024

Apple is sponsoring the ACM Human-Computer Interaction Conference (CHI), which is taking place in person from May 11 to May 16, 2024 in Honolulu, Hawai'i.

View all events

Bottom banner

Discover opportunities in Machine Learning.

Our research in machine learning breaks new ground every day.

Work with us

Subscribe to the PwC Newsletter

Join the community, 10021 dataset results.

machine learning research papers

The CIFAR-10 dataset (Canadian Institute for Advanced Research, 10 classes) is a subset of the Tiny Images dataset and consists of 60000 32x32 color images. The images are labelled with one of 10 mutually exclusive classes: airplane, automobile (but not truck or pickup truck), bird, cat, deer, dog, frog, horse, ship, and truck (but not pickup truck). There are 6000 images per class with 5000 training and 1000 testing images per class.

14,466 PAPERS • 104 BENCHMARKS

machine learning research papers

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark in image classification and object detection. The publicly released dataset contains a set of manually annotated training images. A set of test images is also released, with the manual annotations withheld. ILSVRC annotations fall into one of two categories: (1) image-level annotation of a binary label for the presence or absence of an object class in the image, e.g., “there are cars in this image” but “there are no tigers,” and (2) object-level annotation of a tight bounding box and class label around an object instance in the image, e.g., “there is a screwdriver centered at position (20,25) with width of 50 pixels and height of 30 pixels”. The ImageNet project does not own the copyright of the images, therefore only thumbnails and URLs of images are provided.

13,763 PAPERS • 43 BENCHMARKS

machine learning research papers

The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.

10,461 PAPERS • 93 BENCHMARKS

machine learning research papers

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists of 60000 32x32 color images. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. There are 600 images per class. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). There are 500 training images and 100 testing images per class.

7,890 PAPERS • 55 BENCHMARKS

machine learning research papers

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits written by employees of the United States Census Bureau) and Special Database 1 (digits written by high school students) which contain monochrome images of handwritten digits. The digits have been size-normalized and centered in a fixed-size image. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.

7,068 PAPERS • 52 BENCHMARKS

machine learning research papers

Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense pixel annotations for 30 classes grouped into 8 categories (flat surfaces, humans, vehicles, constructions, objects, nature, sky, and void). The dataset consists of around 5000 fine annotated images and 20000 coarse annotated ones. Data was captured in 50 cities during several months, daytimes, and good weather conditions. It was originally recorded as video so the frames were manually selected to have the following features: large number of dynamic objects, varying scene layout, and varying background.

3,377 PAPERS • 54 BENCHMARKS

machine learning research papers

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile robotics and autonomous driving. It consists of hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. Despite its popularity, the dataset itself does not contain ground truth for semantic segmentation. However, various researchers have manually annotated parts of the dataset to fit their necessities. Álvarez et al. generated ground truth for 323 images from the road detection challenge with three classes: road, vertical, and sky. Zhang et al. annotated 252 (140 for training and 112 for testing) acquisitions – RGB and Velodyne scans – from the tracking challenge for ten object categories: building, sky, road, vegetation, sidewalk, car, pedestrian, cyclist, sign/pole, and fence. Ros et al. labeled 170 training images and 46 testing images (from the visual odome

3,290 PAPERS • 142 BENCHMARKS

machine learning research papers

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels indicating facial attributes like hair color, gender and age.

3,138 PAPERS • 20 BENCHMARKS

machine learning research papers

Street View House Numbers (SVHN) is a digit classification benchmark dataset that contains 600,000 32×32 RGB images of printed digits (from 0 to 9) cropped from pictures of house number plates. The cropped images are centered in the digit of interest, but nearby digits and other distractors are kept in the image. SVHN has three sets: training, testing sets and an extra set with 530,000 images that are less difficult and can be used for helping with the training process.

3,130 PAPERS • 12 BENCHMARKS

Neural Radiance Fields (NeRF) is a method for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. The dataset contains three parts with the first 2 being synthetic renderings of objects called Diffuse Synthetic 360◦ and Realistic Synthetic 360◦ while the third is real images of complex scenes. Diffuse Synthetic 360◦ consists of four Lambertian objects with simple geometry. Each object is rendered at 512x512 pixels from viewpoints sampled on the upper hemisphere. Realistic Synthetic 360◦ consists of eight objects of complicated geometry and realistic non-Lambertian materials. Six of them are rendered from viewpoints sampled on the upper hemisphere and the two left are from viewpoints sampled on a full sphere with all of them at 800x800 pixels. The real images of complex scenes consist of 8 forward-facing scenes captured with a cellphone at a size of 1008x756 pixels.

2,870 PAPERS • 1 BENCHMARK

machine learning research papers

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category. The training set has 60,000 images and the test set has 10,000 images. Fashion-MNIST shares the same image size, data format and the structure of training and testing splits with the original MNIST.

2,847 PAPERS • 17 BENCHMARKS

machine learning research papers

General Language Understanding Evaluation (GLUE) benchmark is a collection of nine natural language understanding tasks, including single-sentence tasks CoLA and SST-2, similarity and paraphrasing tasks MRPC, STS-B and QQP, and natural language inference tasks MNLI, QNLI, RTE and WNLI.

2,790 PAPERS • 25 BENCHMARKS

machine learning research papers

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each annotated by 3 human judges.

2,075 PAPERS • 9 BENCHMARKS

machine learning research papers

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of 200 subcategories belonging to birds, 5,994 for training and 5,794 for testing. Each image has detailed annotations: 1 subcategory label, 15 part locations, 312 binary attributes and 1 bounding box. The textual information comes from Reed et al.. They expand the CUB-200-2011 dataset by collecting fine-grained natural language descriptions. Ten single-sentence descriptions are collected for each image. The natural language descriptions are collected through the Amazon Mechanical Turk (AMT) platform, and are required at least 10 words, without any information of subcategories and actions.

2,005 PAPERS • 46 BENCHMARKS

machine learning research papers

The LibriSpeech corpus is a collection of approximately 1,000 hours of audiobooks that are a part of the LibriVox project. Most of the audiobooks come from the Project Gutenberg. The training data is split into 3 partitions of 100hr, 360hr, and 500hr sets while the dev and test data are split into the ’clean’ and ’other’ categories, respectively, depending upon how well or challenging Automatic Speech Recognition systems would perform against. Each of the dev and test sets is around 5hr in audio length. This corpus also provides the n-gram language models and the corresponding texts excerpted from the Project Gutenberg books, which contain 803M tokens and 977K unique words.

1,992 PAPERS • 8 BENCHMARKS

machine learning research papers

The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text. Because the questions and answers are produced by humans through crowdsourcing, it is more diverse than some other question-answering datasets. SQuAD 1.1 contains 107,785 question-answer pairs on 536 articles. SQuAD2.0 (open-domain SQuAD, SQuAD-Open), the latest version, combines the 100,000 questions in SQuAD1.1 with over 50,000 un-answerable questions written adversarially by crowdworkers in forms that are similar to the answerable ones.

1,954 PAPERS • 11 BENCHMARKS

machine learning research papers

ShapeNet is a large scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University and the Toyota Technological Institute at Chicago, USA. The repository contains over 300M models with 220,000 classified into 3,135 classes arranged using WordNet hypernym-hyponym relationships. ShapeNet Parts subset contains 31,693 meshes categorised into 16 common object classes (i.e. table, chair, plane etc.). Each shapes ground truth contains 2-5 parts (with a total of 50 part classes).

1,731 PAPERS • 13 BENCHMARKS

machine learning research papers

The Multi-Genre Natural Language Inference (MultiNLI) dataset has 433K sentence pairs. Its size and mode of collection are modeled closely like SNLI. MultiNLI offers ten distinct genres (Face-to-face, Telephone, 9/11, Travel, Letters, Oxford University Press, Slate, Verbatim, Goverment and Fiction) of written and spoken English data. There are matched dev/test sets which are derived from the same sources as those in the training set, and mismatched sets which do not closely resemble any seen at training time.

1,696 PAPERS • 3 BENCHMARKS

machine learning research papers

UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These 101 categories can be classified into 5 types (Body motion, Human-human interactions, Human-object interactions, Playing musical instruments and Sports). The total length of these video clips is over 27 hours. All the videos are collected from YouTube and have a fixed frame rate of 25 FPS with the resolution of 320 × 240.

1,657 PAPERS • 25 BENCHMARKS

machine learning research papers

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in Boston and Singapore. Each scene is 20 seconds long and annotated at 2Hz. This results in a total of 28130 samples for training, 6019 samples for validation and 6008 samples for testing. The dataset has the full autonomous vehicle data suite: 32-beam LiDAR, 6 cameras and radars with complete 360° coverage. The 3D object detection challenge evaluates the performance on 10 classes: cars, trucks, buses, trailers, construction vehicles, pedestrians, motorcycles, bicycles, traffic cones and barriers.

1,640 PAPERS • 20 BENCHMARKS

machine learning research papers

The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset contains additional unlabeled data.

1,615 PAPERS • 11 BENCHMARKS

machine learning research papers

Visual Question Answering (VQA) is a dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. The first version of the dataset was released in October 2015. VQA v2.0 was released in April 2017.

1,584 PAPERS • NO BENCHMARKS YET

1,553 PAPERS • 4 BENCHMARKS

machine learning research papers

MuJoCo (multi-joint dynamics with contact) is a physics engine used to implement environments to benchmark Reinforcement Learning methods.

1,426 PAPERS • 2 BENCHMARKS

machine learning research papers

ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled voxels rather than points or objects. Up to now, ScanNet v2, the newest version of ScanNet, has collected 1513 annotated scans with an approximate 90% surface coverage. In the semantic segmentation task, this dataset is marked in 20 classes of annotated 3D voxelized objects.

1,293 PAPERS • 19 BENCHMARKS

machine learning research papers

Flickr-Faces-HQ (FFHQ) consists of 70,000 high-quality PNG images at 1024×1024 resolution and contains considerable variation in terms of age, ethnicity and image background. It also has good coverage of accessories such as eyeglasses, sunglasses, hats, etc. The images were crawled from Flickr, thus inheriting all the biases of that website, and automatically aligned and cropped using dlib. Only images under permissive licenses were collected. Various automatic filters were used to prune the set, and finally Amazon Mechanical Turk was used to remove the occasional statues, paintings, or photos of photos.

1,270 PAPERS • 16 BENCHMARKS

mini-Imagenet is proposed by Matching Networks for One Shot Learning . In NeurIPS, 2016. This dataset consists of 50000 training images and 10000 testing images, evenly distributed across 100 classes.

1,266 PAPERS • 19 BENCHMARKS

machine learning research papers

The ModelNet40 dataset contains synthetic object point clouds. As the most widely used benchmark for point cloud analysis, ModelNet40 is popular because of its various categories, clean shapes, well-constructed dataset, etc. The original ModelNet40 consists of 12,311 CAD-generated meshes in 40 categories (such as airplane, car, plant, lamp), of which 9,843 are used for training while the rest 2,468 are reserved for testing. The corresponding point cloud data points are uniformly sampled from the mesh surfaces, and then further preprocessed by moving to the origin and scaling into a unit sphere.

1,261 PAPERS • 17 BENCHMARKS

machine learning research papers

The SNLI dataset (Stanford Natural Language Inference) consists of 570k sentence-pairs manually labeled as entailment, contradiction, and neutral. Premises are image captions from Flickr30k, while hypotheses were generated by crowd-sourced annotators who were shown a premise and asked to generate entailing, contradicting, and neutral sentences. Annotators were instructed to judge the relation between sentences given that they describe the same event. Each pair is labeled as “entailment”, “neutral”, “contradiction” or “-”, where “-” indicates that an agreement could not be reached.

1,239 PAPERS • 1 BENCHMARK

OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. It includes environment such as Algorithmic, Atari, Box2D, Classic Control, MuJoCo, Robotics, and Toy Text.

1,222 PAPERS • 3 BENCHMARKS

machine learning research papers

The Kinetics dataset is a large-scale, high-quality dataset for human action recognition in videos. The dataset consists of around 500,000 video clips covering 600 human action classes with at least 600 video clips for each action class. Each video clip lasts around 10 seconds and is labeled with a single action class. The videos are collected from YouTube.

1,216 PAPERS • 29 BENCHMARKS

machine learning research papers

Visual Genome contains Visual Question Answering data in a multi-choice setting. It consists of 101,174 images from MSCOCO with 1.7 million QA pairs, 17 questions per image on average. Compared to the Visual Question Answering dataset, Visual Genome represents a more balanced distribution over 6 question types: What, Where, When, Who, Why and How. The Visual Genome dataset also presents 108K images with densely annotated objects, attributes and relationships.

1,162 PAPERS • 19 BENCHMARKS

machine learning research papers

The MovieLens datasets, first released in 1998, describe people’s expressed preferences for movies. These preferences take the form of tuples, each the result of a person expressing a preference (a 0-5 star rating) for a movie at a particular time. These preferences were entered by way of the MovieLens web site1 — a recommender system that asks its users to give movie ratings in order to receive personalized movie recommendations.

1,113 PAPERS • 16 BENCHMARKS

CARLA (CAR Learning to Act) is an open simulator for urban driving, developed as an open-source layer over Unreal Engine 4. Technically, it operates similarly to, as an open source layer over Unreal Engine 4 that provides sensors in the form of RGB cameras (with customizable positions), ground truth depth maps, ground truth semantic segmentation maps with 12 semantic classes designed for driving (road, lane marking, traffic sign, sidewalk and so on), bounding boxes for dynamic objects in the environment, and measurements of the agent itself (vehicle location and orientation).

1,106 PAPERS • 3 BENCHMARKS

The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.

1,097 PAPERS • 24 BENCHMARKS

machine learning research papers

Oxford 102 Flower is an image classification dataset consisting of 102 flower categories. The flowers chosen to be flower commonly occurring in the United Kingdom. Each class consists of between 40 and 258 images.

1,084 PAPERS • 17 BENCHMARKS

machine learning research papers

The QNLI (Question-answering NLI) dataset is a Natural Language Inference dataset automatically derived from the Stanford Question Answering Dataset v1.1 (SQuAD). SQuAD v1.1 consists of question-paragraph pairs, where one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question (written by an annotator). The dataset was converted into sentence pair classification by forming a pair between each question and each sentence in the corresponding context, and filtering out pairs with low lexical overlap between the question and the context sentence. The task is to determine whether the context sentence contains the answer to the question. This modified version of the original task removes the requirement that the model select the exact answer, but also removes the simplifying assumptions that the answer is always present in the input and that lexical overlap is a reliable cue. The QNLI dataset is part of GLUE benchmark.

1,077 PAPERS • 3 BENCHMARKS

machine learning research papers

The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. Each example is comprised of a google.com query and a corresponding Wikipedia page. Each Wikipedia page has a passage (or long answer) annotated on the page that answers the question and one or more short spans from the annotated passage containing the actual answer. The long and the short answer annotations can however be empty. If they are both empty, then there is no answer on the page at all. If the long answer annotation is non-empty, but the short answer annotation is empty, then the annotated passage answers the question but no explicit short answer could be found. Finally 1% of the documents have a passage annotated with a short answer that is “yes” or “no”, instead of a list of short spans.

1,064 PAPERS • 8 BENCHMARKS

machine learning research papers

The Places dataset is proposed for scene recognition and contains more than 2.5 million images covering more than 205 scene categories with more than 5,000 images per category.

1,046 PAPERS • 4 BENCHMARKS

machine learning research papers

The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. There are totally 150 semantic categories, which include stuffs like sky, road, grass, and discrete objects like person, car, bed.

1,034 PAPERS • 27 BENCHMARKS

machine learning research papers

Tiny ImageNet contains 100000 images of 200 classes (500 for each class) downsized to 64×64 colored images. Each class has 500 training images, 50 validation images and 50 test images.

997 PAPERS • 8 BENCHMARKS

machine learning research papers

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or self-taught learning. Besides 100,000 unlabeled images, it contains 13,000 labeled images from 10 object classes (such as birds, cats, trucks), among which 5,000 images are partitioned for training while the remaining 8,000 images for testing. All the images are color images with 96×96 pixels in size.

982 PAPERS • 17 BENCHMARKS

machine learning research papers

The English Penn Treebank (PTB) corpus, and in particular the section of the corpus corresponding to the articles of Wall Street Journal (WSJ), is one of the most known and used corpus for the evaluation of models for sequence labelling. The task consists of annotating each word with its Part-of-Speech tag. In the most common split of this corpus, sections from 0 to 18 are used for training (38 219 sentences, 912 344 tokens), sections from 19 to 21 are used for validation (5 527 sentences, 131 768 tokens), and sections from 22 to 24 are used for testing (5 462 sentences, 129 654 tokens). The corpus is also commonly used for character-level and word-level Language Modelling.

980 PAPERS • 10 BENCHMARKS

machine learning research papers

Office-Home is a benchmark dataset for domain adaptation which contains 4 domains where each domain consists of 65 categories. The four domains are: Art – artistic images in the form of sketches, paintings, ornamentation, etc.; Clipart – collection of clipart images; Product – images of objects without a background and Real-World – images of objects captured with a regular camera. It contains 15,500 images, with an average of around 70 images per class and a maximum of 99 images in a class.

955 PAPERS • 11 BENCHMARKS

machine learning research papers

The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. Each record in the dataset includes ICD-9 codes, which identify diagnoses and procedures performed. Each code is partitioned into sub-codes, which often include specific circumstantial details. The dataset consists of 112,000 clinical reports records (average length 709.3 tokens) and 1,159 top-level ICD-9 codes. Each report is assigned to 7.6 codes, on average. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more.

914 PAPERS • 8 BENCHMARKS

machine learning research papers

The NYU-Depth V2 data set is comprised of video sequences from a variety of indoor scenes as recorded by both the RGB and Depth cameras from the Microsoft Kinect. It features:

866 PAPERS • 20 BENCHMARKS

machine learning research papers

The Open Graph Benchmark (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner. OGB is a community-driven initiative in active development.

851 PAPERS • 16 BENCHMARKS

machine learning research papers

The MS MARCO (Microsoft MAchine Reading Comprehension) is a collection of datasets focused on deep learning in search. The first dataset was a question answering dataset featuring 100,000 real Bing questions and a human generated answer. Over time the collection was extended with a 1,000,000 question dataset, a natural language generation dataset, a passage ranking dataset, keyphrase extraction dataset, crawling dataset, and a conversational search.

847 PAPERS • 7 BENCHMARKS

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 05 June 2024

Scaling neural machine translation to 200 languages

Nature ( 2024 ) Cite this article

29k Accesses

1 Citations

522 Altmetric

Metrics details

  • Communication
  • Computer science

The development of neural techniques has opened up new avenues for research in machine translation. Today, neural machine translation (NMT) systems can leverage highly multilingual capacities and even perform zero-shot translation, delivering promising results in terms of language coverage and quality. However, scaling quality NMT requires large volumes of parallel bilingual data, which are not equally available for the 7,000+ languages in the world 1 . Focusing on improving the translation qualities of a relatively small group of high-resource languages comes at the expense of directing research attention to low-resource languages, exacerbating digital inequities in the long run. To break this pattern, here we introduce No Language Left Behind—a single massively multilingual model that leverages transfer learning across languages. We developed a conditional computational model based on the Sparsely Gated Mixture of Experts architecture 2 , 3 , 4 , 5 , 6 , 7 , which we trained on data obtained with new mining techniques tailored for low-resource languages. Furthermore, we devised multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. We evaluated the performance of our model over 40,000 translation directions using tools created specifically for this purpose—an automatic benchmark (FLORES-200), a human evaluation metric (XSTS) and a toxicity detector that covers every language in our model. Compared with the previous state-of-the-art models, our model achieves an average of 44% improvement in translation quality as measured by BLEU. By demonstrating how to scale NMT to 200 languages and making all contributions in this effort freely available for non-commercial use, our work lays important groundwork for the development of a universal translation system.

Similar content being viewed by others

machine learning research papers

Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals

machine learning research papers

Parameter-efficient fine-tuning of large-scale pre-trained language models

machine learning research papers

Augmenting interpretable models with large language models during training

The recent advent of neural machine translation (NMT) has pushed translation technologies to new frontiers, but its benefits are unevenly distributed 1 . The vast majority of improvements made have mainly benefited high-resource languages, leaving many low-resource languages behind. (For the purpose of our research, we define a high-resource language as a language for which we have at least 1 million sentences of aligned textual data (or bitext) with another language). This disparity could largely be attributed to a data gap: NMT models typically require large volumes of data to produce quality translations and, by definition, these volumes are not available for lower-resource languages. The No Language Left Behind (NLLB-200) project seeks to overcome this limitation by leveraging previously unknown approaches for building massively multilingual models with cross-lingual transfer abilities 8 , 9 , thereby enabling related languages to learn from each other 1 , 10 , 11 .

It has now been widely acknowledged that multilingual models have demonstrated promising performance improvement over bilingual models 12 . However, the question remains whether massively multilingual models can enable the representation of hundreds of languages without compromising quality. Our results demonstrate that doubling the number of supported languages in machine translation and maintaining output quality are not mutually exclusive endeavours. Our final model—which includes 200 languages and three times as many low-resource languages as high-resource ones—performs, as a mean, 44% better than the previous state-of-the-art systems. This paper presents some of the most important data-gathering, modelling and evaluation techniques used to achieve this goal.

First, compared with their high-resource counterparts, training data for low-resource languages are expensive and logistically challenging to procure 13 , 14 , 15 . Publicly available digital resources are either limited in volume or difficult for automated systems to detect (particularly in large public web datasets such as CommonCrawl). Regardless of whether collecting a critical mass of human-translated seed data is necessary, sufficient data acquisition relies on large-scale data mining and monolingual data pipelines 16 , 17 , 18 , 19 . The latter techniques are often affected by noise and biases, thereby making validating the quality of the datasets they generate tedious 20 . In NLLB-200, we show that a distillation-based sentence encoding technique, LASER3 (ref.  21 ), facilitates the effective mining of parallel data for low-resource languages.

Second, on the modelling side, we use an assemblage of seed, mined, open-source and back-translated datasets to train multilingual conditional computational models (more specifically, Sparsely Gated Mixtures-of-Experts models 2 , 3 , 4 , 5 , 6 , 7 that enable cross-lingual transfer between related languages without increasing interference between unrelated languages). We show how we can achieve state-of-the-art performance with a more optimal trade-off between cross-lingual transfer and interference, and improve performance for low-resource languages.

Finally, for the purpose of quality evaluation, we created FLORES-200—a massive multilingual benchmark that enables the measurement of translation quality across any of the approximately 40,000 translation directions covered by the NLLB-200 models. Apart from automatic metrics, we also created Cross-lingual Semantic Text Similarity (XSTS) and Evaluation of Toxicity (ETOX). XSTS is a human evaluation protocol that provides consistency across languages; ETOX is a tool to detect added toxicity in translations using toxicity word lists.

Beyond creating these models, we also reflect on the potential societal impact of NLLB. To amplify the practical applicability of our work in service of low-resource-speaking communities, we provide all the benchmarks, data, code and models described in this effort as resources freely available for non-commercial use ( https://github.com/facebookresearch/fairseq/tree/nllb ) (see Data and Code availability statements for details).

Automatically creating translation training data

The current techniques used for training translation models are difficult to extend to low-resource settings, in which aligned bilingual textual data (or bitext data) are relatively scarce 22 . Many low-resource languages are supported only by small targeted bitext data consisting primarily of translations of the Christian Bible 23 , which provide limited domain diversity.

To build a large-scale parallel training dataset that covers hundreds of languages, our approach centres around extending existing datasets by first collecting non-aligned monolingual data. Then, we used a semantic sentence similarity metric to guide a large-scale data mining effort aiming to identify sentences that have a high probability of being semantically equivalent in different languages 18 .

Language identification for monolingual data collection

Collecting monolingual data at scale requires a language identification (LID) system that accurately classifies textual resources for all NLLB-200 languages. Although LID could be seen as a solved problem in some domains 24 , it remains an open challenge for web data 25 , 26 . Specifically, issues coalesce around domain mismatch 26 , similar language disambiguation 27 and successful massively multilingual scaling 28 .

Devoted attention to advancing LID techniques led to a noticeable increase in both language coverage and accuracy over time. CLD3 ( https://github.com/google/cld3 ) and fasttext 29 are two readily available models offering high detection performance for 107 and 187 languages, respectively. By using numerous public datasets, previous studies 30 , 31 report even higher coverage—464 and 1,366 languages, respectively. Another study 32 scales LID performance up to 1,629 languages using word lists and self-supervision to bootstrap training data found on the web. However, these approaches using found data suffer from domain imbalance. That is, because the available text domains vary by language, classifiers conflate different domains with different languages.

In our work, we curated FLORES-200 to use as a development set so that our LID system performance 33 is tuned over a uniform domain mix. Our approach combines a data-driven fasttext model trained on FLORES-200 with a small set of handwritten rules to address human feedback on classification errors. These rules are specifically mentioned in section 5.1.3 of ref.  34 and include linguistic filters to mitigate the learning of spurious correlations due to noisy training samples while modelling hundreds of languages.

We compare our LID model with three publicly available models: CLD3, LangId ( https://github.com/saffsd/langid.py ) and LangDetect ( https://pypi.org/project/langdetect/ ). Table 1 reports the performance on three cascading sets of languages intersecting with NLLB-200: (1) 51 languages also supported by LangId, LangDetect and CLD3; (2) 78 languages also supported by LangId and CLD3; (3) 95 languages also supported by CLD3. We also report false-positive rates (FPR) to reflect the impact of false positives on unseen languages. Our results show that our model is equipped to handle all 200 languages found in FLORES-200 while achieving notably higher performance than LangId, LangDetect and CLD3. Furthermore, the gain in F1 score is accompanied by a notable improvement in FPR, suggesting a much stronger fit for extracting low-resource languages from web corpora 32 .

Mining for bitext

Previous work 35 notes that translation quality generally increases with the amount of high-quality training data, which is difficult to procure when working with low-resource languages. Existing parallel corpora for low-resource languages are often conveniently drawn from known multilingual collections, such as the Christian Bible or the publications of multinational organizations, which are limited in quantity and domain. To overcome this problem, we created training datasets through global bitext mining in publicly available web content (drawn from repositories such as CommonCrawl). The underlying idea of our bitext mining approach is first to learn a multilingual sentence embedding space and use a similarity measure in that space to decide whether two sentences are parallel. This comparison can be done for all possible pairs in two collections of monolingual texts.

As our mining approach requires a multilingual embedding space, there are several challenges when scaling this representation to all NLLB-200 languages. First, we had to ensure that all languages were well learnt and that we accounted for large imbalances in available training data. Second, training a massively multilingual sentence encoder from scratch each time a new set of languages is introduced is computationally expensive. Furthermore, the main drawback of this approach is that the learnt embedding spaces from each new model are not necessarily mutually compatible. This can make mining intractable as for each new encoder, the entirety of available monolingual data needs to be re-embedded (for example, for English alone, this means thousands of millions of sentences and considerable computational resources). We solved this problem using a teacher–student approach 21 that extends the LASER embedding space 36 to all NLLB-200 languages. Languages are trained either as individual students or together with languages from the same family. The training of students follows the approach described in ref.  21 .

Our approach enables us to focus on the specifics of each language while taking advantage of related languages, which is crucial for dealing with very low-resource languages. (A language is defined as very low-resource if it has fewer than 100,000 samples across all pairings with any other language in our dataset). Using this method, we generated more than 1,100 million new sentence pairs of training data for 148 languages. This additional training data, paired with back translation (a conventional technique for data augmentation in NMT; ref.  37 ), ushered notable improvements in translation quality—specifically, +12.5 chrF++ (ref.  38 ) for translating very low-resource languages into English. For more details, see Supplementary Information D .

Even with marked data volume increases, the main challenge of low-resource translation is for training models to adequately represent 200 languages while adjusting to variable data capacity per language pair. Apart from techniques such as data augmentation (for example, with back translation) and self-supervision strategies on monolingual data, we used conditional computational models—more specifically, Sparsely Gated Mixture of Experts (henceforth MoE)—to minimize interference between unrelated language directions.

MoE transformer models differ from dense transformer models in that some of the feed-forward network layers are replaced with MoE layers in both the encoder and the decoder. An MoE layer consists of E experts (each is a feed-forward network) and a gating network to decide how to route input tokens to experts. The transformer encoder–decoder model, supplemented with MoE layers and their respective gating networks, learns to route input tokens to the corresponding top two experts by optimizing a linearly weighted combination of label-smoothed cross entropy 39 and an auxiliary load balancing loss 6 .

We find that vanilla MoE models with overall dropout are suboptimal for low-resource languages and significantly overfit on low-resource pairs. To remedy this issue, we designed Expert Output Masking (EOM), a regularization strategy specific to MoE architectures, and compared it with existing regularization strategies, such as Gating Dropout 40 . We find that Gating Dropout performs better than vanilla MoE with overall dropout but is outperformed by EOM.

To further reduce overfitting on low-resource language pairs, we devised a curriculum learning that introduces language pairs in phases during model training. Pairs that empirically overfit within K updates are introduced with K updates before the end of training. This reduces overfitting while allowing pairs that benefit from additional training to continue their learning. Table 2 shows that combining curriculum learning and EOM improves performance, especially on low and very low-resource language pairs (see section ‘Modelling’ for more details).

To understand how MoE models are helpful for multilingual machine translation, we visualize similarities of experts in the MoE layers using heat maps (Fig. 1a–d ). These heat maps demonstrate that in late decoder layers (Fig. 1d ), languages are being separated (that is, dispatched to different sets of experts). Moreover, we observe that languages within the same family are highly similar in their choice of experts (that is, the late decoder MoE layers are language-specific). This is particularly the case for the Arabic dialects (the six rows and columns in the top-left corner), languages in the Benue–Congo subgrouping, as well as languages in the Devanagari script. By contrast, the early decoder MoE layers (Fig. 1c ) seem to be less language-specific. The late encoder MoE layers are particularly language-agnostic in how they route tokens as can be attested by the uniform heat map in Fig. 1b .

figure 1

a – d , The first ( a ) and last ( b ) encoder layers and then the first ( c ) and last ( d ) decoder layers. The similarity is measured with respect to the gating decisions (expert choice) per language (source side in the encoder and target side in the decoder). Lighter colours represent higher experts similarity, hence, a language-agnostic processing.

Combining data (see section ‘ Automatically creating translation training data ’) and modelling contributions, Table 3 shows that NLLB-200 outperforms the nearest state-of-the-art system by almost +7.3 spBLEU (ref.  41 ) on average, constituting a 44% improvement. We then compared NLLB-200 with a few other state-of-the-art models, such as Deepnet 42 and M2M-100 (ref.  1 ), to report scores for 87 languages against FLORES-101. On this smaller subset, NLLB-200 again outperforms by +7.0 spBLEU on average. Overall, the results show that NLLB-200 improves on state-of-the-art systems by a notable margin despite supporting 200 languages, or twice as many languages (and more than 30,000 additional directions) compared with any previous work. We also show in additional experiments that NLLB-200 is a general-purpose NMT model, transferable to other domains by fine-tuning on small quantities of high-quality bitexts (see Supplementary Information E.3 ).

Evaluations

Among the many aspects of model performance that can be evaluated 43 , this section emphasizes three aspects that have a marked impact on the overall quality assessment: benchmarks for automatic evaluation, human evaluation protocols and toxicity evaluation.

A benchmark for automatic evaluation using FLORES-200

The quality of NMT outputs is typically evaluated by automatic metrics such as BLEU 44 or spBLEU 41 . The computation of automatic quality scores using these metrics requires benchmark datasets that provide gold-standard human translations as references. In turn, the apples-to-apples evaluation of different approaches made possible by these benchmark datasets gives us a better understanding of what requires further research and development. For example, creating benchmark data sets at the Workshop on Machine Translation (WMT) 45 led to rapid progress in translation directions such as English to German and English to French.

For massively multilingual NMT, the largest benchmark dataset available was FLORES-101, which supports roughly half the number of languages in NLLB-200. The necessary expansion of FLORES-101 to FLORES-200 constitutes a further challenge in terms of quality assurance, in part because of differences in standardization practices and limited access to professional translators for all languages involved. To overcome this challenge, we adapted our workflow to pay particular attention to quality assurance mechanisms. The FLORES-200 workflow consists of four phases: (1) alignment; (2) translation, initial quality assurance and iteration(s); (3) final quality assurance; and (4) completion. A language FLORES-200 set is considered ready after passing a final human quality test with a 90 out of 100 quality score (that is, independent raters agreed with 90% of the FLORES-200 reference translations in that direction).

As a result of this redesigned workflow, we produced a three-split (dev, devtest, test) data set of parallel human reference translations for all NLLB-200 languages meeting the 90% quality threshold in a maximum turnaround time of 287 days (119 days on average, 70 days minimum). (Note that to avoid leakage with our models, we filtered data from FLORES and other evaluation benchmarks used (such as WMT and IWSLT) from our training data. This was done by comparing the hashes of training sentences against those of evaluation sentences, using the xxHash algorithm). Please refer to Supplementary Information C for more details on the evaluation process. Figure 2 shows the quality scores for all languages, some of which are labelled as examples.

figure 2

Quality assurance scores for the languages in FLORES-200. The minimum acceptable standard is 90%.

Reliable human evaluation

State-of-the-art automatic metrics often fail to capture aspects of language that, while subtle, can have a notable bearing on translation quality. Human evaluations are, therefore, essential to ensuring meaningful quality assessments 46 . That said, relying on them comes with two challenges: (1) any large-scale human evaluation of NMT quality, regardless of the number of translation directions involved, contends with potentially low inter-evaluator agreement (in the vicinity of 0.5 kappa); and (2) massively multilingual NMT introduces another complexity—that of quality evaluation consistency across language directions. We address these two issues by developing XSTS 47 , a new scoring metric focused on meaning, and by using a protocol that allows for the calibration of scores across evaluators and language pairs.

XSTS is a human evaluation protocol inspired by STS 48 , emphasizing meaning preservation over fluency. XSTS uses a five-point scale, in which 1 is the lowest score, and 3 represents the acceptability threshold. To ensure consistency not only across languages but also among different evaluators of any given language, we included the same subset of sentence pairs in the full set of sentence pairs given to each evaluator, making it possible to calibrate results.

We find that automated metrics such as spBLEU and chrF++ correlate reasonably well with calibrated human evaluations of translation quality, as shown in Fig. 3 . Spearman’s R correlation coefficients between aggregated XSTS and spBLEU, chrF++ (corpus) and chrF++ (average sentence-level) are 0.710, 0.687 and 0.694, respectively. Other correlation coefficients (Kendall’s τ and Pearson’s R ) have the same ordering. Corpus spBLEU provides the best nominal correlation, followed by average sentence-level chrF++.

figure 3

a , The relationship between spBLEU and XSTS. b , The relationship between chrF++ and XSTS. c , The relationship between average sentence-level chrF++ and XSTS. All automated scores were computed only on the sentences evaluated for a given model and translation direction (either the full FLORES-200 dataset or a subset). NLLB-200 refers to a 55B parameter MoE model, and NLLB-200 Baseline refers to a dense 3.3B parameter model.

We also find that calibrated human evaluation scores correlate more strongly with automated scores than uncalibrated human evaluation scores across all automated metrics and choices of correlation coefficient. In particular, uncalibrated human evaluation scores have a Spearman’s R correlation coefficient of 0.625, 0.607 and 0.611 for spBLEU, chrF++ (corpus) and chrF++ (average sentence-level), respectively.

Overall, a sample of 55 language directions were evaluated, including 8 into English, 27 out of English, and 20 other direct language directions. The overall mean of calibrated XSTS scores was 4.26, with 38/55 directions scoring over 4.0 (that is, high quality) and 52/56 directions scoring over 3.0.

We hypothesize that added toxicity may be because of the presence of toxicity in the training data and used our detectors to estimate, more specifically, unbalanced toxicity in the bitext data. We find that estimated levels of unbalanced toxicity vary from one corpus of bitext to the next and that unbalanced toxicity can be greatly attributed to misaligned bitext. In other words, training with this misaligned bitext could encourage mistranslations with added toxicity.

To mitigate this issue, we designed a bitext filtering procedure based on the detection of multiple instances of added toxicity (that is, cases in which one sentence in the bitext pair contains at least two more toxic items than the other sentence in the pair). (A previous detector quality analysis showed that a higher precision was reached in this situation). We added this toxicity filtering procedure as an option to the filtering process and experimented with or without it for comparison.

The experimental results on the FLORES-200 dev set for 10 translation directions (from and into English for Somali, Southern Sotho, Twi, Umbundu and Venetian) show that after filtering an average amount of around 30% parallel sentences, the translation quality (chrF++) improves by 5% and added toxicity (ETOX) reduces by the same amount. Therefore, the filtering pipeline that includes toxicity filtering not only reduces the number of toxic items in the translation output but also improves the overall translation performance.

In 2016, the United Nations declared internet access a basic human right. Although the intent of this declaration was to limit censorship and allow for information and ideas to flow without interference, much of the internet today remains inaccessible to many due to language barriers. Our effort was designed to contribute one solution to help alter this status quo.

For many low-resource language communities, NLLB-200 is one of the first models designed to support translation into or out of their languages. Although applications of these new translation capabilities could be found in several domains of everyday life, we believe their impact would be most significant in a domain such as education. In formal educational settings, for instance, students and educators belonging to low-resource language groups could, with the help of NLLB-200, tap into more books, research articles and archives than before. Within the realms of informal learning, low-resource language speakers could experience greater access to information from global news outlets and social media platforms, as well as online encyclopaedias such as Wikipedia. Access to machine translation motivates more low-resource language writers or content creators to share localized knowledge or various aspects of their culture. Giving individuals access to new translation tools could thus open up opportunities for bidirectional learning, thereby also challenging Western-centric modes of knowledge production and dissemination, ultimately aiding in revitalizing certain minority cultures and languages.

Since launching NLLB-200, we can already see the impact of the model across many directions. Four months after the launch of NLLB-200, Wikimedia reported that our model was the third most used machine translation engine used by Wikipedia editors (accounting for 3.8% of all published translations) ( https://web.archive.org/web/20221107181300/https://nbviewer.org/github/wikimedia-research/machine-translation-service-analysis-2022/blob/main/mt_service_comparison_Sept2022_update.ipynb ). Compared with other machine translation services and across all languages, articles translated with NLLB-200 has the lowest percentage of deletion (0.13%) and highest percentage of translation modification kept under 10%.

In many ways, the composition of the NLLB-200 effort speaks to the centrality of interdisciplinarity in shaping our vision. Machine translation and AI advancements lie at the intersection of technological, cultural and societal development, and thus require scholars with diverse training and standpoints to fully comprehend every angle 49 , 50 . It is our hope that in future iterations, NLLB-200 continues to include scholars from fields underrepresented in the world of machine translation and AI, particularly those from humanities and social sciences backgrounds. More importantly, we hope that teams developing these initiatives would come from a wide range of race, gender and cultural identities, much like the communities whose lives we seek to improve.

Finally, we want to emphasize that overcoming the challenges that prevent the web from being accessible to speakers of all languages requires a multifaceted approach. At the technical level, NLLB-200 overcomes many data, modelling and evaluation challenges in NMT research, but it still has its limitations, some of which are documented in Supplementary Information G . As a single technological intervention, NLLB-200 is all but one piece of a massive puzzle; policy interventions aimed at more fundamental issues surrounding education, internet access and digital literacy are imperative to eradicate the structural problem of language disparities.

This section describes the steps taken to design our language identification system and bitext mining protocol.

Language identification

To train language identification models, we used fasttext 33 , 51 , which has been widely used for text classification tasks because of its simplicity and speed. We embedded character-level n -grams from the input text and leveraged a multiclass linear classifier on top. The lightweight nature of fasttext enables our LID models to handle web-scale data. Furthermore, a linear model has the benefit of being easily explainable, allowing us to trace any classification error back to its root cause. This is instrumental in addressing common pitfalls that arise when detecting language on web corpora 32 .

Classifier design

We experimented with two different designs. First, we used a combination of multiple binary classifiers in which the final decision was obtained by selecting the language with the highest score after applying a threshold. We applied threshold optimization so that when the confidence of a classifier is low, the corresponding language is not considered for the final decision. A sentence was filtered out if none of the classifiers surpassed its threshold. Second, we built a multiclass classifier using softmax over all possible languages. In this case, the threshold optimization is done after the softmax.

Our results directed us to focus on the second approach, which offers several advantages. First, changing the threshold for one language did not affect the performance of the other (which is not true in the first setting). Second, this approach generalizes better to out-of-domain data, which is our primary use case (Wikipedia → web data). Finally, a single classifier has the added benefit of being computationally simpler, thus streamlining the language identification process.

Training data and handling massive class imbalance

We used publicly available datasets to train our LID system, partially covering our languages of interest. The public datasets deployed were mostly built from web pages such as CommonCrawl. We then supplemented these with NLLB-Seed data (Supplementary Information  B ) for any missing languages. However, this supplementation is insufficient in ensuring balance in the raw training data 32 , 30 . For example, English alone represents 10.1% of our training data, whereas Minangkabau (Latin script) represents only 0.06%. Following ref.  10 , we experimented with multiple settings of temperature upsampling for underrepresented languages, in which sentences from a language l representing p l per cent of the data set are sampled proportionally to \({p}_{l}^{1/T}\) . Optimal performance was obtained at 1/ T  = 0.3 (for more details, see section 5.1 of ref.  34 ).

Training parameters

Our best-performing model was trained with softmax loss over two epochs with a learning rate of 0.8 and embeddings with 256 dimensions. We discarded words with less than a thousand occurrences after upsampling and selecting a minimum and maximum character n -gram length of two and five, respectively (which were assigned a slot in buckets of size 1,000,000). (In fasttext, we refer to ‘word’ when it is separated by spaces. When it is a non-segmenting language, there is only one ‘word’ for the whole sentence (and we take character n -grams)). All hyperparameters were tuned on FLORES-200 dev (see section 5.1.2 of ref.  34 ).

Improving LID with linguistic analysis

Language identification is a challenging task in which numerous failure modes exist, often exacerbated by the gaps between the clean data on which LID models are trained and noisy data on which LID models are applied. In other words, LID models trained in a supervised manner on fluently written sentences may have difficulty identifying grammatically incorrect and incomplete strings extracted from the web. Furthermore, models can easily learn spurious correlations that are not meaningful for the task itself. Given these challenges, we collaborated closely with a team of linguists throughout different stages of LID development to identify proper focus areas, mitigate issues and explore solutions (see section 5.1.3 of ref.  34 ).

Bitext mining

The overall approach for bitext mining focused on starting with a massively multilingual sentence encoder teacher model and adapting it to several different low-resource student models. This approach enabled us to add low-resource languages without competing with high-resource languages for capacity. Doing so circumvents the need to retrain the entire model from scratch while maintaining compatibility with the multilingual embedding spaces for subsequent mining. Extended data Fig. 1 summarizes the overall architecture of the teacher–student approach. The teacher, LASER2, is an improved version of the open-source LASER encoder ( https://github.com/facebookresearch/LASER ). The original training procedure 36 was adapted to include SentencePiece tokenization (including a vocabulary of 7,000 tokens) and the upsampling of low-resource languages.

The architecture of the five-layer BiLSTM encoder and the max pooling method to obtain sentence embeddings were left unchanged. The training was then performed on the same 93 languages with public resources obtained from OPUS 52 . See ref.  36 for details on the original LASER training procedure. Training of the students followed the approach described in greater detail in ref.  21 , summarized below:

students specialized in one language or several similar languages;

students were randomly initialized because we wanted to handle low-resource language for which we did not have a pre-trained language model;

students may have a dedicated SentencePiece vocabulary different from the teacher to better accommodate scripts and tokens in the student languages;

as we used cosine distance for bitext mining (Fig. 1 ), students learnt to minimize the cosine loss with the teacher;

students can have an MLM loss to leverage student language monolingual data (Fig. 1 ).

Our student encoders used a 12-layer transformer with a hidden size of 1,024, four attention heads, and around 250 million parameters. All students were trained with available bitexts in their respective language, complemented by 2 million sentences of English/English and English/Spanish. The motivation behind this approach is to anchor the students to the English embedding space, increasing robustness by including English/Spanish bitexts from CCMatrix and allowing for the joint learning of new languages. This technique is particularly useful when only limited amounts of bitexts are available to train the students. Teacher–student training was performed on 16 GPUs, the ADAM optimizer, a learning rate of 0.0005 and a batch size of 10,000. We trained student encoders for 148 languages and named these models LASER3.

Proxy metric for new encoders

Mined bitexts were subsequently used to improve translation quality for the languages of NLLB-200. However, mining and NMT training are computationally expensive, and it is intractable to perform this evaluation systematically for many different sentence encoder variants. As an evaluation proxy, we used a mining-based multilingual similarity search error rate, referred to here as xsim. In contrast to cosine accuracy, which aligns embeddings based on the highest cosine score, xsim aligns source and target embeddings based on the highest margin score, which has been shown to be beneficial in mining 53 . The margin-based score is defined as

where x and y are the source and target sentences, and N N k ( x ) denotes the k nearest neighbours of x in the other language. We set k to 4. All xsim results are calculated on FLORES-200 devtest, using the ratio margin, where margin( a ,  b ) =  a / b . Moreover, all scores are calculated for translations into English (that is, xxx → eng). English is encoded by the teacher, and the other language is encoded by the LASER3 student. To facilitate further research using xsim, we also provide this evaluation method as an open-source resource ( https://github.com/facebookresearch/LASER/ ).

End-to-end encoder evaluation

Once we had identified the best sentence encoder for each language using the xsim scores, we performed mining, added the mined data to the existing bitexts and trained a bilingual NMT system. Initial experiments indicated that a threshold on the margin of 1.06 seems to be the best compromise between precision and recall for most languages. For these NMT baselines, we do not apply extra filtering on the bitexts and leave this to the training procedure of our massively multilingual NMT system.

We did not attempt to optimize the architecture and parameters of the bilingual NMT systems to the characteristics of each language pair but used the same architecture for all. Therefore, the reported results should not be interpreted as the best possible ones given the available resources—they are mainly provided to validate the mined bitexts. We used a 12-layer encoder and decoder and trained for 100 epochs. Moreover, we looked for the best performance on the FLORES-200 development set and report detokenized BLEU on the FLORES-200 devtest.

In this section, we first describe the multilingual machine translation task setup, which includes tokenization and base model architecture. Then, we outline how we leveraged conditional computation for massively multilingual machine translation with EOM regulation and our Curriculum Learning (CL) strategy for low-resource languages.

We modelled multilingual NMT as a sequence-to-sequence task, in which we conditioned on an input sequence in the source language with an encoder and generated the output sequence in the expected target language with a decoder 54 . With the source sentence S , source language ℓ s , and target language ℓ t in hand, we trained to maximize the probability of the translation in the target language T —that is, P ( T ∣ S ,  ℓ s ,  ℓ t ). Below, we discuss details of the (1) tokenization of the text sequences in the source and target languages; and (2) model architecture with the input and output designed specifically for multilingual machine translation. For further details on the task setup, such as the amount of training data per language pair, please refer to Supplementary Information  F or section 8 of ref.  34 .

Segmentation with SentencePiece

To tokenize our text sequences, we trained a single SentencePiece model (SPM) 55 for all languages. We sampled a total of 100 million sentences from primary bitext data. To ensure low-resource languages are well-represented in the vocabulary, we downsampled high-resource and upsampled low-resource languages with a sampling temperature of five (ref.  10 ). Notably, vocabulary size is an important hyperparameter in multilingual translation models involving low-resource languages 56 , 57 , 58 . The vocabulary size of our trained SPM model is 256,000. Such a large vocabulary ensures adequate representation across the wide spectrum of languages we support.

Model architecture

Our sequence-to-sequence multilingual machine translation model is based on the transformer encoder–decoder architecture 59 . The encoder transforms the source token sequence into a sequence of token embeddings. Then, the decoder attends to the encoder output and autoregressively generates the target sentence token by token. More precisely, the encoder takes the sequence of tokens W  = ( w 1 , …,  w S ) and the source language ℓ s , and produces a sequence of embeddings H  = ( h 1 , …,  h S ), which are then provided to the decoder with the target language ℓ t to produce the target tokens V  = ( v 1 , …,  v T ) sequentially. In sum,

Note that we prefixed the source sequence with the source language, as opposed to the target language, as done in previous work 10 , 60 . We did so because we prioritized optimizing the zero-shot performance of our model on any pair of 200 languages at a minor cost to supervised performance. Empirically, we find zero-shot performance to be negatively affected when conditioning the encoder on the target language. When the source is conditioned on only the source language, the encoder generalizes better to pairs of source and target languages not encountered during training 1 .

Conditional computation for multilingual machine translation

A massively multilingual translation (MMT) model uses the same shared model capacity to train on several translation directions simultaneously. While doing so can lead to beneficial cross-lingual transfer between related languages, it can also add to the risk of interference between unrelated languages 1 , 61 . MoE models are a type of conditional computational models 62 , 63 that activate a subset of model parameters per input, as opposed to dense models that activate all model parameters per input. MoE models unlock marked representational capacity while maintaining the same inference and training efficiencies in terms of FLOPs compared with the core dense architecture.

However, as we increase the model capacity and the computational cost per update, the propensity for low or very low-resource languages to overfit increases, thus causing performance to deteriorate. In this section, we examine how we can use Sparsely Gated Mixture of Experts models 2 , 3 , 4 , 5 , 6 , 7 to achieve a more optimal trade-off between cross-lingual transfer and interference and improve performance for low-resource languages.

Sparsely gated mixture of experts

To build our MoE models, we substitute a quarter of the encoder and decoder feed-forward network layers with MoE layers, each with E distinct experts. We followed the Top- k -Gating algorithm in ref.  4 and dispatched each token to at most k  = 2 experts. For more details on the training of MoE models, see Supplementary Information  E .

Expert output masking

In this proposed regularization strategy, we masked the expert output for a random fraction ( p eom ) of the input tokens. For input tokens with dropped expert outputs, the first and/or second expert is effectively skipped. As shown in the second panel of Extended data Fig. 2 , we masked both experts for the first token ( x 1 in red), chose not to mask any of the expert outputs for the second token ( x 2 in blue) and in the final scenario, masked only one expert for the last token ( x 3 in green).

Curriculum learning for MMT

Orthogonal to model-side regularization methods such as dropout, we explored regularizing MMT models by means of CL. We proposed starting training with high-resource pairs first, then introducing low-resource pairs—prone to overfitting—in later phases. To derive the phases of the curriculum, we first trained a vanilla MoE model (without CL), followed by partitioning the translation directions into n bins { b 1 , …,  b n }. If T is the total number of training updates, we introduced each bin b i after T  −  k i updates. We based when \({({k}_{i})}_{i}\) and what \({({b}_{i})}_{i}\) directions to add at every phase of the step when we observed a language pair starting to overfit. Review the step-based CL algorithm in ref.  64 for more on how the directions are partitioned. See Supplementary Information E.2 for the list of directions added at each stage.

Automatic evaluation

Many automatic translation quality assessment metrics exist, including model-based ones such as COMET 65 and BLEURT 66 . Although model-based metrics have shown better correlation with human judgement in recent metrics shared tasks of the WMT 43 , they require training and are not easily extendable to a large set of low-resource languages. In this work, we rely on BLEU (and a variant of it) and chrF++. Both measures draw on the idea that translation quality can be quantified based on how similar a machine translation output is compared with that produced by a human translator.

BLEU and spBLEU

The BLEU score 44 has been the standard metric for machine translation evaluation since its inception two decades ago. It measures the overlap between machine and human translations by combining the precision of 1-grams to 4-grams with a brevity penalty. The main disadvantage of BLEU is that it is tokenization-dependent. Efforts such as sacrebleu 67 have taken strides towards standardization, supporting the use of community-standard tokenizers under the hood. However, these tokenizers do not extend to many languages. Reference 41 proposes spBLEU, a BLEU metric based on a standardized SentencePiece model (SPM) covering 101 languages, released alongside FLORES-101. In this work, we provide SPM-200 along with FLORES-200 to enable the measurement of spBLEU. (Our analyses demonstrate that there are minor differences between SPM-200 from FLORES-200 and SPM-100 from FLORES-101 when measuring on the FLORES-101 languages. The major advantage of SPM-200 is that it covers 200 languages. More details on SPM-200 are reported in section 8.1.1 of ref.  34 ).

The chrF++ score 38 overcomes the limitation of the BLEU score, which requires that a sentence can be broken up into word tokens. However, some languages, such as Chinese or Thai, do not use spaces to separate words, and word segmentation tools may not be readily available. There is also a concern about highly agglutinative languages in which BLEU fails to assign any credit to morphological variants. chrF++ overcomes these weaknesses by basing the overlap calculation on character-level n -grams F -score ( n ranging from 1 to 6) and complementing with word unigrams and bi-grams. In this work, we primarily evaluated using chrF++ using the settings from sacrebleu. However, when comparing with other published work, we used BLEU and spBLEU where appropriate.

Human evaluation methodology

When building machine translation systems for thousands of different language pairs, a core question is which pairs reach certain levels of quality. Therefore, we needed meaningful scores that are comparable across language pairs.

XSTS evaluation protocol

We adapted the recently proposed XSTS methodology 48 . In short, XSTS is a human evaluation protocol focusing on meaning preservation above fluency. See details on this protocol in Supplementary Information  F . For low-resource languages, translations are usually of poorer quality, and so we focused more on usable (that is, meaning-preserving) translations, even if they are not fully fluent. Compared with Direct Assessment 68 with a 5-point scale (the original direct assessment uses a 100-point scale), it is found that XSTS yields higher inter-annotator agreement 47 . XSTS rates each source sentence and its machine translation on a 5-point scale, in which 1 is the lowest and 5 is the highest.

Calibration set

To enable meaningful scores comparable across language pairs, we asked each evaluator to provide assessments using the XSTS scale on precisely the same set of sentence pairs. This aims to identify annotators who have a systematic tendency to be more harsh or generous in their scoring and correct for this effect. The calibration set consists of the machine translation output paired with the reference translation only in English. Based on how evaluators used the XSTS scale on this calibration set, we adjusted their raw scores on the actual evaluation task to ensure consistency across evaluators. Although this monolingual calibration task does not precisely mimic the bilingual XSTS task, it is a reasonable first approximation and has been shown to increase the correlation between human and automatic metrics primarily by reducing one source of ‘noise’ in the human evaluations—the lack of score calibration between annotators.

Obtaining aggregated human quality metrics from multiple studies

To obtain an aggregate human quality metric for each language direction in an evaluation study, we take the majority XSTS score (that is, mean–median score) for each sentence and average these majority scores over all evaluated sentences. In a given study, the aggregate human evaluation score for any translation direction l s  →  l t is

where l s and l t denote the source language and the target language, respectively; \({X}_{{l}_{{\rm{s}}}\to {l}_{{\rm{t}}},i}(S,T)\) denotes the XSTS score of the i th evaluator who evaluates sentences in a given translation direction l s  →  l t for a source sentence S and a target sentence T ; \({M}_{{l}_{{\rm{s}}}\to {l}_{{\rm{t}}}}\) denotes the total number of evaluators who evaluate the (source, translation) sentence pair ( S ,  T ) for translation direction l s  →  l t ; \({{\mathcal{T}}}_{{l}_{{\rm{s}}}\to {l}_{{\rm{t}}}}=\{({S}_{{l}_{{\rm{s}}}\to {l}_{{\rm{t}}},k},{T}_{{l}_{{\rm{s}}}\to {l}_{{\rm{t}}},k})| 1\le k\le {N}_{{l}_{{\rm{s}}}\to {l}_{{\rm{t}}}}\}\) is the set of \({N}_{{l}_{{\rm{s}}}\to {l}_{{\rm{t}}}}\) (source, translation) sentence pairs being evaluated for translation direction l s  →  l t .

Every evaluator in a given study s is also asked to provide ratings for all or parts of a calibration set— \({{\mathcal{C}}}_{s}=\{({S}_{s,k},{T}_{s,k})| 1\le k\le {K}_{s}\}\) . S s , k denotes the k th source sentence in the calibration set for an evaluation study; s , T s , k denotes the translated sentence corresponding to S s , k ; and \({K}_{s}=| {{\mathcal{C}}}_{s}| \) is the number of sentence pairs in the calibration set for an evaluation study.

For each language direction evaluated in a study, we obtained the majority score on the calibration set as follows:

where \({X}_{l,i}^{(s)}(S,T)\) denotes the XSTS score provided by the i th evaluator, for the language direction l s  →  l t , in study s , for a given source sentence S and a translated sentence T , in the calibration set \({{\mathcal{C}}}_{s}\) of the study.

To obtain aggregated calibrated XSTS scores on the language direction level, we explored several different calibration methodologies. None of the calibration methods we investigated showed a marked difference in correlation with automated scores, and all calibration methodologies we explored provided superior correlation compared with uncalibrated XSTS scores. For more details on these calibration methodologies, see section 7.2 of ref.  34 .

Added toxicity detection for 200 languages

To enable toxicity detection at scale, we used a detector based on word lists. In this section, we provide more details about our toxicity definition and describe the detector (ETOX) and associated word lists.

Toxic content

Owing to the subjective nature of toxicity, definitions of toxic language can vary. We included items that are commonly referred to as vulgar or profane language. (Note that vulgar or profane language is not always necessarily toxic. Some common slang, for instance, may be considered vulgar but is not necessarily toxic). Moreover, we also included items associated with depictions of pornographic content or sexual acts, some frequently used hate speech expressions and some expressions tied to bullying. We also included items, vulgar or not, referring to body parts that are commonly associated with sexual practices.

The ETOX detector

We started with the assumption that general-purpose machine translation systems should remain faithful to the source content and not add any toxic elements during the translation process. We define toxic elements as word tokens or short phrases present in our lists. ETOX identifies added toxicity using the following two criteria: number of toxic items and matched or non-matched toxicity. A toxic item is considered detected if it is present in a line and surrounded by spaces or the start or end of a line. ETOX tracks the number of unique toxic items found in a line but does not count a phrase again if it has multiple occurrences. Matched toxicity indicates that the number of toxic items is the same in both the source and the translated content (that is, no added toxicity). Added toxicity is an instance of non-matched toxicity in which more toxic items are found in the translation output than in the source. For non-segmenting languages or some languages that use complex diacritics, space tokenization is insufficient to distinguish words from one another. In those cases, we used SentencePiece tokenization of both the sentence and toxicity word list.

Toxicity-200 lists

Lists are based on professional translations from English, which were then heuristically adapted by linguists to better serve the target language. As toxicity is culturally sensitive, attempting to find equivalents in a largely multilingual setting constitutes a challenge when starting from one source language. To address this issue, translators were allowed to forgo translating some of the source items and add more culturally relevant items.

In the initial release of the Toxicity-200 lists, the average number of items in a toxicity detection list was 271 entries, whereas the median number of entries was 143. The latter may be a better measure of central tendency than the mean average, given that languages with a rich inflectional morphology constitute extreme outliers (for example, the Czech list had 2,534 entries and the Polish list 2,004). The shortest list had 36 entries, and the longest 6,078.

Data availability

All data generated and described in the Article and its Supplementary Information are available at GitHub ( https://github.com/facebookresearch/fairseq/tree/nllb ) 69 as follows. The FLORES-200 dataset contains human-translated evaluation data in 204 languages. The NLLB-Seed database contains human-translation seed training data in 39 languages (Supplementary Information I ). The NLLB-MD database contains human-translated seed data in different domains in six languages to assess generalization (Supplementary Information J ). The Toxicity-200 database contains wordlists to detect toxicity in 200 languages. Mined bitext database contains publicly available web data for 148 English-centric and 1,465 non-English-centric language pairs. Publicly available data used to train NLLB models with references to download the data are listed in Supplementary Table 2 .

Code availability

To make our work available to the community, we provide the following models and supporting code as resources freely available for non-commercial use, available at GitHub ( https://github.com/facebookresearch/fairseq/tree/nllb ) 69 as follows. The translation models cover 200 languages; the NLLB models come in multiple sizes (54.5B MoE, 3.3B and 1.3B Dense, and 1.3B and 600M distilled). The language identification models contain more than 200 languages. LASER3 comprises sentence encoders for identifying aligned bitext for 148 languages. Stopes consists of a data-mining library that can be used to process and clean monolingual data, followed by the creation of aligned bitext. Scripts to recreate our training data and training and generation scripts to reproduce our models are also included.

Fan, A. et al. Beyond English-centric multilingual machine translation. J. Mach. Learn. Res 22 , 1–48 (2021).

MathSciNet   Google Scholar  

Du, N. et al. GlaM: efficient scaling of language models with mixture-of-experts. In Proceedings of the 39th International Conference on Machine Learning Vol. 162, 5547–5569 (PMLR, 2022).

Hwang, C. et al. Tutel: adaptive mixture-of-experts at scale. In 6th Conference on Machine Learning and Systems (MLSys, 2023).

Lepikhin, D. et al. GShard: scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations (ICLR, 2021).

Lewis, M., Bhosale, S., Dettmers, T., Goyal, N. & Zettlemoyer, L. BASE layers: simplifying training of large, sparse models. In Proc. 38th International Conference on Machine Learning Vol. 139, 6265–6274 (PMLR, 2021).

Shazeer, N. et al. Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. In Proc. 2017 International Conference on Learning Representations (ICLR) 1–19 (ICLR, 2017).

Zoph, B. et al. ST-MoE: designing stable and transferable sparse expert models. Preprint at https://arxiv.org/abs/2202.08906 (2022).

Zoph, B., Yuret, D., May, J. & Knight, K. Transfer learning for low-resource neural machine translation. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing (eds Su, J. et al.) 1568–1575 (Association for Computational Linguistics, 2016).

Nguyen, T. Q. & Chiang, D. Transfer learning across low-resource, related languages for neural machine translation. In Proc. Eighth International Joint Conference on Natural Language Processing Vol. 2 (eds Kondrak, G. & Watanabe, T.) 296–301 (Asian Federation of Natural Language Processing, 2017).

Arivazhagan, N. et al. Massively multilingual neural machine translation in the wild: findings and challenges. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1, 3874–3884 (Association for Computational Linguistics, 2019).

Zhang, B., Williams, P., Titov, I. & Sennrich, R. Improving massively multilingual neural machine translation and zero-shot translation. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D. et al.) 1628–1639 (ACL, 2020).

Tran, C. et al. Facebook AI’s WMT21 news translation task submission. In Proc. Sixth Conference on Machine Translation (eds Barrault, L.) 205–215 (ACL, 2021); https://aclanthology.org/2021.wmt-1.19 .

Orife, I. et al. Masakhane – machine translation for Africa. Preprint at https://arxiv.org/abs/2003.11529 (2020).

Kuwanto, G. et al. Low-resource machine translation training curriculum fit for low-resource languages. Preprint at https://arxiv.org/abs/2103.13272 (2021).

Nekoto, W. et al. Participatory research for low-resourced machine translation: a case study in African languages. In Findings of the Association for Computational Linguistics: EMNLP 2020 (eds Cohn, T. et al.) 2144–2160 (ACL, 2020).

Karakanta, A., Dehdari, J. & van Genabith, J. Neural machine translation for low-resource languages without parallel corpora. Mach. Transl. 32 , 167–189 (2018).

Article   Google Scholar  

Bañón, M. et al. ParaCrawl: web-scale acquisition of parallel corpora. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D. et al.) 4555–4567 (ACL, 2020).

Schwenk, H. et al. CCMatrix: mining billions of high-quality parallel sentences on the web. In Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing Vol. 1 (eds Zong, C. et al.) 6490–6500 (ACL, 2021).

Ramesh, G. et al. Samanantar : the largest publicly available parallel corpora collection for 11 Indic languages. Trans. Assoc. Comput. Linguist. 10 , 145–162 (2022).

Kreutzer, J. et al. Quality at a glance: an audit of web-crawled multilingual datasets. Trans. Assoc. Comput. Linguist. 10 , 50–72 (2022).

Heffernan, K., Çelebi, O. & Schwenk, H. Bitext mining using distilled sentence representations for low-resource languages. Preprint at https://arxiv.org/abs/2205.12654 (2022).

Gowda, T., Zhang, Z., Mattmann, C. & May, J. Many-to-English machine translation tools, data, and pretrained models. In Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations (eds Ji, H. et al.) 306–316 (ACL, 2021).

McCarthy, A. D. et al. The Johns Hopkins University Bible corpus: 1600+ tongues for typological exploration. In Proc. 12th Language Resources and Evaluation Conference (eds Calzolari, N. et al.) 2884–2892 (European Language Resources Association, 2020); https://aclanthology.org/2020.lrec-1.352 .

McNamee, P. Language identification: a solved problem suitable for undergraduate instruction. J. Comput. Sci. Coll. 20 , 94–101 (2005).

Google Scholar  

Abadji, J., Suárez, P. J. O., Romary, L. & Sagot, B. Towards a cleaner document-oriented multilingual crawled corpus. Preprint at https://arxiv.org/abs/2201.06642 (2022).

Widdows, D. & Brew, C. Language identification with a reciprocal rank classifier. Preprint at https://arxiv.org/abs/2109.09862 (2021).

Goutte, C., Léger, S., Malmasi, S. & Zampieri, M. Discriminating similar languages: evaluations and explorations. Preprint at http://arxiv.org/abs/1610.00031 (2016).

Jauhiainen, T., Lindén, K. & Jauhiainen, H. Evaluation of language identification methods using 285 languages. In Proc. 21st Nordic Conference on Computational Linguistics (eds. Tiedemann, J. & Tahmasebi, N.) 183–191 (2017).

Grave, É., Bojanowski, P., Gupta, P., Joulin, A. & Mikolov, T. Learning word vectors for 157 languages. In Proc. 11th International Conference on Language Resources and Evaluation (LREC 2018) (eds Calzolari, N. et al.) (ELRA, 2018).

Dunn, J. Mapping languages: the corpus of global language use. Lang. Resour. Eval. 54 , 999–1018 (2020).

Brown, R. D. Non-linear mapping for improved identification of 1300+ languages. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Moschitti, A. et al.) 627–632 (ACL, 2014).

Caswell, I., Breiner, T., van Esch, D. & Bapna, A. Language ID in the wild: unexpected challenges on the path to a thousand-language web text corpus. In Proc. 28th International Conference on Computational Linguistics (eds Scott, D. et al.) 6588–6608 (International Committee on Computational Linguistics, 2020); https://aclanthology.org/2020.coling-main.579 .

Joulin, A., Grave, E., Bojanowski, P. & Mikolov, T. Bag of tricks for efficient text classification. In Proc. 15th Conference of the European Chapter of the Association for Computational Linguistics Vol. 2 (eds Lapata, M. et al.) 427–431 (ACL, 2017).

NLLB Team et al. No language left behind: scaling human-centered machine translation. Preprint at https://arxiv.org/abs/2207.04672 (2022).

Koehn, P. & Knowles, R. Six challenges for neural machine translation. In Proc. First Workshop on Neural Machine Translation (eds Luong, T. et al.) 28–39 (ACL, 2017).

Artetxe, M. & Schwenk, H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguist. 7 , 597–610 (2019).

Sennrich, R., Haddow, B. & Birch, A. Improving neural machine translation models with monolingual data. In Proc. 54th Annual Meeting of the Association for Computational Linguistics (ACL) Vol. 1 (eds Erk, K. & Smith, N. A.) 86–96 (ACL, 2016).

Popović, M. chrf++: words helping character n-grams. In Proc. Second Conference on Machine Translation Vol. 2 (eds Bojar, O. et al.) 612–618 (ACL, 2017).

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2818–2826 (IEEE, 2016).

Liu, R., Kim, Y. J., Muzio, A., Mozafari, B. & Awadalla, H. H. Gating dropout: communication-efficient regularization for sparsely activated transformers. In Proceedings of the 39th International Conference on Machine Learning (PMLR, 2022).

Goyal, N. et al. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Trans. Assoc. Comput. Linguist. 10 , 522–538 (2022).

Wang, H. et al. DeepNet: scaling transformers to 1,000 layers. In IEEE Transactions on Pattern Analysis and Machine Intelligence https://doi.org/10.1109/TPAMI.2024.3386927 (IEEE, 2024)

Freitag, M. et al. Results of the WMT21 metrics shared task: evaluating metrics with expert-based human evaluations on TED and news domain. In Proc. Sixth Conference on Machine Translation (eds Barrault, L. et al.) 733–774 (ACL, 2021); https://aclanthology.org/2021.wmt-1.73 .

Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proc. 40th annual meeting of the Association for Computational Linguistics (eds Isabelle, P. et al.) 311–318 (ACL, 2002).

Akhbardeh, F. et al. Findings of the 2021 conference on machine translation (WMT21). In Proc. Sixth Conference on Machine Translation (eds Barrault, L. et al.) 1–88 (ACL, 2021); https://aclanthology.org/2021.wmt-1.1 .

Kocmi, T. et al. To ship or not to ship: an extensive evaluation of automatic metrics for machine translation. In Proc. Sixth Conference on Machine Translation (eds Barrault, L. et al.) 478–494 (ACL, 2021).

Licht, D. et al. Consistent human evaluation of machine translation across language pairs. In Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas Vol. 1, 309–321 (Association for Machine Translation in the Americas, 2022).

Agirre, E. et al. SemEval-2012 task 6: a pilot on semantic textual similarity. In Proc. *SEM 2012: The First Joint Conference on Lexical and Computational Semantics Vols 1–2 (eds Aggire, E. et al.) 385–393 (ACL, 2012).

Kusters, R. et al. Interdisciplinary research in artificial intelligence: Challenges and opportunities. Front. Big Data 3 , 577974 (2020).

Article   PubMed   PubMed Central   Google Scholar  

Wang, S., Cooper, N., Eby, M. & Jo, E. S. From human-centered to social-centered artificial intelligence: assessing ChatGPT’s impact through disruptive events. Preprint at https://arxiv.org/abs/2306.00227 (2023).

Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. Enriching word vectors with subword onformation. Trans. Assoc. Comput. Linguist. 5 , 135–146 (2017).

Tiedemann, J. Parallel data, tools and interfaces in OPUS. In Proc. Eighth International Conference on Language Resources and Evaluation (eds Calzolari, N. et al.) 2214–2218 (ACL, 2012).

Artetxe, M. & Schwenk, H. Margin-based parallel corpus mining with multilingual sentence embeddings. In Proc. 57th Annual Meeting of the Association for Computational Linguistics (eds Korhonen, A.) 3197–3203 (ACL, 2019).

Bahdanau, D., Cho, K. H. & Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proc. of the 3rd International Conference on Learning Representations (ICLR, 2015).

Kudo, T. & Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018 (eds Blanco, E. & Lu, W.) 66–71 (ACL, 2018); https://doi.org/10.18653/v1/d18-2012 .

Gu, J., Hassan, H., Devlin, J. & Li, V. O. Universal Neural Machine Translation for Extremely Low Resource Languages. In Proc. 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1 (eds Walker, M. et al.) 344–354 (ACL, 2018); https://aclanthology.org/N18-1032 .

Wang, X., Pham, H., Arthur, P. & Neubig, G. Multilingual neural machine translation with soft decoupled encoding. Preprint at https://arxiv.org/abs/1902.03499 (2019).

Rajab, J. Effect of tokenisation strategies for low-resourced Southern African languages. In 3rd Workshop on African Natural Language Processing (ICLR, 2022).

Vaswani, A. et al. Attention is all you need. In Proc. 31st Conference on Neural Information Processing Systems 5998–6008 (NIPS, 2017).

Johnson, M. et al. Google’s multilingual neural machine translation system: enabling zero-shot translation. Trans. Assoc. Comput. Linguist. 5 , 339–351 (2017).

Conneau, A. et al. Unsupervised cross-lingual representation learning at scale. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D. et al.) 8440–8451 (ACL, 2020).

Bengio, Y., Léonard, N. & Courville, A. C. Estimating or propagating gradients through stochastic neurons for conditional computation. Preprint at http://arxiv.org/abs/1308.3432 (2013).

Almahairi, A. et al. Dynamic capacity networks. In Proc. 33rd International Conference on International Conference on Machine Learning Vol. 48, 2091–2100 (PMLR, 2016).

Elbayad, M., Sun, A. & Bhosale, S. Fixing MoE over-fitting on low-resource languages in multilingual machine translation. In Findings of the Association for Computational Linguistics: ACL 2023 (eds Rogers, A. et al.) 14237–14253 (ACL, 2023); https://aclanthology.org/2023.findings-acl.897 .

Rei, R., Stewart, C., Farinha, A. C. & Lavie, A. COMET: a neural framework for MT evaluation. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B. et al.) 2685–2702 (ACL, 2020).

Sellam, T., Das, D. & Parikh, A. BLEURT: learning robust metrics for text generation. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D. et al.) 7881–7892 (ACL, 2020).

Post, M. A Call for Clarity in Reporting BLEU Scores. In Proc. Third Conference on Machine Translation: Research Papers (eds Bojar, O. et al.) 186–191 (ACL, 2018); https://aclanthology.org/W18-6319 .

Graham, Y., Baldwin, T., Moffat, A. & Zobel, J. Continuous measurement scales in human evaluation of machine translation. In Proc. 7th Linguistic Annotation Workshop and Interoperability with Discourse 33–41 (eds Graham, Y. et al.) (ACL, 2013).

NLLB Team et al. No Language Left Behind: scaling human-centered machine translation. GitHub https://github.com/facebookresearch/fairseq/tree/nllb (2022).

Download references

Acknowledgements

We thank the following interns for their contributions to the project: C. Baziotis, D. Dua, A. Guo, O. Ignat, A. Kamran, T. Mohiuddin, A. N. Rubungo, S. Sun, S. Tan, H. Xu, S. Wu and Y. Zhang. We are grateful to all the Wikimedia Foundation staff and volunteers who worked with us and provided helpful feedback on our project. We thank V. Chaudhary for help with the data pipeline; E. Grave for his help in scaling fasttext to all FLORES-200 languages; M. Diab for her work on XSTS; L. Specia for her feedback on toxicity and XSTS; J. Ferrando and C. Escolano for their help in using the ALTI+ method; G. Chang, C.-J. Wu and R. Raghavendra for helping us to compute the CO 2 cost of training our models; A. Sridhar for helping with FSDP; S. Jeschonek, G. Anantharaman, D. Sarina, J. Colombo, S. Krishnan, D. Kannappan, K. Saladi, V. Pai, A. Yajurvedi and S. Sengupta for their assistance with training infrastructure; K. Johnson for his help with UXR studies and model evaluation; B. O’Horo and J. Kao for their generative insights and guidance; P. Fung, N. Usunier, S. Riedel, S. Sengupta and E. Dinan for their helpful feedback on the paper. We would also like to thank A. Bordes, M. Zannoli and C. Moghbel for their overall support of this project. Finally, we are indebted to the translators, reviewers, human evaluators, linguists, as well as the translation and quality assurance agencies we partnered with, for helping to create FLORES-200, NLLB-Seed, NLLB-MD and Toxicity-200; performing human evaluations; and teaching us about their native languages.

Author information

Authors and affiliations.

Foundational AI Research (FAIR), Meta, Paris, France

Marta R. Costa-jussà, Onur Çelebi, Guillaume Wenzek, Loic Barrault, Shannon Spruit, Pierre Andrews, Alexandre Mourachko & Holger Schwenk

Foundational AI Research (FAIR), Meta, New York, NY, USA

James Cross, Angela Fan, Philipp Koehn & Safiyyah Saleem

Foundational AI Research (FAIR), Meta, Menlo Park, CA, USA

Maha Elbayad, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Al Youngblood, Bapi Akula, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Chau Tran, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Christophe Ropers & Jeff Wang

Foundational AI Research (FAIR), Meta, London, UK

Kenneth Heafield & Kevin Heffernan

University of California, Berkeley, CA, USA

Skyler Wang

Johns Hopkins University, Baltimore, MD, USA

Philipp Koehn

  • Marta R. Costa-jussà
  • , James Cross
  • , Onur Çelebi
  • , Maha Elbayad
  • , Kenneth Heafield
  • , Kevin Heffernan
  • , Elahe Kalbassi
  • , Janice Lam
  • , Daniel Licht
  • , Jean Maillard
  • , Skyler Wang
  • , Guillaume Wenzek
  • , Al Youngblood
  • , Bapi Akula
  • , Loic Barrault
  • , Gabriel Mejia Gonzalez
  • , Prangthip Hansanti
  • , John Hoffman
  • , Semarley Jarrett
  • , Kaushik Ram Sadagopan
  • , Dirk Rowe
  • , Shannon Spruit
  • , Chau Tran
  • , Pierre Andrews
  • , Necip Fazil Ayan
  • , Shruti Bhosale
  • , Sergey Edunov
  • , Angela Fan
  • , Cynthia Gao
  • , Vedanuj Goswami
  • , Francisco Guzmán
  • , Philipp Koehn
  • , Alexandre Mourachko
  • , Christophe Ropers
  • , Safiyyah Saleem
  • , Holger Schwenk
  •  & Jeff Wang

Contributions

B.A., P.A., O.Ç., K. Heafield, K. Heffernan, S.J., H.S. and G.W. contributed to the data workstream of the project, which includes developing tools to facilitate data mining, cleaning and consolidation. L.B., S.B., J.C., M.E., V.G., J.M., K.R.S., A.S. and C.T. conducted research and experiments that gave rise to the models in this work. M.R.C., C.G., J.H., E.K., P.K., D.L., D.R., S.Spruit., S.W. and A.Y. implemented automatic and human evaluations of NLLB, including but not limited to quality, bias and toxicity. G.M.G., P.H., J.L. and C.R. performed all linguistics work in this project. N.F.A., S.E., A.F., F.G., A.M., S.S. and J.W. provided crucial technical and organizational leadership to help materialize this overall project. M.R.C., C.R., M.E. and S.W. prepared the paper for publication.

Corresponding author

Correspondence to Marta R. Costa-jussà .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature thanks David Adelani, Sunipa Dev and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended data fig. 1 architecture of the laser3 teacher-student approach..

See 21 for more details.

Extended Data Fig. 2 Illustration of EOM (panel c) in contrast to overall dropout (panel b) for MoE layers.

A color represents a token, and each token is dispatched to two experts (Top-2-Gating) depending on the gating decision (panel a). Faded colors correspond to dropped units or masked outputs.

Supplementary information

Supplementary information.

This file contains Supplementary Information Sections A–K and Supplementary References – see Supplementary Contents page for details.

Peer Review File

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

NLLB Team. Scaling neural machine translation to 200 languages. Nature (2024). https://doi.org/10.1038/s41586-024-07335-x

Download citation

Received : 08 May 2023

Accepted : 19 March 2024

Published : 05 June 2024

DOI : https://doi.org/10.1038/s41586-024-07335-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Meta’s ai system is a boost to endangered languages — as long as humans aren’t forgotten.

Nature (2024)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

machine learning research papers

Main Navigation

  • Contact NeurIPS
  • Code of Ethics
  • Code of Conduct
  • Create Profile
  • Journal To Conference Track
  • Diversity & Inclusion
  • Proceedings
  • Future Meetings
  • Exhibitor Information
  • Privacy Policy

NeurIPS 2024, the Thirty-eighth Annual Conference on Neural Information Processing Systems, will be held at the Vancouver Convention Center

Monday Dec 9 through Sunday Dec 15. Monday is an industry expo.

machine learning research papers

Registration

Pricing » Registration 2024 Registration Cancellation Policy » Certificate of Attendance

Our Hotel Reservation page is currently under construction and will be released shortly. NeurIPS has contracted Hotel guest rooms for the Conference at group pricing, requiring reservations only through this page. Please do not make room reservations through any other channel, as it only impedes us from putting on the best Conference for you. We thank you for your assistance in helping us protect the NeurIPS conference.

Announcements

  • The call for High School Projects has been released
  • The Call For Papers has been released
  • See the Visa Information page for changes to the visa process for 2024.

Latest NeurIPS Blog Entries [ All Entries ]

Jun 04, 2024
May 17, 2024
May 07, 2024
Apr 17, 2024
Apr 15, 2024
Mar 03, 2024
Dec 11, 2023
Dec 10, 2023
Dec 09, 2023
Nov 23, 2023

Important Dates

Mar 15 '24 11:46 AM PDT *
Apr 05 '24 (Anywhere on Earth)
Apr 21 '24 (Anywhere on Earth)
Main Conference Paper Submission Deadline May 22 '24 01:00 PM PDT *
May 22 '24 01:00 PM PDT *
Jun 14 '24 (Anywhere on Earth)
Jun 27 '24 01:00 PM PDT *
Aug 02 '24 06:00 PM PDT *
Sep 05 '24 (Anywhere on Earth)
Main Conference Author Notification Sep 25 '24 06:00 PM PDT *
Datasets and Benchmarks - Author Notification Sep 26 '24 (Anywhere on Earth)
Workshop Accept/Reject Notification Date Sep 29 '24 (Anywhere on Earth)
Oct 30 '24 (Anywhere on Earth)
Nov 15 '24 11:00 PM PST *

Timezone:

If you have questions about supporting the conference, please contact us .

View NeurIPS 2024 exhibitors » Become an 2024 Exhibitor Exhibitor Info »

Organizing Committee

General chair, program chair, workshop chair, workshop chair assistant, tutorial chair, competition chair, data and benchmark chair, affinity chair, diversity, inclusion and accessibility chair, ethics review chair, communication chair, social chair, journal chair, creative ai chair, workflow manager, logistics and it, mission statement.

The Neural Information Processing Systems Foundation is a non-profit corporation whose purpose is to foster the exchange of research advances in Artificial Intelligence and Machine Learning, principally by hosting an annual interdisciplinary academic conference with the highest ethical standards for a diverse and inclusive community.

About the Conference

The conference was founded in 1987 and is now a multi-track interdisciplinary annual meeting that includes invited talks, demonstrations, symposia, and oral and poster presentations of refereed papers. Along with the conference is a professional exposition focusing on machine learning in practice, a series of tutorials, and topical workshops that provide a less formal setting for the exchange of ideas.

More about the Neural Information Processing Systems foundation »

NeurIPS uses cookies to remember that you are logged in. By using our websites, you agree to the placement of cookies.

The state of AI in 2023: Generative AI’s breakout year

You have reached a page with older survey data. please see our 2024 survey results here ..

The latest annual McKinsey Global Survey  on the current state of AI confirms the explosive growth of generative AI (gen AI) tools . Less than a year after many of these tools debuted, one-third of our survey respondents say their organizations are using gen AI regularly in at least one business function. Amid recent advances, AI has risen from a topic relegated to tech employees to a focus of company leaders: nearly one-quarter of surveyed C-suite executives say they are personally using gen AI tools for work, and more than one-quarter of respondents from companies using AI say gen AI is already on their boards’ agendas. What’s more, 40 percent of respondents say their organizations will increase their investment in AI overall because of advances in gen AI. The findings show that these are still early days for managing gen AI–related risks, with less than half of respondents saying their organizations are mitigating even the risk they consider most relevant: inaccuracy.

The organizations that have already embedded AI capabilities have been the first to explore gen AI’s potential, and those seeing the most value from more traditional AI capabilities—a group we call AI high performers—are already outpacing others in their adoption of gen AI tools. 1 We define AI high performers as organizations that, according to respondents, attribute at least 20 percent of their EBIT to AI adoption.

The expected business disruption from gen AI is significant, and respondents predict meaningful changes to their workforces. They anticipate workforce cuts in certain areas and large reskilling efforts to address shifting talent needs. Yet while the use of gen AI might spur the adoption of other AI tools, we see few meaningful increases in organizations’ adoption of these technologies. The percent of organizations adopting any AI tools has held steady since 2022, and adoption remains concentrated within a small number of business functions.

Table of Contents

  • It’s early days still, but use of gen AI is already widespread
  • Leading companies are already ahead with gen AI
  • AI-related talent needs shift, and AI’s workforce effects are expected to be substantial
  • With all eyes on gen AI, AI adoption and impact remain steady

About the research

1. it’s early days still, but use of gen ai is already widespread.

The findings from the survey—which was in the field in mid-April 2023—show that, despite gen AI’s nascent public availability, experimentation with the tools  is already relatively common, and respondents expect the new capabilities to transform their industries. Gen AI has captured interest across the business population: individuals across regions, industries, and seniority levels are using gen AI for work and outside of work. Seventy-nine percent of all respondents say they’ve had at least some exposure to gen AI, either for work or outside of work, and 22 percent say they are regularly using it in their own work. While reported use is quite similar across seniority levels, it is highest among respondents working in the technology sector and those in North America.

Organizations, too, are now commonly using gen AI. One-third of all respondents say their organizations are already regularly using generative AI in at least one function—meaning that 60 percent of organizations with reported AI adoption are using gen AI. What’s more, 40 percent of those reporting AI adoption at their organizations say their companies expect to invest more in AI overall thanks to generative AI, and 28 percent say generative AI use is already on their board’s agenda. The most commonly reported business functions using these newer tools are the same as those in which AI use is most common overall: marketing and sales, product and service development, and service operations, such as customer care and back-office support. This suggests that organizations are pursuing these new tools where the most value is. In our previous research , these three areas, along with software engineering, showed the potential to deliver about 75 percent of the total annual value from generative AI use cases.

In these early days, expectations for gen AI’s impact are high : three-quarters of all respondents expect gen AI to cause significant or disruptive change in the nature of their industry’s competition in the next three years. Survey respondents working in the technology and financial-services industries are the most likely to expect disruptive change from gen AI. Our previous research shows  that, while all industries are indeed likely to see some degree of disruption, the level of impact is likely to vary. 2 “ The economic potential of generative AI: The next productivity frontier ,” McKinsey, June 14, 2023. Industries relying most heavily on knowledge work are likely to see more disruption—and potentially reap more value. While our estimates suggest that tech companies, unsurprisingly, are poised to see the highest impact from gen AI—adding value equivalent to as much as 9 percent of global industry revenue—knowledge-based industries such as banking (up to 5 percent), pharmaceuticals and medical products (also up to 5 percent), and education (up to 4 percent) could experience significant effects as well. By contrast, manufacturing-based industries, such as aerospace, automotives, and advanced electronics, could experience less disruptive effects. This stands in contrast to the impact of previous technology waves that affected manufacturing the most and is due to gen AI’s strengths in language-based activities, as opposed to those requiring physical labor.

Responses show many organizations not yet addressing potential risks from gen AI

According to the survey, few companies seem fully prepared for the widespread use of gen AI—or the business risks these tools may bring. Just 21 percent of respondents reporting AI adoption say their organizations have established policies governing employees’ use of gen AI technologies in their work. And when we asked specifically about the risks of adopting gen AI, few respondents say their companies are mitigating the most commonly cited risk with gen AI: inaccuracy. Respondents cite inaccuracy more frequently than both cybersecurity and regulatory compliance, which were the most common risks from AI overall in previous surveys. Just 32 percent say they’re mitigating inaccuracy, a smaller percentage than the 38 percent who say they mitigate cybersecurity risks. Interestingly, this figure is significantly lower than the percentage of respondents who reported mitigating AI-related cybersecurity last year (51 percent). Overall, much as we’ve seen in previous years, most respondents say their organizations are not addressing AI-related risks.

2. Leading companies are already ahead with gen AI

The survey results show that AI high performers—that is, organizations where respondents say at least 20 percent of EBIT in 2022 was attributable to AI use—are going all in on artificial intelligence, both with gen AI and more traditional AI capabilities. These organizations that achieve significant value from AI are already using gen AI in more business functions than other organizations do, especially in product and service development and risk and supply chain management. When looking at all AI capabilities—including more traditional machine learning capabilities, robotic process automation, and chatbots—AI high performers also are much more likely than others to use AI in product and service development, for uses such as product-development-cycle optimization, adding new features to existing products, and creating new AI-based products. These organizations also are using AI more often than other organizations in risk modeling and for uses within HR such as performance management and organization design and workforce deployment optimization.

AI high performers are much more likely than others to use AI in product and service development.

Another difference from their peers: high performers’ gen AI efforts are less oriented toward cost reduction, which is a top priority at other organizations. Respondents from AI high performers are twice as likely as others to say their organizations’ top objective for gen AI is to create entirely new businesses or sources of revenue—and they’re most likely to cite the increase in the value of existing offerings through new AI-based features.

As we’ve seen in previous years , these high-performing organizations invest much more than others in AI: respondents from AI high performers are more than five times more likely than others to say they spend more than 20 percent of their digital budgets on AI. They also use AI capabilities more broadly throughout the organization. Respondents from high performers are much more likely than others to say that their organizations have adopted AI in four or more business functions and that they have embedded a higher number of AI capabilities. For example, respondents from high performers more often report embedding knowledge graphs in at least one product or business function process, in addition to gen AI and related natural-language capabilities.

While AI high performers are not immune to the challenges of capturing value from AI, the results suggest that the difficulties they face reflect their relative AI maturity, while others struggle with the more foundational, strategic elements of AI adoption. Respondents at AI high performers most often point to models and tools, such as monitoring model performance in production and retraining models as needed over time, as their top challenge. By comparison, other respondents cite strategy issues, such as setting a clearly defined AI vision that is linked with business value or finding sufficient resources.

The findings offer further evidence that even high performers haven’t mastered best practices regarding AI adoption, such as machine-learning-operations (MLOps) approaches, though they are much more likely than others to do so. For example, just 35 percent of respondents at AI high performers report that where possible, their organizations assemble existing components, rather than reinvent them, but that’s a much larger share than the 19 percent of respondents from other organizations who report that practice.

Many specialized MLOps technologies and practices  may be needed to adopt some of the more transformative uses cases that gen AI applications can deliver—and do so as safely as possible. Live-model operations is one such area, where monitoring systems and setting up instant alerts to enable rapid issue resolution can keep gen AI systems in check. High performers stand out in this respect but have room to grow: one-quarter of respondents from these organizations say their entire system is monitored and equipped with instant alerts, compared with just 12 percent of other respondents.

3. AI-related talent needs shift, and AI’s workforce effects are expected to be substantial

Our latest survey results show changes in the roles that organizations are filling to support their AI ambitions. In the past year, organizations using AI most often hired data engineers, machine learning engineers, and Al data scientists—all roles that respondents commonly reported hiring in the previous survey. But a much smaller share of respondents report hiring AI-related-software engineers—the most-hired role last year—than in the previous survey (28 percent in the latest survey, down from 39 percent). Roles in prompt engineering have recently emerged, as the need for that skill set rises alongside gen AI adoption, with 7 percent of respondents whose organizations have adopted AI reporting those hires in the past year.

The findings suggest that hiring for AI-related roles remains a challenge but has become somewhat easier over the past year, which could reflect the spate of layoffs at technology companies from late 2022 through the first half of 2023. Smaller shares of respondents than in the previous survey report difficulty hiring for roles such as AI data scientists, data engineers, and data-visualization specialists, though responses suggest that hiring machine learning engineers and AI product owners remains as much of a challenge as in the previous year.

Looking ahead to the next three years, respondents predict that the adoption of AI will reshape many roles in the workforce. Generally, they expect more employees to be reskilled than to be separated. Nearly four in ten respondents reporting AI adoption expect more than 20 percent of their companies’ workforces will be reskilled, whereas 8 percent of respondents say the size of their workforces will decrease by more than 20 percent.

Looking specifically at gen AI’s predicted impact, service operations is the only function in which most respondents expect to see a decrease in workforce size at their organizations. This finding generally aligns with what our recent research  suggests: while the emergence of gen AI increased our estimate of the percentage of worker activities that could be automated (60 to 70 percent, up from 50 percent), this doesn’t necessarily translate into the automation of an entire role.

AI high performers are expected to conduct much higher levels of reskilling than other companies are. Respondents at these organizations are over three times more likely than others to say their organizations will reskill more than 30 percent of their workforces over the next three years as a result of AI adoption.

4. With all eyes on gen AI, AI adoption and impact remain steady

While the use of gen AI tools is spreading rapidly, the survey data doesn’t show that these newer tools are propelling organizations’ overall AI adoption. The share of organizations that have adopted AI overall remains steady, at least for the moment, with 55 percent of respondents reporting that their organizations have adopted AI. Less than a third of respondents continue to say that their organizations have adopted AI in more than one business function, suggesting that AI use remains limited in scope. Product and service development and service operations continue to be the two business functions in which respondents most often report AI adoption, as was true in the previous four surveys. And overall, just 23 percent of respondents say at least 5 percent of their organizations’ EBIT last year was attributable to their use of AI—essentially flat with the previous survey—suggesting there is much more room to capture value.

Organizations continue to see returns in the business areas in which they are using AI, and they plan to increase investment in the years ahead. We see a majority of respondents reporting AI-related revenue increases within each business function using AI. And looking ahead, more than two-thirds expect their organizations to increase their AI investment over the next three years.

The online survey was in the field April 11 to 21, 2023, and garnered responses from 1,684 participants representing the full range of regions, industries, company sizes, functional specialties, and tenures. Of those respondents, 913 said their organizations had adopted AI in at least one function and were asked questions about their organizations’ AI use. To adjust for differences in response rates, the data are weighted by the contribution of each respondent’s nation to global GDP.

The survey content and analysis were developed by Michael Chui , a partner at the McKinsey Global Institute and a partner in McKinsey’s Bay Area office, where Lareina Yee is a senior partner; Bryce Hall , an associate partner in the Washington, DC, office; and senior partners Alex Singla and Alexander Sukharevsky , global leaders of QuantumBlack, AI by McKinsey, based in the Chicago and London offices, respectively.

They wish to thank Shivani Gupta, Abhisek Jena, Begum Ortaoglu, Barr Seitz, and Li Zhang for their contributions to this work.

This article was edited by Heather Hanselman, an editor in the Atlanta office.

Explore a career with us

Related articles.

McKinsey partners Lareina Yee and Michael Chui

The economic potential of generative AI: The next productivity frontier

A green apple split into 3 parts on a gray background. Half of the apple is made out of a digital blue wireframe mesh.

What is generative AI?

Circular hub element virtual reality of big data, technology concept.

Exploring opportunities in the generative AI value chain

machine learning research papers

IMAGES

  1. Implementation AI research papers

    machine learning research papers

  2. The Top Machine Learning Research of May 2024

    machine learning research papers

  3. Research Paper

    machine learning research papers

  4. How To Read Research Papers. Introduction:

    machine learning research papers

  5. Boost Your Performance with a Top-Notch Research Paper Writer Service

    machine learning research papers

  6. The Top Machine Learning Research of May 2024

    machine learning research papers

VIDEO

  1. Machine Learning Research Explained to a 5 Year Old #AI #musicgeneration

  2. ML Nomads Live Stream

  3. 2024 Empowering Minds Through Data Science and Machine Learning Symposium: Jinferg Zhang PHD

  4. Why you should read Research Papers in ML & DL? #machinelearning #deeplearning

  5. you should still read Machine Learning research papers

  6. Extreme Learning Machine: Learning Without Iterative Tuning

COMMENTS

  1. The latest in Machine Learning

    Explore the latest Machine Learning papers and code on various topics, such as image generation, time-series forecasting, object detection, and more. See the most popular and recent publications, ratings, and stars on Papers With Code.

  2. Journal of Machine Learning Research

    JMLR is an international forum for high-quality scholarly articles in all areas of machine learning. Browse the latest papers on topics such as PAC-Bayes bounds, neural networks, quality-diversity algorithms, and more.

  3. Machine Learning: Algorithms, Real-World Applications and Research

    A comprehensive review of machine learning techniques and their applications in various real-world domains, such as cybersecurity, smart cities, healthcare, and more. The paper also highlights the challenges and potential research directions based on the study.

  4. Home

    Demonstrates how to apply learning methods to solve significant application problems. Improves how machine learning research is conducted. Prioritizes verifiable and replicable supporting evidence in all published papers.

  5. The Journal of Machine Learning Research

    Machine learning models trained by different optimization algorithms under different data distributions can exhibit distinct generalization behaviors. In this paper, we analyze the generalization of models trained by noisy iterative algorithms.

  6. Machine learning

    Machine learning is the ability of a machine to improve its performance based on previous results. Machine learning methods enable computers to learn without being explicitly programmed and have ...

  7. THE JOURNAL OF MACHINE LEARNING RESEARCH Home

    The Journal of Machine Learning Research (JMLR) provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning.JMLR seeks previously unpublished papers that contain:new algorithms with empirical, theoretical, psychological, or biological justification; experimental and/or theoretical studies yielding new insight into ...

  8. JMLR Papers

    Browse the table of contents and links to the papers published in JMLR, a peer-reviewed journal covering all aspects of machine learning. Find papers by volume number, special topics, special issues, or conference proceedings.

  9. machine learning Latest Research Papers

    Find the latest published documents for machine learning, Related hot topics, top authors, the most cited documents, and related journals

  10. The latest in Machine Learning

    Papers With Code highlights trending Machine Learning research and the code to implement it.

  11. Machine Learning

    Explore the latest research papers on machine learning, including topics on falsifiable, replicable, and reprocible empirical ML research.

  12. Papers with Code

    Find the latest and best papers for various machine learning tasks, such as computer vision, natural language processing, speech, and more. Explore the benchmarks, datasets, and methods for each task and compare the results with code.

  13. Machine Learning with Applications

    A peer reviewed, open access journal focused on research related to machine learning and its applications in various domains. Find the latest published articles, calls for papers, special issues and more on the journal website.

  14. Forecasting the future of artificial intelligence with machine learning

    How can AI predict its own future research directions? Explore a novel benchmark and diverse methods based on a semantic network of AI papers.

  15. Machine Learning

    The codebase will be made available upon publication. This paper is dedicated to Thomas Sankara. Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Atmospheric and Oceanic Physics (physics.ao-ph); Neurons and Cognition (q-bio.NC)

  16. Computer Science

    Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.

  17. Top 20 Recent Research Papers on Machine Learning and Deep Learning

    Machine learning and Deep Learning research advances are transforming our technology. Here are the 20 most important (most-cited) scientific papers that have been published since 2014, starting with "Dropout: a simple way to prevent neural networks from overfitting".

  18. Artificial intelligence and machine learning research ...

    Considering that Machine Learning (ML) and AI are apt to reach unforeseen levels of accuracy and efficiency, this special issue sought to promote research on AI and ML seen as functions of data-driven innovation and digital transformation.

  19. Machine Learning: Models, Challenges, and Research Directions

    Machine learning techniques have emerged as a transformative force, revolutionizing various application domains, particularly cybersecurity. The development of optimal machine learning applications requires the integration of multiple processes, such as data pre-processing, model selection, and parameter optimization. While existing surveys have shed light on these techniques, they have mainly ...

  20. Machine Learning in Healthcare

    The compilation of articles and papers focused on the use of machine learning and artificial intelligence in healthcare as well as current and potential applications. Search terms included machine learning in healthcare, artificial intelligence medical imaging, BIG data and machine learning, machine learning in genomics, electronic health ...

  21. Machine Learning: Algorithms, Real-World Applications and Research

    The learning algorithms can be categorized into four major types, such as supervised, unsupervised, semi-supervised, and reinforcement learning in the area [ 75 ], discussed briefly in Sect. " Types of Real-World Data and Machine Learning Techniques ". The popularity of these approaches to learning is increasing day-by-day, which is shown ...

  22. Overview

    Apple machine learning teams are engaged in state of the art research in machine learning and artificial intelligence. Learn about the latest advancements.

  23. Machine Learning Datasets

    Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. ... (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance ...

  24. 5 Machine Learning Papers to Read in 2024

    There are many machine learning papers to read in 2024, and here are my recommendation papers to read: HyperFast: Instant Classification for Tabular Data. EasyRL4Rec: A User-Friendly Code Library for Reinforcement Learning Based Recommender Systems. Label Propagation for Zero-shot Classification with Vision-Language Models.

  25. Machine Learning

    a few innovative research works and their applications in real-world, such as stock trading, medical and healthcare systems, and software automation. The chapters in ... Machine learning (ML) is the ability of a system to automatically acquire, integrate, and then develop knowledge from large-scale data, and then expand the acquired ...

  26. Scaling neural machine translation to 200 languages

    The development of neural techniques has opened up new avenues for research in machine translation. Today, neural machine translation (NMT) systems can leverage highly multilingual capacities and ...

  27. 2024 Conference

    The Neural Information Processing Systems Foundation is a non-profit corporation whose purpose is to foster the exchange of research advances in Artificial Intelligence and Machine Learning, principally by hosting an annual interdisciplinary academic conference with the highest ethical standards for a diverse and inclusive community.

  28. The state of AI in 2023: Generative AI's breakout year

    In the past year, organizations using AI most often hired data engineers, machine learning engineers, and Al data scientists—all roles that respondents commonly reported hiring in the previous survey. ... About the research. The online survey was in the field April 11 to 21, 2023, and garnered responses from 1,684 participants representing ...