An Integrative Review with Word Cloud Analysis of STEM Education

  • Published: 01 July 2024

Cite this article

word cloud research paper

  • Wen-Song Su 1 , 2 &
  • Ching-Yi Chang 3  

In the twenty-first century, the effectiveness of science, technology, engineering, and mathematics (STEM) courses is an issue of significant concern for educators internationally. This article analyzed STEM articles published in the journals included in PubMed, Scopus, Web of Science (WoS), Embase, CINAHL, and Medline databases from 1999 to 2023 and explored the trends in educational research and educational scientific disciplines. The primary analyses and findings are as follows: (1) the major cooperative institutions identified were Michigan State University; (2) knowledge construction tools, inquiry-based learning, and peer competition or gaming were identified as the top three learning strategies; (3) four keyword clusters covering students, science, literacy, and performance were categorized. These results hold implications for researchers, school district administrators, and policymakers who are endeavoring to introduce STEM education into the curriculum for higher education teachers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

word cloud research paper

Availability of Data and Materials

The data that support the findings of this study are available from the corresponding author upon request.

Allchin, D., Andersen, H. M., & Nielsen, K. (2014). Complementary approaches to teaching nature of science: Integrating student inquiry, historical cases, and contemporary cases in classroom practice. Science Education, 98 (3), 461–486.

Article   Google Scholar  

Bevan, B. (2017). The promise and the promises of making in science education. Studies in Science Education, 53 (1), 75–103.

Bryan, L., & Guzey, S. S. (2020). K-12 STEM education: An overview of perspectives and considerations. Hellenic Journal of STEM Education, 1 (1), 5–15.

Carnes, M., Devine, P. G., Isaac, C., Manwell, L. B., Ford, C. E., Byars-Winston, A., . . . Sheridan, J. (2012). Promoting institutional change through bias literacy. Journal of Diversity in Higher Education, 5 (2), 63.

Century, J., Ferris, K. A., & Zuo, H. (2020). Finding time for computer science in the elementary school day: A quasi-experimental study of a transdisciplinary problem-based learning approach. International Journal of STEM Education, 7 (1), 1–16.

Chandler-Olcott, K. E. L. L. Y., & Mahar, D. (2003). “Tech-savviness” meets multiliteracies: Exploring adolescent girls’ technology-mediated literacy practices. Reading Research Quarterly, 38 (3), 356–385.

Chang, D., Hwang, G. J., Chang, S. C., & Wang, S. Y. (2021). Promoting students’ cross-disciplinary performance and higher order thinking: A peer assessment-facilitated STEM approach in a mathematics course. Educational Technology Research and Development, 69 , 3281–3306.

Charlton, J. P., & Birkett, P. E. (1999). An integrative model of factors related to computing course performance. Journal of Educational Computing Research, 20 (3), 237–257.

Chu, H. C., Chang, C. Y., & Chao, H. C. (2022). Mapping deep learning technologies for mobile networks with the internet of things. Human-Centric Computing and Information Sciences, 12 , 59.

Google Scholar  

Cohen, A., & Gilead, T. (2022). Introducing complexity theory to consider practice-based teacher education for democratic citizenship. Studies in Philosophy and Education, 42 (2), 201–217.

DeFlorio, L., & Beliakoff, A. (2015). Socioeconomic status and preschoolers’ mathematical knowledge: The contribution of home activities and parent beliefs. Early Education and Development, 26 (3), 319–341.

Fajrina, S., Lufri, L., & Ahda, Y. (2020). Science, Technology, Engineering, and Mathematics (STEM) as a learning approach to improve 21st century skills: A review. International Journal of Online & Biomedical Engineering, 16 (7). https://doi.org/10.3991/ijoe.v16i07.14101

Fomunyam, K. G. (2020). Introductory chapter: Theorising STEM Education in the contemporary society. Theorizing STEM Education in the 21st Century . Londres: IntechOpen.

Gunckel, K. L., Covitt, B. A., Salinas, I., & Anderson, C. W. (2012). A learning progression for water in socio-ecological systems. Journal of Research in Science Teaching, 49 (7), 843–868.

Hébert, C., & Jenson, J. (2020). Making in schools: Student learning through an e-textiles curriculum. Discourse Studies in the Cultural Politics of Education, 41 (5), 740–761.

Hofstein, A., Eilks, I., & Bybee, R. (2011). Societal issues and their importance for contemporary science education—A pedagogical justification and the state-of-the-art in Israel, Germany, and the USA. International Journal of Science and Mathematics Education, 9 , 1459–1483.

Hsiao, J. C., Chen, S. K., Chen, W., & Lin, S. S. (2022). Developing a plugged-in class observation protocol in high-school blended STEM classes: Student engagement, teacher behaviors and student-teacher interaction patterns. Computers & Education, 178 , 104403.

Hwang, G. J., Li, K. C., & Lai, C. L. (2020). Trends and strategies for conducting effective STEM research and applications: A mobile and ubiquitous learning perspective. International Journal of Mobile Learning and Organisation, 14 (2), 161–183.

Ibáñez, M. B., & Delgado-Kloos, C. (2018). Augmented reality for STEM learning: A systematic review. Computers & Education, 123 , 109–123.

Jang, H. (2016). Identifying 21st century STEM competencies using workplace data. Journal of Science Education and Technology, 25 , 284–301.

Jiang, S., Shen, J., & Smith, B. E. (2019). Designing discipline-specific roles for interdisciplinary learning: Two comparative cases in an afterschool STEM+ L programme. International Journal of Science Education, 41 (6), 803–826.

Kennedy, T. J., & Odell, M. R. (2014). Engaging students in STEM education. Science Education International, 25 (3), 246–258.

Kieffer, M. J. (2012). Before and after third grade: Longitudinal evidence for the shifting role of socioeconomic status in reading growth. Reading and Writing, 25 , 1725–1746.

Ko, Y., Shim, S. S., & Lee, H. (2023). Development and validation of a scale to measure views of social responsibility of scientists and engineers (VSRoSE). International Journal of Science and Mathematics Education, 21 (1), 277–303.

Kuo, H. C., Tseng, Y. C., & Yang, Y. T. C. (2019). Promoting college student’s learning motivation and creativity through a STEM interdisciplinary PBL human-computer interaction system design and development course. Thinking Skills and Creativity, 31 , 1–10.

Loyalka, P., Liu, O. L., Li, G., Kardanova, E., Chirikov, I., Hu, S., . . . Li, Y. (2021). Skill levels and gains in university STEM education in China, India, Russia and the United States. Nature Human Behaviour, 5 (7), 892–904.

Luo, T., Wang, J., Liu, X., & Zhou, J. (2019). Development and application of a scale to measure students’ STEM continuing motivation. International Journal of Science Education, 41 (14), 1885–1904.

Martín-Páez, T., Aguilera, D., Perales-Palacios, F. J., & Vílchez-González, J. M. (2019). What are we talking about when we talk about STEM education? A Review of Literature. Science Education, 103 (4), 799–822.

Odden, T. O. B., Lockwood, E., & Caballero, M. D. (2019). Physics computational literacy: An exploratory case study using computational essays. Physical Review Physics Education Research, 15 (2), 020152.

Oladinrin, O. T., Arif, M., Rana, M. Q., & Gyoh, L. (2023). Interrelations between construction ethics and innovation: A bibliometric analysis using VOSviewer. Construction Innovation, 23 (3), 505–523.

Perera, V. L., Wei, T., & Mlsna, D. A. (2019). Impact of peer-focused recitation to enhance student success in general chemistry. Journal of Chemical Education, 96 (8), 1600–1608.

Repenning, A., Webb, D. C., Koh, K. H., Nickerson, H., Miller, S. B., Brand, C., . . . Repenning, N. (2015). Scalable game design: A strategy to bring systemic computer science education to schools through game design and simulation creation. ACM Transactions on Computing Education, 15 (2), 1–31.

Schultheis, E. H., Kjelvik, M. K., Snowden, J., Mead, L., & Stuhlsatz, M. A. (2022). Effects of data nuggets on student interest in STEM careers, self-efficacy in data tasks, and ability to construct scientific explanations. International Journal of Science and Mathematics Education . https://doi.org/10.1007/s10763-022-10295-1

Siregar, N. C., Rosli, R., Maat, S. M., & Capraro, M. M. (2019). The effect of science, technology, engineering and mathematics (STEM) program on students’ achievement in mathematics: A meta-analysis. International Electronic Journal of Mathematics Education, 15 (1), em0549.

Tang, K. S. (2022). Material inquiry and transformation as prerequisite processes of scientific argumentation: Toward a social-material theory of argumentation. Journal of Research in Science Teaching, 59 (6), 969–1009.

Tang, K. S., & Williams, P. J. (2019). STEM literacy or literacies? Examining the empirical basis of these constructs. Review of Education, 7 (3), 675–697.

Teasdale, R., Ryker, K., Viskupic, K., Czajka, C. D., & Manduca, C. (2020). Transforming education with community-developed teaching materials: Evidence from direct observations of STEM college classrooms. International Journal of STEM Education, 7 , 1–22.

Tosun, C. (2022). Analysis of the last 40 years of science education research via bibliometric methods. Science & Education . https://doi.org/10.1007/s11191-022-00400-9

Waite, A. M., & McDonald, K. S. (2019). Exploring challenges and solutions facing STEM careers in the 21st century: A human resource development perspective. Advances in Developing Human Resources, 21 (1), 3–15.

Wang, H. H., Hong, Z. R., She, H. C., Smith, T. J., Fielding, J., & Lin, H. S. (2022a). The role of structured inquiry, open inquiry, and epistemological beliefs in developing secondary students’ scientific and mathematical literacies. International Journal of STEM Education, 9 (1), 1–17.

Wang, L. H., Chen, B., Hwang, G. J., Guan, J. Q., & Wang, Y. Q. (2022b). Effects of digital game-based STEM education on students’ learning achievement: A meta-analysis. International Journal of STEM Education, 9 (1), 1–13.

Wiesner, E., Weinberg, A., Fulmer, E. F., & Barr, J. (2020). The roles of textual features, background knowledge, and disciplinary expertise in reading a calculus textbook. Journal for Research in Mathematics Education, 51 (2), 204–233.

Yahya, M. S., & Hashim, H. (2021). Interdisciplinary learning and multiple learning approaches in enhancing the learning of ESL among STEM learners. Creative Education, 12 (5), 1057–1065.

Download references

This research was funded by the Ministry of Science and Technology of Taiwan, grant number MOST 111-2410-H-038-029-MY2.

Author information

Authors and affiliations.

Department of Dentistry, Tri-Service General Hospital, Taipei City, Taiwan

Wen-Song Su

Department of Dentistry, Armed Forces General Hospital, Taoyuan City, Taiwan

School of Nursing, College of Nursing, Taipei Medical University, 250 Wuxing Street, Taipei City, Taiwan

Ching-Yi Chang

You can also search for this author in PubMed   Google Scholar

Contributions

Wen-Song Su oversaw the research’s framework, gathered and analyzed data, and contributed to the manuscript’s composition. Ching-Yi Chang took the lead in forming the concept and analyzing the data and was involved in both the writing and revision processes. The author(s) have reviewed and given their approval to the final version of the manuscript.

Corresponding author

Correspondence to Ching-Yi Chang .

Ethics declarations

Ethical approval.

Not applicable.

Consent to Participate

Consent for publication, competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Su, WS., Chang, CY. An Integrative Review with Word Cloud Analysis of STEM Education. J Sci Educ Technol (2024). https://doi.org/10.1007/s10956-024-10134-8

Download citation

Accepted : 19 June 2024

Published : 01 July 2024

DOI : https://doi.org/10.1007/s10956-024-10134-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Educational research
  • Higher education
  • Social science
  • Find a journal
  • Publish with us
  • Track your research
Property Value
Status
Version
Ad File
Disable Ads Flag
Environment
Moat Init
Moat Ready
Contextual Ready
Contextual URL
Contextual Initial Segments
Contextual Used Segments
AdUnit
SubAdUnit
Custom Targeting
Ad Events
Invalid Ad Sizes

Society for Nutrition Education and Behavior

  • Submit       Member Login

Access provided by

Login to your account

If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password

If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password

  • PDF [52 KB] PDF [52 KB]
  • Add To Online Library Powered By Mendeley
  • Add To My Reading List
  • Export Citation
  • Create Citation Alert

Use of Word Clouds as a Novel Approach for Analysis and Presentation of Qualitative Data for Program Evaluation

  • Douglas Mathews, MS, RD Douglas Mathews Contact Affiliations University of Maine, 5735 Hitchner Hall, Orono, ME 04469 Search for articles by this author
  • L. Franzen-Castle, PhD, RD L. Franzen-Castle Affiliations University of Nebraska-Lincoln Search for articles by this author
  • S. Colby, PhD, RD S. Colby Affiliations University of Tennessee-Knoxville Search for articles by this author
  • K. Kattelmann, PhD, RD K. Kattelmann Affiliations South Dakota State University Search for articles by this author
  • M. Olfert, DrPH, MS, RDN, LDN M. Olfert Affiliations West Virginia University Search for articles by this author
  • A. White, PhD, RD A. White Affiliations University of Maine Search for articles by this author

Target Audience

Theory, prior research, rationale, description, conclusions and implications, article info, publication history, identification.

DOI: https://doi.org/10.1016/j.jneb.2015.04.071

ScienceDirect

  • View Large Image
  • Download Hi-res image
  • Download .PPT

Related Articles

  • Access for Developing Countries
  • Articles & Issues
  • Articles In Press
  • Current Issue
  • List of Issues
  • Supplements
  • For Authors
  • Author Guidelines
  • Submit Your Manuscript
  • Statistical Methods
  • Guidelines for Authors of Educational Material Reviews
  • Permission to Reuse
  • About Open Access
  • Researcher Academy
  • For Reviewers
  • General Guidelines
  • Methods Paper Guidelines
  • Qualitative Guidelines
  • Quantitative Guidelines
  • Questionnaire Methods Guidelines
  • Statistical Methods Guidelines
  • Systematic Review Guidelines
  • Perspective Guidelines
  • GEM Reviewing Guidelines
  • Journal Info
  • About the Journal
  • Disclosures
  • Abstracting/Indexing
  • Impact/Metrics
  • Contact Information
  • Editorial Staff and Board
  • Info for Advertisers
  • Member Access Instructions
  • New Content Alerts
  • Sponsored Supplements
  • Statistical Reviewers
  • Reviewer Appreciation
  • New Resources
  • New Resources for Nutrition Educators
  • Submit New Resources for Review
  • Guidelines for Writing Reviews of New Resources for Nutrition Educators
  • Podcast/Webinars
  • New Resources Podcasts
  • Press Release & Other Podcasts
  • Collections
  • Society News

The content on this site is intended for healthcare professionals.

  • Privacy Policy   
  • Terms and Conditions   
  • Accessibility   
  • Help & Contact

RELX

Session Timeout (2:00)

Your session will expire shortly. If you are still working, click the ‘Keep Me Logged In’ button below. If you do not respond within the next minute, you will be automatically logged out.

Portrait of Nan Xiao

Create Engaging Word Cloud Visualizations from Your Research

  • data visualization
  • natural language processing

Many outstanding researchers and labs have created visualizations of their research using word clouds. In this post, I present a simple, automated “paper2wordcloud” workflow to create eye-catching word cloud visualizations. It combines the efficiency of automation with the power of human intuition and aesthetic sense. The figure below was created using my published papers .

A word cloud visualization generated for my papers using this workflow.

The general steps in the workflow are:

  • Collect PDF files representing your research (10 min).
  • Run a Python script to extract the top words from the PDF files (10 min).
  • Review, edit, and finalize the list of top words (20 min).
  • Use a word cloud generator, adjust the look, and generate SVG (15 min).
  • Convert the SVG file to a PDF/PNG file (5 min).

Now let’s dive into it.

Step 1: Collect your research

Collect all the PDF files that can represent your research, for example, papers, slides, posters, and proposals. Place all PDF files in a single, flat directory, without subfolders. The PDF files should be machine-readable, that is, the pages should not be scanned photocopies, and the text should be selectable in PDF viewers.

Step 2: Extract top words

2.1 install python.

Install Python if you haven’t. For macOS users, install Python via Homebrew :

This will install the latest maintained release of Python 3 provided by Homebrew.

2.2 Get text processing script and install dependency

Clone this GitHub repo: nanxstats/pdf-word-extraction . It contains a Python script I wrote for extracting meaningful words, as defined by a statistical model, from the PDF files.

Follow the workflow section in the repo readme to create a virtual environment in the cloned repository, activate it, and install the required Python packages into the virtual environment. This includes pypdf for PDF parsing, ftfy for text cleaning, and spaCy for natural language processing.

Everything below assumes you are in the directory with the virtual environment activated.

2.3 Run the script

Now, copy all the PDF files prepared in step 1 into pdf/ .

Then, run the Python script:

This will print the top 250 words and their frequencies.

Step 3: Review, edit, and finalize top words

Review the output and identify any words that should be removed or replaced. The common reasons include:

Removal : Words that are meaningful in general but not meaningful in your research context should be removed. Examples include “journal”, “conference”, “Figure”, “Table”, and author names.

Replacement : Uncommon proper nouns that should be stylized in a specific way can be fixed via replacement. The frequency counts for plural and singular forms of the same word can be merged via replacement, too.

To add word removal or replacement rules, open pdf_word_extraction.py . Edit the entries in the list words_to_remove and the key-value pairs in the dictionary replacements . Save and run the Python script again with the same command as before:

Check the output again. Since some words in the original output have now been removed or replaced, the words newly popped into the list might give you more words to remove or replace. Continue this review-edit-run cycle until the top 250 words looks perfect. For me, I ended up removing 50 words and establishing 12 replacement rules.

Each time after running the script, a top_words.txt will be generated or overwritten under the directory. We will use this file in the next step.

Step 4: Use the word cloud generator

Open top_words.txt , select all content, copy and paste into the word cloud generator described in my previous blog post , then click the “Refresh Word Cloud” button to generate an initial layout.

Adjust the graphical parameters based on your aesthetic preferences. Key parameters to consider include the color palette, font, scale transformation method, and the number of words to display.

Keep clicking the “Refresh Word Cloud” button until you achieve a layout you are satisfied with. Personally, I prefer a layout where all the major words are displayed horizontally. Click the “Download SVG” button to save the word cloud as an SVG file.

Step 5: Convert word cloud to a PDF/PNG file

See the appendix section of my previous blog post for a robust command-line workflow to convert the SVG file into a vector PDF file or a 300 DPI PNG file.

With these steps, you now have a professional word cloud visualization based on your research. Enjoy exploring your data in this visually engaging format!

This “paper2wordcloud” workflow demonstrates how to use Python to automate a seemingly difficult task that involves processing natural language data, while allows incorporating human knowledge and preferences. I’m quite amazed by how the text data processing toolchain in Python has advanced, making it a perfect choice for tasks like this.

  • Search Menu
  • Sign in through your institution
  • Volume 2024, 2024 (In Progress)
  • Volume 2023, 2023
  • Author Guidelines
  • Submission Site
  • Open Access
  • About Database
  • About the International Society for Biocuration
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Self-Archiving Policy
  • Journals on Oxford Academic
  • Books on Oxford Academic

International Society for Biocuration

Article Contents

Introduction, supplementary data, acknowledgements, conflict of interest..

  • < Previous

Wormicloud: a new text summarization tool based on word clouds to explore the C. elegans literature

ORCID logo

Valerio Arnaboldi, Jaehyoung Cho contributed equally to this work.

  • Article contents
  • Figures & tables
  • Supplementary Data

Valerio Arnaboldi, Jaehyoung Cho, Paul W Sternberg, Wormicloud: a new text summarization tool based on word clouds to explore the C. elegans literature, Database , Volume 2021, 2021, baab015, https://doi.org/10.1093/database/baab015

  • Permissions Icon Permissions

Finding relevant information from newly published scientific papers is becoming increasingly difficult due to the pace at which articles are published every year as well as the increasing amount of information per paper. Biocuration and model organism databases provide a map for researchers to navigate through the complex structure of the biomedical literature by distilling knowledge into curated and standardized information. In addition, scientific search engines such as PubMed and text-mining tools such as Textpresso allow researchers to easily search for specific biological aspects from newly published papers, facilitating knowledge transfer. However, digesting the information returned by these systems—often a large number of documents—still requires considerable effort. In this paper, we present Wormicloud, a new tool that summarizes scientific articles in a graphical way through word clouds. This tool is aimed at facilitating the discovery of new experimental results not yet curated by model organism databases and is designed for both researchers and biocurators. Wormicloud is customized for the C aenorhabditis   elegans literature and provides several advantages over existing solutions, including being able to perform full-text searches through Textpresso, which provides more accurate results than other existing literature search engines. Wormicloud is integrated through direct links from gene interaction pages in WormBase. Additionally, it allows analysis on the gene sets obtained from literature searches with other WormBase tools such as SimpleMine and Gene Set Enrichment.

Database URL : https://wormicloud.textpressolab.com

Given the overwhelming and constantly growing number of research papers published in biomedical research, finding relevant information from the scientific literature has become a challenging task. There are many strategies researchers have adopted over time in order to keep pace with recent scientific discoveries, such as subscribing to topic-based research blogs or setting up alerts on scientific search platforms such as PubMed MEDLINE or Europe PMC (Europe PubMed Central) ( 1 ). These platforms provide tools for querying relevant documents by free-text searches or controlled vocabulary to help researchers overcome the information overload by suggesting the most relevant list of articles. However, digesting information from the scientific articles returned by the queries may still be challenging, especially for large numbers of papers.

Biological data curation (biocuration) is aimed at extracting valuable knowledge from experimental results in the literature and making the extracted information readily available to researchers in an easy-to-interpret format ( 2 ). Curated data regarding specific model organisms are maintained by model organism databases such as WormBase, the main information resource for the nematode C aenorhabditis   elegans ( 3 ), among others, and by the recently formed Alliance of Genome Resources ( 4 ). Although biocuration is vital for modern biological research, it is mostly a manual process performed by expert curators, and there is a considerable time lag between the publication of an article and the inclusion of curated data into model organism databases. Finding new scientific results in the literature in a timely fashion would be greatly beneficial for both bench scientists and biocurators.

One of the possible solutions to overcome the information overload is to automatically generate summaries of collections of scientific articles such that key aspects of new results can be easily digested without reading all the articles in the collection. Different summarization solutions have been proposed in the literature, including automated generation of text summaries using computational linguistic techniques ( 5 ) or graphical summary generation and visualization methods, such as graphical summaries based on word clouds.

Word clouds (also known as tag clouds or wordles) are visual representations of text data that depict the most important keywords in one or more textual documents such as cloud-shaped collections of words with different sizes, colors, orientations and fonts. The importance of each keyword is measured by its frequency of appearance in the source documents, and it is reflected in the different size of the words in the word cloud in order to make important keywords more visible than others.

Word clouds have been largely used on the web to summarize information and have become very popular with the advent of Web 2.0 and of social media such as Flickr ( www.flickr.com ) and Delicious ( http://del.icio.us/ , a social bookmarking service discontinued in 2019). In this context, they were mainly used to aid website navigation through visual representation of page content. However, overuse and the somewhat limited effectiveness of this specific application led to a gradual decline of their popularity. They have been recently rediscovered as powerful tools for data summarization and analysis, and their effectiveness has been formally analyzed by several research studies ( 6 , 7 ).

A handful of tools that summarize research articles through word clouds have been proposed in the literature, even though, at the time of writing this paper, all of these were not accessible online or were not properly functioning. Perhaps the most promising of these tools in the context of biology is Genes2WordCloud ( 8 ). This tool generates word clouds from a list of genes or keywords by searching documents from different sources, including biological annotations from the Gene Ontology consortium ( 9 , 10 ) and abstracts of scientific articles from PubMed ( https://pubmed.ncbi.nlm.nih.gov/ ). The resulting word clouds summarize information related to the provided genes or keywords, highlighting prominent research topics in the related articles.

Another word cloud tool that was designed to summarize information in research articles from PubMed is LigerCat ( 11 ). This tool presents MeSH terms (the keywords in the controlled vocabulary used to index papers in PubMed) of articles retrieved through PubMed search API to build word clouds. Even though this technique proved to be effective in summarizing research aspects of a collection of articles, it excludes words not indexed as MeSH terms by PubMed and may therefore miss some important keywords not yet in the controlled vocabulary.

Kuo et   al . ( 12 ) designed a tool that allows users to summarize the results of PubMed searches through word clouds based on words extracted from abstracts. This tool presents a simple interface but does not provide additional tools for data analysis such as word trends over time.

In this paper, we present Wormicloud (worm information cloud), a novel visual tool based on word clouds that summarizes knowledge about specific research topics from a large amount of textual documents and facilitates new discoveries from data not yet curated by model organism databases. Wormicloud uses keywords from abstracts and gene names mentioned in the full text of research articles to generate word clouds, thanks to the advanced search functionality provided by the Textpresso Central text-mining system ( 13 ). Textpresso allows fine-grained searches on keywords and categories specifically based for C. elegans and other model organisms (e.g. gene names and other biological entities extracted from articles). In particular, it provides searches on full text of articles, whereas PubMed searches exclusively on abstracts. This makes Wormicloud able to find more articles relevant to specific biological aspects than those that can be found via PubMed searches.

Wormicloud helps users dynamically refine their searches by adding keywords with a click on the displayed words to narrow down the original list of documents until the desired level of detail in the search is reached. Most importantly, Wormicloud does not depend on any manual curation. All the processes, from text mining to presentation of word clouds, are performed automatically, thereby ensuring that the data presented to the user is always up-to-date and complete. Wormicloud includes also a graphical word trends analysis tool that allows the user to trace the use of specific words in the obtained word clouds over time. Wormicloud is integrated into WormBase and can also be used in combination with other bioinformatics tools, such as SimpleMine ( https://wormbase.org/tools/mine/simplemine.cgi ) and Gene-set enrichment analysis tool ( https://wormbase.org/tools/enrichment/tea/tea.cgi ) from WormBase.

Wormicloud is structured into a backend and a frontend component, as depicted in Figure 1 . The frontend allows users to perform keyword-based searches and displays the articles matching the search parameters in the form of word clouds. The interface also has an interactive reference list with the details about the articles used to build the word cloud, and a word trends analysis tool that displays the usage of specific words in the word cloud over time. More details on the frontend are provided in the Results section, where we describe the implementation of the UI and provide some use cases to show how it can be used for different research-oriented tasks.

Wormicloud components and interactions with Textpresso Central through the Textpresso API.

Wormicloud components and interactions with Textpresso Central through the Textpresso API.

In this section, we focus on the backend component, which searches for scientific articles by interfacing with the Textpresso Central Application Programmer Interface (API) and extracts lists of words from these articles with their respective frequency counters. The backend is geared towards C. elegans specific literature and nomenclature, but we plan to expand it to other organisms in the future, as explained in the Discussion section.

Textpresso ( 13 ) allows programmatic access to its functions through a public API ( https://textpressoapi.readthedocs.io/en/latest/?badge=latest ). This API allows keywords- and category-based full-text searches on scientific articles, including all C. elegans papers included in WormBase. The API returns abstracts and full text of the articles matching the search criteria, but it also provides the list of words in the text of each of these articles belonging to a specific category. Wormicloud backend uses the API to retrieve documents containing a list of user-provided keywords through full-text searches and combines the words in the abstracts of returned articles to build word clouds. Using abstracts only makes articles retrieval from Textpresso Central faster. In fact, even though searches are performed on the full text, the text of the matching articles is not retrieved automatically due to specific Textpresso internal optimization rules. Therefore, accessing the full text of a large set of articles would require too much time to provide an acceptable user experience. In addition, abstracts usually describe the core aspects of the research work, and the full text often includes related work and discussions on previous results that could add noise to the resulting word clouds. Wormicloud also uses the API to retrieve all genes, sequence names and protein names in the articles that match the search criteria, in order to build word clouds containing these entities, called ‘gene names only clouds’ in the user interface. Protein names are transformed to match their related gene names by converting them to lowercase. C. elegans gene names are standardized and defined by a specific nomenclature ( https://wormbase.org/about/userguide/nomenclature ). Textpresso identifies gene names in the full text of articles through regular expressions and matches approved gene names, sequence names and synonyms.

To extract words and counters from the list of abstracts obtained through the Textpresso API, the backend tokenizes the text in each of the abstracts—i.e. it breaks the text into individual linguistic units. We designed a custom tokenizer that considers particular biological entities as single words (e.g. C. elegans gene names, which often contain a dash, such as ‘daf-16’). Then, the resulting tokens are lemmatized in order to group together variations of the same base words. Lemmatization uses morphological analysis of words to identify their root, an approach that works well in the context of biology. Stemming is another possible technique that reduces different forms of a word to their common base form through a heuristic process that cuts off the ends of words. Note that stemming is used instead of lemmatization by all the other word cloud-based summarization tools in the literature. However, we decided to not apply stemming, as it turned out to flatten important differences between biological concepts (e.g. germ and germline). Note that lemmatization alone is not able to group all word variations, but it provides the best results for our use cases. A custom list of stopwords is also used to get rid of keywords considered noise in the C. elegans biological context. The lists of stopwords for the keyword-based clouds and gene only clouds are available in Supplemental Table S1 .

The Textpresso API returns a list of articles sorted by a relevance score (hereinafter ‘Textpresso score’) that reflects how well the results match with the provided search parameters, limited to a maximum number that can be controlled through the Wormicloud user interface. This limit makes Textpresso Central searches faster and filters out less relevant articles. We decided to give the user the option to choose between 200, 400 and 1000 as the maximum result number. The default value for searches is set to 200, which is the fastest possible option (since Textpresso returns 200 maximum results per query). As supported by the statistical analysis below, this value is sufficient also for the accuracy of results returned by broad searches, even though the user can still manually set the maximum number of results through the interface to 400 or 1000 for more accurate results. The user can also decide whether to count words in abstracts by plain frequency or by frequency weighted by the Textpresso score received by each paper. In the latter case, words in papers with low Textpresso scores with respect to the search criteria count less than words in papers with higher scores.

Choosing the optimal parameters for Wormicloud searches

We performed a statistical analysis to choose the default maximum number of papers to be fetched from the Textpresso API looking for the best trade-off between search speed and accuracy of the results. We measured how increasing the maximum number of results impacts the retrieval time and accuracy of the resulting list of words and their respective frequencies used to build the word clouds. To do so, we performed searches through the Textpresso API for a broad biological term, which matches a large number of papers. We decided to use the term ‘meiosis’—which returned 2985 documents through a regular search on Textpresso Central ( https://www.textpressocentral.org ) at the time of writing—and we performed two separate analyses on these searches: (i) we measured the query time for searches with 200, 400 and 1000 maximum number of results; (ii) we calculated similarity measures between the lists of keywords obtained by the Wormicloud backend software processing pipeline and ranked by their frequencies (both plain frequency and weighted by the Textpresso score).

Query times

The Textpresso API has a caching mechanism that makes subsequent searches faster, so we measured the time required for the first query and we also measured the average time for subsequent queries. For non-cached queries, it took 16.18 s for the 200 results query, 68.82 s for 400 results and 123.47 s for the 1000 results one. With caching, the average query time, over 10 observations, was 13.33 s (±0.99 s) for 200 results maximum, 29.28 s (±1.73 s) for 400 and 75.86 s (±8.08 s) for 1000 results. These figures tell us that increasing the number of maximum results returned by Textpresso to more than 200 (the maximum number of results that are packed by Textpresso in a single query) can lead to long search times for the user and this significantly impacts user experience.

Similarity between lists of words obtained with different thresholds

We calculated indices to measure how similar the lists of words obtained by the searches with different maximum number of results are. To do so, we first considered the presence/absence of words across the three lists obtained for the search ‘meiosis’ with different thresholds and calculated the percentage of words in the list for 1000 maximum results which are also in the list with 400 and 200 maximum results. We calculated this percentage both for the full lists and by limiting the analysis to the first 100 words sorted by their counters, as the number of words included in the word clouds displayed in Wormicloud is limited to 100. For the latter analysis, we considered two cases, the first with counters obtained by plain word frequency in the returned documents, and the second with frequencies weighted by the Textpresso score assigned to each document. To further analyze the similarity between the lists of words, we calculated the correlation of the lists ranked by their counters, this time also taking into account variations in the position of each word across lists. Also in this case, we considered both plain frequencies and frequencies weighted by the Textpresso scores as counters. For this analysis, we calculated Kendall’s tau correlation coefficient, a standard coefficient for ranked lists.

Overlapping words between lists

Searches for ‘meiosis’ returned 3877 distinct words with the maximum number of results set to 200, 5960 with 400 and 11 085 with 1000 results. All of the words in the 200 and 400 lists are contained in the 1000 list. The 200 list has a 34.98% overlap with the 1000 list, whereas the 400 list has an overlap of 53.77% with the 1000 list. When taking only the first 100 entries per list into account, ranked by plain word counter, the overlap with the elements in the 1000 list is 74% for the 200 list and 83% for the 400 list. When ranked by counter weighted by Textpresso score, the overlap between the list with 1000 and 200 results is 74%, and the one between 1000 and 400 results is 84%. These results tell us that even though the full list of words with their counters obtained using 200 and 400 maximum results from the Textpresso API is significantly shorter than that obtained from 1000 results (especially the 200 results list), the first 100 words in the lists are quite similar and the resulting word clouds are comparable. Moreover, weighting the counters by the score returned by Textpresso for the relative papers does not significantly change the first 100 words in the lists. Figure 2 gives a visual representation of the word clouds obtained with the three thresholds and plain frequency counts. As can be noted from the figure, the main keywords related to ‘meiosis’ in C. elegans are all present in the three cases. This is another indication that limiting the number of results from Textpresso to 200 does not significantly alter the resulting word cloud. Nonetheless, users can manually select a higher number of results if they want to perform more accurate analyses, especially for word trends and to obtain a more accurate reference list.

Word clouds for the keyword ‘meiosis’ obtained by combining 200, 400 and 1000 maximum results from the Textpresso API and with plain frequency word counts.

Word clouds for the keyword ‘meiosis’ obtained by combining 200, 400 and 1000 maximum results from the Textpresso API and with plain frequency word counts.

Correlation between ranked lists

For plain counters, the correlation between the list obtained with 1000 maximum results and the one obtained with 200 results is 0.61 ( P  < 0.01), and it is 0.72 ( P  < 0.01) for the lists obtained from 1000 and 200 maximum articles, respectively. The correlations do not change when considering the Textpresso score for ranking the words in the lists. When considering the first 100 words in the lists only, the correlation is 0.56 ( P  < 0.01) and 0.68 ( P  < 0.01) for the pairs of lists with 1000 and 200, and 1000 and 400 maximum results, respectively. Also, in this case, the correlations do not change significantly when the counters are weighted by Textpresso score. These correlations tell us that the lists of words returned using the different parameters for searches that match large numbers of papers are similar to each other not only in the specific words returned but also in their ranking within the lists. Nonetheless, in use cases where results from Textpresso have large differences in match score, advanced users (Textpresso users in particular) may still want to use counters weighted by Textpresso score. For this reason, we decided to leave the option of choosing which counters to use to the user.

Wormicloud implementation

We developed Wormicloud as a web application available at https://wormicloud.textpressolab.com . The frontend component is written in JavaScript using the React framework, and the backend is a python program based on the Falcon framework ( https://falcon.readthedocs.io/en/stable/ ). The Textpresso Central API, developed in a separate project, is written in C++ ( https://textpressoapi.readthedocs.io/en/latest/?badge=latest ).

Wormicloud is open source and available at https://github.com/WormBase/wormicloud .

Search interface

The Wormicloud frontend component (UI) allows users to insert a list of keywords and the desired word cloud format (all keywords from abstracts or gene names only from full text). Also, with the ‘Advanced options’ button, users can add additional selectors such as publication year range, author names, the maximum number of articles to be used to build the word clouds, and the method for counting word frequencies, a plain count or weighted by TextpressoCentral paper score. As depicted in Figure 3 , keywords can be combined to search for documents containing at least one of them (‘OR’ option) or all of them (‘AND’ option). The ‘AND’ option generates word clouds containing only words that are present in all articles returned by searching each keyword separately instead of the union of all words contained in articles mentioning all the provided keywords. This makes the resulting word clouds more focused on aspects overlapping in all the returned articles. In addition, users can perform searches by author names only, without providing any specific keyword.

Wormicloud search interface with ‘advanced options’ menu expanded.

Wormicloud search interface with ‘advanced options’ menu expanded.

Wormicloud displays the results of searches with a combined view of word clouds ( Figure 4 ), reference list ( Figure 5 ) and word trends tool ( Figure 6 ).

Word cloud displayed by Wormicloud for the query ‘DREAM complex’.

Word cloud displayed by Wormicloud for the query ‘DREAM complex’.

Reference list displaying the articles used to generate the word cloud in Figure 4.

Reference list displaying the articles used to generate the word cloud in Figure 4 .

Word trends tool displaying the yearly usage of the first five words by number of mentions in the abstracts of the articles used to generate the word cloud in Figure 4. In this example, the query was ‘DREAM complex’.

Word trends tool displaying the yearly usage of the first five words by number of mentions in the abstracts of the articles used to generate the word cloud in Figure 4 . In this example, the query was ‘DREAM complex’.

Word cloud interface

Word clouds displayed by Wormicloud are based on a React word cloud package ( https://www.npmjs.com/package/react-wordcloud ). The package takes care of finding the best layout for words and displays them in different colors to maximize readability. The word clouds are interactive in that users can click on each word to add them back to the list of keywords in the search interface. In this way, users can refine their searches to find the most relevant research for their purposes. The word cloud component includes a set of buttons with different functions:

Redraw cloud button

Redraws the word cloud by re-applying the algorithm that positions the words and assigns colors.

Download counters

Downloads a csv file containing all the words obtained from the backend with their counters. Note that this file can contain more words than those displayed in the word cloud, which are limited to 100.

Export as JPEG

Downloads an image of the word cloud in jpeg format.

View on SimpleMine (for gene names word clouds only)

Opens SimpleMine search interface ( https://wormbase.org/tools/mine/simplemine.cgi ) with the field for the list of genes pre-filled with the gene names in the word cloud. SimpleMine is a WormBase tool for the retrieval of essential gene information.

View on Gene Set Enrichment tool (for gene names word clouds only)

Opens WormBase tissue enrichment analysis tool ( https://wormbase.org/tools/enrichment/tea/tea.cgi ) with the field for the list of genes pre-filled with the gene names in the word cloud.

Reference list interface

When a word cloud is displayed, users also see a list of the references used to build the word cloud at the bottom of the screen ( Figure 5 ). This is an interactive component that allows users to sort the list of references by relevance (the score received by each article from Textpresso based on how well it matched the search), title, journal, date, and WormBase and PubMed IDs. The list is paginated for easier navigation. In addition, the user can download the list of references to file in csv format.

Word trends interface

The word trends interface ( Figure 6 ) is accessible by clicking the ‘Word Trends’ tab that appears to the right of the ‘References’ tab when a word cloud is generated. This interface allows the user to visualize the number of mentions in abstracts per year of each of the words obtained from the backend and used to build the word cloud. These are displayed as an interactive graph. Note that the number of mentions included in the graph considers abstracts only, as for the word clouds. By default only the five words with the highest counters are displayed, but the user can add more words by searching them through the autocomplete text input on the right. The user can also remove words from the graph by unchecking the checkboxes near each word.

Finding new experimental results with Wormicloud

In the previous sections, we showed how Wormicloud generates a visual abstract starting from a combination of search keywords and helps users discover a common theme from the articles published in the research area related to each keyword. Here, we show some examples of how Wormicloud can be used to mine information on complex data types such as biological pathways or protein complexes from a simple gene pair. In most cases, protein–protein interaction data as a form of gene pair is not easy to understand, especially when the data are from high-throughput approaches such as mass-spectrometry and yeast two-hybrid screening. In fact, in this case, the importance of gene-gene interactions is usually not easy to assess if there is no further information on the context of the interactions. Using Wormicloud, if a user obtains a gene pair ( lin-9 and lin-35 ) from interaction data, they can simply insert the gene names sequentially in the search interface and get a word cloud displaying the keywords ‘transcriptional’, ‘repression’, ‘DREAM’ and ‘complex’, which successfully capture the underlying biological relationship between the genes ( Figure 7A ) and are not easy to obtain with other methods. For the readers who are not familiar with these keywords, the DREAM complex comprises six additional proteins (LIN-37, LIN-52, LIN-53, LIN-54, DPL-1 and EFL-1) as well as LIN-9 and LIN-35. This protein complex functions as a transcriptional repressor complex that controls the expression of the key genes for cell cycle and development ( 14 ).

Use cases of Wormicloud in mining complex data from literature and its analysis. (A) keyword cloud obtained by entering the keywords ‘lin-9’ and ‘lin-35’ in the Wormicloud search interface. Color highlighted entities show the biological function of lin-9 and lin-35. (B) Gene name cloud for ‘lin-9’ and ‘lin-35’ captures all the essential components in the DREAM complex, which are highlighted in color. (C) Gene ontology enrichment analysis of all genes obtained from the gene name word cloud in Figure 7B recapitulates the major information captured in Figure 7A. (Note that we have manually grayed out terms from Figure 7A and B to highlight the importance of some of the remaining terms in color and to improve readability, but since Wormicloud does not have a measure of ‘biological relevance’ of terms the results in the word clouds generated by the tool are all in color.)

Use cases of Wormicloud in mining complex data from literature and its analysis. (A) keyword cloud obtained by entering the keywords ‘lin-9’ and ‘lin-35’ in the Wormicloud search interface. Color highlighted entities show the biological function of lin-9 and lin-35 . (B) Gene name cloud for ‘lin-9’ and ‘lin-35’ captures all the essential components in the DREAM complex, which are highlighted in color. (C) Gene ontology enrichment analysis of all genes obtained from the gene name word cloud in Figure 7B recapitulates the major information captured in Figure 7A . (Note that we have manually grayed out terms from Figure 7A and B to highlight the importance of some of the remaining terms in color and to improve readability, but since Wormicloud does not have a measure of ‘biological relevance’ of terms the results in the word clouds generated by the tool are all in color.)

By selecting the option to generate word clouds with only gene names, users can also get a whole set of gene names studied in the literature related to the queried keywords ( Figure 7B ). From this gene name word cloud, users can get essential information about what genes often studied with lin-9 and lin-35 . The word cloud depicted in Figure 7B includes all the genes ( lin-9, lin-35, lin-37, lin-52, lin-53, lin-54, dpl-1 and efl-1 ) that encode the protein components of ‘DRM (or DREAM) complex’ ( 14 ) as well as other involved genes. Therefore, gene name clouds successfully provide important information about the components of a protein complex or genes in a biological pathway, which are not easily obtained from biological databases.

Gene name clouds also provide further analysis options for the gene set through other bioinformatics tools in WormBase, namely Gene-set enrichment analysis and SimpleMine. Wormicloud sends the list of genes in gene name clouds to the gene-set enrichment analysis tool and redirects the user to the web pages of the tool which show three different enrichment analysis results for tissue, phenotype and gene ontology terms annotated in WormBase. This list of genes, combined with additional analyses provided by the WormBase Gene-set enrichment analysis tool, provide important clues to find what kind of biological processes or molecular functions are related to the queried keywords ( Figure 7C ). When comparing the results in Figure 7A and C , we found that the word cloud is well matched to the gene ontology enrichment result for the molecular functions which are based on the annotated data in WormBase. Users can also analyze the gene list by using a batch data-mining tool, SimpleMine. SimpleMine retrieves all essential bioinformatics data in at least 30 different topics from WormBase as shown in Supplemental Table S2 . For the user convenience, we summarized all the procedures described above in the tutorial video ( supplemental movie 1 ).

C. elegans is one of the most popular model systems for study of genes implicated in human diseases ( 15 , 16 ). Therefore, it is useful to get a comprehensive list of genes used in the previous research articles related to a specific human disorder. By using the gene name cloud tool in Wormicloud and downloading the obtained list of keywords, users can easily obtain the list of all genes related to any disease names or disease phenotype terms of interest. For example, Alzheimer’s disease has been studied with 563 genes from the top 200 best matched research articles. In another example, human autism-related research articles have mentioned 1377 genes from the same number of best matched articles ( Supplemental Table S3 ).

Curation of certain data types such as lists of gene components of a protein complex or in biological pathways is not easy to keep up-to-date. Wormicloud can be a good alternative source of information for these data types, especially through gene name clouds. Wormicloud can also be used to get a summary of research articles from any type of keywords, including materials (e.g. drug and chemical names), phenotypes and allele names. In addition, understanding how information changes over time in a certain research area is not an easy task; Wormicloud can provide such longitudinal information with its word trends tool.

Current biological databases store a tremendous amount of information in diverse research areas. Therefore, finding the relevant information might be daunting for a naive user. In WormBase, automated gene descriptions make the key data easier to access by providing a detailed but easy to read summary of curated data for each gene ( 17 ). However, gene descriptions are related to single genes, and they do not help in the comparison of multiple genes to find any overlapping data. In this case, users still need to have advanced bioinformatics skills. Wormicloud can help solve this problem by generating a word cloud for multiple genes in WormBase.

Limitations and technical challenges

One of the main limitations of word clouds is that some displayed words may be closely related to each other. Grouping or clustering them can help improve the visualization and obtain more meaningful results. We plan to provide an option to group terms based on their distance in ontologies or other possible measures of distance as part of future improvements.

We designed Wormicloud to complement Textpresso Central with visual tools to explore the research literature and facilitate research, but Wormicloud has somewhat limited features compared to the main Textpresso search interface. In fact, Wormicloud cannot return more than 1000 results, and users should use Textpresso Central instead for large searches and for more advanced search options. Clearly, more results can provide a better chance of finding relevant papers. However, this goal can be achieved only when efficient filtering tools are available. Wormicloud is best suited for generating word clouds from up to 1000 papers at once, whereas the Textpresso search interface can return many more results, but it does not provide a graphical summary of the results.

Textpresso currently updates its corpus every month. Moreover, Textpresso includes only papers already present in the WormBase corpus. Therefore, some delay may occur between publication and inclusion in Textpresso and consequent availability in Wormicloud.

Wormicloud for other organisms

The flexible nature of Wormicloud is expected to make it easy to apply it to the literature of other organisms. In addition, Textpresso already covers several organisms through PubMed open access articles. Some components of the current Wormicloud implementation have been specifically designed for C. elegans . In particular, the Textpresso API used to return gene names works only with C. elegans genes, and the text-mining module in the Wormicloud backend component uses a list of stopwords specific to WormBase data. As part of our future work, we plan to extend Wormicloud to the literature of other organisms, starting from the Alliance of Genome Resources. To do this, we need to improve our text-mining module and expand searches on Textpresso.

Wormiclould is a useful tool for summarizing large and heterogeneous data from sources such as WormBase, and we think it would be applicable to a broad range of organisms and topics for which there are curated data. In particular, Wormicloud can be very useful to make a snapshot of the curated data from multiple gene pages of diverse model organisms such as those included in the Alliance of Genome Resources. This word cloud result can be interactive for filtering the original query and/or navigating the gene page related to a word or word group. From the single-cell level to the whole genomic level, integrating information from multiple sources has become vital for research. This more and more requires systematic approaches using comparative bioinformatics. Wormicloud, with its intuitive yet powerful interface, can be used to conveniently explore such comparative studies through word cloud images showing common topics among multiple genes.

Supplementary data are available at Database Online.

We thank Ranjana Kishore, Daniela Raciti and Eduardo da Veiga Beltrame for their comments and suggestions on the manuscript.

National Institutes of Health/National Human Genome Research Institute grants [U24HG002223 (WormBase) and U24HG010859 (Alliance Central)].

The authors declare that they have no conflict of interest.

Landhuis E. ( 2016 ) Scientific literature: information overload . Nature , 535 , 457 – 458 . doi: doi: 10.1038/nj7612-457a.

Google Scholar

International Society for Biocuration . ( 2018 ) Biocuration: distilling data into knowledge . PLoS Biol. , 16 , e2002846. doi: doi: 10.1371/journal.pbio.2002846.

Harris T.W. , Arnaboldi V. , Cain S.  et al.  ( 2020 ) WormBase: a modern model organism information resource . Nucleic Acids Res. , 48 , D762 – D767 . doi: doi: 10.1093/nar/gkz920.

The Alliance of Genome Resources Consortium . ( 2020 ) Alliance of Genome Resources Portal: unified model organism research platform . Nucleic Acids Res. , 48 , D650 – D658 . doi: doi: 10.1093/nar/gkz813.

Al Saied H. , Dugué N. and Lamirel J. ( 2018 ) Automatic summarization of scientific publications using a feature selection approach . Int. J. Digit Libr. , 19 , 203 – 215 . doi: doi: 10.1007/s00799-017-0214-x.

Felix C. , Franconeri S. and Bertini E. ( 2018 ) Taking word clouds apart: an empirical investigation of the design space for keyword summaries . IEEE Trans. Vis. Comput. Graph. , 24 , 657 – 666 . doi: doi: 10.1109/TVCG.2017.2746018.

Oesper L. , Merico D. , Isserlin R.  et al.  ( 2011 ) WordCloud: a Cytoscape plugin to create a visual semantic summary of networks . Source Code Biol. Med. , 6 , 7. doi: doi: 10.1186/1751-0473-6-7.

Baroukh C. , Jenkins S.L. , Dannenfelser R.  et al.  ( 2011 ) Genes2WordCloud: a quick way to identify biological themes from gene lists and free text . Source Code Biol. Med. , 6 , 15. doi: doi: 10.1186/1751-0473-6-15.

Ashburner M. , Ball C.A. , Blake J.A.  et al.  ( 2000 ) Gene ontology: tool for the unification of biology . Nat. Genet. , 25 , 25 – 29 . doi: doi: 10.1038/75556.

The Gene Ontology Consortium . ( 2019 ) The Gene Ontology Resource: 20 years and still going strong . Nucleic Acids Res. , 47 , D330 – D338 . doi: doi: 10.1093/nar/gky1055.

Sarkar I.N. , Schenk R. , Miller H.  et al.  ( 2009 ) LigerCat: using “MeSH Clouds” from journal, article, or gene citations to facilitate the identification of relevant biomedical literature . AMIA Annu. Symp. Proc. , 2009 : 563 – 567 .

Kuo B.Y.-L. , Hentrich T. , Good B.M.  et al.  ( 2007 ) Tag clouds for summarizing web search results . In: Proceedings of the 16th International Conference on World Wide Web (WWW ‘07) . Association for Computing Machinery , New York, NY, USA . pp. 1203 – 1204 . doi: doi: 10.1145/1242572.1242766.

Müller H.M. , Van Auken K.M. , Li Y.  et al.  ( 2018 ) Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature . BMC Bioinform. , 19 , 94. doi: doi: 10.1186/s12859-018-2103-8.

Harrison M.M. , Ceol C.J. , Lu X.  et al.  ( 2006 ) Some C. elegans class B synthetic multivulva proteins encode a conserved LIN-35 Rb-containing complex distinct from a NuRD-like complex . Proc. Natl. Acad. Sci. USA , 103 , 16782 – 16787 . doi: doi: 10.1073/pnas.0608461103.

Apfeld J. and Alper S. ( 2018 ) What can we learn about human disease from the Nematode C. elegans?   Methods Mol. Biol. (Clifton, N.J.) , 1706 , 53 – 75 . doi: doi: 10.1007/978-1-4939-7471-9_4.

Markaki M. and Tavernarakis N. ( 2020 ) Caenorhabditis elegans as a model system for human diseases . Curr. Opin. Biotechnol. , 63 , 118 – 125 . doi: doi: 10.1016/j.copbio.2019.12.011.

Kishore R. , Arnaboldi V. , Van Slyke C.E.  et al.  ( 2020 ) Automated generation of gene summaries at the Alliance of Genome Resources . Database , 2020 , baaa037. doi: doi: 10.1093/database/baaa037.

Author notes

Month: Total Views:
April 2021 646
May 2021 163
June 2021 121
July 2021 148
August 2021 113
September 2021 110
October 2021 108
November 2021 116
December 2021 76
January 2022 82
February 2022 91
March 2022 113
April 2022 74
May 2022 55
June 2022 64
July 2022 71
August 2022 49
September 2022 105
October 2022 79
November 2022 63
December 2022 40
January 2023 68
February 2023 53
March 2023 89
April 2023 57
May 2023 77
June 2023 84
July 2023 87
August 2023 100
September 2023 76
October 2023 63
November 2023 51
December 2023 66
January 2024 62
February 2024 72
March 2024 31
April 2024 41
May 2024 36
June 2024 44

Email alerts

Citing articles via.

  • Recommend to your Library

Affiliations

  • Online ISSN 1758-0463
  • Copyright © 2024 Oxford University Press
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

  • UC Berkeley
  • Sign Up to Volunteer
  • I School Slack
  • Alumni News
  • Alumni Events
  • Alumni Accounts
  • Career Support
  • Academic Mission
  • Diversity & Inclusion Resources
  • DEIBJ Leadership
  • Featured Faculty
  • Featured Alumni
  • Work at the I School
  • Subscribe to Email Announcements
  • Logos & Style Guide
  • Directions & Parking

The School of Information is UC Berkeley’s newest professional school. Located in the center of campus, the I School is a graduate research and education community committed to expanding access to information and to improving its usability, reliability, and credibility while preserving security and privacy.

  • Career Outcomes
  • Degree Requirements
  • Paths Through the MIMS Degree
  • Final Project
  • Funding Your Education
  • Admissions Events
  • Request Information
  • Capstone Project
  • Jack Larson Data for Good Fellowship
  • Tuition & Fees
  • Women in MIDS
  • MIDS Curriculum News
  • MICS Student News
  • Dissertations
  • Applied Data Science Certificate
  • ICTD Certificate
  • Citizen Clinic

The School of Information offers four degrees:

The Master of Information Management and Systems (MIMS) program educates information professionals to provide leadership for an information-driven world.

The Master of Information and Data Science (MIDS) is an online degree preparing data science professionals to solve real-world problems. The 5th Year MIDS program is a streamlined path to a MIDS degree for Cal undergraduates.

The Master of Information and Cybersecurity (MICS) is an online degree preparing cybersecurity leaders for complex cybersecurity challenges.

Our Ph.D. in Information Science is a research program for next-generation scholars of the information age.

  • Summer 2024 Course Schedule
  • Fall 2024 Course Schedule

The School of Information's courses bridge the disciplines of information and computer science, design, social sciences, management, law, and policy. We welcome interest in our graduate-level Information classes from current UC Berkeley graduate and undergraduate students and community members.  More information about signing up for classes.

  • Ladder & Adjunct Faculty
  • MIMS Students
  • MIDS Students
  • 5th Year MIDS Students
  • MICS Students
  • Ph.D. Students

Photo of three people talking together at a table with a laptop computer.

  • Publications
  • Centers & Labs
  • Computer-mediated Communication
  • Data Science
  • Entrepreneurship
  • Human-computer Interaction (HCI)
  • Information Economics
  • Information Organization
  • Information Policy
  • Information Retrieval & Search
  • Information Visualization
  • Social & Cultural Studies
  • Technology for Developing Regions
  • User Experience Research

Research by faculty members and doctoral students keeps the I School on the vanguard of contemporary information needs and solutions.

The I School is also home to several active centers and labs, including the Center for Long-Term Cybersecurity (CLTC) , the Center for Technology, Society & Policy , and the BioSENSE Lab .

  • Why Hire I School?
  • Request a Resume Book
  • Leadership Development Program
  • Mailing List
  • For Nonprofit and Government Employers
  • Jobscan & Applicant Tracking Systems
  • Resume & LinkedIn Review
  • Resume Book

I School graduate students and alumni have expertise in data science, user experience design & research, product management, engineering, information policy, cybersecurity, and more — learn more about hiring I School students and alumni .

  • Press Coverage
  • Word Clouds: We Can’t Make Them Go Away, So Let’s Improve Them

images of Eric Meyer smiling with arms crossed wearing a yellow tie

Eric T. Meyer has been appointed dean of the UC Berkeley School of Information and will begin his new job on...

Michael Buckland

Michael Buckland, professor emeritus at the School of Information, is the recipient of this year’s Emeriti of the...

logo of an iceberg with stripes behind it

The UC Berkeley School of Information is launching IceBerk, a multidisciplinary lab focused on using...

photo of a dark conference hall where someone presents on a screen

Ten faculty, students, and alumni from the School of Information presented their research at this year’s CHI...

  • Distinguished Lecture Series
  • I School Lectures
  • Information Access Seminars
  • CLTC Events
  • Women in MIDS Events

I School 5K

By Marti Hearst

The field of information visualization still hasn’t found a lay-person friendly way to visualize the contents of text. Important speeches, conference paper titles, or letters in a historical archive most frequently wind up in a word cloud. For instance, USA Today showed this summary of President Obama’s final State of the Union Address:

word cloud research paper

Word clouds are eye-catching, engaging, and easy to produce using online tools. They showcase which words are the biggest, and therefore the most frequent. And you can format them in even more engaging ways:

Thanks to online tools, word clouds are easy to make, and they are engaging, and that is likely a big reason why they are widely used (see this paper by Viegas, Wattenberg, and Feinberg, and Chapter 3 of Beautiful Visualization by Feinberg). No harm done when people are having fun. But word clouds are often used for serious applications, such as data science work or communication of scientific data. And journalists love them for state of the union speeches. The problem is that word clouds are poor ways to show relative values and to summarize information.

Some Problems with Word Clouds

Consider the summary of the speech written on the USA Today website just above the word cloud:

Obama defended the progress made over the last seven years and set out an agenda that will likely remain unfinished long after his presidency ends: turning back the effects of climate change, launching a “moonshot” to cure cancer, and a grassroots movement to demand changes in the political system.

Take a look at the word clouds above, and ask yourself how much of this summary is visible in them at a first glance. What about a second glance?

Since Obama only mentioned the word cancer twice, its frequency is far from the top 150 words, and it has no hope of appearing in the word cloud. He also talks about Ebola , malaria , and HIV/AIDS , as well as science and research , which are all used to express his goals for medical research.

But these words individually are not frequent enough to turn up in a word cloud, despite the fact that they group together within a single concept. This is one of the problems of using raw word counts as summaries; they are biased towards making you notice words for which there are few alternatives. As this example shows, you can’t see the medical research words in the word clouds. You may have also noticed by now that you can’t spot the references to climate change either, although they are there in the speech: oil , fuel , solar , gas , wind , energy , planet , etc.

In fact, the speech covered a wide range of topics, including veterans returning from Afghanistan, the Iran-nuclear deal, fighting Ebola, and so on, each with its own sub-vocabulary. Obama’s discussion of jobs and the economy is the only topic that clearly stands out from the visualizations. The other prominent words, such as the variations on America , new , nation , and people , are the fabric from which the speech is woven.

We won’t dwell here on why word clouds are inaccurate ( see these references ), but suffice it to say that if you try to guestimate how often each word occurs, you’ll likely be pretty far off. Furthermore, most people do not know that the size of the word is supposed to correspond to its frequency (see our prior work), and users of word cloud tools regularly manipulate the size to make the visualization look better. And finally, some word clouds get downright goofy. Something that looked pretty similar to this one turned up in our email one day:

word cloud research paper

Designing a Better Word Cloud

So what design would be better? I’m partial to selecting a subset of important words and comparing the use of those words within or across speeches; Barbara Maseda has a nice collection of this kind of design.

But word clouds have an appeal that is hard to deny. We set out to see if we could build a better word cloud: retaining their visual appeal, but making them more comprehensible. Our thinking is if we select words more carefully and organize them by concept, that would lead to better understanding of the underlying topics of a document. We aren’t the first to suggest this. Jeff Clark, a master of text visualization, has a version of this idea for books , and the TopicPanorama project of Wang et al. has a word cloud composed of words drawn from up to four topics.

We tested the effectiveness of this idea through a sequence of careful studies. We showed that word clouds can be more understandable with just two major changes to how they are built today. First:

Semantically Organize

The words drawn from the document(s) must be grouped in a meaningful way into a few categories that make sense to the reader. The problem with this step is that most unsupervised algorithms for finding topics in text are unable to produce understandable, distinct categories. We’re looking at you, hierarchical clustering, LDA, LSA, etc. None of these automated algorithms generally succeed at subdividing the words in a document into semantically distinct and coherent groups. So for now, this subdivision needs to be done manually. Second:

Visually Subdivide

A big problem with standard word clouds is that the words are all jumbled; the words for business and for economy are far apart in the visualizations shown above. Words that are shown close to one another imply that they are related in meaning, that is, about the same topic or in the same category. There are several ways we can visually suggest closeness in a word cloud; we can do any or all of these:

Place words from the same category in a group near one another

Separate groups of words from one another with open space

Assign the same color to words in the same category

These strategies make use of basic principles from perceptual psychology . We call the application of this approach WordZones . Let’s try it with the Obama State of the Union speech. Compare it to the first two word clouds to see if you think it is easier to interpret.

word cloud research paper

Here we selected a subset of the words from the speech, based on a combination of which were most frequent and which grouped into categories. We could have shown more topics, but the design would have gotten pretty large. This is just one layout option we found had strong user preference scores; many others are possible.

A Controlled Study of Word Cloud Layouts

Read on if you want to know more about how we showed, in a paper recently published in IEEE TVCG , that this kind of design is a better way to go than standard word clouds.

Our key findings were:

Visually grouped layouts are more effective in time-constrained category understanding tasks, compared to ungrouped layouts.

Visual grouping can be done by separating categories via whitespace or by color distinction, or both together.

Layouts defined by white space tend to be preferred over more tightly packed, less organized looking layouts for analytic tasks.

We conducted four experiments. For the first three, the task was to name all the categories in the design without reusing any of the words shown, like the game of Taboo. This is a contribution of the work itself — up to now, each piece of research on word clouds has worked with a different dataset. We designed and tested a set of 60 categories, consisting of five words each, that can be mixed and combined in different ways for future evaluations of word cloud designs. For example:

In the fourth task, we asked participants to compare four designs and tell us their preferences of the different layouts according to four criteria: readability, informativeness, visual appeal, and engagement. They were told to assume the word clouds were on a flyer advertising a class that they were considering taking, and rate the designs along both aesthetic and functionality dimensions. This was inspired by the biology course design above. Which one do you think is best?

word cloud research paper

Participants gave low scores to the word cloud design across readability, informativeness, and visual appeal, with the exception of engagement, for which it was very similar to the other designs. The Column and Radial view were similar to each other across the criteria, suggesting that people might prefer them pretty much equally for many situations. The semantically organized spatial layout fell in between, which surprised us, since we thought people might prefer the spatial grouping typical of word clouds if there was no time pressure.

The Take-Home Message

If you organize the words into coherent groups, and arrange them into spatially proximal groups, people will be able to understand the underlying meaning of the document much better than if you use a standard word cloud. The groups are more effective if distinguished by color and by leaving space between them, and these arrangements are also preferred (at least for people trying to understand the underlying concepts).

The downsides of this approach are that it (currently) requires manual analysis to create the categories, which requires time and judgement calls. Word clouds do too, but you could argue that there is a standard way to make them, without editorializing, which might be important in political contexts. But we argue that the very randomness of word clouds, and the arbitrary cut offs of the counts, and the misleadingness of the relative size of the words, does not make them neutral by any means.

We hope that the next time you need to visualize a document, you will reconsider choosing a word cloud. Instead, try a more sophisticated algorithm and visualization, or if you have to use a word cloud, at least organize text into zones of meaning via spatial or color grouping, so everyone can get the gist, while they have something aesthetic to look at. :)

The article was originally published on Medium. Reprinted by permission.

Marti Hearst is a professor at UC Berkeley

Steven Franconeri , Paul Laskowski , Elsie Lee , Lekha Patil , and Emily Pedersen are co-authors on the research paper that this post describes.

Read the Full Research Article

M. Hearst, E. Pedersen, L. P. Patil, E. Lee, P. Laskowski and S. Franconeri, “An Evaluation of Semantically Grouped Word Cloud Designs,” in IEEE Transactions on Visualization and Computer Graphics . (2019)

Professor Marti Hearst

Last updated:

  • Application

WordCloud.app blog

Enhancing Academic Work with Word Clouds

Word clouds are not just a beautiful visual representation of text; they are also an incredibly useful tool for academics. Whether you are a student, researcher, or educator, integrating word clouds into your daily academic life can bring numerous benefits. In this article, we will explore how you can leverage word clouds to enhance your academic work and how WordCloud.app can be the go-to tool for all your word cloud needs.

1. Visualizing Concepts and Ideas

Word clouds offer a unique way to visualize complex concepts and ideas. Instead of poring over pages of text, you can create a word cloud that highlights the most important keywords and phrases. This helps you to quickly grasp the main themes and relationships within your research or study material. With WordCloud.app, you can easily create custom word clouds by either entering your own text or analyzing web pages and books.

2. Analyzing Textual Data

As an academic, you often need to analyze large volumes of text, such as research papers, articles, or books. Word clouds can be a valuable tool for understanding the key themes and trends within your textual data. By creating a word cloud, you can identify recurring terms, visualize the frequency of specific words, and gain insights into the overall composition of the text. WordCloud.app’s ability to analyze web pages and books makes it a powerful tool for text analysis.

3. Presenting Findings and Data

When presenting your research or findings, word clouds offer an engaging and visually appealing way to present data. Instead of presenting a lengthy bullet-point list or overwhelming charts, you can use a word cloud to highlight the most important keywords and concepts. WordCloud.app provides a wide range of customization options, allowing you to choose different shapes, colors, and fonts to ensure your word cloud suits your presentation needs.

4. Collaborative Work and Brainstorming

Whether you are working on a group project or brainstorming ideas, word clouds can facilitate collaboration and creativity. By creating a collaborative word cloud, each team member can contribute their ideas, and the word cloud helps to visualize the collective input. WordCloud.app’s integrations with Figma and Miro make it seamless to create and share word clouds within your collaborative workspace.

5. Personalized Study Aids

Word clouds are not just for research and presentations; they can also serve as effective study aids. By creating a word cloud of key terms, definitions, or important concepts, you can have a visual reference to review and reinforce your understanding. Customizing the shape, colors, and fonts of the word cloud can make it visually appealing and enjoyable to study.

With its extensive features and user-friendly interface, WordCloud.app is the ideal tool for all your word cloud needs. From its curated collection of inspiring word clouds to its ability to analyze web pages and books, WordCloud.app offers a diverse range of possibilities to create stunning and meaningful visual representations of text.

So, whether you are a student looking to enhance your study materials, a researcher aiming to make your findings visually appealing, or an educator seeking creative teaching aids, WordCloud.app is the perfect companion to transform your academic work into captivating word clouds.

Visit WordCloud.app to unleash your creativity and discover the endless possibilities of word clouds today!

Related Posts

How word clouds can help language learners, how teachers and educators can use word clouds to enhance learning.

Douglas C. Wu

From molecular biology to computation

word cloud research paper

Word Cloud From Publications

Word cloud ( Tag cloud ) has become a very popular visualization method for text data, despite it is almost useless in drawing statistically-relevent conclusions. Word clouds can, however, be a quick way to present research interests on personal webpages .

word cloud research paper

In this blog post, I will show how to use python to generate word cloud from a list of pdf files (a common file format for scientific publications).

We will need several python packages; wordcloud , PyPDF2 , nltk and matplotlib , which can all be install from the conda-forge channel from conda .

  • setting up the environment
  • activating the environment
  • here goes the scipt

TL;DR The script is deposited on github .

First import all the things that are needed:

In English, there are some general words (e.g. “you”, “me”, “is”) that are not necessarily helpful in natural language processings. We call these stop words and we want to exclude these words from our text database. Each of the NLTK and wordcloud package provides a list of stop words. So we will curate a list of stop words for filtering out the stop words in later steps.

I implemented the wordcloud as a python object, and only the required initializing input is the directory of the PDF files, and I also curated some extra words ( self.paper_stop ) that maybe publication-specific stop words (e.g. “Figure”, “Supplementary” and dates, in this case).

And then, I implemented a function to retrieve texts from the PDF files using PyPDF2:

And a also function for filtering out stop words, as well as verbs. NLTK offers implementations to 1. tokenizing words ( nltk.word_tokenize ) from the full text, and 2. identifying if a word is a noun or verb, etc ( nltk.pos_tag ).

Now, we can generate a wordcloud from the words we have curated.

So to run the whole thing:

Creative Commons License

Help | Advanced Search

Computer Science > Information Retrieval

Title: using word clouds for fast identification of papers' subject domain and reviewers' competences.

Abstract: Generating word (tag) clouds is a powerful data visualization technique that allows people to get easily acquainted with the content of a large collection of textual documents and identify their subject domains for a matter of seconds, without reading them at all. This paper suggests applying word clouds visualization to conference management systems (specialized document management and decision support systems) in order to support and facilitate decision making in at least two important processes - forming the Programme Committee by inviting suitable reviewers and manual (re)assignment of reviewers to papers. Word clouds proved to be very useful tool for fast identification of papers' subject domain and reviewers' competences.
Subjects: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
classes: 68P20 (Information storage and retrieval), 90B50 (Management decision making), 68U35 (Information systems)
 classes: H.3; I.7
Cite as: [cs.IR]
  (or [cs.IR] for this version)
  Focus to learn more arXiv-issued DOI via DataCite
Journal reference: PROCEEDINGS OF UNIVERSITY OF RUSE - 2021, volume 60, book 3.2, pp. 114-119

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Skip to Content

Other ways to search:

  • Events Calendar

Word Clouds

Words of different sizes and colors describing an event

When to Use:  At the end of a project

Estimated Time:  5 minutes for participants

Participants:  Young Children, Youth, Adults, Educators

  • Method of collecting responses by on-line polling, spreadsheet, or from a text document
  • word it out
  • polleverywhere
  • You may have to edit participant responses to correct misspellings, capitalization, past and present tenses before generating your word cloud. 
  • Different word cloud generators offer options for customizing color schemes, fonts, word orientation, etc.

Sample Topics: Ask participants to supply 3-5 descriptive words.

  • Feelings and Reactions 
  • Themes - subjects and topics of interest
  • Values - what is important
  • Favorites/ Least Favorites
  • Creating live word clouds with Google Slides from Poll Everywhere
  • How to interpret word cloud data from Boost Labs
  • Tips on using Wordclouds.com  from Europlanet Society

Search Faculty Experts

Research and expertise across CU Boulder.

  • Research Institutes

Our 12 research institutes conduct more than half of the sponsored research at CU Boulder.

  • Research Computing

A carefully integrated cyberinfrastructure supports CU Boulder research.

  • Research Centers

More than 75 research centers span the campus, covering a broad range of topics.

Research Development, Institutes & Centers

  • Research Development
  • Shared Instrumentation Network
  • Office of Postdoctoral Affairs
  • Research & Innovation Office Bulletin

Research Administration

  • Office of Contracts and Grants
  • Research Integrity (Compliance)
  • Human Research & IRB
  • Office of Animal Resources
  • Research Tools
  • Research Professor Series

Partnerships & Innovation

  • Innovation & Entrepreneurship
  • Venture Partners (formerly Technology Transfer Office)
  • Industry & Foundation Relations
  • AeroSpace Ventures
  • Grand Challenge
  • Center for National Security Initiatives

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Br J Gen Pract
  • v.62(596); March 2012

Logo of brjgenprac

Word cloud analysis of the BJGP

A ‘word cloud’ is a visual representation of word frequency. The more commonly the term appears within the text being analysed, the larger the word appears in the image generated. Word clouds are increasingly being employed as a simple tool to identify the focus of written material. They have been used in politics, business and education, for example, to visualise the content of political speeches. In the Health Board that I support, word clouds have been applied to analyse the content of Board committee papers to see whether sufficient attention is being given to the core business of the organisation.

Word clouds should be interpreted with certain caveats. They often fail to group words that have the same or similar meaning, 1 for example, ‘GP’ and ‘GPs’ or ‘Research’ and ‘research.’ As they tend to focus only on single word frequency, they also do not identify phrases, reducing context. 1

A word cloud analysis of the entire content of the British Journal of General Practice from 2011, constituting 600 000 words, was conducted using the online programme Wordle ( http://www.wordle.net/ ). A maximum word limit of 100 was set. Common English words were removed. The image generated can be seen on the cover of this Journal.

According to its editorial policy, the BJGP is ‘an international journal publishing articles of interest to primary care clinicians, researchers, and educators worldwide. Priority is given to research articles asking questions of direct relevance to patient care.’ 2

The word cloud generated from the analysis was measured against the stated editorial policy of the Journal above. Firstly, it can be seen that the two most prominent words highlighted are ‘care’ and ‘patients’. This finding does conform to the aim of the Journal to focus on these areas. In terms of the Journal's concentration on research articles, the words ‘study’ and ‘research’ as major terms both appear, again reflecting well the Journal's policy. The target audience of the Journal is given as ‘primary care clinicians, researchers and educators worldwide’; 2 the terms ‘GP/s’, ‘primary’, ‘general’, ‘practice’ and ‘clinical’ do emerge in the analysis. However, the word ‘education’ fails to materialise, although ‘training’ does appear, but only as a lower order term. In view of the international aim of the Journal, the analysis seems to suggest that its content is perhaps too weighted towards the home audience, with ‘UK’, ‘London’, and ‘NHS’ being the only geographic terms to appear in the word cloud.

Of interest, the medical conditions that figure most prominently in the analysis are ‘cancer’ and ‘depression’. The full extent of the patient journey is revealed, with ‘symptoms’, ‘diagnosis’, and ‘treatment’ all included. The words ‘time’ and ‘years’ are visible, perhaps reflecting the long-term nature of the GP–patient relationship. ‘Quality’ appears as a minor term, which is a surprise, given its importance in patient care; ‘risk’ appears larger. In terms of my own speciality, public health medicine, the words ‘social’, ‘screening’, ‘population’, and ‘need’ all make an appearance, but only as minor terms. In addition, there is no mention of ‘equity’ or ‘prevention’.

In conclusion, a word cloud analysis of the Journal has shown that it seems to have largely fulfilled its stated aim of ensuring that priority is given to research articles of direct relevance to patient care for primary care clinicians. However, it should, perhaps, reflect on whether it needs to become more international in outlook. Furthermore, the analysis has reflected the very diverse nature of primary care practice.

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

word cloud research paper

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

  •  We're Hiring!
  •  Help Center

Word Clouds

  • Most Cited Papers
  • Most Downloaded Papers
  • Newest Papers
  • The social impact of science and technology developmen Follow Following
  • Institutional and organizational design related to political participation and civic engagement Follow Following
  • Q Methodology Follow Following
  • Human development and organizational learning in community-based settings Follow Following
  • Visual Presentation Follow Following
  • Engineering Science and Technology Follow Following
  • Mixed Methods (Methodology) Follow Following
  • Image Follow Following
  • Women and Minority In STEM Fields Follow Following
  • Science teacher education Follow Following

Enter the email address you signed up with and we'll email you a reset link.

  • Academia.edu Publishing
  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

IMAGES

  1. Research word cloud stock illustration. Illustration of scientific

    word cloud research paper

  2. Create Engaging Word Cloud Visualizations from Your Research

    word cloud research paper

  3. Guide to Using Word Clouds for Applied Research Design

    word cloud research paper

  4. Research Report Word Cloud. Stock Illustration

    word cloud research paper

  5. Research word cloud stock illustration. Illustration of development

    word cloud research paper

  6. Word cloud research Royalty Free Vector Image

    word cloud research paper

VIDEO

  1. Conducting Scientific Research in the Cloud

  2. Language, Emotion, and Personality: How the Words We Use Reflect Who We Are

  3. Creating a Word Cloud using plotly

  4. eScience in the cloud

  5. How To Create Word Cloud in Canva

  6. A comparison word cloud Tutorial

COMMENTS

  1. Guide to Using Word Clouds for Applied Research Design

    Technically, a word cloud is based on n-grams that are in computational linguistics and probability fields sequences of n items from a sample of text or speech. The items can be syllables, letters, or words. Using Latin numerical prefixes, an n -gram of size 1 is referred to as a "unigram", size 2 is a "bigram", and size 3 is a "trigram.

  2. Get Your Head into the Clouds: Using Word Clouds for Analyzing

    Using word clouds, instructors can quickly and easily produce graphical depictions of text representing student knowledge. By investigating the patterns of words or phrases, or lack thereof, in ...

  3. Use of Word Clouds as a Novel Approach for Analysis and Presentation of

    Utilization of word clouds as a research tool to analyze and present qualitative data and enhance program evaluation. Target Audience The data used are from process evaluation instruments taken by youth aged 9-11, their primary meal preparer (n=54 dyads), and leaders (n=15) of the iCook 4-H Program.

  4. An Integrative Review with Word Cloud Analysis of STEM Education

    The 510 research papers were imported into VOSviewer to build and visualize word cloud networks. VOSviewer is a widely used software tool for constructing and visualizing word cloud networks based on the articles retrieved from academic databases. It can analyze various types of relationships between articles and generate graphic results in the ...

  5. From word clouds to Word Rain: Revisiting the classic word cloud to

    Therefore, based on previous criticism of word clouds, as well as on previous research in which the classic word cloud has been developed with new features, we here present, ... That is, a static image that could either be provided digitally, or on a printed paper or poster, and that fills the same role of providing an overview of textual ...

  6. USING WORD CLOUDS FOR FAST IDENTIFICATION

    Fig. 1. Word cloud showing the areas of research of more than 100 scientific papers Word clouds have been used for years in various subject domains, most commonly in web-based information systems such as content management systems, crowdsourcing platforms, social networks and others.

  7. [2210.08059v1] Word Clouds in the Wild

    Word clouds are frequently used to analyze and communicate text data in many domains. In order to help guide research on improving the legibility of word clouds, we have conducted a survey of their usage in Digital Humanities academia and journalism. Using a modified grounded theory approach, we sought to identify the most common purposes for which word clouds were employed and the most common ...

  8. Use of Word Clouds as a Novel Approach for Analysis and Presentation of

    Utilization of word clouds as a research tool to analyze and present qualitative data and enhance program evaluation. Use of Word Clouds as a Novel Approach for Analysis and Presentation of Qualitative Data for Program Evaluation - Journal of Nutrition Education and Behavior

  9. Create Engaging Word Cloud Visualizations from Your Research

    The general steps in the workflow are: Collect PDF files representing your research (10 min). Run a Python script to extract the top words from the PDF files (10 min). Review, edit, and finalize the list of top words (20 min). Use a word cloud generator, adjust the look, and generate SVG (15 min). Convert the SVG file to a PDF/PNG file (5 min).

  10. Word Clouds in the Wild

    Word clouds are frequently used to analyze and communicate text data in many domains. In order to help guide research on improving the legibility of word clouds, we have conducted a survey of their usage in Digital Humanities academia and journalism. Using a modified grounded theory approach, we sought to identify the most common purposes for which word clouds were employed and the most common ...

  11. Wormicloud: a new text summarization tool based on word clouds to

    The resulting word clouds summarize information related to the provided genes or keywords, highlighting prominent research topics in the related articles. Another word cloud tool that was designed to summarize information in research articles from PubMed is LigerCat . This tool presents MeSH terms (the keywords in the controlled vocabulary used ...

  12. (PDF) Using word clouds for fast identification of papers' subject

    INTRODUCTION. Generating word (tag) clouds is a powerful data visualization technique that allows people to. get easily acquainted with the content of a large collection of textual documents and ...

  13. Word Clouds: We Can't Make Them Go Away, So Let's Improve Them

    Important speeches, conference paper titles, or letters in a historical archive most frequently wind up in a word cloud. For instance, USA Today showed this summary of President Obama's final State of the Union Address: ... you can't see the medical research words in the word clouds. You may have also noticed by now that you can't spot ...

  14. Enhancing Academic Work with Word Clouds

    As an academic, you often need to analyze large volumes of text, such as research papers, articles, or books. Word clouds can be a valuable tool for understanding the key themes and trends within your textual data. By creating a word cloud, you can identify recurring terms, visualize the frequency of specific words, and gain insights into the ...

  15. Word Cloud From Publications · Douglas C. Wu

    Word Cloud From Publications. Word cloud has become a very popular visualization method for text data, despite it is almost useless in drawing statistically-relevent conclusions.Word clouds can, however, be a quick way to present research interests on personal webpages.. In this blog post, I will show how to use python to generate word cloud from a list of pdf files (a common file format for ...

  16. Using Word Clouds to Present Q Methodology Data and Findings

    The purpose of this paper is to demonstrate the use of word clouds to present Q methodology data and findings and its benefits for exploring themes and patterns.. ... enhance the clarity and support for research findings. Similarly, word clouds offer a visual way to communicate key words and phrases in a variety of contexts (Cui et al., 2010 ...

  17. [2112.14861] Using word clouds for fast identification of papers

    Download PDF Abstract: Generating word (tag) clouds is a powerful data visualization technique that allows people to get easily acquainted with the content of a large collection of textual documents and identify their subject domains for a matter of seconds, without reading them at all. This paper suggests applying word clouds visualization to conference management systems (specialized ...

  18. Word Cloud Explorer: Text Analytics Based on Word Clouds

    Word clouds have emerged as a straightforward and visually appealing visualization method for text. They are used in various contexts as a means to provide an overview by distilling text down to those words that appear with highest frequency. Typically, this is done in a static way as pure text summarization. We think, however, that there is a larger potential to this simple yet powerful ...

  19. Word Clouds

    Using word clouds to create a visual representation of participants' responses is an easy and creative way to highlight and share data and pull out common themes. Numerous free online tools allow you to create word clouds or tag clouds from text. Virtual polls conducted with mentimeter or polleverywhere allow for immediate display of word clouds.

  20. Word Clouds in Research: Use and Best-Practices?

    In workshops, on the other hand, I like using real-time wordclouds, as provided by Mentimeter ( www.mentimeter.com ), to visualize the actual diversity of e.g. participants, perspectives or ...

  21. Word cloud analysis of the BJGP

    A 'word cloud' is a visual representation of word frequency. The more commonly the term appears within the text being analysed, the larger the word appears in the image generated. ... word clouds have been applied to analyse the content of Board committee papers to see whether sufficient attention is being given to the core business of the ...

  22. scholargoggler.com

    Help keep Scholar Goggler running! We are glad so many people have enjoyed generating personalized word clouds and are working to keep the app freely available for everyone. If you found this tool useful, please consider supporting it with a donation. Your generous contributions will help ensure the availability, reliability and longevity of ...

  23. Word Cloud Analytics of the Computer Science Research Publications

    Consequently, the data and the research topics and publications in this field are also evolving rapidly. In this study we used the data form the Computer Science Bibliography DBLP database for the research publications in the field of Computer Science the past 50 years. ... text visualization tool was applied.The results generated by using word ...

  24. Word Clouds Research Papers

    Since 2019, the "word cloud" has been published for each paper. A word cloud is a visualization of word frequency in a text. The font size of each word in the "cloud" is determined by its prevalence in the text. The word cloud helps readers to understand what basic keywords and terms the author uses directly in the text of the paper.