Every hospital website conveys a different message with the images and text on their home pages. Some want to talk about the quality of care, others about the compassion and bedside manner of their staff. In this post, I do text analysis to uncover themes across sites and show the most common ones.
I’ve analyzed 4,138 sites from seed #4 and made use of the Natural Language Processing tool Spacy to extract insights. I looked at both entity labeling and part of sentence analysis. The first one takes a look at elements that relate to real-world objects such as person, places, or organizations. The latter takes a look at verbs, pronouns, and adjectives.
Here’s a description of the four entities I extracted from all hospital sites main page:
ORG: Companies, agencies, institutions, etc.
LOC: Non-GPE locations, mountain ranges, bodies of water.
PERSON: People, including fictional.
GPE: Countries, cities, states.
Looking through all the websites, here's the distribution. The entity organization (ORG) comes up on large numbers:
Here’s a description of the four entities I extracted from all hospital sites main page:
ORG: Companies, agencies, institutions, etc.
LOC: Non-GPE locations, mountain ranges, bodies of water.
PERSON: People, including fictional.
GPE: Countries, cities, states.
Looking through all the websites, here's the distribution. The entity organization (ORG) comes up on large numbers:
Organization (ORG)
The library Spacy is pre-trained to detect some of these organizations and it’s not perfect, some of detected organizations may be classified incorrectly. Across the sites, here’s a distribution of the top ten ones:
The library Spacy is pre-trained to detect some of these organizations and it’s not perfect, some of detected organizations may be classified incorrectly. Across the sites, here’s a distribution of the top ten ones:
The most recognized organization (ORG) is Facebook, which indicates most sites have a link to it from their homepage. The second one is patient portal, which is technically not an organization, but nonetheless appears in many of the sites. One that stands out as a real organization is the Joint Commission, which accredits and certifies many hospitals.
Location (LOC)
The most mentioned entity for location (LOC) was Google, and this because sites include a link to Google maps with latitude and longitude, which Spacy identifies correctly as a location. The second and third are incorrectly classified as location, but the four following ones are correct: valley, south, east, and southern california. Here’s the figure showing these results:
Location (LOC)
The most mentioned entity for location (LOC) was Google, and this because sites include a link to Google maps with latitude and longitude, which Spacy identifies correctly as a location. The second and third are incorrectly classified as location, but the four following ones are correct: valley, south, east, and southern california. Here’s the figure showing these results:
Country, City or State (GPE)
The entity classification for countries, cities, or states (GPE) missed many, but it got some correct. For some reason, it thinks youtube is one of these entities and because it appears in many sites is number one. It also confused ‘md’ as Maryland, and because these are healthcare sites, it would appear as one of the top ones. In this list, it also includes sitemap, a common word on websites as a location. As for real location, Texas appears on the top as TX and Texas.
The entity classification for countries, cities, or states (GPE) missed many, but it got some correct. For some reason, it thinks youtube is one of these entities and because it appears in many sites is number one. It also confused ‘md’ as Maryland, and because these are healthcare sites, it would appear as one of the top ones. In this list, it also includes sitemap, a common word on websites as a location. As for real location, Texas appears on the top as TX and Texas.
Person
The entity person was not classified correctly for the top 10 most common ones. It got confused with Bill Pay, presumably thinking that Bill was a person. Likewise with the term bill. Here’s a distribution of the top 10 detected ones:
The entity person was not classified correctly for the top 10 most common ones. It got confused with Bill Pay, presumably thinking that Bill was a person. Likewise with the term bill. Here’s a distribution of the top 10 detected ones:
Verbs, Adjectives and Nouns
Now let’s take a look at the parts of sentence classified by Spacy (verbs, adjectives and nouns). This analysis did a little better than the previous one. Starting with adjectives, the number one is ‘best’, followed by ‘free’. Here’s the top 10 list:
Now let’s take a look at the parts of sentence classified by Spacy (verbs, adjectives and nouns). This analysis did a little better than the previous one. Starting with adjectives, the number one is ‘best’, followed by ‘free’. Here’s the top 10 list:
There's no surprise that many hospitals have the best program, or the best doctors and as such it appears as one of the most used adjectives. Here's a sample of how it's used at the Yale New Haven Hospital:
Now let’s take a look at the top 100 sites with the most number of adjectives on their page. Here’s a list of the top 10 and the number of adjectives in each:
Hospital URL |
Adjective Count |
www.tbhcare.com |
301 |
www.ormc.org |
240 |
www.susquehannahealth.org |
208 |
www.charlotteregional.com |
182 |
ksmedcenter.com |
180 |
bestheartcare.com |
178 |
www.nm.org |
178 |
www.faithcommunityhospital.com |
177 |
www.nmbhi.org |
165 |
The distribution of adjectives is very similar, as 'best' was the number one again, but there's a few new ones such as 'many', 'first' and 'better':
Verbs
Looking at the top verbs, there’s no much surprise here:
Looking at the top verbs, there’s no much surprise here:
Pronoun
The most common pronouns, as most hospitals do, they put patients first, emphasizing in ‘you’, followed shortly by ‘us’ and ‘we’:
The most common pronouns, as most hospitals do, they put patients first, emphasizing in ‘you’, followed shortly by ‘us’ and ‘we’:
This was a quick look at analyzing hospital websites main pages through Spacy, a quick and effective way to do natural language processing.
The Jupyter notebook used to crunch some of these data can be found here and it has a deeper look at the sites with a large number of entities and parts of speech. The data files used to extract these information can also be found in this directory.
This one lists all parts of speech found in each one of the 4,138 sites. As well as this one, lists all the entities per site. There’s a lot of information found there, for instance, the website with the most verbs was ocalahealthsystem.com, which consist of 16,370 verbs on the site.