‘Qualities’ not ‘Quality’ – Text Analysis Methods to Classify Consumer Health Websites

Guocai Chen, Jim Warren, Joanne Evans. ‘Qualities’ not ‘Quality’ – Text Analysis Methods to Classify Consumer Health Websites. electronic Journal of Health Informatics, 2009; 4(1): e5.


There is an increasing need to help health consumers to achieve timely, differentiated access to quality online healthcare resources. This paper describes and evaluates methods for automated classification of consumer health Web content with respect to qualitative attributes relevant to the preferences of individual health consumers. This is illustrated in the context of identifying breast cancer consumer web pages that are ‘supportive’ versus ‘medical’ perspective, as compared to an existing manual classification employed by a breast cancer portal with personalised search preference options. Classification is performed based on analysis of word co-occurrences and an enhanced decision tree classifier (a decision forest). Current classification test results for ‘medical’ versus ‘supportive’ type resources are 90% accurate (95% confidence interval, 86-94%) using this decision forest classifier. These early results are indicating that language use patterns can be used to automate such classification with acceptable accuracy; however, a wider range of websites and metadata attributes needs to be assessed and compared to end-user feedback. Future application may be either in a tool to facilitate metadata coders in populating the databases of domain-specific portals such as BCKOnline, or in providing tagging or sorting on content type on live search results from health consumers.

Full Text: PDF (Free, registration required)

Comments are closed.