ABSTRACT

Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.3.1 Collection Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3.1.1 Existing Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3.1.2 Data from Tools Supporting Credibility

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3.1.3 Data from Labelers . . . . . . . . . . . . . . . . . . . . . . . . 82

4.3.2 Supporting Web Credibility Evaluation . . . . . . . . . . . . . . . . . 83 4.3.2.1 Support User’s Expertise . . . . . . . . . . . . . . . . . . 84 4.3.2.2 Crowdsourcing Systems . . . . . . . . . . . . . . . . . . . . 84 4.3.2.3 Databases, Search Engines, Antiviruses

and Lists of Pre-Scanned Sites . . . . . . . . . . . . 85 4.3.2.4 Certification, Signatures and Seals . . . . . . . . 85

4.3.3 Reconcile – A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.4 Analysis of Content Credibility Evaluations . . . . . . . . . . . . . . . . . . . . 90

4.4.1 Subjectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.4.2 Consensus and Controversy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.4.3 Cognitive Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.4.3.1 Omnipresent Negative Skew – Shift Towards Positive . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.4.3.2 Users Characteristics Affecting Credibility Evaluation – Selected Personality Traits . . 99

4.4.3.3 Users Characteristics Affecting Credibility Evaluation – Cognitive Heuristics . . . . . . . . . 100

4.5 Aggregation Methods – What Is The Overall Credibility? . . . . . . 102 4.5.1 How to Measure Credibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.5.2 Standard Aggregates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Learning

4.5.3 Combating Bias – Whose Vote Should Count More? . . . 107 4.6 Classifying Credibility Evaluations Using External Web Content

Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.6.1 How We Get Values of Outcome Variable . . . . . . . . . . . . . . 109 4.6.2 Motivation for Building a Feature-Based Classifier of

Webpages Credibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.6.3 Classification of Web Pages Credibility – Related Work 110 4.6.4 Dealing with Controversy Problem . . . . . . . . . . . . . . . . . . . . . . 110 4.6.5 Aggregation of Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.6.6 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.6.7 Results of Experiments with Building of Classifier

Determining whether a Webpage Is Highly Credible (HC), Neutral (N) or Highly Not Credible (HNC). . . . . . 115

4.6.8 Results of Experiments with Build of Binary Classifier Determining whether Webpage Is Credible or Not . . . . . . 118

4.6.9 Results of Experiments with Build of Binary Classifier of Controversy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

4.6.10 Summary and Improvement Suggestions . . . . . . . . . . . . . . . . 120

Millions of users are already using the Internet as the key source of information in their everyday lives. Personal finance, education or security are just three examples of domains in which the use of the Internet becomes almost obvious, and their number is rapidly growing. Medicine might be a glaring example of a kind of content where the users, not relying on credible sources, are virtually putting their health in jeopardy. The more the users rely on the information found on the Web, the more important it is to provide them with tools for efficient credibility assessment and it is more costly to be exposed to information that is not credible. Not every user possesses enough experience or knowledge to make correct assessments.