Hidden-Web Induced by Client-Side Scripting: An Empirical Study Zahra Behfarshad Ali Mesbah University of British Columbia
1 Introduction Do search engines present all information in the Search Query Results World Wide Web?
Visible-Web vs. Hidden-Web <a href="www.coffeeconnection.com/menu/">Menu</a> Crawled and Indexed 2 Not Crawled Nor Indexed Static hyperlinks What is crawled and indexed?
Hidden-Web Resources 3
4 Related Work • Researchers discovered 7,500 terabyte hidden content. (He et al, 2007) • The hidden content were 500 times larger than the visible content. • Many research conducted with the focus on crawling techniques behind forms. • All studies on server-side hidden-web.
Motivation  Most of the research are on server-side hidden content behind forms.  98% of Alexa top best websites use JavaScript.  Measure hidden-web content induced by client-side JavaScript  Hidden-Web 5
JavaScript  JavaScript: a scripting language mostly used to enhance user interaction and create dynamic website.  JavaScript enables dynamically modifying the run time DOM tree with new structures and content. 6
Example <body><h1>Sports News</h1> <p><span id="sportsContainer "></span></p> <div class = "update" rel ="sports">Update !</div> </body> 7 Initial DOM 1 $(document).ready(function () { 2 $('div.update').click(function () { 3 var updateID = $(this).attr('rel '); 4 $. get ('/ news /', { ref: updateID }, 5 function ( data ) { 6 $( updateID + 'Container '). append( data ); }); }) }); JavaScript
<body> <h1 >Sports News </h1 > <p><span id=" sportsContainer "> <h3>US GP: Vettel fastest in Austin second practice </h3 > <p>Vettel produced an ominous performance </p></ span ></p> <div class =" update " rel =" sports ">Update !</ div > </ body > 8 Updated DOM 1 $(document).ready(function () { 2 $('div.update').click(function () { 3 var updateID = $(this).attr('rel '); 4 $. get ('/ news /', { ref: updateID }, 5 function ( data ) { 6 $( updateID + 'Container '). append( data ); }); }) }); JavaScript <body><h1>Sports News</h1> <p><span id="sportsContainer "></span></p> <div class = "update" rel ="sports">Update !</div> </body> Initial DOM
9 Challenges of crawling Hidden-Web Expensive Crawl Insecure Dynamic Time Intensive
Goal and RQs Goal: Measure the pervasiveness and characterize the nature of hidden-web content induced by client-side scripting. RQ1: How pervasive is client-side hidden-web in today's web applications? RQ2: How much content is typically hidden due to client-side scripting? RQ3: Which clickable elements contribute most to client-side hidden-web content? RQ4: Are there any correlations between the degree of client-side hidden-web 10 and a web application's characteristics?
Approach: JAVIS 11 Step 0 Step 1 Step 2 Step 3
 Javis: open source (http://github.com/saltlab/javis)  Java  Crawljax (crawljax.com) [Mesbah TWEB’12]  Automatically explores JavaScript-based Ajax applications through event-driven dynamic crawling engine. 12 Tool Implementation
• Alexa: 400 • Random: 100 Experimental 13 Objects A, Div, Img, Span, Input, Button • Maximum States: 50 • Depth Level: 3 • Event type: click • HTML Elements • Crawling: randomized Crawljax Configuration • Visible: URL state transition • Hidden: non-URL state transition State Flow Graph • Hidden-web: quantity • Correlations • Clickable: contributing most to hidden-web Characterizing Hidden-Web 500 URL Set up Cla-ssify Ana-lyze
Addressing RQs  RQ1: Count the percentage of hidden-web.  RQ2: Compute the total and average of hidden content in terms of textual differences.  RQ3: Assess the commonly used DOM elements by web developers that induce hidden content.  RQ4: Analyze possible correlations. 14
RQ1: Pervasiveness  95% exhibit some degree of hidden-web (476/500) 15 Resources Median Mean Alexa 77 % 65.63 % Random 44 % 50.6 % Total 67 % 62.52 % Crawl time for Crawl time for 50 states: 500 URLs 25 minutes 211 hours ~ 9 Days
RQ2: Quantity Nature of textual content: • Singular words • Numbers • Short messages • Sentences 16 Textual hidden content (KB) All hidden content (KB) Hidden-Web Minimum Mean Maximum Minimum Mean Maximum Per State 0 0.60 11.65 0 18.91 286.4 All States 0 27.6 536 0 869.7 13170
RQ3: Induction 17 <a href="#“ onclick= "updateNews();"></a>
RQ4: Correlations  Between hidden-web and: 18  Average DOM size  JavaScript custom code size
Findings  Client-side hidden-web is omnipresent on the web.  From the 500 websites we analyzed, 95% contain client-side hidden-web content; (1) 62% of the web states are hidden, (2) per hidden state, on average 19 kilobytes of data is hidden from which 0.6 kilobytes contain only textual content, (3) the DIV element is the most common clickable element used (61%) which contributes to hidden content, (4) 25 minutes is required to dynamically crawl 50 DOM states. 19
Implications  Increasing use of modern Web 2.0 techniques and JavaScript can contribute to more hidden content.  Developers should be aware of this and follow guidelines to decrease client-side hidden content.  General search engines should support client-side 20 hidden content in future.

Hidden-Web Induced by Client-Side Scripting: An Empirical Study

  • 1.
    Hidden-Web Induced by Client-Side Scripting: An Empirical Study Zahra Behfarshad Ali Mesbah University of British Columbia
  • 2.
    1 Introduction Dosearch engines present all information in the Search Query Results World Wide Web?
  • 3.
    Visible-Web vs. Hidden-Web <a href="www.coffeeconnection.com/menu/">Menu</a> Crawled and Indexed 2 Not Crawled Nor Indexed Static hyperlinks What is crawled and indexed?
  • 4.
  • 5.
    4 Related Work • Researchers discovered 7,500 terabyte hidden content. (He et al, 2007) • The hidden content were 500 times larger than the visible content. • Many research conducted with the focus on crawling techniques behind forms. • All studies on server-side hidden-web.
  • 6.
    Motivation  Mostof the research are on server-side hidden content behind forms.  98% of Alexa top best websites use JavaScript.  Measure hidden-web content induced by client-side JavaScript  Hidden-Web 5
  • 7.
    JavaScript  JavaScript:a scripting language mostly used to enhance user interaction and create dynamic website.  JavaScript enables dynamically modifying the run time DOM tree with new structures and content. 6
  • 8.
    Example <body><h1>Sports News</h1> <p><span id="sportsContainer "></span></p> <div class = "update" rel ="sports">Update !</div> </body> 7 Initial DOM 1 $(document).ready(function () { 2 $('div.update').click(function () { 3 var updateID = $(this).attr('rel '); 4 $. get ('/ news /', { ref: updateID }, 5 function ( data ) { 6 $( updateID + 'Container '). append( data ); }); }) }); JavaScript
  • 9.
    <body> <h1 >SportsNews </h1 > <p><span id=" sportsContainer "> <h3>US GP: Vettel fastest in Austin second practice </h3 > <p>Vettel produced an ominous performance </p></ span ></p> <div class =" update " rel =" sports ">Update !</ div > </ body > 8 Updated DOM 1 $(document).ready(function () { 2 $('div.update').click(function () { 3 var updateID = $(this).attr('rel '); 4 $. get ('/ news /', { ref: updateID }, 5 function ( data ) { 6 $( updateID + 'Container '). append( data ); }); }) }); JavaScript <body><h1>Sports News</h1> <p><span id="sportsContainer "></span></p> <div class = "update" rel ="sports">Update !</div> </body> Initial DOM
  • 10.
    9 Challenges ofcrawling Hidden-Web Expensive Crawl Insecure Dynamic Time Intensive
  • 11.
    Goal and RQs Goal: Measure the pervasiveness and characterize the nature of hidden-web content induced by client-side scripting. RQ1: How pervasive is client-side hidden-web in today's web applications? RQ2: How much content is typically hidden due to client-side scripting? RQ3: Which clickable elements contribute most to client-side hidden-web content? RQ4: Are there any correlations between the degree of client-side hidden-web 10 and a web application's characteristics?
  • 12.
    Approach: JAVIS 11 Step 0 Step 1 Step 2 Step 3
  • 13.
     Javis: opensource (http://github.com/saltlab/javis)  Java  Crawljax (crawljax.com) [Mesbah TWEB’12]  Automatically explores JavaScript-based Ajax applications through event-driven dynamic crawling engine. 12 Tool Implementation
  • 14.
    • Alexa: 400 • Random: 100 Experimental 13 Objects A, Div, Img, Span, Input, Button • Maximum States: 50 • Depth Level: 3 • Event type: click • HTML Elements • Crawling: randomized Crawljax Configuration • Visible: URL state transition • Hidden: non-URL state transition State Flow Graph • Hidden-web: quantity • Correlations • Clickable: contributing most to hidden-web Characterizing Hidden-Web 500 URL Set up Cla-ssify Ana-lyze
  • 15.
    Addressing RQs RQ1: Count the percentage of hidden-web.  RQ2: Compute the total and average of hidden content in terms of textual differences.  RQ3: Assess the commonly used DOM elements by web developers that induce hidden content.  RQ4: Analyze possible correlations. 14
  • 16.
    RQ1: Pervasiveness 95% exhibit some degree of hidden-web (476/500) 15 Resources Median Mean Alexa 77 % 65.63 % Random 44 % 50.6 % Total 67 % 62.52 % Crawl time for Crawl time for 50 states: 500 URLs 25 minutes 211 hours ~ 9 Days
  • 17.
    RQ2: Quantity Natureof textual content: • Singular words • Numbers • Short messages • Sentences 16 Textual hidden content (KB) All hidden content (KB) Hidden-Web Minimum Mean Maximum Minimum Mean Maximum Per State 0 0.60 11.65 0 18.91 286.4 All States 0 27.6 536 0 869.7 13170
  • 18.
    RQ3: Induction 17 <a href="#“ onclick= "updateNews();"></a>
  • 19.
    RQ4: Correlations Between hidden-web and: 18  Average DOM size  JavaScript custom code size
  • 20.
    Findings  Client-sidehidden-web is omnipresent on the web.  From the 500 websites we analyzed, 95% contain client-side hidden-web content; (1) 62% of the web states are hidden, (2) per hidden state, on average 19 kilobytes of data is hidden from which 0.6 kilobytes contain only textual content, (3) the DIV element is the most common clickable element used (61%) which contributes to hidden content, (4) 25 minutes is required to dynamically crawl 50 DOM states. 19
  • 21.
    Implications  Increasinguse of modern Web 2.0 techniques and JavaScript can contribute to more hidden content.  Developers should be aware of this and follow guidelines to decrease client-side hidden content.  General search engines should support client-side 20 hidden content in future.

Editor's Notes

  • #10 We present a simple example of how JavaScript code can induce hidden-web content by dynamically changing the DOM tree. Figure 1 depicts a JavaScript code snippet using the popular jQuery library.1 Figure 2 illustrates the initial state of the DOM before any modication has occurred. Once the page is loaded (line 1 in Figure 1), the JavaScript code at- taches an onclick event-listener to the DIV DOM element with class attribute `update' (line 2). When a user clicks on this DIV element, the anonymous func- tion associated with the event-listener is executed (lines 2{8). The function then sends an asynchronous call to the server (line 4), passing a parameter read from the DIV element (i.e., `sports') (line 3). On the callback, the response content from the server is injected into the DOM element with ID `sportsContainer' (line 6). The resulting updated DOM state is shown in Figure 3. All the data retrieved and injected into the DOM this way will be hidden content as it is not indexed by search engines. Although the eect of client-side scripting on the hidden-web is clear, there is currently a lack of comprehensive investigation and empirical data in this area.
  • #16  between each hidden state and its previous state.
  • #20 We conducted a correlation analysis of the degree of hidden-web with respect to (1) average DOM size, taken over all the crawled states, and (2) JavaScript custom code size.