Web scraping for non programmers ITNIG | 25th September 2014 @algonpaje - www.quadrigram.com
Goal: Introduce non programmers to APIs and scraping concepts (*) (*) In a simple way….. @algonpaje - www.quadrigram.com
How?: Using few modules of a visual programming language called “Quadrigram” @algonpaje - www.quadrigram.com
> Quadrigram is a computer software designed to make the practice of data analysis and data visualization more universal > It is designed to gather, shape, and share data > It enables to prototype and share ideas rapidly, as well as produce compelling solutions with data in the forms of interactive visualizations, animations or dashboards > The Quadrigram approach to data analysis and visualization is based on a visual programming language composed of around 500 modules @algonpaje - www.quadrigram.com
Example 1: Getting financial information in real time @algonpaje - www.quadrigram.com
> Data source: http://finance.yahoo.com/ @algonpaje - www.quadrigram.com Stock Ticker Input Box
> Base URL: http://finance.yahoo.com/q?s=TEF.MC&ql=1/ 1.- http://finance.yahoo.com/q?s= 2.- ticker (TEF.MC) 3.- &ql=1/ @algonpaje - www.quadrigram.com 1 + 2 + 3 = Base URL
1.- Building base URL using Quadrigram 1.1.- Module “Text” (String): “http://finance.yahoo.com/q?s=” 1.2.- Module “Text Entry Box”: Input the stock ticker (eg: TEF.MC) 1.3.- Module “Text” (String): “&ql=1/” 1.4.- Module “Addition of 5 objects” concatenating 1, 2 and 3 …. result = “http://finance.yahoo.com/q?s=TEF.MC&ql=1/” @algonpaje - www.quadrigram.com
2.- Querying data 2.1.- Connect the output of “Addition of 5 Objects” (“http://finance. yahoo.com/q?s=TEF.MC&ql=1/”) to module “Query HTTP GET” 2.2.- Connect a “Periodic Pulse” module to “Query HTTP GET” to query data each “X” seconds …. and so we get our HTML code ready to be scraped @algonpaje - www.quadrigram.com
3.- Scraping data 3.1.- Analyse the code and look for a “left - content - right” pattern. In this case, the pattern we are looking for is: left = <span id="yfs_l84_tef.mc"> content = stock price (* real time when market is opened) right = </span> @algonpaje - www.quadrigram.com
3.- Scraping data @algonpaje - www.quadrigram.com
3.- Scraping data 3.2.- Use “Scrape Text” module to extract data “Scrape Text” inlets: source text = HTML code (output of Query HTTP GET) start sequence = <span id="yfs_l84_tef.mc"> end sequence = </span> 3.3.- Extract the stock price using “Extract Object from List” module @algonpaje - www.quadrigram.com
@algonpaje - www.quadrigram.com
Example 2: Build a network of similarities using “The Echonest” API @algonpaje - www.quadrigram.com
>Data source: http://developer.echonest.com/raw_tutorials/artist_api/raw_artist_02.html @algonpaje - www.quadrigram.com
>BaseURL: http://developer.echonest.com/api/v4/artist/similar?api_key=J1OPQ9MJ8G8FC19FH&name=stones 1.- http://developer.echonest.com/api/v4/artist/similar?api_key=J1OPQ9MJ8G8FC19FH&name= 2.- artist´s name (“strokes”) @algonpaje - www.quadrigram.com 1 + 2 = Base URL
1.- Building base URL using Quadrigram 1.1.- Module “Text” (String): “http://developer.echonest.com/api/v4/artist/similar? api_key=J1OPQ9MJ8G8FC19FH&name=” 1.2.- Module “Text Entry Box”: Input the artist´s name (eg: strokes) 1.3.- Module “Addition of 5 objects” concatenating 1 and 2 …. result = “http://developer.echonest.com/api/v4/artist/similar? api_key=J1OPQ9MJ8G8FC19FH&name=strokes” @algonpaje - www.quadrigram.com
2.- Querying data 2.1.- Connect the output of “Addition of 5 Objects” (“http://developer.echonest.com/api/v4/artist/similar?api_key=J1OPQ9MJ8G8FC19FH&name=strokes”) to module “Query HTTP GET” …. and so we get our HTML code @algonpaje - www.quadrigram.com
3.- Scraping data 3.2.- Use “Scrape Text” module to extract data “Scrape Text” inlets: source text = HTML code (output of Query HTTP GET) start sequence = "name": " end sequence = "}, … and we obtain the list with similar artists to our query name @algonpaje - www.quadrigram.com
4.- Build a Network of similarities 4.1.- Use “Length of List” module to count how many similar artists the are 4.2.- Use “Create List with repeated Object” module to create as many “strokes” as similar artists are 4.3.- Create a Pair Table using “Create Custom Data Structure” module 4.4.- Conver the Pair Table to a Network using “Convert PairTable to Network” module @algonpaje - www.quadrigram.com
@algonpaje - www.quadrigram.com
More information: www.quadrigram.com @algonpaje - www.quadrigram.com
Thank you!!! @algonpaje - www.quadrigram.com

Web Scraping for Non Programmers

  • 1.
    Web scraping fornon programmers ITNIG | 25th September 2014 @algonpaje - www.quadrigram.com
  • 2.
    Goal: Introduce nonprogrammers to APIs and scraping concepts (*) (*) In a simple way….. @algonpaje - www.quadrigram.com
  • 4.
    How?: Using fewmodules of a visual programming language called “Quadrigram” @algonpaje - www.quadrigram.com
  • 5.
    > Quadrigram isa computer software designed to make the practice of data analysis and data visualization more universal > It is designed to gather, shape, and share data > It enables to prototype and share ideas rapidly, as well as produce compelling solutions with data in the forms of interactive visualizations, animations or dashboards > The Quadrigram approach to data analysis and visualization is based on a visual programming language composed of around 500 modules @algonpaje - www.quadrigram.com
  • 6.
    Example 1: Gettingfinancial information in real time @algonpaje - www.quadrigram.com
  • 7.
    > Data source:http://finance.yahoo.com/ @algonpaje - www.quadrigram.com Stock Ticker Input Box
  • 8.
    > Base URL:http://finance.yahoo.com/q?s=TEF.MC&ql=1/ 1.- http://finance.yahoo.com/q?s= 2.- ticker (TEF.MC) 3.- &ql=1/ @algonpaje - www.quadrigram.com 1 + 2 + 3 = Base URL
  • 9.
    1.- Building baseURL using Quadrigram 1.1.- Module “Text” (String): “http://finance.yahoo.com/q?s=” 1.2.- Module “Text Entry Box”: Input the stock ticker (eg: TEF.MC) 1.3.- Module “Text” (String): “&ql=1/” 1.4.- Module “Addition of 5 objects” concatenating 1, 2 and 3 …. result = “http://finance.yahoo.com/q?s=TEF.MC&ql=1/” @algonpaje - www.quadrigram.com
  • 10.
    2.- Querying data 2.1.- Connect the output of “Addition of 5 Objects” (“http://finance. yahoo.com/q?s=TEF.MC&ql=1/”) to module “Query HTTP GET” 2.2.- Connect a “Periodic Pulse” module to “Query HTTP GET” to query data each “X” seconds …. and so we get our HTML code ready to be scraped @algonpaje - www.quadrigram.com
  • 11.
    3.- Scraping data 3.1.- Analyse the code and look for a “left - content - right” pattern. In this case, the pattern we are looking for is: left = <span id="yfs_l84_tef.mc"> content = stock price (* real time when market is opened) right = </span> @algonpaje - www.quadrigram.com
  • 12.
    3.- Scraping data @algonpaje - www.quadrigram.com
  • 13.
    3.- Scraping data 3.2.- Use “Scrape Text” module to extract data “Scrape Text” inlets: source text = HTML code (output of Query HTTP GET) start sequence = <span id="yfs_l84_tef.mc"> end sequence = </span> 3.3.- Extract the stock price using “Extract Object from List” module @algonpaje - www.quadrigram.com
  • 14.
  • 15.
    Example 2: Builda network of similarities using “The Echonest” API @algonpaje - www.quadrigram.com
  • 16.
  • 17.
    >BaseURL: http://developer.echonest.com/api/v4/artist/similar?api_key=J1OPQ9MJ8G8FC19FH&name=stones 1.-http://developer.echonest.com/api/v4/artist/similar?api_key=J1OPQ9MJ8G8FC19FH&name= 2.- artist´s name (“strokes”) @algonpaje - www.quadrigram.com 1 + 2 = Base URL
  • 18.
    1.- Building baseURL using Quadrigram 1.1.- Module “Text” (String): “http://developer.echonest.com/api/v4/artist/similar? api_key=J1OPQ9MJ8G8FC19FH&name=” 1.2.- Module “Text Entry Box”: Input the artist´s name (eg: strokes) 1.3.- Module “Addition of 5 objects” concatenating 1 and 2 …. result = “http://developer.echonest.com/api/v4/artist/similar? api_key=J1OPQ9MJ8G8FC19FH&name=strokes” @algonpaje - www.quadrigram.com
  • 19.
    2.- Querying data 2.1.- Connect the output of “Addition of 5 Objects” (“http://developer.echonest.com/api/v4/artist/similar?api_key=J1OPQ9MJ8G8FC19FH&name=strokes”) to module “Query HTTP GET” …. and so we get our HTML code @algonpaje - www.quadrigram.com
  • 20.
    3.- Scraping data 3.2.- Use “Scrape Text” module to extract data “Scrape Text” inlets: source text = HTML code (output of Query HTTP GET) start sequence = "name": " end sequence = "}, … and we obtain the list with similar artists to our query name @algonpaje - www.quadrigram.com
  • 21.
    4.- Build aNetwork of similarities 4.1.- Use “Length of List” module to count how many similar artists the are 4.2.- Use “Create List with repeated Object” module to create as many “strokes” as similar artists are 4.3.- Create a Pair Table using “Create Custom Data Structure” module 4.4.- Conver the Pair Table to a Network using “Convert PairTable to Network” module @algonpaje - www.quadrigram.com
  • 22.
  • 23.
    More information: www.quadrigram.com @algonpaje - www.quadrigram.com
  • 24.
    Thank you!!! @algonpaje- www.quadrigram.com