Web scrapping: Getting data from the web with R

Web scrapping: Getting data from the web with R

Presentation:

An important aspect when dealing with data in our days is that, very often, they can be obtained from the web although this is not necessarily straightforward, that is they need to be downloaded and go through some preprocessing and extraction processes, which depend on the format in which they are stored in the web.

This course explores some of these formats jointly with the methods and tools used to retrieve data from the web and extract the desired information.

The first part introduces some common web technologies, their relationship and some tools to manipulate and extract the information such as regular expressions. Next common formats for storing web information (HTML, XML, JSON) are presented, as well as tools to extract it, as XPath and CSS selectors. Finally we introduce some R packages suitable to process Web information and use them in some case studies.

Objectives:

Specifically at the end of the course students should:

  • Be familiar with the main technologies to deal with information stored in the web.
  • Be able to recognize the different formats that can be used for storage.
  • Know how to extract information from these formats using specific R packages.

Contents:

  1. Introducing Web technologies. Web scrapping and web scrapping projects.
  2. Data representation in the web HTML, XML, JSON. Other technologies.
  3. Regular expressions for data manipulation.
  4. Parsing HTML and XML. Using CSS selectors and XPath.
  5. Case studies:
    (1) Parsing data from semi-structured documents.
    (2) Scraping Twitter for Sentiment Analysis.
    (3) Gathering data from commercial sites.

References:

Professor:

Alex Sánchez Genetics Microbiology and Statistics Department.
Faculty of Biology.
University of Barcelona.

Statistics and Bioinformatics (UEB).
Vall d'Hebron Institut de Recerca.

Audience:

The course is addressed to university students, teaching staff and professionals with basic Statistical knowledge who want to know web Scrapping to extract web information, from an applied perspective, based on practical examples.

Prerequisites:

To take advantage of practical sessions, it is necessary for participants to have basic knowledge of R software.

For the realization of this course it is necessary that students bring his/her personal computer.

Organization details:

The Web scrapping: Getting data from the web with R will be held on April 1st, 3rd, 8th and 11st, 2019, from 9:30 to 13:30.

The course duration is 16 hours.

The minimum number of participants for the course is 10, and the maximum is 20.

Formalize the Pre-registration: link

Once we received your form we will send an email to confirm that either you have an assigned place or you are on the waiting list.

Registration fees (2019):

Concept Quantity Import
    External Esfera UAB
Registration (before
March 17th)
1 asist 571,00 € 446,00 € 343,00 €
Registration (after
March 17th)
1 asist 743,00 € 668,00 € 514,00 €

Tarifa UAB: UAB university community and students from other universities.
Tarifa Esfera: Agencies, institutions and companies from esfera de la UAB or Public Sector.
Tarifa Externa: Agencies, institutions and companies from Private Sector.

The rate is assigned by the person/instution/company that made the payment.

DISCOUNTS

- Special discounts for unemployed people, with the apply or renew of the application for unemployment benefits.

- Special discounts for groups of the same comany. Send e-mail to s.estadistica@uab.cat.

- Grants for Degree of statistic students, see conditions on the registration form.

Not cumulative discounts.

Payment details:

Once the pre-registration is completed you will receive an email informing of the details for the registration payment.

Once the payment of the course has been made there will be no refund unless there are circumstances beyond one’s control.

Do wait for our confirmation of the reservation for the course before payment.

Campus d'excel·lència internacional U A B