Forschungspraktikum 1+2: Computational Social Science

Session 09: Web Scraping

Dr. Christian Czymara

Agenda

  • Understanding Web Scraping
  • Static Website Scraping
  • APIs for Web Scraping
  • API packages for R
  • Dynamic Website Scraping
  • Ethics and Legality
  • Tutorial: Scraping Wikipedia & the IC2S2 website

What is Web Scraping?

  • Extracting structured data from websites
  • Collecting data from social media platforms
  • Gathering news articles for sentiment analysis
  • Compiling public datasets from government websites
  • Etc.

Steps of Web Scraping

  1. Accessing the website
  2. Parsing the HTML
  3. Extracting the data
  4. Combining and preparing the data

Scraping Static Webites

Understanding HTML Structure

  • HTML: HyperText Markup Language
  • The standard markup language for creating web pages
  • Consists of elements enclosed in tags
  • Tags define the structure and content of a page

HTML Elements

  • HTML elements are the building blocks of a web page
  • Elements are enclosed in tags, which are in angle brackets (< and >) and come in pairs
  • Tags enclose content (e.g., <p> and </p> for paragraphs)
  • Tags can have attributes that provide additional information

HTML Tags

  • Common HTML tags include:
    • <h1>, <h2>, <h3>: Headings
    • <p>: Paragraph
    • <a>: Hyperlink
    • <img>: Image
    • <table>, <tr>, <td>: Table
    • <div>, <span>: Division and span

HTML Example

<h1>This is a heading</h1>

<p>This is a paragraph</p>

<a href="https://www.example.com">This is a link</a>

<img src="image.jpg" alt="This is an image">

HTML Structure

  • HTML documents have a hierarchical structure
  • The root element is <html>
  • The document is divided into <head> and <body> sections
  • The body contains the content of the page
<html>
  <head>
    <title>Page Title</title>
  </head>
  <body>
    <h1>This is a Heading</h1>
    <p>Some text with a <a href="test.html">link.</a></p>
    <img src="image.jpg">an image</img>
  </body>
</html>

Visualizing HTML as a Tree

  • HTML is hierarchical and can be represented as a tree:
  • <html>
    • <body>
      • <h1>: Title
        • <p>: Text with a link
        • <img>: An image

CSS Selectors

  • CSS: Cascading Style Sheets
  • Used to style HTML elements (e.g., colors, fonts, layout)
  • CSS selectors are patterns used to select elements for styling
  • They can also be used to select elements for scraping

CSS Selectors

  • Common CSS selectors include:
    • Tag name: p (paragraph), h1 (heading), div (generic division)
    • Class: .class-name (for elements with a specific class)
    • ID: #id-name (for single element by ID)
    • Attribute: [attribute=value] (for elements based on attributes and values)

CSS Selector Example

div {border: 2px solid black; }

h1 {color: blue; }

p {color: gray; }
  • A box with a black border surround all <div> elements
    • E. g.: <div class="content">This is a section.</div>
  • Heading is blue
  • Paragraph text is gray

Web Scraping with R

  • Use the rvest package to scrape data from websites
    • read_html() reads HTML content from a URL
    • html_nodes() selects nodes based on CSS selectors
    • html_text() extracts text from nodes
    • html_table() extracts tables from a page

Selector Gadgets

  • Use the SelectorGadget tool to identify CSS selectors
  • Available in Chrome’s web store
  • Install the Chrome extension and click on the elements you want to scrape

SelectorGadget Example

  • Go my website
  • Use the SelectorGadget tool to identify the CSS selector
    1. For the my description on the main page
    2. For my publications (only)

Example: Scraping my Website

library(rvest)

url <- "https://www.czymara.com/"

page <- read_html(url)

nodes <- html_nodes(page, ".page__content")

html_text(nodes)
[1] "I am a migration and political communication scholar at Goethe University Frankfurt. My research focuses on migration attitudes, ethnic conflict, migrant integration, and mass media employing quantitative and computational methods as well as natural language processing. The results were published in esteemed general social science journals like Social Forces, European Sociological Review, and European Journal of Political Research, as well as top-tier specialty outlets such as Journal of Ethnic and Migration Studies, Journal of Computational Social Science, and Mass Communication and Society, amassed over 1,000 citations on Google Scholar, and earned the Janet A. Harkness Award of the World Association for Public Opinion Research and the German Preis der Fritz Thyssen Stiftung für sozialwissenschaftliche Aufsätze. Moreover, these findings were media in national and international media and influenced global and EU-level policy institutions.I have several years of experience in teaching quantitative methods and research design, which includes supervising Bachelor’s and Master’s theses.Feel free to contact me at czymara(at)soz.uni-frankfurt.de or follow me on Bluesky at @christian.czymara.com or Mastodon at @cczymara@sciences.social."

Example: Scraping my Publications

page_research <- read_html(paste0(url, "research/"))

nodes_research <- html_nodes(page_research, ".page__content li")

publications <- html_text(nodes_research)

publications
 [1] "Czymara, C. S. (2018). Discursive Determinants of Attitudes towards Immigrants: Political Parties and Mass Media as Contextual Sources of Threat Perceptions. Universitäts- und Stadtbibliothek Köln. 🔓 PDF"                                                                                                                                                                                       
 [2] "Czymara, C. S. & Gorodzeisky, A. (2024). Hostility on Twitter in the Aftermath of Terror Attacks. Journal of Computational Social Science. 🔓 PDF"                                                                                                                                                                                                                                                  
 [3] "May, A. C., & Czymara, C. S. (2024). Careless Whisper: Political Elite Discourses Activate National Identities for Far-right Voting Preferences. Nations and Nationalism 30 (1), 90 - 109. 🔓 PDF"                                                                                                                                                                                                  
 [4] "Czymara, C. S., & Bauer, L. (2023). Discursive Shifts in the German Right-Wing Newspaper Junge Freiheit 1997-2019: A Computational Approach. German Politics. 🔓 PDF"                                                                                                                                                                                                                               
 [5] "Nägel, C., Nivette A., & Czymara, C. S. (2024). Do Jihadist Terrorist Attacks Cause Changes in Institutional Trust? A Multi-Site Natural Experiment. European Journal of Political Research 63 (2), 411 - 432. 🔓 PDF"                                                                                                                                                                              
 [6] "Czymara, C. S. (2024). Real-World Developments Predict Immigration News in Right-Wing Media: Evidence from Germany. Mass Communication and Society 27 (1), 50 - 74. 🔓 PDF"                                                                                                                                                                                                                         
 [7] "Schmidt-Catran, A. W., & Czymara, C. S. (2023). Political Elite Discourses Polarize Attitudes toward Immigration Along Ideological Lines. A comparative longitudinal analysis of Europe in the 21st century. Journal of Ethnic and Migration Studies 49 (1), 85 - 109. 🔓 PDF"                                                                                                                      
 [8] "Breznau, N. & 100+ co-authors (2022). Observing Many Researchers Using the Same Data and Hypothesis Reveals a Hidden Universe of Uncertainty. Proceedings of the National Academy of Sciences 119 (44), 1 - 8."                                                                                                                                                                                     
 [9] "Czymara, C. S., Dochow-Sondershaus, S, Drouhot, L. G., Şimşek, M., & Spörlein, C. (2023). Catalyst of hate? Ethnic insulting on YouTube in the aftermath of terror attacks in France, Germany and the United Kingdom 2014–2017. Journal of Ethnic and Migration Studies 49 (2), 535 - 553. Special issue: Computational Approaches to Migration and Integration Research. 🔓 PDF"                   
[10] "Hoogeveen, S., & 100+ co-authors (2023). A many-analysts approach to the relation between religiosity and well-being. Religion, Brain and Behavior 13 (3), 237 - 283."                                                                                                                                                                                                                              
[11] "Czymara, C. S., & Mitchell, J. (2023). All Cops are Trusted? How Context and Time Shape Immigrants’ Trust in the Police in Europe. Ethnic and Racial Studies 46 (1), 72 - 96. 🔓 PDF"                                                                                                                                                                                                               
[12] "Langenkamp, A., Cano, T., & Czymara, C. S. (2022). My Home is my Castle? The Role of Living Arrangements on Experiencing the COVID-19 Pandemic: Evidence From Germany. Frontiers in Sociology 6, 1 - 14. 🔓 PDF"                                                                                                                                                                                    
[13] "Czymara, C. S. & van Klingeren, M. (2022). New perspective? Comparing Frame Occurrence in Online and Traditional News Media Reporting on Europe’s “Migration Crisis”. Communications: The European Journal of Communication Research 47 (1), 136 - 162. 🔓 PDF"                                                                                                                                     
[14] "Czymara, C. S., & Eisentraut, M. (2020). ‘A threat to the Occident’? Comparing human values of Muslim immigrants, Christian and non-religious natives in Western Europe. Frontiers in Sociology 5, 1 - 15. 🔓 PDF"                                                                                                                                                                                  
[15] "Czymara, C. S., Langenkamp, A. & Cano, T. (2021). Cause for Concern: Gender Inequality in Experiencing the COVID-19 Lockdown in Germany. European Societies 23 (S1), 68 - 81. Special issue: European Societies in the Time of the Coronavirus Crisis. 🔓 PDF"                                                                                                                                      
[16] "Schmidt-Catran, A. W., & Czymara, C. S. (2020). “Did you read about Berlin?” Terrorist attacks, online media reporting and support for refugees in Germany. Soziale Welt 71 (2–3), 305 – 337. 🔓 PDF"                                                                                                                                                                                               
[17] "Czymara, C. S. (2021). Attitudes toward Refugees in Contemporary Europe: A Longitudinal Perspective on Cross-national Differences. Social Forces 99 (3), 1306 – 1333. 🔓 PDF"                                                                                                                                                                                                                       
[18] "Czymara, C. S. (2020). Propagated Preferences? Political Elite Discourses and Europeans’ Openness toward Muslim Immigrants. International Migration Review 54 (4), 1212 - 1237. 🔓 PDF"                                                                                                                                                                                                             
[19] "Czymara, C. S., & Dochow, S. (2018). Mass Media and Concerns about Immigration in Germany in the 21st Century: Individual-Level Evidence over 15 Years. European Sociological Review 34 (4), 381 – 401. 🔓 PDF"                                                                                                                                                                                     
[20] "Czymara, C. S., & Schmidt-Catran, A. W. (2018). Konfundierungen in Vignettenanalysen mit einzelnen d-effizienten Vignettenstichproben (Confounding in Vignette Studies with Single D-Efficient Vignette Samples). Kölner Zeitschrift für Soziologie und Sozialpsychologie 70 (1), 93 - 103."                                                                                                        
[21] "Czymara, C. S., & Schmidt-Catran, A. W. (2017) : Refugees Unwelcome? Changes in the Public Acceptance of Immigrants and Refugees in Germany in the Course of Europe’s “Immigration Crisis”, European Sociological Review 33 (6). 735 – 751. 🔓 PDF"                                                                                                                                                 
[22] "Czymara, C. S., & Schmidt-Catran, A. W. (2016). Wer ist in Deutschland willkommen? Eine Vignettenanalyse zur Akzeptanz von Einwanderern (Who is welcome in Germany? A Vignette Study on the Acceptance of Immigrants). Kölner Zeitschrift für Soziologie und Sozialpsychologie 68 (2), 1 – 35. 🔓 PDF"                                                                                              
[23] "Czymara, C. S., Dochow-Sondershaus, S, Drouhot, L. G., Şimşek, M., & Spörlein, C. (2024). Catalyst of hate? Ethnic insulting on YouTube in the aftermath of terror attacks in France, Germany and the United Kingdom 2014–2017. in Deutschmann, E., Drouhot, L. G., Zuccotti, C. V. & Zagheni, E. (Eds.), Computational Research in Ethnic and Migration Studies (pp. 152 – 170). Routledge. 🔓 PDF"
[24] "Velásquez, P., Eger, M. A., Castañeda, H., Czymara, C. S., Ivarsflaten, E., Maxwell, R., Okamoto, D., & Wilkes, R. (2024). Processes and Pathways of Stigmatization and Destigmatization over Time. In Yang, L. H., Eger, M. A., & Link, B. G. (Eds.), Migration Stigma: Understanding Prejudice, Discrimination, and Exclusion (pp. 179 - 200). MIT Press."                                        
[25] "Brodeur, A., & 100+ co-authors: Mass Reproducibility and Replicability: A New Hope."                                                                                                                                                                                                                                                                                                                
[26] "Breznau, N., & 100+ co-authors: The Reliability of Replications: A Study in Computational Reproductions."                                                                                                                                                                                                                                                                                           

Example: Plotting Publications

library(quanteda)
publications_toks <- tokens(corpus(publications), remove_punct = TRUE, remove_numbers = TRUE,
                            remove_symbols = TRUE, remove_separators = TRUE)
publications_toks <- tokens_remove(publications_toks, c(stopwords(), "czymara", "pdf", "c", "s"), case_insensitive = TRUE)
publications_dfm <- dfm(publications_toks)

quanteda.textplots::textplot_wordcloud(publications_dfm)

Scraping Tables

  • Use the html_table() function to extract tables from a page
  • The function returns a list of data frames
  • Use indexing to access the data frames (i. e.: first table of the page: tables[[1]])

Example: Scraping a Table from Wikipedia

url_table <- "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"
page_table <- read_html(url_table)
tables <- html_table(page_table)

head(tables[[1]])
# A tibble: 6 × 6
  `Country or territory` `Population(1 July 2022)` `Population(1 July 2023)`
  <chr>                  <chr>                     <chr>                    
1 World                  8,021,407,192             8,091,734,930            
2 India                  1,425,423,212             1,438,069,596            
3 China[a]               1,425,179,569             1,422,584,933            
4 United States          341,534,046               343,477,335              
5 Indonesia              278,830,529               281,190,067              
6 Pakistan               243,700,667               247,504,495              
# ℹ 3 more variables: `Change(%)` <chr>, `UN continental region[1]` <chr>,
#   `UN statistical subregion[1]` <chr>

Tutorial 09: Exercise 1.

Web Scraping with APIs

What are APIs?

  • API: Application Programming Interface
  • A set of rules and protocols for building and interacting with software applications
  • Allow different software applications to communicate with each other
  • Provide a structured way to access data

How Do APIs Work?

  • APIs use HTTP requests to communicate
  • APIs return data in a machine-readable format (e.g., JSON)
  • Use the httr package to interact with APIs in R
    • The GET() function sends a GET request to the API
    • The POST() function sends a POST request to the API

Example: The Google Books API

  • Google Books API documentation
    • Available keys (e.g., q for query)
    • Required vs. optional parameters
  • Endpoint: https://www.googleapis.com/books/v1/volumes
  • Returns JSON data
library(httr)

google_url <- "https://www.googleapis.com/books/v1/volumes?q=web+scraping"

google <-  GET(google_url)

Result of the Google Books

content(google, as="parsed")$items[[1]]
$kind
[1] "books#volume"

$id
[1] "7z_fCQAAQBAJ"

$etag
[1] "EB6+0Nmjb9g"

$selfLink
[1] "https://www.googleapis.com/books/v1/volumes/7z_fCQAAQBAJ"

$volumeInfo
$volumeInfo$title
[1] "Web Scraping with Python"

$volumeInfo$subtitle
[1] "Collecting Data from the Modern Web"

$volumeInfo$authors
$volumeInfo$authors[[1]]
[1] "Ryan Mitchell"


$volumeInfo$publisher
[1] "\"O'Reilly Media, Inc.\""

$volumeInfo$publishedDate
[1] "2015-06-15"

$volumeInfo$description
[1] "Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you’ll learn how to use Python scripts and web APIs to gather and process data from thousands—or even millions—of web pages at once. Ideal for programmers, security professionals, and web administrators familiar with Python, this book not only teaches basic web scraping mechanics, but also delves into more advanced topics, such as analyzing raw data or using scrapers for frontend website testing. Code samples are available to help you understand the concepts in practice. Learn how to parse complicated HTML pages Traverse multiple pages and sites Get a general overview of APIs and how they work Learn several methods for storing the data you scrape Download, read, and extract data from documents Use tools and techniques to clean badly formatted data Read and write natural languages Crawl through forms and logins Understand how to scrape JavaScript Learn image processing and text recognition"

$volumeInfo$industryIdentifiers
$volumeInfo$industryIdentifiers[[1]]
$volumeInfo$industryIdentifiers[[1]]$type
[1] "ISBN_13"

$volumeInfo$industryIdentifiers[[1]]$identifier
[1] "9781491910252"


$volumeInfo$industryIdentifiers[[2]]
$volumeInfo$industryIdentifiers[[2]]$type
[1] "ISBN_10"

$volumeInfo$industryIdentifiers[[2]]$identifier
[1] "1491910259"



$volumeInfo$readingModes
$volumeInfo$readingModes$text
[1] TRUE

$volumeInfo$readingModes$image
[1] TRUE


$volumeInfo$pageCount
[1] 264

$volumeInfo$printType
[1] "BOOK"

$volumeInfo$categories
$volumeInfo$categories[[1]]
[1] "Computers"


$volumeInfo$averageRating
[1] 4.5

$volumeInfo$ratingsCount
[1] 2

$volumeInfo$maturityRating
[1] "NOT_MATURE"

$volumeInfo$allowAnonLogging
[1] TRUE

$volumeInfo$contentVersion
[1] "1.13.15.0.preview.3"

$volumeInfo$panelizationSummary
$volumeInfo$panelizationSummary$containsEpubBubbles
[1] FALSE

$volumeInfo$panelizationSummary$containsImageBubbles
[1] FALSE


$volumeInfo$imageLinks
$volumeInfo$imageLinks$smallThumbnail
[1] "http://books.google.com/books/content?id=7z_fCQAAQBAJ&printsec=frontcover&img=1&zoom=5&edge=curl&source=gbs_api"

$volumeInfo$imageLinks$thumbnail
[1] "http://books.google.com/books/content?id=7z_fCQAAQBAJ&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api"


$volumeInfo$language
[1] "en"

$volumeInfo$previewLink
[1] "http://books.google.de/books?id=7z_fCQAAQBAJ&pg=PT4&dq=web+scraping&hl=&cd=1&source=gbs_api"

$volumeInfo$infoLink
[1] "http://books.google.de/books?id=7z_fCQAAQBAJ&dq=web+scraping&hl=&source=gbs_api"

$volumeInfo$canonicalVolumeLink
[1] "https://books.google.com/books/about/Web_Scraping_with_Python.html?hl=&id=7z_fCQAAQBAJ"


$saleInfo
$saleInfo$country
[1] "DE"

$saleInfo$saleability
[1] "NOT_FOR_SALE"

$saleInfo$isEbook
[1] FALSE


$accessInfo
$accessInfo$country
[1] "DE"

$accessInfo$viewability
[1] "PARTIAL"

$accessInfo$embeddable
[1] TRUE

$accessInfo$publicDomain
[1] FALSE

$accessInfo$textToSpeechPermission
[1] "ALLOWED"

$accessInfo$epub
$accessInfo$epub$isAvailable
[1] TRUE


$accessInfo$pdf
$accessInfo$pdf$isAvailable
[1] TRUE


$accessInfo$webReaderLink
[1] "http://play.google.com/books/reader?id=7z_fCQAAQBAJ&hl=&source=gbs_api"

$accessInfo$accessViewStatus
[1] "SAMPLE"

$accessInfo$quoteSharingAllowed
[1] FALSE


$searchInfo
$searchInfo$textSnippet
[1] "Collecting Data from the Modern Web Ryan Mitchell. Preface To those who have not developed the skill , computer programming can seem like a kind of magic . If programming is magic , then web ... <b>Web Scraping</b> ? If the only way you access."

JSON Objects

  • JSON: JavaScript Object Notation
  • Easy for humans to read and write
  • Easy for machines to parse and generate

JSON Example

{
  "name": "Homer Simpson",
  "age": 34,
  "address": {
    "street": "742 Evergreen Terrace",
    "city": "Springfield"
  }
}

JSON Structure

  • JSON objects can have multiple levels of nesting
  • Use the jsonlite package to convert JSON to a data frame
library(jsonlite)
library(tibble)

google_text <-  content(google, "text")

google_json <- fromJSON(google_text, flatten=T)
head(as_tibble(google_json)$items)
          kind           id        etag
1 books#volume 7z_fCQAAQBAJ EB6+0Nmjb9g
2 books#volume V_l_CwAAQBAJ EGtGNKeSn/w
3 books#volume D5jXEAAAQBAJ QzCrrLEoq2I
4 books#volume ixUvEAAAQBAJ mlerhR7LHME
5 books#volume jHc5DwAAQBAJ YlFtR0ziK9Y
6 books#volume jQGGDwAAQBAJ +YcqSQB3KXY
                                                  selfLink
1 https://www.googleapis.com/books/v1/volumes/7z_fCQAAQBAJ
2 https://www.googleapis.com/books/v1/volumes/V_l_CwAAQBAJ
3 https://www.googleapis.com/books/v1/volumes/D5jXEAAAQBAJ
4 https://www.googleapis.com/books/v1/volumes/ixUvEAAAQBAJ
5 https://www.googleapis.com/books/v1/volumes/jHc5DwAAQBAJ
6 https://www.googleapis.com/books/v1/volumes/jQGGDwAAQBAJ
                   volumeInfo.title
1          Web Scraping with Python
2          Web Scraping with Python
3 Hands-On Web Scraping with Python
4   A Python Guide for Web Scraping
5               Python Web Scraping
6 Go Web Scraping Quick Start Guide
                                                                                                    volumeInfo.subtitle
1                                                                                   Collecting Data from the Modern Web
2                                                                                                                  <NA>
3                                                   Extract quality data from the web using effective Python techniques
4 Explore Python Tools, Web Scraping Techniques, and How to Automata Data for Industrial Applications (English Edition)
5                                                                                                                  <NA>
6                                                       Implement the power of Go to scrape and crawl data from the web
                volumeInfo.authors   volumeInfo.publisher
1                    Ryan Mitchell "O'Reilly Media, Inc."
2                   Richard Lawson   Packt Publishing Ltd
3                  Anish Chapagain   Packt Publishing Ltd
4        Pradumna Milind Panditrao       BPB Publications
5 Katharine Jarmul, Richard Lawson   Packt Publishing Ltd
6                    Vincent Smith   Packt Publishing Ltd
  volumeInfo.publishedDate
1               2015-06-15
2               2015-10-28
3               2023-10-06
4               2021-05-18
5               2017-05-30
6               2019-01-30
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   volumeInfo.description
1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you’ll learn how to use Python scripts and web APIs to gather and process data from thousands—or even millions—of web pages at once. Ideal for programmers, security professionals, and web administrators familiar with Python, this book not only teaches basic web scraping mechanics, but also delves into more advanced topics, such as analyzing raw data or using scrapers for frontend website testing. Code samples are available to help you understand the concepts in practice. Learn how to parse complicated HTML pages Traverse multiple pages and sites Get a general overview of APIs and how they work Learn several methods for storing the data you scrape Download, read, and extract data from documents Use tools and techniques to clean badly formatted data Read and write natural languages Crawl through forms and logins Understand how to scrape JavaScript Learn image processing and text recognition
2                                                                                                                                                             Successfully scrape data from any website with the power of Python About This Book A hands-on guide to web scraping with real-life problems and solutions Techniques to download and extract data from complex websites Create a number of different web scrapers to extract information Who This Book Is For This book is aimed at developers who want to use web scraping for legitimate purposes. Prior programming experience with Python would be useful but not essential. Anyone with general knowledge of programming languages should be able to pick up the book and understand the principals involved. What You Will Learn Extract data from web pages with simple Python programming Build a threaded crawler to process web pages in parallel Follow links to crawl a website Download cache to reduce bandwidth Use multiple threads and processes to scrape faster Learn how to parse JavaScript-dependent websites Interact with forms and sessions Solve CAPTCHAs on protected web pages Discover how to track the state of a crawl In Detail The Internet contains the most useful set of data ever assembled, largely publicly accessible for free. However, this data is not easily reusable. It is embedded within the structure and style of websites and needs to be carefully extracted to be useful. Web scraping is becoming increasingly useful as a means to easily gather and make sense of the plethora of information available online. Using a simple language like Python, you can crawl the information out of complex websites using simple programming. This book is the ultimate guide to using Python to scrape data from websites. In the early chapters it covers how to extract data from static web pages and how to use caching to manage the load on servers. After the basics we'll get our hands dirty with building a more sophisticated crawler with threads and more advanced topics. Learn step-by-step how to use Ajax URLs, employ the Firebug extension for monitoring, and indirectly scrape data. Discover more scraping nitty-gritties such as using the browser renderer, managing cookies, how to submit forms to extract data from complex websites protected by CAPTCHA, and so on. The book wraps up with how to create high-level scrapers with Scrapy libraries and implement what has been learned to real websites. Style and approach This book is a hands-on guide with real-life examples and solutions starting simple and then progressively becoming more complex. Each chapter in this book introduces a problem and then provides one or more possible solutions.
3                                                Work through practical examples to unlock the full potential of web scraping with Python and gain valuable insights from high-quality data Key Features Build an initial portfolio of web scraping projects with detailed explanations Grasp Python programming fundamentals related to web scraping and data extraction Acquire skills to code web scrapers, store data in desired formats, and employ the data professionally Purchase of the print or Kindle book includes a free PDF eBook Book DescriptionWeb scraping is a powerful tool for extracting data from the web, but it can be daunting for those without a technical background. Designed for novices, this book will help you grasp the fundamentals of web scraping and Python programming, even if you have no prior experience. Adopting a practical, hands-on approach, this updated edition of Hands-On Web Scraping with Python uses real-world examples and exercises to explain key concepts. Starting with an introduction to web scraping fundamentals and Python programming, you’ll cover a range of scraping techniques, including requests, lxml, pyquery, Scrapy, and Beautiful Soup. You’ll also get to grips with advanced topics such as secure web handling, web APIs, Selenium for web scraping, PDF extraction, regex, data analysis, EDA reports, visualization, and machine learning. This book emphasizes the importance of learning by doing. Each chapter integrates examples that demonstrate practical techniques and related skills. By the end of this book, you’ll be equipped with the skills to extract data from websites, a solid understanding of web scraping and Python programming, and the confidence to use these skills in your projects for analysis, visualization, and information discovery.What you will learn Master web scraping techniques to extract data from real-world websites Implement popular web scraping libraries such as requests, lxml, Scrapy, and pyquery Develop advanced skills in web scraping, APIs, PDF extraction, regex, and machine learning Analyze and visualize data with Pandas and Plotly Develop a practical portfolio to demonstrate your web scraping skills Understand best practices and ethical concerns in web scraping and data extraction Who this book is for This book is for beginners who want to learn web scraping and data extraction using Python. No prior programming knowledge is required, but a basic understanding of web-related concepts such as websites, browsers, and HTML is assumed. If you enjoy learning by doing and want to build a portfolio of web scraping projects and delve into data-related studies and application, then this book is tailored for your needs.
4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Get hands-on training on any web crawling/scraping tool and uses of web scraping in the real-time industry Ê KEY FEATURESÊÊ _ Includes numerous use-cases on the use of web scraping for industrial applications. _ Learn how to automate web scraping tasks. _ Explore ready-made syntaxes of Python scripts to run web scraping. DESCRIPTIONÊ A Python Guide for Web Scraping is a book that will give information about the importance of web scraping using Python. It includes real-time examples of web scraping. It implies the automation use cases of web scraping as well. It gives information about the different tools and libraries of web scraping so that readers get a wide idea about the features and existence of web scraping. In this book, we started with the basics of Python and its syntactical information. We briefed about the use cases and features of Python. We have explained the importance of Python in automation systems. Furthermore, we have added information about real-time industrial examples. We have concentrated and deep-dived into PythonÕs importance in web scraping, explained the different tools and their usages. We have explained the real-time industrial domain-wise use cases for web scraping. WHAT YOU WILL LEARN _ Explore the Python syntax and key features of using Python for web scraping. _ Usage of Python in the web scraping tasks and how to automate scraping. _ How to use different libraries and modules of Python. WHO THIS BOOK IS FORÊÊ This book is basically for data engineers and data programmers who have a basic knowledge of Python and for theÊ readers who want to learn about web scraping projects for industries. TABLE OF CONTENTS 1. Python Basics 2. Use Cases of Python 3. Automation Using Python 4. Industrial Automation-Python 5. Web Scraping 6. Web Scraping and Necessity 7. Python - Web Scraping and Different Tools 8. Automation in Web Scraping 9. Use Cases-Web Scraping 10. Industrial Benefits of Web Scraping
5 Successfully scrape data from any website with the power of Python 3.x About This Book A hands-on guide to web scraping using Python with solutions to real-world problems Create a number of different web scrapers in Python to extract information This book includes practical examples on using the popular and well-maintained libraries in Python for your web scraping needs Who This Book Is For This book is aimed at developers who want to use web scraping for legitimate purposes. Prior programming experience with Python would be useful but not essential. Anyone with general knowledge of programming languages should be able to pick up the book and understand the principals involved. What You Will Learn Extract data from web pages with simple Python programming Build a concurrent crawler to process web pages in parallel Follow links to crawl a website Extract features from the HTML Cache downloaded HTML for reuse Compare concurrent models to determine the fastest crawler Find out how to parse JavaScript-dependent websites Interact with forms and sessions In Detail The Internet contains the most useful set of data ever assembled, most of which is publicly accessible for free. However, this data is not easily usable. It is embedded within the structure and style of websites and needs to be carefully extracted. Web scraping is becoming increasingly useful as a means to gather and make sense of the wealth of information available online. This book is the ultimate guide to using the latest features of Python 3.x to scrape data from websites. In the early chapters, you'll see how to extract data from static web pages. You'll learn to use caching with databases and files to save time and manage the load on servers. After covering the basics, you'll get hands-on practice building a more sophisticated crawler using browsers, crawlers, and concurrent scrapers. You'll determine when and how to scrape data from a JavaScript-dependent website using PyQt and Selenium. You'll get a better understanding of how to submit forms on complex websites protected by CAPTCHA. You'll find out how to automate these actions with Python packages such as mechanize. You'll also learn how to create class-based scrapers with Scrapy libraries and implement your learning on real websites. By the end of the book, you will have explored testing websites with scrapers, remote scraping, best practices, working with images, and many other relevant topics. Style and approach This hands-on guide is full of real-life examples and solutions starting simple and then progressively becoming more complex. Each chapter in this book introduces a problem and then provides one or more possible solutions.
6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Web scraping is the process of extracting information from the web using various tools that perform scraping and crawling. Go is emerging as the language of choice for scraping using a variety of libraries. This book will quickly explain to you, how to scrape data data from various websites using Go libraries such as Colly and Goquery.
               volumeInfo.industryIdentifiers volumeInfo.pageCount
1 ISBN_13, ISBN_10, 9781491910252, 1491910259                  264
2 ISBN_13, ISBN_10, 9781782164371, 1782164375                  174
3 ISBN_13, ISBN_10, 9781837638512, 1837638519                  324
4 ISBN_13, ISBN_10, 9789390684991, 9390684994                  130
5 ISBN_13, ISBN_10, 9781786464293, 1786464292                  215
6 ISBN_13, ISBN_10, 9781789612943, 1789612942                  125
  volumeInfo.printType volumeInfo.categories volumeInfo.averageRating
1                 BOOK             Computers                      4.5
2                 BOOK             Computers                       NA
3                 BOOK             Computers                       NA
4                 BOOK             Computers                       NA
5                 BOOK             Computers                       NA
6                 BOOK             Computers                       NA
  volumeInfo.ratingsCount volumeInfo.maturityRating volumeInfo.allowAnonLogging
1                       2                NOT_MATURE                        TRUE
2                      NA                NOT_MATURE                        TRUE
3                      NA                NOT_MATURE                       FALSE
4                      NA                NOT_MATURE                        TRUE
5                      NA                NOT_MATURE                        TRUE
6                      NA                NOT_MATURE                        TRUE
  volumeInfo.contentVersion volumeInfo.language
1       1.13.15.0.preview.3                  en
2         2.3.4.0.preview.3                  en
3         0.3.4.0.preview.3                  en
4         1.2.2.0.preview.3                  en
5         1.5.8.0.preview.3                  en
6         1.3.3.0.preview.3                  en
                                                                         volumeInfo.previewLink
1   http://books.google.de/books?id=7z_fCQAAQBAJ&pg=PT4&dq=web+scraping&hl=&cd=1&source=gbs_api
2   http://books.google.de/books?id=V_l_CwAAQBAJ&pg=PA1&dq=web+scraping&hl=&cd=2&source=gbs_api
3  http://books.google.de/books?id=D5jXEAAAQBAJ&pg=PR14&dq=web+scraping&hl=&cd=3&source=gbs_api
4  http://books.google.de/books?id=ixUvEAAAQBAJ&pg=PT77&dq=web+scraping&hl=&cd=4&source=gbs_api
5 http://books.google.de/books?id=jHc5DwAAQBAJ&pg=PA172&dq=web+scraping&hl=&cd=5&source=gbs_api
6   http://books.google.de/books?id=jQGGDwAAQBAJ&pg=PA5&dq=web+scraping&hl=&cd=6&source=gbs_api
                                                              volumeInfo.infoLink
1 http://books.google.de/books?id=7z_fCQAAQBAJ&dq=web+scraping&hl=&source=gbs_api
2      https://play.google.com/store/books/details?id=V_l_CwAAQBAJ&source=gbs_api
3      https://play.google.com/store/books/details?id=D5jXEAAAQBAJ&source=gbs_api
4      https://play.google.com/store/books/details?id=ixUvEAAAQBAJ&source=gbs_api
5      https://play.google.com/store/books/details?id=jHc5DwAAQBAJ&source=gbs_api
6      https://play.google.com/store/books/details?id=jQGGDwAAQBAJ&source=gbs_api
                                                          volumeInfo.canonicalVolumeLink
1 https://books.google.com/books/about/Web_Scraping_with_Python.html?hl=&id=7z_fCQAAQBAJ
2                            https://play.google.com/store/books/details?id=V_l_CwAAQBAJ
3                            https://play.google.com/store/books/details?id=D5jXEAAAQBAJ
4                            https://play.google.com/store/books/details?id=ixUvEAAAQBAJ
5                            https://play.google.com/store/books/details?id=jHc5DwAAQBAJ
6                            https://play.google.com/store/books/details?id=jQGGDwAAQBAJ
  volumeInfo.readingModes.text volumeInfo.readingModes.image
1                         TRUE                          TRUE
2                         TRUE                          TRUE
3                         TRUE                          TRUE
4                         TRUE                          TRUE
5                         TRUE                          TRUE
6                         TRUE                          TRUE
  volumeInfo.panelizationSummary.containsEpubBubbles
1                                              FALSE
2                                              FALSE
3                                              FALSE
4                                              FALSE
5                                              FALSE
6                                              FALSE
  volumeInfo.panelizationSummary.containsImageBubbles
1                                               FALSE
2                                               FALSE
3                                               FALSE
4                                               FALSE
5                                               FALSE
6                                               FALSE
                                                                             volumeInfo.imageLinks.smallThumbnail
1 http://books.google.com/books/content?id=7z_fCQAAQBAJ&printsec=frontcover&img=1&zoom=5&edge=curl&source=gbs_api
2 http://books.google.com/books/content?id=V_l_CwAAQBAJ&printsec=frontcover&img=1&zoom=5&edge=curl&source=gbs_api
3 http://books.google.com/books/content?id=D5jXEAAAQBAJ&printsec=frontcover&img=1&zoom=5&edge=curl&source=gbs_api
4 http://books.google.com/books/content?id=ixUvEAAAQBAJ&printsec=frontcover&img=1&zoom=5&edge=curl&source=gbs_api
5 http://books.google.com/books/content?id=jHc5DwAAQBAJ&printsec=frontcover&img=1&zoom=5&edge=curl&source=gbs_api
6 http://books.google.com/books/content?id=jQGGDwAAQBAJ&printsec=frontcover&img=1&zoom=5&edge=curl&source=gbs_api
                                                                                  volumeInfo.imageLinks.thumbnail
1 http://books.google.com/books/content?id=7z_fCQAAQBAJ&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api
2 http://books.google.com/books/content?id=V_l_CwAAQBAJ&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api
3 http://books.google.com/books/content?id=D5jXEAAAQBAJ&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api
4 http://books.google.com/books/content?id=ixUvEAAAQBAJ&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api
5 http://books.google.com/books/content?id=jHc5DwAAQBAJ&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api
6 http://books.google.com/books/content?id=jQGGDwAAQBAJ&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api
  saleInfo.country saleInfo.saleability saleInfo.isEbook
1               DE         NOT_FOR_SALE            FALSE
2               DE             FOR_SALE             TRUE
3               DE             FOR_SALE             TRUE
4               DE             FOR_SALE             TRUE
5               DE             FOR_SALE             TRUE
6               DE             FOR_SALE             TRUE
                                                                                          saleInfo.buyLink
1                                                                                                     <NA>
2 https://play.google.com/store/books/details?id=V_l_CwAAQBAJ&rdid=book-V_l_CwAAQBAJ&rdot=1&source=gbs_api
3 https://play.google.com/store/books/details?id=D5jXEAAAQBAJ&rdid=book-D5jXEAAAQBAJ&rdot=1&source=gbs_api
4 https://play.google.com/store/books/details?id=ixUvEAAAQBAJ&rdid=book-ixUvEAAAQBAJ&rdot=1&source=gbs_api
5 https://play.google.com/store/books/details?id=jHc5DwAAQBAJ&rdid=book-jHc5DwAAQBAJ&rdot=1&source=gbs_api
6 https://play.google.com/store/books/details?id=jQGGDwAAQBAJ&rdid=book-jQGGDwAAQBAJ&rdot=1&source=gbs_api
                        saleInfo.offers saleInfo.listPrice.amount
1                                  NULL                        NA
2 1, TRUE, 18180000, EUR, 12730000, EUR                     18.18
3 1, TRUE, 28880000, EUR, 20220000, EUR                     28.88
4 1, TRUE, 10280000, EUR, 10280000, EUR                     10.28
5 1, TRUE, 24600000, EUR, 17220000, EUR                     24.60
6 1, TRUE, 14970000, EUR, 10480000, EUR                     14.97
  saleInfo.listPrice.currencyCode saleInfo.retailPrice.amount
1                            <NA>                          NA
2                             EUR                       12.73
3                             EUR                       20.22
4                             EUR                       10.28
5                             EUR                       17.22
6                             EUR                       10.48
  saleInfo.retailPrice.currencyCode accessInfo.country accessInfo.viewability
1                              <NA>                 DE                PARTIAL
2                               EUR                 DE                PARTIAL
3                               EUR                 DE                PARTIAL
4                               EUR                 DE                PARTIAL
5                               EUR                 DE                PARTIAL
6                               EUR                 DE                PARTIAL
  accessInfo.embeddable accessInfo.publicDomain
1                  TRUE                   FALSE
2                  TRUE                   FALSE
3                  TRUE                   FALSE
4                  TRUE                   FALSE
5                  TRUE                   FALSE
6                  TRUE                   FALSE
  accessInfo.textToSpeechPermission
1                           ALLOWED
2                           ALLOWED
3                           ALLOWED
4                           ALLOWED
5                           ALLOWED
6                           ALLOWED
                                                accessInfo.webReaderLink
1 http://play.google.com/books/reader?id=7z_fCQAAQBAJ&hl=&source=gbs_api
2 http://play.google.com/books/reader?id=V_l_CwAAQBAJ&hl=&source=gbs_api
3 http://play.google.com/books/reader?id=D5jXEAAAQBAJ&hl=&source=gbs_api
4 http://play.google.com/books/reader?id=ixUvEAAAQBAJ&hl=&source=gbs_api
5 http://play.google.com/books/reader?id=jHc5DwAAQBAJ&hl=&source=gbs_api
6 http://play.google.com/books/reader?id=jQGGDwAAQBAJ&hl=&source=gbs_api
  accessInfo.accessViewStatus accessInfo.quoteSharingAllowed
1                      SAMPLE                          FALSE
2                      SAMPLE                          FALSE
3                      SAMPLE                          FALSE
4                      SAMPLE                          FALSE
5                      SAMPLE                          FALSE
6                      SAMPLE                          FALSE
  accessInfo.epub.isAvailable
1                        TRUE
2                        TRUE
3                        TRUE
4                        TRUE
5                        TRUE
6                        TRUE
                                                                                                                                                    accessInfo.epub.acsTokenLink
1                                                                                                                                                                           <NA>
2                                                                                                                                                                           <NA>
3                                                                                                                                                                           <NA>
4 http://books.google.de/books/download/A_Python_Guide_for_Web_Scraping-sample-epub.acsm?id=ixUvEAAAQBAJ&format=epub&output=acs4_fulfillment_token&dl_type=sample&source=gbs_api
5                                                                                                                                                                           <NA>
6                                                                                                                                                                           <NA>
  accessInfo.pdf.isAvailable
1                       TRUE
2                       TRUE
3                       TRUE
4                       TRUE
5                       TRUE
6                       TRUE
                                                                                                                                                   accessInfo.pdf.acsTokenLink
1                                                                                                                                                                         <NA>
2                                                                                                                                                                         <NA>
3                                                                                                                                                                         <NA>
4 http://books.google.de/books/download/A_Python_Guide_for_Web_Scraping-sample-pdf.acsm?id=ixUvEAAAQBAJ&format=pdf&output=acs4_fulfillment_token&dl_type=sample&source=gbs_api
5                                                                                                                                                                         <NA>
6                                                                                                                                                                         <NA>
                                                                                                                                                                                                                                                                            searchInfo.textSnippet
1                                             Collecting Data from the Modern Web Ryan Mitchell. Preface To those who have not developed the skill , computer programming can seem like a kind of magic . If programming is magic , then web ... <b>Web Scraping</b> ? If the only way you access.
2                ... short, we cannot rely on APIs to access the online data we may want and therefore, need to learn about <b>web scraping</b> techniques. Is <b>web scraping</b> legal ? <b>Web scraping</b> is in the [ 1 ] Chapter 1: Introduction to <b>Web Scraping</b> When is web&nbsp;...
3 ... <b>web</b> documents. Examples of <b>scraping</b> using pyquery and writing data to JSON and CSV are also covered. Chapter 5, <b>Scraping</b> the <b>Web</b> with Scrapy and Beautiful Soup, provides an overview and examples of using and deploying a popular <b>web</b>-crawling&nbsp;...
4                       Explore Python Tools, <b>Web Scraping</b> Techniques, and How to Automata Data for Industrial Applications (English Edition) Pradumna Milind Panditrao. 1. Why is <b>web scraping</b> used ? 2. What are the differences between <b>web scraping</b> and crawling&nbsp;...
5                  ... <b>webscraping</b>.com/ my_example_site 8 Start page New sample SPIDER ? Show all spiders : * Katharine Q : Extracted items 0 JSON C Example <b>web scraping</b> website You will start to recognize some of the fields from. example.<b>webscraping</b>.com START&nbsp;...
6                        Implement the power of Go to scrape and crawl data from the <b>web</b> Vincent Smith. 1. Introducing. <b>Web</b>. <b>Scraping</b>. and. Go. Collecting, parsing, storing, and processing data are essential tasks that almost everybody will need to do in their&nbsp;...

Pagination

  • APIs often limit the number of results in a single response
  • Use pagination to retrieve more results
    • Continuation keys: Use a key from the response to fetch the next page
    • Start index: Specify where to begin retrieving the next results

Pagination with Google Books API

  • Use the startIndex parameter to specify the starting index
  • Example: Retrieve data from the second page (1 because the first page is 0)
google_url_10 <- "https://www.googleapis.com/books/v1/volumes?q=web+scraping&startIndex=1"

google_10 <-  GET(google_url_10)

google_text_10 <-  content(google_10, "text")
google_10_json <- fromJSON(google_text_10, flatten=T)

head(as_tibble(google_10_json)$items)
          kind           id        etag
1 books#volume V_l_CwAAQBAJ 8xMQdFfVmTU
2 books#volume D5jXEAAAQBAJ /TLWJwih/uY
3 books#volume ixUvEAAAQBAJ Tv5XImWc7hU
4 books#volume jHc5DwAAQBAJ dSj2LSBRgs8
5 books#volume jQGGDwAAQBAJ gYG05HNo1L8
6 books#volume Iel1DwAAQBAJ 9eNh0lV0N/4
                                                  selfLink
1 https://www.googleapis.com/books/v1/volumes/V_l_CwAAQBAJ
2 https://www.googleapis.com/books/v1/volumes/D5jXEAAAQBAJ
3 https://www.googleapis.com/books/v1/volumes/ixUvEAAAQBAJ
4 https://www.googleapis.com/books/v1/volumes/jHc5DwAAQBAJ
5 https://www.googleapis.com/books/v1/volumes/jQGGDwAAQBAJ
6 https://www.googleapis.com/books/v1/volumes/Iel1DwAAQBAJ
                   volumeInfo.title               volumeInfo.authors
1          Web Scraping with Python                   Richard Lawson
2 Hands-On Web Scraping with Python                  Anish Chapagain
3   A Python Guide for Web Scraping        Pradumna Milind Panditrao
4               Python Web Scraping Katharine Jarmul, Richard Lawson
5 Go Web Scraping Quick Start Guide                    Vincent Smith
6  R Web Scraping Quick Start Guide                      Olgun Aydin
  volumeInfo.publisher volumeInfo.publishedDate
1 Packt Publishing Ltd               2015-10-28
2 Packt Publishing Ltd               2023-10-06
3     BPB Publications               2021-05-18
4 Packt Publishing Ltd               2017-05-30
5 Packt Publishing Ltd               2019-01-30
6 Packt Publishing Ltd               2018-10-31
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   volumeInfo.description
1                                                                                                                                                             Successfully scrape data from any website with the power of Python About This Book A hands-on guide to web scraping with real-life problems and solutions Techniques to download and extract data from complex websites Create a number of different web scrapers to extract information Who This Book Is For This book is aimed at developers who want to use web scraping for legitimate purposes. Prior programming experience with Python would be useful but not essential. Anyone with general knowledge of programming languages should be able to pick up the book and understand the principals involved. What You Will Learn Extract data from web pages with simple Python programming Build a threaded crawler to process web pages in parallel Follow links to crawl a website Download cache to reduce bandwidth Use multiple threads and processes to scrape faster Learn how to parse JavaScript-dependent websites Interact with forms and sessions Solve CAPTCHAs on protected web pages Discover how to track the state of a crawl In Detail The Internet contains the most useful set of data ever assembled, largely publicly accessible for free. However, this data is not easily reusable. It is embedded within the structure and style of websites and needs to be carefully extracted to be useful. Web scraping is becoming increasingly useful as a means to easily gather and make sense of the plethora of information available online. Using a simple language like Python, you can crawl the information out of complex websites using simple programming. This book is the ultimate guide to using Python to scrape data from websites. In the early chapters it covers how to extract data from static web pages and how to use caching to manage the load on servers. After the basics we'll get our hands dirty with building a more sophisticated crawler with threads and more advanced topics. Learn step-by-step how to use Ajax URLs, employ the Firebug extension for monitoring, and indirectly scrape data. Discover more scraping nitty-gritties such as using the browser renderer, managing cookies, how to submit forms to extract data from complex websites protected by CAPTCHA, and so on. The book wraps up with how to create high-level scrapers with Scrapy libraries and implement what has been learned to real websites. Style and approach This book is a hands-on guide with real-life examples and solutions starting simple and then progressively becoming more complex. Each chapter in this book introduces a problem and then provides one or more possible solutions.
2                                                Work through practical examples to unlock the full potential of web scraping with Python and gain valuable insights from high-quality data Key Features Build an initial portfolio of web scraping projects with detailed explanations Grasp Python programming fundamentals related to web scraping and data extraction Acquire skills to code web scrapers, store data in desired formats, and employ the data professionally Purchase of the print or Kindle book includes a free PDF eBook Book DescriptionWeb scraping is a powerful tool for extracting data from the web, but it can be daunting for those without a technical background. Designed for novices, this book will help you grasp the fundamentals of web scraping and Python programming, even if you have no prior experience. Adopting a practical, hands-on approach, this updated edition of Hands-On Web Scraping with Python uses real-world examples and exercises to explain key concepts. Starting with an introduction to web scraping fundamentals and Python programming, you’ll cover a range of scraping techniques, including requests, lxml, pyquery, Scrapy, and Beautiful Soup. You’ll also get to grips with advanced topics such as secure web handling, web APIs, Selenium for web scraping, PDF extraction, regex, data analysis, EDA reports, visualization, and machine learning. This book emphasizes the importance of learning by doing. Each chapter integrates examples that demonstrate practical techniques and related skills. By the end of this book, you’ll be equipped with the skills to extract data from websites, a solid understanding of web scraping and Python programming, and the confidence to use these skills in your projects for analysis, visualization, and information discovery.What you will learn Master web scraping techniques to extract data from real-world websites Implement popular web scraping libraries such as requests, lxml, Scrapy, and pyquery Develop advanced skills in web scraping, APIs, PDF extraction, regex, and machine learning Analyze and visualize data with Pandas and Plotly Develop a practical portfolio to demonstrate your web scraping skills Understand best practices and ethical concerns in web scraping and data extraction Who this book is for This book is for beginners who want to learn web scraping and data extraction using Python. No prior programming knowledge is required, but a basic understanding of web-related concepts such as websites, browsers, and HTML is assumed. If you enjoy learning by doing and want to build a portfolio of web scraping projects and delve into data-related studies and application, then this book is tailored for your needs.
3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Get hands-on training on any web crawling/scraping tool and uses of web scraping in the real-time industry Ê KEY FEATURESÊÊ _ Includes numerous use-cases on the use of web scraping for industrial applications. _ Learn how to automate web scraping tasks. _ Explore ready-made syntaxes of Python scripts to run web scraping. DESCRIPTIONÊ A Python Guide for Web Scraping is a book that will give information about the importance of web scraping using Python. It includes real-time examples of web scraping. It implies the automation use cases of web scraping as well. It gives information about the different tools and libraries of web scraping so that readers get a wide idea about the features and existence of web scraping. In this book, we started with the basics of Python and its syntactical information. We briefed about the use cases and features of Python. We have explained the importance of Python in automation systems. Furthermore, we have added information about real-time industrial examples. We have concentrated and deep-dived into PythonÕs importance in web scraping, explained the different tools and their usages. We have explained the real-time industrial domain-wise use cases for web scraping. WHAT YOU WILL LEARN _ Explore the Python syntax and key features of using Python for web scraping. _ Usage of Python in the web scraping tasks and how to automate scraping. _ How to use different libraries and modules of Python. WHO THIS BOOK IS FORÊÊ This book is basically for data engineers and data programmers who have a basic knowledge of Python and for theÊ readers who want to learn about web scraping projects for industries. TABLE OF CONTENTS 1. Python Basics 2. Use Cases of Python 3. Automation Using Python 4. Industrial Automation-Python 5. Web Scraping 6. Web Scraping and Necessity 7. Python - Web Scraping and Different Tools 8. Automation in Web Scraping 9. Use Cases-Web Scraping 10. Industrial Benefits of Web Scraping
4 Successfully scrape data from any website with the power of Python 3.x About This Book A hands-on guide to web scraping using Python with solutions to real-world problems Create a number of different web scrapers in Python to extract information This book includes practical examples on using the popular and well-maintained libraries in Python for your web scraping needs Who This Book Is For This book is aimed at developers who want to use web scraping for legitimate purposes. Prior programming experience with Python would be useful but not essential. Anyone with general knowledge of programming languages should be able to pick up the book and understand the principals involved. What You Will Learn Extract data from web pages with simple Python programming Build a concurrent crawler to process web pages in parallel Follow links to crawl a website Extract features from the HTML Cache downloaded HTML for reuse Compare concurrent models to determine the fastest crawler Find out how to parse JavaScript-dependent websites Interact with forms and sessions In Detail The Internet contains the most useful set of data ever assembled, most of which is publicly accessible for free. However, this data is not easily usable. It is embedded within the structure and style of websites and needs to be carefully extracted. Web scraping is becoming increasingly useful as a means to gather and make sense of the wealth of information available online. This book is the ultimate guide to using the latest features of Python 3.x to scrape data from websites. In the early chapters, you'll see how to extract data from static web pages. You'll learn to use caching with databases and files to save time and manage the load on servers. After covering the basics, you'll get hands-on practice building a more sophisticated crawler using browsers, crawlers, and concurrent scrapers. You'll determine when and how to scrape data from a JavaScript-dependent website using PyQt and Selenium. You'll get a better understanding of how to submit forms on complex websites protected by CAPTCHA. You'll find out how to automate these actions with Python packages such as mechanize. You'll also learn how to create class-based scrapers with Scrapy libraries and implement your learning on real websites. By the end of the book, you will have explored testing websites with scrapers, remote scraping, best practices, working with images, and many other relevant topics. Style and approach This hands-on guide is full of real-life examples and solutions starting simple and then progressively becoming more complex. Each chapter in this book introduces a problem and then provides one or more possible solutions.
5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Web scraping is the process of extracting information from the web using various tools that perform scraping and crawling. Go is emerging as the language of choice for scraping using a variety of libraries. This book will quickly explain to you, how to scrape data data from various websites using Go libraries such as Colly and Goquery.
6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Web Scraping techniques are getting more popular, since data is as valuable as oil in 21st century. Through this book get some key knowledge about using XPath, regEX; web scraping libraries for R like rvest and RSelenium technologies. Key FeaturesTechniques, tools and frameworks for web scraping with RScrape data effortlessly from a variety of websites Learn how to selectively choose the data to scrape, and build your datasetBook Description Web scraping is a technique to extract data from websites. It simulates the behavior of a website user to turn the website itself into a web service to retrieve or introduce new data. This book gives you all you need to get started with scraping web pages using R programming. You will learn about the rules of RegEx and Xpath, key components for scraping website data. We will show you web scraping techniques, methodologies, and frameworks. With this book's guidance, you will become comfortable with the tools to write and test RegEx and XPath rules. We will focus on examples of dynamic websites for scraping data and how to implement the techniques learned. You will learn how to collect URLs and then create XPath rules for your first web scraping script using rvest library. From the data you collect, you will be able to calculate the statistics and create R plots to visualize them. Finally, you will discover how to use Selenium drivers with R for more sophisticated scraping. You will create AWS instances and use R to connect a PostgreSQL database hosted on AWS. By the end of the book, you will be sufficiently confident to create end-to-end web scraping systems using R. What you will learnWrite and create regEX rulesWrite XPath rules to query your dataLearn how web scraping methods workUse rvest to crawl web pagesStore data retrieved from the webLearn the key uses of Rselenium to scrape dataWho this book is for This book is for R programmers who want to get started quickly with web scraping, as well as data analysts who want to learn scraping using R. Basic knowledge of R is all you need to get started with this book.
               volumeInfo.industryIdentifiers volumeInfo.pageCount
1 ISBN_13, ISBN_10, 9781782164371, 1782164375                  174
2 ISBN_13, ISBN_10, 9781837638512, 1837638519                  324
3 ISBN_13, ISBN_10, 9789390684991, 9390684994                  130
4 ISBN_13, ISBN_10, 9781786464293, 1786464292                  215
5 ISBN_13, ISBN_10, 9781789612943, 1789612942                  125
6 ISBN_13, ISBN_10, 9781788992633, 1788992636                  109
  volumeInfo.printType volumeInfo.categories volumeInfo.maturityRating
1                 BOOK             Computers                NOT_MATURE
2                 BOOK             Computers                NOT_MATURE
3                 BOOK             Computers                NOT_MATURE
4                 BOOK             Computers                NOT_MATURE
5                 BOOK             Computers                NOT_MATURE
6                 BOOK             Computers                NOT_MATURE
  volumeInfo.allowAnonLogging volumeInfo.contentVersion volumeInfo.language
1                        TRUE         2.3.4.0.preview.3                  en
2                       FALSE         0.3.4.0.preview.3                  en
3                        TRUE         1.2.2.0.preview.3                  en
4                        TRUE         1.5.8.0.preview.3                  en
5                        TRUE         1.3.3.0.preview.3                  en
6                        TRUE         1.3.3.0.preview.3                  en
                                                                         volumeInfo.previewLink
1   http://books.google.de/books?id=V_l_CwAAQBAJ&pg=PA1&dq=web+scraping&hl=&cd=2&source=gbs_api
2  http://books.google.de/books?id=D5jXEAAAQBAJ&pg=PR14&dq=web+scraping&hl=&cd=3&source=gbs_api
3  http://books.google.de/books?id=ixUvEAAAQBAJ&pg=PT77&dq=web+scraping&hl=&cd=4&source=gbs_api
4 http://books.google.de/books?id=jHc5DwAAQBAJ&pg=PA172&dq=web+scraping&hl=&cd=5&source=gbs_api
5   http://books.google.de/books?id=jQGGDwAAQBAJ&pg=PA5&dq=web+scraping&hl=&cd=6&source=gbs_api
6  http://books.google.de/books?id=Iel1DwAAQBAJ&pg=PA12&dq=web+scraping&hl=&cd=7&source=gbs_api
                                                         volumeInfo.infoLink
1 https://play.google.com/store/books/details?id=V_l_CwAAQBAJ&source=gbs_api
2 https://play.google.com/store/books/details?id=D5jXEAAAQBAJ&source=gbs_api
3 https://play.google.com/store/books/details?id=ixUvEAAAQBAJ&source=gbs_api
4 https://play.google.com/store/books/details?id=jHc5DwAAQBAJ&source=gbs_api
5 https://play.google.com/store/books/details?id=jQGGDwAAQBAJ&source=gbs_api
6 https://play.google.com/store/books/details?id=Iel1DwAAQBAJ&source=gbs_api
                               volumeInfo.canonicalVolumeLink
1 https://play.google.com/store/books/details?id=V_l_CwAAQBAJ
2 https://play.google.com/store/books/details?id=D5jXEAAAQBAJ
3 https://play.google.com/store/books/details?id=ixUvEAAAQBAJ
4 https://play.google.com/store/books/details?id=jHc5DwAAQBAJ
5 https://play.google.com/store/books/details?id=jQGGDwAAQBAJ
6 https://play.google.com/store/books/details?id=Iel1DwAAQBAJ
                                                                                                    volumeInfo.subtitle
1                                                                                                                  <NA>
2                                                   Extract quality data from the web using effective Python techniques
3 Explore Python Tools, Web Scraping Techniques, and How to Automata Data for Industrial Applications (English Edition)
4                                                                                                                  <NA>
5                                                       Implement the power of Go to scrape and crawl data from the web
6                                                           Techniques and tools to crawl and scrape data from websites
  volumeInfo.averageRating volumeInfo.ratingsCount volumeInfo.readingModes.text
1                       NA                      NA                         TRUE
2                       NA                      NA                         TRUE
3                       NA                      NA                         TRUE
4                       NA                      NA                         TRUE
5                       NA                      NA                         TRUE
6                        5                       1                         TRUE
  volumeInfo.readingModes.image
1                          TRUE
2                          TRUE
3                          TRUE
4                          TRUE
5                          TRUE
6                          TRUE
  volumeInfo.panelizationSummary.containsEpubBubbles
1                                              FALSE
2                                              FALSE
3                                              FALSE
4                                              FALSE
5                                              FALSE
6                                              FALSE
  volumeInfo.panelizationSummary.containsImageBubbles
1                                               FALSE
2                                               FALSE
3                                               FALSE
4                                               FALSE
5                                               FALSE
6                                               FALSE
                                                                             volumeInfo.imageLinks.smallThumbnail
1 http://books.google.com/books/content?id=V_l_CwAAQBAJ&printsec=frontcover&img=1&zoom=5&edge=curl&source=gbs_api
2 http://books.google.com/books/content?id=D5jXEAAAQBAJ&printsec=frontcover&img=1&zoom=5&edge=curl&source=gbs_api
3 http://books.google.com/books/content?id=ixUvEAAAQBAJ&printsec=frontcover&img=1&zoom=5&edge=curl&source=gbs_api
4 http://books.google.com/books/content?id=jHc5DwAAQBAJ&printsec=frontcover&img=1&zoom=5&edge=curl&source=gbs_api
5 http://books.google.com/books/content?id=jQGGDwAAQBAJ&printsec=frontcover&img=1&zoom=5&edge=curl&source=gbs_api
6 http://books.google.com/books/content?id=Iel1DwAAQBAJ&printsec=frontcover&img=1&zoom=5&edge=curl&source=gbs_api
                                                                                  volumeInfo.imageLinks.thumbnail
1 http://books.google.com/books/content?id=V_l_CwAAQBAJ&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api
2 http://books.google.com/books/content?id=D5jXEAAAQBAJ&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api
3 http://books.google.com/books/content?id=ixUvEAAAQBAJ&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api
4 http://books.google.com/books/content?id=jHc5DwAAQBAJ&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api
5 http://books.google.com/books/content?id=jQGGDwAAQBAJ&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api
6 http://books.google.com/books/content?id=Iel1DwAAQBAJ&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api
  saleInfo.country saleInfo.saleability saleInfo.isEbook
1               DE             FOR_SALE             TRUE
2               DE             FOR_SALE             TRUE
3               DE             FOR_SALE             TRUE
4               DE             FOR_SALE             TRUE
5               DE             FOR_SALE             TRUE
6               DE             FOR_SALE             TRUE
                                                                                          saleInfo.buyLink
1 https://play.google.com/store/books/details?id=V_l_CwAAQBAJ&rdid=book-V_l_CwAAQBAJ&rdot=1&source=gbs_api
2 https://play.google.com/store/books/details?id=D5jXEAAAQBAJ&rdid=book-D5jXEAAAQBAJ&rdot=1&source=gbs_api
3 https://play.google.com/store/books/details?id=ixUvEAAAQBAJ&rdid=book-ixUvEAAAQBAJ&rdot=1&source=gbs_api
4 https://play.google.com/store/books/details?id=jHc5DwAAQBAJ&rdid=book-jHc5DwAAQBAJ&rdot=1&source=gbs_api
5 https://play.google.com/store/books/details?id=jQGGDwAAQBAJ&rdid=book-jQGGDwAAQBAJ&rdot=1&source=gbs_api
6 https://play.google.com/store/books/details?id=Iel1DwAAQBAJ&rdid=book-Iel1DwAAQBAJ&rdot=1&source=gbs_api
                        saleInfo.offers saleInfo.listPrice.amount
1 1, TRUE, 18180000, EUR, 12730000, EUR                     18.18
2 1, TRUE, 28880000, EUR, 20220000, EUR                     28.88
3 1, TRUE, 10280000, EUR, 10280000, EUR                     10.28
4 1, TRUE, 24600000, EUR, 17220000, EUR                     24.60
5 1, TRUE, 14970000, EUR, 10480000, EUR                     14.97
6 1, TRUE, 21390000, EUR, 14970000, EUR                     21.39
  saleInfo.listPrice.currencyCode saleInfo.retailPrice.amount
1                             EUR                       12.73
2                             EUR                       20.22
3                             EUR                       10.28
4                             EUR                       17.22
5                             EUR                       10.48
6                             EUR                       14.97
  saleInfo.retailPrice.currencyCode accessInfo.country accessInfo.viewability
1                               EUR                 DE                PARTIAL
2                               EUR                 DE                PARTIAL
3                               EUR                 DE                PARTIAL
4                               EUR                 DE                PARTIAL
5                               EUR                 DE                PARTIAL
6                               EUR                 DE                PARTIAL
  accessInfo.embeddable accessInfo.publicDomain
1                  TRUE                   FALSE
2                  TRUE                   FALSE
3                  TRUE                   FALSE
4                  TRUE                   FALSE
5                  TRUE                   FALSE
6                  TRUE                   FALSE
  accessInfo.textToSpeechPermission
1                           ALLOWED
2                           ALLOWED
3                           ALLOWED
4                           ALLOWED
5                           ALLOWED
6                           ALLOWED
                                                accessInfo.webReaderLink
1 http://play.google.com/books/reader?id=V_l_CwAAQBAJ&hl=&source=gbs_api
2 http://play.google.com/books/reader?id=D5jXEAAAQBAJ&hl=&source=gbs_api
3 http://play.google.com/books/reader?id=ixUvEAAAQBAJ&hl=&source=gbs_api
4 http://play.google.com/books/reader?id=jHc5DwAAQBAJ&hl=&source=gbs_api
5 http://play.google.com/books/reader?id=jQGGDwAAQBAJ&hl=&source=gbs_api
6 http://play.google.com/books/reader?id=Iel1DwAAQBAJ&hl=&source=gbs_api
  accessInfo.accessViewStatus accessInfo.quoteSharingAllowed
1                      SAMPLE                          FALSE
2                      SAMPLE                          FALSE
3                      SAMPLE                          FALSE
4                      SAMPLE                          FALSE
5                      SAMPLE                          FALSE
6                      SAMPLE                          FALSE
  accessInfo.epub.isAvailable
1                        TRUE
2                        TRUE
3                        TRUE
4                        TRUE
5                        TRUE
6                        TRUE
                                                                                                                                                    accessInfo.epub.acsTokenLink
1                                                                                                                                                                           <NA>
2                                                                                                                                                                           <NA>
3 http://books.google.de/books/download/A_Python_Guide_for_Web_Scraping-sample-epub.acsm?id=ixUvEAAAQBAJ&format=epub&output=acs4_fulfillment_token&dl_type=sample&source=gbs_api
4                                                                                                                                                                           <NA>
5                                                                                                                                                                           <NA>
6                                                                                                                                                                           <NA>
  accessInfo.pdf.isAvailable
1                       TRUE
2                       TRUE
3                       TRUE
4                       TRUE
5                       TRUE
6                       TRUE
                                                                                                                                                   accessInfo.pdf.acsTokenLink
1                                                                                                                                                                         <NA>
2                                                                                                                                                                         <NA>
3 http://books.google.de/books/download/A_Python_Guide_for_Web_Scraping-sample-pdf.acsm?id=ixUvEAAAQBAJ&format=pdf&output=acs4_fulfillment_token&dl_type=sample&source=gbs_api
4                                                                                                                                                                         <NA>
5                                                                                                                                                                         <NA>
6                                                                                                                                                                         <NA>
                                                                                                                                                                                                                                                                            searchInfo.textSnippet
1                ... short, we cannot rely on APIs to access the online data we may want and therefore, need to learn about <b>web scraping</b> techniques. Is <b>web scraping</b> legal ? <b>Web scraping</b> is in the [ 1 ] Chapter 1: Introduction to <b>Web Scraping</b> When is web&nbsp;...
2 ... <b>web</b> documents. Examples of <b>scraping</b> using pyquery and writing data to JSON and CSV are also covered. Chapter 5, <b>Scraping</b> the <b>Web</b> with Scrapy and Beautiful Soup, provides an overview and examples of using and deploying a popular <b>web</b>-crawling&nbsp;...
3                       Explore Python Tools, <b>Web Scraping</b> Techniques, and How to Automata Data for Industrial Applications (English Edition) Pradumna Milind Panditrao. 1. Why is <b>web scraping</b> used ? 2. What are the differences between <b>web scraping</b> and crawling&nbsp;...
4                  ... <b>webscraping</b>.com/ my_example_site 8 Start page New sample SPIDER ? Show all spiders : * Katharine Q : Extracted items 0 JSON C Example <b>web scraping</b> website You will start to recognize some of the fields from. example.<b>webscraping</b>.com START&nbsp;...
5                        Implement the power of Go to scrape and crawl data from the <b>web</b> Vincent Smith. 1. Introducing. <b>Web</b>. <b>Scraping</b>. and. Go. Collecting, parsing, storing, and processing data are essential tasks that almost everybody will need to do in their&nbsp;...
6                                                Techniques and tools to crawl and scrape data from websites Olgun Aydin. JavaScript tools It is also possible to use JavaScript for web ... <b>Web Scraping</b> Chapter 1 JavaScript tools Web crawling frameworks Web crawling environment in R.

Pagination Loop

  • Use a loop to retrieve multiple pages of results (here: pages 0 to 10)
  • Store the results in a list
google_url <- "https://www.googleapis.com/books/v1/volumes?q=webscraping&startIndex="

results <- list() # Create an empty list to store the results

for (i in 0:10) {
  r <-  GET(paste0(google_url, i))
  r_text <-  content(r, "text")
  data_json <- fromJSON(r_text, flatten=T)
  d <-  as_tibble(data_json)
  results[[i+1]] <- d
}

Pagination Loop

  • Combine the elements of the list into a single data frame
library(dplyr)

google_combined <- bind_rows(results)

nrow(google_combined) # 11 pages of 10 results each
[1] 110

Publications on “Web Scraping”

hist(as.Date(google_combined$items$volumeInfo.publishedDate), breaks = "years", main = "Publication Dates", xlab = "Year")

Another Example: Scraping (ficticious) Books Prices

  • Dummy data available at books.toscrape.com
  • Each page contains 20 books
  • Loop to scrape the title and price of each book of the first 10 pages

Scraping Books

all_books <- list() # Create an empty list to store the results

for (i in 1:10) {  # Scraping the first 10 pages
  url <- paste0("http://books.toscrape.com/catalogue/page-", i, ".html")
  webpage <- read_html(url)
  
  titles <- webpage %>% html_nodes(".product_pod h3 a") %>% html_attr("title")
  prices <- webpage %>% html_nodes(".price_color") %>% html_text()
  
  all_books[[i]] <- data.frame(Title = titles,
                               Price = prices,
                               stringsAsFactors = FALSE)
}

Combining Results

all_books_data <- bind_rows(all_books)

head(all_books_data)
                                  Title  Price
1                  A Light in the Attic £51.77
2                    Tipping the Velvet £53.74
3                            Soumission £50.10
4                         Sharp Objects £47.82
5 Sapiens: A Brief History of Humankind £54.23
6                       The Requiem Red £22.65

Prepare and Analyze

  • What is the average price of the books?
  • Convert the price to a numeric value and remove the Pound symbol
all_books_data$Price <- as.numeric(sub("£", "", all_books_data$Price))

mean(all_books_data$Price)
[1] 34.79625

API Rate Limiting

  • APIs may limit the number of requests you can make
  • Avoid making too many requests in a short period
  • Check the API documentation for rate limits
  • Use the Sys.sleep() function to pause between requests

API Authentication

  • Some APIs require authentication
  • Common methods:
    • API key: Include a key in the URL
    • OAuth: Use a token to authenticate requests
  • Always check the API documentation for authentication requirements

API Key Security

  • Keep your API key secure
  • Do not share it publicly
  • Store it in a secure location
  • Use environment variables to keep it out of your code

Environment Variables

  • Store sensitive information outside of your code
  • Use the Sys.setenv() function to internally store environment variables
  • Exapmle: The New York Times API
Sys.setenv(nyt_key = "qekEhoGTXqjsZnXpqHns0Vfa2U6T7ABf") # Usually you do not want to share this :)

Sys.getenv("nyt_key")
[1] "qekEhoGTXqjsZnXpqHns0Vfa2U6T7ABf"

Environment Variables

  • Use the Sys.getenv() function to access environment variables
nyt_url <- modify_url(
  url = "https://api.nytimes.com/",
  path = "svc/search/v2/articlesearch.json",
  query = list(q = "Angela+Merkel",
               `api-key` = Sys.getenv("nyt_key")))

Scrape the NYT API

nyt_merkel <- GET(nyt_url)
r_merkel <-  content(nyt_merkel, "text")
merkel_json <- fromJSON(r_merkel, flatten=T)
merkel_json$response$docs$headline.main
 [1] "Angela Merkel Tells Us What She Really Thinks"                                  
 [2] "Things Are Terrible in Europe, and They’re Only Going to Get Worse"             
 [3] "Who Is Friedrich Merz, a Leading Candidate to Be Germany’s Next Chancellor?"    
 [4] "6 New Books We Recommend This Week"                                             
 [5] "She Was the Most Powerful Woman in the World. And She Isn’t Ready to Say Sorry."
 [6] "Merkel Memoir Recalls What It Was Like Dealing With Trump and Putin"            
 [7] "What the Collapse of Germany’s Ruling Coalition Means"                          
 [8] "Germany’s Coalition Collapses, Leaving the Government Teetering"                
 [9] "12 Books Coming in November"                                                    
[10] "‘If Germany Can Do It, Why Can’t We?’"                                          

Other APIs

Some API Packages for R

Scraping Eurostat with R

  • Eurostat API available can be accessed using the eurostat package
  • Interest in population statistics, stored in demo_gind table
library(eurostat)
eurostat_pop <- get_eurostat("demo_gind")

eurostat_pop <- label_eurostat(eurostat_pop)
head(eurostat_pop[eurostat_pop$geo=="Germany", ])
# A tibble: 6 × 5
  freq   indic_de                   geo     TIME_PERIOD   values
  <chr>  <chr>                      <chr>   <date>         <dbl>
1 Annual Average population - total Germany 1960-01-01  55607705
2 Annual Average population - total Germany 1961-01-01  56273735
3 Annual Average population - total Germany 1962-01-01  56918197
4 Annual Average population - total Germany 1963-01-01  57555878
5 Annual Average population - total Germany 1964-01-01  58225980
6 Annual Average population - total Germany 1965-01-01  58942021

Tutorial 09: Exercise 2.

Scraping Dynamic Websites

Interacting with Dynamic Webites

  • Some websites use JavaScript to load content dynamically
  • The RSelenium package provides functions to interact with a browser
  • Needs JAVA installed on your computer

Ethics and Legality of Web Scraping

  1. Ethical Considerations:
    • Respect website terms of service
    • Avoid disrupting servers (rate limiting)
  2. Legal Aspects:
    • Review copyright laws and data protection regulations
    • Is scraping this data allowed? Check the robots.txt file (for example: Washington Post)

Tutorial 09: Exercise 3 (and 4).