In Smart Harvesting II, we had asked ourselves: What kind of tool would a librarian need to be able to extract bibliographic metadata from the Web? We have also introduced in another post the declarative web scraping language OXPath, which can help non-programmers to get a web scraper up and running in less time. ![]() In the beginning, we focused on OXPath, but we soon realized, that even though this declarative language is easier to read and write than a script in a full-blown programming language, there are still some hurdles involved that render OXPath not the best alternative for our user group. In addition, we realized that, in the meantime, there are a good deal of web scraping tools suitable for the layman available. In this post, we want to give an overview on the - in our view - most promising web scraping tools. Test Websitesįor the evaluation of web scraping software below, we have used test websites that are specifically designed to test the capabilites of web scrapers. They are functional mockups providing challenges for web scrapers such as login, pagination, user input, AJAX requests etc. You have already learned that you can program a web scraper by yourself.įor interactive pages, this means to imitate each interaction with the website programmatically.Īdditionally, you have to identify the location of your desired data in the DOM tree, for example by using web developer tools, and create a path addressing these locations, e.g. Web scraping tools with graphical user interfaces (GUIs) are designed to encapsulate these steps and present the user only with the view of the web page as it would look like in a regular browser (more or less), enhanced with additional elements for the user to design an extraction workflow. How exactly this is done varies from software to software. ![]() The general idea however is the same: The user defines an extraction workflow via point and click, which the software translates into hidden code that is then executed to perform the actual extraction. While the general principle is the same, different tools vary in some aspects. The most relevant criterion is the mode of installation: Some tools are standalone applications that have to be installed into the system, while others are provided as plugins for web browsers like Chrome or Firefox. In case of standalone applications, there are often restrictions on the operating systems (this mostly depends on the programming language that the software is written in). Most of the tools only run on Windows, some on Windows and MacOS.
0 Comments
Leave a Reply. |