Trafilatura - Web Text and Metadata Extraction

Overview

Trafilatura is a Python library and command-line tool designed to gather text and metadata from the Web through crawling, scraping, and extraction. It provides clean text output in multiple formats, making it ideal for corpus building and web archiving.

  • License: GNU GPL v3
  • Status: Active
  • Platform: Cross-platform
  • Tech Stack: Python

Website

Repository

Guide