I'm a data engineer and scientist specializing in natural language processing. On Github I'm the author and maintainer of projects like Trafilatura, a popular open-source package to gather and extract text data used by researchers and the AI industry.
- Extracting the main text content from web pages using Python
- A simple multilingual lemmatizer for Python
- A module to extract date information from web pages
- Web scraping with R: Text and metadata extraction
Skills | Programming languages |
---|---|