From the course: Vector Databases in Practice: Deep Dive

Unlock the full course today

Join today to access over 23,400 courses taught by industry experts.

Chunk Wikipedia articles

Chunk Wikipedia articles

- [Instructor] Now let's see how chunking and text processing is done in practice. Here I've put together some scripts to process a set of articles from Wikipedia. Off the top, I'm using the same mediawikiapi library that you saw earlier, as well as a few other standard libraries. And this is a list of some, what I think at least, are interesting articles ranging from history of computing to databases. For each of these Wikipedia article then, we need to download it, pause it to raw text, and chunk the body as you just learned to import it into the database. So what I did was set up a couple of functions to break up these tasks. The first task is to turn our text into just a list of words, so we can use the word count for chunking. This is what this word split of function does. It just takes a string of source text as inputs and it uses regular expressions to replace multiple white spaces into a single space and then splits them up based on these spaces. If you're not sure what the…

Contents