{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Sentiment Classification the old-fashioned way: \n", "## `Naive Bayes`, `Logistic Regression`, and `Ngrams`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The purpose of this notebook is to show how sentiment classification is done via the classic techniques of `Naive Bayes`, `Logistic regression`, and `Ngrams`. We will be using `sklearn` and the `fastai` library.\n", "\n", "In a future lesson, we will revisit sentiment classification using `deep learning`, so that you can compare the two approaches." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The content here was extended from [Lesson 10 of the fast.ai Machine Learning course](https://course.fast.ai/lessonsml1/lesson10.html). Linear model is pretty close to the state of the art here. Jeremy surpassed state of the art using a RNN in fall 2017." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## 0.The fastai library" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "We will begin using [the fastai library](https://docs.fast.ai) (version 1.0) in this notebook. We will use it more once we move on to neural networks.\n", "\n", "The fastai library is built on top of PyTorch and encodes many state-of-the-art best practices. It is used in production at a number of companies. You can read more about it here:\n", "\n", "- [Fast.ai's software could radically democratize AI](https://www.zdnet.com/article/fast-ais-new-software-could-radically-democratize-ai/) (ZDNet)\n", "\n", "- [fastai v1 for PyTorch: Fast and accurate neural nets using modern best practices](https://www.fast.ai/2018/10/02/fastai-ai/) (fast.ai)\n", "\n", "- [fastai docs](https://docs.fast.ai/)\n", "\n", "### Installation\n", "\n", "With conda:\n", "\n", "`conda install -c pytorch -c fastai fastai=1.0`\n", "\n", "Or with pip:\n", "\n", "`pip install fastai==1.0`\n", "\n", "More [installation information here](https://github.com/fastai/fastai/blob/master/README.md).\n", "\n", "Beginning in lesson 4, we will be using GPUs, so if you want, you could switch to a [cloud option](https://course.fast.ai/#using-a-gpu) now to setup fastai." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. The IMDB dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"floating" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [large movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) contains a collection of 50,000 reviews from IMDB, We will use the version hosted as part [fast.ai datasets](https://course.fast.ai/datasets.html) on AWS Open Datasets. \n", "\n", "The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.\n", "\n", "The **sentiment classification task** consists of predicting the polarity (positive or negative) of a given text." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from fastai import *\n", "from fastai.text import *\n", "from fastai.utils.mem import GPUMemTrace #call with mtrace" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import sklearn.feature_extraction.text as sklearn_text\n", "import pickle " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preview the sample IMDb data set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "fast.ai has a number of [datasets hosted via AWS Open Datasets](https://course.fast.ai/datasets.html) for easy download. We can see them by checking the docs for URLs (remember `??` is a helpful command):" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "?? URLs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is always good to start working on a sample of your data before you use the full dataset-- this allows for quicker computations as you debug and get your code working. For IMDB, there is a sample dataset already available:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb_sample')" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.IMDB_SAMPLE)\n", "path" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Read the data set into a pandas dataframe, which we can inspect to get a sense of what our data looks like. We see that the three columns contain review label, review text, and the `is_valid` flag, respectively. `is_valid` is a boolean flag indicating whether the row is from the validation set or not." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labeltextis_valid
0negativeUn-bleeping-believable! Meg Ryan doesn't even ...False
1positiveThis is a extremely well-made film. The acting...False
2negativeEvery once in a long while a movie will come a...False
3positiveName just says it all. I watched this movie wi...False
4negativeThis movie succeeds at being one of the most u...False
\n", "
" ], "text/plain": [ " label text is_valid\n", "0 negative Un-bleeping-believable! Meg Ryan doesn't even ... False\n", "1 positive This is a extremely well-made film. The acting... False\n", "2 negative Every once in a long while a movie will come a... False\n", "3 positive Name just says it all. I watched this movie wi... False\n", "4 negative This movie succeeds at being one of the most u... False" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(path/'texts.csv')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract the movie reviews from the sample IMDb data set.\n", "#### We will be using [TextList](https://docs.fast.ai/text.data.html#TextList) from the fastai library:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "failure count is 1\n", "\n", "Wall time: 28.2 s\n" ] } ], "source": [ "%%time\n", "# throws `BrokenProcessPool' Error sometimes. Keep trying `till it works!\n", "\n", "count = 0\n", "error = True\n", "while error:\n", " try: \n", " # Preprocessing steps\n", " movie_reviews = (TextList.from_csv(path, 'texts.csv', cols='text')\n", " .split_from_df(col=2)\n", " .label_from_df(cols=0))\n", " error = False\n", " print(f'failure count is {count}\\n') \n", " except: # catch *all* exceptions\n", " # accumulate failure count\n", " count = count + 1\n", " print(f'failure count is {count}')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exploring IMDb review data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A good first step for any data problem is to explore the data and get a sense of what it looks like. In this case we are looking at movie reviews, which have been labeled as \"positive\" or \"negative\". The reviews have already been `tokenized`, i.e. split into `tokens`, basic units such as words, prefixes, punctuation, capitalization, and other features of the text." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LabelLists;\n", "\n", "Train: LabelList (800 items)\n", "x: TextList\n", "xxbos xxmaj un - xxunk - believable ! xxmaj meg xxmaj ryan does n't even look her usual xxunk lovable self in this , which normally makes me forgive her shallow xxunk acting xxunk . xxmaj hard to believe she was the producer on this dog . xxmaj plus xxmaj kevin xxmaj kline : what kind of suicide trip has his career been on ? xxmaj xxunk ... xxmaj xxunk ! ! ! xxmaj finally this was directed by the guy who did xxmaj big xxmaj xxunk ? xxmaj must be a replay of xxmaj jonestown - hollywood style . xxmaj xxunk !,xxbos xxmaj this is a extremely well - made film . xxmaj the acting , script and camera - work are all first - rate . xxmaj the music is good , too , though it is mostly early in the film , when things are still relatively xxunk . xxmaj there are no really xxunk in the cast , though several faces will be familiar . xxmaj the entire cast does an excellent job with the script . \n", " \n", " xxmaj but it is hard to watch , because there is no good end to a situation like the one presented . xxmaj it is now xxunk to blame the xxmaj british for setting xxmaj hindus and xxmaj muslims against each other , and then xxunk xxunk them into two countries . xxmaj there is some merit in this view , but it 's also true that no one forced xxmaj hindus and xxmaj muslims in the region to xxunk each other as they did around the time of partition . xxmaj it seems more likely that the xxmaj british simply saw the xxunk between the xxunk and were clever enough to exploit them to their own ends . \n", " \n", " xxmaj the result is that there is much cruelty and inhumanity in the situation and this is very unpleasant to remember and to see on the screen . xxmaj but it is never painted as a black - and - white case . xxmaj there is xxunk and xxunk on both sides , and also the hope for change in the younger generation . \n", " \n", " xxmaj there is redemption of a sort , in the end , when xxmaj xxunk has to make a hard choice between a man who has ruined her life , but also truly loved her , and her family which has xxunk her , then later come looking for her . xxmaj but by that point , she has no xxunk that is without great pain for her . \n", " \n", " xxmaj this film carries the message that both xxmaj muslims and xxmaj hindus have their grave xxunk , and also that both can be xxunk and caring people . xxmaj the reality of partition makes that xxunk all the more wrenching , since there can never be real xxunk across the xxmaj india / xxmaj pakistan border . xxmaj in that sense , it is similar to \" xxmaj mr & xxmaj xxunk xxmaj xxunk \" . \n", " \n", " xxmaj in the end , we were glad to have seen the film , even though the resolution was xxunk . xxmaj if the xxup uk and xxup us could deal with their own xxunk of racism with this kind of xxunk , they would certainly be better off .,xxbos xxmaj every once in a long while a movie will come along that will be so awful that i feel compelled to warn people . xxmaj if i labor all my days and i can save but one soul from watching this movie , how great will be my joy . \n", " \n", " xxmaj where to begin my discussion of pain . xxmaj for xxunk , there was a musical xxunk every five minutes . xxmaj there was no character development . xxmaj every character was a stereotype . xxmaj we had xxunk guy , fat guy who eats donuts , goofy foreign guy , etc . xxmaj the script felt as if it were being written as the movie was being shot . xxmaj the production value was so incredibly low that it felt like i was watching a junior high video presentation . xxmaj have the directors , producers , etc . ever even seen a movie before ? xxmaj xxunk is getting worse and worse with every new entry . xxmaj the concept for this movie sounded so funny . xxmaj how could you go wrong with xxmaj gary xxmaj coleman and a handful of somewhat legitimate actors . xxmaj but trust me when i say this , things went wrong , xxup very xxup wrong .,xxbos xxmaj name just says it all . i watched this movie with my dad when it came out and having served in xxmaj xxunk he had great admiration for the man . xxmaj the disappointing thing about this film is that it only xxunk on a short period of the man 's life - interestingly enough the man 's entire life would have made such an epic bio - xxunk that it is staggering to imagine the cost for production . \n", " \n", " xxmaj some posters xxunk to the flawed xxunk about the man , which are cheap shots . xxmaj the theme of the movie \" xxmaj duty , xxmaj honor , xxmaj country \" are not just mere words xxunk from the lips of a high - xxunk officer - it is the deep xxunk of one man 's total devotion to his country . \n", " \n", " xxmaj ironically xxmaj xxunk being the liberal that he was xxunk a better understanding of the man . xxmaj he does a great job showing the xxunk general xxunk with the xxunk side of the man .,xxbos xxmaj this movie succeeds at being one of the most unique movies you 've seen . xxmaj however this comes from the fact that you ca n't make heads or xxunk of this mess . xxmaj it almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid . xxmaj if you do n't want to feel xxunk you 'll sit through this horrible film and develop a real sense of pity for the actors involved , they 've all seen better days , but then you realize they actually got paid quite a bit of money to do this and you 'll lose pity for them just like you 've already done for the film . i ca n't go on enough about this horrible movie , its almost something that xxmaj ed xxmaj wood would have made and in that case it surely would have been his masterpiece . \n", " \n", " xxmaj to start you are forced to sit through an opening dialogue the likes of which you 've never seen / heard , this thing has got to be five minutes long . xxmaj on top of that it is narrated , as to suggest that you the viewer can not read . xxmaj then we meet xxmaj mr. xxmaj xxunk and the xxunk of terrible lines gets xxunk , it is as if he is xxunk solely to get lines on to the movie poster xxunk line . xxmaj soon we meet xxmaj stephen xxmaj xxunk , who i typically enjoy ) and he does his best not to drown in this but ultimately he does . xxmaj then comes the ultimate insult , xxmaj tara xxmaj xxunk playing an intelligent role , oh help us ! xxmaj tara xxmaj xxunk is not a very talented actress and somehow she xxunk gets roles in movies , in my opinion though she should stick to movies of the xxmaj american pie type . \n", " \n", " xxmaj all in all you just may want to see this for yourself when it comes out on video , i know that i got a kick out of it , i mean lets all be honest here , sometimes its comforting to xxunk in the shortcomings of others .\n", "y: CategoryList\n", "negative,positive,negative,positive,negative\n", "Path: C:\\Users\\cross-entropy\\.fastai\\data\\imdb_sample;\n", "\n", "Valid: LabelList (200 items)\n", "x: TextList\n", "xxbos xxmaj this very funny xxmaj british comedy shows what might happen if a section of xxmaj london , in this case xxmaj xxunk , were to xxunk itself independent from the rest of the xxup uk and its laws , xxunk & post - war xxunk . xxmaj merry xxunk is what would happen . \n", " \n", " xxmaj the explosion of a wartime bomb leads to the xxunk of ancient xxunk which show that xxmaj xxunk was xxunk to the xxmaj xxunk of xxmaj xxunk xxunk ago , a small historical xxunk long since forgotten . xxmaj to the new xxmaj xxunk , however , this is an unexpected opportunity to live as they please , free from any xxunk from xxmaj xxunk . \n", " \n", " xxmaj stanley xxmaj xxunk is excellent as the minor city xxunk who suddenly finds himself leading one of the world 's xxunk xxunk . xxmaj xxunk xxmaj margaret xxmaj xxunk is a delight as the history professor who sides with xxmaj xxunk . xxmaj others in the stand - out cast include xxmaj xxunk xxmaj xxunk , xxmaj paul xxmaj xxunk , xxmaj xxunk xxmaj xxunk , xxmaj xxunk xxmaj xxunk & xxmaj sir xxmaj michael xxmaj xxunk . \n", " \n", " xxmaj welcome to xxmaj xxunk !,xxbos i saw this movie once as a kid on the late - late show and fell in love with it . \n", " \n", " xxmaj it took 30 + years , but i recently did find it on xxup dvd - it was n't cheap , either - in a xxunk that xxunk in war movies . xxmaj we watched it last night for the first time . xxmaj the audio was good , however it was grainy and had the trailers between xxunk . xxmaj even so , it was better than i remembered it . i was also impressed at how true it was to the play . \n", " \n", " xxmaj the xxunk is around here xxunk . xxmaj if you 're xxunk in finding it , fire me a xxunk and i 'll see if i can get you the xxunk . xxunk,xxbos xxmaj this is , in my opinion , a very good film , especially for xxmaj michael xxmaj jackson lovers . xxmaj it contains a message on drugs , stunning special effects , and an awesome music video . \n", " \n", " xxmaj the main film is xxunk around the song and music video ' xxmaj smooth xxmaj criminal . ' xxmaj unlike the four - minute music video , it is normal speed and , in my opinion , much xxunk to watch . \n", " \n", " xxmaj the plot is rather weird , however . xxmaj michael xxmaj jackson plays a xxunk ' gangster ' that , when he sees a shooting star , he xxunk into a piece of xxunk . xxmaj throughout the film , he xxunk into a race car , a giant robot , and a space ship . \n", " \n", " xxmaj the robot scene in particular is a bit drawn out and strange . i found it a little out - of - whack compared to the rest of the film . \n", " \n", " a child is kidnapped , xxmaj michael tries to save her , is tortured and beaten , and suddenly turns into a giant robot that blows up all the bad guys . a little weird ? xxmaj yeah . \n", " \n", " xxmaj but besides the bizarre robot scene , it 's a very good movie , and any xxmaj michael xxmaj jackson fan will enjoy both the xxmaj smooth xxmaj criminal music video and the movie .,xxbos xxmaj in xxmaj iran , women are not xxunk to attend men 's sporting events , apparently to \" xxunk \" them from all the xxunk and foul language they might hear xxunk from the male fans ( so since men ca n't xxunk or xxunk themselves , women are forced to suffer . xxmaj go figure . ) . \" xxmaj xxunk \" tells the tale of a half dozen or so young women who , dressed like men , attempt to xxunk into the high - xxunk match between xxmaj iran and xxmaj xxunk that , in xxunk , qualified xxmaj iran to go to the xxmaj world xxmaj cup ( the movie was actually filmed in large part during that game ) . \n", " \n", " \" xxmaj xxunk \" is a xxunk - of - life comedy that will remind you of all those great xxunk films ( \" xxmaj the xxmaj shop on xxmaj main xxmaj street , \" \" xxmaj loves of a xxmaj blonde , \" \" xxmaj closely xxmaj watched xxmaj trains \" etc . ) that xxunk out of xxmaj communist xxmaj xxunk as part of the \" xxmaj xxunk xxmaj xxunk \" in the mid xxunk 's . xxmaj as with many of those works , \" xxmaj xxunk \" is more concerned with xxunk life than with xxunk any kind of xxunk contrived fictional narrative . xxmaj indeed , it is the simplicity of the xxunk and the xxunk of the style that make the movie so effective . \n", " \n", " xxmaj once their xxunk is discovered , the girls are xxunk into a small xxunk right outside the xxunk where they can hear the xxunk xxunk xxunk from the game inside . xxmaj stuck where they are , all they can do is xxunk with the security guards to let them go in , guards who are basically xxunk , good - xxunk xxunk who are compelled to do their duty as a part of their xxunk military service . xxmaj even most of the men going into the xxunk do n't seem particularly xxunk at the thought of these women being allowed in . xxmaj still the prohibition xxunk . xxmaj yet , how can one not be impressed by the very real courage and xxunk displayed by these women as they go up against a system that continues to xxunk such a xxunk xxunk and xxunk xxunk ? xxmaj and , yet , the purpose of these women is not to xxunk behind a cause or to make a \" point . \" xxmaj they are simply obsessed fans with a burning desire to watch a soccer game and , like all the men in the country , xxunk on their team . \n", " \n", " xxmaj it 's hard to tell just how much of the dialogue is scripted and how much of it is xxunk , but , in either case , the actors , with their xxunk xxunk faces , do a magnificent job making each moment seem utterly real and convincing . xxmaj xxunk xxmaj xxunk - xxunk and xxmaj xxunk xxmaj xxunk are notable xxunk in a xxunk excellent cast . xxmaj the structure of the film is also very loose and xxunk , as writer / director xxmaj xxunk xxmaj xxunk and co - writer xxmaj xxunk xxmaj xxunk focus for a few brief moments on one or two of the characters , then move xxunk and xxunk onto others . xxmaj with this documentary - type approach , we come to feel as if we are xxunk an actual event xxunk in \" real time . \" xxmaj very often , it 's quite easy for us to forget we 're actually watching a movie . \n", " \n", " xxmaj it was a very smart move on the part of the filmmakers to include so much good - xxunk humor in the film ( it 's what the xxmaj xxunk filmmakers did as well ) , the better to point up the utter absurdity of the situation and xxunk the appeal of the film for audiences both domestic and foreign . \" xxmaj xxunk \" is obviously a cry for justice , but it is one that is made all the more effective by its xxunk to make of its story a heavy - breathing tragedy . xxmaj instead , it realizes that nothing breaks down social xxunk quite as xxunk as humor and an appeal to the audience 's common humanity . xxmaj and is n't that what true art is supposed to be all about ? xxmaj in its own quiet , xxunk way , \" xxmaj xxunk \" is one of the great , under - appreciated xxunk of xxunk .,xxbos \" xxmaj in xxmaj xxunk xxunk , the xxmaj university of xxmaj xxunk xxunk to xxunk xxmaj xxunk xxmaj national xxmaj xxunk , with an xxunk of xxmaj xxunk xxunk offering to xxunk the research . xxmaj xxunk xxunk became the first \" national \" xxunk . xxmaj it did not , however , remain at its original location in the xxmaj xxunk forest . xxmaj in xxunk , it moved xxunk west from the \" xxmaj xxunk xxmaj city \" to a new site on xxmaj xxunk xxunk . xxmaj when xxmaj xxunk xxmaj xxunk visited xxmaj xxunk 's director , xxmaj walter xxmaj xxunk , in xxunk , he asked him what kind of xxunk was to be built at the new site . xxmaj when xxmaj xxunk described a heavy - water xxunk xxunk at one - xxunk the power of the xxmaj xxunk xxmaj xxunk xxmaj xxunk under design at xxmaj xxunk xxmaj xxunk , xxmaj xxunk xxunk it would be xxunk if xxmaj xxunk took the xxmaj xxunk xxmaj xxunk design and xxunk the xxmaj xxunk xxmaj xxunk xxmaj xxunk at one - xxunk capacity . xxmaj the joke proved unintentionally xxunk . \" \n", " \n", " xxmaj the xxup xxunk plant used xxunk to separate the xxunk in thousands of tall xxunk . xxmaj it was built next to the xxup xxunk power plant , which provided the necessary steam . xxmaj much less xxunk than xxup xxunk , the xxup xxunk plant was torn down after the war . \n", " \n", " xxmaj concerned that the xxmaj xxunk xxmaj energy xxmaj xxunk research program might become too xxunk , xxmaj xxunk xxunk a xxunk of industrial xxunk , and during a xxmaj xxunk visit to xxmaj xxunk xxmaj xxunk , he xxunk with xxmaj clark xxmaj center , manager of xxmaj xxunk & xxmaj xxunk , a xxunk of xxmaj union xxmaj xxunk xxmaj corporation at xxmaj xxunk xxmaj xxunk , the possibility of the company xxunk xxunk of the xxmaj xxunk . \n", " \n", " xxmaj prince xxmaj henry ( of xxmaj xxunk ) xxmaj xxunk in xxmaj washington and xxmaj visiting the xxmaj german xxmaj xxunk ( xxunk ) . xxmaj xxunk , with xxmaj prince xxmaj henry of xxmaj xxunk according to the xxunk of science and its xxunk their were already concerns with the xxunk of new science with military xxunk . xxmaj the xxmaj xxunk ( xxunk / xxup ii ) , \" xxmaj xxunk xxmaj xxunk 's splendid xxunk at the xxunk xxmaj st. xxmaj xxunk , xxmaj new xxmaj york . xxmaj taken at the exact moment of xxmaj prince xxmaj henry 's xxunk , and the raising of the xxunk standard . \" xxmaj if xxmaj xxunk knew of these necessary xxunk to xxunk xxunk then what was the xxunk of the xxunk xxup xxunk and xxup wwii . xxmaj the quality of xxunk control i xxunk ? \n", " \n", " xxmaj thus , did the xxunk of xxmaj xxunk xxmaj xxunk xxunk for a military mission , or a business plan , based on the security xxunk of xxmaj xxunk xxunk ? xxmaj because supposedly their were no survivors , and the ones who were caught in xxmaj europe ordered to be executed . xxmaj of the xxunk man commando team the survivors who were captured were executed under orders of the xxmaj german xxmaj army against xxunk , and xxunk acts of the xxmaj state of xxmaj germany . \n", " \n", " xxmaj the xxmaj xxunk xxmaj no . xxunk / xxunk xxunk xxmaj xxunk . xxup xxunk / xxunk , xxmaj xxunk xxup xxunk , 18 xxmaj xxunk xxunk , ( xxunk ) xxmaj xxunk xxmaj hitler ; xxmaj translation of xxmaj document no . xxup xxunk , xxmaj office of xxup u.s. xxmaj chief of xxmaj xxunk , xxunk true copy xxmaj xxunk xxmaj major , xxunk xxup xxunk xxunk xxmaj march xxunk , xxunk , xxunk at the xxup u.s. xxmaj national xxmaj xxunk . \n", " \n", " xxmaj the xxup xxunk xxmaj society xxunk xxunk xxmaj xxunk xxmaj xxunk . , xxunk xxunk , xxup xxunk xxunk\n", "y: CategoryList\n", "positive,positive,positive,positive,positive\n", "Path: C:\\Users\\cross-entropy\\.fastai\\data\\imdb_sample;\n", "\n", "Test: None" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movie_reviews" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's examine the`movie_reviews` object:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['__class__',\n", " '__delattr__',\n", " '__dict__',\n", " '__dir__',\n", " '__doc__',\n", " '__eq__',\n", " '__format__',\n", " '__ge__',\n", " '__getattr__',\n", " '__getattribute__',\n", " '__gt__',\n", " '__hash__',\n", " '__init__',\n", " '__init_subclass__',\n", " '__le__',\n", " '__lt__',\n", " '__module__',\n", " '__ne__',\n", " '__new__',\n", " '__reduce__',\n", " '__reduce_ex__',\n", " '__repr__',\n", " '__setattr__',\n", " '__setstate__',\n", " '__sizeof__',\n", " '__str__',\n", " '__subclasshook__',\n", " '__weakref__',\n", " 'add_test',\n", " 'add_test_folder',\n", " 'databunch',\n", " 'filter_by_func',\n", " 'get_processors',\n", " 'label_const',\n", " 'label_empty',\n", " 'label_from_df',\n", " 'label_from_folder',\n", " 'label_from_func',\n", " 'label_from_list',\n", " 'label_from_lists',\n", " 'label_from_re',\n", " 'lists',\n", " 'load_empty',\n", " 'load_state',\n", " 'path',\n", " 'process',\n", " 'test',\n", " 'train',\n", " 'transform',\n", " 'transform_y',\n", " 'valid']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dir(movie_reviews)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `movie_reviews` splits the data into training and validation sets, `.train` and `.valid` " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 800 and 200 reviews in the training and validations sets, respectively.\n" ] } ], "source": [ "print(f'There are {len(movie_reviews.train.x)} and {len(movie_reviews.valid.x)} reviews in the training and validations sets, respectively.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reviews are composed of lists of tokens. In NLP, a **token** is the basic unit of processing (what the tokens are depends on the application and your choices). Here, the tokens mostly correspond to words or punctuation, as well as several special tokens, corresponding to unknown words, capitalization, etc." ] }, { "attachments": { "image.png": { "image/png": "" } }, "cell_type": "markdown", "metadata": {}, "source": [ "### Special tokens:\n", "All those tokens starting with \"xx\" are fastai special tokens. You can see the list of all of them and their meanings ([in the fastai docs](https://docs.fast.ai/text.transform.html)): \n", "\n", "![image.png](attachment:image.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's examine the structure of the `training set`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### movie_reviews.train is a `LabelList` object. \n", "#### movie_reviews.train.x is a `TextList` object that holds the reviews\n", "#### movie_reviews.train.y is a `CategoryList` object that holds the labels " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\f", "There are 800 movie reviews in the training set\n", "\n", "LabelList (800 items)\n", "x: TextList\n", "xxbos xxmaj un - xxunk - believable ! xxmaj meg xxmaj ryan does n't even look her usual xxunk lovable self in this , which normally makes me forgive her shallow xxunk acting xxunk . xxmaj hard to believe she was the producer on this dog . xxmaj plus xxmaj kevin xxmaj kline : what kind of suicide trip has his career been on ? xxmaj xxunk ... xxmaj xxunk ! ! ! xxmaj finally this was directed by the guy who did xxmaj big xxmaj xxunk ? xxmaj must be a replay of xxmaj jonestown - hollywood style . xxmaj xxunk !,xxbos xxmaj this is a extremely well - made film . xxmaj the acting , script and camera - work are all first - rate . xxmaj the music is good , too , though it is mostly early in the film , when things are still relatively xxunk . xxmaj there are no really xxunk in the cast , though several faces will be familiar . xxmaj the entire cast does an excellent job with the script . \n", " \n", " xxmaj but it is hard to watch , because there is no good end to a situation like the one presented . xxmaj it is now xxunk to blame the xxmaj british for setting xxmaj hindus and xxmaj muslims against each other , and then xxunk xxunk them into two countries . xxmaj there is some merit in this view , but it 's also true that no one forced xxmaj hindus and xxmaj muslims in the region to xxunk each other as they did around the time of partition . xxmaj it seems more likely that the xxmaj british simply saw the xxunk between the xxunk and were clever enough to exploit them to their own ends . \n", " \n", " xxmaj the result is that there is much cruelty and inhumanity in the situation and this is very unpleasant to remember and to see on the screen . xxmaj but it is never painted as a black - and - white case . xxmaj there is xxunk and xxunk on both sides , and also the hope for change in the younger generation . \n", " \n", " xxmaj there is redemption of a sort , in the end , when xxmaj xxunk has to make a hard choice between a man who has ruined her life , but also truly loved her , and her family which has xxunk her , then later come looking for her . xxmaj but by that point , she has no xxunk that is without great pain for her . \n", " \n", " xxmaj this film carries the message that both xxmaj muslims and xxmaj hindus have their grave xxunk , and also that both can be xxunk and caring people . xxmaj the reality of partition makes that xxunk all the more wrenching , since there can never be real xxunk across the xxmaj india / xxmaj pakistan border . xxmaj in that sense , it is similar to \" xxmaj mr & xxmaj xxunk xxmaj xxunk \" . \n", " \n", " xxmaj in the end , we were glad to have seen the film , even though the resolution was xxunk . xxmaj if the xxup uk and xxup us could deal with their own xxunk of racism with this kind of xxunk , they would certainly be better off .,xxbos xxmaj every once in a long while a movie will come along that will be so awful that i feel compelled to warn people . xxmaj if i labor all my days and i can save but one soul from watching this movie , how great will be my joy . \n", " \n", " xxmaj where to begin my discussion of pain . xxmaj for xxunk , there was a musical xxunk every five minutes . xxmaj there was no character development . xxmaj every character was a stereotype . xxmaj we had xxunk guy , fat guy who eats donuts , goofy foreign guy , etc . xxmaj the script felt as if it were being written as the movie was being shot . xxmaj the production value was so incredibly low that it felt like i was watching a junior high video presentation . xxmaj have the directors , producers , etc . ever even seen a movie before ? xxmaj xxunk is getting worse and worse with every new entry . xxmaj the concept for this movie sounded so funny . xxmaj how could you go wrong with xxmaj gary xxmaj coleman and a handful of somewhat legitimate actors . xxmaj but trust me when i say this , things went wrong , xxup very xxup wrong .,xxbos xxmaj name just says it all . i watched this movie with my dad when it came out and having served in xxmaj xxunk he had great admiration for the man . xxmaj the disappointing thing about this film is that it only xxunk on a short period of the man 's life - interestingly enough the man 's entire life would have made such an epic bio - xxunk that it is staggering to imagine the cost for production . \n", " \n", " xxmaj some posters xxunk to the flawed xxunk about the man , which are cheap shots . xxmaj the theme of the movie \" xxmaj duty , xxmaj honor , xxmaj country \" are not just mere words xxunk from the lips of a high - xxunk officer - it is the deep xxunk of one man 's total devotion to his country . \n", " \n", " xxmaj ironically xxmaj xxunk being the liberal that he was xxunk a better understanding of the man . xxmaj he does a great job showing the xxunk general xxunk with the xxunk side of the man .,xxbos xxmaj this movie succeeds at being one of the most unique movies you 've seen . xxmaj however this comes from the fact that you ca n't make heads or xxunk of this mess . xxmaj it almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid . xxmaj if you do n't want to feel xxunk you 'll sit through this horrible film and develop a real sense of pity for the actors involved , they 've all seen better days , but then you realize they actually got paid quite a bit of money to do this and you 'll lose pity for them just like you 've already done for the film . i ca n't go on enough about this horrible movie , its almost something that xxmaj ed xxmaj wood would have made and in that case it surely would have been his masterpiece . \n", " \n", " xxmaj to start you are forced to sit through an opening dialogue the likes of which you 've never seen / heard , this thing has got to be five minutes long . xxmaj on top of that it is narrated , as to suggest that you the viewer can not read . xxmaj then we meet xxmaj mr. xxmaj xxunk and the xxunk of terrible lines gets xxunk , it is as if he is xxunk solely to get lines on to the movie poster xxunk line . xxmaj soon we meet xxmaj stephen xxmaj xxunk , who i typically enjoy ) and he does his best not to drown in this but ultimately he does . xxmaj then comes the ultimate insult , xxmaj tara xxmaj xxunk playing an intelligent role , oh help us ! xxmaj tara xxmaj xxunk is not a very talented actress and somehow she xxunk gets roles in movies , in my opinion though she should stick to movies of the xxmaj american pie type . \n", " \n", " xxmaj all in all you just may want to see this for yourself when it comes out on video , i know that i got a kick out of it , i mean lets all be honest here , sometimes its comforting to xxunk in the shortcomings of others .\n", "y: CategoryList\n", "negative,positive,negative,positive,negative\n", "Path: C:\\Users\\cross-entropy\\.fastai\\data\\imdb_sample\n" ] } ], "source": [ "print(f'\\fThere are {len(movie_reviews.train.x)} movie reviews in the training set\\n')\n", "print(movie_reviews.train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The text of the movie review is stored as a character `string`, which contains the tokens separated by spaces. Here is the text of the first review:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "xxbos xxmaj un - xxunk - believable ! xxmaj meg xxmaj ryan does n't even look her usual xxunk lovable self in this , which normally makes me forgive her shallow xxunk acting xxunk . xxmaj hard to believe she was the producer on this dog . xxmaj plus xxmaj kevin xxmaj kline : what kind of suicide trip has his career been on ? xxmaj xxunk ... xxmaj xxunk ! ! ! xxmaj finally this was directed by the guy who did xxmaj big xxmaj xxunk ? xxmaj must be a replay of xxmaj jonestown - hollywood style . xxmaj xxunk !\n", "\n", "There are 511 characters in the review\n" ] } ], "source": [ "print(movie_reviews.train.x[0].text)\n", "print(f'\\nThere are {len(movie_reviews.train.x[0].text)} characters in the review')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The text string can be split to get the list of tokens." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['xxbos', 'xxmaj', 'un', '-', 'xxunk', '-', 'believable', '!', 'xxmaj', 'meg', 'xxmaj', 'ryan', 'does', \"n't\", 'even', 'look', 'her', 'usual', 'xxunk', 'lovable', 'self', 'in', 'this', ',', 'which', 'normally', 'makes', 'me', 'forgive', 'her', 'shallow', 'xxunk', 'acting', 'xxunk', '.', 'xxmaj', 'hard', 'to', 'believe', 'she', 'was', 'the', 'producer', 'on', 'this', 'dog', '.', 'xxmaj', 'plus', 'xxmaj', 'kevin', 'xxmaj', 'kline', ':', 'what', 'kind', 'of', 'suicide', 'trip', 'has', 'his', 'career', 'been', 'on', '?', 'xxmaj', 'xxunk', '...', 'xxmaj', 'xxunk', '!', '!', '!', 'xxmaj', 'finally', 'this', 'was', 'directed', 'by', 'the', 'guy', 'who', 'did', 'xxmaj', 'big', 'xxmaj', 'xxunk', '?', 'xxmaj', 'must', 'be', 'a', 'replay', 'of', 'xxmaj', 'jonestown', '-', 'hollywood', 'style', '.', 'xxmaj', 'xxunk', '!']\n", "\n", "The review has 103 tokens\n" ] } ], "source": [ "print(movie_reviews.train.x[0].text.split())\n", "print(f'\\nThe review has {len(movie_reviews.train.x[0].text.split())} tokens')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The review tokens are `numericalized`, ie. mapped to integers. So a movie review is also stored as an array of integers:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 2 5 4622 25 ... 10 5 0 52]\n", "\n", "The array contains 103 numericalized tokens\n" ] } ], "source": [ "print(movie_reviews.train.x[0].data)\n", "print(f'\\nThe array contains {len(movie_reviews.train.x[0].data)} numericalized tokens')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. The IMDb Vocabulary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The `movie_revews` object also contains a `.vocab` property, even though it is not shown with`dir()`. (This may be an error in the `fastai` library.) " ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movie_reviews.vocab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The `vocab` object is a kind of reversible dictionary that translates back and forth between tokens and their integer representations. It has two methods of particular interest: `stoi` and `itos`, which stand for `string-to-index` and `index-to-string`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### `movie_reviews.vocab.stoi` maps vocabulary tokens to their `indexes` in vocab" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "defaultdict(int,\n", " {'xxunk': 0,\n", " 'xxpad': 1,\n", " 'xxbos': 2,\n", " 'xxeos': 3,\n", " 'xxfld': 4,\n", " 'xxmaj': 5,\n", " 'xxup': 6,\n", " 'xxrep': 7,\n", " 'xxwrep': 8,\n", " 'the': 9,\n", " '.': 10,\n", " ',': 11,\n", " 'and': 12,\n", " 'a': 13,\n", " 'of': 14,\n", " 'to': 15,\n", " 'is': 16,\n", " 'it': 17,\n", " 'in': 18,\n", " 'i': 19,\n", " 'that': 20,\n", " 'this': 21,\n", " '\"': 22,\n", " \"'s\": 23,\n", " '\\n \\n ': 24,\n", " '-': 25,\n", " 'was': 26,\n", " 'as': 27,\n", " 'for': 28,\n", " 'movie': 29,\n", " 'with': 30,\n", " 'but': 31,\n", " 'film': 32,\n", " 'you': 33,\n", " ')': 34,\n", " 'on': 35,\n", " '(': 36,\n", " \"n't\": 37,\n", " 'are': 38,\n", " 'he': 39,\n", " 'his': 40,\n", " 'not': 41,\n", " 'have': 42,\n", " 'be': 43,\n", " 'one': 44,\n", " 'they': 45,\n", " 'all': 46,\n", " 'at': 47,\n", " 'by': 48,\n", " 'an': 49,\n", " 'from': 50,\n", " 'like': 51,\n", " '!': 52,\n", " 'so': 53,\n", " 'who': 54,\n", " 'there': 55,\n", " 'about': 56,\n", " 'just': 57,\n", " 'out': 58,\n", " 'if': 59,\n", " 'or': 60,\n", " 'do': 61,\n", " 'what': 62,\n", " 'her': 63,\n", " 'has': 64,\n", " \"'\": 65,\n", " 'some': 66,\n", " 'more': 67,\n", " 'good': 68,\n", " 'when': 69,\n", " 'up': 70,\n", " 'very': 71,\n", " '?': 72,\n", " 'she': 73,\n", " 'would': 74,\n", " 'no': 75,\n", " 'really': 76,\n", " 'were': 77,\n", " 'their': 78,\n", " 'my': 79,\n", " 'had': 80,\n", " 'time': 81,\n", " 'can': 82,\n", " 'only': 83,\n", " 'which': 84,\n", " 'even': 85,\n", " 'see': 86,\n", " 'story': 87,\n", " 'me': 88,\n", " 'into': 89,\n", " 'did': 90,\n", " ':': 91,\n", " 'well': 92,\n", " 'we': 93,\n", " 'will': 94,\n", " 'does': 95,\n", " 'than': 96,\n", " 'also': 97,\n", " 'get': 98,\n", " '...': 99,\n", " 'people': 100,\n", " 'other': 101,\n", " 'bad': 102,\n", " 'been': 103,\n", " 'could': 104,\n", " 'first': 105,\n", " 'much': 106,\n", " 'how': 107,\n", " 'most': 108,\n", " 'any': 109,\n", " 'because': 110,\n", " 'two': 111,\n", " 'then': 112,\n", " 'great': 113,\n", " 'him': 114,\n", " 'its': 115,\n", " 'too': 116,\n", " 'made': 117,\n", " 'them': 118,\n", " 'after': 119,\n", " 'movies': 120,\n", " 'make': 121,\n", " '/': 122,\n", " 'way': 123,\n", " 'think': 124,\n", " 'never': 125,\n", " 'watch': 126,\n", " 'acting': 127,\n", " 'seen': 128,\n", " ';': 129,\n", " 'films': 130,\n", " 'plot': 131,\n", " 'being': 132,\n", " 'many': 133,\n", " 'over': 134,\n", " 'where': 135,\n", " 'character': 136,\n", " 'man': 137,\n", " 'little': 138,\n", " 'better': 139,\n", " 'life': 140,\n", " 'characters': 141,\n", " 'love': 142,\n", " 'your': 143,\n", " 'here': 144,\n", " 'know': 145,\n", " 'scenes': 146,\n", " 'best': 147,\n", " 'end': 148,\n", " 'show': 149,\n", " 'while': 150,\n", " 'through': 151,\n", " 'should': 152,\n", " 'off': 153,\n", " 'ever': 154,\n", " 'these': 155,\n", " 'go': 156,\n", " 'such': 157,\n", " 'say': 158,\n", " '--': 159,\n", " 'something': 160,\n", " 'scene': 161,\n", " 'still': 162,\n", " 'before': 163,\n", " 'though': 164,\n", " 'watching': 165,\n", " 'between': 166,\n", " 'actually': 167,\n", " 'old': 168,\n", " '10': 169,\n", " 'find': 170,\n", " 'back': 171,\n", " 'now': 172,\n", " 'why': 173,\n", " 'years': 174,\n", " \"'ve\": 175,\n", " 'actors': 176,\n", " 'fact': 177,\n", " 'those': 178,\n", " \"'m\": 179,\n", " 'thing': 180,\n", " 'pretty': 181,\n", " 'quite': 182,\n", " 'part': 183,\n", " 'going': 184,\n", " 'same': 185,\n", " 'real': 186,\n", " 'another': 187,\n", " 'down': 188,\n", " 'funny': 189,\n", " 'nothing': 190,\n", " 'look': 191,\n", " 'makes': 192,\n", " '*': 193,\n", " 'new': 194,\n", " 'want': 195,\n", " 'action': 196,\n", " '&': 197,\n", " 'director': 198,\n", " 'work': 199,\n", " 'few': 200,\n", " \"'re\": 201,\n", " 'seems': 202,\n", " 'around': 203,\n", " 'world': 204,\n", " 'point': 205,\n", " 'without': 206,\n", " 'cast': 207,\n", " 'again': 208,\n", " 'own': 209,\n", " 'both': 210,\n", " 'lot': 211,\n", " 'enough': 212,\n", " 'every': 213,\n", " 'family': 214,\n", " 'got': 215,\n", " 'ca': 216,\n", " \"'ll\": 217,\n", " 'probably': 218,\n", " 'big': 219,\n", " 'bit': 220,\n", " 'might': 221,\n", " 'things': 222,\n", " 'horror': 223,\n", " 'us': 224,\n", " 'almost': 225,\n", " 'may': 226,\n", " 'right': 227,\n", " 'must': 228,\n", " 'away': 229,\n", " 'thought': 230,\n", " 'interesting': 231,\n", " 'least': 232,\n", " 'whole': 233,\n", " 'series': 234,\n", " 'gets': 235,\n", " 'each': 236,\n", " 'give': 237,\n", " 'young': 238,\n", " 'however': 239,\n", " 'making': 240,\n", " 'day': 241,\n", " 'fun': 242,\n", " 'anything': 243,\n", " 'minutes': 244,\n", " 'kind': 245,\n", " 'come': 246,\n", " 'girl': 247,\n", " 'saw': 248,\n", " 'script': 249,\n", " 'take': 250,\n", " 'long': 251,\n", " 'times': 252,\n", " 'someone': 253,\n", " 'found': 254,\n", " 'done': 255,\n", " 'feel': 256,\n", " 'far': 257,\n", " 'since': 258,\n", " 'role': 259,\n", " 'original': 260,\n", " 'course': 261,\n", " 'goes': 262,\n", " 'last': 263,\n", " 'true': 264,\n", " 'simply': 265,\n", " 'always': 266,\n", " \"'d\": 267,\n", " 'tv': 268,\n", " 'hard': 269,\n", " 'place': 270,\n", " 'set': 271,\n", " 'trying': 272,\n", " 'believe': 273,\n", " 'shot': 274,\n", " 'comes': 275,\n", " 'actor': 276,\n", " 'yet': 277,\n", " '4': 278,\n", " 'having': 279,\n", " 'book': 280,\n", " 'looks': 281,\n", " 'guy': 282,\n", " 'screen': 283,\n", " 'later': 284,\n", " 'shows': 285,\n", " 'performance': 286,\n", " 'worth': 287,\n", " 'audience': 288,\n", " 'comedy': 289,\n", " 'sure': 290,\n", " 'looking': 291,\n", " 'sense': 292,\n", " 'star': 293,\n", " 'effects': 294,\n", " 'read': 295,\n", " 'takes': 296,\n", " 'although': 297,\n", " 'ending': 298,\n", " 'john': 299,\n", " 'anyone': 300,\n", " 'worst': 301,\n", " 'american': 302,\n", " 'year': 303,\n", " 'especially': 304,\n", " 'women': 305,\n", " 'together': 306,\n", " 'dvd': 307,\n", " 'instead': 308,\n", " 'different': 309,\n", " 'am': 310,\n", " 'woman': 311,\n", " 'men': 312,\n", " '2': 313,\n", " 'our': 314,\n", " 'played': 315,\n", " 'music': 316,\n", " 'special': 317,\n", " 'three': 318,\n", " 'rest': 319,\n", " 'put': 320,\n", " 'maybe': 321,\n", " 'wife': 322,\n", " 'kids': 323,\n", " 'war': 324,\n", " 'left': 325,\n", " 'black': 326,\n", " 'once': 327,\n", " 'second': 328,\n", " 'watched': 329,\n", " 'next': 330,\n", " 'friends': 331,\n", " 'rather': 332,\n", " 'let': 333,\n", " '\\x96': 334,\n", " 'job': 335,\n", " 'start': 336,\n", " 'others': 337,\n", " 'budget': 338,\n", " 'need': 339,\n", " 'mind': 340,\n", " 'said': 341,\n", " 'main': 342,\n", " 'else': 343,\n", " 'wrong': 344,\n", " 'beautiful': 345,\n", " 'half': 346,\n", " 'high': 347,\n", " 'idea': 348,\n", " 'death': 349,\n", " 'tell': 350,\n", " 'help': 351,\n", " 'nice': 352,\n", " 'seem': 353,\n", " 'perhaps': 354,\n", " 'hollywood': 355,\n", " 'everyone': 356,\n", " 'play': 357,\n", " 'case': 358,\n", " 'production': 359,\n", " 'piece': 360,\n", " 'episode': 361,\n", " 'camera': 362,\n", " 'low': 363,\n", " 'already': 364,\n", " 'top': 365,\n", " 'poor': 366,\n", " 'during': 367,\n", " '3': 368,\n", " 'stars': 369,\n", " 'house': 370,\n", " '..': 371,\n", " 'couple': 372,\n", " 'boring': 373,\n", " 'reason': 374,\n", " 'try': 375,\n", " 'along': 376,\n", " 'name': 377,\n", " 'small': 378,\n", " 'plays': 379,\n", " 'father': 380,\n", " 'everything': 381,\n", " 'used': 382,\n", " 'video': 383,\n", " 'getting': 384,\n", " 'money': 385,\n", " 'full': 386,\n", " 'less': 387,\n", " 'performances': 388,\n", " 'often': 389,\n", " 'liked': 390,\n", " 'came': 391,\n", " '1': 392,\n", " 'robert': 393,\n", " 'either': 394,\n", " 'fan': 395,\n", " 'given': 396,\n", " 'hand': 397,\n", " 'kill': 398,\n", " 'felt': 399,\n", " 'yes': 400,\n", " 'completely': 401,\n", " 'night': 402,\n", " 'children': 403,\n", " 'himself': 404,\n", " 'girls': 405,\n", " 'early': 406,\n", " 'awful': 407,\n", " 'oh': 408,\n", " 'live': 409,\n", " 'picture': 410,\n", " 'parts': 411,\n", " 'throughout': 412,\n", " 'until': 413,\n", " 'become': 414,\n", " 'town': 415,\n", " 'written': 416,\n", " 'terrible': 417,\n", " 'turn': 418,\n", " 'child': 419,\n", " 'despite': 420,\n", " 'moments': 421,\n", " 'boy': 422,\n", " 'problem': 423,\n", " 'able': 424,\n", " 'head': 425,\n", " 'stupid': 426,\n", " 'beginning': 427,\n", " 'home': 428,\n", " 'version': 429,\n", " 'excellent': 430,\n", " 'sometimes': 431,\n", " 'overall': 432,\n", " 'recommend': 433,\n", " 'sex': 434,\n", " 'keep': 435,\n", " 'human': 436,\n", " 'drama': 437,\n", " 'hero': 438,\n", " 'supposed': 439,\n", " 'seemed': 440,\n", " 'use': 441,\n", " 'writing': 442,\n", " 'wo': 443,\n", " 'remember': 444,\n", " 'went': 445,\n", " 'enjoy': 446,\n", " 'classic': 447,\n", " 'person': 448,\n", " 'killer': 449,\n", " 'lost': 450,\n", " 'late': 451,\n", " '5': 452,\n", " 'title': 453,\n", " 'king': 454,\n", " 'entire': 455,\n", " 'history': 456,\n", " 'son': 457,\n", " 'school': 458,\n", " 'lead': 459,\n", " 'english': 460,\n", " 'sound': 461,\n", " 'cinema': 462,\n", " 'seeing': 463,\n", " 'unfortunately': 464,\n", " 'genre': 465,\n", " 'sort': 466,\n", " 'mean': 467,\n", " 'friend': 468,\n", " 'fans': 469,\n", " 'close': 470,\n", " 'quality': 471,\n", " 'definitely': 472,\n", " 'james': 473,\n", " 'worse': 474,\n", " 'says': 475,\n", " 'except': 476,\n", " 'doing': 477,\n", " 'itself': 478,\n", " 'past': 479,\n", " 'certainly': 480,\n", " 'days': 481,\n", " 'five': 482,\n", " 'dialogue': 483,\n", " 'line': 484,\n", " 'anyway': 485,\n", " 'under': 486,\n", " 'tries': 487,\n", " 'called': 488,\n", " 'fine': 489,\n", " 'guys': 490,\n", " 'care': 491,\n", " 'style': 492,\n", " 'hope': 493,\n", " 'short': 494,\n", " 'lines': 495,\n", " 'told': 496,\n", " 'car': 497,\n", " 'decent': 498,\n", " 'brother': 499,\n", " 'killed': 500,\n", " 'wanted': 501,\n", " 'entertaining': 502,\n", " 'based': 503,\n", " 'absolutely': 504,\n", " 'feeling': 505,\n", " 'truly': 506,\n", " 'etc': 507,\n", " 'heard': 508,\n", " 'serious': 509,\n", " 'run': 510,\n", " 'wonderful': 511,\n", " 'lives': 512,\n", " 'gives': 513,\n", " 'moment': 514,\n", " 'game': 515,\n", " 'documentary': 516,\n", " 'self': 517,\n", " 'several': 518,\n", " 'waste': 519,\n", " 'dead': 520,\n", " 'blood': 521,\n", " 'matter': 522,\n", " 'wonder': 523,\n", " 'humor': 524,\n", " 'thinking': 525,\n", " 'against': 526,\n", " 'white': 527,\n", " 'side': 528,\n", " 'works': 529,\n", " 'mother': 530,\n", " 'flick': 531,\n", " 'stuff': 532,\n", " 'turns': 533,\n", " 'finally': 534,\n", " 'loved': 535,\n", " 'group': 536,\n", " 'wants': 537,\n", " 'face': 538,\n", " 'guess': 539,\n", " 'dark': 540,\n", " 'city': 541,\n", " 'events': 542,\n", " 'starts': 543,\n", " 'hour': 544,\n", " 'took': 545,\n", " 'george': 546,\n", " 'themselves': 547,\n", " 'red': 548,\n", " 'behind': 549,\n", " 'talking': 550,\n", " 'hit': 551,\n", " 'eyes': 552,\n", " 'attempt': 553,\n", " 'direction': 554,\n", " 'novel': 555,\n", " 'saying': 556,\n", " 'word': 557,\n", " 'dull': 558,\n", " 'light': 559,\n", " 'view': 560,\n", " 'playing': 561,\n", " 'opinion': 562,\n", " 'expect': 563,\n", " 'evil': 564,\n", " 'ten': 565,\n", " 'violence': 566,\n", " 'local': 567,\n", " 'final': 568,\n", " 'gave': 569,\n", " 'leave': 570,\n", " 'paul': 571,\n", " 'crap': 572,\n", " 'happens': 573,\n", " 'knows': 574,\n", " 'problems': 575,\n", " 'example': 576,\n", " 'relationship': 577,\n", " 'non': 578,\n", " 'michael': 579,\n", " 'victor': 580,\n", " 'ridiculous': 581,\n", " 'god': 582,\n", " 'similar': 583,\n", " 'general': 584,\n", " 'major': 585,\n", " 'bunch': 586,\n", " 'sister': 587,\n", " 'oscar': 588,\n", " 'turned': 589,\n", " 'brilliant': 590,\n", " 'highly': 591,\n", " 'nearly': 592,\n", " 'de': 593,\n", " 'please': 594,\n", " 'romance': 595,\n", " 'body': 596,\n", " 'extremely': 597,\n", " 'mr.': 598,\n", " 'soon': 599,\n", " 'yourself': 600,\n", " 'known': 601,\n", " 'lack': 602,\n", " 'age': 603,\n", " 'interest': 604,\n", " 'ago': 605,\n", " 'stories': 606,\n", " 'exactly': 607,\n", " 'finds': 608,\n", " 'modern': 609,\n", " 'voice': 610,\n", " 'perfect': 611,\n", " 'heart': 612,\n", " 'alone': 613,\n", " 'tells': 614,\n", " 'daughter': 615,\n", " 'directed': 616,\n", " 'needs': 617,\n", " 'kid': 618,\n", " 'lady': 619,\n", " 'sad': 620,\n", " 'fight': 621,\n", " 'happened': 622,\n", " 'eye': 623,\n", " 'favorite': 624,\n", " 'using': 625,\n", " 'upon': 626,\n", " 'ben': 627,\n", " 'none': 628,\n", " 'beyond': 629,\n", " 'nature': 630,\n", " 'change': 631,\n", " 'save': 632,\n", " 'shots': 633,\n", " 'country': 634,\n", " 'number': 635,\n", " 'shown': 636,\n", " 'surprised': 637,\n", " 'romantic': 638,\n", " 'huge': 639,\n", " 'murder': 640,\n", " 'steve': 641,\n", " 'slow': 642,\n", " 'myself': 643,\n", " 'woods': 644,\n", " 'apparently': 645,\n", " 'lake': 646,\n", " 'cheap': 647,\n", " 'involved': 648,\n", " 'roles': 649,\n", " '6': 650,\n", " 'gore': 651,\n", " 'obviously': 652,\n", " 'knew': 653,\n", " 'level': 654,\n", " '8': 655,\n", " 'experience': 656,\n", " 'became': 657,\n", " 'gone': 658,\n", " 'cover': 659,\n", " 'amazing': 660,\n", " 'create': 661,\n", " 'living': 662,\n", " 'usually': 663,\n", " 'order': 664,\n", " 'monster': 665,\n", " 'happen': 666,\n", " 'list': 667,\n", " 'clearly': 668,\n", " 'power': 669,\n", " 'features': 670,\n", " 're': 671,\n", " 'subject': 672,\n", " 'across': 673,\n", " 'parents': 674,\n", " 'seriously': 675,\n", " 'ways': 676,\n", " 'room': 677,\n", " 'filmed': 678,\n", " 'cheesy': 679,\n", " 'disappointed': 680,\n", " 'important': 681,\n", " 'plenty': 682,\n", " '7': 683,\n", " 'particular': 684,\n", " 'started': 685,\n", " 'today': 686,\n", " 'enjoyed': 687,\n", " 'cinematography': 688,\n", " 'annoying': 689,\n", " 'looked': 690,\n", " 'supporting': 691,\n", " 'mostly': 692,\n", " 'message': 693,\n", " 'somewhat': 694,\n", " 'viewer': 695,\n", " 'type': 696,\n", " 'certain': 697,\n", " 'release': 698,\n", " 'effort': 699,\n", " 'possible': 700,\n", " 'add': 701,\n", " 'figure': 702,\n", " 'named': 703,\n", " 'wish': 704,\n", " 'difficult': 705,\n", " 'falls': 706,\n", " 'four': 707,\n", " 'husband': 708,\n", " 'score': 709,\n", " 'leads': 710,\n", " 'form': 711,\n", " 'working': 712,\n", " 'writer': 713,\n", " 'sets': 714,\n", " 'including': 715,\n", " 'enjoyable': 716,\n", " 'ok': 717,\n", " 'note': 718,\n", " 'spent': 719,\n", " 'review': 720,\n", " 'art': 721,\n", " 'police': 722,\n", " 'sit': 723,\n", " 'horrible': 724,\n", " 'actress': 725,\n", " 'ones': 726,\n", " 'bring': 727,\n", " 'greatest': 728,\n", " 'dance': 729,\n", " 'earth': 730,\n", " 'becomes': 731,\n", " 'happy': 732,\n", " 'cut': 733,\n", " 'straight': 734,\n", " 'soundtrack': 735,\n", " 'leading': 736,\n", " 'laugh': 737,\n", " 'strange': 738,\n", " 'space': 739,\n", " 'b': 740,\n", " 'tale': 741,\n", " 'comic': 742,\n", " 'near': 743,\n", " 'due': 744,\n", " 'weak': 745,\n", " 'earlier': 746,\n", " 'follow': 747,\n", " 'british': 748,\n", " 'ends': 749,\n", " 'typical': 750,\n", " 'attention': 751,\n", " 'points': 752,\n", " 'talent': 753,\n", " 'tom': 754,\n", " 'female': 755,\n", " 'future': 756,\n", " 'fall': 757,\n", " 'laughs': 758,\n", " 'stop': 759,\n", " 'easy': 760,\n", " 'moving': 761,\n", " 'apart': 762,\n", " 'chance': 763,\n", " 'running': 764,\n", " 'york': 765,\n", " 'particularly': 766,\n", " 'luke': 767,\n", " 'bill': 768,\n", " 'forced': 769,\n", " 'theme': 770,\n", " 'easily': 771,\n", " 'rating': 772,\n", " 'coming': 773,\n", " 'davis': 774,\n", " 'totally': 775,\n", " 'realistic': 776,\n", " 'simple': 777,\n", " 'hours': 778,\n", " 'taken': 779,\n", " 'indeed': 780,\n", " 'released': 781,\n", " 'sexual': 782,\n", " 'feels': 783,\n", " 'french': 784,\n", " 'screenplay': 785,\n", " 'la': 786,\n", " 'jokes': 787,\n", " 'sequences': 788,\n", " 'chase': 789,\n", " 'portrayed': 790,\n", " 'dramatic': 791,\n", " 'mention': 792,\n", " 'talk': 793,\n", " 'gun': 794,\n", " 'thriller': 795,\n", " 'jimmy': 796,\n", " 'career': 797,\n", " 'reality': 798,\n", " 'incredibly': 799,\n", " 'whether': 800,\n", " 'towards': 801,\n", " 'entertainment': 802,\n", " 'feature': 803,\n", " 'western': 804,\n", " 'dialog': 805,\n", " 'business': 806,\n", " 'suspense': 807,\n", " 'focus': 808,\n", " 'doubt': 809,\n", " 'possibly': 810,\n", " 'water': 811,\n", " 'gay': 812,\n", " 'blob': 813,\n", " 'comments': 814,\n", " 'brothers': 815,\n", " 'clear': 816,\n", " 'agree': 817,\n", " 'allen': 818,\n", " 'door': 819,\n", " 'editing': 820,\n", " 'third': 821,\n", " 'deserves': 822,\n", " 'silly': 823,\n", " 'fantastic': 824,\n", " 'convincing': 825,\n", " 'hardly': 826,\n", " 'lame': 827,\n", " 'act': 828,\n", " 'former': 829,\n", " 'material': 830,\n", " 'appears': 831,\n", " 'understand': 832,\n", " 'twist': 833,\n", " 'episodes': 834,\n", " 'buy': 835,\n", " 'secret': 836,\n", " 'richard': 837,\n", " 'south': 838,\n", " 'bourne': 839,\n", " 'deal': 840,\n", " 'musical': 841,\n", " 'words': 842,\n", " 'unique': 843,\n", " 'mess': 844,\n", " 'opening': 845,\n", " 'society': 846,\n", " 'avoid': 847,\n", " 'footage': 848,\n", " 'joe': 849,\n", " 'free': 850,\n", " 'forget': 851,\n", " 'herself': 852,\n", " 'appear': 853,\n", " 'obvious': 854,\n", " 'box': 855,\n", " 'single': 856,\n", " 'average': 857,\n", " 'indian': 858,\n", " 'rent': 859,\n", " 'okay': 860,\n", " 'scary': 861,\n", " 'within': 862,\n", " 'office': 863,\n", " 'crime': 864,\n", " 'science': 865,\n", " '80': 866,\n", " 'believable': 867,\n", " 'period': 868,\n", " 'showing': 869,\n", " 'call': 870,\n", " 'return': 871,\n", " 'keeps': 872,\n", " 'lee': 873,\n", " 'expected': 874,\n", " 'stay': 875,\n", " 'middle': 876,\n", " 'jack': 877,\n", " 'hands': 878,\n", " 'david': 879,\n", " 'attempts': 880,\n", " 'strong': 881,\n", " 'tension': 882,\n", " 'crew': 883,\n", " 'hilarious': 884,\n", " 'grade': 885,\n", " 'outside': 886,\n", " 'means': 887,\n", " 'viewing': 888,\n", " 'sadly': 889,\n", " 'hell': 890,\n", " 'whatever': 891,\n", " 'sorry': 892,\n", " 'recently': 893,\n", " 'stage': 894,\n", " 'decides': 895,\n", " 'hear': 896,\n", " 'team': 897,\n", " 'learn': 898,\n", " 'nor': 899,\n", " 'open': 900,\n", " 'break': 901,\n", " 'question': 902,\n", " 'remake': 903,\n", " 'porn': 904,\n", " 'pain': 905,\n", " 'imagine': 906,\n", " 'deep': 907,\n", " 'zombie': 908,\n", " 'basically': 909,\n", " 'killing': 910,\n", " 'company': 911,\n", " 'poorly': 912,\n", " 'dr.': 913,\n", " 'predictable': 914,\n", " 'taking': 915,\n", " 'large': 916,\n", " 'language': 917,\n", " 'giving': 918,\n", " 'public': 919,\n", " 'audiences': 920,\n", " 'ask': 921,\n", " 'cool': 922,\n", " 'america': 923,\n", " 'slasher': 924,\n", " 'west': 925,\n", " 'mentioned': 926,\n", " 'die': 927,\n", " 'christmas': 928,\n", " 'complete': 929,\n", " 'needed': 930,\n", " 'martin': 931,\n", " 'makers': 932,\n", " 'cgi': 933,\n", " 'boys': 934,\n", " 'vargas': 935,\n", " 'usual': 936,\n", " 'begin': 937,\n", " 'dad': 938,\n", " 'total': 939,\n", " 'somehow': 940,\n", " 'stick': 941,\n", " 'shame': 942,\n", " 'successful': 943,\n", " 'sitting': 944,\n", " 'fred': 945,\n", " 'meets': 946,\n", " 'unless': 947,\n", " 'dancing': 948,\n", " 'sounds': 949,\n", " 'above': 950,\n", " 'elements': 951,\n", " 'whose': 952,\n", " 'german': 953,\n", " 'considering': 954,\n", " 'caught': 955,\n", " 'credit': 956,\n", " 'interested': 957,\n", " 'move': 958,\n", " 'filming': 959,\n", " 'truth': 960,\n", " 'eventually': 961,\n", " 'share': 962,\n", " 'ability': 963,\n", " 'meaning': 964,\n", " 'agent': 965,\n", " 'fast': 966,\n", " 'stand': 967,\n", " 'onto': 968,\n", " 'plain': 969,\n", " 'comment': 970,\n", " 'kept': 971,\n", " 'situation': 972,\n", " 'setting': 973,\n", " 'value': 974,\n", " 'willing': 975,\n", " 'realize': 976,\n", " 'acted': 977,\n", " 'weird': 978,\n", " 'alive': 979,\n", " 'fairly': 980,\n", " 'dream': 981,\n", " 'building': 982,\n", " 'hair': 983,\n", " 'bored': 984,\n", " 'minute': 985,\n", " 'emotional': 986,\n", " 'directing': 987,\n", " 'theatrical': 988,\n", " 'famous': 989,\n", " 'begins': 990,\n", " 'front': 991,\n", " 'catch': 992,\n", " 'sequence': 993,\n", " 'runs': 994,\n", " 'follows': 995,\n", " 'song': 996,\n", " 'government': 997,\n", " 'miss': 998,\n", " 'actual': 999,\n", " ...})" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movie_reviews.vocab.stoi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### `movie_reviews.vocab.itos` maps the `indexes` of vocabulary tokens to `strings`" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "['xxunk',\n", " 'xxpad',\n", " 'xxbos',\n", " 'xxeos',\n", " 'xxfld',\n", " 'xxmaj',\n", " 'xxup',\n", " 'xxrep',\n", " 'xxwrep',\n", " 'the',\n", " '.',\n", " ',',\n", " 'and',\n", " 'a',\n", " 'of',\n", " 'to',\n", " 'is',\n", " 'it',\n", " 'in',\n", " 'i',\n", " 'that',\n", " 'this',\n", " '\"',\n", " \"'s\",\n", " '\\n \\n ',\n", " '-',\n", " 'was',\n", " 'as',\n", " 'for',\n", " 'movie',\n", " 'with',\n", " 'but',\n", " 'film',\n", " 'you',\n", " ')',\n", " 'on',\n", " '(',\n", " \"n't\",\n", " 'are',\n", " 'he',\n", " 'his',\n", " 'not',\n", " 'have',\n", " 'be',\n", " 'one',\n", " 'they',\n", " 'all',\n", " 'at',\n", " 'by',\n", " 'an',\n", " 'from',\n", " 'like',\n", " '!',\n", " 'so',\n", " 'who',\n", " 'there',\n", " 'about',\n", " 'just',\n", " 'out',\n", " 'if',\n", " 'or',\n", " 'do',\n", " 'what',\n", " 'her',\n", " 'has',\n", " \"'\",\n", " 'some',\n", " 'more',\n", " 'good',\n", " 'when',\n", " 'up',\n", " 'very',\n", " '?',\n", " 'she',\n", " 'would',\n", " 'no',\n", " 'really',\n", " 'were',\n", " 'their',\n", " 'my',\n", " 'had',\n", " 'time',\n", " 'can',\n", " 'only',\n", " 'which',\n", " 'even',\n", " 'see',\n", " 'story',\n", " 'me',\n", " 'into',\n", " 'did',\n", " ':',\n", " 'well',\n", " 'we',\n", " 'will',\n", " 'does',\n", " 'than',\n", " 'also',\n", " 'get',\n", " '...',\n", " 'people',\n", " 'other',\n", " 'bad',\n", " 'been',\n", " 'could',\n", " 'first',\n", " 'much',\n", " 'how',\n", " 'most',\n", " 'any',\n", " 'because',\n", " 'two',\n", " 'then',\n", " 'great',\n", " 'him',\n", " 'its',\n", " 'too',\n", " 'made',\n", " 'them',\n", " 'after',\n", " 'movies',\n", " 'make',\n", " '/',\n", " 'way',\n", " 'think',\n", " 'never',\n", " 'watch',\n", " 'acting',\n", " 'seen',\n", " ';',\n", " 'films',\n", " 'plot',\n", " 'being',\n", " 'many',\n", " 'over',\n", " 'where',\n", " 'character',\n", " 'man',\n", " 'little',\n", " 'better',\n", " 'life',\n", " 'characters',\n", " 'love',\n", " 'your',\n", " 'here',\n", " 'know',\n", " 'scenes',\n", " 'best',\n", " 'end',\n", " 'show',\n", " 'while',\n", " 'through',\n", " 'should',\n", " 'off',\n", " 'ever',\n", " 'these',\n", " 'go',\n", " 'such',\n", " 'say',\n", " '--',\n", " 'something',\n", " 'scene',\n", " 'still',\n", " 'before',\n", " 'though',\n", " 'watching',\n", " 'between',\n", " 'actually',\n", " 'old',\n", " '10',\n", " 'find',\n", " 'back',\n", " 'now',\n", " 'why',\n", " 'years',\n", " \"'ve\",\n", " 'actors',\n", " 'fact',\n", " 'those',\n", " \"'m\",\n", " 'thing',\n", " 'pretty',\n", " 'quite',\n", " 'part',\n", " 'going',\n", " 'same',\n", " 'real',\n", " 'another',\n", " 'down',\n", " 'funny',\n", " 'nothing',\n", " 'look',\n", " 'makes',\n", " '*',\n", " 'new',\n", " 'want',\n", " 'action',\n", " '&',\n", " 'director',\n", " 'work',\n", " 'few',\n", " \"'re\",\n", " 'seems',\n", " 'around',\n", " 'world',\n", " 'point',\n", " 'without',\n", " 'cast',\n", " 'again',\n", " 'own',\n", " 'both',\n", " 'lot',\n", " 'enough',\n", " 'every',\n", " 'family',\n", " 'got',\n", " 'ca',\n", " \"'ll\",\n", " 'probably',\n", " 'big',\n", " 'bit',\n", " 'might',\n", " 'things',\n", " 'horror',\n", " 'us',\n", " 'almost',\n", " 'may',\n", " 'right',\n", " 'must',\n", " 'away',\n", " 'thought',\n", " 'interesting',\n", " 'least',\n", " 'whole',\n", " 'series',\n", " 'gets',\n", " 'each',\n", " 'give',\n", " 'young',\n", " 'however',\n", " 'making',\n", " 'day',\n", " 'fun',\n", " 'anything',\n", " 'minutes',\n", " 'kind',\n", " 'come',\n", " 'girl',\n", " 'saw',\n", " 'script',\n", " 'take',\n", " 'long',\n", " 'times',\n", " 'someone',\n", " 'found',\n", " 'done',\n", " 'feel',\n", " 'far',\n", " 'since',\n", " 'role',\n", " 'original',\n", " 'course',\n", " 'goes',\n", " 'last',\n", " 'true',\n", " 'simply',\n", " 'always',\n", " \"'d\",\n", " 'tv',\n", " 'hard',\n", " 'place',\n", " 'set',\n", " 'trying',\n", " 'believe',\n", " 'shot',\n", " 'comes',\n", " 'actor',\n", " 'yet',\n", " '4',\n", " 'having',\n", " 'book',\n", " 'looks',\n", " 'guy',\n", " 'screen',\n", " 'later',\n", " 'shows',\n", " 'performance',\n", " 'worth',\n", " 'audience',\n", " 'comedy',\n", " 'sure',\n", " 'looking',\n", " 'sense',\n", " 'star',\n", " 'effects',\n", " 'read',\n", " 'takes',\n", " 'although',\n", " 'ending',\n", " 'john',\n", " 'anyone',\n", " 'worst',\n", " 'american',\n", " 'year',\n", " 'especially',\n", " 'women',\n", " 'together',\n", " 'dvd',\n", " 'instead',\n", " 'different',\n", " 'am',\n", " 'woman',\n", " 'men',\n", " '2',\n", " 'our',\n", " 'played',\n", " 'music',\n", " 'special',\n", " 'three',\n", " 'rest',\n", " 'put',\n", " 'maybe',\n", " 'wife',\n", " 'kids',\n", " 'war',\n", " 'left',\n", " 'black',\n", " 'once',\n", " 'second',\n", " 'watched',\n", " 'next',\n", " 'friends',\n", " 'rather',\n", " 'let',\n", " '\\x96',\n", " 'job',\n", " 'start',\n", " 'others',\n", " 'budget',\n", " 'need',\n", " 'mind',\n", " 'said',\n", " 'main',\n", " 'else',\n", " 'wrong',\n", " 'beautiful',\n", " 'half',\n", " 'high',\n", " 'idea',\n", " 'death',\n", " 'tell',\n", " 'help',\n", " 'nice',\n", " 'seem',\n", " 'perhaps',\n", " 'hollywood',\n", " 'everyone',\n", " 'play',\n", " 'case',\n", " 'production',\n", " 'piece',\n", " 'episode',\n", " 'camera',\n", " 'low',\n", " 'already',\n", " 'top',\n", " 'poor',\n", " 'during',\n", " '3',\n", " 'stars',\n", " 'house',\n", " '..',\n", " 'couple',\n", " 'boring',\n", " 'reason',\n", " 'try',\n", " 'along',\n", " 'name',\n", " 'small',\n", " 'plays',\n", " 'father',\n", " 'everything',\n", " 'used',\n", " 'video',\n", " 'getting',\n", " 'money',\n", " 'full',\n", " 'less',\n", " 'performances',\n", " 'often',\n", " 'liked',\n", " 'came',\n", " '1',\n", " 'robert',\n", " 'either',\n", " 'fan',\n", " 'given',\n", " 'hand',\n", " 'kill',\n", " 'felt',\n", " 'yes',\n", " 'completely',\n", " 'night',\n", " 'children',\n", " 'himself',\n", " 'girls',\n", " 'early',\n", " 'awful',\n", " 'oh',\n", " 'live',\n", " 'picture',\n", " 'parts',\n", " 'throughout',\n", " 'until',\n", " 'become',\n", " 'town',\n", " 'written',\n", " 'terrible',\n", " 'turn',\n", " 'child',\n", " 'despite',\n", " 'moments',\n", " 'boy',\n", " 'problem',\n", " 'able',\n", " 'head',\n", " 'stupid',\n", " 'beginning',\n", " 'home',\n", " 'version',\n", " 'excellent',\n", " 'sometimes',\n", " 'overall',\n", " 'recommend',\n", " 'sex',\n", " 'keep',\n", " 'human',\n", " 'drama',\n", " 'hero',\n", " 'supposed',\n", " 'seemed',\n", " 'use',\n", " 'writing',\n", " 'wo',\n", " 'remember',\n", " 'went',\n", " 'enjoy',\n", " 'classic',\n", " 'person',\n", " 'killer',\n", " 'lost',\n", " 'late',\n", " '5',\n", " 'title',\n", " 'king',\n", " 'entire',\n", " 'history',\n", " 'son',\n", " 'school',\n", " 'lead',\n", " 'english',\n", " 'sound',\n", " 'cinema',\n", " 'seeing',\n", " 'unfortunately',\n", " 'genre',\n", " 'sort',\n", " 'mean',\n", " 'friend',\n", " 'fans',\n", " 'close',\n", " 'quality',\n", " 'definitely',\n", " 'james',\n", " 'worse',\n", " 'says',\n", " 'except',\n", " 'doing',\n", " 'itself',\n", " 'past',\n", " 'certainly',\n", " 'days',\n", " 'five',\n", " 'dialogue',\n", " 'line',\n", " 'anyway',\n", " 'under',\n", " 'tries',\n", " 'called',\n", " 'fine',\n", " 'guys',\n", " 'care',\n", " 'style',\n", " 'hope',\n", " 'short',\n", " 'lines',\n", " 'told',\n", " 'car',\n", " 'decent',\n", " 'brother',\n", " 'killed',\n", " 'wanted',\n", " 'entertaining',\n", " 'based',\n", " 'absolutely',\n", " 'feeling',\n", " 'truly',\n", " 'etc',\n", " 'heard',\n", " 'serious',\n", " 'run',\n", " 'wonderful',\n", " 'lives',\n", " 'gives',\n", " 'moment',\n", " 'game',\n", " 'documentary',\n", " 'self',\n", " 'several',\n", " 'waste',\n", " 'dead',\n", " 'blood',\n", " 'matter',\n", " 'wonder',\n", " 'humor',\n", " 'thinking',\n", " 'against',\n", " 'white',\n", " 'side',\n", " 'works',\n", " 'mother',\n", " 'flick',\n", " 'stuff',\n", " 'turns',\n", " 'finally',\n", " 'loved',\n", " 'group',\n", " 'wants',\n", " 'face',\n", " 'guess',\n", " 'dark',\n", " 'city',\n", " 'events',\n", " 'starts',\n", " 'hour',\n", " 'took',\n", " 'george',\n", " 'themselves',\n", " 'red',\n", " 'behind',\n", " 'talking',\n", " 'hit',\n", " 'eyes',\n", " 'attempt',\n", " 'direction',\n", " 'novel',\n", " 'saying',\n", " 'word',\n", " 'dull',\n", " 'light',\n", " 'view',\n", " 'playing',\n", " 'opinion',\n", " 'expect',\n", " 'evil',\n", " 'ten',\n", " 'violence',\n", " 'local',\n", " 'final',\n", " 'gave',\n", " 'leave',\n", " 'paul',\n", " 'crap',\n", " 'happens',\n", " 'knows',\n", " 'problems',\n", " 'example',\n", " 'relationship',\n", " 'non',\n", " 'michael',\n", " 'victor',\n", " 'ridiculous',\n", " 'god',\n", " 'similar',\n", " 'general',\n", " 'major',\n", " 'bunch',\n", " 'sister',\n", " 'oscar',\n", " 'turned',\n", " 'brilliant',\n", " 'highly',\n", " 'nearly',\n", " 'de',\n", " 'please',\n", " 'romance',\n", " 'body',\n", " 'extremely',\n", " 'mr.',\n", " 'soon',\n", " 'yourself',\n", " 'known',\n", " 'lack',\n", " 'age',\n", " 'interest',\n", " 'ago',\n", " 'stories',\n", " 'exactly',\n", " 'finds',\n", " 'modern',\n", " 'voice',\n", " 'perfect',\n", " 'heart',\n", " 'alone',\n", " 'tells',\n", " 'daughter',\n", " 'directed',\n", " 'needs',\n", " 'kid',\n", " 'lady',\n", " 'sad',\n", " 'fight',\n", " 'happened',\n", " 'eye',\n", " 'favorite',\n", " 'using',\n", " 'upon',\n", " 'ben',\n", " 'none',\n", " 'beyond',\n", " 'nature',\n", " 'change',\n", " 'save',\n", " 'shots',\n", " 'country',\n", " 'number',\n", " 'shown',\n", " 'surprised',\n", " 'romantic',\n", " 'huge',\n", " 'murder',\n", " 'steve',\n", " 'slow',\n", " 'myself',\n", " 'woods',\n", " 'apparently',\n", " 'lake',\n", " 'cheap',\n", " 'involved',\n", " 'roles',\n", " '6',\n", " 'gore',\n", " 'obviously',\n", " 'knew',\n", " 'level',\n", " '8',\n", " 'experience',\n", " 'became',\n", " 'gone',\n", " 'cover',\n", " 'amazing',\n", " 'create',\n", " 'living',\n", " 'usually',\n", " 'order',\n", " 'monster',\n", " 'happen',\n", " 'list',\n", " 'clearly',\n", " 'power',\n", " 'features',\n", " 're',\n", " 'subject',\n", " 'across',\n", " 'parents',\n", " 'seriously',\n", " 'ways',\n", " 'room',\n", " 'filmed',\n", " 'cheesy',\n", " 'disappointed',\n", " 'important',\n", " 'plenty',\n", " '7',\n", " 'particular',\n", " 'started',\n", " 'today',\n", " 'enjoyed',\n", " 'cinematography',\n", " 'annoying',\n", " 'looked',\n", " 'supporting',\n", " 'mostly',\n", " 'message',\n", " 'somewhat',\n", " 'viewer',\n", " 'type',\n", " 'certain',\n", " 'release',\n", " 'effort',\n", " 'possible',\n", " 'add',\n", " 'figure',\n", " 'named',\n", " 'wish',\n", " 'difficult',\n", " 'falls',\n", " 'four',\n", " 'husband',\n", " 'score',\n", " 'leads',\n", " 'form',\n", " 'working',\n", " 'writer',\n", " 'sets',\n", " 'including',\n", " 'enjoyable',\n", " 'ok',\n", " 'note',\n", " 'spent',\n", " 'review',\n", " 'art',\n", " 'police',\n", " 'sit',\n", " 'horrible',\n", " 'actress',\n", " 'ones',\n", " 'bring',\n", " 'greatest',\n", " 'dance',\n", " 'earth',\n", " 'becomes',\n", " 'happy',\n", " 'cut',\n", " 'straight',\n", " 'soundtrack',\n", " 'leading',\n", " 'laugh',\n", " 'strange',\n", " 'space',\n", " 'b',\n", " 'tale',\n", " 'comic',\n", " 'near',\n", " 'due',\n", " 'weak',\n", " 'earlier',\n", " 'follow',\n", " 'british',\n", " 'ends',\n", " 'typical',\n", " 'attention',\n", " 'points',\n", " 'talent',\n", " 'tom',\n", " 'female',\n", " 'future',\n", " 'fall',\n", " 'laughs',\n", " 'stop',\n", " 'easy',\n", " 'moving',\n", " 'apart',\n", " 'chance',\n", " 'running',\n", " 'york',\n", " 'particularly',\n", " 'luke',\n", " 'bill',\n", " 'forced',\n", " 'theme',\n", " 'easily',\n", " 'rating',\n", " 'coming',\n", " 'davis',\n", " 'totally',\n", " 'realistic',\n", " 'simple',\n", " 'hours',\n", " 'taken',\n", " 'indeed',\n", " 'released',\n", " 'sexual',\n", " 'feels',\n", " 'french',\n", " 'screenplay',\n", " 'la',\n", " 'jokes',\n", " 'sequences',\n", " 'chase',\n", " 'portrayed',\n", " 'dramatic',\n", " 'mention',\n", " 'talk',\n", " 'gun',\n", " 'thriller',\n", " 'jimmy',\n", " 'career',\n", " 'reality',\n", " 'incredibly',\n", " 'whether',\n", " 'towards',\n", " 'entertainment',\n", " 'feature',\n", " 'western',\n", " 'dialog',\n", " 'business',\n", " 'suspense',\n", " 'focus',\n", " 'doubt',\n", " 'possibly',\n", " 'water',\n", " 'gay',\n", " 'blob',\n", " 'comments',\n", " 'brothers',\n", " 'clear',\n", " 'agree',\n", " 'allen',\n", " 'door',\n", " 'editing',\n", " 'third',\n", " 'deserves',\n", " 'silly',\n", " 'fantastic',\n", " 'convincing',\n", " 'hardly',\n", " 'lame',\n", " 'act',\n", " 'former',\n", " 'material',\n", " 'appears',\n", " 'understand',\n", " 'twist',\n", " 'episodes',\n", " 'buy',\n", " 'secret',\n", " 'richard',\n", " 'south',\n", " 'bourne',\n", " 'deal',\n", " 'musical',\n", " 'words',\n", " 'unique',\n", " 'mess',\n", " 'opening',\n", " 'society',\n", " 'avoid',\n", " 'footage',\n", " 'joe',\n", " 'free',\n", " 'forget',\n", " 'herself',\n", " 'appear',\n", " 'obvious',\n", " 'box',\n", " 'single',\n", " 'average',\n", " 'indian',\n", " 'rent',\n", " 'okay',\n", " 'scary',\n", " 'within',\n", " 'office',\n", " 'crime',\n", " 'science',\n", " '80',\n", " 'believable',\n", " 'period',\n", " 'showing',\n", " 'call',\n", " 'return',\n", " 'keeps',\n", " 'lee',\n", " 'expected',\n", " 'stay',\n", " 'middle',\n", " 'jack',\n", " 'hands',\n", " 'david',\n", " 'attempts',\n", " 'strong',\n", " 'tension',\n", " 'crew',\n", " 'hilarious',\n", " 'grade',\n", " 'outside',\n", " 'means',\n", " 'viewing',\n", " 'sadly',\n", " 'hell',\n", " 'whatever',\n", " 'sorry',\n", " 'recently',\n", " 'stage',\n", " 'decides',\n", " 'hear',\n", " 'team',\n", " 'learn',\n", " 'nor',\n", " 'open',\n", " 'break',\n", " 'question',\n", " 'remake',\n", " 'porn',\n", " 'pain',\n", " 'imagine',\n", " 'deep',\n", " 'zombie',\n", " 'basically',\n", " 'killing',\n", " 'company',\n", " 'poorly',\n", " 'dr.',\n", " 'predictable',\n", " 'taking',\n", " 'large',\n", " 'language',\n", " 'giving',\n", " 'public',\n", " 'audiences',\n", " 'ask',\n", " 'cool',\n", " 'america',\n", " 'slasher',\n", " 'west',\n", " 'mentioned',\n", " 'die',\n", " 'christmas',\n", " 'complete',\n", " 'needed',\n", " 'martin',\n", " 'makers',\n", " 'cgi',\n", " 'boys',\n", " 'vargas',\n", " 'usual',\n", " 'begin',\n", " 'dad',\n", " 'total',\n", " 'somehow',\n", " 'stick',\n", " 'shame',\n", " 'successful',\n", " 'sitting',\n", " 'fred',\n", " 'meets',\n", " 'unless',\n", " 'dancing',\n", " 'sounds',\n", " 'above',\n", " 'elements',\n", " 'whose',\n", " 'german',\n", " 'considering',\n", " 'caught',\n", " 'credit',\n", " 'interested',\n", " 'move',\n", " 'filming',\n", " 'truth',\n", " 'eventually',\n", " 'share',\n", " 'ability',\n", " 'meaning',\n", " 'agent',\n", " 'fast',\n", " 'stand',\n", " 'onto',\n", " 'plain',\n", " 'comment',\n", " 'kept',\n", " 'situation',\n", " 'setting',\n", " 'value',\n", " 'willing',\n", " 'realize',\n", " 'acted',\n", " 'weird',\n", " 'alive',\n", " 'fairly',\n", " 'dream',\n", " 'building',\n", " 'hair',\n", " 'bored',\n", " 'minute',\n", " 'emotional',\n", " 'directing',\n", " 'theatrical',\n", " 'famous',\n", " 'begins',\n", " 'front',\n", " 'catch',\n", " 'sequence',\n", " 'runs',\n", " 'follows',\n", " 'song',\n", " 'government',\n", " 'miss',\n", " 'actual',\n", " ...]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movie_reviews.vocab.itos" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Notice that ints-to-string and string-to-ints have different lengths. Think for a moment about why this is.\n", "See Hint below" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "itos length 6016 \n", "stoi length 19160 \n" ] } ], "source": [ "print('itos ', 'length ',len(movie_reviews.vocab.itos),type(movie_reviews.vocab.itos) )\n", "print('stoi ', 'length ',len(movie_reviews.vocab.stoi),type(movie_reviews.vocab.stoi) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Hint: `stoi` is an instance of the class `defaultdict`\n", "\"floating" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### In a `defaultdict`, rare words that appear fewer than three times in the corpus, and words that are not in the dictionary, are mapped to a `default value`, in this case, zero" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "0\n", "0\n", "0\n", "0\n", "0\n" ] } ], "source": [ "rare_words = ['acrid','a_random_made_up_nonexistant_word','acrimonious','allosteric','anodyne','antikythera']\n", "for word in rare_words:\n", " print(movie_reviews.vocab.stoi[word])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### What's the `token` corresponding to the `default` value?" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "xxunk\n" ] } ], "source": [ "print(movie_reviews.vocab.itos[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Note that `stoi` (string-to-int) is larger than `itos` (int-to-string)." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "len(stoi) = 19165\n", "len(itos) = 6016\n", "len(stoi) - len(itos) = 13149\n" ] } ], "source": [ "print(f'len(stoi) = {len(movie_reviews.vocab.stoi)}')\n", "print(f'len(itos) = {len(movie_reviews.vocab.itos)}')\n", "print(f'len(stoi) - len(itos) = {len(movie_reviews.vocab.stoi) - len(movie_reviews.vocab.itos)}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### This is because many words map to `unknown`. We can confirm here:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "unk = []\n", "for word, num in movie_reviews.vocab.stoi.items():\n", " if num==0:\n", " unk.append(word)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "13155" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(unk)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Question: why isn't len(unk) = len(stoi) - len(itos)?\n", "Hint: remember the list of rare words we used to query `stoi` a few cells back?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Here are the first 25 words that are mapped to `unknown`" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "['xxunk',\n", " 'bleeping',\n", " 'pert',\n", " 'ticky',\n", " 'schtick',\n", " 'whoosh',\n", " 'banzai',\n", " 'chill',\n", " 'wooofff',\n", " 'cheery',\n", " 'superstars',\n", " 'fashionable',\n", " 'cruelly',\n", " 'separating',\n", " 'mistreat',\n", " 'tensions',\n", " 'religions',\n", " 'baseness',\n", " 'nobility',\n", " 'puro',\n", " 'disowned',\n", " 'option',\n", " 'faults',\n", " 'dignified',\n", " 'realisation']" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unk[:25]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Map the movie reviews into a vector space" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### There are 6016 unique tokens in the IMDb review vocabulary. Their numericalized values range from 0 to 6015" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 6016 unique tokens in the IMDb review sample vocabulary\n", "The numericalized token values run from 0 to 6015 \n" ] } ], "source": [ "print(f'There are {len(movie_reviews.vocab.itos)} unique tokens in the IMDb review sample vocabulary')\n", "print(f'The numericalized token values run from {min(movie_reviews.vocab.stoi.values())} to {max(movie_reviews.vocab.stoi.values())} ')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Each review can be mapped to a 6016-dimensional `embedding vector` whose indices correspond to the numericalized tokens, and whose values are the number of times the corresponding token appeared in the review. To do this efficiently we need to learn a bit about `Counters`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3A. Counters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A **Counter** is a useful Python object. A **Counter** applied to a list returns an ordered dictionary whose keys are the unique elements in the list, and whose values are the counts of the unique elements. Counters are from the collections module (along with OrderedDict, defaultdict, deque, and namedtuple).\n", "Here is how Counters work:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Let's make a TokenCounter for movie reviews" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_items([(2, 1), (5, 15), (4622, 1), (25, 3), (0, 8), (867, 1), (52, 5), (3776, 1), (1800, 1), (95, 1), (37, 1), (85, 1), (191, 1), (63, 2), (936, 1), (2740, 1), (517, 1), (18, 1), (21, 3), (11, 1), (84, 1), (2418, 1), (192, 1), (88, 1), (3777, 1), (1801, 1), (127, 1), (10, 3), (269, 1), (15, 1), (273, 1), (73, 1), (26, 2), (9, 2), (1360, 1), (35, 2), (1213, 1), (1144, 1), (1145, 1), (2419, 1), (91, 1), (62, 1), (245, 1), (14, 2), (1361, 1), (1447, 1), (64, 1), (40, 1), (797, 1), (103, 1), (72, 2), (99, 1), (534, 1), (616, 1), (48, 1), (282, 1), (54, 1), (90, 1), (219, 1), (228, 1), (43, 1), (13, 1), (3778, 1), (3779, 1), (355, 1), (492, 1)])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TokenCounter = lambda review_index : Counter((movie_reviews.train.x)[review_index].data)\n", "TokenCounter(0).items()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The TokenCounter `keys` are the numericalized `tokens` that apper in the review" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys([2, 5, 4622, 25, 0, 867, 52, 3776, 1800, 95, 37, 85, 191, 63, 936, 2740, 517, 18, 21, 11, 84, 2418, 192, 88, 3777, 1801, 127, 10, 269, 15, 273, 73, 26, 9, 1360, 35, 1213, 1144, 1145, 2419, 91, 62, 245, 14, 1361, 1447, 64, 40, 797, 103, 72, 99, 534, 616, 48, 282, 54, 90, 219, 228, 43, 13, 3778, 3779, 355, 492])" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TokenCounter(0).keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The TokenCounter `values` are the `token multiplicities`, i.e the number of times each `token` appears in the review" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_values([1, 15, 1, 3, 8, 1, 5, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TokenCounter(0).values()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3B. Mapping movie reviews to `embedding vectors`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Make a `count_vectorizer` function that represents a movie review as a 6016-dimensional `embedding vector`\n", "#### The `indices` of the `embedding vector` correspond to the n6016 numericalized tokens in the vocabulary; the `values` specify how often the corresponding token appears in the review. " ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "n_terms = len(movie_reviews.vocab.itos)\n", "n_docs = len(movie_reviews.train.x)\n", "make_token_counter = lambda review_index: Counter(movie_reviews.train.x[review_index].data)\n", "def count_vectorizer(review_index,n_terms = n_terms,make_token_counter = make_token_counter):\n", " # input: review index, n_terms, and tokenizer function\n", " # output: embedding vector for the review\n", " embedding_vector = np.zeros(n_terms) \n", " keys = list(make_token_counter(review_index).keys())\n", " values = list(make_token_counter(review_index).values())\n", " embedding_vector[keys] = values\n", " return embedding_vector\n", "\n", "# make the embedding vector for the first review\n", "embedding_vector = count_vectorizer(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Here is the `embedding vector` for the first review in the training data set" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The review is embedded in a 6016 dimensional vector\n" ] }, { "data": { "text/plain": [ "array([8., 0., 1., 0., ..., 0., 0., 0., 0.])" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(f'The review is embedded in a {len(embedding_vector)} dimensional vector')\n", "embedding_vector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Create the document-term matrix for the IMDb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### In non-deep learning methods of NLP, we are often interested only in `which words` were used in a review, and `how often each word got used`. This is known as the `bag of words` approach, and it suggests a really simple way to store a document (in this case, a movie review). \n", "\n", "#### For each review we can keep track of which words were used and how often each word was used with a `vector` whose `length` is the number of tokens in the vocabulary, which we will call `n`. The `indexes` of this `vector` correspond to the `tokens` in the `IMDb vocabulary`, and the`values` of the vector are the number of times the corresponding tokens appeared in the review. For example the values stored at indexes 0, 1, 2, 3, 4 of the vector record the number of times the 5 tokens ['xxunk','xxpad','xxbos','xxeos','xxfld'] appeared in the review, respectively.\n", "\n", "#### Now, if our movie review database has `m` reviews, and each review is represented by a `vector` of length `n`, then vertically stacking the row vectors for all the reviews creates a matrix representation of the IMDb, which we call its `document-term matrix`. The `rows` correspond to `documents` (reviews), while the `columns` correspond to `terms` (or tokens in the vocabulary)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous lesson, we used [sklearn's CountVectorizer](https://github.com/scikit-learn/scikit-learn/blob/55bf5d9/sklearn/feature_extraction/text.py#L940) to generate the `vectors` that represent individual reviews. Today we will create our own (similar) version. This is for two reasons:\n", "- to understand what sklearn is doing underneath the hood\n", "- to create something that will work with a fastai TextList" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Form the embedding vectors for the movie_reviews in the training set and stack them vertically" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "there are 800 reviews, and 6016 unique tokens in the vocabulary\n" ] } ], "source": [ "# Define a function to build the full document-term matrix\n", "print(f'there are {n_docs} reviews, and {n_terms} unique tokens in the vocabulary')\n", "def make_full_doc_term_matrix(count_vectorizer,n_terms=n_terms,n_docs=n_docs):\n", " \n", " # loop through the movie reviews\n", " for doc_index in range(n_docs):\n", " \n", " # make the embedding vector for the current review\n", " embedding_vector = count_vectorizer(doc_index,n_terms) \n", " \n", " # append the embedding vector to the document-term matrix\n", " if(doc_index == 0):\n", " A = embedding_vector\n", " else:\n", " A = np.vstack((A,embedding_vector))\n", " \n", " # return the document-term matrix\n", " return A\n", "\n", "# Build the full document term matrix for the movie_reviews training set\n", "A = make_full_doc_term_matrix(count_vectorizer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explore the `sparsity` of the document-term matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The `sparsity` of a matrix is defined as the fraction of of zero-valued elements" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Only 112413 of the 4812800 elements in the document-term matrix are nonzero\n", "The sparsity of the document-term matrix is 0.9766429105718085\n" ] } ], "source": [ "NNZ = np.count_nonzero(A)\n", "sparsity = (A.size-NNZ)/A.size\n", "print(f'Only {NNZ} of the {A.size} elements in the document-term matrix are nonzero')\n", "print(f'The sparsity of the document-term matrix is {sparsity}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using matplotlib's `spy` method, we can visualize the structure of the `document-term matrix`\n", "`spy` plots the array, indicating each non-zero value with a dot." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig = plt.figure()\n", "plt.spy(A, markersize=0.10, aspect = 'auto')\n", "fig.set_size_inches(8,6)\n", "fig.savefig('doc_term_matrix.png', dpi=800)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Several observations stand out:\n", "1. Evidently, the document-term matrix is `sparse` ie. has a high proportion of zeros! \n", "2. The density of the matrix increases toward the `left` edge. This makes sense because the tokens are ordered by usage frequency, with frequency increasing toward the `left`.\n", "3. There is a perplexing pattern of curved vertical `density ripples`. If anyone has an explanation, please let me know! \n", "\n", "#### Next we'll see how to exploit matrix sparsity to save memory storage space, and compute time and resources.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Sparse Matrix Representation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Even though we've reduced over 19,000 unique words in our corpus of reviews down to a vocabulary of 6,000 words, that's still a lot! But reviews are generally short, a few hundred words. So most tokens don't appear in a typical review. That means that most of the entries in the document-term matrix will be zeros, and therefore ordinary matrix operations will waste a lot of compute resources multiplying and adding zeros. \n", "\n", "#### We want to maximize the use of space and time by storing and performing matrix operations on our document-term matrix as a **sparse matrix**. `scipy` provides tools for efficient sparse matrix representatin and operations. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Loosely speaking, matrix with a high proportion of zeros is called `sparse` (the opposite of sparse is `dense`). For sparse matrices, you can save a lot of memory by only storing the non-zero values.\n", "\n", "#### More specifically, a class of matrices is called **sparse** if the number of non-zero elements is proportional to the number of rows (or columns) instead of being proportional to the product rows x columns. An example is the class of diagonal matrices.\n", "\n", "\n", "\"floating\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizing sparse matrix structure\n", "\"floating\n", "ref. https://scipy-lectures.org/advanced/scipy_sparse/introduction.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sparse matrix storage formats\n", "\n", "\"floating\n", "ref. https://scipy-lectures.org/advanced/scipy_sparse/storage_schemes.html\n", "\n", "There are the most common sparse storage formats:\n", "- coordinate-wise (scipy calls COO)\n", "- compressed sparse row (CSR)\n", "- compressed sparse column (CSC)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Definition of the Compressed Sparse Row (CSR) format\n", "\n", "Let's start out with a presecription for the **CSR format** (ref. https://en.wikipedia.org/wiki/Sparse_matrix)\n", "\n", "Given a full matrix **`A`** that has **`m`** rows, **`n`** columns, and **`N`** nonzero values, the CSR (Compressed Sparse Row) representation uses three arrays as follows:\n", "\n", "1. **`Val[0:N]`** contains the **values** of the **`N` non-zero elements**.\n", "\n", "2. **`Col[0:N]`** contains the **column indices** of the **`N` non-zero elements**. \n", " \n", "3. For each row **`i`** of **`A`**, **`RowPointer[i]`** contains the index in **Val** of the the first **nonzero value** in row **`i`**. If there are no nonzero values in the **ith** row, then **`RowPointer[i] = None`**. And, by convention, an extra value **`RowPointer[m] = N`** is tacked on at the end. \n", "\n", "Question: How many floats and ints does it take to store the matrix **`A`** in CSR format?\n", "\n", "Let's walk through [a few examples](http://www.mathcs.emory.edu/~cheung/Courses/561/Syllabus/3-C/sparse.html) at the Emory University website\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Store the document-term matrix in CSR format\n", "i.e. given the `TextList` object containing the list of reviews, return the three arrays (values, column_indices, row_pointer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scipy Implementation of sparse matrices\n", "\n", "From the [Scipy Sparse Matrix Documentation](https://docs.scipy.org/doc/scipy-0.18.1/reference/sparse.html)\n", "\n", "- To construct a matrix efficiently, use either dok_matrix or lil_matrix. The lil_matrix class supports basic slicing and fancy indexing with a similar syntax to NumPy arrays. As illustrated below, the COO format may also be used to efficiently construct matrices\n", "- To perform manipulations such as multiplication or inversion, first convert the matrix to either CSC or CSR format.\n", "- All conversions among the CSR, CSC, and COO formats are efficient, linear-time operations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### To really understand the CSR format, we need to be able know how to do two things:\n", "1. Translate a regular matrix A into CSR format\n", "2. Reconstruct a regular matrix from its CSR sparse representation\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.1. Translate a regular matrix A into CSR format\n", "This is done by implementing the definition of `CSR format`, given above." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "# construct the document-term matrix in CSR format\n", "# i.e. return (values, column_indices, row_pointer)\n", "def get_doc_term_matrix(text_list, n_terms):\n", " \n", " # inputs:\n", " # text_list, a TextList object\n", " # n_terms, the number of tokens in our IMDb vocabulary\n", " \n", " # output: \n", " # the CSR format sparse representation of the document-term matrix in the form of a\n", " # scipy.sparse.csr.csr_matrix object\n", "\n", " \n", " # initialize arrays\n", " values = []\n", " column_indices = []\n", " row_pointer = []\n", " row_pointer.append(0)\n", "\n", " # from the TextList object\n", " for _, doc in enumerate(text_list):\n", " feature_counter = Counter(doc.data)\n", " column_indices.extend(feature_counter.keys())\n", " values.extend(feature_counter.values())\n", " # Tack on N (number of nonzero elements in the matrix) to the end of the row_pointer array\n", " row_pointer.append(len(values))\n", " \n", " return scipy.sparse.csr_matrix((values, column_indices, row_pointer),\n", " shape=(len(row_pointer) - 1, n_terms),\n", " dtype=int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Get the document-term matrix in CSR format for the training data" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wall time: 129 ms\n" ] } ], "source": [ "%%time\n", "train_doc_term = get_doc_term_matrix(movie_reviews.train.x, len(movie_reviews.vocab.itos))" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "scipy.sparse.csr.csr_matrix" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(train_doc_term)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(800, 6016)" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_doc_term.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Get the document-term matrix in CSR format for the validation data" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wall time: 32.9 ms\n" ] } ], "source": [ "%%time\n", "valid_doc_term = get_doc_term_matrix(movie_reviews.valid.x, len(movie_reviews.vocab.itos))" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "scipy.sparse.csr.csr_matrix" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(valid_doc_term)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(200, 6016)" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valid_doc_term.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.2 Reconstruct a regular matrix from its CSR sparse representation\n", "#### Given a CSR format sparse matrix representation $(\\text{values},\\text{column_indices}, \\text{row_pointer})$ of a $\\text{m}\\times \\text{n}$ matrix $\\text{A}$,
how can we recover $\\text{A}$?\n", "\n", "First create $\\text{m}\\times \\text{n}$ matrix with all zeros.\n", "We will recover $\\text{A}$ by overwriting the entries in the zeros matrix row by row with the non-zero entries in $\\text{A}$ as follows:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "def CSR_to_full(values, column_indices, row_ptr, m,n):\n", " A = zeros(m,n)\n", " for row in range(n):\n", " if row_ptr is not null:\n", " A[row,column_indices[row_ptr[row]:row_ptr[row+1]]] = values[row_ptr[row]:row_ptr[row+1]]\n", " return A\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. IMDb data exploration exercises" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The`.todense()` method converts a sparse matrix back to a regular (dense) matrix." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<200x6016 sparse matrix of type ''\n", "\twith 27848 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valid_doc_term" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "matrix([[32, 0, 1, 0, ..., 1, 0, 0, 10],\n", " [ 9, 0, 1, 0, ..., 1, 0, 0, 7],\n", " [ 6, 0, 1, 0, ..., 0, 0, 0, 12],\n", " [78, 0, 1, 0, ..., 0, 0, 0, 44],\n", " ...,\n", " [ 8, 0, 1, 0, ..., 0, 0, 0, 8],\n", " [43, 0, 1, 0, ..., 8, 1, 0, 25],\n", " [ 7, 0, 1, 0, ..., 1, 0, 0, 9],\n", " [19, 0, 1, 0, ..., 2, 0, 0, 5]])" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valid_doc_term.todense()[:10,:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Consider the second review in the validation set" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text xxbos i saw this movie once as a kid on the late - late show and fell in love with it . \n", " \n", " xxmaj it took 30 + years , but i recently did find it on xxup dvd - it was n't cheap , either - in a xxunk that xxunk in war movies . xxmaj we watched it last night for the first time . xxmaj the audio was good , however it was grainy and had the trailers between xxunk . xxmaj even so , it was better than i remembered it . i was also impressed at how true it was to the play . \n", " \n", " xxmaj the xxunk is around here xxunk . xxmaj if you 're xxunk in finding it , fire me a xxunk and i 'll see if i can get you the xxunk . xxunk" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "review = movie_reviews.valid.x[1]\n", "review" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise 1:** How many times does the word \"it\" appear in this review? Confirm that the correct values is stored in the document-term matrix, for the row corresponding to this review and the column corresponding to the word \"it\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Answer 1:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "# try it! \n", "# Your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise 2**: Confirm that the review has 144 tokens, 81 of which are distinct" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Answer 2:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<1x6016 sparse matrix of type ''\n", "\twith 81 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valid_doc_term[1]" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "144" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valid_doc_term[1].sum()" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "81" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(set(review.data))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise 3:** How could you convert review.data back to text (without just using review.text)?" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 2, 19, 248, 21, ..., 9, 0, 10, 0], dtype=int64)" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "review.data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Answer 3:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['xxbos', 'i', 'saw', 'this', 'movie', 'once', 'as', 'a', 'kid', 'on', 'the', 'late', '-', 'late', 'show', 'and', 'fell', 'in', 'love', 'with', 'it', '.', '\\n \\n ', 'xxmaj', 'it', 'took', '30', '+', 'years', ',', 'but', 'i', 'recently', 'did', 'find', 'it', 'on', 'xxup', 'dvd', '-', 'it', 'was', \"n't\", 'cheap', ',', 'either', '-', 'in', 'a', 'xxunk', 'that', 'xxunk', 'in', 'war', 'movies', '.', 'xxmaj', 'we', 'watched', 'it', 'last', 'night', 'for', 'the', 'first', 'time', '.', 'xxmaj', 'the', 'audio', 'was', 'good', ',', 'however', 'it', 'was', 'grainy', 'and', 'had', 'the', 'trailers', 'between', 'xxunk', '.', 'xxmaj', 'even', 'so', ',', 'it', 'was', 'better', 'than', 'i', 'remembered', 'it', '.', 'i', 'was', 'also', 'impressed', 'at', 'how', 'true', 'it', 'was', 'to', 'the', 'play', '.', '\\n \\n ', 'xxmaj', 'the', 'xxunk', 'is', 'around', 'here', 'xxunk', '.', 'xxmaj', 'if', 'you', \"'re\", 'xxunk', 'in', 'finding', 'it', ',', 'fire', 'me', 'a', 'xxunk', 'and', 'i', \"'ll\", 'see', 'if', 'i', 'can', 'get', 'you', 'the', 'xxunk', '.', 'xxunk']\n" ] } ], "source": [ "word_list = [movie_reviews.vocab.itos[a] for a in review.data]\n", "print(word_list)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "xxbos i saw this movie once as a kid on the late - late show and fell in love with it . \n", " \n", " xxmaj it took 30 + years , but i recently did find it on xxup dvd - it was n't cheap , either - in a xxunk that xxunk in war movies . xxmaj we watched it last night for the first time . xxmaj the audio was good , however it was grainy and had the trailers between xxunk . xxmaj even so , it was better than i remembered it . i was also impressed at how true it was to the play . \n", " \n", " xxmaj the xxunk is around here xxunk . xxmaj if you 're xxunk in finding it , fire me a xxunk and i 'll see if i can get you the xxunk . xxunk\n" ] } ], "source": [ "reconstructed_text = ' '.join(word_list)\n", "print(reconstructed_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## *Video 4 material ends here.* \n", "## *Video 5 material begins below.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. What is a [Naive Bayes classifier](https://towardsdatascience.com/the-naive-bayes-classifier-e92ea9f47523)? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "#### The `bag of words model` considers a movie review as equivalent to a list of the counts of all the tokens that it contains. When you do this, you throw away the rich information that comes from the sequential arrangement of the tokens into sentences and paragraphs. \n", "\n", "#### Nevertheless, even if you are not allowed to read the review but are only given its representation as `token counts`, you can usually still get a pretty good sense of whether the review was good or bad. How do you do this? By mentally gauging the overall `positive` or `negative` sentiment that the collection of words conveys, right? \n", "\n", "#### The `Naive Bayes Classifier` is an algorithm that encodes this simple reasoning process mathematically. It is based on two important pieces of information that we can learn from the training set:\n", "* The `class priors`, i.e. the probabilities that a randomly chosen review will be `positive`, or `negative`\n", "* The `token likelihoods` i.e. how likely is it that a given token would appear in a `positive` or `negative` review \n", "\n", "#### It turns out that this is all the information we need to build a model capable of predicting fairly accurately how any given review will be classified, given its text! \n", "\n", "#### We shall unfold the complete explanation of the magic of the Naive Bayes Classifier in the next section. \n", "\n", "#### Meanwhile, In this section, we focus on how to compute the necessary information from the training data, specifically the `prior probabilities` for reviews of each class, and the `class occurrence counts` and `class likelihood ratios` for each `token` in the `vocabulary`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8A. Class priors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### From the training data we can determine the `class priors` $p$ and $q$, which are the overall probabilities that a randomly chosen review is in the `positive`, or `negative` class, resepectively. \n", "\n", "#### $p=\\frac{N^{+}}{N}$ \n", "#### and\n", "#### $q=\\frac{N^{-}}{N}$ \n", "\n", "#### Here $N^{+}$ and $N^{-}$ are the numbers of `positive` and `negative` reviews, and $N$ is the total number of reviews in the training set, so that \n", "\n", "#### $N = N^{+} + N^{-}$, \n", "\n", "#### and \n", "\n", "#### $q = 1-p$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8B. Class `occurrence counts`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Let $C^{+}_{t}$ and $C^{-}_{t}$ be the `occurrence counts` of token $t$ in `positive` and `negative` reviews, respectively, and $N^{+}$ and $N^{-}$ be the total numbers of`positive` and `negative` reviews in the data set, respectively. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8B.1 Data exploration with class `occurrence counts`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Movie reviews classes and their integer representations" ] }, { "cell_type": "code", "execution_count": 197, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['__class__',\n", " '__delattr__',\n", " '__dict__',\n", " '__dir__',\n", " '__doc__',\n", " '__eq__',\n", " '__format__',\n", " '__ge__',\n", " '__getattr__',\n", " '__getattribute__',\n", " '__gt__',\n", " '__hash__',\n", " '__init__',\n", " '__init_subclass__',\n", " '__le__',\n", " '__lt__',\n", " '__module__',\n", " '__ne__',\n", " '__new__',\n", " '__reduce__',\n", " '__reduce_ex__',\n", " '__repr__',\n", " '__setattr__',\n", " '__setstate__',\n", " '__sizeof__',\n", " '__slotnames__',\n", " '__str__',\n", " '__subclasshook__',\n", " '__weakref__',\n", " 'add_test',\n", " 'add_test_folder',\n", " 'databunch',\n", " 'filter_by_func',\n", " 'get_processors',\n", " 'label_const',\n", " 'label_empty',\n", " 'label_from_df',\n", " 'label_from_folder',\n", " 'label_from_func',\n", " 'label_from_list',\n", " 'label_from_lists',\n", " 'label_from_re',\n", " 'lists',\n", " 'load_empty',\n", " 'load_state',\n", " 'path',\n", " 'process',\n", " 'test',\n", " 'train',\n", " 'transform',\n", " 'transform_y',\n", " 'valid']" ] }, "execution_count": 197, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dir(movie_reviews)" ] }, { "cell_type": "code", "execution_count": 196, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 196, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movie_reviews.y.c" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['negative', 'positive']" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movie_reviews.y.classes" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Integer representations: positive: 1, negative: 0\n" ] } ], "source": [ "positive = movie_reviews.y.c2i['positive']\n", "negative = movie_reviews.y.c2i['negative']\n", "print(f'Integer representations: positive: {positive}, negative: {negative}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Brief names for training set document term matrix and its labels, validation labels, and vocabulary" ] }, { "cell_type": "code", "execution_count": 200, "metadata": {}, "outputs": [], "source": [ "x = train_doc_term\n", "y = movie_reviews.train.y\n", "valid_y = movie_reviews.valid.y\n", "v = movie_reviews.vocab" ] }, { "cell_type": "code", "execution_count": 198, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(800, 260402)" ] }, "execution_count": 198, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The `count arrays` `C1` and `C0` list the total `occurrence counts` of the tokens in `positive` and `negative` reviews, respectively." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "C1 = np.squeeze(np.asarray(x[y.items==positive].sum(0)))\n", "C0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each vocabulary token, we are summing up how many positive reviews it is in, and how many negative reviews it is in. Here are the occurrence counts for the first 10 tokens in the vocabulary." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 6468 0 383 0 0 10267 674 57 0 5260]\n", "[ 7153 0 417 0 0 10741 908 53 1 6150]\n" ] } ], "source": [ "print(C1[:10])\n", "print(C0[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8B.2 Exercise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### We can use `C0` and `C1` to do some more data exploration!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise 4**: Compare how often the word \"loved\" appears in positive reviews vs. negative reviews. Do the same for the word \"hate\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Answer 4:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The word \"love\" appears 133 and 75 times in positive and negative documents, respectively\n" ] } ], "source": [ "# Exercise: How often does the word \"love\" appear in neg vs. pos reviews?\n", "ind = v.stoi['love']\n", "pos_counts = C1[ind] \n", "neg_counts = C0[ind] \n", "print(f'The word \"love\" appears {pos_counts} and {neg_counts} times in positive and negative documents, respectively')" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The word \"hate\" appears 5 and 13 times in positive and negative documents, respectively\n" ] } ], "source": [ "# Exercise: How often does the word \"hate\" appear in neg vs. pos reviews?\n", "ind = v.stoi['hate']\n", "pos_counts = C1[ind] \n", "neg_counts = C0[ind] \n", "print(f'The word \"hate\" appears {pos_counts} and {neg_counts} times in positive and negative documents, respectively')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Let's look for an example of a positive review containing the word \"hated\"" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 15 49 304 351 393 612 695 773]\n", "[ 1 3 10 11 ... 787 789 790 797]\n" ] }, { "data": { "text/plain": [ "'xxbos xxmaj there are numerous films relating to xxup xxunk , but xxmaj mother xxmaj night is quite xxunk among them : xxmaj in this film , we are introduced to xxmaj howard xxmaj campbell ( xxmaj nolte ) , an xxmaj american living in xxmaj berlin and married to a xxmaj german , xxmaj xxunk xxmaj xxunk ( xxmaj lee ) , who decides to accept the role of a spy : xxmaj more specifically , a xxup cia agent xxmaj major xxmaj xxunk ( xxmaj goodman ) recruits xxmaj campbell who becomes a xxmaj nazi xxunk in order to enter the highest xxunk of the xxmaj hitler xxunk . xxmaj however , the deal is that the xxup us xxmaj government will never xxunk xxmaj campbell \\'s role in the war for national security reasons , and so xxmaj campbell becomes a hated figure across the xxup us . xxmaj after the war , he tries to xxunk his identity , but the past comes back and xxunk him . xxmaj his only \" friend \" is xxmaj xxunk , but even he can not do much for the xxunk of events that fall upon poor xxmaj campbell ... \\n \\n xxmaj the story is deeply touching , as we watch the tragedy of xxmaj campbell who although a great patriot , is treated by xxunk by everybody who xxunk him . xxmaj not only that , but he also gradually realizes that even the persons who are most close to him , have many xxunk of their own . xxmaj vonnegut provides us with a moving atmosphere , with xxmaj campbell \\'s despair building up and almost choking the viewer . \\n \\n xxmaj nolte plays the role of his life , in my opinion ; he is even better than in \" xxmaj xxunk \" , although in both roles he plays tragic figures who are destined to self - destruction . xxmaj xxunk xxmaj lee is also excellent , and the same can be said for the whole cast in general . \\n \\n i have n\\'t read the book , so i can not xxunk how the film compares to it . xxmaj in any case , this is something of no importance here : xxmaj my xxunk is upon the film per xxunk , and the film xxunk deserves a 9 / 10 .'" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "index = v.stoi['hated']\n", "a = np.argwhere((x[:,index] > 0))[:,0]\n", "print(a)\n", "b = np.argwhere(y.items==positive)[:,0]\n", "print(b)\n", "c = list(set(a).intersection(set(b)))[0]\n", "review = movie_reviews.train.x[c]\n", "review.text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example of a negative review with the word \"loved\"" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 1 15 29 69 75 79 174 185 200 205 262 296 303 333 350 351 398 407 440 489 496 528 538 600 602 605 627 642 657\n", " 660 700 712 729 735 755 767 785]\n", "[ 0 2 4 5 ... 795 796 798 799]\n" ] }, { "data": { "text/plain": [ "'xxbos xxmaj oh if only i could give this rubbish less than one star ! xxmaj there were two mildly amusing parts in the whole film and that is it ! one was where a line or two from the song xxmaj do n\\'t xxmaj worry , xxmaj be xxmaj happy was xxunk by the slugs and the other was where xxmaj roddy fell of the toilet roll and landed with his feet and legs apart so that everything else he landed on on the way down hit him in the xxunk . xxmaj that is it there was nothing more amusing than that , at least not for me anyway ! xxmaj xxunk is not right in saying \\' xxmaj fans of the completely terrible \" xxmaj shrek \" might enjoy , but \" xxmaj wallace & xxmaj xxunk \" fans will probably turn away in xxunk . \\' xxmaj as i loved xxmaj shrek 1 2 and 3 and i also love xxmaj wallace and xxmaj xxunk . xxmaj you see what it xxunk down to is that if an animation is done extremely well then it is definitely worth watching , this however was about as far from done well as you can possibly get ! xxmaj the continuity mistakes were too big in number . xxmaj some were pointed out by the makers of this site others were not . i wo n\\'t point out all of the others , but here are a few more to see : xxmaj when the young daughter leaves at the start of the film the catch to the cage door comes down and the hook part of it that is on the right clearly goes back around behind the round xxunk thus effectively making sure xxmaj roddy would not be able to get out and yet he does just by simply kicking at it . xxmaj at one point the ruby falls down xxmaj roddy \\'s back and gets pushed straight up into the the air by xxmaj xxunk all the while the ship is moving forwards . xxmaj in the next scene xxmaj roddy has caught it again . xxmaj this is impossible . xxmaj seeing as how the ship is moving forwards the only place when the ruby was xxunk out from under the back of xxmaj roddy \\'s shirt the only place it could have landed was in the water not in xxmaj roddy \\'s hand . xxmaj there was a third one i wanted to point out but for now i have forgotten it . \\n \\n xxmaj too many , for want of a better word , \\' jokes \\' were repeated in one way or another , there was not enough time to establish any sort of connection with any of the characters , the characters were xxunk , shallow and empty , and the whole film left you wanting xxrep 4 . wanting to watch xxunk minutes of anything else ! xxmaj paint xxunk or grass growing are two superb xxunk !'" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "index = v.stoi['loved']\n", "a = np.argwhere((x[:,index] > 0))[:,0]\n", "print(a)\n", "b = np.argwhere(y.items==negative)[:,0]\n", "print(b)\n", "c = list(set(a).intersection(set(b)))[0]\n", "review = movie_reviews.train.x[c]\n", "review.text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8C. Class likelihood ratios" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Then, given the knowledge that a review is classified as `positive`, the `conditional likelihood` that a token $t$ will appear in the review is\n", "### $ L(t|+) = \\frac{C^{+}_{t}}{N^+}$, \n", "#### and simlarly, the `conditional likelihood` of a token appearing in a `negative` review is \n", "### $ L(t|-) = \\frac{C^{-}_{t}}{N^-}$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8D. The `log-count ratio`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### From the class likelihood ratios, we can define a **log-count ratio** $R_{t}$ for each token $t$ as\n", "### $ R_{t} = \\text{log} \\frac{L(t|+)} {L(t|-)}$\n", "#### The `log-count ratio` ranks tokens by their relative affinities for positive and negative reviews\n", "#### We observe that\n", "* $R_{t} \\gt 0$ means `positive` reviews are more likely to contain this token \n", "* $R_{t} \\lt 0$ means `negative` reviews are more likely to contain this token \n", "* $R_{t} = 0$ indicates the token $t$ has equal likelihood to appear in `positive` and `negative` reviews\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 9. Building a Naive Bayes Classifier for IMDb movie reviews" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### From the `occurrence count` arrays, we can compute the `class likelihoods` and `log-count ratios` of all the tokens in the vocabulary. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 9A. Compute the `class likelihoods`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### We compute slightly modified `conditional likelihoods`, by adding 1 to the numerator and denominator to insure numerically stability." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "L1 = (C1+1) / ((y.items==positive).sum() + 1)\n", "L0 = (C0+1) / ((y.items==negative).sum() + 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 9B. Compute the `log-count ratios`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The log-count ratios are" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[-0.015811 0.084839 0. 0.084839 ... 0.084839 0.084839 0.084839 0.084839]\n" ] } ], "source": [ "R = np.log(L1/L0)\n", "print(R)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Data Exercise: find the vocabulary words most likely to be associated with positive and negative reviews" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Get the indices of the tokens with the highest and lowest log-count ratios" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "n_tokens = 10\n", "highest_R = np.argpartition(R, -n_tokens)[-n_tokens:]\n", "lowest_R = np.argpartition(R, n_tokens)[:n_tokens]" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Highest 10 log-count ratios: [2.569746 2.649788 2.649788 2.723896 2.723896 2.649788 2.792889 2.857428 2.975211 3.029278]\n", "\n", "Lowest 10 log-count ratios: [-2.68775 -2.554218 -2.8596 -3.134037 -2.623211 -3.093215 -2.805533 -2.748374 -2.636457 -2.554218]\n" ] } ], "source": [ "print(f'Highest {n_tokens} log-count ratios: {R[list(highest_R)]}\\n')\n", "print(f'Lowest {n_tokens} log-count ratios: {R[list(lowest_R)]}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Most positive words:" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1723, 1662, 1620, 796, 1529, 1666, 1386, 1358, 1212, 1143], dtype=int64)" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "highest_R" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['sport',\n", " 'davies',\n", " 'jabba',\n", " 'jimmy',\n", " 'felix',\n", " 'gilliam',\n", " 'noir',\n", " 'astaire',\n", " 'fanfan',\n", " 'biko']" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[v.itos[k] for k in highest_R]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### There are only two movie reviews that mention \"biko\"" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<800x1 sparse matrix of type ''\n", "\twith 2 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token = 'biko'\n", "train_doc_term[:,v.stoi[token]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Which movie review has the most occurrences of 'biko'?" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "review # 515 has 14 occurrences of \"biko\"\n", "\n", "xxbos \" xxmaj the xxmaj true xxmaj story xxmaj of xxmaj the xxmaj friendship xxmaj that xxmaj shook xxmaj south xxmaj africa xxmaj and xxmaj xxunk xxmaj the xxmaj world . \" \n", " \n", " xxmaj richard xxmaj attenborough , who directed \" a xxmaj bridge xxmaj too xxmaj far \" and \" xxmaj gandhi \" , wanted to bring the story of xxmaj steve xxmaj biko to life , and the journey and trouble that xxunk xxmaj donald xxmaj woods went through in order to get his story told . xxmaj the films uses xxmaj wood 's two books for it 's information and basis - \" xxmaj biko \" and \" xxmaj asking for xxmaj trouble \" . \n", " \n", " xxmaj the film takes place in the late 1970 's , in xxmaj south xxmaj africa . xxmaj south xxmaj africa is in the grip of the terrible apartheid , which keeps the blacks separated from the whites and xxunk the whites as the superior race . xxmaj the blacks are forced to live in xxunk on the xxunk of the cities and xxunk , and they come under frequent xxunk by the police and the army . xxmaj we are shown a dawn xxunk on a xxunk , as xxunk and armed police force their way through the camp beating and even killing the inhabitants . xxmaj then we are introduced to xxmaj donald xxmaj woods ( xxmaj kevin xxmaj kline ) , who is the editor of a popular newspaper . xxmaj after xxunk a negative story about black xxunk xxmaj steve xxmaj biko ( xxmaj denzel xxmaj washington ) , xxmaj woods goes to meet with him . xxmaj the two are xxunk of each other at first , but they soon become good friends and xxmaj biko shows the horrors of the apartheid system from a black persons point of view to xxmaj woods . xxmaj this xxunk xxmaj woods to speak out against what 's happening around him , and makes him desperate to bring xxmaj steve xxmaj biko 's story out of the xxunk of the white man 's xxmaj south xxmaj africa and to the world . xxmaj soon , xxmaj steve xxmaj biko is arrested and is killed in prison . xxmaj now xxmaj woods and his family are daring to escape from xxmaj south xxmaj africa to xxmaj england , where xxmaj woods can xxunk his book about xxmaj steve xxmaj biko and the apartheid . \n", " \n", " xxmaj when i first heard of \" xxmaj cry xxmaj freedom \" , i was under the impression that it was a movie completely dedicated to the life of xxmaj steve xxmaj biko . i had never actually heard of xxmaj steve xxmaj biko before i seen this film , as the events in this film were really before my time . xxmaj but it 's more about the story of xxmaj donald xxmaj woods and his journey across the border into xxmaj xxunk as he tried to xxunk the xxmaj south xxmaj african xxunk . xxmaj woods was put on a five year type house xxunk after xxmaj steve xxmaj biko was killed . xxmaj so in order to xxunk his xxunk on xxmaj steve xxmaj biko , he had to escape . xxmaj because the xxunk would be considered xxunk in xxmaj south xxmaj africa and that could have resulted in xxmaj woods meeting a fate similar to that of xxmaj biko 's . xxmaj the real xxmaj donald xxmaj woods and his wife acted as xxunk to this film . \n", " \n", " xxmaj denzel xxmaj washington is only in the film for the first hour , and i was disappointed with that as i was expecting to see him for the entire movie . xxmaj but he was amazing as xxmaj steve xxmaj biko , and captured his personality from what i 've read really well and his accent sounded perfect . xxmaj his performance earned him an xxmaj oscar nomination for xxmaj best xxmaj supporting xxmaj actor . xxmaj kevin xxmaj kline delivers a excellent and thought - xxunk performance as xxmaj donald xxmaj woods , and xxmaj penelope xxmaj xxunk is excellent as his wife xxmaj xxunk . \n", " \n", " xxmaj filming took place in xxmaj xxunk , as needless to say problems xxunk when they tried to film it in xxmaj south xxmaj africa . xxmaj while in xxmaj south xxmaj africa , the xxmaj south xxmaj african xxunk followed the film crew everywhere , so they got the bad xxunk and they pulled out and went to xxunk xxmaj xxunk instead . xxmaj despite everything , and the fact that the apartheid did n't end ' xxunk seven years later , \" xxmaj cry xxmaj freedom \" was n't xxunk in xxmaj south xxmaj africa . xxmaj but xxunk showing the movie received bomb threats . \n", " \n", " xxmaj richard xxmaj attenborough brings the horrors of the apartheid to the screen with extreme force and determination . xxmaj he does n't hold back at the end of the movie when showing what was supposed to be a xxunk xxunk by students in a xxunk , turns into a massacre when police open fire on them . xxmaj the film ends with the names of all the anti - apartheid xxunk who died in prison , and the explanations for their deaths . xxmaj many had \" xxmaj no xxmaj explanation \" . xxmaj quite a few were \" xxmaj xxunk \" , which is hard to believe , and many more either fell from the top of the xxunk or were \" xxmaj suicide from xxmaj hanging \" . xxmaj no one will ever know what really happened to them , but i think it 's fair to say that none of these men died at their own hands , but at the hands of others ; or to be more xxunk , at the hands of the police . \n", " \n", " \" xxmaj cry xxmaj freedom \" is a must - see movie for it 's portrayal and story of xxmaj steve xxmaj biko . xxmaj it 's also a xxunk and xxunk portrayal of a beautiful land divided and in the xxunk grips of racial xxunk and violence .\n" ] } ], "source": [ "index = np.argmax(train_doc_term[:,v.stoi[token]])\n", "n_times = train_doc_term[index,v.stoi[token]]\n", "print(f'review # {index} has {n_times} occurrences of \"{token}\"\\n')\n", "print(movie_reviews.train.x[index].text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Most negative words:" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1345, 1545, 572, 904, 1438, 935, 1189, 1213, 301, 1544], dtype=int64)" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lowest_R" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['crater',\n", " 'soderbergh',\n", " 'crap',\n", " 'porn',\n", " 'disappointment',\n", " 'vargas',\n", " 'naschy',\n", " 'dog',\n", " 'worst',\n", " 'fuqua']" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[v.itos[k] for k in lowest_R]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### There's only one movie review that mentions \"soderbergh\"" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<800x1 sparse matrix of type ''\n", "\twith 1 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token = 'soderbergh'\n", "train_doc_term[:,v.stoi[token]]" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "review # 434 has 13 occurrences of \"soderbergh\"\n", "\n", "xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of \" xxmaj at xxmaj the xxmaj movies \" in taking xxmaj steven xxmaj soderbergh to task . \n", " \n", " xxmaj it 's usually satisfying to watch a film director change his style / subject , but xxmaj soderbergh 's most recent stinker , xxmaj the xxmaj girlfriend xxmaj xxunk ) , was also missing a story , so narrative ( and editing ? ) seem to suddenly be xxmaj soderbergh 's main challenge . xxmaj strange , after xxunk years in the business . xxmaj he was probably never much good at narrative , just xxunk it well inside \" edgy \" projects . \n", " \n", " xxmaj none of this excuses him this present , almost diabolical failure . xxmaj as xxmaj david xxmaj xxunk xxunk , \" two parts of xxmaj che do n't ( even ) make a whole \" . \n", " \n", " xxmaj epic xxunk in name only , xxmaj che(2008 ) barely qualifies as a feature film ! xxmaj it certainly has no legs , xxunk as except for its xxunk ultimate resolution forced upon it by history , xxmaj soderbergh 's xxunk - long xxunk just goes nowhere . \n", " \n", " xxmaj even xxmaj margaret xxmaj xxunk , the more xxunk of xxmaj australia 's xxmaj at xxmaj the xxmaj movies duo , noted about xxmaj soderbergh 's xxunk waste of ( xxup xxunk digital xxunk ) : \" you 're in the woods ... xxunk in the woods ... xxunk in the woods ... \" . i too am surprised xxmaj soderbergh did n't give us another xxunk of xxup that somewhere between his xxunk two xxmaj parts , because he still left out massive xxunk of xxmaj che 's \" xxunk \" life ! \n", " \n", " xxmaj for a xxunk of an important but infamous historical figure , xxmaj soderbergh xxunk xxunk , if not deliberately insults , his audiences by \n", " \n", " 1 . never providing most of xxmaj che 's story ; \n", " \n", " 2 . xxunk xxunk film xxunk with mere xxunk xxunk ; \n", " \n", " 3 . xxunk both true xxunk and a narrative of events ; \n", " \n", " 4 . barely developing an idea , or a character ; \n", " \n", " 5 . remaining xxunk episodic ; \n", " \n", " 6 . xxunk proper context for scenes --- whatever we do get is xxunk in xxunk xxunk ; \n", " \n", " 7 . xxunk xxunk all audiences ( even xxmaj spanish - xxunk will be confused by the xxunk xxunk in xxmaj english ) ; and \n", " \n", " 8 . xxunk xxunk his main subject into one dimension . xxmaj why , at xxup this late stage ? xxmaj the t - shirt franchise has been a success ! \n", " \n", " xxmaj our sense of xxunk is surely due to xxmaj peter xxmaj xxunk and xxmaj benjamin xxunk xxmaj xxunk xxunk their screenplay solely on xxmaj xxunk 's memoirs . xxmaj so , like a poor student who has read only xxup one of his xxunk xxunk for his xxunk , xxmaj soderbergh 's product is xxunk limited in perspective . \n", " \n", " xxmaj the audience is held captive within the same xxunk knowledge , scenery and circumstances of the \" revolutionaries \" , but that does n't xxunk our sympathy . xxmaj instead , it xxunk on us that \" xxmaj ah , xxmaj soderbergh 's trying to xxunk his audiences the same as the xxmaj latino peasants were at the time \" . xxmaj but these are the xxup same illiterate xxmaj latino peasants who xxunk out the good doctor to his enemies . xxmaj why does xxmaj soderbergh feel the need to xxunk us with them , and keep us equally mentally captive ? xxmaj such audience xxunk must have a purpose . \n", " \n", " xxmaj part2 is more xxunk than xxmaj part1 , but it 's literally mind - numbing with its repetitive bush - bashing , misery of xxunk , and lack of variety or character xxunk . deltoro 's xxmaj che has no opportunity to grow as a person while he struggles to xxunk his own ill - xxunk troops . xxmaj the only xxunk is the humour as xxmaj che deals with his sometimes deeply ignorant \" revolutionaries \" , some of whom xxunk lack self - control around local peasants or food . xxmaj we certainly get no insight into what caused the conditions , nor any xxunk xxunk of their xxunk xxunk , such as it was . \n", " \n", " xxmaj part2 's xxunk xxunk remains xxunk episodic : again , nothing is telegraphed or xxunk . xxmaj thus even the scenes with xxmaj xxunk xxmaj xxunk ( xxmaj xxunk xxmaj xxunk ) are unexpected and disconcerting . xxmaj any xxunk events are portrayed xxunk and xxmaj latino - xxunk , with xxmaj part1 's interviews xxunk by time - xxunk xxunk between the corrupt xxmaj xxunk president ( xxmaj xxunk de xxmaj xxunk ) and xxup us xxmaj government xxunk promising xxup cia xxunk ( ! ) . \n", " \n", " xxmaj the rest of xxmaj part2 's \" woods \" and day - for - night blue xxunk just xxunk the audience until they 're xxunk the xxunk . \n", " \n", " xxmaj perhaps deltoro felt too xxunk the frustration of many non - xxmaj american xxmaj latinos about never getting a truthful , xxunk history of xxmaj che 's xxunk within their own countries . xxmaj when foreign xxunk still wo n't deliver a free press to their people -- for whatever reason -- then one can see how a popular xxmaj american indie producer might set out to xxunk the not - so - well - read ( \" i may not be able to read or write , but i 'm xxup not xxunk . xxmaj the xxmaj inspector xxmaj xxunk ) ) out to their own local xxunk . xxmaj the film 's obvious xxunk and gross over - xxunk hint very strongly that it 's aiming only at the xxunk of the less - informed xxup who xxup still xxup speak xxup little xxmaj english . xxmaj if they did , they 'd have read xxunk on the subject already , and xxunk the relevant social issues amongst themselves -- learning the lessons of history as they should . \n", " \n", " xxmaj such insights are precisely what societies still need -- and not just the remaining illiterate xxmaj latinos of xxmaj central and xxmaj south xxmaj america -- yet it 's what xxmaj che(2008 ) xxunk fails to deliver . xxmaj soderbergh xxunk his lead because he 's weak on narrative . i am xxunk why xxmaj xxunk deltoro deliberately chose xxmaj soderbergh for this project if he knew this . xxmaj it 's been xxunk , xxunk about xxmaj xxunk was xxunk wanted : it 's what i went to see this film for , but the director xxunk robs us of that . \n", " \n", " xxmaj david xxmaj xxunk , writing in xxmaj the xxmaj australian ( xxunk ) observed that while xxmaj part1 was \" uneven \" , xxmaj part2 actually \" goes rapidly downhill \" from there , \" xxunk xxmaj che 's final xxunk in xxmaj xxunk in xxunk detail \" , which \" ... feels almost unbearably slow and turgid \" . \n", " \n", " xxmaj che : xxmaj the xxmaj xxunk aka xxmaj part2 is certainly no xxunk for xxmaj xxunk , painting it a picture of misery and xxunk . xxmaj the entire second half is only xxunk by the aforementioned humour , and the dramatic -- yet tragic -- capture and execution of the film 's subject . \n", " \n", " xxmaj the rest of this xxunk cinema xxunk is just confusing , irritating misery -- xxunk , for a xxmaj soderbergh film , to be avoided at all costs . xxmaj it is bound to break the hearts of all who know even just a xxunk about the xxunk / 10 )\n" ] } ], "source": [ "index = np.argmax(train_doc_term[:,v.stoi[token]])\n", "n_times = train_doc_term[index,v.stoi[token]]\n", "print(f'review # {index} has {n_times} occurrences of \"{token}\"\\n')\n", "print(movie_reviews.train.x[index].text)\n" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<800x1 sparse matrix of type ''\n", "\twith 1 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_doc_term[:,v.stoi[token]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 9C. Compute the prior probabilities for each class" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The prior probabilities for positive and negative classes are 0.47875 annd 0.52125\n" ] } ], "source": [ "p = (y.items==positive).mean()\n", "q = (y.items==negative).mean()\n", "print(f'The prior probabilities for positive and negative classes are {p} annd {q}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The log probability ratio is\n", "\n", "### $b = \\text{log} \\frac{p} {q}$ \n", "\n", "#### is a measure of the `bias`, or `imbalance` in the data set. \n", "\n", "* $b = 0$ indicates a perfectly balanced data set\n", "* $b \\gt 0$ indicates bias towards `positive` reviews \n", "* $b \\lt 0$ indicates bias towards `negative` reviews " ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The log probability ratio is L = -0.08505123261815539\n" ] } ], "source": [ "b = np.log((y.items==positive).mean() / (y.items==negative).mean())\n", "print(f'The log probability ratio is L = {b}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### We see that the training set is slightly imbalanced toward `negative` reviews." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 9D. Putting it all together: the Naive Bayes Movie Review Classifier\n", "In this section, we'll start with a discussion of Bayes' Theorem, then we'll use it to derive the Naive Bayes Classifier. Next we'll apply the Naive Bayes classifier to our movie reviews problem. Finally we'll review the prescription for building a Naive Bayes Classifier. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 9D.1 What is Bayes Theorem, and what does it have to say about IMDb movie reviews?\n", "\n", "Consider two events, $A$ and $B$ \n", "Then the probability of $A$ and $B$ occurring together can be written in two ways:\n", "$p(A,B) = p(A|B)\\cdot p(B)$\n", "$p(A,B) = p(B|A)\\cdot p(A)$\n", "\n", "where $p(A|B)$ and $p(B|A)$ are conditional probabilities:\n", "$p(A|B)$ is the probability of $A$ occurring given that $B$ has occurred,\n", "$p(A)$ is the probability that $A$ occurs,\n", "$p(B)$ is the probabilityt that $B$ occurs\n", "\n", "\n", "$\\textbf{Bayes Theorem}$ is just the statement that the right hand sides of the above two equations are equal:\n", "\n", "$p(A|B) \\cdot p(B) = p(B|A) \\cdot p(A)$\n", "\n", "Applying $\\textbf{Bayes Theorem}$ to our IMDb movie review problem:\n", "\n", "We identify $A$ and $B$ as
\n", "$A \\equiv \\text{class}$, i.e. positive or negative, and
\n", "$B \\equiv \\text{tokens}$, i.e. the \"bag\" of tokens used in the review\n", "\n", "Then $\\textbf{Bayes Theorem}$ says\n", "\n", "$p(\\text{class}|\\text{tokens})\\cdot p(\\text{tokens}) = p(\\text{tokens}|\\text{class}) \\cdot p(\\text{class})$\n", "\n", "so that
\n", "$p(\\text{class}|\\text{tokens}) = p(\\text{tokens}|\\text{class})\\cdot \\frac{p(\\text{class})}{p(\\text{tokens})}$\n", "\n", "Since $p(\\text{tokens})$ is a constant, we have the proportionality \n", "\n", "$p(\\text{class}|\\text{tokens}) \\propto p(\\text{tokens}|\\text{class})\\cdot p(\\text{class})$\n", "\n", "The left hand side of the above expression is called the $\\textbf{posterior class probability}$, the probability that the review is positive (or negative), given the tokens it contains. This is exactly what we want to predict!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 9D.2 The Naive Bayes Classifier\n", "\n", "#### Given the list of tokens in a review, we seek to predict whether the review is rated as `positive` or `negative` \n", "\n", "#### We can make the prediction if we know the `posterior class probabilities`.\n", "\n", "#### $p(\\text{class}|\\text{tokens})$,\n", "#### where $\\text{class}$ is either `positive` or `negative`, and $\\text{tokens}$ is the list of tokens that appear in the review.\n", "#### [Bayes' Theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem) tells us that the posterior probabilities, the likelihoods and the priors are related this way:\n", "\n", "#### $p(\\text{class}|\\text{tokens}) \\propto p(\\text{tokens}|\\text{class})\\cdot p(\\text{class})$\n", "\n", "#### Now the tokens are not independent of one another. For example, 'go' often appears with 'to', so if 'go' appears in a review it is more likely that the review also contains 'to'. Nevertheless, assuming the tokens are independent allows us to simplify things, so we recklessly do it, hoping it's not too wrong!\n", "#### $p(\\text{tokens}|\\text{class}) = \\prod_{i=1}^{n} p(t_{i}|\\text{class})$\n", "\n", "#### where $t_{i}$ is the $i\\text{th}$ token in the vocabulary and $n$ is the number of tokens in the vocabulary. \n", "\n", "#### So Bayes' theorem is\n", "\n", "#### $p(\\text{class}|\\text{tokens}) \\propto p(\\text{class}) \\prod_{i=1}^{n} p(t_{i}|\\text{class}) $\n", "\n", "#### Taking the ratio of the $\\textbf{posterior class probabilities}$ for the `positive` and `negative` classes, we have\n", "\n", "#### $\\frac{p(+|\\text{tokens})}{p( - |\\text{tokens})} = \\frac{p(+)}{p( - )} \\cdot \\prod_{i=1}^{n} \\frac {p(t_{i}|+)} {p(t_{i}| - )} = \\frac{p}{q} \\cdot \\prod_{i=1}^{n} \\frac {L(t_{i}|+)} {L(t_{i}| - )}$\n", "#### since likelihoods are proportional to probabilities.\n", "#### Taking the log of both sides converts this to a `linear` problem:\n", "#### $\\text{log} \\frac{p(+|\\text{tokens})}{p( - |\\text{tokens})} = \\text{log}\\frac{p}{q} + \\sum_{i=1}^{n} \\text{log} \\frac {L(t_{i}|+)} {L(t_{i}| - )} = b + \\sum_{i=1}^{n} R_{t_{i}}$\n", "\n", "#### The first term on the right-hand side is the `bias`, and the second term is the dot product of the *binarized* embedding vector and the log-count ratios\n", "\n", "#### If the left-hand side is greater than or equal to zero, we predict the review is `positive`, else we predict the review is `negative`. \n", "\n", "#### We can re-write the last equation in matrix form to generate a $m \\times 1$ boolean column vector $\\textbf{preds}$ of review predictions:\n", "\n", "#### $\\textbf{preds} = \\textbf{W} \\cdot \\textbf{R} + \\textbf{b}$\n", "#### where \n", "\n", "* $\\textbf{preds} \\equiv \\text{log} \\frac{p(+|\\text{tokens})}{p( - |\\text{tokens})}$\n", "* $\\textbf{W}$ is the $m\\times n$ `binarized document-term matrix`, whose rows are the binarized embedding vectors for the movie reviews\n", "* $\\textbf{R}$ is the $n\\times 1$ vector of `log-count ratios` for the tokens, and \n", "* $\\textbf{b}$ is a $n\\times 1$ vector whose entries are the bias $b$\n", "\n", "\n", "#### The Naive Bayes model consists of the log-counts vector $\\textbf{R}$ and the bias $\\textbf{b}$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 9E. Implement our Naive Bayes Movie Review classifier\n", "#### and use it to predict labels for the training and validation sets of the IMDb_sample data." ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The prediction accuracy for the training set is 0.9\n" ] } ], "source": [ "W = train_doc_term.sign()\n", "preds_train = (W @ R + b) > 0\n", "train_accuracy = (preds_train == y.items).mean()\n", "print(f'The prediction accuracy for the training set is {train_accuracy}')" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The prediction accuracy for the validation set is 0.68\n" ] } ], "source": [ "W = valid_doc_term.sign()\n", "preds_valid = (W @ R + b) > 0\n", "valid_accuracy = (preds_valid == valid_y.items).mean()\n", "print(f'The prediction accuracy for the validation set is {valid_accuracy}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 9F. Summary: A recipe for the Naive Bayes Classifier\n", "#### Here is a summary of our procedure for predicting labels with the Naive Bayes Classifier, starting with the training set `x` and the training labels `y`\n", "\n", "\n", "#### 1. Compute the token count vectors\n", "> C0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))
\n", "> C1 = np.squeeze(np.asarray(x[y.items==positive].sum(0)))
\n", "\n", "#### 2. Compute the token class likelihood vectors\n", "> L0 = (C0+1) / ((y.items==negative).sum() + 1)
\n", "> L1 = (C1+1) / ((y.items==positive).sum() + 1)
\n", "\n", "#### 3. Compute the log-count ratios vector\n", "> R = np.log(L1/L0)\n", "\n", "#### 4. Compute the bias term\n", "> b = np.log((y.items==positive).mean() / (y.items==negative).mean())\n", "\n", "#### 5. The Naive Bayes model consists of the log-counts vector $\\textbf{R}$ and the bias $\\textbf{b}$\n", "#### 6. Predict the movie review labels from a linear transformation of the log-count ratios vector:\n", "> preds = (W @ R + b) > 0,
\n", "> where the weights matrix W = valid_doc_term.sign() is the binarized `valid_doc_term matrix` whose rows are the binarized embedding vectors for the movie reviews for which you want to predict ratings.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10. Working with the full IMDb data set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our approach working on a smaller sample of the data, we can try using it on the full dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 10A. Download the data" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/data_clas.pkl'),\n", " WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/data_lm.pkl'),\n", " WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/finetuned.pth'),\n", " WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/finetuned_enc.pth'),\n", " WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/imdb.vocab'),\n", " WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/ld.pkl'),\n", " WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/ll_clas.pkl'),\n", " WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/ll_lm.pkl'),\n", " WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/models'),\n", " WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/pretrained'),\n", " WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/README'),\n", " WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/test'),\n", " WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/tmp_clas'),\n", " WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/tmp_lm'),\n", " WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/train'),\n", " WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/unsup'),\n", " WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/vocab_lm.pkl')]" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.IMDB)\n", "path.ls()" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/train/labeledBow.feat'),\n", " WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/train/neg'),\n", " WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/train/pos'),\n", " WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/train/unsupBow.feat')]" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(path/'train').ls()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 10B. Preprocess the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Attempt to split and label the data fails most of the time, throwing a `BrokenProcessPool` error; we apply a `brute force` approach, trying repeatedly until we succeed. Takes 10 minutes if it goes on the first try." ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "failure count is 9\n", "\n", "Wall time: 13min 4s\n" ] } ], "source": [ "%%time\n", "# throws `BrokenProcessPool' Error sometimes. Keep trying `till it works!\n", "count = 0\n", "error = True\n", "while error:\n", " try: \n", " # Preprocessing steps\n", " reviews_full = (TextList.from_folder(path)\n", " # Make a `TextList` object that is a list of `WindowsPath` objects, \n", " # each of which contains the full path to one of the data files.\n", " .split_by_folder(valid='test')\n", " # Generate a `LabelLists` object that splits files by training and validation folders\n", " # Note: .label_from_folder in next line causes the `BrokenProcessPool` error\n", " .label_from_folder(classes=['neg', 'pos']))\n", " # Create a `CategoryLists` object which contains the data and\n", " # its labels that are derived from folder names\n", " error = False\n", " print(f'failure count is {count}\\n') \n", " except: # catch *all* exceptions\n", " # accumulate failure count\n", " count = count + 1\n", " print(f'failure count is {count}')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 10C. Create document-term matrices for training and validation sets. \n", "#### This takes about ~4 sec per matrix" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wall time: 3.72 s\n" ] } ], "source": [ "%%time\n", "valid_doc_term = get_doc_term_matrix(reviews_full.valid.x, len(reviews_full.vocab.itos))" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wall time: 3.78 s\n" ] } ], "source": [ "%%time\n", "train_doc_term = get_doc_term_matrix(reviews_full.train.x, len(reviews_full.vocab.itos))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 10D. Save the data\n", "When storing data like this, always make sure it's included in your `.gitignore` file" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "scipy.sparse.save_npz(\"train_doc_term.npz\", train_doc_term)" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [], "source": [ "scipy.sparse.save_npz(\"valid_doc_term.npz\", valid_doc_term)" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [], "source": [ "with open('reviews_full.pickle', 'wb') as handle:\n", " pickle.dump(reviews_full, handle, protocol=pickle.HIGHEST_PROTOCOL)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### In the future, we'll just be able to load our data:" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [], "source": [ "train_doc_term = scipy.sparse.load_npz(\"train_doc_term.npz\")\n", "valid_doc_term = scipy.sparse.load_npz(\"valid_doc_term.npz\")" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [], "source": [ "with open('reviews_full.pickle', 'rb') as handle:\n", " pickle.load(handle)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 11. Understanding Fastai's API$^\\dagger$ for text data sets
\n", "$^\\dagger$API $\\equiv$ Application Programming Interface" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### reviews_full is a `LabelLists` object, which contains `LabelList` objects `train`, `valid` and potentially `test`" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "fastai.data_block.LabelLists" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(reviews_full)" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "fastai.data_block.LabelList" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(reviews_full.valid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### reviews_full also contains the `vocab` object though it is not shown with the dir() command. This is an error." ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "print(reviews_full.vocab)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### We will store the `vocabulary` in a variable `full_vocab`" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [], "source": [ "full_vocab = reviews_full.vocab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Recall that a `vocab` object has a method `itos` which returns a list of tokens" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "['bad',\n", " 'people',\n", " 'will',\n", " 'other',\n", " 'also',\n", " 'into',\n", " 'first',\n", " 'because',\n", " 'great',\n", " 'how']" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "full_vocab.itos[100:110]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A LabelList object contains a `TextList` object `x` and a `CategoryList` object `y` " ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LabelList (25000 items)\n", "x: TextList\n", "xxbos xxmaj once again xxmaj mr. xxmaj costner has dragged out a movie for far longer than necessary . xxmaj aside from the terrific sea rescue sequences , of which there are very few i just did not care about any of the characters . xxmaj most of us have ghosts in the closet , and xxmaj costner 's character are realized early on , and then forgotten until much later , by which time i did not care . xxmaj the character we should really care about is a very cocky , overconfident xxmaj ashton xxmaj kutcher . xxmaj the problem is he comes off as kid who thinks he 's better than anyone else around him and shows no signs of a cluttered closet . xxmaj his only obstacle appears to be winning over xxmaj costner . xxmaj finally when we are well past the half way point of this stinker , xxmaj costner tells us all about xxmaj kutcher 's ghosts . xxmaj we are told why xxmaj kutcher is driven to be the best with no prior inkling or foreshadowing . xxmaj no magic here , it was all i could do to keep from turning it off an hour in .,xxbos xxmaj this is an example of why the majority of action films are the same . xxmaj generic and boring , there 's really nothing worth watching here . a complete waste of the then barely - tapped talents of xxmaj ice - t and xxmaj ice xxmaj cube , who 've each proven many times over that they are capable of acting , and acting well . xxmaj do n't bother with this one , go see xxmaj new xxmaj jack xxmaj city , xxmaj ricochet or watch xxmaj new xxmaj york xxmaj undercover for xxmaj ice - t , or xxmaj boyz n the xxmaj hood , xxmaj higher xxmaj learning or xxmaj friday for xxmaj ice xxmaj cube and see the real deal . xxmaj ice - t 's horribly cliched dialogue alone makes this film grate at the teeth , and i 'm still wondering what the heck xxmaj bill xxmaj paxton was doing in this film ? xxmaj and why the heck does he always play the exact same character ? xxmaj from xxmaj aliens onward , every film i 've seen with xxmaj bill xxmaj paxton has him playing the exact same irritating character , and at least in xxmaj aliens his character died , which made it somewhat gratifying ... \n", " \n", " xxmaj overall , this is second - rate action trash . xxmaj there are countless better films to see , and if you really want to see this one , watch xxmaj judgement xxmaj night , which is practically a carbon copy but has better acting and a better script . xxmaj the only thing that made this at all worth watching was a decent hand on the camera - the cinematography was almost refreshing , which comes close to making up for the horrible film itself - but not quite . 4 / 10 .,xxbos xxmaj first of all i hate those moronic rappers , who could'nt act if they had a gun pressed against their foreheads . xxmaj all they do is curse and shoot each other and acting like xxunk version of gangsters . \n", " \n", " xxmaj the movie does n't take more than five minutes to explain what is going on before we 're already at the warehouse xxmaj there is not a single sympathetic character in this movie , except for the homeless guy , who is also the only one with half a brain . \n", " \n", " xxmaj bill xxmaj paxton and xxmaj william xxmaj sadler are both hill xxunk and xxmaj xxunk character is just as much a villain as the gangsters . i did'nt like him right from the start . \n", " \n", " xxmaj the movie is filled with pointless violence and xxmaj walter xxmaj hills specialty : people falling through windows with glass flying everywhere . xxmaj there is pretty much no plot and it is a big problem when you root for no - one . xxmaj everybody dies , except from xxmaj paxton and the homeless guy and everybody get what they deserve . \n", " \n", " xxmaj the only two black people that can act is the homeless guy and the junkie but they 're actors by profession , not annoying ugly brain dead rappers . \n", " \n", " xxmaj stay away from this crap and watch 48 hours 1 and 2 instead . xxmaj at lest they have characters you care about , a sense of humor and nothing but real actors in the cast .,xxbos xxmaj not even the xxmaj beatles could write songs everyone liked , and although xxmaj walter xxmaj hill is no mop - top he 's second to none when it comes to thought provoking action movies . xxmaj the nineties came and social platforms were changing in music and film , the emergence of the xxmaj rapper turned movie star was in full swing , the acting took a back seat to each man 's overpowering regional accent and transparent acting . xxmaj this was one of the many ice - t movies i saw as a kid and loved , only to watch them later and cringe . xxmaj bill xxmaj paxton and xxmaj william xxmaj sadler are firemen with basic lives until a burning building tenant about to go up in flames hands over a map with gold implications . i hand it to xxmaj walter for quickly and neatly setting up the main characters and location . xxmaj but i fault everyone involved for turning out xxmaj lame - o performances . xxmaj ice - t and cube must have been red hot at this time , and while i 've enjoyed both their careers as rappers , in my opinion they fell flat in this movie . xxmaj it 's about ninety minutes of one guy ridiculously turning his back on the other guy to the point you find yourself locked in multiple states of disbelief . xxmaj now this is a movie , its not a documentary so i wo nt waste my time recounting all the stupid plot twists in this movie , but there were many , and they led nowhere . i got the feeling watching this that everyone on set was xxunk of confused and just playing things off the cuff . xxmaj there are two things i still enjoy about it , one involves a scene with a needle and the other is xxmaj sadler 's huge 45 pistol . xxmaj bottom line this movie is like domino 's pizza . xxmaj yeah ill eat it if i 'm hungry and i do n't feel like cooking , xxmaj but i 'm well aware it tastes like crap . 3 stars , meh .,xxbos xxmaj brass pictures ( movies is not a fitting word for them ) really are somewhat brassy . xxmaj their alluring visual qualities are reminiscent of expensive high class xxup tv commercials . xxmaj but unfortunately xxmaj brass pictures are feature films with the pretense of wanting to entertain viewers for over two hours ! xxmaj in this they fail miserably , their undeniable , but rather soft and flabby than steamy , erotic qualities non withstanding . \n", " \n", " xxmaj xxunk ' 45 is a remake of a film by xxmaj luchino xxmaj visconti with the same title and xxmaj alida xxmaj valli and xxmaj farley xxmaj granger in the lead . xxmaj the original tells a story of senseless love and lust in and around xxmaj venice during the xxmaj italian wars of independence . xxmaj brass moved the action from the 19th into the 20th century , 1945 to be exact , so there are xxmaj mussolini xxunk , men in black shirts , xxmaj german uniforms or the tattered garb of the xxunk . xxmaj but it is just window dressing , the historic context is completely negligible . \n", " \n", " xxmaj anna xxmaj xxunk plays the attractive aristocratic woman who falls for the amoral xxup ss guy who always puts on too much lipstick . xxmaj she is an attractive , versatile , well trained xxmaj italian actress and clearly above the material . xxmaj her wide range of facial expressions ( xxunk boredom , loathing , delight , fear , hate ... and ecstasy ) are the best reason to watch this picture and worth two stars . xxmaj she endures this basically trashy stuff with an astonishing amount of dignity . i wish some really good parts come along for her . xxmaj she really deserves it .\n", "y: CategoryList\n", "neg,neg,neg,neg,neg\n", "Path: C:\\Users\\cross-entropy\\.fastai\\data\\imdb" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reviews_full.valid" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A `TextList` object is a list of `Text` objects containing the reviews as items" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "fastai.text.data.Text" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(reviews_full.valid.x[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A `Text` object has properties \n", "#### `text`, which is a `str` containing the review text:" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"xxbos xxmaj once again xxmaj mr. xxmaj costner has dragged out a movie for far longer than necessary . xxmaj aside from the terrific sea rescue sequences , of which there are very few i just did not care about any of the characters . xxmaj most of us have ghosts in the closet , and xxmaj costner 's character are realized early on , and then forgotten until much later , by which time i did not care . xxmaj the character we should really care about is a very cocky , overconfident xxmaj ashton xxmaj kutcher . xxmaj the problem is he comes off as kid who thinks he 's better than anyone else around him and shows no signs of a cluttered closet . xxmaj his only obstacle appears to be winning over xxmaj costner . xxmaj finally when we are well past the half way point of this stinker , xxmaj costner tells us all about xxmaj kutcher 's ghosts . xxmaj we are told why xxmaj kutcher is driven to be the best with no prior inkling or foreshadowing . xxmaj no magic here , it was all i could do to keep from turning it off an hour in .\"" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reviews_full.valid.x[0].text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### and `data`, which is an array of integers representing the tokens in the review:" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 2, 5, 303, 192, ..., 50, 555, 18, 10], dtype=int64)" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reviews_full.valid.x[0].data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A `Text` object also has a method `.items` which returns the integer array representations for all the reviews" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([array([ 2, 5, 303, 192, ..., 50, 555, 18, 10], dtype=int64),\n", " array([ 2, 5, 20, 16, ..., 236, 126, 182, 10], dtype=int64),\n", " array([ 2, 5, 106, 14, ..., 18, 9, 197, 10], dtype=int64),\n", " array([ 2, 5, 38, 77, ..., 399, 11, 23500, 10], dtype=int64), ...,\n", " array([ 2, 5, 279, 19, ..., 32312, 78, 608, 10], dtype=int64),\n", " array([ 2, 5, 53, 9, ..., 51, 336, 56, 10], dtype=int64),\n", " array([ 2, 5, 20, 30, ..., 44, 1161, 5947, 10], dtype=int64),\n", " array([ 2, 19, 161, 130, ..., 78, 127, 3208, 10], dtype=int64)], dtype=object)" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reviews_full.valid.x.items" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Review labels are stored as a `CategoryList` object" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "fastai.data_block.CategoryList" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(reviews_full.valid.y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A `CategoryList` object is a list of `Category` objects" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "fastai.core.Category" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(reviews_full.valid.y[0])" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Category neg" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reviews_full.valid.y[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A `Category` object also has a method `.items` which returns an array of integers labels for all the reviews" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 0, 0, ..., 1, 1, 1, 1], dtype=int64)" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reviews_full.valid.y.items" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The label of the first review seems right" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Category neg" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reviews_full.valid.y[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Names of classes" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['neg', 'pos']" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reviews_full.valid.y.classes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Number of classes" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reviews_full.valid.y.c" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The classes have both integer rand string representations:" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'neg': 0, 'pos': 1}" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reviews_full.valid.y.c2i" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reviews_full.valid.y[0].data" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'neg'" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reviews_full.valid.y[0].obj" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The training and validation data sets each have 25000 samples" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(25000, 25000)" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(reviews_full.train), len(reviews_full.valid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 12. The Naive Bayes classifier with the full IMDb dataset" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [], "source": [ "x=train_doc_term\n", "y=reviews_full.train.y\n", "valid_y = reviews_full.valid.y.items" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<25000x38464 sparse matrix of type ''\n", "\twith 3716501 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [], "source": [ "positive = y.c2i['pos']\n", "negative = y.c2i['neg']" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [], "source": [ "C0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))\n", "C1 = np.squeeze(np.asarray(x[y.items==positive].sum(0)))" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([26553, 0, 12500, 0, ..., 0, 0, 0, 0], dtype=int32)" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C0" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([28399, 0, 12500, 0, ..., 0, 0, 0, 0], dtype=int32)" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 12A. Data exploration: log-count ratios" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Token likelihoods conditioned on class" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [], "source": [ "L1 = (C1+1) / ((y.items==positive).sum() + 1)\n", "L0 = (C0+1) / ((y.items==negative).sum() + 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### log-count ratios" ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [], "source": [ "R = np.log(L1/L0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Examples of log-count ratios for a few words\n", "Check that log-count ratios are negative for words with `negative` sentiment and positive for words with `positive` sentiment! " ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-0.7133498878774648" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "R[full_vocab.stoi['hated']]" ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.1563661500586044" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "R[full_vocab.stoi['loved']]" ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.4418327522790391" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "R[full_vocab.stoi['liked']]" ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-2.2826243504315076" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "R[full_vocab.stoi['worst']]" ] }, { "cell_type": "code", "execution_count": 121, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7225576052173609" ] }, "execution_count": 121, "metadata": {}, "output_type": "execute_result" } ], "source": [ "R[full_vocab.stoi['best']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Since we have equal numbers of positive and negative reviews in this data set, the `bias` $b$ is 0." ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The bias term b is 0.0\n" ] } ], "source": [ "b = np.log((y.items==positive).mean() / (y.items==negative).mean())\n", "print(f'The bias term b is {b}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 12B. Predictions of the Naive Bayes Classifier for the full IMDb data set.\n", "#### We get much better accuracy this time, because of the larger training set." ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Validation accuracy is 0.83292 for the full data set\n" ] } ], "source": [ "# predict labels for the validation data\n", "W = valid_doc_term.sign()\n", "preds = (W @ R + b) > 0\n", "valid_accuracy = (preds == valid_y).mean()\n", "print(f'Validation accuracy is {valid_accuracy} for the full data set')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 13. The Logistic Regression classifier with the full IMBb data set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### With the `sci-kit learn` library, we can fit logistic a regression model where the features are the unigrams. Here $C$ is a regularization parameter." ] }, { "cell_type": "code", "execution_count": 124, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using the full `document-term matrix`:" ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Validation accuracy is 0.88328 using the full doc-term matrix\n" ] } ], "source": [ "m = LogisticRegression(C=0.1, dual=False,solver = 'liblinear')\n", "# 'liblinear' and 'newton-cg' solvers both get 0.88328 accuracy\n", "# 'sag', 'saga', and 'lbfgs' don't converge\n", "m.fit(train_doc_term, y.items.astype(int))\n", "preds = m.predict(valid_doc_term)\n", "valid_accuracy = (preds==valid_y).mean()\n", "print(f'Validation accuracy is {valid_accuracy} using the full doc-term matrix')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using the binarized `document-term` matrix gets a slightly higher accuracy:" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Validation accuracy is 0.88532 using the binarized doc-term matrix\n" ] } ], "source": [ "m = LogisticRegression(C=0.1, dual=False,solver = 'liblinear')\n", "m.fit(train_doc_term.sign(), y.items.astype(int))\n", "preds = m.predict(valid_doc_term.sign())\n", "valid_accuracy = (preds==valid_y).mean()\n", "print(f'Validation accuracy is {valid_accuracy} using the binarized doc-term matrix')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 14. `Trigram` representation of the `IMDb_sample`: preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Our next model is a version of logistic regression with Naive Bayes features extended to include bigrams and trigrams as well as unigrams, described [here](https://www.aclweb.org/anthology/P12-2018). For every document we compute binarized features as described above, but this time we use bigrams and trigrams too. Each feature is a log-count ratio. A logistic regression model is then trained to predict sentiment. Because of the much larger number of features, we will return to the smaller `IMDb_sample` data set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What are `ngrams`?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### An `n-gram` is a contiguous sequence of n items (where the items can be characters, syllables, or words). A `1-gram` is a `unigram`, a `2-gram` is a `bigram`, and a `3-gram` is a `trigram`.\n", "\n", "#### Here, we are referring to sequences of words. So examples of bigrams include \"the dog\", \"said that\", and \"can't you\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 14A. Get the IMDb_sample" ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [], "source": [ "path = untar_data(URLs.IMDB_SAMPLE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Again we find that accessing the `TextList` API *sometimes* (about 50% of the time) throws a `BrokenProcessPool` Error. This is puzzling, I don't know why it happens. But usually works on 1st or 2nd try." ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "failure count is 0\n", "\n", "Wall time: 14.9 s\n" ] } ], "source": [ "%%time\n", "# throws `BrokenProcessPool' Error sometimes. Keep trying `till it works!\n", "\n", "count = 0\n", "error = True\n", "while error:\n", " try: \n", " # Preprocessing steps\n", " movie_reviews = (TextList.from_csv(path, 'texts.csv', cols='text')\n", " .split_from_df(col=2)\n", " .label_from_df(cols=0))\n", "\n", " error = False\n", " print(f'failure count is {count}\\n') \n", " except: # catch *all* exceptions\n", " # accumulate failure count\n", " count = count + 1\n", " print(f'failure count is {count}')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### IMDb_sample vocabulary" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "IMDb_sample vocabulary has 6016 tokens\n" ] } ], "source": [ "vocab_sample = movie_reviews.vocab.itos\n", "vocab_len = len(vocab_sample)\n", "print(f'IMDb_sample vocabulary has {vocab_len} tokens')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 14B. Create the `ngram-doc matrix` for the training data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Just as the `doc-term matrix` encodes the `token` features, the `ngram-doc matrix` encodes the `ngram` features." ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [], "source": [ "min_n=1\n", "max_n=3\n", "\n", "j_indices = []\n", "indptr = []\n", "values = []\n", "indptr.append(0)\n", "num_tokens = vocab_len\n", "\n", "itongram = dict()\n", "ngramtoi = dict()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### We will iterate through the sequences of words to create our n-grams. This takes several minutes:" ] }, { "cell_type": "code", "execution_count": 131, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wall time: 2min 53s\n" ] } ], "source": [ "%%time\n", "for i, doc in enumerate(movie_reviews.train.x):\n", " feature_counter = Counter(doc.data)\n", " j_indices.extend(feature_counter.keys())\n", " values.extend(feature_counter.values())\n", " this_doc_ngrams = list()\n", "\n", " m = 0\n", " for n in range(min_n, max_n + 1):\n", " for k in range(vocab_len - n + 1):\n", " ngram = doc.data[k: k + n]\n", " if str(ngram) not in ngramtoi:\n", " if len(ngram)==1:\n", " num = ngram[0]\n", " ngramtoi[str(ngram)] = num\n", " itongram[num] = ngram\n", " else:\n", " ngramtoi[str(ngram)] = num_tokens\n", " itongram[num_tokens] = ngram\n", " num_tokens += 1\n", " this_doc_ngrams.append(ngramtoi[str(ngram)])\n", " m += 1\n", "\n", " ngram_counter = Counter(this_doc_ngrams)\n", " j_indices.extend(ngram_counter.keys())\n", " values.extend(ngram_counter.values())\n", " indptr.append(len(j_indices))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using dictionaries to convert between indices and strings (in this case, for n-grams) is a common and useful approach! Here, we have created `itongram` (index to n-gram) and `ngramtoi` (n-gram to index) dictionaries. This takes a few minutes..." ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wall time: 161 ms\n" ] } ], "source": [ "%%time\n", "train_ngram_doc_matrix = scipy.sparse.csr_matrix((values, j_indices, indptr),\n", " shape=(len(indptr) - 1, len(ngramtoi)),\n", " dtype=int)" ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<800x260402 sparse matrix of type ''\n", "\twith 678912 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_ngram_doc_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 14C. Examine some ngrams in the training data" ] }, { "cell_type": "code", "execution_count": 134, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(260402, 260402)" ] }, "execution_count": 134, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(ngramtoi), len(itongram)" ] }, { "cell_type": "code", "execution_count": 135, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([125, 340, 10], dtype=int64)" ] }, "execution_count": 135, "metadata": {}, "output_type": "execute_result" } ], "source": [ "itongram[20005]" ] }, { "cell_type": "code", "execution_count": 136, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "20005" ] }, "execution_count": 136, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ngramtoi[str(itongram[20005])]" ] }, { "cell_type": "code", "execution_count": 137, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('never', 'mind', '.')" ] }, "execution_count": 137, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab_sample[125],vocab_sample[340],vocab_sample[10], " ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([42, 49], dtype=int64)" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "itongram[100000]" ] }, { "cell_type": "code", "execution_count": 139, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('have', 'an')" ] }, "execution_count": 139, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab_sample[42], vocab_sample[49]" ] }, { "cell_type": "code", "execution_count": 140, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 38, 862], dtype=int64)" ] }, "execution_count": 140, "metadata": {}, "output_type": "execute_result" } ], "source": [ "itongram[100010]" ] }, { "cell_type": "code", "execution_count": 141, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('are', 'within')" ] }, "execution_count": 141, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab_sample[38], vocab_sample[862]" ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([867, 52, 5], dtype=int64)" ] }, "execution_count": 142, "metadata": {}, "output_type": "execute_result" } ], "source": [ "itongram[6116]" ] }, { "cell_type": "code", "execution_count": 143, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('believable', '!', 'xxmaj')" ] }, "execution_count": 143, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab_sample[867], vocab_sample[52], vocab_sample[5]" ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([3776, 5, 1800], dtype=int64)" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "itongram[6119]" ] }, { "cell_type": "code", "execution_count": 145, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('parallel', 'xxmaj', 'ryan')" ] }, "execution_count": 145, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab_sample[3376], vocab_sample[5], vocab_sample[1800]" ] }, { "cell_type": "code", "execution_count": 146, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0, 1240, 0], dtype=int64)" ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" } ], "source": [ "itongram[80000]" ] }, { "cell_type": "code", "execution_count": 147, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('xxunk', 'involving', 'xxunk')" ] }, "execution_count": 147, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab_sample[0], vocab_sample[1240], vocab_sample[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 14D. Create the `ngram-doc matrix` for the validation data" ] }, { "cell_type": "code", "execution_count": 148, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wall time: 40.8 s\n" ] } ], "source": [ "%%time\n", "j_indices = []\n", "indptr = []\n", "values = []\n", "indptr.append(0)\n", "\n", "for i, doc in enumerate(movie_reviews.valid.x):\n", " feature_counter = Counter(doc.data)\n", " j_indices.extend(feature_counter.keys())\n", " values.extend(feature_counter.values())\n", " this_doc_ngrams = list()\n", "\n", " m = 0\n", " for n in range(min_n, max_n + 1):\n", " for k in range(vocab_len - n + 1):\n", " ngram = doc.data[k: k + n]\n", " if str(ngram) in ngramtoi:\n", " this_doc_ngrams.append(ngramtoi[str(ngram)])\n", " m += 1\n", "\n", " ngram_counter = Counter(this_doc_ngrams)\n", " j_indices.extend(ngram_counter.keys())\n", " values.extend(ngram_counter.values())\n", " indptr.append(len(j_indices))" ] }, { "cell_type": "code", "execution_count": 149, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wall time: 37.9 ms\n" ] } ], "source": [ "%%time\n", "valid_ngram_doc_matrix = scipy.sparse.csr_matrix((values, j_indices, indptr),\n", " shape=(len(indptr) - 1, len(ngramtoi)),\n", " dtype=int)" ] }, { "cell_type": "code", "execution_count": 150, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<200x260402 sparse matrix of type ''\n", "\twith 121597 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 150, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valid_ngram_doc_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 14E. Save the `ngram` data so we won't have to spend the time to generate it again" ] }, { "cell_type": "code", "execution_count": 151, "metadata": {}, "outputs": [], "source": [ "scipy.sparse.save_npz(\"train_ngram_matrix.npz\", train_ngram_doc_matrix)\n", "scipy.sparse.save_npz(\"valid_ngram_matrix.npz\", valid_ngram_doc_matrix)" ] }, { "cell_type": "code", "execution_count": 152, "metadata": {}, "outputs": [], "source": [ "with open('itongram.pickle', 'wb') as handle:\n", " pickle.dump(itongram, handle, protocol=pickle.HIGHEST_PROTOCOL)\n", " \n", "with open('ngramtoi.pickle', 'wb') as handle:\n", " pickle.dump(ngramtoi, handle, protocol=pickle.HIGHEST_PROTOCOL)" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "### 14F. Load the `ngram` data" ] }, { "cell_type": "code", "execution_count": 153, "metadata": { "hidden": true }, "outputs": [], "source": [ "train_ngram_doc_matrix = scipy.sparse.load_npz(\"train_ngram_matrix.npz\")\n", "valid_ngram_doc_matrix = scipy.sparse.load_npz(\"valid_ngram_matrix.npz\")" ] }, { "cell_type": "code", "execution_count": 154, "metadata": { "hidden": true }, "outputs": [], "source": [ "with open('itongram.pickle', 'rb') as handle:\n", " b = pickle.load(handle)\n", " \n", "with open('ngramtoi.pickle', 'rb') as handle:\n", " b = pickle.load(handle)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 15. A Naive Bayes IMDb classifier using Trigrams instead of Tokens" ] }, { "cell_type": "code", "execution_count": 155, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<800x260402 sparse matrix of type ''\n", "\twith 678912 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 155, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x=train_ngram_doc_matrix\n", "x" ] }, { "cell_type": "code", "execution_count": 156, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 260402 1-gram, 2-gram, and 3-gram features in the IMDb_sample vocabulary\n" ] } ], "source": [ "k = x.shape[1]\n", "print(f'There are {k} 1-gram, 2-gram, and 3-gram features in the IMDb_sample vocabulary')" ] }, { "cell_type": "code", "execution_count": 157, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(800,)" ] }, "execution_count": 157, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y=movie_reviews.train.y\n", "y.items\n", "y.items.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Numerical label representation" ] }, { "cell_type": "code", "execution_count": 158, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "positive and negative review labels are represented numerically by 1 and 0\n" ] } ], "source": [ "positive = y.c2i['positive']\n", "negative = y.c2i['negative']\n", "print(f'positive and negative review labels are represented numerically by {positive} and {negative}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Boolean indicator tells whether or not a training label is positive" ] }, { "cell_type": "code", "execution_count": 159, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(200, 1)" ] }, "execution_count": 159, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valid_labels = [label == positive for label in movie_reviews.valid.y.items]\n", "valid_labels=np.array(valid_labels)[:,np.newaxis]\n", "valid_labels.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Boolean indicators for `positive` and `negative` reviews in the training set" ] }, { "cell_type": "code", "execution_count": 160, "metadata": {}, "outputs": [], "source": [ "pos = (y.items == positive)\n", "neg = (y.items == negative)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 15A. Naive Bayes with Trigrams" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The input is the full `ngram_doc_matrix`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Token `occurrence count` vectors\n", "The kernel dies if I use the sparse matrix x here, so converting x to a dense matrix" ] }, { "cell_type": "code", "execution_count": 161, "metadata": {}, "outputs": [], "source": [ "C0 = np.squeeze(x.todense()[neg].sum(0))\n", "C1 = np.squeeze(x.todense()[pos].sum(0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Token `class likelihood` vectors" ] }, { "cell_type": "code", "execution_count": 162, "metadata": {}, "outputs": [], "source": [ "L0 = (C0+1) / (neg.sum() + 1)\n", "L1 = (C1+1) / (pos.sum() + 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### `log-count ratio` column vector" ] }, { "cell_type": "code", "execution_count": 163, "metadata": {}, "outputs": [], "source": [ "R = np.log(L1/L0).reshape((-1,1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### bias" ] }, { "cell_type": "code", "execution_count": 164, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.47875, 0.52125)" ] }, "execution_count": 164, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(y.items==positive).mean(), (y.items==negative).mean()" ] }, { "cell_type": "code", "execution_count": 165, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-0.08505123261815539\n" ] } ], "source": [ "b = np.log((y.items==positive).mean() / (y.items==negative).mean())\n", "print(b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The input is the `ngram_doc_matrix`" ] }, { "cell_type": "code", "execution_count": 166, "metadata": {}, "outputs": [], "source": [ "W = valid_ngram_doc_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Label predictions with the full ngram_doc_matrix" ] }, { "cell_type": "code", "execution_count": 167, "metadata": {}, "outputs": [], "source": [ "preds = W @ R + b\n", "preds = preds > 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Accuracy is much better than with the unigram model" ] }, { "cell_type": "code", "execution_count": 168, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy for Naive Bayes with the full trigrams Model = 0.76\n" ] } ], "source": [ "accuracy = (preds == valid_labels).mean()\n", "print(f'Accuracy for Naive Bayes with the full trigrams Model = {accuracy}' )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 15B. Binarized Naive Bayes with Trigrams" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The input data is the binarized `n_gram_doc_matrix`" ] }, { "cell_type": "code", "execution_count": 169, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<800x260402 sparse matrix of type ''\n", "\twith 566499 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 169, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = train_ngram_doc_matrix.sign()\n", "x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Token `occurrence count` vectors\n", "The kernel dies if I use the sparse matrix x here, so converting x to a dense matrix" ] }, { "cell_type": "code", "execution_count": 170, "metadata": {}, "outputs": [], "source": [ "C0 = np.squeeze(x.todense()[neg].sum(0))\n", "C1 = np.squeeze(x.todense()[pos].sum(0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Token `class likelihood` vectors" ] }, { "cell_type": "code", "execution_count": 171, "metadata": {}, "outputs": [], "source": [ "L1 = (C1+1) / ((y.items==positive).sum() + 1)\n", "L0 = (C0+1) / ((y.items==negative).sum() + 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### `log-count ratio` column vector" ] }, { "cell_type": "code", "execution_count": 172, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-0.005675]\n", " [ 0.084839]\n", " [ 0. ]\n", " [ 0.084839]\n", " ...\n", " [-0.608308]\n", " [-0.608308]\n", " [-0.608308]\n", " [-0.608308]]\n" ] } ], "source": [ "R = np.log(L1/L0).reshape((-1,1))\n", "print(R)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Input to the model is the binarized `ngram_doc_matrix`" ] }, { "cell_type": "code", "execution_count": 173, "metadata": {}, "outputs": [], "source": [ "W = valid_ngram_doc_matrix.sign()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Label predictions with the binarized ngram_doc_matrix" ] }, { "cell_type": "code", "execution_count": 174, "metadata": {}, "outputs": [], "source": [ "preds = W @ R + b\n", "preds = preds>0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Accuracy is still much better than with unigram model, but this time a bit worse with the binarized model" ] }, { "cell_type": "code", "execution_count": 175, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy for Binarized Naive Bayes with Trigrams Model = 0.735\n" ] } ], "source": [ "accuracy = (preds==valid_labels).mean()\n", "print(f'Accuracy for Binarized Naive Bayes with Trigrams Model = {accuracy}' )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 16. A Logistic Regression IMDb classifier using Trigrams" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Here we fit `regularized` logistic regression where the features are the trigrams." ] }, { "cell_type": "code", "execution_count": 176, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "from sklearn.feature_extraction.text import CountVectorizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 16A. Use `CountVectorizer` to create the `train_ngram_doc` matrix" ] }, { "cell_type": "code", "execution_count": 177, "metadata": {}, "outputs": [], "source": [ "veczr = CountVectorizer(ngram_range=(1,3), preprocessor=noop, tokenizer=noop, max_features=800000)" ] }, { "cell_type": "code", "execution_count": 178, "metadata": {}, "outputs": [], "source": [ "train_docs = movie_reviews.train.x\n", "train_words = [[movie_reviews.vocab.itos[o] for o in doc.data] for doc in train_docs]" ] }, { "cell_type": "code", "execution_count": 179, "metadata": {}, "outputs": [], "source": [ "valid_docs = movie_reviews.valid.x\n", "valid_words = [[movie_reviews.vocab.itos[o] for o in doc.data] for doc in valid_docs]" ] }, { "cell_type": "code", "execution_count": 180, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wall time: 1.35 s\n" ] }, { "data": { "text/plain": [ "<800x260401 sparse matrix of type ''\n", "\twith 565699 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 180, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "train_ngram_doc_matrix_veczr = veczr.fit_transform(train_words)\n", "train_ngram_doc_matrix_veczr" ] }, { "cell_type": "code", "execution_count": 181, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<200x260401 sparse matrix of type ''\n", "\twith 93549 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 181, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valid_ngram_doc_matrix_veczr = veczr.transform(valid_words)\n", "valid_ngram_doc_matrix_veczr" ] }, { "cell_type": "code", "execution_count": 182, "metadata": {}, "outputs": [], "source": [ "vocab = veczr.get_feature_names()" ] }, { "cell_type": "code", "execution_count": 183, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['the running man',\n", " 'the rural',\n", " 'the rural xxmaj',\n", " 'the sad',\n", " 'the sad recognition']" ] }, "execution_count": 183, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab[200000:200005]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Binarized trigram counts" ] }, { "cell_type": "code", "execution_count": 184, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy = 0.83 for Logistic Regression, with binarized trigram counts from `CountVectorizer`\n" ] } ], "source": [ "# fit model\n", "m = LogisticRegression(C=0.1, dual=False, solver = 'liblinear')\n", "m.fit(train_ngram_doc_matrix_veczr.sign(), y.items);\n", "\n", "# get predictions\n", "preds = m.predict(valid_ngram_doc_matrix_veczr.sign())\n", "valid_labels = [label == positive for label in movie_reviews.valid.y.items]\n", "\n", "# check accuracy\n", "accuracy = (preds==valid_labels).mean()\n", "print(f'Accuracy = {accuracy} for Logistic Regression, with binarized trigram counts from `CountVectorizer`' )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Full trigram counts\n", "Performance is worse with full trigram counts." ] }, { "cell_type": "code", "execution_count": 185, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy = 0.78 for Logistic Regression, with full trigram counts from `CountVectorizer`\n" ] } ], "source": [ "m = LogisticRegression(C=0.1, dual=False, solver = 'liblinear')\n", "m.fit(train_ngram_doc_matrix_veczr, y.items);\n", "\n", "preds = m.predict(valid_ngram_doc_matrix_veczr)\n", "accuracy =(preds==valid_labels).mean()\n", "print(f'Accuracy = {accuracy} for Logistic Regression, with full trigram counts from `CountVectorizer`' )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 16B. This time, use `our` ngrams to create the `train_ngram_doc` matrix" ] }, { "cell_type": "code", "execution_count": 186, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(800, 260402)" ] }, "execution_count": 186, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_ngram_doc_matrix.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Fit a model to the binarized trigram counts" ] }, { "cell_type": "code", "execution_count": 187, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy = 0.83 for Logistic Regression, with our binarized trigram counts\n" ] } ], "source": [ "m2=None\n", "m2 = LogisticRegression(C=0.1, dual=False, solver = 'liblinear')\n", "m2.fit(train_ngram_doc_matrix.sign(), y.items)\n", "\n", "preds = m2.predict(valid_ngram_doc_matrix.sign())\n", "accuracy = (preds==valid_labels).mean()\n", "print(f'Accuracy = {accuracy} for Logistic Regression, with our binarized trigram counts' )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Fit a model to the full trigram counts\n", "Performance is again worse with full trigram counts." ] }, { "cell_type": "code", "execution_count": 188, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy = 0.795 for Not-Binarized Logistic Regression, with our Trigrams\n" ] } ], "source": [ "m2 = LogisticRegression(C=0.1, dual=False,solver='liblinear')\n", "m2.fit(train_ngram_doc_matrix, y.items)\n", "preds = m2.predict(valid_ngram_doc_matrix)\n", "accuracy = (preds==valid_labels).mean()\n", "print(f'Accuracy = {accuracy} for Not-Binarized Logistic Regression, with our Trigrams' )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 16C. Logistic Regression with the log-count ratio gives a slightly better result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Compute the $\\text{log-count ratio}, \\textbf{R}$ and the $\\text{bias}, \\textbf{b}$" ] }, { "cell_type": "code", "execution_count": 189, "metadata": {}, "outputs": [], "source": [ "x=train_ngram_doc_matrix.sign()\n", "valid_x=valid_ngram_doc_matrix.sign()" ] }, { "cell_type": "code", "execution_count": 190, "metadata": {}, "outputs": [], "source": [ "C0 = np.squeeze(x.todense()[neg].sum(axis=0))\n", "C1 = np.squeeze(x.todense()[pos].sum(axis=0))" ] }, { "cell_type": "code", "execution_count": 191, "metadata": {}, "outputs": [], "source": [ "L1 = (C1+1) / ((pos).sum() + 1)\n", "L0 = (C0+1) / ((neg).sum() + 1)" ] }, { "cell_type": "code", "execution_count": 192, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1, 260402)" ] }, "execution_count": 192, "metadata": {}, "output_type": "execute_result" } ], "source": [ "R = np.log(L1/L0)\n", "R.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Here we fit regularized logistic regression where the features are the log-count ratios for the trigrams':" ] }, { "cell_type": "code", "execution_count": 193, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(800, 260402)\n" ] } ], "source": [ "R_tile = np.tile(R,[x.shape[0],1])\n", "print(R_tile.shape)" ] }, { "cell_type": "code", "execution_count": 194, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy = 0.835 for Logistic Regression, with trigram log-count ratios\n" ] } ], "source": [ "# The next line causes the kernel to die?\n", "# x_nb = x.multiply(R)\n", "# As a workaround, use the full matrices\n", "x_nb = np.multiply(x.todense(),R_tile)\n", "m = LogisticRegression(dual=False, C=0.1,solver='liblinear')\n", "m.fit(x_nb, y.items);\n", "\n", "# why does valid_x.multiply(R) work but x.multiply(R) does not?\n", "valid_x_nb = valid_x.multiply(R) \n", "preds = m.predict(valid_x_nb)\n", "\n", "accuracy = (preds==valid_labels).mean()\n", "print(f'Accuracy = {accuracy} for Logistic Regression, with trigram log-count ratios' )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 17. Summary of movie review sentiment classifier results" ] }, { "cell_type": "code", "execution_count": 203, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Model Data Set Token Unit Validation Accuracy(%)
Naive Bayes IMDb_sampleFull Unigram 64.5 (from video #5)
Naive Bayes IMDb_sampleBinarized Unigram 68.0
Naive Bayes IMDb_sampleFull Trigram 76.0
Naive Bayes IMDb_sampleBinarized Trigram 73.5
Logistic RegressionIMDb_sampleFull Trigram 78.0, 80.0 (our Trigrams)
Logistic RegressionIMDb_sampleBinarized Trigram 83.0
Logistic RegressionIMDb_sampleBinarized Trigram log-count ratios83.5
Naive Bayes Full IMDb IMDb_sample Binarized Trigram 83.3
Logistic RegressionFull IMDb Full Trigram 88.3
Logistic RegressionFull IMDb Binarized Trigram 88.5
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import HTML, display\n", "# Note: to install the `tabulate` package, \n", "# go to a shell terminal and run the command\n", "# `conda install tabulate`\n", "import tabulate\n", "table = [[\"Model\",\"Data Set\",\"Token Unit\",\"Validation Accuracy(%)\"],\n", " [\"Naive Bayes\",\"IMDb_sample\", \"Full Unigram\",\"64.5 (from video #5)\"],\n", " [\"Naive Bayes\",\"IMDb_sample\", \"Binarized Unigram\",\"68.0\"],\n", " [\"Naive Bayes\",\"IMDb_sample\", \"Full Trigram\",\"76.0\"],\n", " [\"Naive Bayes\",\"IMDb_sample\", \"Binarized Trigram\",\"73.5\"],\n", " [\"Logistic Regression\",\"IMDb_sample\", \"Full Trigram\",\"78.0, 80.0 (our Trigrams)\"],\n", " [\"Logistic Regression\",\"IMDb_sample\", \"Binarized Trigram\",\"83.0\"],\n", " [\"Logistic Regression\",\"IMDb_sample\", \"Binarized Trigram log-count ratios\",\"83.5\"],\n", " [\"Naive Bayes\",\"Full IMDb\",\"IMDb_sample\", \"Binarized Trigram\",\"83.3\"],\n", " [\"Logistic Regression\",\"Full IMDb\", \"Full Trigram\",\"88.3\"],\n", " [\"Logistic Regression\",\"Full IMDb\", \"Binarized Trigram\",\"88.5\"]]\n", "display(HTML(tabulate.tabulate(table, tablefmt='html')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. Sida Wang and Christopher D. Manning [pdf](https://www.aclweb.org/anthology/P12-2018)\n", "* [The Naive Bayes Classifier](https://towardsdatascience.com/the-naive-bayes-classifier-e92ea9f47523). Joseph Catanzarite, in Towards Data Science" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }