{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !\n", "from numpy.random import seed\n", "\n", "seed(42)\n", "import tensorflow as tf\n", "\n", "tf.random.set_seed(42)\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "sns.set(style=\"darkgrid\")\n", "mpl.rcParams[\"lines.linewidth\"] = 3\n", "%matplotlib inline\n", "%config InlineBackend.figure_format = 'retina'\n", "%config IPCompleter.greedy=True\n", "import warnings\n", "\n", "warnings.filterwarnings('ignore')\n", "warnings.filterwarnings(\"ignore\", category=FutureWarning)\n", "from IPython.core.display import HTML\n", "\n", "HTML(open(\"custom.html\", \"r\").read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Chapter 8e: Sequence modeling: Natural language processing\n", "## What is Natural language processing?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As the name suggests, it refers to processing of data such as text and speech. This involves tasks such as:\n", "\n", "- Automatic document processing\n", "- Topic modeling\n", "- Language translation\n", "- sentiment analysis\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we all know, computers cannot process data in text format. They need numbers. So we need some mechanism to convert our text to numbers.\n", "\n", "**Important to know libraries:**\n", "- [Natural language toolkit](https://www.nltk.org/)\n", "- [Gensim](https://radimrehurek.com/gensim/)\n", "- [Tomotopy](https://bab2min.github.io/tomotopy/v0.12.3/en/)\n", "- [fastext](https://fasttext.cc/)\n", "\n", "## Text prepocessing\n", "\n", "### Tokenization\n", "\n", "Text -> tokens\n", "\n", "The process of reducing a piece of text to tokens is called tokenization. It is genrally done at a word level but can also be done at other levels such as a sentence." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import nltk\n", "\n", "nltk.download(\"all\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text = \"Is Monty a python or a group of pythons in a flying circus? What about swimming circuses?\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nltk.tokenize import word_tokenize\n", "\n", "print(word_tokenize(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lemmatization and Stemming\n", "\n", "Most of the time we want to also reduce the inflectional forms of the same word. For example, consider a text that has (organization, organizational, organizations)\n", "\n", "`Stemming`: This is a process of reducing a word to a stem form based on some pre-defined rules. The resulting stem might be a non-sensical word.\n", "\n", "`Lemmatization`: This is a process of reducing a word to a lemma or the dictionary form of the word. This follows lexicon rules and is much more comprehensive than `stemming`. However, it is also more computationally expensive." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nltk.stem import PorterStemmer, WordNetLemmatizer\n", "from nltk.tokenize import word_tokenize\n", "from prettytable import PrettyTable\n", "\n", "words = word_tokenize(text)\n", "print(\"Tokens \\n\")\n", "print(words)\n", "\n", "stemmer = PorterStemmer()\n", "\n", "lemmatizer = WordNetLemmatizer()\n", "\n", "table = PrettyTable([\"Word\", \"Stem\", \"Lemma\"])\n", "\n", "for w in words:\n", " table.add_row([w, stemmer.stem(w), lemmatizer.lemmatize(w)])\n", "\n", "print(table)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lemmatizer.lemmatize(\"swimming\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lemmatizer.lemmatize?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lemmatizer.lemmatize(\"swimming\", \"v\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Automatically find POS tag\n", "from nltk.corpus import wordnet\n", "\n", "\n", "def get_wordnet_pos(word):\n", " \"\"\"Map POS tag to first character lemmatize() accepts\"\"\"\n", " tag = nltk.pos_tag([word])[0][1][0].upper()\n", " tag_dict = {\n", " \"J\": wordnet.ADJ,\n", " \"N\": wordnet.NOUN,\n", " \"V\": wordnet.VERB,\n", " \"R\": wordnet.ADV,\n", " }\n", "\n", " return tag_dict.get(tag, wordnet.NOUN)\n", "\n", "\n", "words = word_tokenize(text)\n", "\n", "table = PrettyTable([\"Word\", \"Stem\", \"Lemma\"])\n", "\n", "for w in words:\n", " table.add_row([w, stemmer.stem(w), lemmatizer.lemmatize(w, get_wordnet_pos(w))])\n", "\n", "print(table)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other:\n", "\n", "- Text to lower case\n", "- Remove punctuations\n", "- Remove stopwords" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Text to lower case\n", "text = text.lower()\n", "print(text)\n", "\n", "# Remove punctuations\n", "import string\n", "\n", "text = text.translate(str.maketrans(\"\", \"\", string.punctuation))\n", "print(text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Remove stopwords\n", "from nltk.corpus import stopwords\n", "\n", "print(stopwords.words(\"english\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "words = word_tokenize(text)\n", "\n", "filtered_text = [w for w in words if not w in set(stopwords.words(\"english\"))]\n", "\n", "print(filtered_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tokens to Vectors\n", "\n", "Once we have cleaned up our text we have different ways in which we can tokenize them:\n", "\n", "### Bag-of-Words (BoW)\n", "\n", "Imagine that all the unique words in our text corpus are put together in one big bag. \n", "\n", "All or a subset of this bag is then considered as our `vocabulary`.\n", "\n", "Each unit (document/line/...) in our corpus can now be represented as a vector of length equal to our vocabulary size with each index of the vector representing a word from our `vocabulary`.\n", "\n", "We count the number of occurences of each word in a unit of text and put this number at the corresponding location in this vector. If the word does not exist in the unit we enter 0." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's consider each sentence of our example text as a document/unit we want to process\n", "import numpy as np\n", "\n", "text = [\n", " \"Is Monty a python or a group of pythons in a flying circus?\",\n", " \"What about swimming circuses?\",\n", "]\n", "\n", "for index, value in enumerate(text):\n", " text[index] = value.lower().translate(str.maketrans(\"\", \"\", string.punctuation))\n", "\n", "lemmatizer = WordNetLemmatizer()\n", "\n", "unique_words = {}\n", "\n", "bow_text = []\n", "\n", "for index, value in enumerate(text):\n", " words = word_tokenize(value)\n", " words = [w for w in words if not w in set(stopwords.words(\"english\"))]\n", " words = [lemmatizer.lemmatize(w) for w in words]\n", " print(words)\n", " for token in words:\n", " if token not in unique_words.keys():\n", " unique_words[token] = 1\n", " else:\n", " unique_words[token] += 1\n", " bow_text.append(words)\n", "\n", "print(unique_words)\n", "\n", "unique_words = list(unique_words.keys())\n", "\n", "bow_vectors = np.zeros((len(unique_words), len(text)))\n", "\n", "for column, value in enumerate(bow_text):\n", " for _, word in enumerate(value):\n", " if word in unique_words:\n", " bow_vectors[unique_words.index(word), column] += 1\n", "print(bow_vectors)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Much better way of doing this is:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from string import punctuation\n", "\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "# CountVectorizer automatically makes the text lowercase\n", "\n", "text = [\n", " \"Is Monty a python or a group of python in a flying circus?\",\n", " \"What about swimming circuses?\",\n", "]\n", "\n", "\n", "class LemmaTokenizer:\n", " def __init__(self):\n", " self.wnl = WordNetLemmatizer()\n", "\n", " def __call__(self, doc):\n", " return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]\n", "\n", "\n", "vectorizer = CountVectorizer(\n", " stop_words=list(set(stopwords.words(\"english\")).union(set(punctuation))),\n", " tokenizer=LemmaTokenizer(),\n", ")\n", "\n", "bow_vectors = vectorizer.fit_transform(text)\n", "\n", "print(f\"The vocabulary of our corpus is: \\n {vectorizer.vocabulary_}\\n\")\n", "\n", "print(f\"Vectorizer from Scikit learn creates sparse matrices: {type(bow_vectors)} \\n\")\n", "\n", "print(f\"The created vectors are: {bow_vectors.todense()}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Other tokenizers\n", "from string import punctuation\n", "\n", "from nltk.tokenize import TweetTokenizer\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "tokenizer = TweetTokenizer()\n", "\n", "text = [\n", " \"Is Monty a python or a group of python's in a flying circus?\",\n", " \"What about swimming circuses?\",\n", "]\n", "\n", "\n", "class LemmaTokenizer:\n", " def __init__(self):\n", " self.wnl = WordNetLemmatizer()\n", "\n", " def __call__(self, doc):\n", " return [self.wnl.lemmatize(t) for t in tokenizer.tokenize(doc)]\n", "\n", "\n", "vectorizer = CountVectorizer(\n", " stop_words=list(set(stopwords.words(\"english\")).union(set(punctuation))),\n", " tokenizer=LemmaTokenizer(),\n", ")\n", "\n", "bow_vectors = vectorizer.fit_transform(text)\n", "\n", "print(f\"The vocabulary of our corpus is: \\n {vectorizer.vocabulary_}\\n\")\n", "\n", "print(f\"Vectorizer from Scikit learn creates sparse matrices: {type(bow_vectors)} \\n\")\n", "\n", "print(f\"The created vectors are: {bow_vectors.todense()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Term frequency inverse document frequency (Tf-idf)\n", "\n", "A numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.\n", "\n", "A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf(*)\n", "\n", "*[Research-paper recommender systems : a literature survey](https://link.springer.com/article/10.1007/s00799-015-0156-0)\n", "\n", "$TF-IDF = TF * IDF$\n", "\n", "**TF = Term frequency**\n", "\n", "**IDF = Inverse document/text frequency**\n", "\n", "$T_{t',d}$ = Number of occurences of a particular term ($t'$) in a document ($d$).\n", "\n", "$\\sum_{t' \\in d} T_{t',d}$ : Total number of terms in the document\n", "\n", "$N_T$ = Total number of documents/text samples.\n", "\n", "$N_{t'}$ = Number of documents/text samples that contain the term $t'$-\n", "\n", "$TF-IDF = \\dfrac{T_{t',d}}{\\sum_{t' \\in d} T_{t',d}} * \\dfrac{N_T}{N_{t'}}$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### IMDB dataset\n", "\n", "Let's have a look at a sample dataset.\n", "\n", "IMDB dataset comprises of 50,000 movie reviews. Each of them has a label `0` or `1` representing a bad or a good review, respectively.\n", "\n", "`Note`: This dataset is also contained in tensorflow.keras.datasets, however that data is already preprocessed. Therefore, we import it from tensorflow_datasets instead." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise: Explore the IMDB dataset and vectorize the tokens" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import tensorflow_datasets as tfds\n", "\n", "train_data, test_data = tfds.load(\n", " name=\"imdb_reviews\", split=[\"train\", \"test\"], batch_size=-1, as_supervised=True\n", ")\n", "\n", "X_train, y_train = tfds.as_numpy(train_data)\n", "X_test, y_test = tfds.as_numpy(test_data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Number of: training samples - {len(y_train)}, test_samples - {len(y_test)}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(X_train[:5])\n", "print(y_train[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise: Apply tokenization and vectorization (e.g. CountVectorizer) to the imdb dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a vectorizer e.g. CountVectorizer\n", "# Pass maximum features=10000 to the vectorizer to avoid running out of memory\n", "\n", "\n", "# train it on the training set (HINT: one can pass an array of texts)\n", "\n", "\n", "# Look at the resulting vocabulary\n", "# vectorizer.vocabulary_\n", "\n", "\n", "# Transform the test data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "raises-exception" ] }, "outputs": [], "source": [ "# Build a 3 layer simple vanilla neural network\n", "# Dont forget to add dropout\n", "\n", "\n", "\n", "\n", "model.compile(optimizer=\"adam\", loss=\"binary_crossentropy\", metrics=[\"accuracy\"])\n", "\n", "# We need to convert the sparse vector to dense\n", "results = model.fit(\n", " train.todense(),\n", " y_train,\n", " epochs=10,\n", " batch_size=512,\n", " validation_data=(test.todense(), y_test),\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "solution" ] }, "outputs": [], "source": [ "# Solution\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "vectorizer = CountVectorizer(\n", " stop_words=list(set(stopwords.words(\"english\")).union(set(punctuation))),\n", " tokenizer=LemmaTokenizer(),\n", " max_features=20000,\n", ")\n", "\n", "train = vectorizer.fit_transform(X_train)\n", "\n", "#print(vectorizer.vocabulary_)\n", "\n", "test = vectorizer.transform(X_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "solution" ] }, "outputs": [], "source": [ "# Build a 3 layer simple vanilla neural network\n", "# Dont forget to add dropout\n", "\n", "from tensorflow.keras.layers import Dense, Dropout\n", "from tensorflow.keras.models import Sequential\n", "\n", "\n", "model = Sequential()\n", "model.add(Dense(50, activation=\"relu\", input_shape=(test.shape[1],)))\n", "# Hidden - Layers\n", "model.add(Dropout(0.5))\n", "model.add(Dense(30, activation=\"relu\"))\n", "model.add(Dropout(0.5))\n", "model.add(Dense(20, activation=\"relu\"))\n", "# Output- Layer\n", "model.add(Dense(1, activation=\"sigmoid\"))\n", "model.summary()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "solution" ] }, "outputs": [], "source": [ "model.compile(optimizer=\"adam\", loss=\"binary_crossentropy\", metrics=[\"accuracy\"])\n", "\n", "# We need to convert the sparse vector to dense\n", "results = model.fit(\n", " train.todense(),\n", " y_train,\n", " epochs=10,\n", " batch_size=512,\n", " validation_data=(test.todense(), y_test),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Word embeddings: Featurized representation of words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<center>\n", "<figure>\n", "<img src=\"./images/neuralnets/word_embedding.png\" width=\"700\"/>\n", "<figcaption>Embedding words in a higher dimensional feature space</figcaption>\n", "</figure>\n", "</center>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| <div style=\"width:150px\"></div> | <div style=\"width:150px\"></div> Apple | <div style=\"width:150px\"></div> Orange | <div style=\"width:150px\"></div> Pants | <div style=\"width:150px\"></div> Tiger |\n", "| :-----------: | :-----------: | :-----------: | :-----------: | :-----------: |\n", "| Animal |0.01 |0.015 |0.006 | 0.96 |\n", "| Fruit | 0.99 | 0.97 | -0.001 | -0.01 |\n", "| Clothing | 0.02 | 0.07 | 0.97 | 0.002 |\n", "| FeatureX | - | - | - | - |\n", "| FeatureY | - | - | - | - |\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some models to compute word embeddings:\n", "- Word2Vec\n", "- GloVe\n", "- fastText\n", "- BERT" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pretrained embeddings\n", "\n", "Example:\n", "https://fasttext.cc/docs/en/crawl-vectors.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import fasttext\n", "\n", "ft = fasttext.load_model(\"./data/cc.en.100.bin\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "words = [\"cat\", \"dog\", \"cream\", \"pizza\", \"car\", \"tractor\"]\n", "\n", "word_vectors = {}\n", "for word in words:\n", " word_vectors[word] = ft.get_word_vector(word)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from scipy import spatial\n", "\n", "\n", "def compute_similarity(a, b):\n", " \"\"\"This function computes cosine similarity between two vectors\"\"\"\n", " return 1 - spatial.distance.cosine(a, b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# similarities = np.zeros([len(words)]*2)\n", "similarities = pd.DataFrame(columns=words, index=words)\n", "for word_a, vec_a in word_vectors.items():\n", " for word_b, vec_b in word_vectors.items():\n", " similarities.at[word_a, word_b] = compute_similarity(vec_a, vec_b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "similarities" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Recurrent Neural Networks (RNNs)\n", "\n", "RNNs are used for problems such as time-series data, speech recognition and translation.\n", "\n", "<center>\n", "<figure>\n", "<img src=\"./images/neuralnets/RNNs.png\" width=\"700\"/>\n", "<figcaption>Recurrent neural network</figcaption>\n", "</figure>\n", "</center>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are newer variants that overcome some issues with a vanilla RNN:\n", "- Gated Recurrent Unit (GRU)\n", "- Long Short Term Memory (LSTM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example walkthrough** : https://keras.io/examples/vision/video_classification/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Transformer models\n", "\n", "Transformers are models based on an encoder-decoder architecture and mainly using the attention.\n", "\n", "<center>\n", "<table>\n", " <tr><td>\n", " <figure>\n", " <img src=\"./images/neuralnets/transformer.png\" width=\"400\"/>\n", " <figcaption>Transformer architecture</figcaption>\n", " </figure>\n", " </td></tr>\n", " <tr><td><center><sub>Source: <a href=\"https://arxiv.org/abs/1706.03762/\">\"Attention is all you need\": https://arxiv.org/abs/1706.03762/</a></sub></center></td></tr>\n", "</table>\n", "</center>ce" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Putting it all together\n", "\n", "https://paperswithcode.com/sota/sentiment-analysis-on-imdb" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.10" } }, "nbformat": 4, "nbformat_minor": 4 }