Skip to content
Snippets Groups Projects
08_e-neural_networks.ipynb 22.7 KiB
Newer Older
chadhat's avatar
chadhat committed
{
 "cells": [
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {},
chadhat's avatar
chadhat committed
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "# IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !\n",
    "from numpy.random import seed\n",
    "\n",
    "seed(42)\n",
    "import tensorflow as tf\n",
    "\n",
    "tf.random.set_seed(42)\n",
    "import matplotlib as mpl\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "sns.set(style=\"darkgrid\")\n",
    "mpl.rcParams[\"lines.linewidth\"] = 3\n",
    "%matplotlib inline\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "%config IPCompleter.greedy=True\n",
    "import warnings\n",
    "\n",
chadhat's avatar
chadhat committed
    "warnings.filterwarnings('ignore')\n",
chadhat's avatar
chadhat committed
    "warnings.filterwarnings(\"ignore\", category=FutureWarning)\n",
    "from IPython.core.display import HTML\n",
    "\n",
    "HTML(open(\"custom.html\", \"r\").read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 8e: Sequence modeling: Natural language processing\n",
    "## What is Natural language processing?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As the name suggests, it refers to processing of data such as text and speech. This involves tasks such as:\n",
    "\n",
    "- Automatic document processing\n",
    "- Topic modeling\n",
    "- Language translation\n",
    "- sentiment analysis\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we all know, computers cannot process data in text format. They need numbers. So we need some mechanism to convert our text to numbers.\n",
    "\n",
    "**Important to know libraries:**\n",
    "- [Natural language toolkit](https://www.nltk.org/)\n",
    "- [Gensim](https://radimrehurek.com/gensim/)\n",
    "- [Tomotopy](https://bab2min.github.io/tomotopy/v0.12.3/en/)\n",
    "- [fastext](https://fasttext.cc/)\n",
    "\n",
    "## Text prepocessing\n",
    "\n",
    "### Tokenization\n",
    "\n",
    "Text -> tokens\n",
    "\n",
    "The process of reducing a piece of text to tokens is called tokenization. It is genrally done at a word level but can also be done at other levels such as a sentence."
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {
    "tags": []
   },
chadhat's avatar
chadhat committed
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "import nltk\n",
    "\n",
    "nltk.download(\"all\")"
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"Is Monty a python or a group of pythons in a flying circus? What about swimming circuses?\""
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {},
chadhat's avatar
chadhat committed
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "from nltk.tokenize import word_tokenize\n",
    "\n",
    "print(word_tokenize(text))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lemmatization and Stemming\n",
    "\n",
    "Most of the time we want to also reduce the inflectional forms of the same word. For example, consider a text that has (organization, organizational, organizations)\n",
    "\n",
    "`Stemming`: This is a process of reducing a word to a stem form based on some pre-defined rules. The resulting stem might be a non-sensical word.\n",
    "\n",
    "`Lemmatization`: This is a process of reducing a word to a lemma or the dictionary form of the word. This follows lexicon rules and is much more comprehensive than `stemming`. However, it is also more computationally expensive."
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {},
chadhat's avatar
chadhat committed
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "from nltk.stem import PorterStemmer, WordNetLemmatizer\n",
    "from nltk.tokenize import word_tokenize\n",
    "from prettytable import PrettyTable\n",
    "\n",
    "words = word_tokenize(text)\n",
    "print(\"Tokens \\n\")\n",
    "print(words)\n",
    "\n",
    "stemmer = PorterStemmer()\n",
    "\n",
    "lemmatizer = WordNetLemmatizer()\n",
    "\n",
    "table = PrettyTable([\"Word\", \"Stem\", \"Lemma\"])\n",
    "\n",
    "for w in words:\n",
    "    table.add_row([w, stemmer.stem(w), lemmatizer.lemmatize(w)])\n",
    "\n",
    "print(table)"
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {},
chadhat's avatar
chadhat committed
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "lemmatizer.lemmatize(\"swimming\")"
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {},
chadhat's avatar
chadhat committed
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "lemmatizer.lemmatize?"
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {},
chadhat's avatar
chadhat committed
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "lemmatizer.lemmatize(\"swimming\", \"v\")"
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {},
chadhat's avatar
chadhat committed
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "# Automatically find POS tag\n",
    "from nltk.corpus import wordnet\n",
    "\n",
    "\n",
    "def get_wordnet_pos(word):\n",
    "    \"\"\"Map POS tag to first character lemmatize() accepts\"\"\"\n",
    "    tag = nltk.pos_tag([word])[0][1][0].upper()\n",
    "    tag_dict = {\n",
    "        \"J\": wordnet.ADJ,\n",
    "        \"N\": wordnet.NOUN,\n",
    "        \"V\": wordnet.VERB,\n",
    "        \"R\": wordnet.ADV,\n",
    "    }\n",
    "\n",
    "    return tag_dict.get(tag, wordnet.NOUN)\n",
    "\n",
    "\n",
    "words = word_tokenize(text)\n",
    "\n",
    "table = PrettyTable([\"Word\", \"Stem\", \"Lemma\"])\n",
    "\n",
    "for w in words:\n",
    "    table.add_row([w, stemmer.stem(w), lemmatizer.lemmatize(w, get_wordnet_pos(w))])\n",
    "\n",
    "print(table)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Other:\n",
    "\n",
    "- Text to lower case\n",
    "- Remove punctuations\n",
    "- Remove stopwords"
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {},
chadhat's avatar
chadhat committed
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "# Text to lower case\n",
    "text = text.lower()\n",
    "print(text)\n",
    "\n",
    "# Remove punctuations\n",
    "import string\n",
    "\n",
    "text = text.translate(str.maketrans(\"\", \"\", string.punctuation))\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {},
chadhat's avatar
chadhat committed
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "# Remove stopwords\n",
    "from nltk.corpus import stopwords\n",
    "\n",
    "print(stopwords.words(\"english\"))"
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {},
chadhat's avatar
chadhat committed
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "words = word_tokenize(text)\n",
    "\n",
    "filtered_text = [w for w in words if not w in set(stopwords.words(\"english\"))]\n",
    "\n",
    "print(filtered_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tokens to Vectors\n",
    "\n",
    "Once we have cleaned up our text we have different ways in which we can tokenize them:\n",
    "\n",
    "### Bag-of-Words (BoW)\n",
    "\n",
    "Imagine that all the unique words in our text corpus are put together in one big bag. \n",
    "\n",
    "All or a subset of this bag is then considered as our `vocabulary`.\n",
    "\n",
    "Each unit (document/line/...) in our corpus can now be represented as a vector of length equal to our vocabulary size with each index of the vector representing a word from our `vocabulary`.\n",
    "\n",
    "We count the number of occurences of each word in a unit of text and put this number at the corresponding location in this vector. If the word does not exist in the unit we enter 0."
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {},
chadhat's avatar
chadhat committed
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "# Let's consider each sentence of our example text as a document/unit we want to process\n",
    "import numpy as np\n",
    "\n",
    "text = [\n",
    "    \"Is Monty a python or a group of pythons in a flying circus?\",\n",
    "    \"What about swimming circuses?\",\n",
    "]\n",
    "\n",
    "for index, value in enumerate(text):\n",
    "    text[index] = value.lower().translate(str.maketrans(\"\", \"\", string.punctuation))\n",
    "\n",
    "lemmatizer = WordNetLemmatizer()\n",
    "\n",
    "unique_words = {}\n",
    "\n",
    "bow_text = []\n",
    "\n",
    "for index, value in enumerate(text):\n",
    "    words = word_tokenize(value)\n",
    "    words = [w for w in words if not w in set(stopwords.words(\"english\"))]\n",
    "    words = [lemmatizer.lemmatize(w) for w in words]\n",
    "    print(words)\n",
    "    for token in words:\n",
    "        if token not in unique_words.keys():\n",
    "            unique_words[token] = 1\n",
    "        else:\n",
    "            unique_words[token] += 1\n",
    "    bow_text.append(words)\n",
    "\n",
    "print(unique_words)\n",
    "\n",
    "unique_words = list(unique_words.keys())\n",
    "\n",
    "bow_vectors = np.zeros((len(unique_words), len(text)))\n",
    "\n",
    "for column, value in enumerate(bow_text):\n",
    "    for _, word in enumerate(value):\n",
    "        if word in unique_words:\n",
    "            bow_vectors[unique_words.index(word), column] += 1\n",
    "print(bow_vectors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Much better way of doing this is:"
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {},
chadhat's avatar
chadhat committed
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "from string import punctuation\n",
    "\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "# CountVectorizer automatically makes the text lowercase\n",
    "\n",
    "text = [\n",
    "    \"Is Monty a python or a group of python in a flying circus?\",\n",
    "    \"What about swimming circuses?\",\n",
    "]\n",
    "\n",
    "\n",
    "class LemmaTokenizer:\n",
    "    def __init__(self):\n",
    "        self.wnl = WordNetLemmatizer()\n",
    "\n",
    "    def __call__(self, doc):\n",
    "        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]\n",
    "\n",
    "\n",
    "vectorizer = CountVectorizer(\n",
chadhat's avatar
chadhat committed
    "    stop_words=list(set(stopwords.words(\"english\")).union(set(punctuation))),\n",
chadhat's avatar
chadhat committed
    "    tokenizer=LemmaTokenizer(),\n",
    ")\n",
    "\n",
    "bow_vectors = vectorizer.fit_transform(text)\n",
    "\n",
    "print(f\"The vocabulary of our corpus is: \\n {vectorizer.vocabulary_}\\n\")\n",
    "\n",
    "print(f\"Vectorizer from Scikit learn creates sparse matrices: {type(bow_vectors)} \\n\")\n",
    "\n",
    "print(f\"The created vectors are: {bow_vectors.todense()}\")"
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {},
chadhat's avatar
chadhat committed
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "# Other tokenizers\n",
    "from string import punctuation\n",
    "\n",
    "from nltk.tokenize import TweetTokenizer\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "tokenizer = TweetTokenizer()\n",
    "\n",
    "text = [\n",
    "    \"Is Monty a python or a group of python's in a flying circus?\",\n",
    "    \"What about swimming circuses?\",\n",
    "]\n",
    "\n",
    "\n",
    "class LemmaTokenizer:\n",
    "    def __init__(self):\n",
    "        self.wnl = WordNetLemmatizer()\n",
    "\n",
    "    def __call__(self, doc):\n",
    "        return [self.wnl.lemmatize(t) for t in tokenizer.tokenize(doc)]\n",
    "\n",
    "\n",
    "vectorizer = CountVectorizer(\n",
chadhat's avatar
chadhat committed
    "    stop_words=list(set(stopwords.words(\"english\")).union(set(punctuation))),\n",
chadhat's avatar
chadhat committed
    "    tokenizer=LemmaTokenizer(),\n",
    ")\n",
    "\n",
    "bow_vectors = vectorizer.fit_transform(text)\n",
    "\n",
    "print(f\"The vocabulary of our corpus is: \\n {vectorizer.vocabulary_}\\n\")\n",
    "\n",
    "print(f\"Vectorizer from Scikit learn creates sparse matrices: {type(bow_vectors)} \\n\")\n",
    "\n",
    "print(f\"The created vectors are: {bow_vectors.todense()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Term frequency inverse document frequency (Tf-idf)\n",
    "\n",
    "A numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.\n",
    "\n",
    "A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf(*)\n",
    "\n",
    "*[Research-paper recommender systems : a literature survey](https://link.springer.com/article/10.1007/s00799-015-0156-0)\n",
    "\n",
    "$TF-IDF = TF * IDF$\n",
    "\n",
    "**TF = Term frequency**\n",
    "\n",
    "**IDF = Inverse document/text frequency**\n",
    "\n",
    "$T_{t',d}$ = Number of occurences of a particular term ($t'$) in a document ($d$).\n",
    "\n",
    "$\\sum_{t' \\in d} T_{t',d}$ : Total number of terms in the document\n",
    "\n",
    "$N_T$ = Total number of documents/text samples.\n",
    "\n",
    "$N_{t'}$ = Number of documents/text samples that contain the term $t'$-\n",
    "\n",
    "$TF-IDF = \\dfrac{T_{t',d}}{\\sum_{t' \\in d} T_{t',d}} * \\dfrac{N_T}{N_{t'}}$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### IMDB dataset\n",
    "\n",
    "Let's have a look at a sample dataset.\n",
    "\n",
    "IMDB dataset comprises of 50,000 movie reviews. Each of them has a label `0` or `1` representing a bad or a good review, respectively.\n",
    "\n",
    "`Note`: This dataset is also contained in tensorflow.keras.datasets, however that data is already preprocessed. Therefore, we import it from tensorflow_datasets instead."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise: Explore the IMDB dataset and vectorize the tokens"
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {
    "tags": []
   },
chadhat's avatar
chadhat committed
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "import tensorflow_datasets as tfds\n",
    "\n",
    "train_data, test_data = tfds.load(\n",
    "    name=\"imdb_reviews\", split=[\"train\", \"test\"], batch_size=-1, as_supervised=True\n",
    ")\n",
    "\n",
    "X_train, y_train = tfds.as_numpy(train_data)\n",
    "X_test, y_test = tfds.as_numpy(test_data)"
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {},
chadhat's avatar
chadhat committed
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "print(f\"Number of: training samples - {len(y_train)}, test_samples - {len(y_test)}\")"
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {},
chadhat's avatar
chadhat committed
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "print(X_train[:5])\n",
    "print(y_train[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise: Apply tokenization and vectorization (e.g. CountVectorizer) to the imdb dataset"
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a vectorizer e.g. CountVectorizer\n",
    "# Pass maximum features=10000 to the vectorizer to avoid running out of memory\n",
    "\n",
    "\n",
    "# train it on the training set (HINT: one can pass an array of texts)\n",
    "\n",
    "\n",
    "# Look at the resulting vocabulary\n",
    "# vectorizer.vocabulary_\n",
    "\n",
    "\n",
    "# Transform the test data"
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
   "metadata": {
    "tags": [
     "raises-exception"
    ]
   },
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "# Build a 3 layer simple vanilla neural network\n",
    "# Dont forget to add dropout\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "model.compile(optimizer=\"adam\", loss=\"binary_crossentropy\", metrics=[\"accuracy\"])\n",
    "\n",
    "# We need to convert the sparse vector to dense\n",
    "results = model.fit(\n",
    "    train.todense(),\n",
    "    y_train,\n",
    "    epochs=10,\n",
    "    batch_size=512,\n",
    "    validation_data=(test.todense(), y_test),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {
    "tags": [
     "solution"
    ]
   },
chadhat's avatar
chadhat committed
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "# Solution\n",
chadhat's avatar
chadhat committed
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
chadhat's avatar
chadhat committed
    "vectorizer = CountVectorizer(\n",
chadhat's avatar
chadhat committed
    "    stop_words=list(set(stopwords.words(\"english\")).union(set(punctuation))),\n",
chadhat's avatar
chadhat committed
    "    tokenizer=LemmaTokenizer(),\n",
    "    max_features=20000,\n",
    ")\n",
    "\n",
    "train = vectorizer.fit_transform(X_train)\n",
    "\n",
chadhat's avatar
chadhat committed
    "#print(vectorizer.vocabulary_)\n",
chadhat's avatar
chadhat committed
    "\n",
    "test = vectorizer.transform(X_test)"
   ]
  },
  {
   "cell_type": "code",
chadhat's avatar
chadhat committed
   "execution_count": null,
chadhat's avatar
chadhat committed
   "metadata": {
    "tags": [
     "solution"
    ]
   },
chadhat's avatar
chadhat committed
   "outputs": [],
chadhat's avatar
chadhat committed
   "source": [
    "# Build a 3 layer simple vanilla neural network\n",
    "# Dont forget to add dropout\n",
    "\n",
    "from tensorflow.keras.layers import Dense, Dropout\n",
    "from tensorflow.keras.models import Sequential\n",
    "\n",
    "\n",
    "model = Sequential()\n",
    "model.add(Dense(50, activation=\"relu\", input_shape=(test.shape[1],)))\n",
    "# Hidden - Layers\n",
    "model.add(Dropout(0.5))\n",
    "model.add(Dense(30, activation=\"relu\"))\n",
    "model.add(Dropout(0.5))\n",
    "model.add(Dense(20, activation=\"relu\"))\n",
    "# Output- Layer\n",
    "model.add(Dense(1, activation=\"sigmoid\"))\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "model.compile(optimizer=\"adam\", loss=\"binary_crossentropy\", metrics=[\"accuracy\"])\n",
    "\n",
    "# We need to convert the sparse vector to dense\n",
    "results = model.fit(\n",
    "    train.todense(),\n",
    "    y_train,\n",
    "    epochs=10,\n",
    "    batch_size=512,\n",
    "    validation_data=(test.todense(), y_test),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Word embeddings: Featurized representation of words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<figure>\n",
    "<img src=\"./images/neuralnets/word_embedding.png\" width=\"700\"/>\n",
    "<figcaption>Embedding words in a higher dimensional feature space</figcaption>\n",
    "</figure>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| <div style=\"width:150px\"></div>  | <div style=\"width:150px\"></div> Apple | <div style=\"width:150px\"></div> Orange  | <div style=\"width:150px\"></div> Pants | <div style=\"width:150px\"></div> Tiger |\n",
    "| :-----------: | :-----------: | :-----------: | :-----------: | :-----------: |\n",
    "| Animal |0.01 |0.015 |0.006 | 0.96 |\n",
    "| Fruit | 0.99 | 0.97 | -0.001 | -0.01 |\n",
    "| Clothing | 0.02 | 0.07 | 0.97 | 0.002 |\n",
    "| FeatureX | - | - | - | - |\n",
    "| FeatureY | - | - | - | - |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some models to compute word embeddings:\n",
    "- Word2Vec\n",
    "- GloVe\n",
    "- fastText\n",
    "- BERT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pretrained embeddings\n",
    "\n",
    "Example:\n",
    "https://fasttext.cc/docs/en/crawl-vectors.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import fasttext\n",
    "\n",
    "ft = fasttext.load_model(\"./data/cc.en.100.bin\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "words = [\"cat\", \"dog\", \"cream\", \"pizza\", \"car\", \"tractor\"]\n",
    "\n",
    "word_vectors = {}\n",
    "for word in words:\n",
    "    word_vectors[word] = ft.get_word_vector(word)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from scipy import spatial\n",
    "\n",
    "\n",
    "def compute_similarity(a, b):\n",
    "    \"\"\"This function computes cosine similarity between two vectors\"\"\"\n",
    "    return 1 - spatial.distance.cosine(a, b)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# similarities = np.zeros([len(words)]*2)\n",
    "similarities = pd.DataFrame(columns=words, index=words)\n",
    "for word_a, vec_a in word_vectors.items():\n",
    "    for word_b, vec_b in word_vectors.items():\n",
    "        similarities.at[word_a, word_b] = compute_similarity(vec_a, vec_b)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "similarities"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Recurrent Neural Networks (RNNs)\n",
    "\n",
    "RNNs are used for problems such as time-series data, speech recognition and translation.\n",
    "\n",
    "<center>\n",
    "<figure>\n",
    "<img src=\"./images/neuralnets/RNNs.png\" width=\"700\"/>\n",
    "<figcaption>Recurrent neural network</figcaption>\n",
    "</figure>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are newer variants that overcome some issues with a vanilla RNN:\n",
    "- Gated Recurrent Unit (GRU)\n",
    "- Long Short Term Memory (LSTM)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Example walkthrough** : https://keras.io/examples/vision/video_classification/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Transformer models\n",
    "\n",
    "Transformers are models based on an encoder-decoder architecture and mainly using the attention.\n",
    "\n",
    "<center>\n",
    "<table>\n",
    "    <tr><td>\n",
    "        <figure>\n",
    "        <img src=\"./images/neuralnets/transformer.png\" width=\"400\"/>\n",
    "        <figcaption>Transformer architecture</figcaption>\n",
    "        </figure>\n",
    "    </td></tr>\n",
    "    <tr><td><center><sub>Source: <a href=\"https://arxiv.org/abs/1706.03762/\">\"Attention is all you need\": https://arxiv.org/abs/1706.03762/</a></sub></center></td></tr>\n",
    "</table>\n",
    "</center>ce"
chadhat's avatar
chadhat committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Putting it all together\n",
    "\n",
    "https://paperswithcode.com/sota/sentiment-analysis-on-imdb"
   ]
chadhat's avatar
chadhat committed
  },
chadhat's avatar
chadhat committed
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
chadhat's avatar
chadhat committed
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
chadhat's avatar
chadhat committed
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
chadhat's avatar
chadhat committed
   "language": "python",
   "name": "python3"
chadhat's avatar
chadhat committed
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
chadhat's avatar
chadhat committed
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}