Conditional Probability¶
 Conditional probability as the name suggests, comes into play when the probability of occurrence of a particular event changes when one or more conditions are satisfied (these conditions again are events). Speaking in technical terms, if X and Y are two events then the conditional probability of X w.r.t Y is denoted by P(XY). So when we talk in terms of conditional probability, just for an example, we make a statement like “The probability of event X given that Y has already occurred”.
 What if X and Y are independent events?
 From the definition of independent events, the occurrence of event X is not dependent on event Y. Therefore P(XY)=P(X).
 What if X and Y are mutually exclusive?
 As X and Y are disjoint events, the probability that X will occurs when Y has already occurred is 0. Therefore, P(XY)=0
 Bayes Theorem
 Product rule states that, P(X ∩ Y) = P(XY) ∗ P(Y)
 Product rule states that, P(Y ∩ X) = P(YX) ∗ P(X)
 as P(X ∩ Y) = P(Y ∩ X)
 P(XY) ∗ P(Y) = P(YX) ∗ P(X)
 P(YX) = $ \frac {P(XY) ∗ P(Y) }{P(X)}$
Naive Bayes algorithm¶
 In Naive Bayes algorithm uses Bayes Theorem. We have to predict Y (the target variable with k – classes), given X (set of $x_1, x_2, .., x_n$ for n – input features). We call this algorithm Naive as algorithm assumes that $x_1, x_2, .., x_n$ are independent.
 Naive Bayes algorithm assumes independence amoung features in predecting target class. Hence
$$ P(Y_k x_1, x_2, …, x_n) = \frac {P(x_1,x_2, …, x_nY_k)} {P(x_1,x_2, …, x_n)} P(Y_k) $$ As $x_1, x_2, …, x_n$ are independent $P(x_1,x_2, …, x_nY_k)$ (likelihood) can be represented as $P(x_1Y_k)P(x_2Y_k)… P(x_nY_k) $, the formula for Naive Bayes algorithm can be represented as below.
$$ P(Y_k x_1, x_2, …, x_n) = \frac {P(x_1Y_k)P(x_2Y_k)… P(x_nY_k)} {P(x_1,x_2, …, x_n)} P(Y_k) $$
No need to compute denominator $P(x_1,x_2, …, x_n)$ as we compare probabilities for different classes $Y_k$. Proir probability
 1) $ P(x_iY_k), i = 1,2,…,n $ – probability of occurrence of $x_i$ given class $y_k$.
 2) $ P(Y_k) $ – probability of occurrence of class – $Y_k$
 We can derive these two proir probability using given dataset (Training Data Set)
 Posterior probability: By using Naive Bayes algorithm we will determine posterior probability – $P(Y_k x_1, x_2, …, x_n)$ .
 Proir probability
 As $x_1, x_2, …, x_n$ are independent $P(x_1,x_2, …, x_nY_k)$ (likelihood) can be represented as $P(x_1Y_k)P(x_2Y_k)… P(x_nY_k) $, the formula for Naive Bayes algorithm can be represented as below.
Laplacian Correction (Laplacian smoothing)¶
 Let $Y_k$ be a class in target, w be a word in feature $x_i$, then maximum likelihood estimator for occurrence of w for class $Y_k$ is
$$ P(w/Y_k) = \frac {count(w,Y_k)}{count(Y_k)} $$  If w is not present in test set, above probability will become 0. To handle this we add 1 to numerator and deominator as shown below.
$$ P(w/Y_k) = \frac {count(w,Y_k) + 1}{count(Y_k) + 1} $$
Naive Bayes – Play Tennis Example¶
 We have got a train dataset, with some input features “Day”, “Outlook”, “Temperature”, “Humidity”, “Wind” and two classes in output feature “PlayTennis” ($Y_k$) “YES”, “NO”
 We have to predict “PlayTennis” – “YES” or “NO” for below records. Which is nothing but calculate posterior probabilities of each class and pick the class which has got more probability
Prior Probabilities of $Y_k$ – (‘YES’, ‘NO’)¶
 Probability(Play Tennis) = P(play) = 9/14
 Probability(Not Play Tennis) = P(NotPlay) = 5/14
To calculate posterior probability we have to follow three steps¶
 1) Constructing a frequency table for each attribute against the target.
 2) Transforming the frequency tables to likelihood tables.
 3) Use the Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of prediction.

$$ P(YES  [Outlook=Sunny],[Temperature=Cool],[Humidity=Normal],[Wind=Week]) = \frac {P(Outlook=SunnyYES) * P(Temperature=CoolYES) * P(Humidity=NormalYES) * P(Wind=WeekYES)}{P([Outlook=Sunny],[Temperature=Cool],[Humidity=Normal],[Wind=Week])} * P(YES) $$
$$ = \frac {2/9 * 3/9 * 6/9 * 5/9 * 9/14} {P([Outlook=Sunny],[Temperature=Cool],[Humidity=Normal],[Wind=Week])} $$  $$ P(NO  [Outlook=Sunny],[Temperature=Cool],[Humidity=Normal],[Wind=Week]) = \frac {P(Outlook=SunnyNO) * P(Temperature=CoolNO) * P(Humidity=NormalNO) * P(Wind=WeekNO)}{P([Outlook=Sunny],[Temperature=Cool],[Humidity=Normal],[Wind=Week])} * P(NO) $$
$$ = \frac {3/5 * 1/5 * 1/5 * 2/5 * 5/14} {P([Outlook=Sunny],[Temperature=Cool],[Humidity=Normal],[Wind=Week])} $$  In above two poserior probabilities, the denominator is same, there is no need of calculating it as we have to compare these two probabilities.
 Hence P(YES  X) approximately equal to 0.017636684
 P(NO  X) is approximately equal to 0.0034285714285714
 P(YES  X) is greater than P(NO  X), hence the prediction is PlayTennis = YES
Variations of the Naive Bayes algorithm¶

There are multiple variations of the Naive Bayes algorithm depending on the distribution of P(x_jC_i). Three of the commonly used variations are

Gaussian: The Gaussian Naive Bayes algorithm assumes distribution of features to be Gaussian or normal, i.e., $ P(x_jC_i)=\frac{1}{\sqrt{2\pi\sigma_{C_i}^2}}\exp{\left(\frac{(x_j\mu_{C_j})^2}{2\sigma_{C_i}^2}\right)} $ Read more about it here .

Multinomial: The Multinomial Naive Bayes algorithm is used when the data is distributed multinomially, i.e., multiple occurrences matter a lot. You can read more here .
 Bernoulli: The Bernoulli algorithm is used when the features in the data set are binaryvalued. It is helpful in spam filtration and adult content detection techniques. For more details, click here .

Pros and Cons of Naive Bayes algorithm¶

Pros
 It is a relatively easy algorithm to build and understand.
 It is faster to predict classes using this algorithm than many other classification algorithms.
 It can be easily trained using a small data set.

Cons
 Zero Conditional Probability Problem : If a given class and a feature have 0 frequency, then the conditional probability estimate for that category will come out as 0. This problem is known as the “Zero Conditional Probability Problem”. This is a problem because it wipes out all the information in other probabilities too. There are several sample correction techniques to fix this problem such as “Laplacian Correction”.
 Another disadvantage is the very strong assumption of independence class features that it makes. It is near to impossible to find such data sets in real life.
Text Analytics – BagofWords model¶
 Stop Words are words which are filtered out before or after processing of natural language data (text). Though “stop words” usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.
 Tokenization Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. Here is an example of tokenization:
 Input: Friends, Romans, Countrymen, lend me your ears;
 Output: Friends, Romans, Countrymen, lend, me, your, ears

Stemming algorithms work by cutting off the end of the word, and in some cases also the beginning while looking for the root. This indiscriminate cutting can be successful in some occasions, but not always, that is why we affirm that this is an approach that has some limitations.
 The bestknown and most popular stemming approach for English is the Porter stemming algorithm, also known as the Porter stemmer. It is a collection of rules (or, if you prefer, heuristics) designed to reflect how English handles inflections. For example, the Porter stemmer chops both apple and apples down to appl, and it stems berry and berries to berri.

Lemmatization on the other hand takes into consideration the morphological analysis of the words. To do so it is necessary to have detailed dictionaries the algorithm can look back at to link the form back to its lemma.
 Lemmatization does not simply chop off inflections, but instead relies on a lexical knowledge base like , WordNet to obtain the correct base forms of words.
 For example, WordNet lemmatizes geese to goose, meanness to meanness and meaning to meaning. In these examples, it outperforms than the Porter stemmer.
 Stemming VS Lemmatization : Lemmatization or Stemming has limits. For example, Porter stems both happiness and happy to happi, while WordNet lemmatizes the two words to themselves. The WordNet lemmatizer also requires specifying the word’s part of speech — otherwise, it assumes the word is a noun. Finally, lemmatization cannot handle unknown words: for example, Porter stems both iphone and iphones to iphon, while WordNet lemmatizes both words to themselves. In general, lemmatization offers better precision than stemming, but at the expense of recall.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
reviews = pd.read_csv('../data/Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)
reviews.shape
reviews.head()
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download('stopwords')
corpus = []
for i in range(0, 1000):
review = re.sub('[^azAZ]', ' ', reviews['Review'][i]) #substitute non alphabets with space
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
review
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 11)
X = cv.fit_transform(corpus).toarray()
y = reviews.iloc[:, 1].values
X[:5,: ]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
from sklearn.naive_bayes import BernoulliNB
classifier = BernoulliNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm