NAI - Tools of Artificial Intelligence | Comprehensive Lecture Notes

1

Introduction to Artificial Intelligence

What is AI?

Goal: To create machines/algorithms that can "think" — simulate human cognitive abilities like learning, reasoning, and problem-solving.

Historical Context

Myths & Legends: Automatons in Ancient Egypt, Talos (Greece), Golem (Europe)
Science Fiction: Cyberiada (Lem), Star Wars, Blade Runner, Alien
Modern Era: Turing Test (1950), Birth of AI term (1956)

Key Tools Used in AI

Mathematical Foundations

Logic & reasoning
Probability calculus
Linear algebra
Optimization theory

Application Domains

Game theory & economics
Robotics & control
Natural language processing
Computer vision

AI involves optimization: training ML models, tuning hyperparameters, finding optimal routes, selecting best game moves, and more.

2

Vectors & Matrices

Vector Operations

Operation	Formula	Result
Dot Product	`x·y = Σ xᵢyᵢ`	Scalar
Vector Length	`\|\|x\|\| = √(Σ xᵢ²)`	Scalar
Euclidean Distance	`d(x,y) = √(Σ (xᵢ-yᵢ)²)`	Scalar

Orthogonality

Two vectors are orthogonal (perpendicular) if and only if their dot product equals zero:

x·y = 0 ⟺ x ⊥ y

Matrix Operations

Addition: Element-by-element: Cᵢⱼ = Aᵢⱼ + Bᵢⱼ
Scalar Multiplication: Bᵢⱼ = αAᵢⱼ
Matrix Multiplication: Cᵢⱼ = Σ AᵢₖBₖⱼ (dot product of row i and column j)
Transposition: Aᵀᵢⱼ = Aⱼᵢ (rows ↔ columns)

Note: Matrix multiplication is NOT commutative: AB ≠ BA in general. Also, (AB)ᵀ = BᵀAᵀ

3

Machine Learning Basics

Decision Table Structure

Data is organized as a table where each row is a case/observation and each column is an attribute (feature). The last column is typically the decision/target attribute.

Types of Learning

Supervised Learning

Training data includes labeled examples (input-output pairs)

Classification: Discrete output (class labels)
Regression: Continuous output (numbers)

Unsupervised Learning

No labels — discover hidden patterns

Clustering: Group similar items
Dimensionality Reduction: PCA

Classic Example: Iris Dataset

Classify 3 species of iris flowers based on 4 attributes:

Sepal length (cm)
Sepal width (cm)
Petal length (cm)
Petal width (cm)

Species: Iris-setosa, Iris-versicolor, Iris-virginica

Unlabeled data is often easier to obtain (from instruments/computers), while labeled data requires human intervention (e.g., manual sentiment analysis).

4

Classifier Evaluation

Confusion Matrix

	Predicted Positive	Predicted Negative
Actually Positive	TP (True Positive)	FN (False Negative)
Actually Negative	FP (False Positive)	TN (True Negative)

Key Metrics

Metric	Formula	Meaning
Accuracy	`(TP + TN) / Total`	Overall correctness
Precision	`TP / (TP + FP)`	Of predicted positives, how many are correct?
Recall	`TP / (TP + FN)`	Of actual positives, how many did we find?
F-measure	`2·P·R / (P + R)`	Harmonic mean of Precision & Recall

Precision vs Recall Trade-off: High precision = few false positives; High recall = few false negatives. F-measure balances both.

5

k-Nearest Neighbors (k-NN)

Algorithm

Calculate distance from query point to all training points
Select the k closest neighbors
Vote: classify based on majority class among neighbors

Distance Metrics

Euclidean: d(x,y) = √(Σ (xᵢ - yᵢ)²)

Pro tip: Compare d² instead of d (skip square root) — same ordering, easier math!

Choosing k

Small k (e.g., 1): Sensitive to noise, may overfit
Large k: Smoother boundaries, but may underfit
Odd k: Helps avoid ties in 2-class problems

k-NN is a lazy learner — no explicit training phase. All computation happens at prediction time.

6

Naive Bayes Classifier

Bayes' Theorem

P(Class | Features) ∝ P(Class) × P(Features | Class)

The "Naive" Assumption

Assume all features are conditionally independent given the class:

P(f₁, f₂, ..., fₙ | Class) = P(f₁|Class) × P(f₂|Class) × ... × P(fₙ|Class)

Classification Rule

Calculate P(Class) × Π P(featureᵢ | Class) for each class
Select class with highest (unnormalized) probability

Laplace Smoothing

When P(feature|Class) = 0 (zero count), apply smoothing:

P(value|Class) = (count + 1) / (total + number_of_possible_values)

Despite the "naive" independence assumption often being wrong, Naive Bayes works surprisingly well in practice, especially for text classification!

7

Perceptrons & Neural Networks

Perceptron Model

A perceptron is a simple mathematical model of a neuron:

net = Σ wᵢxᵢ = w·x (weighted sum)
output = activation(net - θ)

where θ (theta) is the threshold.

Activation Functions

Type	Output Type	Values
Unipolar (discrete)	Binary	{0, 1}
Bipolar (discrete)	Binary	{-1, 1}
Unipolar (continuous)	Sigmoid	[0, 1]
Bipolar (continuous)	Tanh	[-1, 1]

Perceptron Learning Rule (Delta Rule)

w' = w + α × (d - y) × x
θ' = θ - α × (d - y)

α = learning rate
d = desired (correct) output
y = actual output
x = input vector

Key insight: If perceptron doesn't activate when it should → decrease threshold. If it activates when it shouldn't → increase threshold.

8

Decision Trees & ID3 Algorithm

Knowledge Representation

Decision Rules: IF-THEN statements
Decision Trees: Hierarchical structure of tests

ID3 Algorithm Concept

Builds trees by selecting the attribute that provides the most information gain at each step.

Entropy

H(S) = -Σ pᵢ log₂(pᵢ)

Measures "impurity" or uncertainty in a dataset
H = 0: pure node (all same class)
H = 1: maximum uncertainty (50-50 split for 2 classes)

Information Gain

Gain(S, A) = H(S) - Σ (|Sᵥ|/|S|) × H(Sᵥ)

Select attribute A with highest information gain at each node.

Decision trees are highly interpretable — they can be converted to simple IF-THEN rules that humans can understand!

9

Model Selection & Overfitting

The Bias-Variance Trade-off

Underfitting (High Bias)

Model too simple
Can't capture patterns
Poor on both train & test

Overfitting (High Variance)

Model too complex
Memorizes training data
Great train, poor test

Model Complexity Examples

100-leaf decision tree for 150-sample Iris dataset = too complex
100-neuron network for XOR problem = overkill

Cross-Validation

k-Fold Cross-Validation:

Split data into k equal parts (folds)
Train on k-1 folds, test on 1 fold
Repeat k times (each fold is test once)
Average the results

Cross-validation gives more reliable performance estimates than a single train-test split, especially with limited data.

10

Clustering Algorithms

k-Means Algorithm

Initialize k centroids (randomly or given)
Assign each point to nearest centroid
Recompute centroids as mean of assigned points
Repeat steps 2-3 until no changes

Centroid Update:

cᵢ = (1/|Cᵢ|) × Σ x for all x in cluster Cᵢ

Hierarchical Clustering

Agglomerative (bottom-up): Start with each point as cluster, merge closest pairs
Divisive (top-down): Start with one cluster, recursively split

k-Means Tips

Compare squared distances (avoid square roots)
Handle ties with priority rules (e.g., prefer cluster 1)
Algorithm converges when no reassignments occur

k-Means is sensitive to initial centroid positions. Multiple runs with different initializations can help find better solutions.

11

Optimization & Greedy Algorithms

Role of Optimization in AI

Training machine learning models (minimize loss)
Hyperparameter tuning
Resource allocation problems
Route planning (autonomous vehicles)
Game-playing (selecting best moves)

Greedy Approach

Make the locally optimal choice at each step, hoping to find a global optimum. Works for some problems, not all!

Matroids & Rado-Edmonds Theorem

Key insight: A greedy algorithm produces an optimal solution if and only if the problem has a matroid structure.

Huffman Coding

Classic greedy algorithm for data compression
Assigns shorter codes to more frequent symbols
Provably optimal for prefix-free codes

Local Search Methods

Hill Climbing: Move to better neighbors until stuck
Simulated Annealing: Accept worse moves with decreasing probability

12

Knapsack Problem

Problem Definition

Given items with values and weights, and a knapsack with limited capacity, select items to maximize total value without exceeding capacity.

Problem Variants

Variant	Description	Solution
0/1 Knapsack	Take or leave each item	Dynamic Programming
Fractional	Can take fractions of items	Greedy (by value/weight ratio)

Solution Strategy (0/1)

List all feasible combinations (weight ≤ capacity)
Calculate value for each combination
Select combination with maximum value

Value/Weight Ratio

For fractional knapsack, sort items by value/weight ratio (efficiency) and take items greedily.

efficiency = value / weight

For 0/1 knapsack with small instances (like exams), enumerate all valid combinations. For larger instances, use dynamic programming.