📝 NAI - Practice Tests

Theory Questions & Computational Exercises for Exam Preparation

⚠️ Exam Structure

Tip: Practice computational tasks WITHOUT a calculator — they're not allowed on the exam! Use fractions instead of decimals.
📚 Part I: Elements of Machine Learning
1

Introduction to AI

🎯 Theory Questions

THEORY

Q1.1: What is the main goal of Artificial Intelligence?

To create machines/algorithms that can "think" — simulate human cognitive abilities like learning, reasoning, problem-solving, and decision-making.
THEORY

Q1.2: Name 4 mathematical tools commonly used in AI.

  • Logic (propositional, predicate)
  • Probability calculus
  • Optimization theory
  • Linear algebra (vectors, matrices)
THEORY

Q1.3: What is the difference between supervised and unsupervised learning?

Supervised: Training data includes labeled examples (input-output pairs). Model learns to predict outputs for new inputs.

Unsupervised: No labels provided. Model discovers hidden patterns or structure in data (e.g., clustering, dimensionality reduction).
ADVANCED

Q1.4: Explain the difference between classification and regression.

Classification: Output is a discrete class label (e.g., "spam" vs "not spam", species of iris).

Regression: Output is a continuous numerical value (e.g., house price, temperature prediction).
2

k-Nearest Neighbors (k-NN) Classifier

🎯 Theory Questions

THEORY

Q2.1: Describe the k-NN algorithm in 3 steps.

  1. Calculate distance from query point to all training points
  2. Select the k closest neighbors
  3. Classify by majority vote among the k neighbors
THEORY

Q2.2: Write the formula for Euclidean distance between vectors x and y.

d(x,y) = √(Σ (xᵢ - yᵢ)²) = √((x₁-y₁)² + (x₂-y₂)² + ... + (xₙ-yₙ)²)
THEORY

Q2.3: Why is k-NN called a "lazy learner"?

Because it has no explicit training phase. All computation is deferred until prediction time — it simply stores the training data and computes distances when a new query arrives.

🔢 Computational Exercises

PRACTICAL

P2.1: Given vectors belonging to 2 classes:

Class A: a=(1,0), b=(2,1), c=(0,2)
Class B: d=(4,3), e=(5,4), f=(3,5)

Classify x=(2,3) using k-NN with k=3.
Step 1: Calculate squared distances (easier than full Euclidean)
d²(x,a) = (2-1)² + (3-0)² = 1 + 9 = 10
d²(x,b) = (2-2)² + (3-1)² = 0 + 4 = 4
d²(x,c) = (2-0)² + (3-2)² = 4 + 1 = 5
d²(x,d) = (2-4)² + (3-3)² = 4 + 0 = 4
d²(x,e) = (2-5)² + (3-4)² = 9 + 1 = 10
d²(x,f) = (2-3)² + (3-5)² = 1 + 4 = 5
Step 2: Find 3 nearest
b (A): d²=4, d (B): d²=4, c (A): d²=5 — or f (B): d²=5
Tie at d²=5: c(A) and f(B). Need tie-breaker. If alphabetical: c chosen.

Step 3: Vote
3 nearest: b(A), d(B), c(A) → A:2, B:1

Answer: x is classified as Class A
PRACTICAL

P2.2: Calculate the Euclidean distance between v=(3,-1,2) and w=(1,2,0).

d(v,w) = √((3-1)² + (-1-2)² + (2-0)²)
= √(4 + 9 + 4)
= √17
Answer: √17 ≈ 4.12
3

Perceptron

🎯 Theory Questions

THEORY

Q3.1: What is a perceptron? Describe its components.

A perceptron is a simple mathematical model of a neuron. Components:
  • Inputs (x): Feature vector
  • Weights (w): Importance of each input
  • Threshold (θ): Activation threshold
  • Activation function: Determines output based on net = w·x - θ
THEORY

Q3.2: Write the perceptron learning rule (delta rule) for weights and threshold.

w' = w + α × (d - y) × x
θ' = θ - α × (d - y)
Where: α = learning rate, d = desired output, y = actual output, x = input vector
THEORY

Q3.3: What is the difference between unipolar and bipolar activation?

Unipolar: Output values are {0, 1} (discrete) or [0, 1] (continuous/sigmoid)
Bipolar: Output values are {-1, 1} (discrete) or [-1, 1] (continuous/tanh)

🔢 Computational Exercises

PRACTICAL

P3.1: Given: p=(1,-1,2), w=(2,1,-1), θ=1, α=1

a) Compute perceptron output (unipolar discrete)

b) If output is incorrect, compute new weights

c) Will threshold increase or decrease?

a) Compute output:
net = w·p = 2×1 + 1×(-1) + (-1)×2 = 2 - 1 - 2 = -1
net = -1 < θ=1 → Perceptron does NOT activate
Output y = 0
b) If incorrect (should have been d=1):
w' = w + α(d-y)x = (2,1,-1) + 1×(1-0)×(1,-1,2)
w' = (2,1,-1) + (1,-1,2) = (3, 0, 1)
c) Threshold:
θ' = θ - α(d-y) = 1 - 1×(1-0) = 1 - 1 = 0
Threshold DECREASES (from 1 to 0)
PRACTICAL

P3.2: Given: p=(2,0,-1), w=(-1,3,2), θ=-2. Compute the output for a unipolar discrete perceptron.

net = w·p = (-1)×2 + 3×0 + 2×(-1) = -2 + 0 - 2 = -4
net = -4 < θ=-2 → Does NOT activate
Output y = 0
4

Classifier Evaluation

🎯 Theory Questions

THEORY

Q4.1: What are TP, TN, FP, FN in a confusion matrix?

  • TP (True Positive): Actually positive, predicted positive ✓
  • TN (True Negative): Actually negative, predicted negative ✓
  • FP (False Positive): Actually negative, predicted positive ✗ (Type I error)
  • FN (False Negative): Actually positive, predicted negative ✗ (Type II error)
THEORY

Q4.2: Write formulas for Accuracy, Precision, Recall, and F-measure.

Accuracy = (TP + TN) / Total
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F-measure = 2 × P × R / (P + R)
ADVANCED

Q4.3: What does the F-measure represent and why is it useful?

F-measure is the harmonic mean of Precision and Recall. It provides a single score that balances both metrics. It's useful when you want to find the best trade-off between minimizing false positives (precision) and minimizing false negatives (recall).

🔢 Computational Exercises

PRACTICAL

P4.1: Given confusion matrix:

Pred P Pred N
Actual P 40 10
Actual N 20 30

Calculate: Accuracy, Precision, Recall, F-measure

TP=40, FN=10, FP=20, TN=30, Total=100

Accuracy = (40+30)/100 = 70/100 = 7/10

Precision = 40/(40+20) = 40/60 = 2/3

Recall = 40/(40+10) = 40/50 = 4/5

F-measure = 2×(2/3)×(4/5) / (2/3 + 4/5)
= 2×(8/15) / (10/15 + 12/15)
= (16/15) / (22/15)
= 16/22 = 8/11
PRACTICAL

P4.2: A classifier has: TP=15, TN=25, FP=5, FN=5. Calculate all metrics.

Total = 15+25+5+5 = 50

Accuracy = (15+25)/50 = 40/50 = 4/5

Precision = 15/(15+5) = 15/20 = 3/4

Recall = 15/(15+5) = 15/20 = 3/4

F-measure = 2×(3/4)×(3/4) / (3/4+3/4)
= 2×(9/16) / (6/4) = (18/16) / (24/16) = 18/24 = 3/4
5

Multi-layer Neural Networks

🎯 Theory Questions

THEORY

Q5.1: What problem can a multi-layer network solve that a single perceptron cannot?

A single perceptron can only solve linearly separable problems. Multi-layer networks can solve non-linearly separable problems like XOR, because hidden layers can learn complex decision boundaries.
THEORY

Q5.2: What is the purpose of the backpropagation algorithm?

Backpropagation is a method for computing gradients of the loss function with respect to all weights in the network. It propagates the error backward from the output layer to the input layer, allowing us to update weights using gradient descent.
ADVANCED

Q5.3: Why do we need non-linear activation functions in hidden layers?

Without non-linear activation functions, a multi-layer network would be equivalent to a single linear transformation, regardless of depth. Non-linear activations allow the network to learn complex, non-linear relationships in the data.
6

Knowledge Representation & Overfitting

🎯 Theory Questions

THEORY

Q6.1: What is overfitting? Give an example.

Overfitting occurs when a model is too complex and learns the training data too well, including noise and outliers. It performs excellently on training data but poorly on new, unseen data.

Example: Using a 100-leaf decision tree for a 150-sample Iris dataset.
THEORY

Q6.2: What is cross-validation and why do we use it?

Cross-validation is a technique to evaluate model performance by splitting data into multiple train-test folds. In k-fold CV:
  1. Split data into k equal parts
  2. Train on k-1 parts, test on 1 part
  3. Repeat k times, average results
It provides more reliable performance estimates than a single train-test split.
THEORY

Q6.3: What is the difference between a decision rule and a decision tree?

Decision rule: An IF-THEN statement (e.g., IF temperature=hot AND humidity=high THEN play=no)

Decision tree: A hierarchical structure of attribute tests that can be converted into a set of decision rules.
ADVANCED

Q6.4: What is entropy and how is it used in decision trees?

Entropy measures impurity or uncertainty in a dataset:
H(S) = -Σ pᵢ × log₂(pᵢ)
In decision trees (ID3 algorithm), we choose the attribute that provides the highest information gain (reduction in entropy) at each split.
7

Clustering (k-Means)

🎯 Theory Questions

THEORY

Q7.1: Describe the k-Means algorithm in 4 steps.

  1. Initialize k centroids (randomly or given)
  2. Assign each data point to the nearest centroid
  3. Recompute centroids as the mean of assigned points
  4. Repeat steps 2-3 until no points change clusters
THEORY

Q7.2: How is the centroid of a cluster computed?

The centroid is the mean (average) of all points in the cluster:
centroid = (1/n) × Σ xᵢ for all points in cluster
Each component is averaged separately.
THEORY

Q7.3: What is hierarchical clustering? Name its two types.

Hierarchical clustering builds a tree (dendrogram) of clusters.
  • Agglomerative (bottom-up): Start with each point as its own cluster, merge closest pairs
  • Divisive (top-down): Start with one cluster, recursively split

🔢 Computational Exercises

PRACTICAL

P7.1: Given vectors: a=(0,0), b=(1,0), c=(4,4), d=(5,4)

Initial centroids: c₁=(0,0), c₂=(4,4), k=2

Perform one iteration of k-Means.

Step 1: Assign to nearest centroid
d²(a,c₁)=0, d²(a,c₂)=32 → a→Cluster1
d²(b,c₁)=1, d²(b,c₂)=25 → b→Cluster1
d²(c,c₁)=32, d²(c,c₂)=0 → c→Cluster2
d²(d,c₁)=41, d²(d,c₂)=1 → d→Cluster2
Step 2: Update centroids
c₁ = ((0+1)/2, (0+0)/2) = (0.5, 0)
c₂ = ((4+5)/2, (4+4)/2) = (4.5, 4)
PRACTICAL

P7.2: Given: a=(1,1), b=(2,1), c=(5,5). Initial centroids: c₁=(1,1), c₂=(2,1). Simulate k-Means until convergence.

Iteration 1:
d²(a,c₁)=0, d²(a,c₂)=1 → a→C1
d²(b,c₁)=1, d²(b,c₂)=0 → b→C2
d²(c,c₁)=32, d²(c,c₂)=25 → c→C2
New centroids: c₁=(1,1), c₂=((2+5)/2, (1+5)/2)=(3.5, 3)

Iteration 2:
d²(a,c₁)=0, d²(a,c₂)=10.25 → a→C1
d²(b,c₁)=1, d²(b,c₂)=6.25 → b→C1
d²(c,c₁)=32, d²(c,c₂)=6.25 → c→C2
C1={a,b}, C2={c}
c₁=(1.5,1), c₂=(5,5)

Iteration 3: No changes → Converged!
8

Naive Bayes Classifier

🎯 Theory Questions

THEORY

Q8.1: What is the "naive" assumption in Naive Bayes?

The "naive" assumption is that all features are conditionally independent given the class. This means:
P(f₁, f₂, ..., fₙ | Class) = P(f₁|Class) × P(f₂|Class) × ... × P(fₙ|Class)
THEORY

Q8.2: Write Bayes' theorem in the context of classification.

P(Class | Features) ∝ P(Class) × P(Features | Class)
We classify to the class with the highest posterior probability.
THEORY

Q8.3: What is Laplace smoothing and when do we use it?

Laplace smoothing is used when a conditional probability is zero (no occurrences in training data). We add 1 to the numerator and the number of possible values to the denominator:
P(value|Class) = (count + 1) / (total + num_possible_values)

🔢 Computational Exercises

PRACTICAL

P8.1: Given training data for playing tennis:

Weather Wind Play?
Sunny Weak No
Sunny Strong No
Rainy Weak Yes
Rainy Weak Yes
Rainy Strong No
Sunny Weak Yes

Classify: (Sunny, Strong)

Count classes: Yes=3, No=3 (total=6)
P(Yes)=3/6=1/2, P(No)=3/6=1/2

Conditional probabilities:
P(Sunny|Yes) = 1/3, P(Sunny|No) = 2/3
P(Strong|Yes) = 0/3, P(Strong|No) = 2/3
Need Laplace smoothing for P(Strong|Yes)=0:
P(Strong|Yes) = (0+1)/(3+2) = 1/5
Also smooth P(Strong|No):
P(Strong|No) = (2+1)/(3+2) = 3/5
Calculate posteriors:
P(Yes|Sunny,Strong) ∝ 1/2 × 1/3 × 1/5 = 1/30
P(No|Sunny,Strong) ∝ 1/2 × 2/3 × 3/5 = 6/30 = 1/5
Since 1/5 > 1/30: Classification = No
📐 Part II: Elements of Discrete Optimization
9

Discrete Optimization: Knapsack, TSP, Greedy

🎯 Theory Questions

THEORY

Q9.1: Define the 0/1 Knapsack problem.

Given a set of items, each with a value and weight, and a knapsack with a maximum capacity, select a subset of items to maximize total value without exceeding capacity. Each item can be taken (1) or left (0).
THEORY

Q9.2: What is the greedy approach? When does it work optimally?

Greedy approach makes the locally optimal choice at each step, hoping to find a global optimum. It works optimally for problems with matroid structure (Rado-Edmonds theorem) or special cases like Fractional Knapsack.
THEORY

Q9.3: What is the difference between 0/1 and Fractional Knapsack?

0/1 Knapsack: Items must be taken whole or not at all. Greedy doesn't always work; use dynamic programming.

Fractional Knapsack: Items can be taken partially. Greedy by value/weight ratio gives optimal solution.

🔢 Computational Exercises

PRACTICAL

P9.1: Knapsack: Capacity=7, Items:

Item Value Weight
a 4 3
b 3 2
c 2 1
d 5 4

Find the optimal solution.

Check feasible combinations:
{a}: w=3, v=4
{b}: w=2, v=3
{c}: w=1, v=2
{d}: w=4, v=5
{a,b}: w=5, v=7
{a,c}: w=4, v=6
{a,b,c}: w=6, v=9 ✓
{b,c}: w=3, v=5
{b,d}: w=6, v=8
{c,d}: w=5, v=7
{a,c,b}: w=6, v=9 ✓
{b,c,d}: w=7, v=10 ✓ BEST!
Optimal: {b,c,d} with value=10, weight=7
PRACTICAL

P9.2: Capacity=6, values=(5,3,2,4), weights=(3,2,1,3). Find optimal value.

Items: a(v=5,w=3), b(v=3,w=2), c(v=2,w=1), d(v=4,w=3)
{a,b}: w=5, v=8
{a,c}: w=4, v=7
{a,b,c}: w=6, v=10 ✓
{b,c,d}: w=6, v=9
{a,d}: w=6, v=9
Optimal value = 10 (items {a,b,c})
10

Local Search Heuristics

🎯 Theory Questions

THEORY

Q10.1: What is the hill climbing algorithm?

Hill climbing is a local search algorithm that:
  1. Start with an initial solution
  2. Look at neighboring solutions
  3. Move to a better neighbor if one exists
  4. Stop when no better neighbor exists (local optimum)
THEORY

Q10.2: What is the main weakness of hill climbing? How does simulated annealing address it?

Weakness: Hill climbing gets stuck in local optima — it cannot accept worse solutions to escape.

Simulated Annealing: Accepts worse solutions with a probability that decreases over time (temperature). This allows exploration early on and convergence later.
ADVANCED

Q10.3: What is a "neighborhood" in local search? Give an example for TSP.

A neighborhood is the set of solutions reachable from the current solution by a single move operation.

TSP Example: 2-opt neighborhood — swap two edges in the tour. For tour A-B-C-D-A, a neighbor might be A-C-B-D-A (reverse segment B-C).

Quick Reference: Vector Operations

Essential Formulas for Exam

PRACTICE

V1: Given v=(2,-1,3) and w=(1,2,-1). Calculate:

a) Dot product v·w

b) Are they orthogonal?

c) Length of v

d) Distance between v and w

a) v·w = 2×1 + (-1)×2 + 3×(-1) = 2 - 2 - 3 = -3

b) Since v·w = -3 ≠ 0, they are NOT orthogonal

c) |v| = √(4 + 1 + 9) = √14

d) d(v,w) = √((2-1)² + (-1-2)² + (3-(-1))²)
= √(1 + 9 + 16) = √26
PRACTICE

V2: Which is longer: a=(1,2,2) or b=(2,1,1)?

|a|² = 1 + 4 + 4 = 9 → |a| = 3
|b|² = 4 + 1 + 1 = 6 → |b| = √6 ≈ 2.45
a is longer (3 > √6)