Back to all posts
Data Science

The Complete NumPy Guide — Part 2: Math, Statistics, Linear Algebra & File I/O

Deep dive into NumPy's mathematical functions, statistical operations, linear algebra tools, and file I/O. Packed with real-world examples for data scientist...

1. Arithmetic Operations — Element-wise Math {#arithmetic}

All basic arithmetic in NumPy is element-wise and returns a new array (or view).

Python
import numpy as np

a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])

# Basic operators
print(a + b)     # [11 22 33 44 55]
print(b - a)     # [9 18 27 36 45]
print(a * b)     # [10 40 90 160 250]
print(b / a)     # [10. 10. 10. 10. 10.]  — always float64
print(b // a)    # [10 10 10 10 10] — floor division
print(b % a)     # [0 0 0 0 0] — modulo
print(a ** 2)    # [1 4 9 16 25] — power
print(-a)        # [-1 -2 -3 -4 -5] — negate

# Operator functions (equivalent to operators but more control)
np.add(a, b)           # a + b
np.subtract(b, a)      # b - a
np.multiply(a, b)      # a * b
np.divide(b, a)        # b / a
np.floor_divide(b, a)  # b // a
np.mod(b, a)           # b % a
np.power(a, 2)         # a ** 2
np.negative(a)         # -a
np.absolute(np.array([-1, -2, 3]))  # [1 2 3]
np.fabs(np.array([-1.5, 2.5]))      # [1.5 2.5] — float only

# Out parameter — write result into existing array (memory optimization)
result = np.empty(5)
np.add(a, b, out=result)  # No new allocation!

# In-place operations
a += 10         # a = a + 10, modifies in-place
a -= 5
a *= 2
a /= 3
a **= 2

Matrix Multiplication (Not Element-wise!)

Python
import numpy as np

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Element-wise multiplication (Hadamard product)
print(A * B)
# [[ 5 12]
#  [21 32]]

# Matrix multiplication — 3 ways
print(np.dot(A, B))    # Traditional
print(A @ B)           # Python 3.5+ — cleaner syntax (preferred!)
print(np.matmul(A, B)) # Same as @, but doesn't work with scalars

# Result:
# [[19 22]
#  [43 50]]

# Dot product of vectors
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(np.dot(a, b))   # 1*4 + 2*5 + 3*6 = 32
print(a @ b)          # 32

# Outer product
print(np.outer(a, b))
# [[ 4  5  6]
#  [ 8 10 12]
#  [12 15 18]]

# Cross product
a = np.array([1, 0, 0])
b = np.array([0, 1, 0])
print(np.cross(a, b))  # [0 0 1]

2. Universal Functions (ufuncs) — The Real Power {#ufuncs}

ufuncs are functions that operate element-wise on ndarrays. They are compiled in C and extremely fast.

Math ufuncs

Python
import numpy as np

arr = np.array([0, 0.5, 1.0, 1.5, 2.0])

# --- TRIGONOMETRY ---
np.sin(arr)             # Sine
np.cos(arr)             # Cosine
np.tan(arr)             # Tangent
np.arcsin(arr)          # Inverse sine (input must be in [-1, 1])
np.arccos(arr)          # Inverse cosine
np.arctan(arr)          # Inverse tangent
np.arctan2([1, 0], [0, 1])  # atan2(y, x) — handles quadrants correctly
np.degrees(arr)         # Radians to degrees
np.radians([0, 90, 180])    # Degrees to radians

# --- EXPONENTIAL & LOGARITHM ---
arr2 = np.array([1.0, 2.0, 3.0, 10.0])
np.exp(arr2)            # e^x
np.exp2(arr2)           # 2^x
np.expm1(arr2)          # e^x - 1 (more accurate for small x)
np.log(arr2)            # Natural log (base e)
np.log2(arr2)           # Log base 2
np.log10(arr2)          # Log base 10
np.log1p(arr2)          # log(1 + x) (more accurate for small x)

# --- ROUNDING ---
arr3 = np.array([-2.7, -1.3, 0.5, 1.3, 2.7])
np.floor(arr3)          # [-3. -2.  0.  1.  2.] — round down
np.ceil(arr3)           # [-2. -1.  1.  2.  3.] — round up
np.trunc(arr3)          # [-2. -1.  0.  1.  2.] — round toward zero
np.round(arr3)          # [-3. -1.  0.  1.  3.] — round to nearest even (banker's rounding)
np.around(arr3, 1)      # Same as round
np.rint(arr3)           # Round to nearest int

# --- SIGN & COMPARISON ---
np.sign(arr3)           # [-1. -1.  1.  1.  1.] — sign of elements
np.abs(arr3)            # absolute value
np.sqrt(np.abs(arr3))   # Square root (use abs to avoid NaN)
np.cbrt(arr3)           # Cube root (works with negatives!)
np.square(arr3)         # Element-wise x²
np.reciprocal(arr3)     # 1/x (for non-zero!)

# --- SPECIAL ---
np.clip(arr3, -1.5, 1.5)  # Clamp values to [-1.5, 1.5]
np.heaviside(arr3, 0.5)   # 0 for x<0, 0.5 for x==0, 1 for x>0
np.sinc(arr3)             # Normalized sinc function

ufunc Special Methods

Python
import numpy as np
arr = np.array([1, 2, 3, 4, 5])

# .reduce — apply ufunc cumulatively
np.add.reduce(arr)       # 15 — sum of all elements
np.multiply.reduce(arr)  # 120 — product of all elements

# .accumulate — like reduce but keeps intermediate results
np.add.accumulate(arr)       # [1 3 6 10 15] — cumulative sum
np.multiply.accumulate(arr)  # [1 2 6 24 120] — cumulative product

# .outer — compute outer product
np.add.outer([1, 2, 3], [10, 20, 30])
# [[11 21 31]
#  [12 22 32]
#  [13 23 33]]

# .at — unbuffered in-place operation (handles repeated indices correctly)
arr = np.array([1, 2, 3, 4, 5])
np.add.at(arr, [0, 0, 2], 10)  # arr[0] += 10 twice, arr[2] += 10
print(arr)  # [21 2 13 4 5]

3. Aggregate Functions — Reduce Operations {#aggregate}

Python
import numpy as np

arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

# --- BASIC AGGREGATES ---
np.sum(arr)              # 45 — sum of all
np.sum(arr, axis=0)      # [12 15 18] — sum along columns (collapse rows)
np.sum(arr, axis=1)      # [6 15 24] — sum along rows (collapse columns)
np.sum(arr, keepdims=True)  # shape (1,1) instead of scalar — useful for broadcasting

np.prod(arr)             # 362880 — product of all
np.prod(arr, axis=0)     # [28 80 162]

np.cumsum(arr)           # [1 3 6 10 15 21 28 36 45] — flattened cumsum
np.cumsum(arr, axis=0)   # Cumulative sum along rows
np.cumsum(arr, axis=1)   # Cumulative sum along columns

np.cumprod(arr, axis=1)  # Cumulative product along columns

# --- MIN & MAX ---
np.min(arr)              # 1
np.max(arr)              # 9
np.min(arr, axis=0)      # [1 2 3] — min of each column
np.max(arr, axis=1)      # [3 6 9] — max of each row
np.argmin(arr)           # 0 — index of min (flattened)
np.argmax(arr)           # 8 — index of max (flattened)
np.argmin(arr, axis=0)   # [0 0 0] — row index of min in each column
np.argmax(arr, axis=1)   # [2 2 2] — column index of max in each row

# ptp — peak to peak (max - min)
np.ptp(arr)              # 8 = 9 - 1
np.ptp(arr, axis=0)      # [6 6 6]

# --- NaN-SAFE VERSIONS ---
arr_with_nan = np.array([[1, np.nan, 3], [4, 5, np.nan]])
np.nansum(arr_with_nan)      # 13.0 — ignores NaN
np.nanmean(arr_with_nan)     # 3.25
np.nanmax(arr_with_nan)      # 5.0
np.nanmin(arr_with_nan)      # 1.0
np.nanstd(arr_with_nan)      # ignores NaN in std
np.nanprod(arr_with_nan)     # ignores NaN in product
np.nanpercentile(arr_with_nan, 75)  # 75th percentile ignoring NaN

Axis Explained Visually

SQL
arr = [[1, 2, 3],   # axis=0 collapses DOWN (across rows)
       [4, 5, 6],   # axis=1 collapses ACROSS (across columns)
       [7, 8, 9]]

np.sum(arr, axis=0) = [1+4+7, 2+5+8, 3+6+9] = [12, 15, 18]  # One value per COLUMN
np.sum(arr, axis=1) = [1+2+3, 4+5+6, 7+8+9] = [6, 15, 24]   # One value per ROW

4. Statistical Functions — Full Coverage {#statistics}

Python
import numpy as np

data = np.array([4, 7, 13, 2, 1, 15, 8, 3, 9, 11])

# --- CENTRAL TENDENCY ---
np.mean(data)            # 7.3 — arithmetic mean
np.median(data)          # 7.5 — middle value
# Mode — NumPy doesn't have it; use scipy
from scipy import stats
stats.mode(data)         # Most frequent value

# Weighted mean
weights = np.array([1, 2, 3, 1, 1, 2, 1, 1, 1, 1])
np.average(data, weights=weights)  # Weighted average (different from mean!)

# --- SPREAD / DISPERSION ---
np.var(data)             # Variance (default ddof=0 — population variance)
np.var(data, ddof=1)     # Sample variance (ddof=1)
np.std(data)             # Standard deviation
np.std(data, ddof=1)     # Sample std dev (use this for samples!)

# --- PERCENTILES & QUANTILES ---
np.percentile(data, 25)         # Q1 — 25th percentile
np.percentile(data, 50)         # Q2 — median
np.percentile(data, 75)         # Q3 — 75th percentile
np.percentile(data, [25, 50, 75])  # All at once: [3.75 7.5 10.5]
np.quantile(data, 0.25)         # Same as percentile(data, 25)
np.quantile(data, [0.25, 0.5, 0.75])  # Using 0-1 scale

# IQR (Interquartile Range) — for outlier detection
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
outliers = data[(data < lower_fence) | (data > upper_fence)]
print(f"Outliers: {outliers}")

# --- CORRELATION & COVARIANCE ---
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

# Covariance matrix
cov_matrix = np.cov(x, y)
print(cov_matrix)
# [[2.5  1.75]
#  [1.75 1.7 ]]
# cov_matrix[0,0] = var(x), cov_matrix[1,1] = var(y)
# cov_matrix[0,1] = cov(x,y)

# Correlation matrix (normalized covariance)
corr_matrix = np.corrcoef(x, y)
print(corr_matrix)
# [[1.         0.75592895]
#  [0.75592895 1.        ]]
# Values between -1 and 1

# Correlation with multiple variables
data = np.random.randn(5, 100)  # 5 variables, 100 observations
corr = np.corrcoef(data)        # 5x5 correlation matrix

# --- HISTOGRAM ---
data = np.random.randn(1000)
counts, bin_edges = np.histogram(data, bins=20)
# counts: frequency in each bin
# bin_edges: 21 edges for 20 bins

# 2D histogram
x = np.random.randn(1000)
y = np.random.randn(1000)
hist2d, xedges, yedges = np.histogram2d(x, y, bins=20)

Practical Statistics — Outlier Detection & Z-scores

Python
import numpy as np

# Generate data with outliers
np.random.seed(42)
data = np.concatenate([
    np.random.normal(50, 10, 1000),   # Normal data
    np.array([200, -100, 300])         # Outliers
])

# Z-score based outlier detection
mean = np.mean(data)
std = np.std(data)
z_scores = (data - mean) / std
outlier_mask = np.abs(z_scores) > 3
print(f"Outliers found: {np.sum(outlier_mask)}")
print(f"Outlier values: {data[outlier_mask]}")

# Winsorize — cap outliers at percentiles
lower = np.percentile(data, 1)
upper = np.percentile(data, 99)
winsorized = np.clip(data, lower, upper)

# Min-Max Normalization (0 to 1 scaling)
normalized = (data - data.min()) / (data.max() - data.min())

# Standard Scaling (Z-score normalization)
standardized = (data - data.mean()) / data.std()

# Moving average (for time series)
def moving_average(arr, window=5):
    return np.convolve(arr, np.ones(window)/window, mode='valid')

prices = np.array([10, 11, 12, 11, 13, 14, 15, 13, 12, 14, 16])
ma5 = moving_average(prices, window=3)
print(ma5)  # [11. 11.33 12. 12.67 14. 14. 13.33 13. 14.]

5. Sorting & Searching {#sorting}

Python
import numpy as np

arr = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3])

# --- SORTING ---
np.sort(arr)              # Returns sorted copy — original unchanged
arr.sort()                # In-place sort — original modified!
np.sort(arr)[::-1]        # Descending order (no direct parameter)
arr[np.argsort(arr)[::-1]]  # Descending via argsort

# Stable sort
np.sort(arr, kind='stable')  # Preserves relative order of equal elements
np.sort(arr, kind='mergesort')  # Always stable
np.sort(arr, kind='quicksort')  # Fastest average (default, not stable)
np.sort(arr, kind='heapsort')   # O(n log n) worst case

# 2D sorting
matrix = np.array([[3, 1, 2], [6, 4, 5]])
np.sort(matrix, axis=0)  # Sort each column: [[3 1 2] [6 4 5]]
np.sort(matrix, axis=1)  # Sort each row: [[1 2 3] [4 5 6]]

# --- ARGSORT — Returns indices that would sort the array ---
arr = np.array([30, 10, 20, 50, 40])
indices = np.argsort(arr)   # [1 2 0 4 3] — index of 10, 20, 30, 40, 50

# Use case: sort one array by values of another
names = np.array(['Charlie', 'Alice', 'Bob', 'Eve', 'Dave'])
scores = np.array([85, 92, 78, 95, 88])
sorted_names = names[np.argsort(scores)[::-1]]  # By descending score
print(sorted_names)  # ['Eve' 'Alice' 'Dave' 'Charlie' 'Bob']

# --- PARTITION — Faster than full sort for top-k ---
arr = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3])
np.partition(arr, 3)     # 3 smallest elements at start (not sorted among themselves)
np.partition(arr, -3)    # 3 largest elements at end

# Get top-3 values efficiently (O(n) vs O(n log n) for full sort)
top3_indices = np.argpartition(arr, -3)[-3:]
print(arr[top3_indices])   # [5, 6, 9] — not necessarily sorted

# --- SEARCHING ---
arr = np.array([0, 10, 20, 30, 40, 50])
# np.searchsorted — binary search on sorted array
idx = np.searchsorted(arr, 25)   # 3 — where 25 would be inserted
idx = np.searchsorted(arr, 20)   # 2 — leftmost position (default='left')
idx = np.searchsorted(arr, 20, side='right')  # 3 — rightmost position

# np.where — returns indices where condition is True
arr = np.array([5, 15, 25, 35, 45])
indices = np.where(arr > 20)          # (array([2, 3, 4]),)
print(arr[indices])                    # [25 35 45]
row_idx, col_idx = np.where(matrix > 3)  # 2D — returns tuple of (row_indices, col_indices)

# np.nonzero — indices of non-zero elements
arr = np.array([0, 1, 0, 0, 2, 0, 3])
print(np.nonzero(arr))   # (array([1, 4, 6]),)
print(arr[arr != 0])     # [1 2 3] — more readable alternative

# np.argmin / np.argmax — index of min/max
arr = np.array([3, 1, 4, 1, 5, 9, 2, 6])
print(np.argmin(arr))    # 1 (first occurrence of minimum)
print(np.argmax(arr))    # 5

# np.nanargmin / np.nanargmax — ignores NaN
arr_nan = np.array([3, np.nan, 1, 5])
print(np.nanargmin(arr_nan))  # 2 (ignores NaN)

6. Set Operations {#sets}

Python
import numpy as np

a = np.array([1, 2, 3, 4, 5, 2, 3])
b = np.array([3, 4, 5, 6, 7])

# Unique
np.unique(a)                          # [1 2 3 4 5] — sorted unique values
vals, counts = np.unique(a, return_counts=True)  # With counts: [1,2,2,2,1]
vals, indices = np.unique(a, return_index=True)  # With first occurrence index
vals, inverse = np.unique(a, return_inverse=True) # Indices to reconstruct a

# Set operations
np.union1d(a, b)            # [1 2 3 4 5 6 7] — union
np.intersect1d(a, b)        # [3 4 5] — intersection
np.setdiff1d(a, b)          # [1 2] — in a but not in b
np.setdiff1d(b, a)          # [6 7] — in b but not in a
np.setxor1d(a, b)           # [1 2 6 7] — in one but not both

# Membership test
np.isin(a, b)               # [F F T T T F T] — is each element of a in b?
np.isin(a, b, invert=True)  # Inverse — not in b

7. Linear Algebra — numpy.linalg {#linalg}

Linear algebra is the backbone of machine learning. NumPy provides a comprehensive linalg module.

Python
import numpy as np
from numpy import linalg as LA

A = np.array([[2, 1], [1, 3]], dtype=float)
B = np.array([[1, 0], [0, 1]], dtype=float)  # Identity

# --- BASIC OPERATIONS ---
LA.matrix_power(A, 3)    # A³ = A @ A @ A
np.trace(A)              # Sum of diagonal: 5.0

# --- DETERMINANT ---
det = LA.det(A)           # 5.0

# --- INVERSE ---
inv = LA.inv(A)           # A⁻¹ such that A @ A⁻¹ = I
# Verify: A @ inv ≈ Identity
print(np.allclose(A @ inv, np.eye(2)))   # True

# --- NORMS ---
v = np.array([3.0, 4.0])
LA.norm(v)               # 5.0 — L2 norm (Euclidean distance)
LA.norm(v, ord=1)        # 7.0 — L1 norm (Manhattan distance)
LA.norm(v, ord=np.inf)   # 4.0 — L∞ norm (max absolute value)
LA.norm(A)               # Frobenius norm (matrix)
LA.norm(A, ord='fro')    # Same — Frobenius explicitly
LA.norm(A, ord=2)        # Spectral norm (largest singular value)

# --- EIGENVALUES & EIGENVECTORS ---
eigenvalues, eigenvectors = LA.eig(A)
print(eigenvalues)   # [1.38 3.62] (approximately)
print(eigenvectors)  # Columns are eigenvectors

# For symmetric/Hermitian matrices (more stable)
eigenvalues_sym = LA.eigh(A)  # Returns sorted eigenvalues

# --- SINGULAR VALUE DECOMPOSITION (SVD) ---
# A = U @ np.diag(S) @ Vt
U, S, Vt = LA.svd(A)
# U: left singular vectors
# S: singular values (sorted descending)
# Vt: right singular vectors (transposed)
# Reconstruct A
A_reconstructed = U @ np.diag(S) @ Vt
print(np.allclose(A, A_reconstructed))  # True

# Truncated SVD for dimensionality reduction (like PCA)
U, S, Vt = LA.svd(A, full_matrices=False)  # Economy/thin SVD

# --- QR DECOMPOSITION ---
Q, R = LA.qr(A)  # A = Q @ R, Q orthogonal, R upper triangular
print(np.allclose(A, Q @ R))  # True
print(np.allclose(Q @ Q.T, np.eye(2)))  # Q is orthogonal: True

# --- CHOLESKY DECOMPOSITION ---
# For symmetric positive definite matrices
# A = L @ L.T (L is lower triangular)
A_sym = np.array([[4, 2], [2, 3]], dtype=float)
L = LA.cholesky(A_sym)
print(np.allclose(A_sym, L @ L.T))  # True

# --- SOLVING LINEAR SYSTEMS ---
# Solve Ax = b
A = np.array([[3, 1], [1, 2]], dtype=float)
b = np.array([9, 8], dtype=float)
x = LA.solve(A, b)  # x = [2, 3]
print(np.allclose(A @ x, b))  # True

# Least squares solution (overdetermined system — more equations than unknowns)
A = np.array([[1, 1], [1, 2], [1, 3]], dtype=float)  # 3 equations
b = np.array([1, 2, 2], dtype=float)                   # 2 unknowns
x, residuals, rank, sv = LA.lstsq(A, b, rcond=None)
print(x)  # Best fit line coefficients [0.33, 0.5]

# --- MATRIX RANK & CONDITION NUMBER ---
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=float)
print(LA.matrix_rank(A))     # 2 — rows are linearly dependent!
print(LA.cond(A))             # Condition number — high means ill-conditioned

# --- PSEUDO-INVERSE (Moore-Penrose) ---
# For non-square or singular matrices
A = np.array([[1, 2], [3, 4], [5, 6]], dtype=float)  # 3x2 (tall matrix)
pinv = LA.pinv(A)  # shape (2, 3) — left inverse when more rows than cols

Real-World Linear Algebra — Linear Regression from Scratch

Python
import numpy as np

# Generate synthetic data: y = 2x + 3 + noise
np.random.seed(42)
X_raw = np.linspace(0, 10, 100)
y = 2 * X_raw + 3 + np.random.randn(100)

# Add bias column (intercept) — augment X with ones
X = np.column_stack([np.ones(100), X_raw])  # shape (100, 2)

# OLS solution: w = (X^T X)^{-1} X^T y (Normal Equation)
w = np.linalg.solve(X.T @ X, X.T @ y)
print(f"Intercept: {w[0]:.4f}, Slope: {w[1]:.4f}")
# Intercept: ~3.0, Slope: ~2.0

# Alternative with lstsq (numerically more stable)
w, _, _, _ = np.linalg.lstsq(X, y, rcond=None)
print(f"Intercept: {w[0]:.4f}, Slope: {w[1]:.4f}")

# Predictions
y_pred = X @ w
mse = np.mean((y - y_pred) ** 2)
print(f"MSE: {mse:.4f}")

8. Fourier Transform — numpy.fft {#fft}

Python
import numpy as np

# --- 1D FFT ---
# Create a signal: sum of two sine waves
t = np.linspace(0, 1, 500, endpoint=False)   # 1 second, 500 samples
freq1, freq2 = 5, 20                          # 5 Hz and 20 Hz components
signal = np.sin(2 * np.pi * freq1 * t) + 0.5 * np.sin(2 * np.pi * freq2 * t)

# Compute FFT
fft_result = np.fft.fft(signal)
frequencies = np.fft.fftfreq(len(signal), d=1/500)  # Frequency bins

# Get magnitude spectrum
magnitude = np.abs(fft_result)

# Only positive frequencies (spectrum is symmetric)
positive_freq_idx = frequencies > 0
print(frequencies[positive_freq_idx][np.argmax(magnitude[positive_freq_idx])])
# Should show 5.0 (dominant frequency)

# --- INVERSE FFT ---
signal_reconstructed = np.fft.ifft(fft_result).real
print(np.allclose(signal, signal_reconstructed))  # True

# --- 2D FFT (Image frequency analysis) ---
image = np.random.rand(64, 64)
fft2d = np.fft.fft2(image)
fft2d_shifted = np.fft.fftshift(fft2d)  # Center the zero-frequency component
magnitude_spectrum = np.log1p(np.abs(fft2d_shifted))  # Log scale for visibility
# Reconstruct
image_back = np.fft.ifft2(fft2d).real

# --- REAL FFT (when input is real) — more efficient ---
rfft_result = np.fft.rfft(signal)     # Half the output (no redundancy)
frequencies = np.fft.rfftfreq(len(signal), d=1/500)
signal_back = np.fft.irfft(rfft_result, n=len(signal))

9. Polynomial Operations — numpy.polynomial {#polynomial}

Python
import numpy as np
from numpy.polynomial import polynomial as P

# --- CLASSIC INTERFACE (numpy.poly1d) ---
# Represent polynomial: 3x² + 2x - 1
p = np.poly1d([3, 2, -1])
print(p)         # 3x² + 2x - 1
print(p(4))      # 3(16) + 2(4) - 1 = 55
print(p.roots)   # Roots where polynomial = 0
print(p.deriv()) # Derivative: 6x + 2
print(p.integ()) # Integral

# Polynomial arithmetic
p1 = np.poly1d([1, -3])   # x - 3
p2 = np.poly1d([1, -5])   # x - 5
print(p1 * p2)             # x² - 8x + 15
print(np.polymul(p1.coeffs, p2.coeffs))  # Same

# Polynomial fitting
x = np.array([0, 1, 2, 3, 4])
y = np.array([1, 3, 9, 19, 33])
coeffs = np.polyfit(x, y, deg=2)  # Fit degree-2 polynomial
fitted_poly = np.poly1d(coeffs)
print(fitted_poly)    # ~2x² + 0x + 1

# Evaluate at new points
x_new = np.linspace(0, 4, 100)
y_new = fitted_poly(x_new)

# --- NEW API (numpy.polynomial) — Recommended ---
# More numerically stable, clearer API
from numpy.polynomial import Polynomial

# Create from coefficients [a0, a1, a2] = a0 + a1*x + a2*x²
p = Polynomial([1, 2, 3])   # 1 + 2x + 3x²
print(p(2))                  # 1 + 4 + 12 = 17
print(p.roots())             # Complex roots
print(p.deriv())             # 2 + 6x
print(p.integ())             # x + x² + x³

# Fit
x = np.array([0, 1, 2, 3, 4])
y = np.array([1, 3, 9, 19, 33])
poly_fit = Polynomial.fit(x, y, deg=2)
print(poly_fit.convert().coef)   # Coefficients

10. Random Number Generation — Deep Dive {#random}

Python
import numpy as np

# --- LEGACY API (still works but less recommended) ---
np.random.seed(42)

# Distributions
np.random.rand(3, 4)               # Uniform [0, 1)
np.random.randn(3, 4)              # Standard normal N(0, 1)
np.random.randint(0, 100, (3, 4)) # Random integers
np.random.uniform(1.5, 3.5, 10)   # Uniform [low, high)
np.random.normal(mean=50, scale=10, size=1000)  # Normal distribution
np.random.binomial(n=10, p=0.5, size=1000)      # Binomial
np.random.poisson(lam=3, size=1000)             # Poisson
np.random.exponential(scale=2.0, size=1000)     # Exponential
np.random.gamma(shape=2, scale=1.5, size=1000)  # Gamma
np.random.beta(a=2, b=5, size=1000)             # Beta
np.random.chisquare(df=5, size=1000)            # Chi-square
np.random.choice([1,2,3,4,5], size=10)         # Random choice

# --- NEW API (Recommended — numpy.random.Generator) ---
rng = np.random.default_rng(seed=42)  # Reproducible generator

# Equivalent functions
rng.random((3, 4))             # Uniform [0, 1)
rng.standard_normal((3, 4))   # N(0, 1)
rng.integers(0, 100, (3, 4))  # Random integers [low, high)
rng.uniform(1.5, 3.5, 10)
rng.normal(50, 10, 1000)
rng.binomial(10, 0.5, 1000)
rng.poisson(3, 1000)
rng.choice([1,2,3,4,5], size=10, replace=False)  # Without replacement!

# Shuffle
arr = np.arange(10)
rng.shuffle(arr)              # In-place shuffle
shuffled = rng.permutation(arr)  # Returns new shuffled array

# Multivariate distributions
mean = [0, 0]
cov = [[1, 0.8], [0.8, 1]]   # Correlated variables
samples = rng.multivariate_normal(mean, cov, size=1000)
print(samples.shape)           # (1000, 2)

# --- MONTE CARLO EXAMPLE: Estimate π ---
rng = np.random.default_rng(42)
n_points = 1_000_000
points = rng.uniform(-1, 1, (n_points, 2))
inside_circle = np.sum(np.sum(points**2, axis=1) <= 1)
pi_estimate = 4 * inside_circle / n_points
print(f"Estimated π: {pi_estimate:.5f}")  # ~3.14159

# --- BOOTSTRAP SAMPLING ---
data = rng.normal(50, 10, 100)
n_bootstraps = 1000
bootstrap_means = np.array([
    rng.choice(data, size=len(data), replace=True).mean()
    for _ in range(n_bootstraps)
])
ci_lower = np.percentile(bootstrap_means, 2.5)
ci_upper = np.percentile(bootstrap_means, 97.5)
print(f"95% CI for mean: [{ci_lower:.2f}, {ci_upper:.2f}]")

11. File I/O — Save and Load Arrays {#fileio}

Python
import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

# --- BINARY FORMAT (.npy and .npz) ---

# Single array — .npy (fast, exact, preserves dtype)
np.save('array.npy', arr)
loaded = np.load('array.npy')
print(np.array_equal(arr, loaded))   # True — exact preservation

# Multiple arrays — .npz (compressed zip of .npy files)
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.savez('arrays.npz', array_a=a, array_b=b)  # Named arrays
data = np.load('arrays.npz')
print(data['array_a'])   # [1 2 3]
print(data['array_b'])   # [4 5 6]
print(list(data.files))  # ['array_a', 'array_b']

# Compressed version (smaller file, slower)
np.savez_compressed('arrays_compressed.npz', a=a, b=b)

# --- TEXT FORMAT ---

# Save as CSV / whitespace-separated
np.savetxt('array.csv', arr, delimiter=',', fmt='%d')        # Integer format
np.savetxt('array.csv', arr, delimiter=',', fmt='%.4f')      # Float format
np.savetxt('array.csv', arr, delimiter=',', header='A,B,C', comments='')

# Load from CSV
loaded = np.loadtxt('array.csv', delimiter=',', dtype=np.int32)
loaded = np.loadtxt('data.csv', delimiter=',', skiprows=1)   # Skip header
loaded = np.loadtxt('data.csv', delimiter=',', usecols=[0, 2])  # Select columns

# np.genfromtxt — more powerful (handles missing values)
data = np.genfromtxt('data.csv',
                      delimiter=',',
                      skip_header=1,
                      filling_values=0,     # Replace NaN with 0
                      dtype=float,
                      usecols=(0, 1, 2))

# --- MEMORY-MAPPED FILES (for large arrays that don't fit in RAM) ---
# Create a memory-mapped file
mmapped = np.memmap('large_data.dat', dtype=np.float32, mode='w+', shape=(1000, 1000))
mmapped[:] = np.random.rand(1000, 1000)    # Write data
del mmapped                                 # Flush to disk

# Open existing memory-mapped file (read-only)
mmapped_ro = np.memmap('large_data.dat', dtype=np.float32, mode='r', shape=(1000, 1000))
print(mmapped_ro[0, :5])   # Read first row, first 5 elements
# Data is loaded from disk only when accessed!

# --- READING FROM STRING / BYTES ---
import io
csv_string = "1,2,3\n4,5,6\n7,8,9"
arr = np.loadtxt(io.StringIO(csv_string), delimiter=',')

12. Structured Arrays & Record Arrays {#structured}

Structured arrays allow you to store heterogeneous data (like a C struct or database row).

Python
import numpy as np

# Define a structured dtype
employee_dtype = np.dtype([
    ('name', 'U20'),           # Unicode string, max 20 chars
    ('age', 'i4'),             # 32-bit integer
    ('salary', 'f8'),          # 64-bit float
    ('is_active', '?'),        # Boolean
])

# Create structured array
employees = np.array([
    ('Alice', 30, 75000.0, True),
    ('Bob', 25, 55000.0, True),
    ('Charlie', 35, 90000.0, False),
    ('Diana', 28, 80000.0, True),
], dtype=employee_dtype)

# Access by field name (like a column in a DataFrame)
print(employees['name'])     # ['Alice' 'Bob' 'Charlie' 'Diana']
print(employees['salary'])   # [75000. 55000. 90000. 80000.]

# Access by row (like a record)
print(employees[0])          # ('Alice', 30, 75000., True)
print(employees[0]['name'])  # 'Alice'

# Boolean indexing on structured arrays
active = employees[employees['is_active']]
print(active['name'])        # ['Alice' 'Bob' 'Diana']

high_earners = employees[employees['salary'] > 70000]
print(high_earners['name'])  # ['Alice' 'Charlie' 'Diana']

# Sorting by field
sorted_by_salary = np.sort(employees, order='salary')
print(sorted_by_salary['name'])  # ['Bob' 'Alice' 'Diana' 'Charlie']

# Sort by multiple fields
sorted_multi = np.sort(employees, order=['is_active', 'salary'])

# np.recarray — access fields as attributes
rec = employees.view(np.recarray)
print(rec.name)     # Same as employees['name']
print(rec.salary)   # Same as employees['salary']

13. Masked Arrays — numpy.ma {#masked}

Masked arrays let you mark certain elements as invalid or missing without removing them.

Python
import numpy as np
import numpy.ma as ma

# Create masked array
data = np.array([1, 2, -999, 4, -999, 6])
mask = data == -999                        # True where data is invalid
masked = ma.masked_array(data, mask=mask)
print(masked)    # [1 2 -- 4 -- 6]

# Operations ignore masked values
print(masked.mean())   # 3.25 — average of [1, 2, 4, 6]
print(masked.sum())    # 13
print(masked.min())    # 1

# Fill masked values
filled = masked.filled(fill_value=0)   # Replace masked with 0
filled_mean = masked.filled(masked.mean())  # Fill with mean

# Mask based on condition
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
masked = ma.masked_where(arr % 2 == 0, arr)  # Mask even numbers
print(masked)    # [1 -- 3 -- 5 -- 7 --]

masked_greater = ma.masked_greater(arr, 5)   # Mask values > 5
masked_less = ma.masked_less(arr, 3)         # Mask values < 3
masked_outside = ma.masked_outside(arr, 3, 6)  # Mask outside [3, 6]
masked_equal = ma.masked_equal(arr, 4)       # Mask value == 4

# Combining masks
m1 = ma.masked_less(arr, 3)
m2 = ma.masked_greater(arr, 6)
combined = ma.mask_or(m1.mask, m2.mask)     # Union of masks

# Real use case: sensor data with invalid readings
sensor_data = np.array([23.5, 24.1, -9999.0, 23.8, 24.5, -9999.0])
valid_data = ma.masked_equal(sensor_data, -9999.0)
print(f"Valid readings mean: {valid_data.mean():.2f}")
print(f"Valid count: {valid_data.count()}")

14. Edge Cases & Debugging Tips {#errors}

Floating Point Gotchas

Python
import numpy as np

# NaN propagation
arr = np.array([1.0, 2.0, np.nan, 4.0])
print(np.sum(arr))    # nan — NaN poisons the result
print(np.nansum(arr)) # 7.0 — nan-safe version

# NaN comparisons
x = np.nan
print(x == x)           # False! NaN is not equal to itself
print(np.isnan(x))      # True — correct way to check
print(np.isnan(arr))    # [False False True False]

# Infinity
print(np.inf + 1)       # inf
print(np.inf - np.inf)  # nan
print(1 / np.inf)       # 0.0
print(np.isinf(arr))    # Check for infinity
print(np.isfinite(arr)) # Check for neither NaN nor inf

# Divide by zero — NumPy warns but returns inf, not error!
with np.errstate(divide='ignore', invalid='ignore'):
    result = np.array([1.0, 2.0]) / np.array([0.0, 1.0])
print(result)   # [inf 2.0]

Dtype Overflow

Python
import numpy as np

# Silent overflow — very dangerous!
arr = np.array([200], dtype=np.uint8)   # max is 255
arr += 60
print(arr)   # [4] — 200 + 60 = 260, wraps to 260 - 256 = 4!

# How to detect overflow risk
arr = np.array([200], dtype=np.uint8)
new_val = 200 + 60
if new_val > np.iinfo(np.uint8).max:
    print("Warning: overflow!")
    arr = arr.astype(np.int32) + 60   # Upcast first

Memory Errors with Large Arrays

Python
import numpy as np

# Check array size before creating
rows, cols = 100000, 100000
bytes_needed = rows * cols * 8  # float64 = 8 bytes
gb_needed = bytes_needed / 1e9
print(f"Memory needed: {gb_needed:.1f} GB")

if gb_needed > 4:
    # Use chunked processing or memory mapping
    chunk_size = 1000
    for i in range(0, rows, chunk_size):
        chunk = np.zeros((min(chunk_size, rows - i), cols), dtype=np.float32)
        # Process chunk...

15. Interview Questions — Part 2 {#interview}

Q1: What is the difference between np.dot, np.matmul, and @?

For 2D arrays all three give matrix multiplication. Difference: np.dot with scalars gives scalar product; np.matmul (@) doesn't work with scalars. For 3D+, np.matmul does batch matrix multiplication (stacks of matrices), while np.dot sums over specific axes differently.

Q2: What does ddof=1 mean in np.std()?

ddof = "Delta Degrees of Freedom". ddof=0 gives population std (divides by N). ddof=1 gives sample std (divides by N-1, Bessel's correction). Always use ddof=1 when computing statistics from a sample.

Q3: Why does np.sort have multiple kind options?

Different algorithms have different trade-offs. quicksort (default): O(n log n) average, not stable. mergesort/stable: O(n log n) always, stable, more memory. heapsort: O(n log n) worst-case, in-place. Use stable when relative order of equal elements matters.

Q4: When would you use np.linalg.solve vs np.linalg.lstsq?

solve requires square invertible matrix (Ax=b, unique solution). lstsq handles overdetermined systems (more equations than unknowns) — finds minimum-norm least-squares solution. Always prefer lstsq for regression problems.

Q5: What is the difference between np.save and np.savetxt?

np.save writes binary .npy format — fast, lossless, preserves dtype. np.savetxt writes human-readable text — slow, may lose precision for floats, but interoperable with other tools.

Q6: How does NumPy handle NaN values? How do you work around it?

NaN propagates through most operations (sum, mean, etc.). Use nan-safe functions: np.nansum, np.nanmean, np.nanmax, etc. Detect with np.isnan(). For integer arrays, NaN doesn't exist — use np.ma.masked_array instead.

Q7: Explain the difference between np.correlate and np.corrcoef.

np.correlate computes cross-correlation of two 1D signals (used in signal processing). np.corrcoef computes the Pearson correlation coefficient matrix (used in statistics) — values between -1 and 1.

Q8: What is SVD used for in data science?

SVD (A = U S Vt) has many applications: Dimensionality reduction (like PCA), collaborative filtering/recommendation systems, image compression (keep only top-k singular values), solving least squares, computing matrix rank and pseudo-inverse.


End of Part 2 — Math, Statistics & Linear Algebra

Up Next → Part 3: Performance Optimization, Memory Layout, Advanced Patterns, Real-World Pipelines & Complete Interview Guide

0 likes

Rate this post

No rating

Tap a star to rate

0 comments

Latest comments

0 comments

No comments yet.

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.