1. Arithmetic Operations — Element-wise Math {#arithmetic}
All basic arithmetic in NumPy is element-wise and returns a new array (or view).
import numpy as np
a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])
# Basic operators
print(a + b) # [11 22 33 44 55]
print(b - a) # [9 18 27 36 45]
print(a * b) # [10 40 90 160 250]
print(b / a) # [10. 10. 10. 10. 10.] — always float64
print(b // a) # [10 10 10 10 10] — floor division
print(b % a) # [0 0 0 0 0] — modulo
print(a ** 2) # [1 4 9 16 25] — power
print(-a) # [-1 -2 -3 -4 -5] — negate
# Operator functions (equivalent to operators but more control)
np.add(a, b) # a + b
np.subtract(b, a) # b - a
np.multiply(a, b) # a * b
np.divide(b, a) # b / a
np.floor_divide(b, a) # b // a
np.mod(b, a) # b % a
np.power(a, 2) # a ** 2
np.negative(a) # -a
np.absolute(np.array([-1, -2, 3])) # [1 2 3]
np.fabs(np.array([-1.5, 2.5])) # [1.5 2.5] — float only
# Out parameter — write result into existing array (memory optimization)
result = np.empty(5)
np.add(a, b, out=result) # No new allocation!
# In-place operations
a += 10 # a = a + 10, modifies in-place
a -= 5
a *= 2
a /= 3
a **= 2
Matrix Multiplication (Not Element-wise!)
import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
# Element-wise multiplication (Hadamard product)
print(A * B)
# [[ 5 12]
# [21 32]]
# Matrix multiplication — 3 ways
print(np.dot(A, B)) # Traditional
print(A @ B) # Python 3.5+ — cleaner syntax (preferred!)
print(np.matmul(A, B)) # Same as @, but doesn't work with scalars
# Result:
# [[19 22]
# [43 50]]
# Dot product of vectors
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(np.dot(a, b)) # 1*4 + 2*5 + 3*6 = 32
print(a @ b) # 32
# Outer product
print(np.outer(a, b))
# [[ 4 5 6]
# [ 8 10 12]
# [12 15 18]]
# Cross product
a = np.array([1, 0, 0])
b = np.array([0, 1, 0])
print(np.cross(a, b)) # [0 0 1]
2. Universal Functions (ufuncs) — The Real Power {#ufuncs}
ufuncs are functions that operate element-wise on ndarrays. They are compiled in C and extremely fast.
Math ufuncs
import numpy as np
arr = np.array([0, 0.5, 1.0, 1.5, 2.0])
# --- TRIGONOMETRY ---
np.sin(arr) # Sine
np.cos(arr) # Cosine
np.tan(arr) # Tangent
np.arcsin(arr) # Inverse sine (input must be in [-1, 1])
np.arccos(arr) # Inverse cosine
np.arctan(arr) # Inverse tangent
np.arctan2([1, 0], [0, 1]) # atan2(y, x) — handles quadrants correctly
np.degrees(arr) # Radians to degrees
np.radians([0, 90, 180]) # Degrees to radians
# --- EXPONENTIAL & LOGARITHM ---
arr2 = np.array([1.0, 2.0, 3.0, 10.0])
np.exp(arr2) # e^x
np.exp2(arr2) # 2^x
np.expm1(arr2) # e^x - 1 (more accurate for small x)
np.log(arr2) # Natural log (base e)
np.log2(arr2) # Log base 2
np.log10(arr2) # Log base 10
np.log1p(arr2) # log(1 + x) (more accurate for small x)
# --- ROUNDING ---
arr3 = np.array([-2.7, -1.3, 0.5, 1.3, 2.7])
np.floor(arr3) # [-3. -2. 0. 1. 2.] — round down
np.ceil(arr3) # [-2. -1. 1. 2. 3.] — round up
np.trunc(arr3) # [-2. -1. 0. 1. 2.] — round toward zero
np.round(arr3) # [-3. -1. 0. 1. 3.] — round to nearest even (banker's rounding)
np.around(arr3, 1) # Same as round
np.rint(arr3) # Round to nearest int
# --- SIGN & COMPARISON ---
np.sign(arr3) # [-1. -1. 1. 1. 1.] — sign of elements
np.abs(arr3) # absolute value
np.sqrt(np.abs(arr3)) # Square root (use abs to avoid NaN)
np.cbrt(arr3) # Cube root (works with negatives!)
np.square(arr3) # Element-wise x²
np.reciprocal(arr3) # 1/x (for non-zero!)
# --- SPECIAL ---
np.clip(arr3, -1.5, 1.5) # Clamp values to [-1.5, 1.5]
np.heaviside(arr3, 0.5) # 0 for x<0, 0.5 for x==0, 1 for x>0
np.sinc(arr3) # Normalized sinc function
ufunc Special Methods
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
# .reduce — apply ufunc cumulatively
np.add.reduce(arr) # 15 — sum of all elements
np.multiply.reduce(arr) # 120 — product of all elements
# .accumulate — like reduce but keeps intermediate results
np.add.accumulate(arr) # [1 3 6 10 15] — cumulative sum
np.multiply.accumulate(arr) # [1 2 6 24 120] — cumulative product
# .outer — compute outer product
np.add.outer([1, 2, 3], [10, 20, 30])
# [[11 21 31]
# [12 22 32]
# [13 23 33]]
# .at — unbuffered in-place operation (handles repeated indices correctly)
arr = np.array([1, 2, 3, 4, 5])
np.add.at(arr, [0, 0, 2], 10) # arr[0] += 10 twice, arr[2] += 10
print(arr) # [21 2 13 4 5]
3. Aggregate Functions — Reduce Operations {#aggregate}
import numpy as np
arr = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# --- BASIC AGGREGATES ---
np.sum(arr) # 45 — sum of all
np.sum(arr, axis=0) # [12 15 18] — sum along columns (collapse rows)
np.sum(arr, axis=1) # [6 15 24] — sum along rows (collapse columns)
np.sum(arr, keepdims=True) # shape (1,1) instead of scalar — useful for broadcasting
np.prod(arr) # 362880 — product of all
np.prod(arr, axis=0) # [28 80 162]
np.cumsum(arr) # [1 3 6 10 15 21 28 36 45] — flattened cumsum
np.cumsum(arr, axis=0) # Cumulative sum along rows
np.cumsum(arr, axis=1) # Cumulative sum along columns
np.cumprod(arr, axis=1) # Cumulative product along columns
# --- MIN & MAX ---
np.min(arr) # 1
np.max(arr) # 9
np.min(arr, axis=0) # [1 2 3] — min of each column
np.max(arr, axis=1) # [3 6 9] — max of each row
np.argmin(arr) # 0 — index of min (flattened)
np.argmax(arr) # 8 — index of max (flattened)
np.argmin(arr, axis=0) # [0 0 0] — row index of min in each column
np.argmax(arr, axis=1) # [2 2 2] — column index of max in each row
# ptp — peak to peak (max - min)
np.ptp(arr) # 8 = 9 - 1
np.ptp(arr, axis=0) # [6 6 6]
# --- NaN-SAFE VERSIONS ---
arr_with_nan = np.array([[1, np.nan, 3], [4, 5, np.nan]])
np.nansum(arr_with_nan) # 13.0 — ignores NaN
np.nanmean(arr_with_nan) # 3.25
np.nanmax(arr_with_nan) # 5.0
np.nanmin(arr_with_nan) # 1.0
np.nanstd(arr_with_nan) # ignores NaN in std
np.nanprod(arr_with_nan) # ignores NaN in product
np.nanpercentile(arr_with_nan, 75) # 75th percentile ignoring NaN
Axis Explained Visually
arr = [[1, 2, 3], # axis=0 collapses DOWN (across rows)
[4, 5, 6], # axis=1 collapses ACROSS (across columns)
[7, 8, 9]]
np.sum(arr, axis=0) = [1+4+7, 2+5+8, 3+6+9] = [12, 15, 18] # One value per COLUMN
np.sum(arr, axis=1) = [1+2+3, 4+5+6, 7+8+9] = [6, 15, 24] # One value per ROW
4. Statistical Functions — Full Coverage {#statistics}
import numpy as np
data = np.array([4, 7, 13, 2, 1, 15, 8, 3, 9, 11])
# --- CENTRAL TENDENCY ---
np.mean(data) # 7.3 — arithmetic mean
np.median(data) # 7.5 — middle value
# Mode — NumPy doesn't have it; use scipy
from scipy import stats
stats.mode(data) # Most frequent value
# Weighted mean
weights = np.array([1, 2, 3, 1, 1, 2, 1, 1, 1, 1])
np.average(data, weights=weights) # Weighted average (different from mean!)
# --- SPREAD / DISPERSION ---
np.var(data) # Variance (default ddof=0 — population variance)
np.var(data, ddof=1) # Sample variance (ddof=1)
np.std(data) # Standard deviation
np.std(data, ddof=1) # Sample std dev (use this for samples!)
# --- PERCENTILES & QUANTILES ---
np.percentile(data, 25) # Q1 — 25th percentile
np.percentile(data, 50) # Q2 — median
np.percentile(data, 75) # Q3 — 75th percentile
np.percentile(data, [25, 50, 75]) # All at once: [3.75 7.5 10.5]
np.quantile(data, 0.25) # Same as percentile(data, 25)
np.quantile(data, [0.25, 0.5, 0.75]) # Using 0-1 scale
# IQR (Interquartile Range) — for outlier detection
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
outliers = data[(data < lower_fence) | (data > upper_fence)]
print(f"Outliers: {outliers}")
# --- CORRELATION & COVARIANCE ---
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
# Covariance matrix
cov_matrix = np.cov(x, y)
print(cov_matrix)
# [[2.5 1.75]
# [1.75 1.7 ]]
# cov_matrix[0,0] = var(x), cov_matrix[1,1] = var(y)
# cov_matrix[0,1] = cov(x,y)
# Correlation matrix (normalized covariance)
corr_matrix = np.corrcoef(x, y)
print(corr_matrix)
# [[1. 0.75592895]
# [0.75592895 1. ]]
# Values between -1 and 1
# Correlation with multiple variables
data = np.random.randn(5, 100) # 5 variables, 100 observations
corr = np.corrcoef(data) # 5x5 correlation matrix
# --- HISTOGRAM ---
data = np.random.randn(1000)
counts, bin_edges = np.histogram(data, bins=20)
# counts: frequency in each bin
# bin_edges: 21 edges for 20 bins
# 2D histogram
x = np.random.randn(1000)
y = np.random.randn(1000)
hist2d, xedges, yedges = np.histogram2d(x, y, bins=20)
Practical Statistics — Outlier Detection & Z-scores
import numpy as np
# Generate data with outliers
np.random.seed(42)
data = np.concatenate([
np.random.normal(50, 10, 1000), # Normal data
np.array([200, -100, 300]) # Outliers
])
# Z-score based outlier detection
mean = np.mean(data)
std = np.std(data)
z_scores = (data - mean) / std
outlier_mask = np.abs(z_scores) > 3
print(f"Outliers found: {np.sum(outlier_mask)}")
print(f"Outlier values: {data[outlier_mask]}")
# Winsorize — cap outliers at percentiles
lower = np.percentile(data, 1)
upper = np.percentile(data, 99)
winsorized = np.clip(data, lower, upper)
# Min-Max Normalization (0 to 1 scaling)
normalized = (data - data.min()) / (data.max() - data.min())
# Standard Scaling (Z-score normalization)
standardized = (data - data.mean()) / data.std()
# Moving average (for time series)
def moving_average(arr, window=5):
return np.convolve(arr, np.ones(window)/window, mode='valid')
prices = np.array([10, 11, 12, 11, 13, 14, 15, 13, 12, 14, 16])
ma5 = moving_average(prices, window=3)
print(ma5) # [11. 11.33 12. 12.67 14. 14. 13.33 13. 14.]
5. Sorting & Searching {#sorting}
import numpy as np
arr = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3])
# --- SORTING ---
np.sort(arr) # Returns sorted copy — original unchanged
arr.sort() # In-place sort — original modified!
np.sort(arr)[::-1] # Descending order (no direct parameter)
arr[np.argsort(arr)[::-1]] # Descending via argsort
# Stable sort
np.sort(arr, kind='stable') # Preserves relative order of equal elements
np.sort(arr, kind='mergesort') # Always stable
np.sort(arr, kind='quicksort') # Fastest average (default, not stable)
np.sort(arr, kind='heapsort') # O(n log n) worst case
# 2D sorting
matrix = np.array([[3, 1, 2], [6, 4, 5]])
np.sort(matrix, axis=0) # Sort each column: [[3 1 2] [6 4 5]]
np.sort(matrix, axis=1) # Sort each row: [[1 2 3] [4 5 6]]
# --- ARGSORT — Returns indices that would sort the array ---
arr = np.array([30, 10, 20, 50, 40])
indices = np.argsort(arr) # [1 2 0 4 3] — index of 10, 20, 30, 40, 50
# Use case: sort one array by values of another
names = np.array(['Charlie', 'Alice', 'Bob', 'Eve', 'Dave'])
scores = np.array([85, 92, 78, 95, 88])
sorted_names = names[np.argsort(scores)[::-1]] # By descending score
print(sorted_names) # ['Eve' 'Alice' 'Dave' 'Charlie' 'Bob']
# --- PARTITION — Faster than full sort for top-k ---
arr = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3])
np.partition(arr, 3) # 3 smallest elements at start (not sorted among themselves)
np.partition(arr, -3) # 3 largest elements at end
# Get top-3 values efficiently (O(n) vs O(n log n) for full sort)
top3_indices = np.argpartition(arr, -3)[-3:]
print(arr[top3_indices]) # [5, 6, 9] — not necessarily sorted
# --- SEARCHING ---
arr = np.array([0, 10, 20, 30, 40, 50])
# np.searchsorted — binary search on sorted array
idx = np.searchsorted(arr, 25) # 3 — where 25 would be inserted
idx = np.searchsorted(arr, 20) # 2 — leftmost position (default='left')
idx = np.searchsorted(arr, 20, side='right') # 3 — rightmost position
# np.where — returns indices where condition is True
arr = np.array([5, 15, 25, 35, 45])
indices = np.where(arr > 20) # (array([2, 3, 4]),)
print(arr[indices]) # [25 35 45]
row_idx, col_idx = np.where(matrix > 3) # 2D — returns tuple of (row_indices, col_indices)
# np.nonzero — indices of non-zero elements
arr = np.array([0, 1, 0, 0, 2, 0, 3])
print(np.nonzero(arr)) # (array([1, 4, 6]),)
print(arr[arr != 0]) # [1 2 3] — more readable alternative
# np.argmin / np.argmax — index of min/max
arr = np.array([3, 1, 4, 1, 5, 9, 2, 6])
print(np.argmin(arr)) # 1 (first occurrence of minimum)
print(np.argmax(arr)) # 5
# np.nanargmin / np.nanargmax — ignores NaN
arr_nan = np.array([3, np.nan, 1, 5])
print(np.nanargmin(arr_nan)) # 2 (ignores NaN)
6. Set Operations {#sets}
import numpy as np
a = np.array([1, 2, 3, 4, 5, 2, 3])
b = np.array([3, 4, 5, 6, 7])
# Unique
np.unique(a) # [1 2 3 4 5] — sorted unique values
vals, counts = np.unique(a, return_counts=True) # With counts: [1,2,2,2,1]
vals, indices = np.unique(a, return_index=True) # With first occurrence index
vals, inverse = np.unique(a, return_inverse=True) # Indices to reconstruct a
# Set operations
np.union1d(a, b) # [1 2 3 4 5 6 7] — union
np.intersect1d(a, b) # [3 4 5] — intersection
np.setdiff1d(a, b) # [1 2] — in a but not in b
np.setdiff1d(b, a) # [6 7] — in b but not in a
np.setxor1d(a, b) # [1 2 6 7] — in one but not both
# Membership test
np.isin(a, b) # [F F T T T F T] — is each element of a in b?
np.isin(a, b, invert=True) # Inverse — not in b
7. Linear Algebra — numpy.linalg {#linalg}
Linear algebra is the backbone of machine learning. NumPy provides a comprehensive linalg module.
import numpy as np
from numpy import linalg as LA
A = np.array([[2, 1], [1, 3]], dtype=float)
B = np.array([[1, 0], [0, 1]], dtype=float) # Identity
# --- BASIC OPERATIONS ---
LA.matrix_power(A, 3) # A³ = A @ A @ A
np.trace(A) # Sum of diagonal: 5.0
# --- DETERMINANT ---
det = LA.det(A) # 5.0
# --- INVERSE ---
inv = LA.inv(A) # A⁻¹ such that A @ A⁻¹ = I
# Verify: A @ inv ≈ Identity
print(np.allclose(A @ inv, np.eye(2))) # True
# --- NORMS ---
v = np.array([3.0, 4.0])
LA.norm(v) # 5.0 — L2 norm (Euclidean distance)
LA.norm(v, ord=1) # 7.0 — L1 norm (Manhattan distance)
LA.norm(v, ord=np.inf) # 4.0 — L∞ norm (max absolute value)
LA.norm(A) # Frobenius norm (matrix)
LA.norm(A, ord='fro') # Same — Frobenius explicitly
LA.norm(A, ord=2) # Spectral norm (largest singular value)
# --- EIGENVALUES & EIGENVECTORS ---
eigenvalues, eigenvectors = LA.eig(A)
print(eigenvalues) # [1.38 3.62] (approximately)
print(eigenvectors) # Columns are eigenvectors
# For symmetric/Hermitian matrices (more stable)
eigenvalues_sym = LA.eigh(A) # Returns sorted eigenvalues
# --- SINGULAR VALUE DECOMPOSITION (SVD) ---
# A = U @ np.diag(S) @ Vt
U, S, Vt = LA.svd(A)
# U: left singular vectors
# S: singular values (sorted descending)
# Vt: right singular vectors (transposed)
# Reconstruct A
A_reconstructed = U @ np.diag(S) @ Vt
print(np.allclose(A, A_reconstructed)) # True
# Truncated SVD for dimensionality reduction (like PCA)
U, S, Vt = LA.svd(A, full_matrices=False) # Economy/thin SVD
# --- QR DECOMPOSITION ---
Q, R = LA.qr(A) # A = Q @ R, Q orthogonal, R upper triangular
print(np.allclose(A, Q @ R)) # True
print(np.allclose(Q @ Q.T, np.eye(2))) # Q is orthogonal: True
# --- CHOLESKY DECOMPOSITION ---
# For symmetric positive definite matrices
# A = L @ L.T (L is lower triangular)
A_sym = np.array([[4, 2], [2, 3]], dtype=float)
L = LA.cholesky(A_sym)
print(np.allclose(A_sym, L @ L.T)) # True
# --- SOLVING LINEAR SYSTEMS ---
# Solve Ax = b
A = np.array([[3, 1], [1, 2]], dtype=float)
b = np.array([9, 8], dtype=float)
x = LA.solve(A, b) # x = [2, 3]
print(np.allclose(A @ x, b)) # True
# Least squares solution (overdetermined system — more equations than unknowns)
A = np.array([[1, 1], [1, 2], [1, 3]], dtype=float) # 3 equations
b = np.array([1, 2, 2], dtype=float) # 2 unknowns
x, residuals, rank, sv = LA.lstsq(A, b, rcond=None)
print(x) # Best fit line coefficients [0.33, 0.5]
# --- MATRIX RANK & CONDITION NUMBER ---
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=float)
print(LA.matrix_rank(A)) # 2 — rows are linearly dependent!
print(LA.cond(A)) # Condition number — high means ill-conditioned
# --- PSEUDO-INVERSE (Moore-Penrose) ---
# For non-square or singular matrices
A = np.array([[1, 2], [3, 4], [5, 6]], dtype=float) # 3x2 (tall matrix)
pinv = LA.pinv(A) # shape (2, 3) — left inverse when more rows than cols
Real-World Linear Algebra — Linear Regression from Scratch
import numpy as np
# Generate synthetic data: y = 2x + 3 + noise
np.random.seed(42)
X_raw = np.linspace(0, 10, 100)
y = 2 * X_raw + 3 + np.random.randn(100)
# Add bias column (intercept) — augment X with ones
X = np.column_stack([np.ones(100), X_raw]) # shape (100, 2)
# OLS solution: w = (X^T X)^{-1} X^T y (Normal Equation)
w = np.linalg.solve(X.T @ X, X.T @ y)
print(f"Intercept: {w[0]:.4f}, Slope: {w[1]:.4f}")
# Intercept: ~3.0, Slope: ~2.0
# Alternative with lstsq (numerically more stable)
w, _, _, _ = np.linalg.lstsq(X, y, rcond=None)
print(f"Intercept: {w[0]:.4f}, Slope: {w[1]:.4f}")
# Predictions
y_pred = X @ w
mse = np.mean((y - y_pred) ** 2)
print(f"MSE: {mse:.4f}")
8. Fourier Transform — numpy.fft {#fft}
import numpy as np
# --- 1D FFT ---
# Create a signal: sum of two sine waves
t = np.linspace(0, 1, 500, endpoint=False) # 1 second, 500 samples
freq1, freq2 = 5, 20 # 5 Hz and 20 Hz components
signal = np.sin(2 * np.pi * freq1 * t) + 0.5 * np.sin(2 * np.pi * freq2 * t)
# Compute FFT
fft_result = np.fft.fft(signal)
frequencies = np.fft.fftfreq(len(signal), d=1/500) # Frequency bins
# Get magnitude spectrum
magnitude = np.abs(fft_result)
# Only positive frequencies (spectrum is symmetric)
positive_freq_idx = frequencies > 0
print(frequencies[positive_freq_idx][np.argmax(magnitude[positive_freq_idx])])
# Should show 5.0 (dominant frequency)
# --- INVERSE FFT ---
signal_reconstructed = np.fft.ifft(fft_result).real
print(np.allclose(signal, signal_reconstructed)) # True
# --- 2D FFT (Image frequency analysis) ---
image = np.random.rand(64, 64)
fft2d = np.fft.fft2(image)
fft2d_shifted = np.fft.fftshift(fft2d) # Center the zero-frequency component
magnitude_spectrum = np.log1p(np.abs(fft2d_shifted)) # Log scale for visibility
# Reconstruct
image_back = np.fft.ifft2(fft2d).real
# --- REAL FFT (when input is real) — more efficient ---
rfft_result = np.fft.rfft(signal) # Half the output (no redundancy)
frequencies = np.fft.rfftfreq(len(signal), d=1/500)
signal_back = np.fft.irfft(rfft_result, n=len(signal))
9. Polynomial Operations — numpy.polynomial {#polynomial}
import numpy as np
from numpy.polynomial import polynomial as P
# --- CLASSIC INTERFACE (numpy.poly1d) ---
# Represent polynomial: 3x² + 2x - 1
p = np.poly1d([3, 2, -1])
print(p) # 3x² + 2x - 1
print(p(4)) # 3(16) + 2(4) - 1 = 55
print(p.roots) # Roots where polynomial = 0
print(p.deriv()) # Derivative: 6x + 2
print(p.integ()) # Integral
# Polynomial arithmetic
p1 = np.poly1d([1, -3]) # x - 3
p2 = np.poly1d([1, -5]) # x - 5
print(p1 * p2) # x² - 8x + 15
print(np.polymul(p1.coeffs, p2.coeffs)) # Same
# Polynomial fitting
x = np.array([0, 1, 2, 3, 4])
y = np.array([1, 3, 9, 19, 33])
coeffs = np.polyfit(x, y, deg=2) # Fit degree-2 polynomial
fitted_poly = np.poly1d(coeffs)
print(fitted_poly) # ~2x² + 0x + 1
# Evaluate at new points
x_new = np.linspace(0, 4, 100)
y_new = fitted_poly(x_new)
# --- NEW API (numpy.polynomial) — Recommended ---
# More numerically stable, clearer API
from numpy.polynomial import Polynomial
# Create from coefficients [a0, a1, a2] = a0 + a1*x + a2*x²
p = Polynomial([1, 2, 3]) # 1 + 2x + 3x²
print(p(2)) # 1 + 4 + 12 = 17
print(p.roots()) # Complex roots
print(p.deriv()) # 2 + 6x
print(p.integ()) # x + x² + x³
# Fit
x = np.array([0, 1, 2, 3, 4])
y = np.array([1, 3, 9, 19, 33])
poly_fit = Polynomial.fit(x, y, deg=2)
print(poly_fit.convert().coef) # Coefficients
10. Random Number Generation — Deep Dive {#random}
import numpy as np
# --- LEGACY API (still works but less recommended) ---
np.random.seed(42)
# Distributions
np.random.rand(3, 4) # Uniform [0, 1)
np.random.randn(3, 4) # Standard normal N(0, 1)
np.random.randint(0, 100, (3, 4)) # Random integers
np.random.uniform(1.5, 3.5, 10) # Uniform [low, high)
np.random.normal(mean=50, scale=10, size=1000) # Normal distribution
np.random.binomial(n=10, p=0.5, size=1000) # Binomial
np.random.poisson(lam=3, size=1000) # Poisson
np.random.exponential(scale=2.0, size=1000) # Exponential
np.random.gamma(shape=2, scale=1.5, size=1000) # Gamma
np.random.beta(a=2, b=5, size=1000) # Beta
np.random.chisquare(df=5, size=1000) # Chi-square
np.random.choice([1,2,3,4,5], size=10) # Random choice
# --- NEW API (Recommended — numpy.random.Generator) ---
rng = np.random.default_rng(seed=42) # Reproducible generator
# Equivalent functions
rng.random((3, 4)) # Uniform [0, 1)
rng.standard_normal((3, 4)) # N(0, 1)
rng.integers(0, 100, (3, 4)) # Random integers [low, high)
rng.uniform(1.5, 3.5, 10)
rng.normal(50, 10, 1000)
rng.binomial(10, 0.5, 1000)
rng.poisson(3, 1000)
rng.choice([1,2,3,4,5], size=10, replace=False) # Without replacement!
# Shuffle
arr = np.arange(10)
rng.shuffle(arr) # In-place shuffle
shuffled = rng.permutation(arr) # Returns new shuffled array
# Multivariate distributions
mean = [0, 0]
cov = [[1, 0.8], [0.8, 1]] # Correlated variables
samples = rng.multivariate_normal(mean, cov, size=1000)
print(samples.shape) # (1000, 2)
# --- MONTE CARLO EXAMPLE: Estimate π ---
rng = np.random.default_rng(42)
n_points = 1_000_000
points = rng.uniform(-1, 1, (n_points, 2))
inside_circle = np.sum(np.sum(points**2, axis=1) <= 1)
pi_estimate = 4 * inside_circle / n_points
print(f"Estimated π: {pi_estimate:.5f}") # ~3.14159
# --- BOOTSTRAP SAMPLING ---
data = rng.normal(50, 10, 100)
n_bootstraps = 1000
bootstrap_means = np.array([
rng.choice(data, size=len(data), replace=True).mean()
for _ in range(n_bootstraps)
])
ci_lower = np.percentile(bootstrap_means, 2.5)
ci_upper = np.percentile(bootstrap_means, 97.5)
print(f"95% CI for mean: [{ci_lower:.2f}, {ci_upper:.2f}]")
11. File I/O — Save and Load Arrays {#fileio}
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
# --- BINARY FORMAT (.npy and .npz) ---
# Single array — .npy (fast, exact, preserves dtype)
np.save('array.npy', arr)
loaded = np.load('array.npy')
print(np.array_equal(arr, loaded)) # True — exact preservation
# Multiple arrays — .npz (compressed zip of .npy files)
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.savez('arrays.npz', array_a=a, array_b=b) # Named arrays
data = np.load('arrays.npz')
print(data['array_a']) # [1 2 3]
print(data['array_b']) # [4 5 6]
print(list(data.files)) # ['array_a', 'array_b']
# Compressed version (smaller file, slower)
np.savez_compressed('arrays_compressed.npz', a=a, b=b)
# --- TEXT FORMAT ---
# Save as CSV / whitespace-separated
np.savetxt('array.csv', arr, delimiter=',', fmt='%d') # Integer format
np.savetxt('array.csv', arr, delimiter=',', fmt='%.4f') # Float format
np.savetxt('array.csv', arr, delimiter=',', header='A,B,C', comments='')
# Load from CSV
loaded = np.loadtxt('array.csv', delimiter=',', dtype=np.int32)
loaded = np.loadtxt('data.csv', delimiter=',', skiprows=1) # Skip header
loaded = np.loadtxt('data.csv', delimiter=',', usecols=[0, 2]) # Select columns
# np.genfromtxt — more powerful (handles missing values)
data = np.genfromtxt('data.csv',
delimiter=',',
skip_header=1,
filling_values=0, # Replace NaN with 0
dtype=float,
usecols=(0, 1, 2))
# --- MEMORY-MAPPED FILES (for large arrays that don't fit in RAM) ---
# Create a memory-mapped file
mmapped = np.memmap('large_data.dat', dtype=np.float32, mode='w+', shape=(1000, 1000))
mmapped[:] = np.random.rand(1000, 1000) # Write data
del mmapped # Flush to disk
# Open existing memory-mapped file (read-only)
mmapped_ro = np.memmap('large_data.dat', dtype=np.float32, mode='r', shape=(1000, 1000))
print(mmapped_ro[0, :5]) # Read first row, first 5 elements
# Data is loaded from disk only when accessed!
# --- READING FROM STRING / BYTES ---
import io
csv_string = "1,2,3\n4,5,6\n7,8,9"
arr = np.loadtxt(io.StringIO(csv_string), delimiter=',')
12. Structured Arrays & Record Arrays {#structured}
Structured arrays allow you to store heterogeneous data (like a C struct or database row).
import numpy as np
# Define a structured dtype
employee_dtype = np.dtype([
('name', 'U20'), # Unicode string, max 20 chars
('age', 'i4'), # 32-bit integer
('salary', 'f8'), # 64-bit float
('is_active', '?'), # Boolean
])
# Create structured array
employees = np.array([
('Alice', 30, 75000.0, True),
('Bob', 25, 55000.0, True),
('Charlie', 35, 90000.0, False),
('Diana', 28, 80000.0, True),
], dtype=employee_dtype)
# Access by field name (like a column in a DataFrame)
print(employees['name']) # ['Alice' 'Bob' 'Charlie' 'Diana']
print(employees['salary']) # [75000. 55000. 90000. 80000.]
# Access by row (like a record)
print(employees[0]) # ('Alice', 30, 75000., True)
print(employees[0]['name']) # 'Alice'
# Boolean indexing on structured arrays
active = employees[employees['is_active']]
print(active['name']) # ['Alice' 'Bob' 'Diana']
high_earners = employees[employees['salary'] > 70000]
print(high_earners['name']) # ['Alice' 'Charlie' 'Diana']
# Sorting by field
sorted_by_salary = np.sort(employees, order='salary')
print(sorted_by_salary['name']) # ['Bob' 'Alice' 'Diana' 'Charlie']
# Sort by multiple fields
sorted_multi = np.sort(employees, order=['is_active', 'salary'])
# np.recarray — access fields as attributes
rec = employees.view(np.recarray)
print(rec.name) # Same as employees['name']
print(rec.salary) # Same as employees['salary']
13. Masked Arrays — numpy.ma {#masked}
Masked arrays let you mark certain elements as invalid or missing without removing them.
import numpy as np
import numpy.ma as ma
# Create masked array
data = np.array([1, 2, -999, 4, -999, 6])
mask = data == -999 # True where data is invalid
masked = ma.masked_array(data, mask=mask)
print(masked) # [1 2 -- 4 -- 6]
# Operations ignore masked values
print(masked.mean()) # 3.25 — average of [1, 2, 4, 6]
print(masked.sum()) # 13
print(masked.min()) # 1
# Fill masked values
filled = masked.filled(fill_value=0) # Replace masked with 0
filled_mean = masked.filled(masked.mean()) # Fill with mean
# Mask based on condition
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
masked = ma.masked_where(arr % 2 == 0, arr) # Mask even numbers
print(masked) # [1 -- 3 -- 5 -- 7 --]
masked_greater = ma.masked_greater(arr, 5) # Mask values > 5
masked_less = ma.masked_less(arr, 3) # Mask values < 3
masked_outside = ma.masked_outside(arr, 3, 6) # Mask outside [3, 6]
masked_equal = ma.masked_equal(arr, 4) # Mask value == 4
# Combining masks
m1 = ma.masked_less(arr, 3)
m2 = ma.masked_greater(arr, 6)
combined = ma.mask_or(m1.mask, m2.mask) # Union of masks
# Real use case: sensor data with invalid readings
sensor_data = np.array([23.5, 24.1, -9999.0, 23.8, 24.5, -9999.0])
valid_data = ma.masked_equal(sensor_data, -9999.0)
print(f"Valid readings mean: {valid_data.mean():.2f}")
print(f"Valid count: {valid_data.count()}")
14. Edge Cases & Debugging Tips {#errors}
Floating Point Gotchas
import numpy as np
# NaN propagation
arr = np.array([1.0, 2.0, np.nan, 4.0])
print(np.sum(arr)) # nan — NaN poisons the result
print(np.nansum(arr)) # 7.0 — nan-safe version
# NaN comparisons
x = np.nan
print(x == x) # False! NaN is not equal to itself
print(np.isnan(x)) # True — correct way to check
print(np.isnan(arr)) # [False False True False]
# Infinity
print(np.inf + 1) # inf
print(np.inf - np.inf) # nan
print(1 / np.inf) # 0.0
print(np.isinf(arr)) # Check for infinity
print(np.isfinite(arr)) # Check for neither NaN nor inf
# Divide by zero — NumPy warns but returns inf, not error!
with np.errstate(divide='ignore', invalid='ignore'):
result = np.array([1.0, 2.0]) / np.array([0.0, 1.0])
print(result) # [inf 2.0]
Dtype Overflow
import numpy as np
# Silent overflow — very dangerous!
arr = np.array([200], dtype=np.uint8) # max is 255
arr += 60
print(arr) # [4] — 200 + 60 = 260, wraps to 260 - 256 = 4!
# How to detect overflow risk
arr = np.array([200], dtype=np.uint8)
new_val = 200 + 60
if new_val > np.iinfo(np.uint8).max:
print("Warning: overflow!")
arr = arr.astype(np.int32) + 60 # Upcast first
Memory Errors with Large Arrays
import numpy as np
# Check array size before creating
rows, cols = 100000, 100000
bytes_needed = rows * cols * 8 # float64 = 8 bytes
gb_needed = bytes_needed / 1e9
print(f"Memory needed: {gb_needed:.1f} GB")
if gb_needed > 4:
# Use chunked processing or memory mapping
chunk_size = 1000
for i in range(0, rows, chunk_size):
chunk = np.zeros((min(chunk_size, rows - i), cols), dtype=np.float32)
# Process chunk...
15. Interview Questions — Part 2 {#interview}
Q1: What is the difference between np.dot, np.matmul, and @?
For 2D arrays all three give matrix multiplication. Difference:
np.dotwith scalars gives scalar product;np.matmul(@) doesn't work with scalars. For 3D+,np.matmuldoes batch matrix multiplication (stacks of matrices), whilenp.dotsums over specific axes differently.
Q2: What does ddof=1 mean in np.std()?
ddof = "Delta Degrees of Freedom".
ddof=0gives population std (divides by N).ddof=1gives sample std (divides by N-1, Bessel's correction). Always useddof=1when computing statistics from a sample.
Q3: Why does np.sort have multiple kind options?
Different algorithms have different trade-offs.
quicksort(default): O(n log n) average, not stable.mergesort/stable: O(n log n) always, stable, more memory.heapsort: O(n log n) worst-case, in-place. Usestablewhen relative order of equal elements matters.
Q4: When would you use np.linalg.solve vs np.linalg.lstsq?
solverequires square invertible matrix (Ax=b, unique solution).lstsqhandles overdetermined systems (more equations than unknowns) — finds minimum-norm least-squares solution. Always preferlstsqfor regression problems.
Q5: What is the difference between np.save and np.savetxt?
np.savewrites binary.npyformat — fast, lossless, preserves dtype.np.savetxtwrites human-readable text — slow, may lose precision for floats, but interoperable with other tools.
Q6: How does NumPy handle NaN values? How do you work around it?
NaN propagates through most operations (sum, mean, etc.). Use nan-safe functions:
np.nansum,np.nanmean,np.nanmax, etc. Detect withnp.isnan(). For integer arrays, NaN doesn't exist — usenp.ma.masked_arrayinstead.
Q7: Explain the difference between np.correlate and np.corrcoef.
np.correlatecomputes cross-correlation of two 1D signals (used in signal processing).np.corrcoefcomputes the Pearson correlation coefficient matrix (used in statistics) — values between -1 and 1.
Q8: What is SVD used for in data science?
SVD (A = U S Vt) has many applications: Dimensionality reduction (like PCA), collaborative filtering/recommendation systems, image compression (keep only top-k singular values), solving least squares, computing matrix rank and pseudo-inverse.
End of Part 2 — Math, Statistics & Linear Algebra
Up Next → Part 3: Performance Optimization, Memory Layout, Advanced Patterns, Real-World Pipelines & Complete Interview Guide