Maschinelles Lernen (ML) ist einer der gefragtesten Bereiche in der Technologiebranche, und Kenntnisse in Python sind aufgrund der umfangreichen Bibliotheken und der Benutzerfreundlichkeit oft eine Voraussetzung. Wenn Sie sich auf ein Vorstellungsgespräch in diesem Bereich vorbereiten, ist es wichtig, dass Sie sich sowohl mit theoretischen Konzepten als auch mit der praktischen Umsetzung auskennen. Hier sind einige häufig gestellte Fragen und Antworten zu Python ML-Interviews, die Ihnen bei der Vorbereitung helfen sollen.
Vorverarbeitungstechniken sind für die Vorbereitung von Daten für Modelle des maschinellen Lernens unerlässlich. Zu den gebräuchlichsten Techniken gehören:
from sklearn.preprocessing import MinMaxScaler import pandas as pd # Data normalization scaler = MinMaxScaler() normalized_data = scaler.fit_transform(data) # Creating dummy variables df_with_dummies = pd.get_dummies(data, drop_first=True)
Brute-Force-Algorithmen probieren ausgiebig alle Möglichkeiten aus, um eine Lösung zu finden. Ein häufiges Beispiel ist die lineare Suche, bei der der Algorithmus jedes Element eines Arrays überprüft, um eine Übereinstimmung zu finden.
def linear_search(arr, target): for i in range(len(arr)): if arr[i] == target: return i return -1 # Example usage arr = [2, 3, 4, 10, 40] target = 10 result = linear_search(arr, target)
Ein unausgeglichener Datensatz hat die Klassenverhältnisse verzerrt. Zu den Strategien, um damit umzugehen, gehören:
from imblearn.over_sampling import SMOTE from sklearn.model_selection import train_test_split X_resampled, y_resampled = SMOTE().fit_resample(X, y) X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2)
Übliche Strategien für den Umgang mit fehlenden Daten sind Auslassung und Zurechnung:
from sklearn.impute import SimpleImputer # Imputing missing values imputer = SimpleImputer(strategy='median') data_imputed = imputer.fit_transform(data)
Regression ist eine überwachte Lerntechnik, die verwendet wird, um Korrelationen zwischen Variablen zu finden und Vorhersagen für abhängige Variablen zu treffen. Zu den gängigen Beispielen gehören die lineare Regression und die logistische Regression, die mit Scikit-learn implementiert werden können.
from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split # Split the dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Create and train the model model = LinearRegression() model.fit(X_train, y_train) # Make predictions predictions = model.predict(X_test)
In Python können Sie die Funktion train_test_split von Scikit-learn verwenden, um Ihre Daten in Trainings- und Testsätze aufzuteilen.
from sklearn.model_selection import train_test_split # Split the dataset: 60% training and 40% testing X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.4)
Einige wichtige Parameter für baumbasierte Lernende sind:
from sklearn.ensemble import RandomForestClassifier # Setting parameters for Random Forest model = RandomForestClassifier(max_depth=5, n_estimators=100, max_features='sqrt', random_state=42) model.fit(X_train, y_train)
Zwei gängige Methoden zur Optimierung von Hyperparametern sind:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV # Grid Search param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 15]} grid_search = GridSearchCV(model, param_grid, cv=5) grid_search.fit(X_train, y_train) # Random Search param_dist = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 15]} random_search = RandomizedSearchCV(model, param_dist, n_iter=10, cv=5, random_state=42) random_search.fit(X_train, y_train)
Sie müssen Tage ohne Regen entfernen und dann den Median ermitteln.
def median_rainfall(df_rain): # Remove days with no rain df_rain_filtered = df_rain[df_rain['rainfall'] > 0] # Find the median amount of rainfall median_rainfall = df_rain_filtered['rainfall'].median() return median_rainfall
Sie können Pandas verwenden, um den Medianwert zu berechnen und einzugeben.
def impute_median_price(df, column): median_price = df[column].median() df[column].fillna(median_price, inplace=True) return df
def fill_none(input_list): prev_value = None result = [] for value in input_list: if value is None: result.append(prev_value) else: result.append(value) prev_value = value return result
def grades_colors(df_students): filtered_df = df_students[(df_students["grade"] > 90) & (df_students["favorite_color"].isin(["green", "red"]))] return filtered_df
import pandas as pd from scipy import stats def calculate_t_value(df, column, mu_0): sample_mean = df[column].mean() sample_std = df[column].std() n = len(df) t_value = (sample_mean - mu_0) / (sample_std / (n ** 0.5)) return t_value # Example usage t_value = calculate_t_value(df, 'var', mu_0) print(t_value)
import numpy as np import pandas as pd def euclidean_distance(point1, point2): return np.sqrt(np.sum((point1 - point2) ** 2)) def kNN(k, data, new_point): distances = data.apply(lambda row: euclidean_distance(row[:-1], new_point), axis=1) sorted_indices = distances.sort_values().index top_k = data.iloc[sorted_indices[:k]] return top_k['label'].mode()[0] # Example usage data = pd.DataFrame({ 'feature1': [1, 2, 3, 4], 'feature2': [2, 3, 4, 5], 'label': [0, 0, 1, 1] }) new_point = [2.5, 3.5] k = 3 result = kNN(k, data, new_point) print(result)
Note: This example uses simplified assumptions to meet the interview constraints.
import pandas as pd import numpy as np def create_tree(dataframe, new_point): unique_classes = dataframe['class'].unique() for col in dataframe.columns[:-1]: # Exclude the 'class' column if new_point[col] == 1: sub_data = dataframe[dataframe[col] == 1] if len(sub_data) > 0: return sub_data['class'].mode()[0] return unique_classes[0] # Default to the most frequent class def random_forest(df, new_point, n_trees): results = [] for _ in range n_trees): tree_result = create_tree(df, new_point) results.append(tree_result) # Majority vote return max(set(results), key=results.count) # Example usage df = pd.DataFrame({ 'feature1': [0, 1, 1, 0], 'feature2': [0, 0, 1, 1], 'class': [0, 1, 1, 0] }) new_point = {'feature1': 1, 'feature2': 0} n_trees = 5 result = random_forest(df, new_point, n_trees) print(result)
import pandas as pd import numpy as np def sigmoid(z): return 1 / (1 + np.exp(-z)) def logistic_regression(X, y, num_iterations, learning_rate): weights = np.zeros(X.shape[1]) for i in range(num_iterations): z = np.dot(X, weights) predictions = sigmoid(z) errors = y - predictions gradient = np.dot(X.T, errors) gradient = np.dot(X.T, errors) weights += learning_rate * gradient return weights # Example usage df = pd.DataFrame({ 'feature1': [0, 1, 1, 0], 'feature2': [0, 0, 1, 1], 'class': [0, 1, 1, 0] }) X = df[['feature1', 'feature2']].values y = df['class'].values num_iterations = 1000 learning_rate = 0.01 weights = logistic_regression(X, y, num_iterations, learning_rate) print(weights)
import numpy as np def k_means(data_points, k, initial_centroids): centroids = initial_centroids while True: distances = np.linalg.norm(data_points[:, np.newaxis] - centroids, axis=2) clusters = np.argmin(distances, axis=1) new_centroids = np.array([data_points[clusters == i].mean(axis=0) for i in range(k)]) if np.all(centroids == new_centroids): break centroids = new_centroids return clusters # Example usage data_points = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]]) k = 2 initial_centroids = np.array([[1, 2], [10, 2]]) clusters = k_means(data_points, k, initial_centroids) print(clusters)
Machine Learning is a field of artificial intelligence focused on building algorithms that enable computers to learn from data without explicit programming. It uses algorithms to analyze and identify patterns in data and make predictions based on those patterns.
"Machine learning is a branch of artificial intelligence that involves creating algorithms capable of learning from and making predictions based on data. It works by training a model on a dataset and then using that model to make predictions on new data."
There are three main types of machine learning algorithms:
Supervised Learning: Useslabeled data and makes predictions based on this information. Examples include linear regression and classification algorithms.
Unsupervised Learning: Processes unlabeled data and seeks to find patterns or relationships in it. Examples include clustering algorithms like K-means.
Reinforcement Learning: The algorithm learns from interacting with its environment, receiving rewards or punishments for certain actions. Examples include training AI agents in games.
"There are three main types of machine learning algorithms: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning uses labeled data to make predictions, unsupervised learning finds patterns in unlabeled data, and reinforcement learning learns from interactions with the environment to maximize rewards."
Cross-validation is a technique to evaluate the performance of a machine learning model by dividing the dataset into two parts: a training set and a validation set. The training set trains the model, whereas the validation set evaluates it.
"Cross-validation is a technique used to evaluate a machine learning model'sperformance by dividing the dataset into training and validation sets. It helps ensure the model generalizes well to new data, preventing overfitting and providing a more accurate measure of performance."
Artificial Neural Networks (ANNs) are models inspired by the human brain's structure. They consist of layers of interconnected nodes (neurons) that process input data and generate output predictions.
"An artificial neural network is a machine learning model inspired by the structure and function of the human brain. It comprises layers of interconnected neurons that process input data through weighted connections to make predictions."
Decision Trees are models for classification and regression tasks that split data into subsets based on the values of input variables to generate prediction rules.
"A decision tree is a tree-like model used for classification and regression tasks. It works by recursively splitting data into subsets based on input variables, creating rules for making predictions."
K-Nearest Neighbors (KNN) is a simple machine learning algorithm usedfor classification or regression tasks. It determines the k closest data points in the feature space to a given unseen data point and classifies it based on the majority class of its k nearest neighbors.
"The K-Nearest Neighbors (KNN) algorithm is a machine learning technique used for classification or regression. It works by identifying the k closest data points to a given point in the feature space and classifying it based on the majority class among the k nearest neighbors."
Support Vector Machines (SVM) are linear models used for binary classification and regression tasks. They find the most suitable boundary (hyperplane) that separates data into classes. Data points closest to the hyperplane, called support vectors, play a critical role in defining this boundary.
"The Support Vector Machine (SVM) algorithm is a linear model used for binary classification and regression tasks. It identifies the best hyperplane that separates data into classes, relying heavily on the data points closest to the hyperplane, known as support vectors."
Regularization is a technique to prevent overfitting in machinelearning models by adding a penalty term to the loss function. This penalty discourages the model from learning overly complex relationships in the data.
"Regularization is a technique to prevent overfitting in machine learning models by adding a penalty term to the loss function, which discourages the model from learning overly complex patterns. Common types of regularization include L1 (Lasso) and L2 (Ridge) regularization."
from sklearn.linear_model import Ridge # Applying L2 Regularization (Ridge Regression) ridge_model = Ridge(alpha=1.0) ridge_model.fit(X_train, y_train)
Gradient Descent is an optimization algorithm used to minimize a cost function in machine learning. It iteratively adjusts the parameters of the model in the direction of the negative gradient of the cost function until it reaches a minimum.
"Gradient Descent is an optimization algorithm used to minimize a cost function in machine learning. It iteratively updates the model parameters in the direction of the negative gradient of the cost function, aiming to find the parameters that minimize the cost."
Ensemble Learning is a technique where multiple models (often called "weak learners") are combined to solve a prediction task. The combined model is generally more robust and performs better than individual models.
"Ensemble learning is a machine learning technique where multiple models are combined to solve a prediction task. Common ensemble methods include bagging, boosting, and stacking. Combining the predictions of individual models can improve performance and reduce the risk of overfitting."
from sklearn.ensemble import RandomForestClassifier # Ensemble learning using Random Forest model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42) model.fit(X_train, y_train) predictions = model.predict(X_test)
Preparing for a Python machine learning interview involves understanding both theoretical concepts and practical implementations. This guide has covered several essential questions and answers that frequently come up in interviews. By familiarizing yourself with these topics and practicing the provided code examples, you'll be well-equipped to handle a wide range of questions in your next machine learning interview. Good luck!
Visit MyExamCloud and see the most recent Python Certification Practice Tests. Begin creating your Study Plan today.
Das obige ist der detaillierte Inhalt vonDie wichtigsten Fragen und Antworten zu Python-Machine-Learning-Interviews. Für weitere Informationen folgen Sie bitte anderen verwandten Artikeln auf der PHP chinesischen Website!