了解您的數據:探索性數據分析的要點。

WBOY
發布: 2024-08-10 06:56:02
原創
1000 人瀏覽過

Understanding Your Data: The Essentials of Exploratory Data Analysis.

簡介

身為資料科學家和資料分析師,這是必須執行的非常非常重要且關鍵的初始步驟。資料收集後,資料處於原始形式和未經處理的事實,資料科學家、分析師或任何其他人無法理解該資料的結構和內容,這就是EDA 的用武之地;分析和視覺化資料以了解其關鍵特徵、發現模式並辨識變數之間的關係。

理解資料需要理解資料的預期品質和特徵。您擁有的關於數據的知識、數據將滿足的需求、數據的內容和創造。現在讓我們更深入地研究 EDA,了解如何將資料轉換為資訊。資訊是經過處理、組織、解釋和結構化的資料。

探索性資料分析

如上所述,EDA 是指分析和視覺化數據,以了解其關鍵特徵、發現模式並識別變數之間的關係。它有助於確定如何最好地操縱資料來源以獲得所需的答案,使資料科學家更容易發現模式、發現異常、測試假設或假設。這是資料分析重要的第一步,也是理解和解釋複雜資料集的基礎。

EDA 類型
這些是探索性資料分析過程中所使用的不同方法和途徑。以下是 EDA 的三種主要類型:

單變量分析:這是可用於分析資料的最簡單形式,它探索資料集中的每個變數。涉及查看值的範圍以及值的集中趨勢。它描述了反應模式,每個變數都有自己的例如,檢查公司員工的年齡。

雙變量分析:此分析,觀察到兩個變數。它的目的是確定兩個變數之間是否存在統計聯繫,如果是的話,它們的強度有多大。雙變量讓研究者可以查看兩個變數之間的關係。在使用此分析之前,您必須了解其重要性;

 Bivariate analysis helps identify trends and patterns
 Helps identify cause and effect relationships.
 Helps researchers to make predictions.
 It also inform decision-making.
登入後複製

雙變量分析中使用的技術包括散點圖、相關性、迴歸、卡方檢定、t 檢定和變異數分析,可用於確定兩個變數的相關性。

多元分析:這涉及實驗的統計研究,其中對每個實驗單元進行多次測量,並且多變量測量之間的關係及其結構對於實驗非常重要。實驗的理解。 例如,一個人每天在 Instagram 上花費多少小時。

技術包括依賴技術和相互依賴技術。

EDA 精要

a. 資料收集:處理資料的第一步是先擁有你想要的資料。根據您正在研究的主題,使用網頁抓取或從 Kaggle 等平台下載資料集等方法從各種來源收集資料。

b. 了解您的資料:在進行清潔之前,您首先必須了解您收集的資料。試著了解您將使用的行數和列數、每列的資訊、資料的特徵、資料類型等等。

c. 資料清理:此步驟涉及識別和解決資料中的錯誤、不一致、重複或不完整條目。此步驟的主要目標是提高數據的品質和有用性,從而獲得更可靠和精確的發現。資料清理涉及幾個步驟;
如何清理資料;

      i)Handling missing values: by imputing them using mean, mode, median of the column, fill with a constant, forward-fill, backward-fill, interpolation or dropping them using the dropna() function.

      ii)Detecting outliers: you can detect outliers using the interquartile range, visualizing, using Z-Score or using One-Class SVM.

      iii)Handle duplicates: Drop duplicate records

      iv)Fix structural errors: Address issues with the layout and format of your data such as date formats or misaligned fields.

      v)Remove unnecessary values: Your dataset might contain irrelevant or redundant information that is unnecessary for your analysis. You can identify and remove any records or fields that won't contribute to the insights you are trying to derive. 
登入後複製

d. 摘要統計。 此步驟使用 pandas 或 numpy 中的描述方法快速概述資料集的中心趨勢和分佈,包括平均值、中位數、眾數、標準差、最小值、最大值對於數字特徵。對於分類特徵,我們可以使用圖表和實際的總計統計資料。

e. 資料視覺化:這是設計和創建大量複雜的定量和定性資料的易於溝通和易於理解的圖形或視覺表示的實踐。嘗試使用 matplotlib、seaborn 或 tableau 等工具使用線圖、長條圖、散佈圖和箱線圖來識別資料集中的趨勢和模式。

f. Data relationship. Identify the relationship between your data by performing correlation analysis to examine correlations between variables.

  • Analyze relationships between categorical variables. Use techniques like correlation matrices, heatmaps to visualize.

g. Test Hypothesis: Conduct tests like t-tests, chi-square tests, and ANOVA to determine statistical significance.

h. Communicate Your findings and Insights: This is the final step in carrying out EDA. This includes summarizing your evaluation, highlighting fundamental discoveries, and imparting your outcomes cleanly.

  • Clearly state the targets and scope of your analysis.
  • Use visualizations to display your findings.
  • Highlight critical insights, patterns, or anomalies you discovered in your EDA.
  • Discuss any barriers or caveats related to your analysis.

The next step after conducting Exploratory Data Analysis (EDA) in a data science project is feature engineering. This process involves transforming your features into a format that can be effectively understood and utilized by your model. Feature engineering builds on the insights gained from EDA to enhance the data, ensuring that it is in the best possible form for model training and performance. Let’s explore feature engineering in simple terms.

Feature Engineering.

This is the process of selecting, manipulating and transforming raw data into features that can be used in model creation. This process involves 4 main steps;

  1. Feature Creation:- Create new features from the existing features, using your domain knowledge or observing patterns in the data. This step helps to improve the model performance.

  2. Feature Transformation: This involves the transformation of your features into more suitable representation for your model. This is done to ensure that the model can effectively learn from the data. Transforming data involves 4 types;

     i) Normalization: Changing the shape of your distribution data. Map data to a bounded range using methods like Min-Max Normalization or Z-score Normalization.
    
     ii) Scaling. Rescale your features to have a similar scale  to make sure the model considers all features equally using methods like Min-Max Scaling, Standardization and  MaxAbs Scaling.
    
     iii) Encoding. Apply encoding to your categorical features to transform them to numerical features using methods like label encoding, One-hot encoding, Ordinal encoding or any other encoding according to the structure of your categorical columns.
    
     iv) Transformation. Transform the features using mathematical operations to change the distribution of features for example logarithmic, square root.
    
    登入後複製
  3. Feature Extraction: Extract new features from the existing attributes. It is concerned with reducing the number of features in the model, such as using Principal Component Analysis(PCA).

  4. Feature Selection: Identify and select the most relevant features for further analysis. Use filter method( Evaluate features based on statistical metrics and select the most relevant ones), wrapper method(Use machine learning models to evaluate feature subsets and select the best combination based on model performance) or embedded method(Perform feature selection as part of model training e.g regularization techniques)

Tools Used for Performing EDA

-Let's look at the tools we can use to perform our analysis efficiently.

Python libraries

         i)   Pandas: Provides extensive functions for data manipulation and analysis.

         ii)  Matplotlib: Used for creating static, interactive, and animated visualizations.

         iii) Seaborn: Built on top of Matplotlib, providing a high-level interface for drawing attractive and informative capabilities.

         iv)  Plotly: Used for making interactive plots and offers more sophisticated visualization capabilities.
登入後複製

R Packages

     i)  ggplot2: This is used for making complex plots from data 
      in a dataframe.

    ii)  dplyr: It helps in solving the most common data manipulation challenges.

   iii)  tidyr: This tool is used to tidy your dataset; Storing it in a consistent form that matches the semantics of the dataset with the way it is stored.
登入後複製

Conclusion
Exploratory Data Analysis (EDA) forms the foundation of data science, offering insights and guiding informed decision-making. EDA empowers data scientists to uncover hidden truths and steer projects toward success. Always ensure to perform thorough EDA for effective model performance.

以上是了解您的數據:探索性數據分析的要點。的詳細內容。更多資訊請關注PHP中文網其他相關文章!

來源:dev.to
本網站聲明
本文內容由網友自願投稿,版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容,請聯絡admin@php.cn
熱門教學
更多>
最新下載
更多>
網站特效
網站源碼
網站素材
前端模板
關於我們 免責聲明 Sitemap
PHP中文網:公益線上PHP培訓,幫助PHP學習者快速成長!