Unsigned

sdfsf

Byadam-al-rahman

sadfsdf

Deploy ModelKit Contents Security Report ModelKit Diff Tag Timeline Model Card

playground-series-s6e1

Overview

This is your new Kedro project, which was generated using kedro 1.1.1.

Take a look at the Kedro documentation to get started.

Rules and guidelines

In order to get the best out of the template:

Don't remove any lines from the .gitignore file we provide
Make sure your results can be reproduced by following a data engineering convention
Don't commit data to your repository
Don't commit any credentials or your local configuration to your repository. Keep all your credentials and local configuration in conf/local/

How to install dependencies

Declare any dependencies in requirements.txt for pip installation.

To install them, run:

pip install -r requirements.txt

How to run your Kedro pipeline

You can run your Kedro project with:

kedro run

How to test your Kedro project

Have a look at the file tests/test_run.py for instructions on how to write your tests. You can run your tests as follows:

pytest

You can configure the coverage threshold in your project's pyproject.toml file under the [tool.coverage.report] section.

Project dependencies

To see and update the dependency requirements for your project use requirements.txt. You can install the project requirements with pip install -r requirements.txt.

Further information about project dependencies

How to work with Kedro and notebooks

Note: Using kedro jupyter or kedro ipython to run your notebook provides these variables in scope: context, 'session', catalog, and pipelines.

Jupyter, JupyterLab, and IPython are already included in the project requirements by default, so once you have run pip install -r requirements.txt you will not need to take any extra steps before you use them.

Jupyter

To use Jupyter notebooks in your Kedro project, you need to install Jupyter:

pip install jupyter

After installing Jupyter, you can start a local notebook server:

kedro jupyter notebook

JupyterLab

To use JupyterLab, you need to install it:

pip install jupyterlab

You can also start JupyterLab:

kedro jupyter lab

IPython

And if you want to run an IPython session:

kedro ipython

How to ignore notebook output cells in `git`

To automatically strip out all output cell contents before committing to git, you can use tools like nbstripout. For example, you can add a hook in .git/config with nbstripout --install. This will run nbstripout before anything is committed to git.

Note: Your output cells will be retained locally.

Package your Kedro project

Further information about building project documentation and packaging your project

Project Structure

--------------------------------------------------------------------------------
PART 1: NOTEBOOKS (THE LABORATORY)
--------------------------------------------------------------------------------

Notebooks are for analysis, prototyping, and debugging. They are NOT for production pipelines.

1. notebooks/eda/ ("The Observer")
   -----------------------------------
   • When to use: immediately after receiving new data.
   • Goal: Understand the physics of the dataset without changing it.
   • Why: You cannot model what you do not understand. Blind modeling leads to bias.
   
   - 01_raw_inspection.ipynb
     • Task: Check data types, count nulls, identify schema errors.
     • Question: "Is this column actually a date, or a string?"
   
   - 02_feature_correlation.ipynb
     • Task: Plot heatmaps, check distributions, look for target leakage.
     • Question: "Does feature A perfectly predict the target because of a glitch?"

2. notebooks/prototyping/ ("The Inventor")
   ------------------------------------------
   • When to use: When developing the logic that will eventually move to production code (src/).
   • Goal: Write messy code here so your pipeline code stays clean.
   • Why: Notebooks allow instant feedback. Debugging a pipeline script is slow; debugging a cell is fast.

   - 01_cleaning_logic.ipynb
     • Task: Develop the regex to clean strings or logic to handle outliers.
     • Output Logic: Defines how 'raw_ingestion' becomes 'processed/01_primary'.

   - 02_feature_ideas.ipynb
     • Task: Experiment with Log transforms, Interaction terms (A*B), and Encodings.
     • Output Logic: Defines how 'primary' becomes 'processed/02_feature'.

   - 03_model_selection.ipynb
     • Task: Quick-and-dirty comparison of algorithms (RandomForest vs XGBoost).
     • Output Logic: Decides which model architecture to save in 'models/artifacts'.

3. notebooks/auditing/ ("The Detective")
   ---------------------------------------
   • When to use: After a model has been trained or deployed.
   • Goal: Understand failures and monitor health.
   • Why: Accuracy scores (e.g., "90%") hide the fact that the model fails 100% of the time on minority groups.

   - 01_error_analysis.ipynb
     • Task: Load the trained model and run it on specific edge cases.
     • Question: "Why did we predict 'Pass' for this student who failed?"

   - 02_drift_check.ipynb
     • Task: Compare statistical distributions of new incoming data vs. old training data.
     • Question: "Has the definition of 'low income' changed since we trained the model?"

--------------------------------------------------------------------------------
PART 2: DATA (THE WAREHOUSE)
--------------------------------------------------------------------------------
Data is state. It flows from Top (Raw) to Bottom (Reporting) and gets cleaner at every step.

1. data/raw_ingestion/ (Bronze Layer)
   ----------------------------------
   • Content: The immutable original files (CSV, JSON, SQL Dumps).
   • Rule: READ-ONLY. Never edit these files manually (not even in Excel).
   • Why: If you corrupt the source, you lose the ability to prove your work is correct.

2. data/processed/ (The Assembly Line)
   -----------------------------------
   This folder contains the transformed data stages.

   - 01_primary/ (Silver Layer - "Business Truth")
     • Content: Cleaned, typed, and joined data. No machine learning math yet.
     • State: Readable by a human analyst (e.g., 'Gender' is "Male", not '0').
     • Why: You need a clean "Master Table" that can be used for BI dashboards AND machine learning.

   - 02_feature/ (Gold Layer - "Math Ready")
     • Content: Data transformed for math.
     • State: 'Gender' is '0', Prices are Log-Scaled, Text is TF-IDF vectors.
     • Why: Algorithms cannot read strings. This is the bridge between Human Reality and Machine Reality.

   - 03_model_input/ (The Vault - "Leakage Proof")
     • Content: The strict Train/Validation/Test splits (X_train.parquet, y_test.parquet).
     • State: Imputation happens AFTER this split.
     • Why: To guarantee that the Test Set never influences the Training Set (preventing Data Leakage).

3. data/models/ (The Artifacts)
   ----------------------------
   • Content: The outputs of the training process.
   
   - artifacts/
     • Content: Binary files (.pkl, .joblib, .onnx, .pt).
     • Why: This is the "Cake" you ship to production servers (via KitOps).
   
   - predictions/
     • Content: CSVs containing Model Scores and Probabilities.
     • Why: Needed for offline analysis in '03_auditing' notebooks.

4. data/reporting/ (The Presentation)
   ----------------------------------
   • Content: PNG plots, Confusion Matrices, HTML dashboards, SHAP value plots.
   • Why: Stakeholders (CEOs, PMs) do not read code or parquet files. They read charts.