Spaces:

LunaticMaestro
/

book-recommender

Running

App Files Files Community

Deepak Sahu commited on Nov 23, 2024

Commit

abf887e

1 Parent(s): 10b6366

section update

Browse files

Files changed (8) hide show

.resources/clean_1.png +3 -0
.resources/clean_2.png +3 -0
README.md +37 -10
__pycache__/z_utils.cpython-310.pyc +0 -0
app.py +1 -1
books_summary.csv +0 -0
z_clean_data.ipynb +0 -0
z_clean_data.py +1 -1

.resources/clean_1.png ADDED Viewed

Git LFS Details

SHA256: 3dc453350379a7244ba663f6a09b6d4e9a0d31572b788a4f78e0d11b2f35de49
Pointer size: 130 Bytes
Size of remote file: 33.6 kB

.resources/clean_2.png ADDED Viewed

Git LFS Details

SHA256: 6a0b4eacd7219ac1eaf1aa85d93e6f951bb4f1f701b99e44a35105a7bddfb3e1
Pointer size: 129 Bytes
Size of remote file: 9.91 kB

README.md CHANGED Viewed

@@ -17,9 +17,19 @@ Try it out: https://huggingface.co/spaces/LunaticMaestro/book-recommender
 ![image](.resources/preview.png)
 ## Table of Content
-> All images are my actual work please source powerpoint of them in `.resources` folder of this repo.
 - [Running Inference Locally](#libraries-execution)
 - [10,000 feet Approach overview](#approach)
@@ -31,6 +41,14 @@ Try it out: https://huggingface.co/spaces/LunaticMaestro/book-recommender
 ## Running Inference Locally
 ### Libraries
 I used google colab with following libraries extra.
@@ -66,23 +84,32 @@ References:
 ## Training Steps
-**ALL files Paths are at set as CONST in beginning of each script, to make it easier while using the paths while inferencing; hence not passing as CLI arguments**
 ### Step 1: Data Clean
-I am going to do basic steps like unwanted column removal (the first column of index), missing values removal (drop rows), duplicate rows removal. Output Screenshot attached.
-I am NOT doing any text pre-processing steps like stopword removal, stemming/lemmatization or special character removal because my approach is to use the casual language modelling (later steps) hence makes no sense to rip apart the word meaning via these word-based techniques.
-A little tinker in around with the dataset found that some titles can belong to multiple categories. (*this code I ran separately, is not part of any script*)
-![image](https://github.com/user-attachments/assets/cdf9141e-21f9-481a-8b09-913a0006db87)
-A descriptive analysis shows that there are just 1230 unique titles. (*this code I ran separately, is not part of any script*)
-![image](https://github.com/user-attachments/assets/072b4ed7-7a4d-48b2-a93c-7b08fc5bee45)
-We are not going to remove them rows that shows same titles (& summaries) with different categories but rather create a separate file for unique titles.
 ```SH
 python z_clean_data.py

 ![image](.resources/preview.png)
+## Foreword
+- All images are my actual work please source powerpoint of them in `.resources` folder of this repo.
+- Code is documentation is as per [Google's Python Style Guide](https://google.github.io/styleguide/pyguide.html)
+- ALL files Paths are at set as CONST in beginning of each script, to make it easier while using the paths while inferencing & evaluation; hence not passing as CLI arguments
+- prefix `z_` in filenames is just to avoid confusion (to human) of which is prebuilt module and which is custom during import.
 ## Table of Content
+>
 - [Running Inference Locally](#libraries-execution)
 - [10,000 feet Approach overview](#approach)
 ## Running Inference Locally
+### Memory Requirements
+The code need <2Gb RAM to use both the following. Just CPU works fine for inferencing.
+  - https://huggingface.co/openai-community/gpt2 ~500 mb
+  - https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 <500 mb
 ### Libraries
 I used google colab with following libraries extra.
 ## Training Steps
 ### Step 1: Data Clean
+What is taken care
+  - unwanted column removal (the first column of index)
+  - missing values removal (drop rows)
+  - duplicate rows removal.
+What is not taken care
+  - stopword removal, stemming/lemmatization or special character removal
+  **because approach is to use the casual language modelling (later steps) hence makes no sense to rip apart the word meaning**
+### Observations from `z_cleand_data.ipynb`
+- Same title corresponds to different categories
+  ![image](.resources/clean_1.png)
+- Total 1230 unique titles.
+  ![image](.resources/clean_2.png)
+**Action**: We are not going to remove them rows that shows same titles (& summaries) with different categories but rather create a separate file for unique titles.
+**RUN**:
 ```SH
 python z_clean_data.py

__pycache__/z_utils.cpython-310.pyc ADDED Viewed

Binary file (1.22 kB). View file

app.py CHANGED Viewed

@@ -16,7 +16,7 @@ GRADIO_TITLE = "Content Based Book Recommender"
 GRADIO_DESCRIPTION = '''
 This is a [HyDE](https://arxiv.org/abs/2212.10496) based searching mechanism that generates random summaries using your input book title and matches books which has summary similary to generated ones. The books, for search, are used from used [Kaggle Dataset: arpansri/books-summary](https://www.kaggle.com/datasets/arpansri/books-summary)
-**Should take ~ 15s to 30s** for inferencing.
 '''
 # Caching mechanism for gradio

 GRADIO_DESCRIPTION = '''
 This is a [HyDE](https://arxiv.org/abs/2212.10496) based searching mechanism that generates random summaries using your input book title and matches books which has summary similary to generated ones. The books, for search, are used from used [Kaggle Dataset: arpansri/books-summary](https://www.kaggle.com/datasets/arpansri/books-summary)
+**Should take ~ 15s to 30s** for inferencing. If taking time then then its cold starting in HF space which lasts 300s and **decreases to 15s when you have made sufficiently many ~10 to 15 call**
 '''
 # Caching mechanism for gradio

books_summary.csv CHANGED Viewed

The diff for this file is too large to render. See raw diff

z_clean_data.ipynb ADDED Viewed

File without changes

z_clean_data.py CHANGED Viewed

@@ -31,7 +31,7 @@ print(f"\n\nCleaned Shape: {books_df.shape}")
 # Saving these cleaned DF
 print("Storing cleaned as (this includes same titles with diff cats: "+CLEAN_DF)
-books_df.to_csv(ORIGNAL_DF, index=False)
 # ==== NOW to store the unique titles  ====
 books_df = books_df[["book_name", "summaries"]]

 # Saving these cleaned DF
 print("Storing cleaned as (this includes same titles with diff cats: "+CLEAN_DF)
+books_df.to_csv(CLEAN_DF, index=False)
 # ==== NOW to store the unique titles  ====
 books_df = books_df[["book_name", "summaries"]]