Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for subdirectory structure within data directory #569

Merged
merged 5 commits into from
Aug 31, 2023

Conversation

PfisterAn
Copy link
Contributor

  • recursively read the data directory
    • @TODO/SUGGEST - may be worth to think about using the sub directory information for enriching the index and/or use it for access control (e.g., subdirectory names may be named like the Azure AD Group for access control)

Purpose

  • Ability to bring subfolder structure into data directory for better arranging the data, at least on the directory level for now

Does this introduce a breaking change?

[ ] Yes
[ x] No

Pull Request Type

What kind of change does this Pull Request introduce?

[ ] Bugfix
[ x] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

How to Test

  • Run prepdocs on the data directory where content is arranged within subdirectories
git clone [repo-address]
cd [repo-name]
git checkout [branch-name]
npm install
  • Test the code

What to Check

Verify that the following are valid

  • ...

Other Information

- recursively read the data directory
  - @TODO/SUGGEST - may be worth to think about using the sub directory information for enriching the index and/or use it for access control (e.g., subdirectory names may be named like the Azure AD Group for access control)
scripts/prepdocs.py Outdated Show resolved Hide resolved
scripts/prepdocs.py Outdated Show resolved Hide resolved
scripts/prepdocs.py Outdated Show resolved Hide resolved
@pamelafox
Copy link
Collaborator

Thanks, this seems useful!

Copy link
Collaborator

@pamelafox pamelafox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the CI results and see CONTRIBUTING.md for instructions on running linters locally.

* Convert comment for read_files function to DocString
* Handing over `use_vectors` variable in the recursive call
- Removed out-of-reach condition within read_files function
scripts/prepdocs.py Outdated Show resolved Hide resolved
scripts/prepdocs.py Outdated Show resolved Hide resolved
if args.remove:
remove_blobs(filename)
remove_from_index(filename)
elif args.removeall:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its in the main body, it was already there before, thats why you dont see it in the Review.

if args.removeall:
remove_blobs(None)
remove_from_index(None)
else:
if not args.remove:
create_search_index()
print("Processing files...")
read_files(args.files, use_vectors)

@pamelafox
Copy link
Collaborator

CI/CD is still failing - https://github.com/Azure-Samples/azure-search-openai-demo/actions/runs/6037247433/job/16381542344?pr=569
Please see CONTRIBUTING for details on running linters

* Fixing syntax error - filename isn't global anymore due to the recursive function introduced. Have to be explicitly handed over to the split_text function
@PfisterAn
Copy link
Contributor Author

Missed to pass the filename towards the split_text function. split_text was setup to utilise the global variable filename, which doesnt exist anymore after introducing the read_files function.

@pamelafox pamelafox merged commit 886f445 into Azure-Samples:main Aug 31, 2023
6 checks passed
@pamelafox
Copy link
Collaborator

Merged, thanks for the changes!

HughRunyan pushed a commit to RMI/RMI_chatbot that referenced this pull request Mar 26, 2024
…#569)

* Allow for subdirectory structure within data directory

- recursively read the data directory
  - @TODO/SUGGEST - may be worth to think about using the sub directory information for enriching the index and/or use it for access control (e.g., subdirectory names may be named like the Azure AD Group for access control)

* Update prepdocs.py

* Convert comment for read_files function to DocString
* Handing over `use_vectors` variable in the recursive call

* Update prepdocs.py

- Removed out-of-reach condition within read_files function

* Update prepdocs.py

* Update prepdocs.py

* Fixing syntax error - filename isn't global anymore due to the recursive function introduced. Have to be explicitly handed over to the split_text function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants