Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tabular reader #47

Merged
merged 9 commits into from
Jun 5, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
244 changes: 130 additions & 114 deletions poetry.lock

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ openinference-instrumentation = "^0.1.7"
llama-index-llms-huggingface = "^0.2.0"
pytest-asyncio = "^0.23.7"
pytest-cov = "^5.0.0"
xlrd = "^2.0.1"

[tool.poetry.scripts]
pai_rag = "pai_rag.main:main"
Expand Down
82 changes: 82 additions & 0 deletions src/pai_rag/docs/tabular_doc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Tabular processing with PAI-RAG

## PaiCSVReader

PaiCSVReader(concat_rows=True, row_joiner="\n", csv_config={})

### Parameters:

**concat_rows:** _bool, default=True._
Whether to concatenate rows into one document.

**row_joiner:** _str, default="\n"._
The separator used to join rows.

**csv_config:** _dict, default={}._
moria97 marked this conversation as resolved.
Show resolved Hide resolved
The configuration of csv reader
Set to empty dict by default, this means pandas will try to figure out the separators, table head, etc. on its own.

#### one important parameter:

**header:** _None or int, list of int, default 0._
moria97 marked this conversation as resolved.
Show resolved Hide resolved
Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row
positions will be combined into a MultiIndex. Use None if there is no header.

### Functions:

load_data(file: Path, extra_info: Optional[Dict] = None, fs: Optional[AbstractFileSystem] = None)

## PaiPandasCSVReader

PaiPandasCSVReader(concat_rows=True, row_joiner="\n", pandas_config={})

### Parameters:

**concat_rows:** _bool, default=True._
Whether to concatenate rows into one document.

**row_joiner:** _str, default="\n"._
The separator used to join rows.

**pandas_config:** _dict, default={}._
The configuration of pandas.read_csv.
Refer to https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for more information.
Set to empty dict by default, this means pandas will try to figure out the separators, table head, etc. on its own.

#### one important parameter:

**header:** _None or int, list of int, default 0._
Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row
positions will be combined into a MultiIndex. Use None if there is no header.

### Functions:

load_data(file: Path, extra_info: Optional[Dict] = None, fs: Optional[AbstractFileSystem] = None)

## PaiPandasExcelReader

PaiPandasExcelReader(concat_rows=True, row_joiner="\n", pandas_config={})

### Parameters:

**concat_rows:** _bool, default=True._
Whether to concatenate rows into one document.

**row_joiner:** _str, default="\n"._
The separator used to join rows.

**pandas_config:** _dict, default={}._
The configuration of pandas.read_csv.
Refer to https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html for more information.
Set to empty dict by default, this means pandas will try to figure out the separators, table head, etc. on its own.

#### one important parameter:

**header:** _None or int, list of int, default 0._
Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row
positions will be combined into a MultiIndex. Use None if there is no header.

### Functions:

load_data(file: Path, extra_info: Optional[Dict] = None, fs: Optional[AbstractFileSystem] = None)
only process the first sheet
169 changes: 169 additions & 0 deletions src/pai_rag/integrations/readers/pai_csv_reader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
"""Tabular parser-CSV parser.

Contains parsers for tabular data files.

"""

from pathlib import Path
from typing import Any, Dict, List, Optional
from fsspec import AbstractFileSystem

import pandas as pd
from llama_index.core.readers.base import BaseReader
from llama_index.core.schema import Document


class PaiCSVReader(BaseReader):
"""CSV parser.

Args:
concat_rows (bool): whether to concatenate all rows into one document.
If set to False, a Document will be created for each row.
True by default.
csv_config (dict): Options for the reader.Set to empty dict by default.
one important parameter:
"header": None or int, list of int, default 0.
Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row
positions will be combined into a MultiIndex. Use None if there is no header.

"""

def __init__(
self, *args: Any, concat_rows: bool = True, csv_config: dict = {}, **kwargs: Any
moria97 marked this conversation as resolved.
Show resolved Hide resolved
) -> None:
"""Init params."""
super().__init__(*args, **kwargs)
self._concat_rows = concat_rows
self._csv_config = csv_config

def load_data(
self, file: Path, extra_info: Optional[Dict] = None
) -> List[Document]:
"""Parse csv file.

Returns:
Union[str, List[str]]: a string or a List of strings.

"""
try:
import csv
except ImportError:
raise ImportError("csv module is required to read CSV files.")
text_list = []
headers = []
data_lines = []
data_line_start_index = 1
if (
"header" in self._csv_config
moria97 marked this conversation as resolved.
Show resolved Hide resolved
and self._csv_config["header"] is not None
and isinstance(self._csv_config["header"], list)
):
data_line_start_index = max(self._csv_config["header"]) + 1
elif (
"header" in self._csv_config
and self._csv_config["header"] is not None
and isinstance(self._csv_config["header"], int)
):
data_line_start_index = self._csv_config["header"] + 1
self._csv_config["header"] = [self._csv_config["header"]]

with open(file) as fp:
has_header = csv.Sniffer().has_header(fp.read(2048))
fp.seek(0)

if "header" not in self._csv_config and has_header:
self._csv_config["header"] = [0]
elif "header" not in self._csv_config and not has_header:
self._csv_config["header"] = None

csv_reader = csv.reader(fp)

if self._csv_config["header"] is None:
for row in csv_reader:
text_list.append(", ".join(row))
else:
for i, row in enumerate(csv_reader):
if i in self._csv_config["header"]:
headers.append(row)
elif i >= data_line_start_index:
data_lines.append(row)
headers = [tuple(group) for group in zip(*headers)]
for line in data_lines:
if len(line) == len(headers):
data_entry = str(dict(zip(headers, line)))
text_list.append(data_entry)

metadata = {"filename": file.name, "extension": file.suffix}
if extra_info:
metadata = {**metadata, **extra_info}

if self._concat_rows:
return [Document(text="\n".join(text_list), metadata=metadata)]
else:
return [Document(text=text, metadata=metadata) for text in text_list]


class PaiPandasCSVReader(BaseReader):
r"""Pandas-based CSV parser.

Parses CSVs using the separator detection from Pandas `read_csv`function.
If special parameters are required, use the `pandas_config` dict.

Args:
concat_rows (bool): whether to concatenate all rows into one document.
If set to False, a Document will be created for each row.
True by default.

row_joiner (str): Separator to use for joining each row.
Only used when `concat_rows=True`.
Set to "\n" by default.

pandas_config (dict): Options for the `pandas.read_csv` function call.
Refer to https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
for more information.
Set to empty dict by default, this means pandas will try to figure
out the separators, table head, etc. on its own.

"""

def __init__(
self,
*args: Any,
concat_rows: bool = True,
row_joiner: str = "\n",
pandas_config: dict = {},
**kwargs: Any
) -> None:
"""Init params."""
super().__init__(*args, **kwargs)
self._concat_rows = concat_rows
self._row_joiner = row_joiner
self._pandas_config = pandas_config

def load_data(
self,
file: Path,
extra_info: Optional[Dict] = None,
fs: Optional[AbstractFileSystem] = None,
) -> List[Document]:
"""Parse csv file."""
if fs:
with fs.open(file) as f:
df = pd.read_csv(f, **self._pandas_config)
else:
df = pd.read_csv(file, **self._pandas_config)

text_list = df.apply(
lambda row: str(dict(zip(df.columns, row.astype(str)))), axis=1
).tolist()

if self._concat_rows:
return [
Document(
text=(self._row_joiner).join(text_list), metadata=extra_info or {}
)
]
else:
return [
Document(text=text, metadata=extra_info or {}) for text in text_list
]
77 changes: 77 additions & 0 deletions src/pai_rag/integrations/readers/pai_excel_reader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
"""Tabular parser-Excel parser.

Contains parsers for tabular data files.

"""

from pathlib import Path
from typing import Any, Dict, List, Optional
from fsspec import AbstractFileSystem

import pandas as pd
from llama_index.core.readers.base import BaseReader
from llama_index.core.schema import Document


class PaiPandasExcelReader(BaseReader):
r"""Pandas-based Excel parser.


Args:
concat_rows (bool): whether to concatenate all rows into one document.
If set to False, a Document will be created for each row.
True by default.

row_joiner (str): Separator to use for joining each row.
Only used when `concat_rows=True`.
Set to "\n" by default.

pandas_config (dict): Options for the `pandas.read_excel` function call.
Refer to https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
for more information.
Set to empty dict by default, this means pandas will try to figure
out the separators, table head, etc. on its own.

"""

def __init__(
self,
*args: Any,
concat_rows: bool = True,
row_joiner: str = "\n",
pandas_config: dict = {},
**kwargs: Any
) -> None:
"""Init params."""
super().__init__(*args, **kwargs)
self._concat_rows = concat_rows
self._row_joiner = row_joiner
self._pandas_config = pandas_config

def load_data(
self,
file: Path,
extra_info: Optional[Dict] = None,
fs: Optional[AbstractFileSystem] = None,
) -> List[Document]:
"""Parse Excel file. only process the first sheet"""
if fs:
with fs.open(file) as f:
df = pd.read_excel(f, sheet_name=0, **self._pandas_config)
else:
df = pd.read_excel(file, sheet_name=0, **self._pandas_config)

text_list = df.apply(
lambda row: str(dict(zip(df.columns, row.astype(str)))), axis=1
moria97 marked this conversation as resolved.
Show resolved Hide resolved
).tolist()

if self._concat_rows:
return [
Document(
text=(self._row_joiner).join(text_list), metadata=extra_info or {}
)
]
else:
return [
Document(text=text, metadata=extra_info or {}) for text in text_list
]
11 changes: 11 additions & 0 deletions src/pai_rag/modules/datareader/datareader_factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
from pai_rag.integrations.readers.pai_pdf_reader import PaiPDFReader
from pai_rag.integrations.readers.llama_parse_reader import LlamaParseDirectoryReader
from pai_rag.integrations.readers.html.html_reader import HtmlReader
from pai_rag.integrations.readers.pai_csv_reader import PaiPandasCSVReader
from pai_rag.integrations.readers.pai_excel_reader import PaiPandasExcelReader
from llama_index.readers.database import DatabaseReader
from llama_index.core import SimpleDirectoryReader
import logging
Expand All @@ -25,6 +27,15 @@ def _create_new_instance(self, new_params: Dict[str, Any]):
enable_image_ocr=self.reader_config.get("enable_image_ocr", False),
model_dir=self.reader_config.get("easyocr_model_dir", None),
),
".csv": PaiPandasCSVReader(
concat_rows=self.reader_config.get("concat_rows", False),
),
".xlsx": PaiPandasExcelReader(
concat_rows=self.reader_config.get("concat_rows", False),
),
".xls": PaiPandasExcelReader(
concat_rows=self.reader_config.get("concat_rows", False),
),
}
return self

Expand Down
Loading
Loading