This repository contains a Python script designed for efficient extraction of machine-readable file URLs corresponding to Anthem's Preferred Provider Organization (PPO) health plans in New York State. The script handles extremely large JSON files by leveraging the ijson
library for streaming parsing to minimize memory usage and improve performance.
- Python 3.x
ijson
library
- Clone the repository:
git clone https://github.com/asharm0662/serif_json-url-extractor.git
- Install the required Python package:
cd <location to repo> pip install -r requirements.txt
- Usage
python extract_urls.py <input_file_path> <output_file_path>
python extract_urls.py large_input.json output_urls.csv
The script parses a large JSON file to find URLs that match specific criteria—entries describing Anthem's PPO plans located in New York. The output is a list of URLs saved to a specified text file.
- Input Handling: The script uses
argparse
to take input and output file paths from the command line, allowing flexibility in file management. - JSON Parsing: Utilizes
ijson
for incremental parsing of JSON data, enabling the script to handle large files without loading the entire file into memory. - Data Processing: Iterates through nested JSON structures to extract URLs based on matching criteria within descriptions.
- Development Time: Approximately 90 minutes.
- Execution Time: About 3 minutes for parsing and writing URLs from the large JSON file.
- Memory Efficiency: By using
ijson
for streaming parsing, the script maintains low memory usage, crucial for processing files that exceed memory capacities of most standard systems. This approach avoids memory overflow and crashing issues commonly associated with loading large files in memory. - Computational Speed: While
ijson
reduces memory load, it processes the file more slowly than loading the entire JSON into memory (e.g., with standardjson.load
). This trade-off is critical for the feasibility of processing very large datasets on limited-resource systems. - Scalability: The script is designed to scale with file size, maintaining performance across varying file sizes without additional memory requirement adjustments.
- Data Complexity: The script assumes a specific JSON structure and may require adjustments if the data format varies.
- Error Handling: Limited error handling for unexpected data formats or read/write errors, which could be expanded in future versions for robustness.