Skip to content
This repository has been archived by the owner on Sep 17, 2024. It is now read-only.

New dataset: A15 Employees, farmholdings, utilized agricultural area and livestock on level 1 of classification by canton #37

Open
cmdoret opened this issue Jul 27, 2023 · 5 comments
Assignees

Comments

@cmdoret
Copy link
Collaborator

cmdoret commented Jul 27, 2023

Proposal to include dataset: Employees, farmholdings, utilized agricultural area and livestock on level 1 of classification by canton

Dataset properties

  • URL: https://www.bfs.admin.ch/asset/en/px-x-0702000000_101
  • format: px
  • size: 83MB
  • dimensions: area type (plain, mountain, hill, ...), canton, year, size class (hectares range), system (organic, conventional, ...), form (part time, full time, ...)
  • units: amount of (employees, animals), surface area (hectares)
  • lang: en

Additional notes

  • The dataset combines indicators representing different things (cattle, employees, are). These need to be spread across different columns.
  • Meaning of area types (mountain 1/2/3/4) are not clear and will need to to do one of:
    • recode with more descriptive names
    • combine mountain categories
    • omit area type factor

Questions

@cmdoret cmdoret added the dataset Proposal for a new dataset label Jul 27, 2023
@cmdoret
Copy link
Collaborator Author

cmdoret commented Jul 28, 2023

The dataset is "only" 83MB, but I cannot seem to download it, even when setting it to "large". It is still waiting after 15min.

@cmdoret cmdoret self-assigned this Jul 28, 2023
@sabinem
Copy link
Collaborator

sabinem commented Aug 2, 2023

@cmdoret For that I think we would need another option xlarge for ds$size. pxRRead in that case works better on downloaded files: so the file needs to be downloaded first that parsed and then the doqnload should be removed again.

@cmdoret
Copy link
Collaborator Author

cmdoret commented Aug 4, 2023

I tried implementing an xlarge option where the file is first downloaded and then parsed:

tmp <- paste0(tempfile(), ".px")
download.file(ds$read_path, tmp)
df <- pxRRead::scan_px_file(tmp,
  locale = ds$lang,
  encoding = ds$encoding
)
ds$data <- df$dataframe

But the pxRRead::scan_px_file hangs for 10+ minutes at:

INFO [2023-08-04 11:15:54] unsupported keyword detected DATASYMBOL5[en]
INFO [2023-08-04 11:15:54] unsupported keyword detected DATASYMBOL6
INFO [2023-08-04 11:15:54] unsupported keyword detected DATASYMBOL6[fr]
INFO [2023-08-04 11:15:54] unsupported keyword detected DATASYMBOL6[en]

After investigation, it appears the culprit is pxRRead::parse_px_lines. The file contains ~1M lines and parsing them is pretty slow. I have open an issue on the pxRRead on this topic sdsc-ordes/pxRRead#20

@cmdoret
Copy link
Collaborator Author

cmdoret commented Aug 4, 2023

Issue addressed upstream in sdsc-ordes/pxRRead#21. The parser can now accommodate huge file. This dataset is parsed in under 1 minute instead of 3h.

@cmdoret
Copy link
Collaborator Author

cmdoret commented Oct 3, 2023

@nooralahzadeh is the structure OK for you?
metadata and queries are at https://github.com/statistikZH/statbotData/tree/main/pipelines/A15

year farmholding_system farmholdings employees_total full_time_employees_75_percent_or_more part_time_employees_50_75_percent part_time_employees_2_less_than_50_percent employees_men employees_women employees_women_manager_label employees_swiss employees_foreign_nationals family_employees beef_cattle_and_cows_farm horse_and_other_equine_farm sheep_farm goat_farm pig_farms poultry_farm farms_with_other_animals utilised_agricultural_area_in_hectares arable_land_in_hectares grassland_in_hectares permanent_crops_in_hectares other_utilised_agricultural_area_in_hectares livestock_beef_cattle_and_cows livestock_horses_and_other_equines livestock_sheep livestock_goats livestock_pigs livestock_poultry livestock_other_animals spatialunit_uid
2013 Organic farming 346 947 331 195 421 614 333 40 809 138 724 164 106 145 56 15 36 31 7252.4681 308.6694 6444.2811 486.4801 13.0375 5161 680 15332 1793 124 1973 314 23_A.ADM1
2011 Farmholding system - total 926 2369 1411 377 581 1626 743 43 2186 183 1900 712 255 86 73 49 81 49 32012.902 4207.36 27128.9113 597.5507 79.08 42151 1813 2818 580 7232 78785 963 24_A.ADM1
2021 Not defined 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5_A.ADM1
2012 Not defined 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23_A.ADM1
2022 Organic farming 193 530 190 145 195 328 202 15 530 0 490 175 21 21 32 4 47 10 2796.25 6.92 2745.36 1.62 42.35 6151 104 471 331 142 20626 101 6_A.ADM1
2003 Organic farming 360 1336 703 226 407 844 492 0 1129 207 859 278 115 80 66 48 201 63 7019.6 1327.49 5382.56 55.93 253.62 9625 627 3370 455 852 34682 734 1_A.ADM1
2020 Farmholding system - total 1324 3874 1608 786 1480 2438 1436 114 3581 293 3088 873 414 200 123 134 369 155 31463.4045 10314.7904 20784.8522 167.3145 196.4474 40936 3351 7081 1311 25912 190494 1811 11_A.ADM1
2006 Organic farming 20 82 42 12 28 46 36 0 75 7 56 16 3 5 2 4 10 4 528.32 193.92 322.72 8.88 2.8 685 5 71 6 201 5417 8 14_A.ADM1
2011 Farmholding system - total 2866 8936 4463 1619 2854 5591 3345 101 7827 1109 6544 1757 513 341 217 432 738 396 50033.6 16951.245 30388.448 2376.727 317.18 75127 3647 19321 1454 198572 1070618 5381 20_A.ADM1
2017 Conventional farming 3044 9130 4084 1747 3299 5803 3327 181 8082 1048 6818 1674 640 303 217 144 723 398 64363.65 26172.95 35064.06 1424.68 1701.96 85384 5812 10690 1638 34204 453421 8672 1_A.ADM1

@nooralahzadeh nooralahzadeh added the step_table_structure_approved table layout approved, ready to work on testdata label Oct 10, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants