Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV -> DTA Conversion Fails to Format Dates Properly #312

Closed
FlipperPA opened this issue Apr 16, 2024 · 3 comments
Closed

CSV -> DTA Conversion Fails to Format Dates Properly #312

FlipperPA opened this issue Apr 16, 2024 · 3 comments

Comments

@FlipperPA
Copy link

FlipperPA commented Apr 16, 2024

I'm attempting to convert a csv file to STATA dta format using the JSON metadata format.

I've put together an example with just a few columns:

permno,date,eom
10107,2020-01-31,2020-01-31
10107,2020-02-28,2020-02-29
[...]

This is the JSON file I've generated to provide the metadata for the conversion:

{
    "type": "STATA",
    "variables": [
        {
            "type": "NUMERIC",
            "name": "permno",
            "label": "CRSP permno (permno)",
            "format": "UNSPECIFIED"
        },
        {
            "type": "NUMERIC",
            "name": "date",
            "label": "Day of last price observation (date)",
            "format": "DATE"
        },
        {
            "type": "NUMERIC",
            "name": "eom",
            "label": "End of month (eom)",
            "format": "DATE"
        }
    ]
}

The conversion seems to run properly: Converted 3 variables and 120 rows in 0.01 seconds

However, when I open up the file in STATA v18, it doesn't properly format the dates. I've read through the code, and I seem to be triggering the proper type and format selection, as if I change them, it triggers an error. I'm a STATA novice, but here's what the CLI is reporting for the generated file:

. use "~/test.dta"
((null))

. describe

Contains data from ~/test.dta
 Observations:           120                  (null)
    Variables:             3                  16 Apr 2024 14:04
-------------------------------------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------------------------------------------------------
permno          double  %10.0g                CRSP permno (permno)
date            long    %12.0g                Day of last price observation (date)
eom             long    %12.0g                End of month (eom)
-------------------------------------------------------------------------------------------------------------------------------
Sorted by: 

. tabulate eom

     End of |
month (eom) |      Freq.     Percent        Cum.
------------+-----------------------------------
      21945 |          1        2.50        2.50
      21974 |          1        2.50        5.00
[...]

As you can see, it doesn't seem to be properly applying type or format, and when I tabulate, it displays as a number rather than a date. Have I uncovered a bug, or am I doing something wrong? I tried to follow the README and the spec.

@NilsEnevoldsen
Copy link

Stata date variables are just longs with a certain formatting applied. Try running format %tdCCYY-NN-DD date eom on your imported data.

@FlipperPA
Copy link
Author

@NilsEnevoldsen Thanks so much for the quick response, and further information. I'll give it a whirl and come back to close the ticket. 💯

@FlipperPA FlipperPA reopened this Oct 9, 2024
@FlipperPA
Copy link
Author

FlipperPA commented Oct 9, 2024

We're continuing to see this issue, and several more experienced STATA users have told me that it is likely the format should be applied during the conversion process, since we are explicitly setting the DATE and DATE_TIME format in the JSON map. Could someone confirm for me that the format should be applied during the readstat conversion process? Am I doing something wrong during my conversion attempt?

When I use SAS to export the same data, the display format seems to be properly set to %td instead of %12.0g:

-----------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-----------------------------------------------------------
permno          double  %8.0g                 PERMNO
date            long    %td                   Names Date
eom             long    %td                   Names Ending Date

And it shows up properly in the data:

     +---------------------------------+
     | permno        date   eom        |
     |---------------------------------|
  1. |  10107   31jan2020          .   |
  2. |  10107   03feb2020          .   |
  3. |  10107   04feb2020          .   |
[...]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants