Skip to content

Latest commit

 

History

History
493 lines (398 loc) · 23.5 KB

week-14.adoc

File metadata and controls

493 lines (398 loc) · 23.5 KB

Week 14 - Consumer Complaints Database

In 2015 our data scientist Nicole White ran a Webinar "LOAD CSV in the Real World", which is really popular. It has more than 56k views and people are still referring to it frequently because it is so well done.

That’s why we decided this time in our live-stream to re-enact Nicole’s webinar with Neo4j 4.3 and AuraDB Free to give people todays equivalent of that session.

Here is this weeks video: https://youtu.be/dTp6iCMEcng

Dataset

The consumer complaints database is available with data since 2011 from data.gov with more than 2.3 million complaints about products from different companies totalling 1.4GB (380MB compressed).

We’re using xsv a really neat and fast CSV processing tool written in Rust for some pre-processing.

Table 1. xsv frequency -s 'Company' complaints.csv

field

value

count

Company

EQUIFAX, INC.

355140

Company

TRANSUNION INTERMEDIATE HOLDINGS, INC.

265086

Company

Experian Information Solutions Inc.

250538

Company

BANK OF AMERICA, NATIONAL ASSOCIATION

106775

Company

WELLS FARGO & COMPANY

90802

Company

JPMORGAN CHASE & CO.

83789

Company

CITIBANK, N.A.

70313

Company

CAPITAL ONE FINANCIAL CORPORATION

60117

Company

Navient Solutions, LLC.

35285

Company

SYNCHRONY FINANCIAL

34418

To fit the data into the LIMITs of AuraDB free, we need to subfilter to a certain range, the complaints received in November 2021 are 12295, which is a good number for us.

xsv search for filtering
xsv search -s 'Date received' '2021-11-' complaints.csv | wc -l
   12295

xsv search -s 'Date received' '2021-11-' complaints.csv > complaints-nov-2021.csv

The file is also available here.

Date received

Product

Sub-product

Issue

Sub-issue

Consumer complaint narrative

Company public response

Company

State

ZIP code

Tags

Consumer consent provided?

Submitted via

Date sent to company

Company response to consumer

Timely response?

Consumer disputed?

Complaint ID

2021-11-03

Credit reporting, credit repair services, or other personal consumer reports

Credit reporting

Incorrect information on your report

Information belongs to someone else

EQUIFAX, INC.

OH

45458

Web

2021-11-03

In progress

Yes

N/A

4874335

2021-11-02

Debt collection

Other debt

Attempts to collect debt not owed

Debt was paid

FISERV INC.

FL

33156

Older American

Other

Web

2021-11-02

In progress

Yes

N/A

4867119

To load the data into Neo4j, we need it on a publicly accessible URL.

We can put it into a GitHub Gist or a Google Spreadsheet

Can we load all rows?
LOAD CSV WITH HEADERS FROM "https://git.io/J1m7e" AS row
RETURN count(*);
// 12295
What does the data look like in Cypher
LOAD CSV WITH HEADERS FROM "https://git.io/J1m7e" AS row
RETURN row LIMIT 1;

When loading the data via Cypher we get each row `RETURN`ed as a map/dict/hash object that we then can use to create our graph.

{
  "Company": "EQUIFAX, INC.",
  "Date received": "2021-11-03",
  "Issue": "Incorrect information on your report",
  "Timely response?": "Yes",
  "Sub-product": "Credit reporting",
  "Consumer complaint narrative": null,
  "ZIP code": "45458",
  "Consumer consent provided?": null,
  "Product": "Credit reporting, credit repair services, or other personal consumer reports",
  "Consumer disputed?": "N/A",
  "Company public response": null,
  "Sub-issue": "Information belongs to someone else",
  "Date sent to company": "2021-11-03",
  "State": "OH",
  "Complaint ID": "4874335",
  "Submitted via": "Web",
  "Tags": null,
  "Company response to consumer": "In progress"
}

Data Model

We closely follow Nicole’s modeling, with the exception of new datatypes for dates. All names are capitalized to ensure uniqueness.

We have the following Nodes in our models that are connected with appropriate relationships.

  • Complaint - core entity, also holds user information

  • Company - the company the complaint is against

  • Product - the product that is complained about

  • SubProduct - optional subclassification of product

  • Issue - the categorized issue with the product

  • SubIssue - optional subclassification of the issue

  • Response - Response of the company

  • Tag - there are a few tagged values

Here is our data model which is very similar to the original.

consumer complaints

Data Import

First we create constraints for our key-properties, the complaint-id and the names for the other entities.

CREATE CONSTRAINT ON (c:Complaint) ASSERT c.id IS UNIQUE;

CREATE CONSTRAINT ON (c:Company) ASSERT c.name IS UNIQUE;

CREATE CONSTRAINT ON (c:Response) ASSERT c.name IS UNIQUE;

CREATE CONSTRAINT ON (c:Product) ASSERT c.name IS UNIQUE;
CREATE CONSTRAINT ON (c:SubProduct) ASSERT c.name IS UNIQUE;

CREATE CONSTRAINT ON (c:Issue) ASSERT c.name IS UNIQUE;
CREATE CONSTRAINT ON (c:SubIssue) ASSERT c.name IS UNIQUE;

CREATE CONSTRAINT ON (c:Tag) ASSERT c.name IS UNIQUE;

We can run :schema to see if our constraints were correctly created.

Then we import the data step by step, starting with the complaints. And then running additional passes to load and connect the other entities.

We need to deal with a few optional elements, like SubIssue, SubProduct, and Tag. Tags also form a comma separated list.

For complaints we MERGE (get-or-create) them by the key-id and set the date, zip-code and state as properties. Note the backticked field names if they are not just alphanumeric characters.

Complaint
LOAD CSV WITH HEADERS FROM "https://git.io/J1m7e" AS row
MERGE (c:Complaint {id:row.`Complaint ID`})
SET c.dateReceived = date(row.`Date received`)
SET c.zip = row.`ZIP code`
SET c.state = row.State;

For companies we do our second pass over the data. We find the Complaint by ID and MERGE the Company and then MERGE the relationship (so only one relationship can exist betwen a complaint and company) Then we set the date properties for when the complaint was sent to this company.

Company
LOAD CSV WITH HEADERS FROM "https://git.io/J1m7e" AS row

MATCH (c:Complaint {id:row.`Complaint ID`})

MERGE (co:Company {name:toUpper(row.Company)})

MERGE (c)-[rel:AGAINST]->(co)
SET rel.date = date(row.`Date sent to company`);
Most frequently companies complained against
MATCH (n:Company)<-[:AGAINST]-()
RETURN n.name, count(*) AS c
ORDER BY c DESC LIMIT 10;
╒════════════════════════════════════════╤════╕
│"n.name"                                │"c" │
╞════════════════════════════════════════╪════╡
│"EQUIFAX, INC."                         │5858│
├────────────────────────────────────────┼────┤
│"TRANSUNION INTERMEDIATE HOLDINGS, INC."│3583│
├────────────────────────────────────────┼────┤
│"EXPERIAN INFORMATION SOLUTIONS INC."   │707 │
├────────────────────────────────────────┼────┤
│"NAVY FEDERAL CREDIT UNION"             │114 │
├────────────────────────────────────────┼────┤
│"ALLY FINANCIAL INC."                   │96  │
└────────────────────────────────────────┴────┘

For adding Products we need to first connect the complaint to the product as before, but then filter out rows where the Sub-product cell contains an empty string. For all others we merge and connect the SubProduct both to the product and the complaint.

Product and SubProduct
LOAD CSV WITH HEADERS FROM "https://git.io/J1m7e" as row

MATCH (c:Complaint {id:row.`Complaint ID`})

MERGE (p:Product {name:toUpper(row.Product)})
MERGE (c)-[:ABOUT]->(p)

WITH * WHERE trim(row.`Sub-product`) <> ""

MERGE (sp:SubProduct {name:toUpper(row.`Sub-product`)})
MERGE (c)-[:ABOUT]->(sp)
MERGE (sp)-[:IN_CATEGORY]->(p);

The same way we add issues and subissues.

Issue and SubIssue
LOAD CSV WITH HEADERS FROM "https://git.io/J1m7e" as row

MATCH (c:Complaint {id:row.`Complaint ID`})

MERGE (iss:Issue {name:toUpper(row.Issue)})
MERGE (c)-[:WITH]->(iss)

WITH * WHERE trim(row.`Sub-issue`) <> ""

MERGE (si:SubIssue {name:toUpper(row.`Sub-issue`)})
MERGE (c)-[:WITH]->(si)
MERGE (si)-[:IN_CATEGORY]->(iss);

Now we can have a bit of fun with the data.

Complaint counts by state and issue
MATCH (n:Complaint)-[:WITH]->(iss:Issue)
RETURN n.state, iss.name, count(*) as c
ORDER BY c DESC LIMIT 5;
╒═════════╤═══════════════════════════════════════════╤═══╕
│"n.state"│"iss.name"                                 │"c"│
╞═════════╪═══════════════════════════════════════════╪═══╡
│"TX"     │"PROBLEM WITH A CREDIT REPORTING COMPANY'S │847│
│         │INVESTIGATION INTO AN EXISTING PROBLEM"    │   │
├─────────┼───────────────────────────────────────────┼───┤
│"PA"     │"INCORRECT INFORMATION ON YOUR REPORT"     │596│
├─────────┼───────────────────────────────────────────┼───┤
│"FL"     │"INCORRECT INFORMATION ON YOUR REPORT"     │542│
├─────────┼───────────────────────────────────────────┼───┤
│"CA"     │"PROBLEM WITH A CREDIT REPORTING COMPANY'S │455│
│         │INVESTIGATION INTO AN EXISTING PROBLEM"    │   │
├─────────┼───────────────────────────────────────────┼───┤
│"TX"     │"INCORRECT INFORMATION ON YOUR REPORT"     │425│
└─────────┴───────────────────────────────────────────┴───┘

If there is a Response, it is coming from the Company pointing to the Complaint. So we need to match both in our pass and then connect the response to the complaint and to the company. Additionally we store some properties on the relationship. Yes/No/Blank values are converted to boolean just with a boolean expression (value='text'). For other conditional expressions we can use CASE.

Response
LOAD CSV WITH HEADERS FROM "https://git.io/J1m7e" as row

MATCH (c:Complaint {id:row.`Complaint ID`})
MATCH (co:Company {name:toUpper(row.Company)})
WITH * WHERE trim(row.`Company response to consumer`) <> ""

MERGE (res:Response {name:toUpper(row.`Company response to consumer`)})

MERGE (c)<-[rel:TO]-(res)
SET rel.disputed = (row.`Consumer disputed?` = "Yes")
SET rel.timely = (row.`Timely response?` = "Yes")
SET rel.text = CASE row.`Company public response`
                WHEN "" THEN null
                ELSE row.`Company public response` END

MERGE (co)-[:ANSWERED]->(res);

For tags we need to do two things. First to split the comma separated name into its constituent parts. Then taking that list, turning it into rows with UNWIND and connect the newly merged Tag nodes to the previously found Complaint.

The WITH * WHERE …​ is there to filter arbirary data with a WHERE clause.

Tag
LOAD CSV WITH HEADERS FROM "https://git.io/J1m7e" as row

MATCH (c:Complaint {id:row.`Complaint ID`})

WITH * WHERE trim(row.`Tags`) <> ""
WITH distinct row.Tags as tagsName,c
UNWIND split(tagsName,", ") as tagName

MERGE (t:Tag {name:toUpper(tagName)})
MERGE (c)-[:TAGGED]->(t);

Now we can look at our imported graph model with call db.schema.visualization(); or better call apoc.meta.graph to see the schema of the data we have imported. It’s great to see how well it fits our model.

complaints schema graph viz

Exploration

The queries here were taken and adjusted from Nicole’s GitHub repository.

Subissues with communication tactics
MATCH (i:Issue {name:'COMMUNICATION TACTICS'})
MATCH (sub:SubIssue)-[:IN_CATEGORY]->(i)
RETURN sub.name AS subissue
ORDER BY subissue;
╒════════════════════════════════════════════════════════════╕
│"subissue"                                                  │
╞════════════════════════════════════════════════════════════╡
│"CALLED BEFORE 8AM OR AFTER 9PM"                            │
├────────────────────────────────────────────────────────────┤
│"FREQUENT OR REPEATED CALLS"                                │
├────────────────────────────────────────────────────────────┤
│"USED OBSCENE, PROFANE, OR OTHER ABUSIVE LANGUAGE"          │
├────────────────────────────────────────────────────────────┤
│"YOU TOLD THEM TO STOP CONTACTING YOU, BUT THEY KEEP TRYING"│
└────────────────────────────────────────────────────────────┘
Common Responses
MATCH (r:Response)-[:TO]->(:Complaint)
RETURN r.name AS response, COUNT(*) AS count
ORDER BY count DESC;
╒═════════════════════════════════╤═══════╕
│"response"                       │"count"│
╞═════════════════════════════════╪═══════╡
│"IN PROGRESS"                    │6097   │
├─────────────────────────────────┼───────┤
│"CLOSED WITH EXPLANATION"        │5970   │
├─────────────────────────────────┼───────┤
│"CLOSED WITH NON-MONETARY RELIEF"│192    │
├─────────────────────────────────┼───────┤
│"CLOSED WITH MONETARY RELIEF"    │35     │
└─────────────────────────────────┴───────┘
Sub-Issues in multiple different Issues
MATCH (sub:SubIssue)-[:IN_CATEGORY]->(i:Issue)
WITH sub, COLLECT(i.name) AS issues
WHERE size(issues) > 1
RETURN sub.name, issues LIMIT 2;
╒══════════════════════════════╤══════════════════════════════╕
│"sub.name"                    │"issues"                      │
╞══════════════════════════════╪══════════════════════════════╡
│"DIFFICULTY SUBMITTING A DISPU│["PROBLEM WITH A CREDIT REPORT│
│TE OR GETTING INFORMATION ABOU│ING COMPANY'S INVESTIGATION IN│
│T A DISPUTE OVER THE PHONE"   │TO AN EXISTING PROBLEM","PROBL│
│                              │EM WITH A COMPANY'S INVESTIGAT│
│                              │ION INTO AN EXISTING ISSUE"]  │
├──────────────────────────────┼──────────────────────────────┤
│"THEIR INVESTIGATION DID NOT F│["PROBLEM WITH A CREDIT REPORT│
│IX AN ERROR ON YOUR REPORT"   │ING COMPANY'S INVESTIGATION IN│
│                              │TO AN EXISTING PROBLEM","PROBL│
│                              │EM WITH A COMPANY'S INVESTIGAT│
│                              │ION INTO AN EXISTING ISSUE"]  │
└──────────────────────────────┴──────────────────────────────┘
Product and issues with 'EQUIFAX, INC.'
MATCH (ef:Company {name:'EQUIFAX, INC.'})
MATCH (complaint:Complaint)-[:AGAINST]->(ef)
MATCH (:Response)-[:TO]->(complaint)
MATCH (complaint)-[:ABOUT]->(p:Product)
MATCH (complaint)-[:WITH]->(i:Issue)
RETURN p.name AS product, i.name AS issue, COUNT(*) AS count
ORDER BY count DESC LIMIT 2;
╒═══════════════════════╤═══════════════════════╤═══════╕
│"product"              │"issue"                │"count"│
╞═══════════════════════╪═══════════════════════╪═══════╡
│"CREDIT REPORTING, CRED│"PROBLEM WITH A CREDIT │2394   │
│IT REPAIR SERVICES, OR │REPORTING COMPANY'S INV│       │
│OTHER PERSONAL CONSUMER│ESTIGATION INTO AN EXIS│       │
│ REPORTS"              │TING PROBLEM"          │       │
├───────────────────────┼───────────────────────┼───────┤
│"CREDIT REPORTING, CRED│"INCORRECT INFORMATION │2262   │
│IT REPAIR SERVICES, OR │ON YOUR REPORT"        │       │
│OTHER PERSONAL CONSUMER│                       │       │
│ REPORTS"              │                       │       │
└───────────────────────┴───────────────────────┴───────┘
Which (sub-)products have sub-issues about obscene language
MATCH (subIssue:SubIssue)
WHERE subIssue.name contains 'OBSCENE'
MATCH (complaint:Complaint)-[:WITH]->(subIssue)
MATCH (complaint)-[:ABOUT]->(p:Product)
OPTIONAL MATCH (complaint)-[:ABOUT]->(sub:SubProduct)
RETURN p.name AS product, sub.name AS subproduct, COUNT(*) AS count
ORDER BY count DESC;
╒═════════════════╤══════════════════╤═══════╕
│"product"        │"subproduct"      │"count"│
╞═════════════════╪══════════════════╪═══════╡
│"DEBT COLLECTION"│"OTHER DEBT"      │3      │
├─────────────────┼──────────────────┼───────┤
│"DEBT COLLECTION"│"CREDIT CARD DEBT"│2      │
├─────────────────┼──────────────────┼───────┤
│"DEBT COLLECTION"│"PAYDAY LOAN DEBT"│2      │
├─────────────────┼──────────────────┼───────┤
│"DEBT COLLECTION"│"MEDICAL DEBT"    │1      │
└─────────────────┴──────────────────┴───────┘
Typical Response percentages per product
MATCH ()<--(r:Response) with r, count(*) as rCount
MATCH (p:Product)<--(:Complaint)<--(r)
RETURN p.name, r.name, count(*) as c, (count(*)*100)/rCount as percent ORDER BY percent DESC LIMIT 10;
╒══════════════════════╤══════════════════════╤════╤═════════╕
│"p.name"              │"r.name"              │"c" │"percent"│
╞══════════════════════╪══════════════════════╪════╪═════════╡
│"CREDIT REPORTING, CRE│"IN PROGRESS"         │5672│93       │
│DIT REPAIR SERVICES, O│                      │    │         │
│R OTHER PERSONAL CONSU│                      │    │         │
│MER REPORTS"          │                      │    │         │
├──────────────────────┼──────────────────────┼────┼─────────┤
│"CREDIT REPORTING, CRE│"CLOSED WITH EXPLANATI│4752│79       │
│DIT REPAIR SERVICES, O│ON"                   │    │         │
│R OTHER PERSONAL CONSU│                      │    │         │
│MER REPORTS"          │                      │    │         │
├──────────────────────┼──────────────────────┼────┼─────────┤
│"CREDIT REPORTING, CRE│"CLOSED WITH NON-MONET│135 │70       │
│DIT REPAIR SERVICES, O│ARY RELIEF"           │    │         │
│R OTHER PERSONAL CONSU│                      │    │         │
│MER REPORTS"          │                      │    │         │
├──────────────────────┼──────────────────────┼────┼─────────┤

Conclusion

This was a fun dataset and a bit more challenging to import. There are lots of possibilities to build analytics and visualizations with the data now that we have it in the graph. Let us know if you found this helpful to explore with us and if you built anything on top of the data.