Dedupe It

Deduplicate your CSV in one click. Try it: https://dedupe.it

Watch a demo:

Overview

Dirty datasets suck. Common problems:

Same person signs up with business and personal email
Same company gets added to your CRM twice with slightly different naming, ex. "Amazon" vs. Amazon.com"
Two overlapping datasets get merged - ex. website visitors and demo form submissions

I've wasted tons of time cleaning datasets with Excel formulas and SQL, but hand-crafting deterministic rules is tedious.

After trying "AI-powered" solutions like the AWS Entity Resolution, I was frustrated to find that I still had to map fields and hand-tune "confidence scores".

Dedupe It obviates manual configuration by using LLM vector embeddings & chat completions.

How it works

The algorithm:

Perform an LLM vector embedding of each record
For each record, perform vector similarity search to identify its nearest neighbors (typically 2-5 neighbors)
For each nearest neighbor, ask an LLM to determine whether it's a duplicate of the record under consideration
If we identify two records as duplicates, we group them together
For each duplicate group, we have our LLM propose a merged record

Here are some advantages of this approach:

Unlike a pure NxN record comparison, scales approximately linearly in time and cost
Unlike a pure vector clustering approach, avoids the hard problem of picking a cluster cutoff
Utilizes LLM chat completions to make fine-grained comparisons. Also makes it possible to add custom guidelines, ex. "consider international subsidiaries to be distinct entities".

Future improvements

Incremental / stream deduplication: Persist the vector database and groupings to deduplicate new records without reprocessing the whole dataset.
Custom rules: Allow specifying custom guidelines, either fuzzy (ex. "consider international subsidiaries to be distinct entities") or deterministic (always use the values from the last updated record when merging).
Integrations: Excel, Google Sheets, Google Forms, Salesforce, Hubspot, Snowflake, etc.

NOTE: Some of these features are already in the works. If you have requests or want early access, email [email protected] or fill out the form on the demo site (https://dedupe.it)

Self-hosting

Frontend: see instructions in frontend/README.md Backend: see instructions in backend/README.md

Contributing

Contributions are welcome. We don't currently have a formal process for contribs; feel free to simply open an issue or PR.

License

Dedupe It is open-source software, available under the Apache 2.0 license.

If you intend to use Dedupe It for commercial purposes, we simply ask that you include a copy of the original license in your code, and email the authors ([email protected]) with a short explanation of how you intend to use Dedupe It.

FAQ

What does Dedupe It cost to run?

With no modifications, this implementation will cost you ~$8.40 per 1k records. This will depend on the size of your records, so please benchmark your costs on a small, representative sample of data before attempting to run on a large dataset.

NOTE: We're working on optimizations to reduce the cost 100x. If you need to run deduplication on a larger dataset (100k+ rows), please email us: [email protected]

When should I use Dedupe It?

Dedupe It is built for fuzzy deduplication of structured data. In other words, it's a good fit if your data:

Fits into a spreadsheet / CSV
Might have duplicates
The duplicates are not exact (ex. "Michael B. Schmidt, [email protected]" and "mike schmidt, [email protected]")

This is typically the case if:

Your data comes from manual entry, like someone filling out a form
Your data is a combination of multiple sources, each of which has different fields or formatting

Do you have performance benchmarks?

We'll release some more detailed head-to-head and benchmarks in the future.

Anecdotally, on the datasets that we've tested (people, companies, and products):

Rate of false negatives (not catching a pair of duplicates): <5%
Rate of false positives (mistakenly identifying distinct records as duplicates): vanishingly small

Why should I use Dedupe It over alternative X?

As of today, our main value proposition is simplicity.

There is no other fuzzy deduplication solution that works out-of-the-box without manual configuration.

That said, this implementation is a minimum viable product. Shortcomings of Dedupe It today include:

Relatively expensive (~$8.40/1k records). NOTE: we will be bringing this down 100x over the next month or two.
No incrementality. NOTE: we have incremental versions working locally.
No integrations. NOTE: on the roadmap.
No custom rules. NOTE: on the roadmap.

If you need any of those features, please contact us.

In the meantime, here are some popular alternatives:

Enterprise-ready: AWS Entity Resolution
Open-source: Zingg
Integrated with Salesforce: DataGroomr

Can you add feature X?

Probably! Add a thread in the discussions section on Github, email us at [email protected].

Contact

Dedupe It is built by the Snowpilot team

To contact us, email [email protected]

Cheers!

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
backend		backend
frontend		frontend
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dedupe It

Overview

How it works

Future improvements

Self-hosting

Contributing

License

FAQ

Contact

About

Releases

Packages

Languages

License

SnowPilotOrg/dedupe_it

Folders and files

Latest commit

History

Repository files navigation

Dedupe It

Overview

How it works

Future improvements

Self-hosting

Contributing

License

FAQ

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages