Skip to content

Commit

Permalink
Added data for bad-word-filtering
Browse files Browse the repository at this point in the history
- Included data and its sources.
  • Loading branch information
tfnribeiro committed Sep 30, 2024
1 parent 1c0ddd3 commit 26d1e01
Show file tree
Hide file tree
Showing 36 changed files with 279,504 additions and 3 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,5 @@ lu-mir-zeeguu-credentials.json

zenv*
tools/_playground.py

!zeeguu/core/word_filter/data/
5 changes: 5 additions & 0 deletions zeeguu/core/word_filter/data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## Sources

`bad-word-list` is cloned from: https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

Both `name-list.txt` and `city-names.txt` are from: https://github.com/FinNLP
393 changes: 393 additions & 0 deletions zeeguu/core/word_filter/data/bad-word-list/LICENSE

Large diffs are not rendered by default.

58 changes: 58 additions & 0 deletions zeeguu/core/word_filter/data/bad-word-list/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Our List of Dirty, Naughty, Obscene, and Otherwise Bad Words #

With millions of images in our library and billions of user-submitted keywords, we work hard at Shutterstock to make sure that bad words don't show up in places they shouldn't. This repo contains a list of words that we use to filter results from our autocomplete server and recommendation engine.

Please add to it as you see fit (particularly in non-English languages) or use it to spice up your next game of Scrabble :)

Obvious warning: These lists contain material that many will find offensive. (But that's the point!)

Miscellaneous caveat: Clearly, what goes in these lists is subjective. In our case, the question we use is, "What wouldn't we want to *suggest* that people look at?" This of course varies between culture, language, and geographies, so in the end we just have to make our best guess.

## Languages

| Name | Code |
| ---------------------------------- | ----------------- |
| [Arabic](ar) | ar |
| [Chinese](zh) | zh |
| [Czech](cs) | cs |
| [Danish](da) | da |
| [Dutch](nl) | nl |
| [English](en) | en |
| [Esperanto](eo) | eo |
| [Filipino](fil) | fil |
| [Finnish](fi) | fi |
| [French](fr) | fr |
| [French (CA)](fr-CA-u-sd-caqc) | fr-CA-u-sd-caqc |
| [German](de) | de |
| [Hindi](hi) | hi |
| [Hungarian](hu) | hu |
| [Italian](it) | it |
| [Japanese](ja) | ja |
| [Kabyle](kab) | kab |
| [Klingon](tlh) | tlh |
| [Korean](ko) | ko |
| [Norwegian](no) | no |
| [Persian](fa) | fa |
| [Polish](pl) | pl |
| [Portuguese](pt) | pt |
| [Russian](ru) | ru |
| [Spanish](es) | es |
| [Swedish](sv) | sv |
| [Thai](th) | th |
| [Turkish](tr) | tr |

See also the [list of projects, documents, and organizations](USERS.md) that use these lists.

## Node Module

If you are using the word lists as `.json`, or in an `npm`project, you can install the word list using the [naughty-words](https://github.com/LDNOOBW/naughty-words-js) package.

```bash
npm install naughty-words
```

© 2012–2020 Shutterstock, Inc.

[![Creative Commons License](http://i.creativecommons.org/l/by/4.0/80x15.png)](http://creativecommons.org/licenses/by/4.0/)

This work is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).
31 changes: 31 additions & 0 deletions zeeguu/core/word_filter/data/bad-word-list/USERS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Users of these lists

The following projects, documents, and organizations use these lists of dirty,
naughty, obscene, and otherwise bad words. To contribute additional uses, please
either [create an issue](https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/issues/new)
or send a pull request.

## Projects

* [jQuery.ProfanityFilter](https://github.com/ChaseFlorell/jQuery.ProfanityFilter):
jQuery plugin to filter out profane words on the client.
* [grunt-naughty-words](https://www.npmjs.com/package/grunt-naughty-words):
Node.js npm library that provides a grunt task for creating an array of
profane words, and a regex to test against, as well as the capability to
compile your own obscene word list.
* [Arena Metrics](http://www.arena.co.ke):
Arena Metrics uses these lists to prevent users from popping offensive words
into polls and surveys.
* [osm-compare](https://github.com/mapbox/osm-compare) uses these lists to flag profanity in name tags when finding what changed during a feature edit on OpenStreetMap.
<!-- * [Liquidata](https://www.dolthub.com/repositories/Liquidata/bad-words) uses these lists to create demo SQL tables for [Dolt](https://www.reddit.com/r/git/comments/cdl022/dolt_its_git_for_data/) consumers. -->

## Documents

* [“Bad words” filter](http://stackoverflow.com/questions/24515/bad-words-filter)
on Stack Overflow
* [How do you implement a good profanity filter?](http://stackoverflow.com/questions/273516/how-do-you-implement-a-good-profanity-filter)
on Stack Overflow

## Organizations

* [Shutterstock, Inc.](http://code.shutterstock.com/)
38 changes: 38 additions & 0 deletions zeeguu/core/word_filter/data/bad-word-list/ar
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
سكس
طيز
شرج
لعق
لحس
مص
تمص
بيضان
ثدي
بز
بزاز
حلمة
مفلقسة
بظر
كس
فرج
شهوة
شاذ
مبادل
عاهرة
جماع
قضيب
زب
لوطي
لواط
سحاق
سحاقية
اغتصاب
خنثي
احتلام
نيك
متناك
متناكة
شرموطة
عرص
خول
قحبة
لبوة
41 changes: 41 additions & 0 deletions zeeguu/core/word_filter/data/bad-word-list/cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
bordel
buzna
čumět
čurák
debil
do piče
do prdele
dršťka
držka
flundra
hajzl
hovno
chcanky
chuj
jebat
kokot
kokotina
koňomrd
kunda
kurva
mamrd
mrdat
mrdka
mrdník
oslošoust
piča
píčus
píchat
pizda
prcat
prdel
prdelka
sračka
srát
šoustat
šulin
vypíčenec
zkurvit
zkurvysyn
zmrd
žrát
20 changes: 20 additions & 0 deletions zeeguu/core/word_filter/data/bad-word-list/da
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
anus
bøsserøv
cock
fisse
fissehår
fuck
hestepik
kussekryller
lort
luder
pik
pikhår
pikslugeri
piksutteri
pis
røv
røvhul
røvskæg
røvspræke
shit
66 changes: 66 additions & 0 deletions zeeguu/core/word_filter/data/bad-word-list/de
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
analritter
arsch
arschficker
arschlecker
arschloch
bimbo
bratze
bumsen
bonze
dödel
fick
ficken
flittchen
fotze
fratze
hackfresse
hure
hurensohn
ische
kackbratze
kacke
kacken
kackwurst
kampflesbe
kanake
kimme
lümmel
MILF
möpse
morgenlatte
möse
mufti
muschi
nackt
neger
nigger
nippel
nutte
onanieren
orgasmus
penis
pimmel
pimpern
pinkeln
pissen
pisser
popel
poppen
porno
reudig
rosette
schabracke
schlampe
scheiße
scheisser
schiesser
schnackeln
schwanzlutscher
schwuchtel
tittchen
titten
vögeln
vollpfosten
wichse
wichsen
wichser
Loading

0 comments on commit 26d1e01

Please sign in to comment.