Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

197 add profanity filter to random words #245

Merged
merged 7 commits into from
Oct 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,5 @@ lu-mir-zeeguu-credentials.json

zenv*
tools/_playground.py

!zeeguu/core/word_filter/data/
16 changes: 13 additions & 3 deletions zeeguu/core/exercises/similar_words.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,29 @@
import random

from zeeguu.core.word_stats import lang_info
from zeeguu.core.word_filter import (
BAD_WORD_LIST,
PROPER_NAMES_LIST,
remove_words_based_on_list,
)


def similar_words(word, language, user):
def similar_words(word, language, user, number_of_words_to_return=2):

words_the_user_must_study = user.scheduled_bookmarks(10)

if len(words_the_user_must_study) == 10:
candidates = [each.origin.word for each in words_the_user_must_study]
else:
candidates = lang_info(language.code).all_words()
candidates_filtered = remove_words_based_on_list(candidates, BAD_WORD_LIST)
candidates_filtered = remove_words_based_on_list(
candidates_filtered, PROPER_NAMES_LIST
)
candidates_filtered = [w for w in candidates_filtered if len(w) > 1]

random_sample = random.sample(candidates, 2)
random_sample = random.sample(candidates_filtered, number_of_words_to_return)
while word in random_sample:
random_sample = random.sample(candidates, 2)
random_sample = random.sample(candidates_filtered, number_of_words_to_return)

return random_sample
9 changes: 9 additions & 0 deletions zeeguu/core/word_filter/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
from .profanity_filter import load_bad_words
from .proper_noun_filter import load_proper_name_list

BAD_WORD_LIST = load_bad_words()
PROPER_NAMES_LIST = load_proper_name_list()


def remove_words_based_on_list(candidates, words_to_remove_list):
return list(set(candidates) - set(words_to_remove_list))
5 changes: 5 additions & 0 deletions zeeguu/core/word_filter/data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## Sources

`bad-words` is cloned from: https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

Both `person-names.txt` and `city-names.txt` are from: https://github.com/FinNLP
393 changes: 393 additions & 0 deletions zeeguu/core/word_filter/data/bad-words/LICENSE

Large diffs are not rendered by default.

58 changes: 58 additions & 0 deletions zeeguu/core/word_filter/data/bad-words/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Our List of Dirty, Naughty, Obscene, and Otherwise Bad Words #

With millions of images in our library and billions of user-submitted keywords, we work hard at Shutterstock to make sure that bad words don't show up in places they shouldn't. This repo contains a list of words that we use to filter results from our autocomplete server and recommendation engine.

Please add to it as you see fit (particularly in non-English languages) or use it to spice up your next game of Scrabble :)

Obvious warning: These lists contain material that many will find offensive. (But that's the point!)

Miscellaneous caveat: Clearly, what goes in these lists is subjective. In our case, the question we use is, "What wouldn't we want to *suggest* that people look at?" This of course varies between culture, language, and geographies, so in the end we just have to make our best guess.

## Languages

| Name | Code |
| ---------------------------------- | ----------------- |
| [Arabic](ar) | ar |
| [Chinese](zh) | zh |
| [Czech](cs) | cs |
| [Danish](da) | da |
| [Dutch](nl) | nl |
| [English](en) | en |
| [Esperanto](eo) | eo |
| [Filipino](fil) | fil |
| [Finnish](fi) | fi |
| [French](fr) | fr |
| [French (CA)](fr-CA-u-sd-caqc) | fr-CA-u-sd-caqc |
| [German](de) | de |
| [Hindi](hi) | hi |
| [Hungarian](hu) | hu |
| [Italian](it) | it |
| [Japanese](ja) | ja |
| [Kabyle](kab) | kab |
| [Klingon](tlh) | tlh |
| [Korean](ko) | ko |
| [Norwegian](no) | no |
| [Persian](fa) | fa |
| [Polish](pl) | pl |
| [Portuguese](pt) | pt |
| [Russian](ru) | ru |
| [Spanish](es) | es |
| [Swedish](sv) | sv |
| [Thai](th) | th |
| [Turkish](tr) | tr |

See also the [list of projects, documents, and organizations](USERS.md) that use these lists.

## Node Module

If you are using the word lists as `.json`, or in an `npm`project, you can install the word list using the [naughty-words](https://github.com/LDNOOBW/naughty-words-js) package.

```bash
npm install naughty-words
```

© 2012–2020 Shutterstock, Inc.

[![Creative Commons License](http://i.creativecommons.org/l/by/4.0/80x15.png)](http://creativecommons.org/licenses/by/4.0/)

This work is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).
31 changes: 31 additions & 0 deletions zeeguu/core/word_filter/data/bad-words/USERS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Users of these lists

The following projects, documents, and organizations use these lists of dirty,
naughty, obscene, and otherwise bad words. To contribute additional uses, please
either [create an issue](https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/issues/new)
or send a pull request.

## Projects

* [jQuery.ProfanityFilter](https://github.com/ChaseFlorell/jQuery.ProfanityFilter):
jQuery plugin to filter out profane words on the client.
* [grunt-naughty-words](https://www.npmjs.com/package/grunt-naughty-words):
Node.js npm library that provides a grunt task for creating an array of
profane words, and a regex to test against, as well as the capability to
compile your own obscene word list.
* [Arena Metrics](http://www.arena.co.ke):
Arena Metrics uses these lists to prevent users from popping offensive words
into polls and surveys.
* [osm-compare](https://github.com/mapbox/osm-compare) uses these lists to flag profanity in name tags when finding what changed during a feature edit on OpenStreetMap.
<!-- * [Liquidata](https://www.dolthub.com/repositories/Liquidata/bad-words) uses these lists to create demo SQL tables for [Dolt](https://www.reddit.com/r/git/comments/cdl022/dolt_its_git_for_data/) consumers. -->

## Documents

* [“Bad words” filter](http://stackoverflow.com/questions/24515/bad-words-filter)
on Stack Overflow
* [How do you implement a good profanity filter?](http://stackoverflow.com/questions/273516/how-do-you-implement-a-good-profanity-filter)
on Stack Overflow

## Organizations

* [Shutterstock, Inc.](http://code.shutterstock.com/)
38 changes: 38 additions & 0 deletions zeeguu/core/word_filter/data/bad-words/ar
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
سكس
طيز
شرج
لعق
لحس
مص
تمص
بيضان
ثدي
بز
بزاز
حلمة
مفلقسة
بظر
كس
فرج
شهوة
شاذ
مبادل
عاهرة
جماع
قضيب
زب
لوطي
لواط
سحاق
سحاقية
اغتصاب
خنثي
احتلام
نيك
متناك
متناكة
شرموطة
عرص
خول
قحبة
لبوة
41 changes: 41 additions & 0 deletions zeeguu/core/word_filter/data/bad-words/cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
bordel
buzna
čumět
čurák
debil
do piče
do prdele
dršťka
držka
flundra
hajzl
hovno
chcanky
chuj
jebat
kokot
kokotina
koňomrd
kunda
kurva
mamrd
mrdat
mrdka
mrdník
oslošoust
piča
píčus
píchat
pizda
prcat
prdel
prdelka
sračka
srát
šoustat
šulin
vypíčenec
zkurvit
zkurvysyn
zmrd
žrát
20 changes: 20 additions & 0 deletions zeeguu/core/word_filter/data/bad-words/da
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
anus
bøsserøv
cock
fisse
fissehår
fuck
hestepik
kussekryller
lort
luder
pik
pikhår
pikslugeri
piksutteri
pis
røv
røvhul
røvskæg
røvspræke
shit
66 changes: 66 additions & 0 deletions zeeguu/core/word_filter/data/bad-words/de
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
analritter
arsch
arschficker
arschlecker
arschloch
bimbo
bratze
bumsen
bonze
dödel
fick
ficken
flittchen
fotze
fratze
hackfresse
hure
hurensohn
ische
kackbratze
kacke
kacken
kackwurst
kampflesbe
kanake
kimme
lümmel
MILF
möpse
morgenlatte
möse
mufti
muschi
nackt
neger
nigger
nippel
nutte
onanieren
orgasmus
penis
pimmel
pimpern
pinkeln
pissen
pisser
popel
poppen
porno
reudig
rosette
schabracke
schlampe
scheiße
scheisser
schiesser
schnackeln
schwanzlutscher
schwuchtel
tittchen
titten
vögeln
vollpfosten
wichse
wichsen
wichser
Loading
Loading