BERT-based model to accurately translate Weiss Schwarz cards.

Model on HuggingFace here: https://huggingface.co/EricZ0u/WeissTranslate

Weiss Schwarz is a trading card game by Japanese company Bushiroad. It plays cards out of a 50-card deck where each card will have a number of lengthy, specifically-worded effects. This makes the game known for being very beginner-unfriendly, sometimes referred to as a "paragraph reading simulator". This is made worse by the fact that many sets in the game are exclusively printed in the Japanese language, so the choices are to either print out a thick stack of translations, memorize every card, or fumble with google translate(Makes wording very weird and confusing, especially for new players). I'm aiming to build a bert-based model that can accurately translate Japanese cards so the wording is the same as their english counterparts. I plan to add a function to tag card effects to make them easier to search, and hopefully integrate this all into an app to image translate cards instantly. Comparison lol

6/24/2024

Incrementally training the model by storing it on huggingface then pulling it the next time. At this point I'm limited by the ddos protection of heartofthecards in getting training data lol. The accuracy is very good, I'd say usable. The only problem is sometimes in specific contexts some effects get cut off? like if a card is a counter but also has another effect below the counter effect(pretty rare) it cuts off the second effect in the translation.

6/20/2024

Trained the model again, this time with 10 sets(~1000 rows), results are noticeably better! pretty much all the keywords show up right, but there's a new problem of it cutting off some lines sometimes? I'll figure that one out. I made some tweaks including reading from a CSV to be more space efficient than the json, and rather than having a train.csv and a validation.csv I just use one train and split it in the program. This makes it a lot easier to distribute it evenly, as previously I was doing 4 sets for train and 1 for validation, which could lead to imbalance. Makes it a lot easier for me too, I can just send all the set data to one csv and have it train. Added a python notebook to send any number of set JSON files into a csv, and neatened up the github repo.

6/17/2024

Got the webscraping script to work using just requests. Took me a bit and had to watch a youtube video on webscraping, but turns out I was trying to get "div" with the class element when I should've been using "td" . It seems I can go faster too now, i'm limited by the website rather than firecrawl's API ratelimit.

6/13/24

kind of taken a sideline to my other projects, but the main issue I have right now is getting data. firecrawl only does 500 credits a month for free and I don't want to pay 20 a month. Besides that i've been tinkering with python's requests library, but HOTC seems to be formatted poorly so there's no class to extract english and japanese text from. I can definitely find a way, but i'll need to play around more and maybe use some other webscraping libraries.

6/9/24

Added a couple more lines of data and retrained, not much visible improvement. Found out that HOTC does a "simplified view" for each card and the URL will just be the setcode and card number, so wrote a quick python script to loop through every card in the set. Used Firecrawl to scrape the pages and then put the data into a JSON for training. File is 'HOTCSetScrape.ipynb'. This lets me get a lot more training data so hoping it'll make my translations a lot more accurate.

6/7/24

Took a break from fine tuning the translation model today to work on making it do image translations. Tesseract still didn't work nearly good enough for my needs even after fine tuning with a custom font, so I decided to switch gears and go for google cloud vision. They have a pretty generous free plan and it was pretty easy to set up the api keys. I got it working quickly in google colab and the translations are pretty much spot on! Decided to try and wrap it up as an app so spent a couple hours learning flask, got a rudimentary concept page to work, just an "upload" button and then it translates it and spits out the original text and translation with the option to translate another. It feels usable and if integrated with a camera probably better than google translate, but I think I can do a lot better.

Also improved translation a bit, the cost represented by numbers in circles ① was getting detected but in translation was switched out for just a 1. encoded them before translation so they wouldn't be translated then decoded after.

6/6/24

Did some more training with a slightly larger pool of data(40 training, 10 validation). I'm not using a defined metric(like BLEU) just yet, but going by eye the translations are a lot cleaner than the model I started with, "focus" is now BRAINSTORM and "mountain札" is now library. A lot better! There are still some wonky bits though, like "backup" is "skull sword"??. I definitely need to expand the dataset, but besides that I do want to tweak the hyperparameters. I'm on Google Colab Pro already, but their A100 instances are never available and I can't go past a batch size of 8 on their L4 instances. While it should theoretically be enough, I do want to try going on AWS to get more freedom.

Also tried to get image translation working. Spent a couple hours with tesseract but no matter what it doesn't want to work, even with third party .traineddata files. Most likely I just need to fine-tune it using Weiss Schwarzs font.

6/5/24

Worked on getting a large corpus of training data. I'm going to need multiple series-worth of data as no one set has an instance of every effect in the game. The current solution is taking a list of the english translations for a whole set(looks like this https://www.heartofthecards.com/translations/hololive_production_booster_pack.html) off of heartofthecards, extracting the text to a .txt, then using a simple python script to populate the english part of the JSON for training. As I couldn't find a sheet of card text like this for Japanese, I just manually copy/pasted data in. It was pretty quick, but if I need more data in the future I might need to use some Japanese search engines to find an easily scrapeable list of Japanese card effects.

6/4/24

Started training. I chose to do the data in JSON, and gathered some quick data to do an initial proof of concept. First issue I ran into was that the instance was running out of RAM and crashing while training. I solved this by decreasing the batch size from 16 to 8 and increasing the learning rate accordingly . There was also a weird issue with accelerate, with it being up to date but colab not recognizing that it was up to date. The solution I found for this on HF forums was just to pip update it, then restart the runtime and then not pip any other packages. Weird, but worked.

6/3/24

Exploring models to retrain. There weren't many specific Japanese to English models on huggingface, so I settled on using "facebook/mbart-large-50-many-to-many-mmt", which seems to be pretty popular. It uses 50 languages, so I will just take the tokenizer for Japanese and do the fine tuning on that

Downloads last month
2
Safetensors
Model size
611M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support