Compare commits

...

21 Commits
v0.2 ... master

Author SHA1 Message Date
lub 61d66b2b9b Update 'README.md' 1 year ago
lub 3b78b19330 use extra variable instead of post['category'] directly 2 years ago
lub b2c23d0446 fix old_cache check 2 years ago
lub 282b00daca set presence to offline
reduces inaccurate presence traffic
2 years ago
lub 05f9da94c0 optimize cache logic 2 years ago
lub f3ec6c6b73 remove unused import 2 years ago
lub 61981726db fix linter errors 2 years ago
lub ecf32bc114 add support for multiple categories 2 years ago
lub 4a531d98f5 relicense 2 years ago
lub ce4f3401f4 reorder native imports before 3rd party 2 years ago
lub 65f83c75f2 remove .gitignore 2 years ago
lub 544291ac60 switch to python:slim 2 years ago
lub 9202a2f49d add docker repo URIs to readme 2 years ago
lub 0130d63551 use hardcoded cache size instead of len(blog)
It seems the blog posts are not rotated strictly chronologically,
so sometimes previously already posted things got posted again.
3 years ago
taire 00b6811a20 typo in comment 4 years ago
lub 11e4c753ce clarify admin room sharing in readme 4 years ago
lub 5b7f095470 add diablo bot to readme 4 years ago
lub f88f17bb65 use unbuffered output
this prints stdout directly instead of buffering it. otherwise docker logs is useless because its not realtime
4 years ago
lub 7990d787fc refactor category and image into own functions 4 years ago
lub fed4b15d08 fix scraping 4 years ago
lub cd194b344d fix example env vars in Dockerfile 4 years ago

5
.gitignore vendored

@ -1,5 +0,0 @@
include
lib
lib64
bin
pyvenv.cfg

@ -1,9 +1,10 @@
FROM docker.io/python:latest
FROM docker.io/python:slim
ENV HOMESERVER_URL=https://matrix.org \
HOMESERVER_NAME=matrix.org \
MXID_PREFIX=snowstorm_ \
ADMIN_ROOM=!jaeisofjaosiefjoi:matrix.org
ENV HOMESERVER=https://example.com \
MXID=@snowstorm:example.com \
ACCESSTOKEN_FILE=/run/secrets/accesstoken \
ADMIN_ROOM=!jaeisofjaosiefjoi:example.com \
CATEGORY=examplecategory
WORKDIR /app
@ -13,4 +14,4 @@ RUN pip install -r requirements.txt
COPY scrape.py ./
USER nobody:nogroup
CMD [ "python", "scrape.py" ]
CMD [ "python", "-u", "scrape.py" ]

@ -1,6 +1,6 @@
### GNU GENERAL PUBLIC LICENSE
### GNU AFFERO GENERAL PUBLIC LICENSE
Version 3, 29 June 2007
Version 3, 19 November 2007
Copyright (C) 2007 Free Software Foundation, Inc.
<https://fsf.org/>
@ -10,17 +10,15 @@ license document, but changing it is not allowed.
### Preamble
The GNU General Public License is a free, copyleft license for
software and other kinds of works.
The GNU Affero General Public License is a free, copyleft license for
software and other kinds of works, specifically designed to ensure
cooperation with the community in the case of network server software.
The licenses for most software and other practical works are designed
to take away your freedom to share and change the works. By contrast,
the GNU General Public License is intended to guarantee your freedom
to share and change all versions of a program--to make sure it remains
free software for all its users. We, the Free Software Foundation, use
the GNU General Public License for most of our software; it applies
also to any other work released this way by its authors. You can apply
it to your programs, too.
our General Public Licenses are intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains
free software for all its users.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
@ -29,46 +27,34 @@ them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.
To protect your rights, we need to prevent others from denying you
these rights or asking you to surrender the rights. Therefore, you
have certain responsibilities if you distribute copies of the
software, or if you modify it: responsibilities to respect the freedom
of others.
Developers that use our General Public Licenses protect your rights
with two steps: (1) assert copyright on the software, and (2) offer
you this License which gives you legal permission to copy, distribute
and/or modify the software.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must pass on to the recipients the same
freedoms that you received. You must make sure that they, too, receive
or can get the source code. And you must show them these terms so they
know their rights.
A secondary benefit of defending all users' freedom is that
improvements made in alternate versions of the program, if they
receive widespread use, become available for other developers to
incorporate. Many developers of free software are heartened and
encouraged by the resulting cooperation. However, in the case of
software used on network servers, this result may fail to come about.
The GNU General Public License permits making a modified version and
letting the public access it on a server without ever releasing its
source code to the public.
Developers that use the GNU GPL protect your rights with two steps:
(1) assert copyright on the software, and (2) offer you this License
giving you legal permission to copy, distribute and/or modify it.
The GNU Affero General Public License is designed specifically to
ensure that, in such cases, the modified source code becomes available
to the community. It requires the operator of a network server to
provide the source code of the modified version running there to the
users of that server. Therefore, public use of a modified version, on
a publicly accessible server, gives the public access to the source
code of the modified version.
For the developers' and authors' protection, the GPL clearly explains
that there is no warranty for this free software. For both users' and
authors' sake, the GPL requires that modified versions be marked as
changed, so that their problems will not be attributed erroneously to
authors of previous versions.
Some devices are designed to deny users access to install or run
modified versions of the software inside them, although the
manufacturer can do so. This is fundamentally incompatible with the
aim of protecting users' freedom to change the software. The
systematic pattern of such abuse occurs in the area of products for
individuals to use, which is precisely where it is most unacceptable.
Therefore, we have designed this version of the GPL to prohibit the
practice for those products. If such problems arise substantially in
other domains, we stand ready to extend this provision to those
domains in future versions of the GPL, as needed to protect the
freedom of users.
Finally, every program is threatened constantly by software patents.
States should not allow patents to restrict development and use of
software on general-purpose computers, but in those that do, we wish
to avoid the special danger that patents applied to a free program
could make it effectively proprietary. To prevent this, the GPL
assures that patents cannot be used to render the program non-free.
An older license, called the Affero General Public License and
published by Affero, was designed to accomplish similar goals. This is
a different license, not a version of the Affero GPL, but Affero has
released a new version of the Affero GPL which permits relicensing
under this license.
The precise terms and conditions for copying, distribution and
modification follow.
@ -77,7 +63,8 @@ modification follow.
#### 0. Definitions.
"This License" refers to version 3 of the GNU General Public License.
"This License" refers to version 3 of the GNU Affero General Public
License.
"Copyright" also means copyright-like laws that apply to other kinds
of works, such as semiconductor masks.
@ -546,37 +533,47 @@ from those to whom you convey the Program, the only way you could
satisfy both those terms and this License would be to refrain entirely
from conveying the Program.
#### 13. Use with the GNU Affero General Public License.
#### 13. Remote Network Interaction; Use with the GNU General Public License.
Notwithstanding any other provision of this License, if you modify the
Program, your modified version must prominently offer all users
interacting with it remotely through a computer network (if your
version supports such interaction) an opportunity to receive the
Corresponding Source of your version by providing access to the
Corresponding Source from a network server at no charge, through some
standard or customary means of facilitating copying of software. This
Corresponding Source shall include the Corresponding Source for any
work covered by version 3 of the GNU General Public License that is
incorporated pursuant to the following paragraph.
Notwithstanding any other provision of this License, you have
permission to link or combine any covered work with a work licensed
under version 3 of the GNU Affero General Public License into a single
under version 3 of the GNU General Public License into a single
combined work, and to convey the resulting work. The terms of this
License will continue to apply to the part which is the covered work,
but the special requirements of the GNU Affero General Public License,
section 13, concerning interaction through a network will apply to the
combination as such.
but the work with which it is combined will remain governed by version
3 of the GNU General Public License.
#### 14. Revised Versions of this License.
The Free Software Foundation may publish revised and/or new versions
of the GNU General Public License from time to time. Such new versions
will be similar in spirit to the present version, but may differ in
detail to address new problems or concerns.
of the GNU Affero General Public License from time to time. Such new
versions will be similar in spirit to the present version, but may
differ in detail to address new problems or concerns.
Each version is given a distinguishing version number. If the Program
specifies that a certain numbered version of the GNU General Public
License "or any later version" applies to it, you have the option of
following the terms and conditions either of that numbered version or
of any later version published by the Free Software Foundation. If the
Program does not specify a version number of the GNU General Public
License, you may choose any version ever published by the Free
Software Foundation.
specifies that a certain numbered version of the GNU Affero General
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version published by the Free Software
Foundation. If the Program does not specify a version number of the
GNU Affero General Public License, you may choose any version ever
published by the Free Software Foundation.
If the Program specifies that a proxy can decide which future versions
of the GNU General Public License can be used, that proxy's public
statement of acceptance of a version permanently authorizes you to
choose that version for the Program.
of the GNU Affero General Public License can be used, that proxy's
public statement of acceptance of a version permanently authorizes you
to choose that version for the Program.
Later license versions may give you additional or different
permissions. However, no additional obligations are imposed on any
@ -634,42 +631,30 @@ the exclusion of warranty; and each file should have at least the
Copyright (C) <year> <name of author>
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
it under the terms of the GNU Affero General Public License as
published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
GNU Affero General Public License for more details.
You should have received a copy of the GNU General Public License
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
Also add information on how to contact you by electronic and paper
mail.
If the program does terminal interaction, make it output a short
notice like this when it starts in an interactive mode:
<program> Copyright (C) <year> <name of author>
This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
The hypothetical commands \`show w' and \`show c' should show the
appropriate parts of the General Public License. Of course, your
program's commands might be different; for a GUI interface, you would
use an "about box".
If your software can interact with users remotely through a computer
network, you should also make sure that it provides a way for users to
get its source. For example, if your program is a web application, its
interface could display a "Source" link that leads users to an archive
of the code. There are many ways you could offer source, and different
solutions will be better for different programs; see section 13 for
the specific requirements.
You should also get your employer (if you work as a programmer) or
school, if any, to sign a "copyright disclaimer" for the program, if
necessary. For more information on this, and how to apply and follow
the GNU GPL, see <https://www.gnu.org/licenses/>.
The GNU General Public License does not permit incorporating your
program into proprietary programs. If your program is a subroutine
library, you may consider it more useful to permit linking proprietary
applications with the library. If this is what you want to do, use the
GNU Lesser General Public License instead of this License. But first,
please read <https://www.gnu.org/licenses/why-not-lgpl.html>.
the GNU AGPL, see <https://www.gnu.org/licenses/>.

@ -9,6 +9,7 @@ Information about which URLs already got posted is also saved into the admin roo
These bots are publicly hosted. Just invite them to a room and they should start doing their job.
Just kick them, when you don't want to receive updates anymore.
[Diablo](https://matrix.to/#/@cain:imninja.net)
[Heroes of the Storm](https://matrix.to/#/@thelostvikings:imninja.net)
[Inside Blizzard](https://matrix.to/#/@snowstorm:imninja.net)
[Overwatch](https://matrix.to/#/@winston:imninja.net)
@ -20,28 +21,27 @@ First create all users you want to use.
Next you have to create a new room you can use as admin room. You have to specify the room id of that room later. Invite all bot users to that room.
Additionally you have to allow every user to post to the cache state:
```json
"de.lubiland.de.snowstorm-matrix.cache": 0
"de.lubiland.snowstorm-matrix.cache": 0
```
Additional bot users can be added via register -> invite to admin room -> access token file.
Multiple bots of different categories can share the same admin room. This can be handy when trying to debug the current RSS feed cache. Two bots with the same category can't share the same admin room.
## Running it
After initial configuration you can run it manually:
```bash
docker run --rm \
-v $(pwd)/heroesofthestorm:/heroesofthestorm:ro \
-e HOMESERVER=https://example.org
-e MIXD=@heeeroooooooes:example.org
-e ACCESSTOKEN_FILE=/heroesofthestorm \
-e ADMIN_ROOM='!iesofojasief90429ewiofj:example.org' \
-e CATEGORY=heroesofthestorm
snowstorm-matrix
-e CATEGORY=heroesofthestorm,insideblizzard,battlenet
gitea.lubiland.de/lub/snowstorm-matrix:latest
```
Or via docker-compose/swarm:
```yaml
snowstorm-matrix_overwatch:
image: snowstorm-matrix
image: gitea.lubiland.de/lub/snowstorm-matrix:latest
deploy:
replicas: 1
secrets:
@ -51,9 +51,9 @@ Or via docker-compose/swarm:
- ACCESSTOKEN_FILE=/run/secrets/snowstorm-matrix_overwatch
- MXID=@bastionrulez:example.com
- ADMIN_ROOM=!jjpPluoxZoAOBQeYer:example.org
- CATEGORY=overwatch
- CATEGORY=overwatch,overwatch2
snowstorm-matrix_worldofwarcraft:
image: snowstorm-matrix
image: gitea.lubiland.de/lub/snowstorm-matrix:latest
deploy:
replicas: 1
secrets:

@ -1,60 +1,70 @@
from os import environ
import requests
import asyncio
import re
from bs4 import BeautifulSoup
from copy import deepcopy
from datetime import datetime, timedelta
from random import randrange
import asyncio
from nio import ClientConfig, AsyncClient, LoginResponse, InviteEvent
import requests
from bs4 import BeautifulSoup
from nio import ClientConfig, AsyncClient, LoginResponse
def get_accesstoken_from_file(accesstoken_path):
accesstoken_file = open(accesstoken_path, 'r')
accesstoken_file = open(accesstoken_path, 'r', encoding='utf8')
single_accesstoken = accesstoken_file.read().strip()
accesstoken_file.close()
return single_accesstoken
def extract_image_url(image_html):
# only recent articles use "" to escape the url, so we have to search for
# with quotes and without quotes
image_url_fragment = re.findall(r'url\("?(.*?)"?\)', image_html.attrs['style'])[0]
return 'https:'+image_url_fragment
def sanitize_category(raw_category):
return raw_category.replace(' ', '').replace(':', '').replace('.', '').lower()
def get_blog():
url = 'https://news.blizzard.com/en-us/'
html = requests.get(url).text
html = requests.get(url, timeout=60).text
soup = BeautifulSoup(html, 'html.parser')
base_url = 'https://news.blizzard.com'
blog = []
feature_list_html = soup.find_all(class_='FeaturedArticle-link')
for feature_html in feature_list_html:
image_html = feature_html.find(class_='Card-image')
image_url_fragment = re.findall('url\("(.*?)"\)', image_html.attrs['style'])[0]
image_url = 'https:'+image_url_fragment
for featured_article in soup.select('#featured-articles article'):
image_url = extract_image_url(featured_article.find(class_='Card-image'))
text_list = feature_html.find_all(class_='text-truncate-ellipsis')
text_list = featured_article.select('.text-truncate-ellipsis')
category = sanitize_category(text_list[0].text)
title = text_list[1].text
url = base_url+featured_article.find('a').attrs['href']
blog.append({
'image': image_url,
'category': text_list[0].contents[0].replace(' ', '').replace(':', '').lower(),
'title': text_list[1].contents[0],
'description': '',
'url': base_url+feature_html.attrs['href'],
'category': category,
'title': title,
'description': '', # featured articles don't have a description
'url': url,
})
article_list_html = soup.find_all(class_='ArticleListItem')
for article_html in article_list_html:
image_html = article_html.find(class_='ArticleListItem-image')
image_url_fragment = re.findall('url\((.*?)\)', image_html.attrs['style'])[0]
image_url = 'https:'+image_url_fragment
for recent_article in soup.select('#recent-articles article'):
image_url = extract_image_url(recent_article.find(class_='ArticleListItem-image'))
content_html = article_html.find(class_='ArticleListItem-contentGrid')
category = sanitize_category(recent_article.find(class_='ArticleListItem-subtitle').find(class_='ArticleListItem-labelInner').text)
title = recent_article.find(class_='ArticleListItem-title').text
description = recent_article.find(class_='ArticleListItem-description').find(class_='h6').text
url = base_url+recent_article.find('a').attrs['href']
blog.append({
'image': image_url,
'category': content_html.find(class_='ArticleListItem-subtitle').find(class_='ArticleListItem-labelInner').contents[0].replace(' ', '').replace(':', '').lower(),
'title': content_html.find(class_='ArticleListItem-title').contents[0],
'description': content_html.find(class_='ArticleListItem-description').find(class_='h6').contents[0],
'url': base_url+article_html.find(class_='ArticleLink').attrs['href'],
'category': category,
'title': title,
'description': description,
'url': url
})
# reverse order so the oldest article is at [0]
@ -92,6 +102,7 @@ async def main():
'xxx',
accesstoken)
await matrix.receive_response(login_response)
await matrix.set_presence('offline')
# filter out everything except m.room.member (for invites)
@ -127,20 +138,23 @@ async def main():
if next_update < datetime.now():
# refresh url cache
cache_state = await matrix.room_get_state_event(room_id=admin_room,
event_type=cache_event_type,
state_key=category)
if hasattr(cache_state, 'content') and 'url_list' in cache_state.content:
cache = cache_state.content['url_list']
else:
print('cache is empty')
cache = []
cache = {}
for category in category_list:
cache_state = await matrix.room_get_state_event(room_id=admin_room,
event_type=cache_event_type,
state_key=category)
if not hasattr(cache, category):
cache[category] = []
if hasattr(cache_state, 'content') and 'url_list' in cache_state.content:
cache[category] += cache_state.content['url_list']
old_cache = deepcopy(cache)
# scape all blog posts and process them
# scrape all blog posts and process them
blog = get_blog()
for post in blog:
category = post['category']
# check if post url is in cache and matches our category
if post['url'] not in cache and post['category'] == category:
if category in category_list and post['url'] not in cache[category]:
# post url not found in cache
# announce new post to matrix rooms
print('new post: '+post['url'])
@ -160,23 +174,24 @@ async def main():
content=content)
# add url to cache
cache += [post['url']]
cache[category] += [post['url']]
else:
# no new posts found
pass
# trim the cache
# len(blog) is usually bigger than the count of posts in our category,
# so with len(blog) instead of the latter we have some buffer
while len(cache) > len(blog):
cache.remove(cache[0])
# cleanup cache and push it as room state
for category in cache.keys():
# trim the cache
while len(cache[category]) > 100:
cache[category].remove(cache[category][0])
# set new cache event
await matrix.room_put_state(room_id=admin_room,
event_type=cache_event_type,
state_key=category,
content={'url_list': cache})
# set new cache event
if old_cache[category] != cache[category]:
await matrix.room_put_state(room_id=admin_room,
event_type=cache_event_type,
state_key=category,
content={'url_list': cache[category]})
# wait between 15min and 30min to randomize scraping
next_update = datetime.now() + timedelta(minutes=randrange(15, 30))
@ -195,8 +210,9 @@ print('accesstoken_file: '+environ['ACCESSTOKEN_FILE'])
admin_room = environ['ADMIN_ROOM']
print('admin_room: '+admin_room)
category = environ['CATEGORY']
print('category: '+category)
category_list = environ['CATEGORY'].split(',')
print('categories:')
print(category_list)
asyncio.run(main())

Loading…
Cancel
Save