Announcements

News about the SCC offer: Service News
Report malfunction: via ITB or the SCC ticketing system.
 In case of failure of this website: scc.fail

Stand: 30.03.2023 17:18:24

Incidents

 2023-02-22 17:00

Office Online not available for external users of KIT teamsites (SharePoint)

DescriptionExternal users who have access to KIT team sites (SharePoint) cannot edit Office documents with Office Online.
Edit with local Office instances works.
Currently we are in exchange with the Microsoft Support Team.
Affected usersAll external users who have access to KIT team sites (SharePoint). Users with GuP accounts are not affected here.
WorkaroundDocuments can be modified via local Office instance
Exists since2023-02-22 17:00
 2022-11-16 12:00 - 2023-03-20 19:00

LSDF Online Storage: back in production (28.03.2023) - FIXED

DescriptionUPDATE 2023-03-28

LSDF02 is in production, we close this incident


2023-03-21 Update

Finally we are happy to inform you that we have applied the recovery procedures suggested by the vendor and have repaired LSDF02 FileSystem.

The LSDF02's Block Allocation Table passed the necessary integrity checks and we have made a few additional general checks on the data files.

With this, we have resumed the production in RW mode starting 7 pm on Monday, March 20th.

Access via all normally available protocols (ssh, scp, sftp, https, webdav, nfs, cifs, ..) is available again.

Access to the LSDF data via the connected HPC systems will be also activated in a few days (once confirmed with the remote HPC cluster).

We will retain the recovered data replica as of March 20th morning for a few more months, just in case.

The are very few corrupted files discovered so far (31 in toal). If needed excplicitly, we might recover them on particular user requests.

These data corruptions are only indirectly related to the problems we had. We feel, however, we should give a more detailed description on the root cause of these data corruptions so that you can understand their nature.
In November 2022, during an incorrect file system check, as a direct consequence of the file system issue the file system's table of allocated blocks has been corrupted which has not been known at that time.
Thereafter, the file system had been mounted writable for a total of less than 24 hours Nov 12-15, 2022, before it had been unmounted or mounted read-only due to the irregularities we had observed. In that short period of write-mount, writing new data had the potential to overwrite falsely deallocated data blocks (i.e. portions of other files). However, the user activity at that time was disabled, hence also the writing activity was low. Unfortunately, we may not tell which files have been corrupted but the very small fraction of corrupted files agrees with the above description.

Please let us know if you find any other of your files corrupted, or if you will observe any other issues after 7 pm, March 20th.

We apologise for the caused inconvenience and thank you for the patience.



UPDATE 2023-03-13

Currently work is in progress to restore the lsdf02 file system (see maintenance).
These are expected to be completed on 2023-03-20.

_ _ _
UPDATE 2023-02-22

Finally, we have good news from our vendor. The root cause of the LSDF02 incident has been completely clarified.
Due to a rare combination of operational conditions the FileSystem went into a confused state with respect to the data block allocation.
Such a possibility is neither mentioned in any documentation nor was expected by the vendor support and development teams.
Once we have it understood, we will make sure to avoid such operational conditions from now on.
For the short term, a work-around procedure has been proposed by the vendor in order to restore full reliability on the LSDF02.
For this we will need another downtime period of a few days which we preliminarly plan for March 3-5 (to be confirmed and announced later).
For the long-term solution, the vendor will patch its software and we will later deploy it to all our GPFS clusters.
We will keep you informed on our recovery actions and on the date when the LSDF02 will turn back into RW operational mode.
Thank you very much for your patience and understanding!
_ _ _
UPDATE 2023-02-14

The LSDF-02 FileSystem still remains in RO operations mode while we are progressing towards the incident resolution.

The data recovery on our side continues and is 91% completed as of today. The vendor support team has analysed the test data collected 2 weeks ago and has confirmed our hypothesis of the problem's root cause. Now they are working on providing us with a possible fix.

We also continue various checks of files on our side. So far we have found not more than just 31 large files (several 100 GBs each) out of several millions files checked, all of those 31 with the same type of corruption.

We repeat our request to all the LSDF-02 users to check integrity of their data and inform us on any new corruption case supposedly found.
We apologize for the inconveniences and appreciate your understanding and patience.
_ _ _
UPDATE 2023-02-07

We are continuing to work on analysis and resolution of the issues with the LSDF-02 FileSystem which continues to be in RO operations.

We have provided our vendor with all the necessary test results and now waiting for their solution. Meanwhile we are contiuing data recovery from LSDF-02, it is about 86% complete as of today, we hope to finish it soon. We also did several other tests on LSDF-02 which did not bring us new type problems so far. With all this, we believe that the likelyhood for the LSDF-02 to be fully recovered has increased other the past weeks. We have also tested a large set of big-size files on LSDF-01 which all matched the reference checksum. So we believe LSDF-01 should not have those problems which we are dealing with on LSDF-02.

We ask again all LSDF users please check integrity of your data stored on LSDF-01 and LSDF-02 and report us immediately if you find any corruption in your files.

If you have any LSDF data which you can not access but which you need really urgently for your work (thesis, publication, etc), please let us know so we may try to retrieve it from tape backups.

We apologize for possible inconvenience and appreciate your understanding.
_ _ _
UPDATE, 31.01.2023

The problems with the lsdf02 file system are still not completely fixed.

In order to also let our vendor to work in parallel on their investigations and possible SW fixes, we have taken the filesystem offline last Friday evening, as it was announced past week.

We have been running intensive filesystem checks during both Saturday and Sunday, January 28/29. We have collected the technical details which the vendor asked us and in this sense the downtime period has fulfilled its goals. We will pass the collected test results to the vendor and hope they will bring us some solution.

As promised, we have resumed RO operations on 9.00 Monday January 30th, and resumed files recovery from lsdf02, which will continue several days more (it is about 70% done as of today).

We are continuing doing in parallel several further tests and investigation, and so far observed NO other real data corruption cases, which increases the likelyhood to recover the integrity of lsdf02.

We would like again to renew our appeal to all users for to check their data as far as
possible and to contact us as soon as there is even the slightest sign of data corruption.

A connection to the HPC cluster is still not available for the sake of cluster stability.

Again we would like to ask for your understanding and to apologize for the inconveniences.
_ _ _
UPDATE, 18.01.23, 3.30 p.m.

The problems with a part of the LSDF (the filesystem lsdf02) are still not completely fixed.
For further file system analysis, it will be necessary to take the file system offline again later. For this reason, we are currently copying data from the lsdf02 file system to a newly created file system to keep the data accessible. Users will be informed separately about the details.
We also have to report that some corrupt files have been identified. Whether the reasons for the data corruption and the current problems are related is not conclusively determined at this time.
We will now randomly fetch some data from the TSM backup and check more checksums.
We continue to ask all users to check their data as much as possible.
Connection to the HPC cluster remains unavailable for cluster stability reasons.

Thank you for your understanding and patience.
_ _ _
UPDATE, 09.01.23, 4.00 p.m.

Unfortunately, the file system issues could not have been completely resolved so far. In particular, lsdf02 is still affected.

lsdf01 is fully operational since 20.12.2022

The filesystem lsdf02 can be made available for read-only access again. We have not detected any data corruptions so far, however, the checking tools return inconsistent results for data block allocation which indicates that data may be destroyed on write operations. For that reason we are copying ifh, imk-tro, imk-asf, and ibcs data off lsdf02 into newly created file systems.

We kindly ask all users to check their data as far as possible.

For the sake of cluster stability, access from the HPC cluster is being denied at present.

_ _ _
UPDATE, 20.12.22, 15:30 p.m.

Dear LSDF users,

We would like to inform you about the current status of the LSDF before the holidays.
All work on the file systems sdil, gridka, gpfsadmin and lsdf01 has been successfully completed. completed successfully. These file systems are now available for released for writing.
For the file system lsdf02 further file system checks are in progress, which will take will continue for several days. The filesystem will be available after the available until after the holidays.
The connection of the file systems to the HPC systems at the KIT can cannot be restored yet. This can only be done at the beginning of January 2023 take place.
In cooperation with the manufacturer, we have chosen a setup for the current operation of the setup (IPoIB) for the current operation of the LSDF, which is possible with a smaller possible transmission problems in the infiniband range with the help of internal in the infiniband range with the help of internal checksums identified.

On January 5, 2023, we will again take the LSDF offline for a few hours to perform hardware maintenance for a few hours to perform hardware maintenance.

We apologize for the ongoing outage.
_ _ _
UPDATE, 13.12.22, 12:15 p.m.

The file systems lsdf01, sdil, gridka and gpfsadmin have assumed a formally consistent state after the latest measures. Several inconsistencies due to communication problems in the Infiniband fabric have been identified and repaired. However we cannot exclude that files might have lost references to data blocks.

Thus, these two file systems will be provided just for read access (read-only mounts) on the LSDF Login cluster, WebDAV cluster and NFS/CES cluster via ssh/scp/sftp, https/WebDAV or NFS for now.

We kindly ask all users to check files for data corruptions as far as possible.

With respect to the file system lsdf02 we received an information from the vendor support on Dec 9 evening that further corrective actions are required. These were started on Dec 10 but they take several days, presumably.


For all file systems, connection to the HPC cluster will not be available for now..

We take, in close cooperation with the vendor support, all efforts to resolve the remaining problems as soon as possible.
_ _ _
UPDATE, 09.12.22, 3:30 p.m.

Since the last update, we were able to successfully complete the consistency checks, some of which had been running for several days. This work was delayed by further crashes of the checks themselves, but we were able to successfully deliver debug information to the manufacturer, so that such crashes will no longer occur in the future.

Soon we will mount the file systems internally (unfortunately still without user access) to identify any problematic data now.

The LSDF will not be able to resume user operation until we receive precise information from the manufacturer about software releases and configuration that will guarantee secure and stable operation of the file systems. We already have some of this information, but some details still need to be clarified.

If no further fundamental problems occur, we expect the LSDF Online Storage to be available again within the next week.
_ _ _
UPDATE, 30.11.22, 10:45 a.m.

Since our last update, further file system checks have been carried out over the whole weekend and the following days in close cooperation with the manufacturer. Unfortunately, these are still not completed, because in the meantime crashes of the file system software occur, which the manufacturer must first analyze. The smaller file systems of the LSDF have been successfully checked in the meantime, but we cannot release them yet because this would interfere too much with the ongoing checks.

The actual cause of the problems is attributed by the manufacturer to communication problems in the Infiniband network, which occur in certain combinations of the file system software, the Infiniband hardware constellation and the Infiniband driver software. While workarounds have been taken in the meantime, we expect reliable statements on versions and setup of all components in the next few days, which will ensure stable and performant operation after the current problems have been fixed.

We apologize for the ongoing outage.
_ _ _
UPDATE, 25.11.22, 3:25 p.m.

Due to very large amounts of data, the file system checks could unfortunately not yet be completed.
Access to the data in the LSDF Online Storage is therefore still not possible.

We regret this long outage.
_ _ _
UPDATE, 24.11.22, 11.15 p.m.

Since Monday, 21.11.22, we have performed several consistency checks. The results of these checks are being verified by the manufacturer support.
Unfortunately, during the checks, which take up to 24 hours per file system, crashes occurred, so that they have to be repeated. Under normal circumstances such work is done in parallel, but the current problems in the cluster mean that we cannot work in parallel.

Unfortunately, this also means that it will take longer until we can make concrete statements about the availability of the data.

We apologize for the ongoing outage.
_ _ _
UPDATE, 21.11.22, 3.30 p.m.

On 18.11.2022 the LSDF had to be taken completely offline and isolated in order to stabilize internal operations for troubleshooting. This was successful, so that important maintenance processes could be carried out for all file systems over the weekend. Whether this work was successful must be confirmed today with load tests and renewed file system checks.

Together with the manufacturer support, the resulting data will be evaluated in order to be able to ensure the consistency of the user data.

Together with the manufacturer support, a possible candidate for the cause of the file system failure was identified. This points to problems in the Infiniband network that only occur in conjunction with certain firmware/software/configuration states and which may have been triggered by the load of the original file system check. We are currently investigating possible paths to update above components.

We apologize for the ongoing outage.

_ _ _
UPDATE, 18.11.22, 2 p.m.

Due to given reasons and to avoid further problems in the file system, the LSDF Online Storage has to be switched off urgently.
This also means that read-only access to the data is NOT possible anymore!

Thank you for your understanding.

_ _ _
LSDF Online Storage (bulk storage for scientific data) - https://www.scc.kit.edu/en/services/11228.php - is not available.

AFFECTED/IMPACTS
Access via all normally available protocols (ssh, scp, sftp, https, webdav, nfs, cifs, ..), as well as access via connected HPC systems to
the LSDF data are affected.

On the HPC login nodes, LSDF login cluster, LSDF WebDAV cluster, and NFS/SMB cluster, all LSDF file systems except lsdf02 read-only are available.

We are working with the support team to resolve the issues as quickly as possible.
Affected usersLSDF user
Existed since2022-11-16 12:00
Fixed since2023-03-20 19:00
 2023-02-16 09:30 - 2023-03-21 11:00

Exchange malfunction workaround (17.02.2023) - FIXED

DescriptionUpdate 02/17/2023 09:45 AM

A workaround has been installed that brings significant improvement.

However, problems are not yet completely resolved. This will only be made possible by a new security update.

=====

The problem is caused by a security update for Microsoft Exchange that we installed today. The MS security update has been classified as critical and therefore had to be installed by us promptly.

Sporadic access is possible.

- At the moment MacOS users cannot send and receive emails.
- Caldav is not able to connect.
- free/busy times cannot be retrieved.
- Out of office messages cannot be changed in Outlook.
- calendars cannot be opened in Outlook/OWA
- etc.

Possible further steps will be checked. We will keep you informed.
Additional InformationUpdate 03/21/2023: A security update that fixes the problem has been installed and the workaround has been rolled back.
Affected usersAll users
WorkaroundA workaround would be to use OWA (via https://owa.kit.edu)
Existed since2023-02-16 09:30
Fixed since2023-03-21 11:00

Maintenance

  2023-03-30 09:30 - 2023-03-31 18:00

bwCloud: Maintenance days - Virtual machines are stopped

AFFECTED
Services:
  • bwCloud SCOPE - Virtualized server and application infrastructure.

IMPACT
  • bwCloud SCOPE: Stopping of all Virtual Machines (VM), unless they have been shut down by users themselves by 03/30/2023.
    • Note: Users must restart their VMs independently after maintenance.

DESCRIPTION During the specified period, the following work will be performed on the bwCloud components:
  • Update of the host operating systems,
  • Restart of all bwCloud servers,
  • Restart of all Openstack services.
Maintenance work will also take place on the storage system.
  2023-03-30 17:00 - 2023-03-30 18:00

HIS (SOS, POS): Update database server- HIS applications not available

Thursday, 30.03.2023 (this week!) between 5 and 6 pm the database server for the HIS SOS/POS database will be updated.
The HIS applications (SOS-GX, POS-GX, BSOS) and QIS (qis.studium.kit.edu) will not be available during this time. Likewise, some functions in the student portal (campus.studium.kit.edu) are out of service.